Following our post about Hadoop security for the enterprise, or the lack thereof, one of the ways to make Hadoop more secure is to add a security framework to the mix. Five major Hadoop security projects are currently available: Apache Knox Gateway, Apache Sentry, Apache Argus, Apache Accumulo and Project Rhino. Let’s see what they do.
Apache Knox Gateway
While other projects attempt to improve Hadoop’s security from the inside, Apache Knox Gateway tries to do it from the outside. Apache Knox Gateway creates a security perimeter between Hadoop and the rest of the world by providing a REST API gateway for interacting with Hadoop clusters.
All communication with Hadoop is done via Knox Gateway, which controls and moderates it. Knox includes the following features: LDAP and Active Directory integration, support for identity federation based on HTTP headers, and service-level authorization and auditing.
Knox sounds like a great security solution for the enterprise. It integrates with identity management frameworks and hides Hadoop hosts and ports behind it. This also simplifies Hadoop access: Instead of connecting to different Hadoop clusters, which all have different security policies, Knox becomes the single entry point for all the Hadoop clusters in the organization.
This project is backed by Hortonworks and runs as a server or a cluster of servers.
Developed by Cloudera, Apache Sentry is the security layer for Hadoop applications such as Hive, Impala, and Solr. It allows administrators to grant or revoke access to servers, databases, and tables and not just in file system (HDFS) level. Lower granularity for columns or cells isn’t supported at the moment.
Sentry also allows one to set different privileges for SELECT, INSERT, and TRANSFORM statements and for creating and modifying schemas. It even makes multi-tenant administration available for Hadoop, so separate policies can be maintained by separate admins for databases and schemas.
Although the project is still in incubation, it promises to work right out of the box with Apache Hive and Cloudera Impala. Apache Sentry is a part of the Project Rhino initiative but still deserves focus in its own right.
Formerly known as Apache Argus, Apache Ranger overlaps with Apache Sentry since it also deals with authorization and permissions. It adds an authorization layer to Hive, HBase, and Knox, and they claim that it has an advantage over Sentry since it includes column-level permissions in Hive.
Originally, this product was called Secure XA, but it was recently acquired by Hortonworks and then open-sourced.
Apache Accumulo is not a security project per se but a distributed key/value store that is based on Google’s BigTable and built on top of Apache Hadoop, ZoeKeeper, and Thrift.
Nonetheless, Accumulo includes authorization at the cell level. Therefore, highly specified access to the data can be granted or restricted at the highest resolution possible: per user and per key/value cell. Surprisingly, Accumulo was originally created and contributed to Apache by the NSA.
Project Rhino is an initiative to bring Hadoop’s security up to par by contributing code directly to the relevant Apache projects.
This initiative, which is led by Intel, views Hadoop as a full stack solution that also includes projects such as ZooKeeper, Hive, Oozie, etc., and wishes to improve security for all of them.
Some of Project Rhino’s goals are to add support for encryption and key management, create a common authorization framework for Hadoop, add single sign-on and token-based authentication, extend HBase support for ACLs up to the cell level, and improve audit logging. Basically, it aims to add all the security features that are missing from Hadoop. Sounds pretty sweet.
Extending HBase support for ACLs down to the cell level has already been achieved, and with Intel backing the project, we will surely see more tasks marked off their to-do list.
Several projects aim to strengthen Hadoop’s security:
- Apache Knox Gateway—secure REST API gateway for Hadoop with authorization and authentication
- Apache Sentry—Hadoop data and metadata authorization
- Apache Ranger—similar to Apache Sentry
- Apache Accumulo—NoSQL key/value store with cell-level authorization
- Project Rhino—general initiative to improve Hadoop security by contributing code to the entire stack
Whether you decide to use any of these projects or not, we are surely on the way to a more secure Hadoop and increased adoption in the enterprise.