Five major Hadoop security projects are currently available: 

  • Apache Knox: open source project which provides a framework for managing security and supports security implementations on Hadoop clusters
  • Apache Sentry: the project aims to provide a granular, role-based authorization module for Hadoop. 
  • Apache Ranger: a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform
  • Apache Accumulo: practically a sorted, distributed key/value store designed to be a robust, scalable, high-performance storage, and retrieval system. 
  • Intel Project Rhino: provides an integrated end-to-end data security view of the Hadoop ecosystem.

BigData has changed society and continues to do so at a rapid pace. Massive data sets emit from smart home devices, genomes, pacemakers, satellites, social blogs, etc. To harvest this data correctly and extract valuable insights requires serious engineering. Such Big Data requires a distributed, scalable, and robust system with quick computational capabilities; currently, Apache Hadoop is a dominant open-source system in this regard. 

Hadoop 2.x core components (Hadoop Common, Hadoop Distributed File System (HDFS), MapReduce and YARN), and several Apache projects such as Hive, HBase, Storm, and Kafka, enable non-expert users to harness the potential hidden in the valuable data assets. However, the platform's distributed nature and massive scale results in a broader attack surface, making it vulnerable to security threats. Knowing the shortcomings of Hadoop security for the enterprise, one way to make Hadoop more secure is to add a security framework to the mix. We discuss five of those here.

For information on Integrate.io's native Hadoop HDFS connector, visit our Integration page.

Table of Contents

  1. Apache Knox Gateway
  2. Apache Sentry
  3. Apache Ranger
  4. Apache Accumolo
  5. Project Rhino
  6. Summary

1) Apache Knox Gateway

thumbnail image

While other projects attempt to improve Hadoop’s security from the inside, Apache Knox Gateway tries to do it from the outside. Apache Knox Gateway creates a security perimeter between Hadoop and the rest of the world by providing a REST API gateway for interacting with Hadoop clusters.

Knox is a REST Representational State Transfer style (the colloquial term is "ReST.") API gateway, providing a single point of authentication and access for Apache Hadoop based services. The gateway runs as a server or as a cluster of servers), which provides centralized access to one or more Hadoop clusters - designed to obfuscate the Hadoop cluster topology from the outside world. There is also support for popular Hadoop services such as WebHDFS, Oozie, Hive, Hbase, and HCatalog.

All communication with Hadoop is done via Knox Gateway, which controls and moderates it. Knox includes the following features: LDAP and Active Directory integration, support for identity federation based on HTTP headers, end-to-end encryption on the wire, and service-level authorization and auditing.

Knox sounds like a great security solution for the enterprise. It integrates with identity management frameworks and hides Hadoop hosts and ports behind it. The process simplifies Hadoop access: Instead of connecting to different Hadoop clusters, which all have different security policies, Knox becomes the single entry point for all the Hadoop clusters in the organization.

This project is backed by Hortonworks and runs as a server or a cluster of servers.

2) Apache Sentry

thumbnail image

Secure and robust authorization mechanisms have always been an issue within the Hadoop ecosystems. Historically there has been no consistency. Administrators reluctantly had to implement and maintain a standard system of authorization across multiple components. Moreover, many parts have varying granularity levels in the context of authorization and enforcement controls; the confusion evolved into a significant security concern.

In response, Cloudera released a Hadoop open-source component, Apache Sentry, with fine-grained authentication based on role and multi-tenant management mode to provide unified access control for data and metadata stored in Hadoop. Traditionally, and covered already, HDFS authorization controls are limited to simple POXIS-style permissions and extended ACLs.

It is now possible to integrate Sentry with Hive/HCatalog, Apache Solr, and Cloudera Impala. Users can store more sensitive data in Hadoop. More terminal users can access data in Hadoop. Sentry’s goals are to implement authorization for Hadoop ecosystem components in a unified way so that security administrators can easily control what users and groups have access to without knowing the ins and outs of every single element in the Hadoop stack.

3) Apache Ranger

thumbnail image

Apache Ranger, formerly known as Apache Argus, overlaps with Apache Sentry since it also deals with authorization and permissions. It adds an authorization layer to Hive, HBase, and Knox, and they claim that it has an advantage over Sentry since it includes column-level permissions in Hive. Initially, this product was called Secure XA, but it was recently acquired by Hortonworks and then open-sourced.

It delivers a comprehensive security approach for a Hadoop cluster, providing a centralized system that defines, administers, and manages security policies consistently across Hadoop components. Using the Apache Ranger console in Hortonworks, security administrators can easily manage access to files, folders, databases, tables, or columns. It is possible to set policies at the resource level, i.e., files, folders, databases, and pinpoint specific lines and columns within databases. It also provides security administrators with extensive visibility and oversight into their Hadoop environment through a centralized audit location.

4) Apache Accumolo

thumbnail image

Apache Accumulo is not a security project per se but a distributed key/value store that is based on Google’s BigTable and built on top of Apache Hadoop, ZoeKeeper, and Thrift.

Nonetheless, Accumulo includes authorization at the cell level. Therefore, highly specified access to the data can be granted or restricted at the highest resolution possible: per user and key/value cell. Surprisingly, Accumulo was initially created and contributed to Apache by the NSA.

Accumulo is a relaxed consistency database widely used for government applications designed to deliver high performance on unstructured data, i.e., graphs of network data. Although mostly a clone of Bigtable, it includes several novel features: 

  • Iterators: a framework for processing sorted streams of key/value entries 
  • Cell-level Security: mandatory, attribute-based access control with key/value granularity 
  • Fault-Tolerant Execution Framework (FATE
  • A novel compaction scheduler 

5) Project Rhino

While Hadoop provides several security mechanisms covering authentication and authorization, enterprises often require greater assurance of data protection, including encryption of data at rest and in motion, role-based access control (RBAC), and other necessary features compliance and data governance.

Project Rhino is an initiative to bring Hadoop’s security up to par by contributing code directly to the relevant Apache projects. This initiative, which is led by Intel, views Hadoop as a full-stack solution that also includes projects such as ZooKeeper, Hive, Oozie, etc., and wishes to improve security for all of them. 

Some of Project Rhino’s goals are to: 

  • Provide encryption with hardware-enhanced performance and add support for encryption and key management.
  • Create an enterprise-grade authorization framework for Hadoop, add single sign-on, and token-based authentication.
  • Provide role-based access control (ACLs ) with unified across multiple components in Hadoop and higher granularity such as cell-level granularity in Apache HBase and improve audit logging.
  • Ensure consistent auditing across essential Apache Hadoop components

They are aiming to add all the security features that are missing from Hadoop. Sounds pretty sweet. Extending HBase support for ACLs down to the cell level has already been achieved, and with Intel backing the project, we will surely see more tasks marked off their to-do list.

Summary

As data and analytics teams grow, and data flow becomes more pronounced, GDPR, security, and governance are becoming more significant. As organizations' analytical maturity improves, business and technical users must understand data best practices.

How Integrate.io Can Help

At Integrate.io, data security and compliance are two of the most critical aspects of our automatic ETL service’s most critical elements. We have incorporated the most advanced data security and encryption technology into our platform, such as:

  • Physical infrastructure hosted by accredited Amazon Web Service (AWS) technology
  • Advanced preparations to meet the European General Data Protection Regulation (GDPR) standards
  • SSL/TLS encryption on all our websites and microservices (encryption in transit and at rest)
  • Field Level Encryption
  • Encryption of sensitive data anytime it's at rest in the Integrate.io platform using industry-standard encryption.
  • Constant verification of our security certificates and encryption algorithms
  • Firewalls that restrict access to systems from external networks and between systems internally

If you'd like to know more about our data security standards, schedule a demo with the Integrate.io team.