Is Hadoop secure for the enterprise? This is the question that data analysts must answer if they want to bring Hadoop to large organisations.
While Hadoop has proved its power for scalable storage and processing of Big Data, it may not be enterprise-ready when it comes to security. Hortonworks, Cloudera and MapR address this problem by providing Enterprise Hadoop distributions. There are also several Hadoop security projects, such as Apache Argus and Knox. But what does Hadoop provide right out of the box?
The bad news is that a fresh Hadoop installation isn’t secure—it wasn’t even made to be secure. Hadoop’s main purpose was to allow processing data that comes in large volume, variety and velocity, while allowing everyone to access the data and run jobs. However, things changed and security features were added to later Hadoop versions. There are at least four areas of concern for Hadoop’s security: authentication, authorisation, auditing and encryption.
No one wants anonymous users to browse through their data. That’s why Hadoop supports Kerberos: a mature authentication protocol that has been around since the late eighties.
Nonetheless, to get Kerberos up and running for Hadoop, sysadmins need to install, configure and maintain a Kerberos server. If the organisation runs some other kind of centralized authentication server, then this doubles the amount of work. Not to mention that Kerberos is well known as a nightmare to maintain.
Hadoop also supports HTTP simple authentication for its web consoles. This authentication method sends passwords in plaintext and for each HTTP request. Even if you use SSL to hide it while in transit, passwords may still be logged by the server and cached in the browser. That’s not secure enough.
Although Hadoop was founded on the democratic principles of open access to data for all, this isn’t right for the enterprise. Organisations need strict control over who can access which data and what they can do with it.
Fortunately, HDFS supports authorization via the traditional file permission model as well as ACLs (Access Control Lists). Therefore, Hadoop makes it possible to control access to files and directories for users and groups.
What about control over who can send jobs to the cluster? For that Hadoop provides Service Level Authorization: a mechanism that makes sure that clients who use a certain Hadoop service have the right permissions.
Hadoop authorization is pretty tight so far and it can be tighter: HDFS, Oozie, YARN and any other Hadoop processes should be executed by hardened users. If one of the systems is attacked, whether from inside or outside of the organisation, the attacker won’t be able to harm the machine or disrupt any of the other processes due to limited permissions.
Any secure system must include auditing: the ability to monitor and report changes in the system. In the case of Hadoop, there should be monitoring who accessed which data and when, what jobs they executed, what settings they changed, etc.
Hadoop and its related components do offer built-in audit logging. However, they still have a long way to go. There’s no unified or even consistent audit format, which makes log analysis really difficult. Intel’s Project Rhino, a general Hadoop security initiative, will attempt to build tools that transform audit logs into a standard format. Until then, auditing is available but it’s not easy.
Part of data protection is making sure that the data becomes useless if stolen—whether physically or by a man-in-the-middle attack. Data encryption is the obvious solution.
Fortunately, RPC data—the data transferred between Hadoop services and clients—and block data transfer between nodes can be encrypted. Even connections to Hadoop’s web console can be encrypted.
Well, what about the data itself? Sadly HDFS doesn’t support local data encryption just yet. Once again, Project Rhino comes to the rescue by promising to develop an encryption and key management framework for Hadoop. They’re still working on it.
Hadoop isn’t secure for the enterprise right out of the box. Nonetheless, it comes with several built-in security features such as Kerberos authentication, HDFS file permissions, Service Level Authorization, audit logging and network encryption. These need to be set up and configured by a sysadmin.
Organizations who need stronger security will probably opt for a Hadoop distribution by Hortonworks, Cloudera, or MapR. These distributions include extra security measures as well as integration with Apache Hadoop security projects, thus making it safe to let the elephant in.