One of the greatest Big Data myths, is that you need terabytes or even petabytes of data before you can use Hadoop. However, there are plenty of advantages to using Hadoop for small data: Big Data is not just about volume; Hadoop integrates between different data types; it processes data quickly; and it saves money because it scales as your data grows. So, here are four ways to process small data with Hadoop.
Concatenating Text Files
Website logs, emails, or any other textual data can be concatenated it into large files. Hadoop processes data line by line, so the data will be processed just the same even after concatenation.
What about binary data, such as images or videos? Hadoop Archives, or HAR, could help. HAR can archive small files from any format into a single file via the command line. HAR files operate as another file system layer on top of HDFS, so the archived files can also be accessed directly using har:// URLs.
Compress Files with Parquet
Parquet is a columnar storage format. It’s available for all the Hadoop ecosystem projects, including Hive and Impala as well. Parquet is based on Google’s famous Dremel paper. Although it’s similar to RC and ORC files, it claims to have better performance due to efficient compression and encoding schemes. Just put your data into Parquet and let it do its magic.
Xplenty runs Hadoop under the hood and includes automatic optimization for small files on the cloud. This feature works by copying files to the cluster’s local HDFS and optimizing them during the copy process. The files are automatically deleted once the cluster has finished processing the data. To use this feature, set the “Pre-process Action” to “Copy” in the cloud storage source component.
Try Xplenty for 7 days to see how it can easily process your small files.