“Big Data? No, we don’t have big data—our data is rather small. We don’t need anything big or fancy for now, but when we grow…”
Working in the data business, we often hear that statement. We get it. Big Data is mostly famous, well, for being big. So if you have anything under a petabyte, why would you even think about using Apache Hadoop?
But you should.
Yes, although traditional databases can handle anything up to terabytes, there are several reasons why Hadoop could be great for processing small data.
1. Big Data Is not just Volume
Remember Big Data’s four V’s: Volume, Velocity, Variety and Veracity. Your data doesn’t need to tick off all four to be considered Big Data. Even if it comes at great velocity and variety, that may be enough to choke MySQL and require something different: Hadoop, which takes care of all the V’s.
2. Hadoop Integrates Different Data Types
Speaking of variety—data from different sources needs to be integrated these days: web server log files, social network data, emails, images, videos as well as traditional ERP, CRM and data from enterprise relational databases. All of these formats—files, JSON and relational data—can be processed by Hadoop.
3. Hadoop Processes Data Faster
Even small data could take a long time to process with traditional data stores. Hadoop, however, uses MapReduce, which processes data in parallel. It brings significant advantages in terms of failure management, redundancy and scalability for batch processes such as ETL offloading, data preparation for analytics and data transformation. Better yet—Hadoop lets you execute multiple processes at the same time, so you can shorten your processing window and get more work done in less time. Wouldn’t that make your boss happy?
4. Hadoop Saves Money
Apache Hadoop works with commodity servers. You don’t even need to buy them or do any installations—just use Hadoop on the cloud. Services such as Amazon’s EMR and our very own Xplenty work on the cloud and make Hadoop easier to use, more affordable, easily elastic and scalable. EMR helps with Hadoop TCO in hardware and operations, and Xplenty adds the reduced TCO in terms of people, training and time to solution. Say goodbye to dust-gathering hardware—Hadoop on the cloud lets you use the exact resources that you need at any given time and discard them when you’re finished.
5. Your Data is Growing
According to IDC, the amount of data in the digital universe is growing by 40% each year. We’re all a part of it. So instead of building an infrastructure that will get clogged up with data by the next year, choose a technology that easily scales, a technology like Apache Hadoop that lets you start small and grow big.
So now you're convinced you should be using Hadoop to process small data, but how do you do that? We explore that in our blog post, 4 Ways to Process Small Data with Hadoop.
There are plenty of reasons to process small data with Hadoop: it also handles data in great velocity, variety, or veracity; Hadoop integrates different data types; Hadoop speeds up data processing by running in parallel; and you can save money by using it on the cloud. Most importantly, your data is growing and you need something that grows with it—that yellow elephant.