Everyone is always looking for the next big thing. Big Data isn’t the next big thing, it’s already the big thing. Anyone in an industry that collects mass amounts of data knows what that data means to the success of their company. What’s less apparent, however, is how to utilize that data. The need to properly use the data has presented a plethora of challenges and solutions. To meet these challenges, technologies like Hadoop have been developed and occupations have evolved as a result.
There used to be that guy in the company that handled all the data. Then, the need to collect and analyze high volumes of data arose, necessitating the creation of more jobs within the data community. There are DBA’s specializing in infrastructure and applications or both, Business Intelligence analysts and developers, ETL developers, and Data Warehouse architects.
Needs have since progressed, data gotten bigger, and more jobs have been created, such as the data scientist. It has actually gotten to the point where the demand for data jobs has surpassed the supply of capable data workers. Soon, the gap will widen even further, as shown in the below graph from the McKinsey Global Institute’s May 2011 report on Big Data.
Demand for deep analytical talent in the united states could be 50 to 60 percent greater than its projected supply by 2018Supply and demand of deep analytical talent by 2018Thousand people
1 Other supply drivers include attriton (-), immigration (+), and reemploying previously unemployed deep analytical talent (+).SOURCE: US Bureau of labor Statistics; US Census; Dun& Bradstreet; company interviews; McKinsey Global Institute analysis
As data requirements have advanced, traditional systems no longer have the ability or capacity to handle new forms and amounts of data being collected. For the aforementioned companies that deal with data volume, variability, and velocity issues, the need for a technology that can handle those criteria is just as essential as the data itself.
Originally developed by Doug Cutting and Michael Carfarella as part of the Apache Nutch web-search project, it spun off and took on a life of its own.
An open-source software framework licensed by Apache, Hadoop is:
A distributed, scalable, storage and processing engine.
Supports data-intensive distributed applications as well as storage on large clusters of commodity servers.
Very effective in the processing of large amounts of structured, semi structured, and unstructured data.
Implements a programming model known as map/reduce, in which the main server (master node) in the cluster splits up the processes and distributes tasks across numerous other servers (slave nodes).
The scalability allows for the addition and subtraction of nodes as needed, which makes it ideal to operate in an elastic cloud environment.
As with everything in life though, Hadoop has its pros and cons. The next post will delve into the cons, as well as what you can do to avoid them.