Writing your own ETL code is not trivial. What starts out as a simple ETL process gets more complex over time. So does the coding, which becomes less manageable. A short story that morphs into a convoluted volume that rivals Tolstoy’s War & Peace.
Welcome to Xplenty's Blog
All things data
Everything comes as a service these days, and so does collecting Big Data. Various platforms on the web are happy to take data collection off your coding hands, making it easy for you to collect data from various sources in one location. Some call this a data hub. The following five platforms will help meet your ever-increasing data collection needs.
There are quite a few real-time platforms out there. A lot of them are newcomers, and the differences between them aren’t clear at all. The least we can do, is present all the options for you to choose from, so here are five real-time streaming platforms for Big Data.
Happy birthday to Redshift! Amazon’s data warehouse-as-a-service has just celebrated two years of data querying. Several reviews were written about Redshift at the time, but as far as we know, no one has looked back to check on what’s happening with the red giant since then. So, we went ahead and did a little checkup. Here’s our up-to-date Redshift review.
Big Data consultant David Gruzman answered some of our burning questions about which Big Data platform to use, whether streaming is a must or not, and what are the biggest issues with the cloud.
One of the greatest Big Data myths, is that you need terabytes or even petabytes of data before you can use Hadoop. However, there are plenty of advantages to using Hadoop for small data. The only question that’s left is “how”.
On paper, Spark and Tez have a lot in common: both possess in-memory capabilities, can run on top of Hadoop YARN and support all data types from any data sources. So, what’s the difference?
People talk a lot about Hadoop, and we like to keep up to date with the latest gossip by reading Hadoop blogs. If you'd also like to jump into the conversation and read the best Hadoop posts out there, here are our favorite Hadoop blogs for 2014.
Apache Spark is setting the world of Big Data on fire. With a promise of speeds up to 100 times faster than Hadoop MapReduce and comfortable APIs, some think this could be the end of Hadoop MapReduce. Or is it?
Following our post about Hadoop security for the enterprise, or the lack thereof, one of the ways to make Hadoop more secure is by installing an additional platform. Five major Hadoop security projects are currently available: Apache Knox Gateway, Apache Sentry, Apache Argus, Apache Accumulo and Project Rhino. Let’s see what they provide.
Let’s say that you’re doing some marketing for a Big Data startup. As part of your campaign, you want to find the most influential tweeters who talk about Hadoop and determine where they come from. So you collect tweets, with DataSift for example. But now you have a ton of JSON objects filled with data from Twitter and no clue what to do with them.
We concentrate on making data processing as fast and easy as possible. To complete the dataflow, Xplenty integrates with a plethora of services that can store, analyze, or visualize data. One of these services is Chart.io, a popular service for data visualization and analysis. You can use Xplenty to process the data and then visualize the results in Chart.io. Here’s how.
Everybody has issues, and so do users and repositories on GitHub. That's why we decided to answer this year’s GitHub Data Challenge by heading where developers fear to tread and analyze GitHub issues in 2013.
Integrating data from MongoDB and a relational database sounds like a major headache. On one hand you have a schemaless NoSQL database containing JSON objects, and on the other, an SQL database with a fully defined schema. How can you easily integrate them? With Xplenty’s data integration on the cloud, of course!
Although the Internet made the world flat, geography still matters. Knowing which countries your users live in could provide business opportunities to localize your services and increase profits. The only question, is how in the world to do it.
You’ve spent hours tinkering and preparing the perfect dataflow to batch process zillions of web logs. Feeling satisfied, you run the job on one of the clusters and leave your desk. The boss catches you on the way out - he wonders what’s going on with the clicks and impressions report. You promise it will be ready tomorrow and head for the exit.
A regular expression, AKA regex, is a powerful yet really confusing tool. Although regular expressions are the technology behind text replacement and natural language processing, they are hard to read, and even harder to write.
Comparing Hive with HBase is like comparing Google with Facebook - although they compete over the same turf (our private information), they don’t provide the same functionality. But things can get confusing for the Big Data beginner when trying to understand what Hive and HBase do and when to use each one of them. Let’s try and clear it up.
According to the Elephant Care Manual for Mahouts and Camp Managers: "It is essential to cleanse the elephant's body carefully every day by using half of a coconut shell to scrape the elephant on a daily basis." ETL developers may not have coconuts at their disposal, but some of them may still need to do the dirty work of cleansing Big Data.
Last year Cloudera published a blog post on Big Data’s new use cases: transformation, active archive, and exploration. There’s one more use case that isn’t explicitly mentioned - data integration.
An overview of 12 open source and commercial SQL-on-Hadoop tools: Apache Hive, Apache Sqoop, Apache Phoenix, Impala, Presto, BigSQL, CitusDB, Hadapt, Jethro, Lingual, and HAWQ.
Mortal Kombat’s master of ice Sub-Zero and the living-dead fire breathing Scorpion are major archenemies. As the story goes, Sub-Zero and his clan of assassin ninjas slaughtered their rival clan, which Scorpion and his family were members of. Scorpion’s hatred made him rise from the Netherrealm to avenge his family’s death and kill Sub-Zero in the great tournament.
Ken and Ryu are both the best of friends and the greatest of rivals in the Street Fighter game series. When it comes to Hadoop data storage on the cloud though, the rivalry lies between Hadoop Distributed File System (HDFS) and Amazon's Simple Storage Service (S3). Although Apache Hadoop traditionally works with HDFS, it can also use S3 since it meets Hadoop's file system requirements. Netflix utilizes this feature and stores data on S3 rather than HDFS. Why did Netflix choose this data architecture? To understand their motives let's see how HDFS and S3 do in battle!
Childhood dreams do come true - in 2015 "Batman vs. Superman" will bring the world’s biggest superheroes to battle on-screen, finally solving that eternal debate who will prevail (I put my Bitcoins on Batman).
Data Warehousing projects are challenging. Quite often the sheer amount of business requirements involved, and the data volumes that are attached to it, have made these types of projects notoriously risky and costly. Today, however, most organizations understand that deriving insights from data is essential, and to a certain extent, the majority of them are executing various types of data warehousing projects.