Welcome to Xplenty's Blog

All things data

Using An ETL Platform VS Writing Your Own Code

Using An ETL Platform VS Writing Your Own Code

Writing your own ETL code is not trivial. What starts out as a simple ETL process gets more complex over time. So does the coding, which becomes less manageable. A short story that morphs into a convoluted volume that rivals Tolstoy’s War & Peace.

5 Platforms for Collecting Big Data

5 Platforms for Collecting Big Data

Everything comes as a service these days, and so does collecting Big Data. Various platforms on the web are happy to take data collection off your coding hands, making it easy for you to collect data from various sources in one location. Some call this a data hub. The following five platforms will help meet your ever-increasing data collection needs.

5 Real-time Streaming Platforms for Big Data

5 Real-time Streaming Platforms for Big Data

There are quite a few real-time platforms out there. A lot of them are newcomers, and the differences between them aren’t clear at all. The least we can do, is present all the options for you to choose from, so here are five real-time streaming platforms for Big Data.

Amazon Redshift Review 2015

Amazon Redshift Review 2015

Happy birthday to Redshift! Amazon’s data warehouse-as-a-service has just celebrated two years of data querying. Several reviews were written about Redshift at the time, but as far as we know, no one has looked back to check on what’s happening with the red giant since then. So, we went ahead and did a little checkup. Here’s our up-to-date Redshift review.

Spark, Impala, Tez and Hive: Interview with David Gruzman

Spark, Impala, Tez and Hive: Interview with David Gruzman

Big Data consultant David Gruzman answered some of our burning questions about which Big Data platform to use, whether streaming is a must or not, and what are the biggest issues with the cloud.

4 Ways to Process Small Data with Hadoop

4 Ways to Process Small Data with Hadoop

One of the greatest Big Data myths, is that you need terabytes or even petabytes of data before you can use Hadoop. However, there are plenty of advantages to using Hadoop for small data. The only question that’s left is “how”.

Spark vs. Tez: What's the Difference?

Spark vs. Tez: What's the Difference?

On paper, Spark and Tez have a lot in common: both possess in-memory capabilities, can run on top of Hadoop YARN and support all data types from any data sources. So, what’s the difference?

Top 7 Hadoop Blogs for 2014

Top 7 Hadoop Blogs for 2014

People talk a lot about Hadoop, and we like to keep up to date with the latest gossip by reading Hadoop blogs. If you'd also like to jump into the conversation and read the best Hadoop posts out there, here are our favorite Hadoop blogs for 2014.

Spark vs. Hadoop MapReduce

Spark vs. Hadoop MapReduce

Apache Spark is setting the world of Big Data on fire. With a promise of speeds up to 100 times faster than Hadoop MapReduce and comfortable APIs, some think this could be the end of Hadoop MapReduce. Or is it?

5 Hadoop Security Projects

5 Hadoop Security Projects

Following our post about Hadoop security for the enterprise, or the lack thereof, one of the ways to make Hadoop more secure is by installing an additional platform. Five major Hadoop security projects are currently available: Apache Knox Gateway, Apache Sentry, Apache Argus, Apache Accumulo and Project Rhino. Let’s see what they provide.

Become a Twitter Data Analyst with Xplenty

Become a Twitter Data Analyst with Xplenty

Let’s say that you’re doing some marketing for a Big Data startup. As part of your campaign, you want to find the most influential tweeters who talk about Hadoop and determine where they come from. So you collect tweets, with DataSift for example. But now you have a ton of JSON objects filled with data from Twitter and no clue what to do with them.

Process Data with Xplenty and Visualize it with Chart.io

Process Data with Xplenty and Visualize it with Chart.io

We concentrate on making data processing as fast and easy as possible. To complete the dataflow, Xplenty integrates with a plethora of services that can store, analyze, or visualize data. One of these services is Chart.io, a popular service for data visualization and analysis. You can use Xplenty to process the data and then visualize the results in Chart.io. Here’s how.

GitHub, You Got Issues: An Analysis of Issues on GitHub in 2013

GitHub, You Got Issues: An Analysis of Issues on GitHub in 2013

Everybody has issues, and so do users and repositories on GitHub. That's why we decided to answer this year’s GitHub Data Challenge by heading where developers fear to tread and analyze GitHub issues in 2013.

How to Integrate MongoDB with Relational Databases

How to Integrate MongoDB with Relational Databases

Integrating data from MongoDB and a relational database sounds like a major headache. On one hand you have a schemaless NoSQL database containing JSON objects, and on the other, an SQL database with a fully defined schema. How can you easily integrate them? With Xplenty’s data integration on the cloud, of course!

How to get Website Visitor Geolocations from IPs

How to get Website Visitor Geolocations from IPs

Although the Internet made the world flat, geography still matters. Knowing which countries your users live in could provide business opportunities to localize your services and increase profits. The only question, is how in the world to do it.

8 Data Integration Best Practices

8 Data Integration Best Practices

You’ve spent hours tinkering and preparing the perfect dataflow to batch process zillions of web logs. Feeling satisfied, you run the job on one of the clusters and leave your desk. The boss catches you on the way out - he wonders what’s going on with the clicks and impressions report. You promise it will be ready tomorrow and head for the exit.

Using Regular Expressions in Big Data

Using Regular Expressions in Big Data

A regular expression, AKA regex, is a powerful yet really confusing tool. Although regular expressions are the technology behind text replacement and natural language processing, they are hard to read, and even harder to write.

Hive vs. HBase

Hive vs. HBase

Comparing Hive with HBase is like comparing Google with Facebook - although they compete over the same turf (our private information), they don’t provide the same functionality. But things can get confusing for the Big Data beginner when trying to understand what Hive and HBase do and when to use each one of them. Let’s try and clear it up.

Data Cleansing Big Data: Scrubbing the Elephant

Data Cleansing Big Data: Scrubbing the Elephant

According to the Elephant Care Manual for Mahouts and Camp Managers: "It is essential to cleanse the elephant's body carefully every day by using half of a coconut shell to scrape the elephant on a daily basis." ETL developers may not have coconuts at their disposal, but some of them may still need to do the dirty work of cleansing Big Data.

Hadoop Data Integration 101

Hadoop Data Integration 101

Last year Cloudera published a blog post on Big Data’s new use cases: transformation, active archive, and exploration. There’s one more use case that isn’t explicitly mentioned - data integration.

12 SQL-on-Hadoop Tools

12 SQL-on-Hadoop Tools

An overview of 12 open source and commercial SQL-on-Hadoop tools: Apache Hive, Apache Sqoop, Apache Phoenix, Impala, Presto, BigSQL, CitusDB, Hadapt, Jethro, Lingual, and HAWQ.

Hadoop-as-a-Service vs. On-Premise...FINISH HIM

Hadoop-as-a-Service vs. On-Premise...FINISH HIM

Mortal Kombat’s master of ice Sub-Zero and the living-dead fire breathing Scorpion are major archenemies. As the story goes, Sub-Zero and his clan of assassin ninjas slaughtered their rival clan, which Scorpion and his family were members of. Scorpion’s hatred made him rise from the Netherrealm to avenge his family’s death and kill Sub-Zero in the great tournament.

Storing Apache Hadoop Data on the Cloud - HDFS vs. S3

Storing Apache Hadoop Data on the Cloud - HDFS vs. S3

Ken and Ryu are both the best of friends and the greatest of rivals in the Street Fighter game series. When it comes to Hadoop data storage on the cloud though, the rivalry lies between Hadoop Distributed File System (HDFS) and Amazon's Simple Storage Service (S3). Although Apache Hadoop traditionally works with HDFS, it can also use S3 since it meets Hadoop's file system requirements. Netflix utilizes this feature and stores data on S3 rather than HDFS. Why did Netflix choose this data architecture? To understand their motives let's see how HDFS and S3 do in battle!

Hadoop vs. Redshift

Hadoop vs. Redshift

Childhood dreams do come true - in 2015 "Batman vs. Superman" will bring the world’s biggest superheroes to battle on-screen, finally solving that eternal debate who will prevail (I put my Bitcoins on Batman).

Data Sources and Destinations with Xplenty's Hadoop Platform

Data Sources and Destinations with Xplenty's Hadoop Platform

Data Warehousing projects are challenging. Quite often the sheer amount of business requirements involved, and the data volumes that are attached to it, have made these types of projects notoriously risky and costly. Today, however, most organizations understand that deriving insights from data is essential, and to a certain extent, the majority of them are executing various types of data warehousing projects.