Scale Your Data Collection on the Cloud Like a Champ

Scale Your Data Collection on the Cloud Like a Champ

(Oil can collection photo by northbaywanderer, some rights reserved)

When we met with WalkMe, a company which offers helpful in-app walkthroughs (we use it for our app and it’s great), our meeting took a surprising turn. We expected a discussion about crunching big data. They already had a data collection mechanism in place, but they had a problem that preceded any sort of crunching. They had a problem scaling the data collection process.

Their setup was simple. Events from their application were sent to a tracking server (developed in-house), and the server wrote the data into a relational database. It worked fine for a while, but when the company wanted to grab a lot more data about app events, the system became overloaded. To solve it, they needed to hire an extra DBA and restructure their DB via partitioning. The resources were not available, and even if they were, the setup would only scale up to a limit. This was not an option.

WalkMe is not alone. Plenty of companies are limited by their data collection methods when it comes to scalability. Once they need more detailed data and in larger quantities, scaling the system can become a major pain. Three common methods are used to collect big data - storing it directly in the DB, keeping it in a local file, or using S3/CloudFront logging. Is your company using the right one? Let’s dive in and see.

Storing Directly in the DB

This is what companies usually start with. As the name suggests, data is inserted right into the DB. There are two ways to do it - either row by row or bulk insert. Row by row means the data is added as a row to the DB in real time. Bulk insert adds multiple rows to the DB in one transaction. It’s faster than row by row, but insertion of the entire batch may fail, thus having to re-insert a big chunk of data.

Pros


  • Better performance for inserting data than other methods.


  • Real-time data available when adding row by row.


Cons


  • Schema changes are required to add new types of data.


  • Scaling is required in two layers - application and database. Scaling the application is usually easier (using a network load balancer for example) but scaling the database requires hiring an expert DBA, partitioning the DB, and scaling up the server. Relational DBs that scale out to multiple nodes are expensive and require a lot of maintenance.


Bottom Line

Fast performance that doesn’t scale.

Local File

Data is dumped in big local files. These files are periodically uploaded via a program to S3 or inserted in batches into a NoSQL DB, such as Amazon DynamoDB or a data warehouse like Amazon RedShift.

Pros


  • New types of data can be added easily since no schema changes are required.


  • Compatible with all applications because any file format can be used.


  • Quicker filtering via customized directory/file names, e.g. with date/time indication.


Cons


  • One needs to develop a tracking program to deal with the files - rotating logs while more data is incoming, handling failures, and transactionality. Even if you have the manpower, time, and money, it’s hard to develop such a program.


  • Scaling means adding more servers, more maintenance, and more money.


  • Data is not as queryable compared to storage in a DB.


  • Staging and production environments require extra servers.


Bottom Line

More flexible than direct DB storage, but requires more development, and scaling is still an issue.

S3/CloudFront Logging

This old school solution goes back to the early days when visitor counters and burning "hot!" animations ruled the web. To track an event, an HTTP request is sent for a 1x1 pixel image from a relevant S3 directory. Accessing the image automatically generates a W3C log with all HTTP request parameters: IP address, browser, date/time, etc. Extra session level data like username or mouse position is passed via the querystring. To differentiate between event types, images are placed in accordingly named directories, e.g. /click/.

Pros


  • No tracking server required - data reaches S3 automatically.


  • No file management - Amazon handles all file monkey business.


  • No servers - Amazon provides them.


  • Cost effective - only log storage and bandwidth are paid for. The logs take little space since they are all GZipped and the bandwidth for 1x1 pixel images is marginal.


  • Easily scalable with practically infinite space and firepower.


  • Quick and easy to implement.


  • Simple setup for staging/production environments via additional distributions and a prefix.


  • Web application performance unharmed, especially using the CloudFront CDN.


Cons


  • Slower filtering performance compared to local setup. Amazon handles log file/directory names automatically and no customization is available.


  • Not suitable for real time or impatience. Data is aggregated into a new file in the bucket only once per hour, and that’s Amazon’s best effort so it could take longer.


  • Data is not as queryable compared to storage in a DB.


  • Vendor dependant. Having your servers outside of Amazon will decrease performance.


  • No control over the file format. W3C Extended Log File Format is mandatory and some applications may not like that.


Bottom Line

Quick, cheap, and scalable though it doesn’t provide the best performance and customization.

WalkMe’s Solution

We advised WalkMe to use S3/CloudFront logging. It took their developer only four hours to implement it without a DBA in sight. As an added bonus, it separated their application from the DB - now it can collect new event types and keep backwards compatibility with the DB without any schema changes. Whenever the shiny new schema is ready, new event types will slide right in. Maybe S3/CloudFront logging could work for your big data collection as well.

So much emphasis has been put on the technologies used for processing, analyzing, and visualizing data. But so often getting lost in the shuffle is the importance of the collection of this data. The two go hand in hand. To get good output from your data, you must first have proper input. Only once you have achieved the synergy between the two will you fully be able to tap into your data’s potential.

 


Integrate Your Data Today!

Get a 7-day free trial. No credit card necessary.