Huge amounts of data are needed to calculate key performance indicators (KPIs), a luxury that only large enterprises were able to afford. This post series discusses how companies of all sizes can measure KPIs by collecting and processing Big Data on the cloud.
Define Company KPIs
The first step is to understand what KPIs are, set company goals, and define KPIs for each goal.
What Are KPIs
Key performance indicators are measurements that help determine whether a business succeeds in achieving its goals. For example, if an e-commerce company would like to increase revenues by 20% this year, the KPIs measuring this goal could be sales in US Dollars per month and the number of orders per month.
Which goals would your business like to achieve in the near future? Write them all down.
Find quantifiable KPIs to measure each goal. There are many performance indicators, but stick to only 2-3 key performance indicators per goal.
Example Goals & KPIs
Below are a few example goals for a company as well as relevant KPIs:
Grow site traffic by 40% in the next year
Page views per month
Unique visitors per month
Increase user engagement by 5% in the next 6 months
User registration per month
Facebook likes per month
Boost revenue by 10% in the next quarter
Orders per week
Shopping cart abandonment rate per week
Decrease support costs by 20% in the next year
Number of support calls per month
Support call time per month
A lot of data is needed to calculate KPIs. To get this data, web applications should generate and store detailed logs (also see “Scale Your Data Collection on the Cloud Like a Champ”).
Log HTTP Data
When a web browser loads a website or submits a form it sends an HTTP request to the web server. This request contains a lot of user data such as IP addresses, browser types, locales, URLs, query strings, and more. It should be logged for later analysis.
Save Application Data
Application level data, such as user ID, order ID, or event type, should also be logged. In some cases it’s already available in the URL’s query string and is thus logged as part of the HTTP data. In case it isn’t, the application should take care of logging this data.
Store the Data
The above logs are usually saved in a relational database, as local files, or on a cloud service such as Amazon S3/Cloudfront. Another great and cost-effective solution is to use Cloudfront logging. This method tracks application events via HTTP requests to 1×1 pixel images in an S3 directory (e.g. /mouse/click.png). Accessing these images automatically generates W3C logs that include all HTTP request parameters; application level data is passed via the query string. To setup this method in your AWS account please see the following tutorial.
What Is Hadoop?
One of the most popular Big Data platforms is Apache Hadoop, an open source framework for distributed processing and storage of Big Data on commodity machines. Hadoop is great because it can store huge amounts of data, process that data in parallel on a cluster of servers, and scale horizontally with off the shelf machines.
Hadoop as a Service
Setting up your own Hadoop environment is difficult and expensive - you need plenty of hardware and expert manpower to take care of development, support, and maintenance. Luckily there are affordable Hadoop solutions on the cloud that eliminate these issues and scale easily. Most require coding skills to process the data, but they are still the best option to get Hadoop up and running as soon as possible.
Now that everything is ready, it’s data crunching time.
Load the Data
Load all your data into the Hadoop cluster. This should be easier if the data is already on the cloud and you use Hadoop as a Service. However, if the data is on-premise but your Hadoop is on the cloud, it could take a while to get that data online.
Parse the Logs
Since the logs are probably in CSV or another unstructured format, they should be parsed to extract the fields. A second pass of parsing may be needed to extract parameters and values from the query string.
Clone the Chain
A separate data processing chain is needed to calculate each KPI. To accomplish this, clone the chain by the number of KPIs that are defined (eight in the example above).
Different fields are relevant for each KPI. For instance, user registration per month might require fields for user ID, date/time, and registration source (e.g. direct or Facebook). Filter out the rest of the fields to discard irrelevant data and speed up processing.
Process the Data
Calculating a KPI could require joins, filters, aggregation, and other functions. For example, to calculate user registrations per month, filter out registrations that are older than 30 days, return distinct results by user ID, and finally, count the results.
Now that the KPIs have been calculated, they should be saved for later analysis. The simplest option is to save them in files, preferably on the cloud. If you have a data warehouse or use a service like Amazon Redshift then that’s where the KPIs should be stored.
Measuring key performance indicators is best done by collecting and processing Big Data on the cloud. After defining the KPIs and collecting detailed application logs, they can be calculated by processing the data on Hadoop. With the cloud providing a scalable and cost-effective solution for handling Big Data, it can help measure your business on its way to success.