Collecting Big Data with S3/CloudFront Logging

Collecting Big Data with S3/CloudFront Logging

In our recent article, "Scale Your Data Collection on the Cloud Like a Champ", we reviewed several ways of collecting big data, the most promising of which was S3/CloudFront logging. It’s low cost and quick to implement. In this article we’d like to go deeper into the woods and show how to setup S3/CloudFront logging with your application.

1. Define App Data

Sit back and think - which data would you like to collect? Which app events should be logged? These could be page visits, mouse clicks, logins, errors, etc. Some of them may include parameters such as the page visit URL. Write them all down. Be as thorough as possible so you don’t lose any precious data.

2. Create an AWS Account

If you don’t already have an AWS (Amazon Web Services) account, you can sign up here. Registration is free with the basic support package.

3. Create an S3 Bucket

Go to the S3 dashboard and create a bucket for saving the logs. Note that the bucket must have a unique name across Amazon’s service and adhere to DNS rules: 3-63 characters, only letters numbers and periods, shouldn't look like an IP address, and no underscores. Don’t turn on logging - we will do so via CloudFront. Also, make sure to enable CORS (Cross Origin Resource Sharing) for this bucket so that the images AJAX request will be accepted.

Creating a S3 Bucket

4. Create Event Images

Set up directories in the image bucket, for example /mouse, to organize events by categories, and create 1x1 pixel images (see previous post) for all the events that you defined in the first step, e.g. click.png, login.png, error.png. Don’t worry about event parameters at the moment, we will deal with them shortly. All files uploaded to S3 are set as private, so make sure to change the file permissions to public. You may use tools such as CloudBerry Explorer or S3 Browser to do so and much more.

Set HTTP headers for all the images so that they will be cached by CloudFront, thus saving GET requests from CloudFront edge locations to S3. Go to the relevant bucket, check the image files on the left, click Actions at the top, choose Properties, and open the Metadata section. Add the following metadata line and click Save:

  • Cache-Control: max-age=31536000


5. Create a CloudFront Distribution

Creating a CloudFront distribution costs extra, but it’s mandatory - it logs the query string, adds extra log info such as edge locations, and helps to deliver files via Amazon’s CDN to shorten load times. Access the CloudFront dashboard and create a web distribution for the image S3 bucket. Make sure that Use Origin Cache Headers is set under Object Caching (it’s the default setting).

Note that the distribution gets a random domain name. It could take a while before it starts working because the DNS servers need to be updated to support it. You can also set a more friendly domain using the Alternate Domain Names (CNAMEs) option under Distribution Settings, though it requires configuring your DNS settings so that your domain points to CloudFront’s domain name. See Amazon’s documentation for more info.

Creating a CloudFront Distribution Creating a CloudFront Distribution

6. Turn Logging On

Still in the CloudFront dashboard, check the distribution on the left, click Distribution Settings at the top, click Edit under the General tab, enable logging, and insert the bucket where you want to store the logs.

Turn Logging On
Turn Logging On

7. Code a Function to Call Events

Time to get your hands dirty and write a method that registers events, or call one of your app’s developers to do it for you. The code could be on the client side, server side, or both depending on the architecture. The method should simply send an asynchronous HTTP GET request to the relevant image URL, e.g. to (link for demo purposes only). If you need to send additional event parameters, use the query string (don’t forget URL encoding), e.g.

Here’s an example of such a client side function in JavaScript/jQuery:

$.CloudFrontLog = function (attr) {
  var url = '' + attr.category + '/' + attr.action + '.png',
    data = {
      url: attr.url
    return $.get(url, data);

8. Call the Events

Dig through your app’s code and add event calls using the method that you’ve just written. This will collect the data that you defined in step 1. Here’s a jQuery code sample for logging client-side button clicks:

$('.btn').click(function(e) {
  var id = $(this).attr('id');
    action: 'click',
    category: 'mouse',
    id: id,
    url: location.href

9. Test

Use your staging environment to call events via the application and check that the logs are generated accordingly. Patience young padawan, it may take an hour or so until Amazon writes them. In the meantime check out behind the scenes Star Wars footage taken by Peter Mayhew, the actor who played Chewbacca.

10. Go Live!

Everything should be ready for you to collect big data like a champ - update the production environment and let the logging begin. Don't know what to do with the data? See how to analyze AWS logs in 15 minutes.

(Logs and loggers photo by Paukrus, some rights reserved)

Integrate Your Data Today!

Get a 7-day free trial. No credit card necessary.