Amazon’s CloudTrail is a service that logs AWS activity. These logs can be analyzed to check what’s going on in your AWS account, for example, by filtering activity by user, checking for suspicious behavior from various IPs, and monitoring valuable resources.

However, CloudTrail logs need some preparation before they can be analyzed. They are automatically saved as many GZIPed JSON files and arranged in separate folders by year, month, and day. In this post, we’ll see how to parse these log files with Integrate.io’s data integration on the cloud. We’ll use Integrate.io’s visual dataflow designer to generate usable tab-delimited results that are ready for analysis.

AWS CloudTrail Data

Below is sample data from a CloudTrail log file. It contains a single JSON object with an array called “Records.” The array contains detailed objects about each AWS event:

{
"Records": [{
    "eventVersion": "1.01",
    "userIdentity": {
      "type": "IAMUser",
      "principalId": "XXXXXXXXXXXXXXXXXXXX",
      "arn": "arn:aws:iam::012345678901:user/rhendriks",
      "accountId": "012345678901",
      "accessKeyId": "XXXXXXXXXXXXXXXXXXXX",
      "userName": "rhendriks"
    },
    "eventTime": "2014-01-31T12:00:00Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "DescribeInstances",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "11.111.111.111",
    "userAgent": "aws-sdk-ruby/1.33.0 ruby/1.9.3 x86_64-linux",
    "requestParameters": {
      "instancesSet": {
        "items": [{
          "instanceId": "i-01234567"
        }]
      },
      "filterSet": {
      }
    },
    "responseElements": null,
    "requestID": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaaa",
    "eventID": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaaa"
},

and so on...

Before we can start processing the data, we need to combine the files into one big file — processing a few large files works much faster than processing many small files. The easiest way to do so is to copy and unite the files with S3DistCp using the “groupBy” option.

Processing CloudTrail Logs

Once we have a single CloudTrail log file, it can be parsed via Integrate.io:

thumbnail image

Let’s dive in to see what each component does:

  1. Source component - loads data from the relevant S3 bucket/path as a JSON array. Once you set the relevant connection, bucket, and path and select JSON as the source type, it’s recommend to fill the fields automatically by clicking the circular arrows on the top right.
    thumbnail image

  2. Select - because Records is an array and we’d like to iterate over each object in the array, the Flatten(record) function should be used.
    thumbnail image

  3. Select - converts each JSON string to an object via the JsonStringToMap(record) function.
    thumbnail image

  4. Select - selects relevant JSON object properties. Before setting aliases, these properties can be accessed with the hashtag operator, e.g. record#’eventTime’.

    1. Since userName is located in another object under the userIdentity property (see CloudTrail log sample above), we need to use JsonStringToMap(record#'userIdentity')#'userName' to access the user name.

  5. Extracting the instance IDs is even trickier and requires the JsonStringToBag function:
    Flatten(JsonStringToBag(JsonStringToMap(JsonStringToMap(record#'requestParameters')#'instancesSet')#'items')) thumbnail image

  • Select - —keeps the same aliases and extracts the instance ID with JsonStringToMap(instance)#'instanceId'
    thumbnail image

  • Sort - reorders the data by event time in ascending order and then by event name in ascending order in case several events happened at the same time.
    thumbnail image

  • Destination - stores the results back on Amazon S3 in tab-delimited GZIP. It can also be saved as an uncompressed JSON object if you wish.
    thumbnail image

Results

Below are sample results of running the above dataflow with CloudTrail logs. As you can see, the data are now ready for analysis:

2014-01-01T01:09:06Z`` ``DescribeVolumes`` ``ec2.amazonaws.com`` ``us-east-1`` ``piedpiper`` ``11.111.11.11

2014-01-01T01:09:06Z`` ``DescribeVolumes`` ``ec2.amazonaws.com`` ``us-east-1`` ``piedpiper`` ``11.111.11.11`` ``i-01234567

2014-01-01T01:09:11Z`` ``DescribeJobFlows`` ``elasticmapreduce.amazonaws.com`` ``us-east-1`` ``piedpiper`` ``11.111.11.11`` ``i-88888888

2014-01-01T01:09:11Z`` ``DescribeJobFlows`` ``elasticmapreduce.amazonaws.com`` ``us-east-1`` ``piedpiper`` ``11.111.11.11

2014-01-01T01:20:08Z`` ``DescribeInstances`` ``ec2.amazonaws.com`` ``us-east-1`` ``rhendriks`` ``22.222.222.222`` ``i-87654321

Start analyzing your AWS CloudTrail log files with a free Integrate.io account.

Read more about Amazon Redshift and other Amazon integrations on the Integrate.io blog.