Parsing AWS CloudTrail Log Files

Parsing AWS CloudTrail Log Files

Amazon’s CloudTrail is a service that logs AWS activity. These logs can be analyzed to check what’s going on in your AWS account, for example, by filtering activity by user, checking for suspicious behavior from various IPs, and monitoring valuable resources.

However, CloudTrail logs need some preparation before they can be analyzed. They are automatically saved as many GZIPed JSON files and arranged in separate folders by year, month, and day. In this post, we’ll see how to parse these log files with Xplenty’s data integration on the cloud. We’ll use Xplenty’s visual dataflow designer to generate usable tab-delimited results that are ready for analysis.

AWS CloudTrail Data

Below is sample data from a CloudTrail log file. It contains a single JSON object with an array called “Records.” The array contains detailed objects about each AWS event:

"Records": [{
    "eventVersion": "1.01",
    "userIdentity": {
      "type": "IAMUser",
      "principalId": "XXXXXXXXXXXXXXXXXXXX",
      "arn": "arn:aws:iam::012345678901:user/rhendriks",
      "accountId": "012345678901",
      "accessKeyId": "XXXXXXXXXXXXXXXXXXXX",
      "userName": "rhendriks"
    "eventTime": "2014-01-31T12:00:00Z",
    "eventSource": "",
    "eventName": "DescribeInstances",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "",
    "userAgent": "aws-sdk-ruby/1.33.0 ruby/1.9.3 x86_64-linux",
    "requestParameters": {
      "instancesSet": {
        "items": [{
          "instanceId": "i-01234567"
      "filterSet": {
    "responseElements": null,
    "requestID": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaaa",
    "eventID": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaaa"

and so on...

Before we can start processing the data, we need to combine the files into one big file — processing a few large files works much faster than processing many small files. The easiest way to do so is to copy and unite the files with S3DistCp using the “groupBy” option.

Processing CloudTrail Logs

Once we have a single CloudTrail log file, it can be parsed via Xplenty:


Let’s dive in to see what each component does:

  1. Source component - loads data from the relevant S3 bucket/path as a JSON array. Once you set the relevant connection, bucket, and path and select JSON as the source type, it’s recommend to fill the fields automatically by clicking the circular arrows on the top right.


  2. Select - because Records is an array and we’d like to iterate over each object in the array, the Flatten(record) function should be used.


  3. Select - converts each JSON string to an object via the JsonStringToMap(record) function.


  4. Select - selects relevant JSON object properties. Before setting aliases, these properties can be accessed with the hashtag operator, e.g. record#’eventTime’.

    1. Since userName is located in another object under the userIdentity property (see CloudTrail log sample above), we need to use JsonStringToMap(record#'userIdentity')#'userName' to access the user name.
    2. Extracting the instance IDs is even trickier and requires the JsonStringToBag function:
      Flatten(JsonStringToBag(JsonStringToMap(JsonStringToMap(record#'requestParameters')#'instancesSet')#'items')) cloudtrail5.png
  5. Select - —keeps the same aliases and extracts the instance ID with JsonStringToMap(instance)#'instanceId'
  6. Sort - reorders the data by event time in ascending order and then by event name in ascending order in case several events happened at the same time.
  7. Destination - stores the results back on Amazon S3 in tab-delimited GZIP. It can also be saved as an uncompressed JSON object if you wish.


Below are sample results of running the above dataflow with CloudTrail logs. As you can see, the data are now ready for analysis:






Start analyzing your AWS CloudTrail log files with a free Xplenty account.

Integrate Your Data Today!

Get a 7-day free trial. No credit card necessary.