Parsing URL query string parameters is easy with Integrate.io. You can take a huge pile of web server logs and analyze them via Integrate.io’s visual interface.

Related reading: Parsing User Agent Strings in Big Data

Let’s get to it shall we?

Demo Data

For this demo we’re going to use publicly available HTTP server logs from the University of Saskatchewan that were recorded between January and December of 1995. Here’s a sample of the data:

comp.uark.edu - - [01/Jun/1995:10:18:39 -0600] "GET /cgi-bin/hytelnet HTTP/1.0" 200 1485
halley.execpc.com - - [01/Jun/1995:10:18:55 -0600] "GET /cgi-bin/hytelnet?file=US000OTH HTTP/1.0" 200 48064
192.139.11.251 - - [01/Jun/1995:10:18:55 -0600] "GET / HTTP/1.0" 304 0
192.139.11.251 - - [01/Jun/1995:10:18:58 -0600] "GET /images/logo.gif HTTP/1.0" 304 0
halley.execpc.com - - [01/Jun/1995:10:19:27 -0600] "GET /cgi-bin/hytelnet?file=US385 HTTP/1.0" 200 328

As you can see, the format consists of the following fields:

  1. Source IP/domain

  2. User Identifier (blank)

  3. UserID (blank)

  4. Date - in the format of dd/MMM/yyyy:HH:mm:ss Z

  5. HTTP request - type, URL, HTTP version

  6. HTTP code

  7. Bytes transferred

Parsing Query Strings

Aside from regular requests for HTML pages and images, you may notice that the log contains URLs with query strings, for example /cgi-bin/hytelnet?file=US385. According to Wikipedia, HyTelnet “...was an early attempt to create a universal or at least simpler interface for the various Telnet-based information resources available before the World Wide Web”. This CGI may have been some kind of interface to view or download files via HyTelnet.

Integrate.io provides two functions for parsing query strings:

  1. ExtractQueryStringParam(url, paramname) - receives the field name which contains the URL, the query string parameter to extract, and returns the parameter value. Use this function if you need to retrieve only one query string parameter. E.g. ExtractQueryStringParam('/cgi-bin/hytelnet?file=US385', 'file') returns US385.

  2. QueryStringToMap(string_expression) - also receives the field name which contains the URL and returns the querystring parameters + values as a hash map. Use this function if you need to retrieve several query string parameters. E.g. QueryStringToMapy(‘/cgi-bin/cusi?query=culture&service=http%3A%2F%2Fc.gp.cs.cmu.edu%3A5103%2Fprog%2Fwebster%3F_cusi-search-term-here_’) returns [query#culture, service#http%3A%2F%2Fc.gp.cs.cmu.edu%3A5103%2Fprog%2Fwebster%3F_cusi-search-term-here_]. The values can be accessed later by using the hashtag character, e.g. fieldname#service.

Processing the Logs

Let’s say we want to find out what were the top files served by HyTelnet in the University of Saskatchewan in the latter half of 1995. This means we need retrieve URLs from the logs, filter only the ones that contain ‘hytelnet’ with the file query string parameter, parse them, and aggregate. This can be done with Integrate.io.

thumbnail image

  1. Login to your Integrate.io account, sign up for free if you do not have one.

  2. Create a new package.

  3. Click ‘new source’ at the top and select ‘Cloud Storage’.
    thumbnail image

  4. Click the source component that was just created:

    1. Select ‘Integrate.io demo data’ as the cloud storage connection.

    2. Enter the bucket and path. The data is available in the public integrate.io.public bucket, weblogs/usaskaccesslog.gz path. Click the check mark button at the top to test the connection and make sure it works. If it doesn’t, then the cloud storage connection, bucket, or path aren’t defined correctly.

    3. Enter a single space character as a field delimiter and choose double quotes as the string qualifier.

    4. Click the circling arrows button on the top right to auto-detect the schema. A data preview should show the data separated into different fields. Give the fields meaningful names such as: ip, empty1, empty2, date, time, request, code, and bytes.

    5. Click okay on the bottom right.
      thumbnail image

  5. Add a select component to use the ‘request’ field only (you will not need any of the other fields for now).
    thumbnail image

  6. Add a filter component to screen only URLs which contain ‘hytelnet?file’ from the request field with text matchesand .*hytelnet\?file.* as the pattern.
    thumbnail image

  7. Add a select component to extract the URL section only from the request via regular expressions. Open the dropdown and select ‘edit’ to open the expression editor and enter REGEX_EXTRACT(request, '(\\/.*)\\s',1). Enter url on the right back in the main component screen.
    thumbnail image

  8. Now that the URL is available, add another select component to parse the querystring. We only need to extract the ‘file’ parameter, so we’ll use ExtractQueryStringParam(url, 'file') on the left, and file on the right.
    thumbnail image

  9. To count the file types, add an aggregate component next. Set file under ‘group by’, ‘Count All’ as the function, and set file_count as the alias.
    thumbnail image

  10. Add a sort component to sort the data by file_count in descending order.
    thumbnail image

  11. Finally, add a cloud storage destination component to output the final data back to S3. Set the relevant cloud storage connection, bucket, and path (you may use the integrate.io.dumpster bucket for demo purposes). Turn on ‘Overwrite destination path’ if the path should be overwritten.

  12. Save and validate the package via the checkmark button next to the ‘save’ button on the top right. Fix any package problems that arise.

  13. All done! Setup a cluster and run the job. After the job is done, you will be able to preview the output. The link is available in the notification email that you’ll receive, and also via ‘my clusters’, the relevant cluster, and ‘View outputs’ for the top most job.

Conclusion

Parsing HTTP server logs from the University of Saskatchewan in the latter half of 1995, here’s an analysis of the top HyTelnet file parameter values:

Of course, Integrate.io has a lot more features for simple data integration on the cloud. Sign up to see what you could do with your logs.