Parsing URL query string parameters is easy with Xplenty. You can take a huge pile of web server logs and analyze them via Xplenty’s visual interface. Let’s get to it shall we?
For this demo we’re going to use publicly available HTTP server logs from the University of Saskatchewan that were recorded between January and December of 1995. Here’s a sample of the data:
comp.uark.edu - - [01/Jun/1995:10:18:39 -0600] "GET /cgi-bin/hytelnet HTTP/1.0" 200 1485 halley.execpc.com - - [01/Jun/1995:10:18:55 -0600] "GET /cgi-bin/hytelnet?file=US000OTH HTTP/1.0" 200 48064 18.104.22.168 - - [01/Jun/1995:10:18:55 -0600] "GET / HTTP/1.0" 304 0 22.214.171.124 - - [01/Jun/1995:10:18:58 -0600] "GET /images/logo.gif HTTP/1.0" 304 0 halley.execpc.com - - [01/Jun/1995:10:19:27 -0600] "GET /cgi-bin/hytelnet?file=US385 HTTP/1.0" 200 328
As you can see, the format consists of the following fields:
User Identifier (blank)
Date - in the format of dd/MMM/yyyy:HH:mm:ss Z
HTTP request - type, URL, HTTP version
Parsing Query Strings
Aside from regular requests for HTML pages and images, you may notice that the log contains URLs with query strings, for example /cgi-bin/hytelnet?file=US385. According to Wikipedia, HyTelnet “...was an early attempt to create a universal or at least simpler interface for the various Telnet-based information resources available before the World Wide Web”. This CGI may have been some kind of interface to view or download files via HyTelnet.
Xplenty provides two functions for parsing query strings:
ExtractQueryStringParam(url, paramname) - receives the field name which contains the URL, the query string parameter to extract, and returns the parameter value. Use this function if you need to retrieve only one query string parameter. E.g.
QueryStringToMap(string_expression) - also receives the field name which contains the URL and returns the querystring parameters + values as a hash map. Use this function if you need to retrieve several query string parameters. E.g.
[query#culture, service#http%3A%2F%2Fc.gp.cs.cmu.edu%3A5103%2Fprog%2Fwebster%3F_cusi-search-term-here_]. The values can be accessed later by using the hashtag character, e.g.
Processing the Logs
Let’s say we want to find out what were the top files served by HyTelnet in the University of Saskatchewan in the latter half of 1995. This means we need retrieve URLs from the logs, filter only the ones that contain ‘hytelnet’ with the file query string parameter, parse them, and aggregate. This can be done with Xplenty.
Create a new package.
Click the source component that was just created:
Select ‘Xplenty demo data’ as the cloud storage connection.
Enter the bucket and path. The data is available in the public xplenty.public bucket, weblogs/usask_access_log.gz path. Click the check mark button at the top to test the connection and make sure it works. If it doesn’t, then the cloud storage connection, bucket, or path aren’t defined correctly.
Enter a single space character as a field delimiter and choose double quotes as the string qualifier.
Click the circling arrows button on the top right to auto-detect the schema. A data preview should show the data separated into different fields. Give the fields meaningful names such as: ip, empty1, empty2, date, time, request, code, and bytes.
Add a select component to extract the URL section only from the request via regular expressions. Open the dropdown and select ‘edit’ to open the expression editor and enter
REGEX_EXTRACT(request, '(\\/.*)\\s',1). Enter
urlon the right back in the main component screen.
Now that the URL is available, add another select component to parse the querystring. We only need to extract the ‘file’ parameter, so we’ll use
ExtractQueryStringParam(url, 'file')on the left, and
fileon the right.
Finally, add a cloud storage destination component to output the final data back to S3. Set the relevant cloud storage connection, bucket, and path (you may use the xplenty.dumpster bucket for demo purposes). Turn on ‘Overwrite destination path’ if the path should be overwritten.
Save and validate the package via the checkmark button next to the ‘save’ button on the top right. Fix any package problems that arise.
All done! Setup a cluster and run the job. After the job is done, you will be able to preview the output. The link is available in the notification email that you’ll receive, and also via ‘my clusters’, the relevant cluster, and ‘View outputs’ for the top most job.
Parsing HTTP server logs from the University of Saskatchewan in the latter half of 1995, here’s an analysis of the top HyTelnet file parameter values:
Of course, Xplenty has a lot more features for simple data integration on the cloud. Sign up to see what you could do with your logs.