How to Parse Query String Parameters from URLs in Big Data

Parsing URL query string parameters is easy with Integrate.io. You can take a huge pile of web server logs and analyze them via Integrate.io’s visual interface.

Related reading: Parsing User Agent Strings in Big Data

Let’s get to it shall we?

Demo Data

For this demo we’re going to use publicly available HTTP server logs from the University of Saskatchewan that were recorded between January and December of 1995. Here’s a sample of the data:

comp.uark.edu - - [01/Jun/1995:10:18:39 -0600] "GET /cgi-bin/hytelnet HTTP/1.0" 200 1485
halley.execpc.com - - [01/Jun/1995:10:18:55 -0600] "GET /cgi-bin/hytelnet?file=US000OTH HTTP/1.0" 200 48064
192.139.11.251 - - [01/Jun/1995:10:18:55 -0600] "GET / HTTP/1.0" 304 0
192.139.11.251 - - [01/Jun/1995:10:18:58 -0600] "GET /images/logo.gif HTTP/1.0" 304 0
halley.execpc.com - - [01/Jun/1995:10:19:27 -0600] "GET /cgi-bin/hytelnet?file=US385 HTTP/1.0" 200 328

As you can see, the format consists of the following fields:

Source IP/domain
User Identifier (blank)
UserID (blank)
Date - in the format of dd/MMM/yyyy:HH:mm:ss Z
HTTP request - type, URL, HTTP version
HTTP code
Bytes transferred

Parsing Query Strings

Aside from regular requests for HTML pages and images, you may notice that the log contains URLs with query strings, for example /cgi-bin/hytelnet?file=US385. According to Wikipedia, HyTelnet “...was an early attempt to create a universal or at least simpler interface for the various Telnet-based information resources available before the World Wide Web”. This CGI may have been some kind of interface to view or download files via HyTelnet.

Integrate.io provides two functions for parsing query strings:

ExtractQueryStringParam(url, paramname) - receives the field name which contains the URL, the query string parameter to extract, and returns the parameter value. Use this function if you need to retrieve only one query string parameter. E.g. ExtractQueryStringParam('/cgi-bin/hytelnet?file=US385', 'file') returns US385.
QueryStringToMap(string_expression) - also receives the field name which contains the URL and returns the querystring parameters + values as a hash map. Use this function if you need to retrieve several query string parameters. E.g. QueryStringToMapy(‘/cgi-bin/cusi?query=culture&service=http%3A%2F%2Fc.gp.cs.cmu.edu%3A5103%2Fprog%2Fwebster%3F_cusi-search-term-here_’) returns [query#culture, service#http%3A%2F%2Fc.gp.cs.cmu.edu%3A5103%2Fprog%2Fwebster%3F_cusi-search-term-here_]. The values can be accessed later by using the hashtag character, e.g. fieldname#service.

Processing the Logs

Let’s say we want to find out what were the top files served by HyTelnet in the University of Saskatchewan in the latter half of 1995. This means we need retrieve URLs from the logs, filter only the ones that contain ‘hytelnet’ with the file query string parameter, parse them, and aggregate. This can be done with Integrate.io.

Login to your Integrate.io account, sign up for free if you do not have one.
Create a new package.
Click ‘new source’ at the top and select ‘Cloud Storage’.
Click the source component that was just created:
1. Select ‘Integrate.io demo data’ as the cloud storage connection.
2. Enter the bucket and path. The data is available in the public integrate.io.public bucket, weblogs/usaskaccesslog.gz path. Click the check mark button at the top to test the connection and make sure it works. If it doesn’t, then the cloud storage connection, bucket, or path aren’t defined correctly.
3. Enter a single space character as a field delimiter and choose double quotes as the string qualifier.
4. Click the circling arrows button on the top right to auto-detect the schema. A data preview should show the data separated into different fields. Give the fields meaningful names such as: ip, empty1, empty2, date, time, request, code, and bytes.
5. Click okay on the bottom right.
Add a select component to use the ‘request’ field only (you will not need any of the other fields for now).
Add a filter component to screen only URLs which contain ‘hytelnet?file’ from the request field with text matchesand .*hytelnet\?file.* as the pattern.
Add a select component to extract the URL section only from the request via regular expressions. Open the dropdown and select ‘edit’ to open the expression editor and enter REGEX_EXTRACT(request, '(\\/.*)\\s',1). Enter url on the right back in the main component screen.
Now that the URL is available, add another select component to parse the querystring. We only need to extract the ‘file’ parameter, so we’ll use ExtractQueryStringParam(url, 'file') on the left, and file on the right.
To count the file types, add an aggregate component next. Set file under ‘group by’, ‘Count All’ as the function, and set file_count as the alias.
Add a sort component to sort the data by file_count in descending order.
Finally, add a cloud storage destination component to output the final data back to S3. Set the relevant cloud storage connection, bucket, and path (you may use the integrate.io.dumpster bucket for demo purposes). Turn on ‘Overwrite destination path’ if the path should be overwritten.
Save and validate the package via the checkmark button next to the ‘save’ button on the top right. Fix any package problems that arise.
All done! Setup a cluster and run the job. After the job is done, you will be able to preview the output. The link is available in the notification email that you’ll receive, and also via ‘my clusters’, the relevant cluster, and ‘View outputs’ for the top most job.

Conclusion

Parsing HTTP server logs from the University of Saskatchewan in the latter half of 1995, here’s an analysis of the top HyTelnet file parameter values:

Of course, Integrate.io has a lot more features for simple data integration on the cloud. Sign up to see what you could do with your logs.

Big Data

How to Parse Query String Parameters from URLs in Big Data

Demo Data

Parsing Query Strings

Processing the Logs

Conclusion

Snowpark Unleashed: Data Magic Within Snowflake

The Essential Role of a Data Steward in Modern Business Intelligence

Maximizing Efficiency: Streamlining Your Business with Advanced SFDC Strategies

Solutions

Support

Company

Language

How to Parse Query String Parameters from URLs in Big Data

The Unified Stack for Modern Data Teams

Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer

Demo Data

Parsing Query Strings

Processing the Logs

The Unified Stack for Modern Data Teams

Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer

Conclusion

Related Readings

Snowpark Unleashed: Data Magic Within Snowflake

The Essential Role of a Data Steward in Modern Business Intelligence

Maximizing Efficiency: Streamlining Your Business with Advanced SFDC Strategies

Subscribe To The Stack Newsletter

Solutions

Support

Company

Language

Subscribe To
The Stack Newsletter