Unstructured data is big. Really big. According to an IDC study, about 90 percent of the storage in the world is used for unstructured data. This should come as no surprise considering the amount of photos, videos, documents, and emails that are being generated on the web by the minute. However, in 2012 only 0.5% of all the data out there was being analyzed. With such huge potential for unstructured data, and technologies like Hadoop to handle it, old school database administrators need to make the transition from the comfortable world of structured data, to the chaotic realm of the unstructured.
Unstructured Data Processing Methods
Unlike structured data which is neatly organized in relational databases, unstructured data does not have a predefined schema and isn’t available in a specified format. How can such data be processed? Let’s discuss it by classifying two different groups of unstructured data.
The first group consists of application logs. They are stored as files that list events such as page visits, button clicks, logins, exceptions and so forth. Part of these log lines may be structured and contain the date, log type (info/warning/error), and URL, while the other part may be fully unstructured with any info the app’s developers choose to include. Log data could also include newline characters which may further complicate processing - determining where one log starts and another one ends.
So, your boss needs analytics on web app logs pronto - which transactions happened, how long they took, and what errors occurred. What tools can you use to provide these numbers? Unfortunately, the solution is to write custom code that checks for sequences (e.g. add item to cart, checkout, thank you) and extracts values using regular expressions. Hive or Pig could be used to deal with the data, but you’d still need to find or write UDFs (user defined functions).
Fully Unstructured Data
Data such as social network statuses, emails, documents, images, and videos is considered to be fully unstructured data. In actuality, this label may be misleading because emails and binary file formats have well defined headers with metadata. However, their content is fully unstructured, and may appear in the form of free text or binary bits and bytes, either raw or compressed.
Processing unstructured data means extracting structure from it. Take, for example, sentiment analysis which is also known as opinion mining. It determines what kind of judgement, evaluation, or even emotional state is conveyed by processing unstructured text and analyzing how the words fit together. They are then assigned a polarity that identifies the text to be positive, negative, or neutral.
Biometrics is another field which uses unstructured data, more specifically images. Fingerprints, and facial images are processed to extract structured attributes. For example, ink smears are transformed into lines and polygons. Biometric comparison is then done using the structured attributes rather than the raw data.
Actually, unstructured data already exists in structured data. Take BLOBs (binary large objects) - collections of binary data saved in a database entity. BLOBs can store text, documents, videos, images, and other kinds of unstructured binary data.
As for textual data, some features are already available in relational databases: LIKE operators with regular expressions, fulltext search, entity extraction, text classification, and text similarities. Processing XML is also possible - via XQuery support on Oracle and SQL Server, with extra support for JSON on PostgreSQL. SQL Server also has the ability to map file systems to database tables.
There are two major groups of unstructured data that need to be processed: The first is application logs that should be handled via custom code and regular expressions; The second is fully unstructured data from which advanced algorithms extract structured attributes.
Considering the huge amount of unstructured data and that only a tiny bit of it is being analyzed, adventuring in this dense information jungle with the right tools and methods could lead to some amazing discoveries: Seton Healthcare Family, a non-profit health care facility in Texas, uses unstructured clinical information to save costs by predicting potential return patients and intervening with them; Also, an investment firm called BNY Mellon integrates customer interactions with its traditional systems to get a better view of customer needs.
Which insights will you gain by processing unstructured data?