Unstructured data is any digital information that doesn’t exist in a recognized data structure, such as a relational database table. Essentially, anything that isn’t structured data or semi-structured data counts as unstructured data.
The term Big Data almost exclusively refers to this kind of data. Unstructured data is voluminous, but it can be hard to navigate and analyze. As a result, organizations have developed new ways to store and process this data. These new approaches include Hadoop, NoSQL, and data lakes.
How is Unstructured Data Different from Structured Data?
All data has some structure, either implicit or implied. For example, when a digital image is in a format such as JPG or PNG, the image data exists in a structure implied by the file format.
But generally, structured data refers to information that is suitable for queries in a language such as SQL. This almost exclusively means relational databases, ideally normalized and with key-based relationships between tables.
The term semi-structured refers to data that is ready for conversion to a queryable format with relative ease. A CSV file, for example, is a text file, which is not structured data. But it’s a trivial task to import a CSV file into a relational database, at which point the values in the file become suitable for queries in SQL.
Everything else is unstructured data. Common examples of unstructured data include:
- Flat files
- Documents, such as Word files or PDFs
- Multimedia, including audio and video
- Scans of documents (technically images, but they contain text that an OCR process can retrieve)
- System logs
- Biometric data
All of these instances contain data that is of use to the business. Individual files may contain vital information, such as scans of contracts. Or the business may be able to use data analytics techniques to uncover patterns within unstructured data. For example, a deep analysis of website activity logs may reveal information about user behavioral patterns.
How is Unstructured Data Stored?
Organizations often generate vast amounts of unstructured every day, and this data can sit everywhere from desktop folders to email servers. Most of these companies will want to organize and consolidate this data for purposes such as storage and analytics.
There are two main strategies for handling unstructured data at scale: a NoSQL database and a data lake.
NoSQL has emerged over recent years as one of the preferred methods for dealing with large volumes of unstructured data.
NoSQL stands for “Not Only SQL,” as it can handle relational databases, but it also supports more complex data structures. NoSQL approaches unstructured data in a variety of ways, such as:
- Key-value stores: The database holds a table of keys, with each key pointing to a data item. This can be any type of data, including video, text files, or JSON. This is one of the simpler NoSQL strategies and is often used for storing data, rather than building transactional databases.
- Document store: This strategy involves encoding values in a standardized format, such as YAML, JSON, or BSON. Depending on the version of NoSQL, it will try to organize these documents in a logical structure and cache the most commonly-used values.
- Graph store: This system is ideal for unstructured data containing relationships represented by a graph. It’s a popular system for storing social media data, where you can graph relationships between users.
- Wide-table store: These databases work like relational databases, except there is a greater degree of flexibility. That way, column names and formats can vary between rows. This technique is not quite the same as a columnar database, which is a method of storing large relational tables.
While data warehouses are highly structured, a data lake has almost no inherent structure, making it an ideal repository for unstructured data storage.
A data lake consists of several components:
- File system: Lakes often hold vast amounts of data. To do this efficiently, they must use a file system such as Apache Hadoop or Spark to distribute it over an expansive network of storage clusters. Each cluster holds a small section of the overall lake. In some models, the cluster handles all processing requests relating to data contained within that particular cluster. In other models, the file system decouples processing from storage.
- Data pipeline: Data has to get from the sources into the data lake. This generally involves an automated ELT (Extract, Load, Transform) process. ELT is fast than ETL (Extract, Transform, Load) as it doesn’t involve an intermediate transformation layer. Instead, ELT uses schema-on-demand, which means that the end-user must try to sort through the unstructured data.
- Sorting tool: Data lake users require some kind of tool to help them navigate a vast data lake. MapReduce is one of the more familiar tools. Generally associated with Hadoop, MapReduce works by tokenizing data, and then organizing those tokens, creating a more logical data structure. Hadoop’s structure allows thousands of instances of MapReduce to run concurrently, which can produce fast results.
- Analytics tools: For the most part, organizations will use their data lakes as a source of analytics insights. There are many tools, such as GoodData and Dundas BI, that can explore and analyze data lakes, returning actionable insights.
For analytics purposes, the most important aspect of a data lake is that it be recent. That’s why data lakes are so reliant on a data pipeline to keep them constantly refreshed.
Finally, a data lake also requires good data governance. Otherwise, it risks becoming a data swamp, filled with extraneous data that only serves to slow analytics queries.
Data governance for unstructured data means:
- Creating detailed metadata for all data ingested into the lake
- Establishing clear business rules about the lifecycle of different data types
- Performing regular audits to ensure data quality
- Expunging all data that has expired
Metadata can itself be structured data. For example, in a large cache of images, each image will have metadata describing image format, resolution, geolocation information, and other key details. It’s possible to store this data in a relational database, while the images themselves persist in a data lake.