What are Columnar Databases?

A columnar database is a database management system that stores data in columns rather than rows. This helps to minimize resource usage related to queries on big data sets.

Why are Columnar Databases Used?

Typically, a database table stores data in rows like this:

When the system needs to add a record, it appends a new row to the table. When a record is deleted, the corresponding row is deleted. This approach works for systems whose main purpose is transactional – production databases such as the CRM or ERP, for example.

All data is held on a physical storage device, either a hard disk drive (HDD) or solid-state drive (SSD), and there is a resource cost when querying those devices for data. With a row-oriented database, each row is usually contained in nearby sectors, which allows for faster reading from and writing to the HDD or SSD.

However, data analytics often looks at columns rather than rows. For example, imagine an analyst is trying to run a report on the database above to study customer spending by state. They would begin by querying all spending totals and all states. Because the database is physically stored by row, the operation could end up searching across the entire storage array for results.

Columnar databases are a solution to this problem. A columnar version of the above database would look like this:

In physical terms, this structure means that all fields are now grouped together, which means that there is minimal overhead when selecting all states or all order totals in the database.

Columnar databases are unsuitable for high-transaction environments. They are most useful for large volumes of static data that are subject to many read requests, and few write requests.

How are Columnar Databases Implemented?

Columnar databases are mostly used in environments such as data warehouses, where the data stored is mostly used for the purpose of analytics. There are several technical solutions to the implementation of columnar databases. For example, Apache Hadoop includes Parquet, a data storage format designed for column-oriented databases.

To move row-oriented databases to a columnar model, most organizations use Extract, Transform, Load (ETL) applications such as Xplenty. ETL works in the following way:

  • Extract: Production databases must usually be row-oriented to facilitate day-to-day transactions. The Extract stage exports all relevant data without altering the production data.
  • Transform: The exported data is transformed from row-oriented to column-oriented, using schema defined in the ETL application. Errors and outliers can be flagged up for manual intervention.
  • Load: The transformed data is loaded into an appropriate data structure, where it can be queried directly or via analytics tools.

The ETL pipeline ensures that the columnar database always contains the most recent data so that all analytics are accurate and representative.