Building a data lake in Amazon S3 using AWS Spectrum to query the data from a Redshift cluster is a common practice. However, when it comes to boosting performance, there are some tricks that are worth learning. One of those is using data in Parquet format, which Redshift considers a best practice.

Here's how to use Parquet format with Xplenty for the best data lake performance.

Table of Contents

  1. What Is Parquet?
  2. Why Should a Data Lake Use Parquet?
  3. The Downsides of Parquet
  4. Using Parquet with Xplenty
  5. Improve The Efficiency of Your Data Processing

Customer Story
Customer Story
Keith connected multiple data sources with Amazon Redshift to transform, organize and analyze their customer data.
MongoDB MongoDB
Amazon Redshift Amazon Redshift
David Schuman
Dave Schuman
CTO and Co-Founder at Raise.me
They really have provided an interface to this world of data transformation that works. It’s intuitive, it’s easy to deal with [...] and when it gets a little too confusing for us, [Xplenty’s customer support team] will work for an entire day sometimes on just trying to help us solve our problem, and they never give up until it’s solved.
TRUSTED BY COMPANIES WORLDWIDE

Enjoying This Article?

Receive great content weekly with the Xplenty Newsletter!

 

What Is Parquet?

Apache Parquet is an open-source file format. It's a popular columnar storage format for Impala, AWS Athena, Presto, and countless other data warehouse (DWH) solutions. Compared to row-based files, such as CSV or TSV, Parquet's flat columnar storage format is efficient and high-performing.  

With a superior algorithm for record shredding and assembly, Parquet outperforms other methods that simply flatten nest namespaces. Instead, Parquet works optimally with complex data sets, and does so in bulk, at-scale. It features various options for efficiently compressing large amounts of data and encoding types. 

Parquet is often the go-to option when a business uses queries that require information from certain columns within a large table. By only reading the necessary columns, Parquet format greatly minimizes your latency and compute usage. 

Why Should a Data Lake Use Parquet?

There are countless advantages to storing data in a columnar format such as Apache Parquet. Parquet is designed to improve efficiency, performing much better than row-based files, such as CSV. Additional benefits include the following:

  • With columnar storage, queries can skip irrelevant data quickly. This results in aggregation queries being far less resource- and time-consuming compared to running the same query in a row-based data store. This directly translates into reduced latency and increased hardware savings. 
  • From the ground up, Parquet supports advanced, nested data structures. By default, Parquet lays out data files in a manner that optimizes the information for queries that process large amounts of information at once (i.e., gigabytes for each file). 
  • To further improve performance, Parquet accommodates flexible compression along with various, highly efficient encoding schemes. Since the data type for every column in Parquet format is similar, it's straightforward to compress them all, further speeding up query execution. You can select from several available codecs to compress the data, allowing separate files to be compressed differently. 

As you can see, there are countless advantages to using Parquet. But there are some circumstances where it may not work.

Customer Story
Customer Story
Keith connected multiple data sources with Amazon Redshift to transform, organize and analyze their customer data.
Amazon Redshift Amazon Redshift
David Schuman
Keith Slater
Senior Developer at Creative Anvil
Before we started with Xplenty, we were trying to move data from many different data sources into Redshift. Xplenty has helped us do that quickly and easily. The best feature of the platform is having the ability to manipulate data as needed without the process being overly complex. Also, the support is great - they’re always responsive and willing to help.
TRUSTED BY COMPANIES WORLDWIDE

Enjoying This Article?

Receive great content weekly with the Xplenty Newsletter!

 

The Downsides of Parquet

Parquet isn't perfect. In some scenarios, it may be less than ideal. Understanding the downsides of Parquet is important to ensure you select the right format for your data lake. Here are some aspects of Parquet that may pose a challenge:

  • Parquet does not support datetime and complex data types.
  • Parquet is only ideal for interactive and serverless solutions, like Amazon Redshift Spectrum and Google BigQuery. 
  • With a column-based design, you need to consider the schema and data types.
  • You will not always find built-in support tools other than Spark, Apache's analytics engine.
  • Parquet doesn't support data modification, since Parquet files are immutable.
  • Parquet doesn't support scheme evolution. If you're using Spark, it can combine the schema overtime. 

Despite these downsides, Parquet is still an excellent option for many situations. Fortunately, getting it up-and-running with Xplenty is very easy.

Using Parquet with Xplenty

At Xplenty, we believe everything should be easy — especially when it comes to defining and managing the file formats of your ETL output. With Xplenty, opting to use Parquet for your storage destination only requires you to define a few simple parameters.

Parquet for File Storage Destination

When using Xplenty, you can opt to use Parquet by setting up the parameters within the Destination Properties menu. You can use the following system variables to control the Parquet output:

_PARQUET_COMPRESSION: This variable allows you to control the compression type, with available values being UNCOMPRESSED, GZIP, or SNAPPY.

_PARQUET_PAGE_SIZE: This variable simply defines the page size for Apache Parquet compression.

_PARQUET_BLOCK_SIZE: This variable allows you to define the size of a row group that Parquet is buffering in memory.

You can find additional steps for defining your file storage destination in Xplenty's detailed guide.

Troubleshooting

As you work to build your data lake in S3 with the help of AWS Spectrum, you want to follow the setup steps to connect the components together. These additional guides may help:

You can also reach out to the Xplenty support team if you need assistance.

Improve The Efficiency of Your Data Processing with Xplenty

Integrate Your Data Today!

Try Xplenty free for 14 days. No credit card required.

 

Xplenty makes robust ETL simple through a streamlined, graphical user interface that puts your team in total control, with minimal programming knowledge required. Plus, with our collaborative tools, multiple people can participate in the design of your data flow.

Ready to get started? Xplenty can help you improve the efficiency of your data processing at every step. Schedule an intro call to book a demo or start your risk-free 14-day trial of the Xplenty platform.