Most businesses start tracking data with Excel or Google Sheets – the manual way. In fact, 63% of U.S. businesses prefer Excel for budgeting and planning, according to a 2018 Robert Half survey. However, the number of businesses depending on simple spreadsheets is rapidly declining. Just a year earlier in 2017, 6% more businesses preferred these legacy, spreadsheet-based solutions.
The reason why companies are leaving Excel and Google Sheets is because they want the benefit of more in-depth data analysis, using data warehouses like Amazon Redshift and Snowflake. Data warehouses are better than simple spreadsheets because they offer scalability, reduced human error through automation, and business intelligence insights – all of which can transform the efficiency and profitability of your business.
That being said, firing up a data warehouse is a lot more complicated than plugging numbers into a spreadsheet. That’s why we wrote this guide. In this guide, you’ll learn:
- What's a Data Warehouse? Why Are They Valuable?
- What's Amazon Redshift?
- How Do You Create a Free Amazon Redshift Data Warehouse?
- How Do You Load Data Onto Your Redshift Data Warehouse?
- How Do You Query Your Redshift Data Warehouse for Basic Information?
(1) What's a Data Warehouse? Why Are They Valuable?
A data warehouse serves as a central data hub for the key information related to your company. By providing access to all your information in one place, the data warehouse integrates data from various databases, spreadsheets, applications, and more. Data from different sources usually doesn't easily fit together, so the data warehouse connects the information in a way that you can perform deep queries and analysis that incorporates all of it.
In this way, the data warehouse brings tremendous value because it lets you access, analyze, interpret, and extract deep insights using as much information as possible. These insights help your company forecast future trends and make better business decisions. In fact, the companies with access to deep insights from machine learning analytics like this gain an immediate leg up on the competition.
Data Warehouses vs. Databases: How Are They Different?
Although you can use a data warehouse to interact with, query, and manage your database, there are some fundamental differences between data warehouses and databases. For one, databases are designed for "transactional data processing." In other words, they store and provide access to information about what happened in your company's record-keeping and accounting. They allow you to run basic queries and reports for stats and other data on this information.
Conversely, a data warehouse is built for pulling together and analyzing massive amounts of data at once. The data warehouse integrates your data for use with powerful business insights tools, machine learning, and artificial intelligence algorithms. This complex analysis provides predictive insights about what could happen in the future (not just reports on what happened or is happening now).
Cloud-Based Data Warehouses: The Way of the Future
An important trend in modern data warehousing strategies relates to cloud-based data warehouses. Most businesses have begun to use cloud-based data warehouse platforms like Amazon Redshift, Snowflake, Google BigQuery, and others. These platforms offer the following advantages:
- Fast, easy, and sometimes free: You can set up a cloud-based data warehouse quickly and inexpensively. If your data needs are modest, you can even create and use a data warehouse for free. We're going to show you how to that in this article.
- No need to purchase hardware: Without the time and expense of buying physical hardware, it's a lot easier to set up a cloud-based data warehouse than a physical one.
- Scalable: Businesses used to suffer far too long with sluggish onsite servers in desperate need of upgrades. With a cloud-based data warehouse they can upgrade the speed and size of their systems simply by changing the configurations.
- Speed: Analytics, business intelligence, and complex data queries can slow down a traditional data system, but cloud-based data warehouses are built for handling analytics tasks.
- Convenient tech maintenance: To maintain your cloud-based data warehouse, you don't need to employ a full-time tech person. Just call the data warehouse provider and their customer support will sort out your challenges.
- Easy integration: Cloud-based data warehouses were made to integrate with as many data types and formats as possible. However, they're still not perfect, and that's where services like Xplenty can help.
We chose to use Amazon Redshift for this how-to guide. In the next sections, we'll discuss Amazon Redshift and provide a step-by-step procedure of creating a free warehouse on the platform.
(2) What's Amazon Redshift and Why Did We Choose It for this Guide?
As one of the most popular data warehouse solutions, Amazon Redshift comes with key advantages regardless of the type and size of your business:
- A free basic tier for setup purposes: You can set up, query, and play with a fully-functional data warehouse on Redshift for free. That way you can decide if this data solution is right for you and your needs.
- Zero hardware requirements: Like all cloud-based data warehouse platforms, you don't need to buy or install any special hardware to use Redshift.
- Scalability: From gigabyte to petabyte scale, rest assured that it's highly unlikely your data needs will "outgrow" the capacity of Redshift no matter how big your company gets.
- Easy for beginners: As part of the Amazon AWS ecosystem, it’s one of the easiest for beginners to set up use
- Use-based pricing: Amazon's "Spectrum Pricing" allows you to merely pay for queries that you run via a use-based pricing model.
Redshift is also one of the most popular data warehousing solutions. Redshift customers currently include notable companies like Nokia, Coinbase, Yelp, Soundcloud, Pinterest, and AMGEN. In addition, a website that tracks data warehouse popularity reports that more companies are adopting Redshift with each passing year (See Image 1):
Redshift is an excellent option for data warehousing “first-timers," but alternatives abound, such as Snowflake and Google BigQuery. Keep a lookout for guides on these data warehouses in the future.
(3) Create a Free Amazon Redshift Data Warehouse
The first step of your data warehouse journey is to create a free Amazon Web Services (AWS) account and log into it. After doing that, you’re ready to start a Redshift instance. But first, you’ll have to set up and connect an IAM role.
(a) Set Up Your IAM Role:
Without getting too technical, your IAM role allows you to access data from AWS instances called S3 (Simple Storage Service) buckets. S3 buckets are like file folders in the AWS ecosystem. These “buckets” contain data and descriptive metadata. The IAM role provides a data connection, so your Redshift cluster can access the data in your S3 buckets. Use the instructions and images that follow to set up your IAM role.
Click this link to go to the IAM console. Then click the “roles” link (See Image 2):
Click “create role” (See Image 3):
Click “Redshift” (See Image 4):
Next, (1) click “Redshift – Customizable”, then (2) click “Next: Permissions” (See Image 5):
On the “Attach Permissions Policies” page, (1) type “AmazonS3ReadOnlyAccess” into the search field, (2) select the checkbox, and (3) click “Next: Tags” (See Image 6):
“Tag” your IAM role so it’s easy to find. Create any tag you want in the field under “Key” (this step is optional) (See Image 7):
Name your IAM role (this step is necessary): (1) Name it what you want, then (2) click “Create Role” (See Image 8):
Congratulations. You’ll now see your IAM role on the list of roles (See Image 9):
(b) Set Up a Cluster for Your Redshift Data Warehouse:
According to Amazon: “An Amazon Redshift data warehouse is a collection of computing resources called nodes, which are organized into a group called a cluster. Each cluster runs an Amazon Redshift engine and contains one or more databases.”
To set up a Redshift cluster for your data warehouse, you need to open your Redshift Management Console. Next, you’ll click the button “Quick Launch Cluster” (See Image 10):
Next, you’re going to configure a basic cluster that keeps you eligible for Amazon’s free tier. This configuration shouldn’t trigger any charges, but please keep in mind that Amazon’s pricing structure could change, and we are not responsible for any charges you incur. Always double-check to ensure you will not incur charges before proceeding with any steps in this guide.
If you understand your future data needs, and know you’ll need a larger setup, consider a bigger node type than the default “dc2.large.” Otherwise, leave all of the default settings as they are.
Now, you will (1) choose a password for your cluster; (2) attach your newly created IAM role to the cluster by selecting it from the drop-down titled “Available IAM Roles”; and (3) launch the cluster (See Image 11):
HINT: Look at arrow #2 in the image above. Do you see the text with letters and numbers in the blue box under “TestRole”? These are the credentials for your IAM role. Copy them and save them in a file. You will need them when you’re importing data into your data warehouse.
Congratulations! Your cluster is launching! You can watch the progress by clicking the “clusters” link on the left of the screen (See Image 12):
(c) Secure Access to Your Redshift Data Warehouse:
After your cluster is ready, but before you integrate data into the cluster, it’s vital to adjust the access settings to make sure your data warehouse is secure. You'll secure access to your Redshift cluster by going to the Redshift Console and clicking “Clusters.” Next, click on your newly created cluster, which should be the only one on the list (see image below) (See Image 13):
Now you should be on the tab labeled “Configuration.” Under the section called “Cluster Properties,” click the link beside “VPC Security Groups.” It will say “default” along with some numbers and letters as per the image below (See Image 14):
In the new screen, (1) click the tab labeled “inbound,” then (2) click “edit” (See Image 15):
Here’s how to configure the settings:
- Under “Type,” select “Custom TCP Rule.”
- Under “Protocol,” select “TCP.”
- Under “Port Range,” type 5439. This will be the same port you set when launching the cluster.
- Under “Source,” select “Custom IP” and input the IP address you’ll use to connect to the database (and any other IP addresses you or your team plan to use). Don’t know your IP address? Go here, or just select “My PI.”
- Click “Save.”
(See Image 16):
(4) How to Load Data Onto Your Redshift Data Warehouse
After securing access to your data warehouse, it’s safe to load your data into it. Amazon provides sample data in an S3 bucket that’s ready to load, so we’ll use that for this tutorial. However, you may want to load your own data by following the same instructions we have outlined below.
(a) Configure Your Tables:
First, you’ll configure the schema for the tables that will hold your data. For that, you'll need to access the Query Editor. Go back to the Redshift Management Console and click “Open Query Editor” (See Image 17):
The system may prompt you to fill out a few fields. Use the same data from Section 3 to fill out these fields (See, above, Image 11 for this data). Once you're in the Query Editor, you will cut-and-paste some data to create the schema for your first tables (See Image 18):
Amazon offers several pipe-delimited text files to cut-and-paste and create schema for various tables. Create your tables one-by-one with the following steps:
- Cut-and-paste the table schema for User Table (see data below) into the query text field.
- Click the “Run Query” button.
- Wait for the query to complete.
- Click “Clear.”
- Repeat the process for the rest of the tables (see data below).
Here's the code you'll cut-and-past into the query field to create a variety of tables:
Create User Table:
create table users( userid integer not null distkey sortkey, username char(8), firstname varchar(30), lastname varchar(30), city varchar(30), state char(2), email varchar(100), phone char(14), likesports boolean, liketheatre boolean, likeconcerts boolean, likejazz boolean, likeclassical boolean, likeopera boolean, likerock boolean, likevegas boolean, likebroadway boolean, likemusicals boolean);
Create Venue Table:
create table venue( venueid smallint not null distkey sortkey, venuename varchar(100), venuecity varchar(30), venuestate char(2), venueseats integer);
Create Category Table:
create table category( catid smallint not null distkey sortkey, catgroup varchar(10), catname varchar(10), catdesc varchar(50));
Create Date Table:
create table date( dateid smallint not null distkey sortkey, caldate date not null, day character(3) not null, week smallint not null, month character(5) not null, qtr character(5) not null, year smallint not null, holiday boolean default('N'));
Create Event Table:
create table event( eventid integer not null distkey, venueid smallint not null, catid smallint not null, dateid smallint not null sortkey, eventname varchar(200), starttime timestamp);
Create Listing Table:
create table listing( listid integer not null distkey, sellerid integer not null, eventid integer not null, dateid smallint not null sortkey, numtickets smallint not null, priceperticket decimal(8,2), totalprice decimal(8,2), listtime timestamp);
Create Sales Table:
create table sales( salesid integer not null, listid integer not null distkey, sellerid integer not null, buyerid integer not null, eventid integer not null, dateid smallint not null sortkey, qtysold smallint not null, pricepaid decimal(8,2), commission decimal(8,2), saletime timestamp);
(b) Load Data Into Your Tables:
Now you’ll connect the tables to the information in the Sample S3 Buckets. You’ll do this with the following cut-and-paste commands. Enter the commands, one-by-one, into the query field – just like you did when creating the tables. IMPORTANT: You’ll need the IAM credentials you saved in Step #3 (above). In each of the commands below, replace the text “am-role-arn with the text of your own credentials.
Here's the code you'll cut-and-paste into the query field to connect the data to your tables:
Load Users Data
copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt' credentials 'aws_iam_role=arn:aws:iam-role-arn' delimiter '|' region 'us-west-2';
Load Venue Data:
copy venue from 's3://awssampledbuswest2/tickit/venue_pipe.txt' credentials 'aws_iam_role=arn:aws:iam-role-arn' delimiter '|' region 'us-west-2';
Load Category Data:
copy category from 's3://awssampledbuswest2/tickit/category_pipe.txt' credentials 'aws_iam_role=arn:aws:iam-role-arn' delimiter '|' region 'us-west-2';
Load Date Data:
copy date from 's3://awssampledbuswest2/tickit/date2008_pipe.txt' credentials 'aws_iam_role=arn:aws:iam-role-arn' delimiter '|' region 'us-west-2';
Load Event Data:
copy event from 's3://awssampledbuswest2/tickit/allevents_pipe.txt' credentials 'aws_iam_role=arn:aws:iam-role-arn' delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-west-2';
Load Listing Data:
copy listing from 's3://awssampledbuswest2/tickit/listings_pipe.txt' credentials 'aws_iam_role=arn:aws:iam-role-arn' delimiter '|' region 'us-west-2';
Load Sales Data:
copy sales from 's3://awssampledbuswest2/tickit/sales_tab.txt' credentials 'aws_iam_role=arn:aws:iam-role-arn' delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS' region 'us-west-2';
(5) Write Queries and Play With Your Redshift Data Warehouse
Now that you’ve loaded your data warehouse with information, it’s time to practice writing a few queries. Try these basic SQL queries on for size:
(a) Query All Data From User Table:
SELECT * FROM users
The result should look like this:
(b) Query Total Sales For a Specific Date:
SELECT sum(qtysold) FROM sales, date WHERE sales.dateid = date.dateid AND caldate = '2008-01-05';
The result should look like this:
Get creative and invent some fresh queries of your own!
Xplenty: Advanced Solutions for Data Integration Challenges
This was a simple example of how to create a basic data warehouse. You can use this data warehouse to integrate data from all the databases and applications related to your company, like data pertaining to: marketing, accounting, payroll, clients, customers, vendors, distributors, scheduling, shipping, production, billing, warehousing, customer satisfaction, market research, advertising, public relations, and more.
Having all of this information in one place will help you make better business decisions, but how do you get the data in there? Redshift interfaces with a wide range of data types. However, not all of your data will readily fit into your warehousing solution (guaranteed).
Whether you're an experienced data guru or a "data newbie", data integration challenges are a massive headache. This is where Xplenty can help. When you encounter data integration bottlenecks, Xplenty's advanced ETL solutions help you extract and transform all kinds of data to work with your data warehouse.
Don't worry! Xplenty is a lot easier to use than you think! Contact the Xplenty team now to learn about our powerful, easy-to-use data integration technologies.