What's the Cheapest Way to Store Big Data in the Cloud?

What's the Cheapest Way to Store Big Data in the Cloud?

(Cloud image by PiccoloNamek, modified by Xplenty, Some rights reserved)

There are three ways to collect data on the cloud: storing it directly in the database, uploading log files, or logging via S3/CloudFront. Although we reviewed the pros and cons for each method there was one aspect we didn't mention - price.

Cloud costs are blurry due to complicated pricing models and the lack of ability to predict how many resources are needed to run a job. It's nearly impossible to get an accurate one-size-fits-all figure, but let's try and estimate how much collecting data on the cloud actually costs.


  • AWS services

  • US East region

  • 24/7 usage

  • One year heavy reserved

  • One billion (109) log lines per day

  • Average of 1,000 bytes per log line

  • Total - 1 TB per day, 30 TB per month

  • Only storage costs, processing not included

  • Prices calculated using Amazon's Web Services Simple Monthly Calculator

  • All prices in USD rounded off to the nearest dollar


These are only estimations. Your needs may be different from those listed here, performance tweaks can change the required hardware, and Amazon may modify prices at any given time. If you collect data on the cloud, please feel free to share which method you use and how much it costs in the comments section.

Storing Directly in the DB

Amazon provides two options for running a relational database on the cloud: Relational Database Service (RDS) and a custom installation on Elastic Compute Cloud (EC2). Whichever one is used, a log server is required to collect, generate, and store the logs.

Log Server

The log server should be able to handle 109 logs per day. On average, that's ~11,000 logs per second with a peak usage of 10 times as many logs. Some companies use in-house solutions, but they take plenty of time and money to develop and maintain. Let's go with an off-the-shelf logger like Fluentd and use a plugin to integrate it with a database.

According to Kazuki Ohta, CTO of Treasure Data (a major contributor to Fluentd) “Fluentd can handle 18,000 messages per sec per core with Intel Xeon L3426 (4Core HT 1.87GHz)”. Hence, 4 Amazon EC2 m2.2xlarge instances with Intel Xeon E5-2670 (4 Cores 2.6 GHz), 34.2 GB of RAM, and 850 GB drives should be more than enough to handle logging, including peak times, and writing the data into the database. Price for one year of heavy reserve is $498 per month plus a one time fee of $6,883, in total $12,859 per year.

There are several extra charges. Intra-region data transfer of 30 TB per month costs $307 per month or $3,684 per year (no charges for data transfer in). An Elastic Load Balancer is also needed to balance between the instances - it costs $18 per month plus $246 per month for processing 30 TB, that's $3,168 per year.

EC2 Instances $12,859 / year
Intra-region data transfer $3,684 / year
Elastic LB $3,168 / year
Total $19,711 / year

Amazon RDS

Unfortunately RDS databases have a storage limit of 3 TB and don't support built-in sharding (more complicated solutions for sharding are available). This means we only get to keep less than 3 days worth of logs, so RDS isn't a valid option.

Elastic Compute Cloud (EC2)

Running MySQL on EC2 requires a lot of space. A storage optimized hs1.8xlarge instance with a total of 48 TB of storage should do. 30 TB are generated each month, so every month and a half another instance is needed, a total of eight instances throughout the year. It's cheaper to book them in advance for one year heavy reserve than it is to work on demand and keep scaling. The cost is $5,926 per month plus a one time fee of $95,820, a total of $166,932 per year.

Unfortunately all data is lost when stopping instances (the data stays when rebooting the virtual machine though). To make sure it stays put, an EBS provisioned IOPS volume is needed. These don't come cheap. 1 TB with 4,000 provisioned IOPS (the maximum allowed) costs $557 per month. Saving just one month of data for a whole year costs $200,520.

A more affordable option is to keep only a month's worth of raw data and aggregations for older data while archiving the rest on S3. The log files can be gzipped with a ratio of 1:4, which means 7.5 TB will suffice for local storage per month. On S3 that costs $586 per month, $7,032 per year. We'll only need one hs1.8xlarge instance which costs $773 per month plus a one time fee of $12,245, a total of $21,521 per year.

Log Server $19,711 / year
EC2 $21,521 / year
S3 $7,032 / year
Total $48,264 / year

latakia beach.jpg

(Lattakia Beach image by Taras Kalapun, Some rights reserved)

Uploading Log Files

The data is stored as big log files that are continuously uploaded into S3, DynamoDB, or Redshift.

Log Server

Requirements for the log server are the same as in the previous method, except the data is saved as files rather than stored in the DB. See prices above.


7.5 TB per month cost $7,032 per year as previously calculated.

Log Server $19,711 / year
S3 $7,032 / year
Total $26,743 / year


30 TB per month with DynamoDB cost $9,273 per month, $111,276 per year.

Log Server $19,711 / year
DynamoDB $111,276 / year
Total $130,987 / year


To have 360 TB available for the entire year, 23 instances of dw1.8xlarge (16 TB of space each) will be needed. Reserved for one year, that's a one time payment of $479,242 and $31,285 monthly, a total of $854,662.

A more cost effective option is to only save one month's worth of data using two instances. The cost - one time fee of $43,024 plus $2,770 per month, that's $76,264 per year. The rest can be archived on Amazon S3 as already calculated.

Log Server $19,711 / year
Redshift $76,264 / year
S3 $7,032 / year
Total $103,007 / year

S3/CloudFront Logging

In this method events are tracked via HTTP requests to images from S3 directories which automatically generate logs. No extra logging servers are needed, only 7.5 TB per month of storage which costs $7,032 per year (the transparent image file storage is marginal since they are 68 bytes each).

CloudFront has to be used as well or features like logging via the query string won't work. CloudFront GET requests cost $0.0075 per 10,000 requests (see pricing). Hence, 109 requests cost $750 per day which are $270,000 per year.Traditionally there are charges for requests to S3 as well, but as long as caching headers are set then these charges will be minimal. Accessing the transparent images incurs a data transfer of 68 x 109 bytes or 68 GB per day or 2040 GB per month. This costs $510 per month, $6,120 per year.

S3 Storage $7,032 / year
CloudFront GET requests $270,000 / year
CloudFront Data Transfer Out $6,120 / year
Total $283,152 / year


Following the above research, here are estimated prices for collecting big data on the cloud using various methods. All prices are per year and rounded off to the nearest thousand dollars.

Method Via Price (USD per year)
Directly in the DB EC2 $48,000
Uploading Log Files S3 $27,000
DynamoDB $131,000
Redshift $103,000
S3/CloudFront logging S3 + CloudFront $283,000

Clearly, uploading log files to S3 is the cheapest way to store Big Data on the cloud. Contrary to our previous claims, S3/CloudFront logging is quite expensive. There are some unknowns in the equation since DBA and developer costs for implementation and maintenance are not included and neither are costs for processing the data. Nonetheless, hopefully this review gives a decent indicator of how much it costs to collect data on the cloud.

Integrate Your Data Today!

Get a 7-day free trial. No credit card necessary.