Although there are many different advantages of moving your data to the cloud, there's one that might be especially important for you: price. Unfortunately, calculating the exact costs of storing data in the cloud can be a little messy, due to complicated pricing models and challenges predicting how many resources you'll need to run a job.
Let’s motivate the issue of cloud storage costs by discussing an example use case: logging big data. In this case, there are three ways to store big data on the cloud: storing it directly in the database, uploading log files, or logging via S3/CloudFront. While it’s challenging to come up with an accurate one-size-fits-all figure, this article will talk about how much storing big data in the cloud really costs, so that you can have a better estimate for your own organization.
Table of Contents
- Storing Big Data Pricing Assumptions
- Storing Directly in the Database
- Uploading Log Files
- S3/CloudFront Logging
- Storing Big Data: What's the Verdict?
Storing Big Data Pricing Assumptions
We first need to set some parameters for our pricing model. Of course, this won't reflect every situation out there—but for the purposes of this exercise, we want to establish a few simple assumptions.
Here's what we’ll be working with:
- Amazon Web Services, US East region
- 24/7 usage, 1-year reserved instance, heavy utilization
- 1 billion log lines per day, with an average of 1,000 bytes per logline; a total of 1 terabyte per day, or 30 terabytes per month
- Only storage costs (processing not included)
- Prices calculated using the AWS Pricing Calculator (all prices in USD)
Again, a disclaimer: these are only estimates. Your needs may be different from those listed here, performance tweaks might change the required hardware, and Amazon can modify prices at any given time. If you collect data in the cloud, please feel free to let us know which method you use and how much it costs.
Storing Big Data Directly in the Database
AWS provides two options for running a relational database in the cloud: Relational Database Service (RDS) and a custom installation on Elastic Compute Cloud (EC2). In both cases, you'll need a log server to collect, generate, and store the logs.
The log server should be able to handle 1 billion logs per day, or roughly 11,000 logs per second on average. Some companies use in-house solutions, but they take plenty of time and money to develop and maintain. Let's go with an off-the-shelf logger like Fluentd and use a plugin to integrate it with a database.
According to Kazuki Ohta, CTO of Treasure Data (a major contributor to Fluentd): “Fluentd can handle 18,000 messages per sec per core with Intel Xeon L3426 (4Core HT 1.87GHz)”. In other words, 4 Amazon EC2 r4large instances with Intel Xeon E5-2686 v4 (Broadwell) processors (4 vCPUs) and, 30.5 gigabytes of RAM should be more than enough to handle logging, including during peak times, and writing the data into the database.
The hourly rate for a 1-year reserved instance is $0.168: with 4 instances, that’s 4 * $0.168 * 24 * 365, or roughly $5,900 per year.
There are several possible extra charges, however:
- Transferring 30 terabytes of data from US East to another AWS region costs $0.02 per gigabyte or $0.02 * 30 * 1000 = $600.
- You’ll need an elastic load balancer to balance between your instances. According to Amazon, a load balancer costs about $0.0225 per hour or about $16 per month if running full-time.
|4 Amazon EC2 r4.xlarge instances:||$5,900 / year|
|Data transfers||$7,200 / year|
|Elastic Load Balancer||$200 / year|
|Total||$13,300 / year|
Integrate Your Data Today!
Try Xplenty free for 14 days. No credit card required.
First, a note on Amazon RDS: RDS databases have a storage limit of 6416 terabytes, which means that they’ll be too small for our needs in this example—we’ll run out of room in just 2 months! There is an option to utilize Amazon RDS for sharding (see this link for details), but in this post, we'll focus on some other, more appropriate options.
Elastic Compute Cloud (EC2)
Running MySQL on EC2 requires a lot of space. A storage-optimized d2.8xlarge instance with a total of 48 terabytes of storage should do. With 30 terabytes of data generated each month, you'll need another instance every month and a half, for a total of 8 instances throughout the year. It's cheaper to book them in advance for 1 year than it is to work on-demand and keep scaling. The cost of a single d2.8xlarge instance is around $3.216 per hour, $2,300 per month, or $28,000 per year. With 8 instances per year, that comes to roughly $225,000.
Unfortunately, you lose all your data when stopping instances (although the data stays when rebooting the virtual machine). To make sure it stays put, you'll need an Amazon EBS-provisioned IOPS SSD volume—and these don't come cheap.
A more affordable option is to keep only a month's worth of raw data and aggregations for older data while archiving the rest on S3. You can gzip log files with a ratio of 1:4, which means that 7.5 terabytes will suffice for local storage per month. On S3, that costs around $170 per month, or about $13,700 per year (12 months for the first 7.5 terabytes, 11 months for the next 7.5 terabytes, etc). We'll only need one d2.8xlarge instance, which costs $28,000 per year, as mentioned above.
Together, these costs come to an estimated $55,000/year for storing big data directly in the database.
|Log Server||$13,300 / year|
|S3 Storage||$13,700 / year|
|d2.8xlarge instange||$28,000 / year|
|Total||$55,000 / year|
Uploading Log Files
In this case, data is stored as big log files that are continuously uploaded into S3, DynamoDB, or Redshift.
The requirements for the log server are the same as in the previous method, except that you'll be saving the data as files rather than in the database. See the prices above.
S3 standard storage for 7.5 terabytes per month costs $0.023 per gigabyte. As previously calculated, that’s about $170 per month or $13,700 per year. This will be added to the cost of running the log server, for a total of about $30,000/year. (The costs of transparent image file storage are marginal since each file is only 68 bytes.)
DynamoDB charges $0.25 per gigabyte per month for data storage. If you’re storing 30 terabytes per month with DynamoDB, the cost will be roughly $7,700 per month or a little over $600,000 per year. Combined with the cost of the log server, that’s $613,300/year.
To have 360 terabytes available for the entire year, you'll need 23 instances of ds2.8xlarge (16 TB of space each). Reserved for one year and paid upfront, that will cost you around $790,000.
A more cost-effective option is to only save one month's worth of data using 2 ds2.8xlarge instances while archiving the rest on Amazon S3. The cost will be roughly $69,000, paid upfront for a 1-year term, plus the costs of S3 calculated above.
The total cost for uploading log files will be $96,000/year.
|Log Server||$13,300 / year|
|S3 Storage||$13.700 / year|
|d2.8xlarge instance||$69,000 / year|
|Total||$96,000 / year|
This method tracks events via HTTP requests to images from S3 directories, which automatically generate logs. It needs no extra logging servers and only 7.5 terabytes per month of storage. As previously calculated, 7.5 terabytes of storage per month on S3 is roughly $13,700 per year.
You'll need to use CloudFront as well, or features like logging via the query string won't work. CloudFront GET requests cost $0.0075 per 10,000 requests (see pricing). 1 billion HTTP requests will cost $750 per day or around $270,000 per year.
Traditionally there are charges for requests to S3 as well, but as long as you set caching headers, these charges will be minimal. Accessing the transparent images incurs a data transfer of 68 x 1 billion bytes: 68 gigabytes per day, or 2040 gigabytes per month. Outward data transfers from S3 cost $0.02 per gigabyte, which comes to about $40 per month, or about $500 per year.
Adding these costs together, we get a total price of $284,200/year for storing big data via S3/CloudFront logging.
|S3 Storage||$13,700 / year|
|CloudFront Requests||$270,000 / year|
|CloudFront Data Transfers||$500 / year|
|Total||$284,000 / year|
Storing Big Data: What's the Verdict?
As a reminder, here are the total costs of all the methods we’ve discussed:
|Directly the Database||$55,000 / year|
|Uploading log files to S3||$30,000 / year|
|Uploading log files to DynamoDb||$613,000 / year|
|Uploading log files to Redshift||$96,000 / year|
|S3/Cloudfront Logging||$284,000 / year|
Based on these analyses, uploading log files to S3 is the cheapest way to store big data in the cloud. Contrary to some assumptions, S3/CloudFront logging is quite expensive.
Of course, this is just one example of how to calculate the cost of storing big data in the cloud—your mileage may vary based on your own business needs and objectives. Most importantly, there are a few big unknowns in the equation:, since DBA and developer costs for implementation and maintenance are not included, and neither are costs for processing the data.
Nonetheless, we hope that this overview has helped you figure out a decent way to estimate how much it costs to store data in the cloud.
Need Help With Your Big Data Storage?
Need a helping hand through the wild and complicated world of big data? Here are two things you should do:
- First, check out our list of the best 17 data warehouse tools (and their accompanying pricing).
- Second, talk to Xplenty about how to optimize your data storage. Xplenty's cloud-based ETL solution offers simple, visual data pipelines for building automated data flows, giving you the insights your organization needs. Ready to see a demo and start your free trial? Contact us here.