The Importance of Good Data Hygiene
The Importance of Good Data Hygiene
Data hygiene is just like personal hygiene; everybody wants it, but not everyone has the easiest time building or maintaining it. Join Ben Perlmutter, Solutions Architect at Xplenty, and Jennifer Hudiono, Product Manager at Chartio, as they discuss the importance of good data hygiene and highlight key methodologies that will help you build and maintain it.
- How data hygiene can improve analytics and long-term insights
- The key differences between data warehouses and data lakes
- Large-scale strategies to help you implement and maintain good data hygiene
- How to organize your data lakes and data tables for optimum hygiene
- Best practices and good habits that will increase efficiency and make data accessible company-wide
Q: What is a data lake and what are its key characteristics?
A: In general, if you want to put data in a data lake, you don’t need to do anything to that data - you just throw it in there. This means that you can collect unstructured data from social media, structured data from your accounting system, unstructured data from an advertising API, and more. All of this information, whether it’s related or not, can just be dumped in that place. So it’s easy to set up.
In general, as well, data lakes are built on inexpensive commodity hardware, which means that it’s not that expensive to store a ton of information. For that reason, you can store all of your historical data and company information.
Some examples of data lakes - also called “blob storage” - include things like Amazon S3, Google Cloud Storage, SFTP and HDFS.
Q: What is a data warehouse and what are its defining characteristics?
A: I personally like to think of a data warehouse as the hygienic place where my data goes when it’s all clean and ready for processing. If you think about a lake, it’s kind of a murky pond. A data warehouse, on the other hand, is this place where I can take a white glove on my finger, scrape it across the top of any surface, and there won’t be any dust on it. So some characteristics of that data warehouse would be things like well-defined schemas and architectures.
This does often take planning up-front to determine things like how you want to store your data, what the tables are going to look like, what the data types are going to be, what the columns are going to be called, etc. So there is this up-front cost to getting a warehouse started. However, once you have all of that in place, you have common datatypes that are ready for analytics and that’s the key to a warehouse.
Generally, it’s a bit more expensive than your “blob storage” but that’s because it’s organized and it’s ready for those analytics.
Some examples of data warehouses would be Redshift, BigQuery, Snowflake, Oracle, and Microsoft SQL server.
Q: What is data aggregation and why is it important?
A: Aggregation is a blanket term for aggregating and consolidating your data so that it’s effective for visualization.
Over time, your data only grows. When you bring raw data into a visualization tool, you’re really just faced with more tables, columns and fields to work with because you haven’t touched that data at all. Working with data in this way can be equivalent to not washing your hands - if you don’t address it and clean it, it will only build up and cause problems for you down the line.
When fixing this problem, you want to think about your data audience. If you’re working with analysts, it’s fine to let them use raw data. They know exactly how to work with that and they’re very quick and efficient. But if you’re really opening your data up to business users or other teams who are not quite as proficient with data, this becomes a problem, since lots of tables and lots of columns can be very overwhelming.
Now let’s apply this to Chartio. As a best practice, we have two versions of our application database - one that’s a full schema that has all our tables, columns, etc., and a second version where we’ve already aggregated those tables together. In the latter version, we’ve reduced the number of columns for certain tables because they’re not needed by most business users. Thinking ahead and aggregating your databases like this can optimize for performance in many instances.
Here’s an example. I have a couple of friends who work at Facebook and Facebook has a ton of data. Instead of trying to look at all of that data at once, they tend to aggregate their tables - they’ll aggregate it to the last 30 days, the last 60 days, or the last 3 months. This optimizes the query so that when users are getting their data it’s very fast and very efficient.
Q: What are some naming convention best practices when it comes to data and data tables?
A: If aggregation or consolidations is like washing your data clean with water, then naming convention would be your soap.
Washing your hands is only truly effective if you use some kind of disinfectant like soap. Again, across my two years working as a customer success engineer, I’ve worked with a lot of companies and I’ve seen a lot of schemas, data tables, and column names - and one thing I’ve realized is that people have bad habits. To break those bad habits and have good naming conventions, I recommend that you:
- Don’t append table names - it can get really confusing.
- Be very specific and don’t use ambiguous names. People love using acronyms, shorthand, abbreviations. That’s great if you know it, but other people might not and you have to keep that in mind.
- Know your audience and communicate with them accordingly. If you’re trying to decentralize your data and have other users and other teams use it, make sure that you appropriately name your tables and your columns so that they know exactly what kind of data they’ll be using.
- Implement a data dictionary. This is a central repository where you get to keep all of your relevant data information - add notes, give explanations, use cases, things that people can refer to. I think that if you’re a part of the data team, you probably get a lot of requests, a lot of people asking a lot of things, and this is a great way to kind of defer that so if someone wants to quickly look something up, they can go to look at the data dictionary as a way to get that information.
Q: What are foreign key relationships and why are they important?
A: This is where I emphasize foreign key relationships. Think about a database: each one holds a collection of tables, and you can establish relationships between those tables and define how they are used with one another. That’s what I mean by establishing a table relationship.
Now, if you just want to condense a couple of tables, you can aggregate them. But you can’t do that with every single table. For more complex relationships, the best way to keep them organized and ensure that your users can actually use the data is by establishing foreign key relationships.
That said, make sure not to over-complicate these relationships. I’ve seen a lot of customers go really crazy with the foreign key relationships and it becomes a mess. It’s very hard to work with because there will be certain dependencies or certain queries will require certain relationships over another. When it gets that complex, just go back to washing your hands.
In other words, if you see a problem where you have a lot of tables and it requires very complex foreign key relationships, maybe it’s worthwhile to aggregate that and put it into its own custom tables so that users don’t have to go back and do that themselves.
Q: How do you apply these tips and practice good data habits?
A: So, you’ve set everything up, prepared your tables, aggregated where you can, and set up your foreign keys. Now it’s getting down to actually using a visualization tool, and you want to make sure you make everything as easy as possible. For any Chartio users - since we offer interactive vs SQL mode - this means trying to use the interactive mode when possible and SQL mode when necessary. This is because most users end up sharing their charts with business users who don’t know SQL, and they’re not going be able to read SQL code. However, they will be able to understand the interactive mode.
When you do use SQL, always format and structure it so it’s not just one giant blob of code that no one can read. Additionally, always leave comments so that people can break it down and understand what you’ve written.
Q: Why practice good data hygiene?
A: Good data hygiene leads to more insights across your entire organization. When you practice all of these good habits, it allows for more users and spreads knowledge across your entire organization.