What is Data Lake?

Imagine a seemingly bottomless hole that you can endlessly fill with data from any source without having to consider the value or format of that data. That is a data lake. Data lakes can store as much data as a user needs to and from any number of different sources.

In a data lake, the data is stored in its original format or after undergoing a very basic “cleaning” process without being transformed or integrated with other data sources. As a result, data lakes are able to store a wide range of data, including both totally unstructured and highly structured data. Since this kind of unstructured storage - like Amazon S3 - is relatively cheap, a data lake is comparatively inexpensive. As a result of the cheaper costs, data can sit in a data lake indefinitely. Then, when you decide to extract that data from the data lake, you can determine the format you need the data to take.

This level of freedom makes data lakes highly adaptive places and allows for a broader range of analysis to be done on your data. However, since the data is neither transformed nor integrated, that analysis will be more time-consuming than it would be in a more structured warehouse.

In the end, a data lake exists to easily store data that you might need later, rather than store data that you know you are going to need in the future (which is usually done in a data warehouse). With a data lake, you never have to worry about losing your data. It’s always there, whether you need it or not.

