data lake

Share This
Categories: Storage

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.

The term data lake is often associated with Hadoop-oriented object storage. In such a scenario, an organization’s data is first loaded into the Hadoop platform, and then business analytics and data mining tools are applied to the data where it resides on Hadoop’s cluster nodes of commodity computers.

Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried.

Data lake vs. data warehouse

Data lakes and data warehouses are both used for storing big data, but each approach has its own uses. Typically, a data warehouse is a relational database housed on an enterprise mainframe server or the cloud. The data stored in a warehouse is extracted from various online transaction processing (OLTP) applications to support business analytics (BA) queries and data marts for specific internal business groups, such as sales or inventory teams.

Data warehouses are useful when there is a massive amount of data from operational systems that needs to be readily available for analysis. Because the data in a lake is often uncurated and can originate from sources outside of the company’s operational systems, lakes are not a good fit for the average business analytics user.