Scroll Top

Data Lake Vs Data Warehouse – Understanding the Difference

For organizations today, data is the new currency. Given that it has become an extremely valuable asset, it becomes imperative for all organizations to ensure that their data management strategies enable them to not only use the data to get insights but also ensure that the data can be accessed and used easily. Those in data management and analytics keep using the terms ‘data warehouse’ and ‘data lake’ when conversing about data. However, the non-technical person can get lost in this sea of technical jargon and might get even more confused as to what to do with his/her data. Is he to store the data in the warehouse or should he just dive into the data lake? How should businesses decide which form of the data repository will deliver the most business value to them? In this blog, we take a look at data lakes and data warehouses and try to understand the difference between the two.

Today organizations are keen to store every bit of the data they generate for use in the future and make informed, data-driven decisions to stay ahead of the curve. However, the meteoric rise of data hasn’t happened overnight. The importance of data has been on the increase since 1970’s when ACNielsen and IRI used ‘dimensional data marts’ to increase retail sales. It was then that the foundation of the modern day data warehouse was laid and Bill Inmon, the Father of Data Warehousing, coined the term ‘data warehouse’. A typical data warehouse is made up of a number of layers and has to have metadata, data governance, and data quality processes in place to allow the IT staff to store data, its source, and format and decide how it should be used. The data warehouse contains data that is placed in well-structured databases that can be used in analytics. To put it simply, a data warehouse mimics a real storage warehouse where rows of boxes are labeled neatly and stored on shelves and forklifts can be used to move these boxes around to specific locations. Data warehousing is most commonly used in online analytical processing (OLAP), and online transaction processing (OLTP).

Over time, the volume of data began to increase because of new data sources like social media and mobile devices. Along with this, we also witnessed the emergence of new data types that came from social media, transaction logs, user reviews etc. This data, though immensely valuable, usually is largely semi-structured or unstructured. Until now, organizations were used to structured data that could be stored in a standard format in the data warehouses. How could they store this unstructured data which demanded greater architectural flexibility? This need gave birth to the ‘data lake’, a term coined by Pentaho CTO James Dixon in 2010, which allowed the storage of any type and amount of data and make it available for use on demand. Dixon defines a Data Lake as follows: “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption. Translate this into the data version of the term and the contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

While it may seem that a data lake is nothing more than version 2.0 of the data warehouse, there are some major differences between the two. The commonalities between the two ends at both being data storage repositories. The differences can be listed out as follows:

Data Type

Data warehouses essentially store transaction system data and quantitative metrics. The type of data stored in a data warehouse, thus, is essentially structured and modeled. A data lake, on the other hand, is not so picky. It allows for the users to store all kinds of data, structured or unstructured irrespective of the volume, variety or form; historical or streaming in real-time. This gives organizations the flexibility to use the data as it is being funneled or use it later in the future. Because of this, enterprises can collect as much data as they want without needing additional resources to regulate it.

Data Processing

When loading onto a data warehouse, the data needs to be modeled and given a shape and structure called ‘schema-on-write’. Data Lakes, on the other hand, are ‘schema-on-read’ which means that data is only transformed into a particular shape and structure when the user selects the data that they want to use. Since data lakes have no regulatory functions, any amount of data from any data source can be dumped into them and processed as required.

Data Retention

Quite simply, data lakes store all the data while data warehouses do not. Being a highly structured repository, businesses have to decide on the kind of data they want to store and how they want to store it, what questions they want the data to answer, and which data sources they want to access when they are building a data warehouse. It is essential to ask these questions since the cost of storing data in a data warehouse is much higher in a data lake. A data lake, on the other hand, is more economical as it uses open source technologies such as Hadoop that allow free licensing and community support and can be installed as a low-cost hardware.

Data Agility

The unstructured nature of the data lake also offers the data scientists the agility to play around more with the data by configuring and re-configuring the models and queries etc. Data warehouses, however, are not that well-suited for rapid change because of their fixed configuration and changing its structure can be a time consuming and lengthy process.

Data Security

Data warehouses have been in existence for decades and hence have a more mature security structure than a data lake. Having said that, given the growth and increased adoption of big data and big data technologies in the enterprise, significant efforts are being made to make the data lakes more secure. It is only a matter of time when a data lake becomes as secure, if not more secure, than a data warehouse.

Users

While the data warehouse was built so that everyone could leverage the benefits of business intelligence and analytics, it was mostly used only by business professionals who wanted to access the data for better reporting and faster decision making. Data Lake, on the other hand, managed to invite a more cosmopolitan crowd and allows its users to dive further into the data and make a deeper analysis. Data Lakes at this point of their maturity are suited more for data scientists and those of this ilk.

Given that the volume of data and newer data sources are on an increase, organizations would have to perhaps leverage the data warehouse and the data lake together through a hybrid architecture. Without a data warehouse, decision makers could be making decisions based on inaccurate data. However, traditional data warehouses are not programmed to handle the data deluge from a plethora of data sources. Organizations of the future, thus, will have to implement what Gartner calls a ‘logical data warehouse’ which amongst other technologies also includes a data lake to derive greater value.

Leave a comment