Enterprise data and analytics teams are sometimes confused about the difference between data warehouses vs. data lakes. They struggle to evaluate their relative merits and demerits to figure out what is better suited for their organization. This blog is intended to clarify this confusion between data warehouses vs. data lakes.
The reality is that data warehouses and data lakes are complementary to each other and best suited to solve different problems. A data-driven organization needs both — and the cloud offers new, cost-effective architectures. Taken together, cloud data lakes and data warehouses can co-exist and help a wide variety of end-users get the most value out of data and analytics.
What is a Data Warehouse?
A data warehouse is a database for analytics that has structured data with a predominantly relational processing engine. In a data warehouse, the data is organized in terms of tables and columns. Data warehouses are generally categorized as schema-on-write, meaning the schema is already designed and implemented and the writes to a data warehouse need to adhere to this schema. Since the data warehouse engine is largely relational, SQL is the lingua franca.
There are some data warehouse products that market the functionality to handle semi-structured data like JSON with SQL extensions. These are attempts to provide a schema-on-read type of functionality in the data warehouse. But they incur the strict ACID transaction overhead of the data warehouse, which many non-SQL applications do not need. Such applications can be naturally supported by schema-on-read with less strict transaction semantics and superior performance.
Data warehouses have been around for decades. Making schema changes driven by business needs is often a time-consuming process that involves design and landing the data before analysis can take place. A data warehouse assumes that the raw data is cleaned and structured to suit the questions the business applications need to answer.
Understanding Data Warehouse UDXs
While standard SQL provides a set of features to do business analytics, more advanced analysis can be done in a data warehouse relational engines using what are called user-defined functions (UDFs) and user-defined aggregates (UDAs) written by the application developers. UDFs and UDAs are sometimes referred to as user-defined extensions (UDX).
Almost all data warehouses in the market support UDXs. UDXs can be used just like other standard SQL functions and aggregates in a SQL statement. UDXs could be as simple as validating a URL to more complex ones like mathematical and statistical functions, encryption and decryption, compression and decompression.
Example: Data Warehouses for Business Intelligence
Data warehouses support analysis on historical data and primarily drive Business Intelligence (BI) applications and ad hoc and interactive reporting needs by business analysts. One example of a data warehouse is an automobile manufacturer analyzing inventory and sales by country, region, state and city, for various models they manufacture.
What is a Data Lake?
A data lake is a generalized data processing platform that supports a wider variety of data and analytical processing versus SQL data warehouses. Data lakes are categorized as schema-on-read, meaning the schema of the data is determined at the time of reading the data – essentially data as it arrived and before any cleaning. Data can be structured, semi-structured or unstructured.
Schema-on-read data lakes encompass SQL as well as a wide variety of data processing engines like Spark, Flink, NoSQL and Search to deal with the wider variety of analytics, tools and data. Data lakes support data engineering, data science, machine learning and reporting from a single unified platform.
A common misconception is that a data lake is just a data store (like AWS S3, or Azure ADLS). While early lakes had a strong focus on storage for large data sets, in reality, a data lake is now a complete analytical environment that combines data storage, data processing, and tools. Popular SQL processing engines within Data Lakes include support many new, advanced analytics as well as modern open-source SQL engines like Impala, Presto, Arrow, etc. Spark is also commonly used with its in-memory model and capability to rapidly process large data sets.
Example: Cloud Data Lakes for Advanced Analytics and Data Science
A popular use case, especially for cloud data lakes, is data science and advanced analytics. That typically involves pre-processing data using Python or R to interact with the Spark framework and then feeding the data to a data science application for predictive analytics or machine learning.
Taking the case of the automotive manufacturer, the cloud data lake can be used to land internet-of-things (IOT) sensor data from vehicles running on the road. This data can be analyzed to understand and predict the functioning of the various automotive components, potential failures and suitable actions. A cloud data lake is necessary, due to the semi-structured sensor data and the types of advanced analytics used in this case. This is a good example of a data science application running on a cloud data lake.
Data Lake vs Data Warehouse: Differences in Governance Models
Due to the broader range of use cases that data lakes support, we also see differences in the governance models in data lakes vs. data warehouses. Data warehouses typically go through strict change control process for any schema change or addition of data. This is a direct consequence of the expense of change with schema-on-write.
Most data lakes have a flexible governance model. For example, they could have a strong governance model for centrally ingested core data, with a looser model for rapid data ingestion of ad hoc datasets for data science or exploratory analysis.
Data Lakes: The Platform for All Analytics
In general, the data lakes support a wider range of data, more use cases and more advanced analytics vs a data warehouse. That combination, along with a flexible governance model, has made data lakes a popular platform for all analytics.
Next-Gen Cloud Data Lakes and Data Warehouses Can Co-Exist
With the arrival of cloud computing, the storage and the compute are separated and can be provisioned and scaled separately. This has led to a new generation of cloud data warehouses and cloud data lakes that leverage separate storage and compute to offer flexible, scalable and cost-effective analytics.
A common model is when data on cloud object store can be shared between cloud data warehouses and cloud data lakes, without the need for multiple copies or the need for ingestion/transformation. The cloud computing ecosystem has democratized data. Now, enterprises can easily deploy data warehouses and cloud data lakes in a co-existence model.
To learn more about cloud data lakes, and to try it yourself, please contact Cazena today.
Data Warehouses vs. Data Lakes