A data lake is a generalized data processing platform that supports a wider variety of data and analytical processing above and beyond standard SQL data warehouses. For over a decade, enterprises have invested heavily to build on-premises data lakes. However, over the past few years a new trend is emerging, the Cloud Data Lake.
The Cloud Data Lake is a next-generation Data Lake hosted in the cloud that delivers more attractive price/performance, a variety of analytical engines, best-of-breed tooling, all on virtually unlimited Cloud storage.
Residing in public cloud environments such as AWS and Microsoft Azure, the Cloud Data Lake is more than just storage. The Cloud Data Lake is a complete analytical environment that supports a variety of analytical tools and languages (SQL, R, Python, Java, Scala, etc.), supporting a variety of workloads, from traditional analytics, BI, streaming event/IoT processing, to advanced Machine Learning and AI processing.
Compared to their on-premises counterparts, the Cloud Data Lake brings a set of distinctly different advantages across Storage, Compute, and Cost. With these advantages, however, come new challenges around skills required and the operational complexity of Cloud Data Lakes. This post will explore the advantages and challenges of this new analytical platform.
Storage
One of the big challenges with data lake deployments is the growth of data. Data is being created at astonishing rates and piles up quickly.
With on-premises data lakes, you must regularly monitor the data growth within your data lake. As your data grows and approaches your capacity, you must add additional disk drives to existing hardware, or purchase additional compute and storage to expand your cluster even though you might not require the additional compute power.
With Cloud Data Lakes, storage is essentially infinite (it is serverless) when using a cloud vendor’s low-cost object storage, such as S3 on AWS and Azure Data Lake Storage (ADLS) on Azure. These storage layers offer many 9s of durability as well as availability and have automatic geo-replication. With limitless capacity, there is no need to capacity plan for data growth.
Compute
The separation of storage from compute is significant as it increases the flexibility and capacity of your Cloud Data Lake.
In a Cloud Data Lake, compute, the analytic engines such as Spark, Hive, Presto or Impala, can be on-demand and compute can be elastic.
Analytics engines can be created on-demand for specific purposes. Different teams can spin up different compute engines for different workloads, such as ETL, ML, ad hoc analytics, etc., on the same shared data. It is not uncommon to spin up a Spark cluster for a few hours to process a pipeline of data then shutdown the compute resources when the pipeline has completed.
The result is that you can provision infrastructure that is optimized for individual workloads, and since the infrastructure can be transient, the result is often is significantly reduced infrastructure cost.
Analytics engines can be configured to increase compute on-demand, adding the power of compute elasticity to your data lake. Often this results in performance SLAs with reduced infrastructure cost over the long run.
These scenarios are practically impossible with on-premises clusters. To increase compute in your data lake, you must add additional hardware. Unless you have it lying around, this means procuring the hardware and installing it; often, this process takes weeks to months. To avoid or at least reduce this delay you must perform regular compute capacity analysis and projections to stay ahead of the game.
Cost
With a Cloud Data Lake, you only pay for the compute that you use, if you are not using it, you can easily shut it down, and avoid the wasted expense.
For on-premises data lakes, once you spend the money on new hardware, you own it. If it goes unused it is still a capitalized expense — one that you may be stuck with 3-5 years. This means that even if there are new options that better fit for your workloads, you can only adopt them during a hardware refresh.
Software licensing costs are similar. With on-premises data lakes, you must buy software licenses and software support contracts, and if you find you are not using the software, that doesn’t matter, you usually can’t get your money back. With Cloud Data Lakes, software and services are usually billed hourly – if you are not using the service you don’t have to pay for it.
Cloud Data Lake Challenges
Enterprises should evaluate the use of Cloud Data Lakes based on their architectural advantages described above. However, Cloud Data Lakes also present new challenges around their complexity and operational skills requirements.
Common challenges for Cloud Data Lakes include long deployment cycles for production, integration issues with on-premises applications and users, ensuring security and compliance, data governance, and managing on-going costs.
Just as an example, around security and compliance, you must put serious thought into securing your data, especially if you are planning to store sensitive data in the cloud data lake. Much like an on-premises solution, you should define encryption of data in motion as well as at rest. Also be sure to not expose your data and services to the internet. This means no public IP addresses and fully auditing access to data and services.
A number of cloud-native or third-party security services and controls must be deployed, integrated and managed for your specific cloud data lake. And once the Cloud Data Lake is provisioned, the work doesn’t stop there. You still need ongoing SecOps to fully secure your data, detect and protect from the ever-present threats to the system.
Enterprises must be aware of these issues and work to address them. Many enterprises will need to augment their teams with new skills and expertise around the new Cloud Data Lake stack. Partners with expertise can help with skills. Also, new SaaS Cloud Data Lake offerings are emerging that accelerate deployments with minimal operational complexity.
Conclusion
Cloud Data Lakes deliver significant flexibility at potentially significant cost savings versus traditional on-premises data lakes. Low-cost limitless storage capacity and on-demand flexible compute, where you pay for only the compute you use. Cloud Data Lakes are the next-generation enterprise data platform for all analytical workloads, including BI, ML, and data engineering. As enterprises evaluate Cloud Data Lakes, they must work to address their deployment and operational complexity.