The Hidden Costs of Cloud Data Lakes

This blog series from Cazena's engineering team investigates the hidden costs of cloud data lakes. Learn the top three hidden costs of cloud data lakes!

Read the Blog Series

Hadoop Performance, Sizing and Scaling

SeriesThe Hidden Challenges of Putting Hadoop and Spark in Production

A recent Gartner survey estimates that only 14% of Hadoop deployments are in production. We’re not surprised. We’ve been in many conversations with companies that have been piloting Hadoop to bolster their analytic capabilities beyond relational databases. Common challenges fall into a few important categories, which we explore in this blog series:

Hadoop Performance, Sizing and Scaling

Closely related to infrastructure decisions for Hadoop and Spark are the questions about how to size your cluster. Since Hadoop and Spark are newer technologies, often with new workloads, it’s frequently a big unknown. Most organizations don’t have a lot of history (if any) to estimate cluster characteristics – and it’s not just a matter of data volume estimates. We have seen cases where the initial cluster is grossly undersized due to rapid adoption (arguably a good thing) and other deployments were the cluster is drastically underutilized, often due to operational or business issues. It’s a tough equation.

Contrary to popular belief, managing Hadoop performance doesn’t automatically get easier if you deploy in the cloud vs. on-premises. In fact, the stakes can be even higher with metered services, where a poor performance configuration can translate to a big bill. There’s no easy advice here. It’s critical to do your research, consider many factors (which components of the Hadoop ecosystem, storage, compute, growth, adoption, etc.) and be ready to experiment.

With the cluster in place, the next step is configuring Hadoop and optimizing it for production. Configuring the cluster has everything to do with the workload, which drives how you configure HDFS, or if you use HDFS at all. After storage, you then keep stepping up the stack and configure Yarn, Spark, Impala and other components. However this is not a one time process. The Hadoop ecosystem is rapidly expanding and evolving, so configuration optimization is an ongoing process.

Let’s use the Cloudera Impala example that I used in the last blog, which described how Impala rapidly evolved to exploit more and more cores on a single machine. It would have been totally feasible to have configured a cluster 24 months ago that had lots of nodes with very few cores. However today, you can get better price-performance using fewer nodes that have more cores.

Another example is the transition from MapReduce (M/R) to Spark, which usually entails having nodes that have a lot more memory than the optimal node for a M/R workload. Trying to do this on premise is close to impossible, unless you have unlimited budget and can depreciate hardware in a year or less.

The Cloud Helps (But Is Not a Panacea)

Leveraging the public cloud (AWS, Azure, etc) can mitigate these challenges. There is no long-term commitment to either the server type that you choose for your data nodes or the number of nodes that you start with. The cloud helps you truly leverage advances by the chip manufacturers, as well as advances by the cloud platforms, with minimal lag time. And with the cloud, there are no sunk costs due to hardware not being fully depreciated.

Sizing of Hadoop clusters in the cloud can take a totally different approach to what would be traditionally done on-premises. There will obviously need to be some requirements gathering and planning to initially bring up a Hadoop platform that will meet current analytic needs. But there’s room for flexibility, so it’s almost more important to have processes that can monitor and constantly evolve the platform to provide the best price-performance.

The cloud provides us many great tools to dynamically size and configure platforms, however it also adds in some questions. How do I secure this environment that is no longer behind my firewall? How do I monitor and manage this environment to the same level as my existing on-premises systems?

We will explore the process of putting these distributed technologies into production in the next blog.

Original artwork by Carlos Joaquin in collaboration with Cazena.
Apache®, Apache Hadoop, Hadoop®, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundationin the United States and/or other countries.

Related Resources