The Hidden Costs of Cloud Data Lakes

This blog series from Cazena's engineering team investigates the hidden costs of cloud data lakes. Learn the top three hidden costs of cloud data lakes!

Read the Blog Series

Performance testing Impala and Spark on Azure Data Lake Store vs. HDFS

Microsoft’s Azure Data Lake Store (ADLS) is a highly scalable storage solution which boasts the ability to store trillions of files, including files a petabyte in size. It is also 3x cheaper than running HDFS on Azure’s Standard Storage solution. It is optimized for analytic workload performance and features enterprise grade authentication, access control, audit and encryption at rest.

Naturally, since it’s been announced we have been eager to see what it can do.  Cloudera recently published an analysis of ADLS read/write performance for pure throughput and scalability. We did some complementary benchmarking of popular SQL on Hadoop tools. In this blog post we present our findings and assess the price-performance of ADLS vs HDFS. In a future blog post, we look forward to using the same toolkit to benchmark performance of the latest versions of Spark and Impala against S3.

For our analysis we used the Big Data Benchmark (BDB) published by UC Berkeley’s AMPLab. This benchmark measures the response time on a handful of relational queries: scans, aggregations, joins, and UDF’s, across different data sizes. We find it useful for producing simple, interpretable benchmarks of various cloud features and software products. We ran 1) a scan query; 2) and aggregation query; and 3) a join query, against about 250 GB of raw data. Each query in turn has three flavors, A, B, and C, which return increasingly large result sizes. For example, query 1A is a scan query returning relatively little data, while query 3C is a join returning a dataset size in the gigabytes.

We ran the BDB queries on identical clusters and compared the performance of HDFS to that of ADLS. Each test cluster had three Hadoop worker nodes (Azure D13’s) with 8 cores, 56 GB RAM, and 512 GB of HDFS storage. All data was in the Hadoop SequenceFile format, compressed using the Snappy compression codec. There are many adjustments that could be made with this setup, but we tried picking a good starting point and seeing what we got out of the box.

The story told by our benchmarks brought with it some pretty interesting lessons.

1. ADLS works well with more data.

One of the reasons people are drawn to the Impala + HDFS combination is its speed. The fastest and simplest BDB query, 1a, a scan returning a small result set, completes in a zippy 1.5 seconds on Impala + HDFS, vs 35.7 seconds on ADLS. That should not come as a surprise, as ADLS promises high throughput, not low latency. And ADLS’ performance becomes apparent with Query 3c – basically the same query modified to return a larger result set. Impala returns in 41.9 seconds on HDFS, and 42.0 seconds on ADLS – i.e., the storage solution that’s 4x cheaper can also be just as fast.

This presents interesting design choices. Users who require low-latency response times may need to pay a significant price premium, while users who think 42 seconds is good enough can save some money.

2. The more complicated (or CPU intensive) the query, the more price-performant ADLS gets.

We also saw ADLS do pretty well when the query in question was complicated and primarily CPU bottlenecked.


The above graph shows Spark SQL performance on HDFS vs ADLS for query 3C, a join of two large tables that also returns a large result set. Interestingly, HDFS throughput (blue line) does spike to higher values than ADLS, but ADLS throughput (orange line) overall does quite respectably. While the HDFS query finishes slightly faster, the difference of 13.5 minutes vs 20 minutes may not be material for a number of use cases and users. Which brings us to our next point.

3. It’s not all about speed

Perhaps the most important takeaway doesn’t come from reading a graph. ADLS satisfies a number of use cases, and if yours is one of them, then it’s possible to have data storage nicely separated from compute. The ability to turn on and off compute is a huge potential cost savings. The low cost of ADLS makes it appealing for storing full backups that would otherwise be prohibitively expensive. And maybe most importantly, storing data separately from other components can make the overall architecture more future proof and easier to modify and upgrade.

4. ADLS is decidedly a useful tool in the Hadoop storage toolkit.

Average price-performance for ADLS BDB queries for Impala: 3.2X compared to HDFS

Average price-performance for ADLS BDB queries for Spark: 3.2X compared to HDFS 

We were a bit surprised by this price performance calculation. Impala and Spark are quite different. The former does its own resource management and is written in C++; the latter we run under YARN and is written in Scala. Yet, we run this relative price/performance calculation and see exactly a 3.2x improvement for both. Go figure.

There’s definitely a lot more to be done in this area. These results were just a promising first pass at benchmarking a technology that’s been generally available for just under a year. There are a number of variables that could be tweaked to realize better performance – vertical and horizontal scaling, compression used, Spark and YARN configurations, and multi-stream testing. But we are probably most excited to start analyzing our customers and see how they could benefit from a hybrid HDFS/ADLS deployment architecture, leveraging the benefits of each.

Related Resources