The Hidden Costs of Cloud Data Lakes

This blog series from Cazena's engineering team investigates the hidden costs of cloud data lakes. Learn the top three hidden costs of cloud data lakes!

Read the Blog Series

The Strata Word Cloud 2012 vs 2016: Data Lakes, Spark, Real-Time and other Trends

This week Cazena made a major announcement, un-coincidentally timed with Strata + Hadoop World. We seriously enhanced our Data Lake as a Service, which is based on Cloudera Enterprise, runs on Microsoft Azure or AWS, and includes many new features for data science.  Read more here. It’s been exciting to see the momentum in the Big Data as a Service category and I loved sharing the news at Strata. Walking through buzzy hum of expo floor conversations, I overheard the same terms over and over. It began to feel like floating through a word cloud. That gave me an idea – and with it, I uncovered some surprising trends. 

I scraped the session descriptions from the 2012 schedule in Santa Clara, the first year Strata merged with Hadoop World. Then I did the same for session descriptions in 2016 in NYC. (I know, analysts, I probably already have sample problems. But this is just for fun.) Then I put the text into a word cloud generator (Tagul is cool!), did some lightweight data massaging and output the graphics. It’s not super scientific, but it’s an interesting back-of-the-napkin look at trends and passes the gut test for industry regulars.

What’s New? Data Lakes and Spark. Big data technology innovation moves quickly. Neither of these terms appeared in any session descriptions in 2012. The new in-memory engine Spark made a huge impact this year, having only debuted in 2014. Now it’s often considered a must-have (and Cazena has it “as a Service,” BTW.) But perhaps the biggest splash was made by the “data lake,” a Hadoop-based repository for storing raw data. The term was around in 2012, but didn’t make any Strata session descriptions. Now data lakes are all over the place, and at the peak of the hype cycle this year. Cazena also sees lots of interest, inspiring the recent updates to our cloud-based Data Lake as a Service. 

What’s Hot? Cloud, Real-time, Stream, IOT, AI, Machine Learning and…Data Warehousing?! These terms ranked in 2012, but were used much more frequently in 2016. Clearly, cloud has come a long way in four years, and everything has gotten bigger and faster. The desire for “real-time,” a term that barely ranked in 2012, has driven technology innovation in streaming, especially in Kafka, a stream processing platform. Analytics have gotten much more sophisticated, with more talk about machine learning and artificial intelligence. Unsurprisingly, Internet of Things (IOT) get a lot more lip service. While IOT only had a couple of dedicated sessions this year, many referenced the potential and challenge of IOT sensor data.

The biggest surprise was more sessions mentioning data warehousing at Strata this year, way more than four years ago. That’s likely due to the conference expanding topically, but also the desire to apply new technologies, like cloud and Hadoop, to aging data warehouse architectures. Cazena knows this challenge well. Many clients are augmenting aging appliances with Big Data as a Service, by migrating data science or other workloads to the cloud. That’s why Cazena’s Data Mart/Warehouse as a Service leverages technologies like Pivotal Greenplum and Amazon Redshift, runs on Microsoft Azure and AWS, supports standard BI/analytics tools and includes the Cazena Gateway for secure integration with on-premises systems. Just sayin’.

What’s Still Cool? Big Data, Hadoop, Data Science, Pipelines. The conference’s focus on Big Data and Hadoop remains clear – the relative use of terms was consistent year over year. I was a bit surprised to find the same pattern with the terms data science and data scientist. It feels like much more emphasis on this role recently. That may be because there are more robust tools emerging to help (wink, wink) – and because this conference has focused on the data science since it’s early days. On a smaller scale, the concept of data “pipeline” was also used about equally this year and four years ago.

What Terms Are Fading? Social, MapReduce (?!) While IOT is the big data example du jour now, social data was the big thing four years ago. There are definitely social analytics success stories, often for consumer brands, but attention seems to have shifted to other, shinier objects for now. Interestingly, “MapReduce” appeared often in 2012 Strata session descriptions and was notably absent this year. It’s clearly still very much in use, but perhaps it’s that there more tools, services and platforms abstracting that complex technology. Like…you guessed it, Cazena’s Data Lake as a Service.

The meta message? The tech world changes quickly. That’s why Big Data as a Service is transformational. Instead of tracking these rapid-fire changes, figuring out what’s significant, testing, training and deploying, now you can simply plug your favorite tools and datasets into a cloud service like Cazena. We bring the benefits of new technologies to you, without all the complexity. It may be early days for the Big Data as a Service term…but as you can see, things change fast around here. 

Related Resources