The Hidden Costs of Cloud Data Lakes

This blog series from Cazena's engineering team investigates the hidden costs of cloud data lakes. Learn the top three hidden costs of cloud data lakes!

Read the Blog Series

Strata NYC 2018 Notebook: Tech Trends & Takeaways

It’s always great to attend Strata conference to see what people are working on across the industry. This O’Reilly and Cloudera-hosted event remains the premier big data conference, though the emphasis has changed over the years. Cazena attended in force this year, with colleagues, partners, customers and friends. We debuted several solution demos, cheered as one of our customers won a notable award and attended sessions. We had a cool booth on the expo floor, keeping attendees headache-free with our handy giveaways, as pictured here.

Here are my takeaways of fun and interesting technical concepts and trends. There was more emphasis on using GPUs to train machine learning (ML) models. We heard changing views on cloud vs. edge computing. Overall, we observed more emphasis on “self-service,” and lots of evidence that SQL continues to play a major role in big data workflows. As always, there was that one hot technology that everyone kept asking about. But we’ll get to that.

Avoid Complicated Architectures

One of the first things I noticed during the keynotes was that some people were getting jaded at the sight of overly-complicated data architectures with too many components. Diagrams that would have been taken seriously and photographed two years ago were presented as examples of lengthy, costly paths to avoid. While some complexity is necessary for some use cases, many reported that they underestimated the hidden costs of overly-complicated systems. Architecture advice tended toward streamlining. Many promoted reducing platforms, optimizing storage and increasing “self-service” capabilities where possible. Some suggest focusing on unified engines like Spark and relying on single read-only master copies of data to simplify dev/test/production environments. I agreed with many of the concepts. At Cazena, these kinds of approaches have helped our enterprise customers vastly simplify their initial cloud rollouts, hybrid cloud expansions and workload migrations. We design solutions that help teams ramp up faster and save time down the line, so they won’t need to re-architect existing workflows when tools or datasets change.

Less Data is Big Data?

Another trend aimed at simplifying architectures was the concept of ingesting less data. This is a change. Three years ago, the Hadoop paradigm was to take all possible data, every little bit of it, and put it in your data lake because you never know what you’ll need down the line. That attitude has completely changed. This year, more architects recommended filtering data at the edge. Some describe this as a shift to an “edge-cloud” world, in which a lot of useful processing is done on edge devices. At Cazena, we have observed more interest in companies moderating the volume of data streamed into storage. Companies that stream raw log files in to HDFS storage often find that capacity fills up quickly. Alternatives include storing that data on S3 or ADLS; or, filtering out more “noise” at the edge.

SQL Still Rules

More than one firm proudly stood by their SQL-based workflows, and see a path for SQL to complement ML applications. Despite the proliferation of languages, tools and methods, many see SQL as the lingua franca of data and analytics. MemSQL pitched its in-memory SQL-based storage layer as a simple way of handling ingestion and analytics. Google’s BigQuery ML product allows analysts to combine SQL and ML models as a way of streaming workflows. The talk of SQL continued at Cloudera, which continues to emphasize many uses for Impala, such as handling data warehousing and business intelligence workloads. Several Cazena customers make heavy use of SQL for ETL and data preparation stages, with no superior replacement in sight.

Serving ML Models is Important

We also saw a lot of focus on designing efficient machine learning workflows, from development to production. This is a theme that’s consistent with the data science trends we observed at ODSC East earlier this year. Another similar observation from both the Strata and ODSC events is that the technology industry seems to have hired heavily on the ML research side, but has not placed the same emphasis on developing capabilities for production-grade deployments of models. This maybe changing. Some projects, notably KubeFlow, are targeted at solving the ML problem end-to-end. Google, nVidia, and AMD all prominently promoted their GPU/TPU technologies for ML applications, and there seems to be a resurgence of startups utilizing GPUs for ML, ingest, and database applications.

Kubernetes: the Hottest New Thing Since Spark?

There’s always one technology at Strata that seems to be on the tip of everyone’s tongues, from first-time attendees to more seasoned veterans. People sometimes ask if it’s in our product before knowing what it does! Three years ago, that technology was Spark (or maybe Spark Streaming). This year, the clear winner was Kubernetes – the open-source container-orchestration system. The distinction is not without merit. Kubernetes, though new-ish, meets many of the scalability challenges discussed at this conference. Kubernetes is supported by big players in the big data ecosystem, including AWS, Azure, Google Cloud; Spark 2.3; TensorFlow; and Kafka. The idea of training a model using a Kubernetes cluster on a modestly sized laptop, then using a single command to push that model to the cloud to run at scale is appealing to many. And container-based deployments can solve many versioning, dependency, and scalability issues for big data/ML projects – though can be a bit complex.

Self-Service & the Self Service Data Lake

During case study sessions, several enterprise leaders reiterated the need for more self-service capabilities, along with governance. This story is familiar. That’s why, along with our partners, Cazena launched a new Self-Service Cloud Data Lake at Strata. The idea is to provide a fully-managed analytics environment for all workloads and tools (data prep, SQL, ML) so that analysts can ingest and start navigating data immediately. The solution integrates a new AppCloud tool called ShareInsights, which adds self-service capabilities for the entire data pipeline, no coding skills required.

Congratulations to FairVentures!

It was especially exciting to learn that our customer FairVentures was recognized with a Cloudera Data Impact Award. FairVentures won the Cloud Success award category by sharing the impact of their Cazena-Cloudera powered Data Lake with an impressive panel of industry judges. We’re honored to be their solution of choice. The recognition means a tremendous amount to our hardworking team, and was a highlight of an exciting week! Read about the awards here on the Cloudera blog.

FV_award.jpg

Overall, we’ve come home inspired by the strength of existing technologies, as well as the potential of emerging products. Strata 2018 felt like a good blend of seasoned production architects and practitioners as well as newer data scientists, analysts and application developers. Everyone is talking about how to deploy big data platforms faster, with more agility – and more are valuing simplicity. At Cazena, we look forward to continuing to partner with technology providers and enterprises to provide well architected and cutting-edge big data solutions.


Related Resources