The Hidden Costs of Cloud Data Lakes

This blog series from Cazena's engineering team investigates the hidden costs of cloud data lakes. Learn the top three hidden costs of cloud data lakes!

Read the Blog Series

ODSC Notebook: Trends in Data Science

Part of our job within Cazena Engineering is to track technology and trends in data science and analytics. So it was great to send a few of our engineers to ODSC East again this year, the Open Data Science Conference in Boston. The 4,500-person turnout was significantly more than last year, and the conference felt more like Strata, with a number of sponsors and attendees flying in from other cities.

It’s always nice to talk with a variety of different data scientists and engineers about ways in which Cazena’s fully-managed data lakes and cloud stacks for analytics could help them operate more efficiently. It’s also a good time to connect with peers. Data science and AI companies are natural partners for Cazena – they build tools and algorithms to analyze data; we provide a fully-functioning, end-to-end data platform.

Conferences are a great time to find out about what’s new, what’s working and what’s on the horizon for analytics and data science. Trends of the conference this year versus last were apparent.

Placing machine learning (ML) algorithms in production was a big theme, because it’s a big challenge. A number of companies are thinking about how to make ML algorithms perform at scale, and run efficiently across cloud architectures. Many companies were also building workflows designed to factor in ongoing model maintenance and improvement. For example, how do you know when it’s time to completely re-train a model; or, how do you build a model that can be continuously updated as new information comes in (hint: TensorFlow can now talk to Kafka directly!). We enjoy helping our customers build systems that address these issues.

Tooling is reaching a reasonable level of maturity. Apache Spark provides support for a number of ML algorithms; however, the popularity of Tensorflow was impressive. At least 4 vendors I talked to were building platforms to help clients build full end-to-end deep learning pipelines – from exploratory analysis to deploying models in production – with Tensorflow, Docker, and Kubernetes. The TensorFlow project has taken the philosophy of keeping full end-to-end workflows in TensorFlow, which means duplicating some of the ETL and data processing functionality that we are more used to seeing in Spark.

ML and deep learning applications are shifting away from multimedia photos and videos to more industrial applications. We heard case studies about analyzing data from jet engines and turbines; using phone accelerometers and GPS data to detect car crashes. We’ve observed a somewhat recent trend of our customers using the Cazena platform to ingest and analyze IoT, or application data using some combination of Spark and Impala. It’s nice to see extremely useful algorithms applied to industrial applications where they can have a major impact.

Growth in use cases and algorithms demands flexibility. As companies push their analytic boundaries, new requirements tend to emerge. For example: A number of data scientists mentioned that the Convolutional Neural Networks often used in image classification problems didn’t always provide the best results, and a number of labs and companies were using Recurring Neural Networks or hybrid CNN + RNN architectures. That’s a conversation that wasn’t relevant to very many people a few years ago, but now it’s a topic of interest. In an industry that moves quickly, flexibility is paramount.

We are glad to see continued strong growth and evolution in the data science industry. As usual, we came away from the conference inspired by future possibilities in the space, and excited to move our services forward with this strong ecosystem of partners and customers.

Related Resources