Cazena’s Data Science Sandbox Test Drive has spurred many conversations in the last few weeks, at Strata + Hadoop event in San Jose, meetups and other events. There is an interesting common thread mentioned by the leaders of data science and advanced analytics groups: All are focused on how to make their team as productive as possible. The resources for these teams are notoriously hard to find. So, naturally, team leaders want to ensure that these scarce, highly-skilled workers have everything they need.
Here are the common challenges I’ve heard about recently that are impacting data science team’s ability to collaborate efficiently and productively:
1) Working (too) locally
One of the things that these leaders point to as a major efficiency impact is members of their teams working locally on their laptops or desktops. Many data scientists are quite adept at quickly creating complex models by extracting data from various enterprise and non-enterprise systems. My initial instinct was that the “desktop constraint” discussion was going to lead into a conversation about the challenges of volume and scale with limited local compute and storage resources – which it did – but there’s another big issue that’s second on their list of challenges.
2) No Common Environment for Sharing Data, Models and Code Snippets
The leaders want their teams to be able to iterate quickly and fail fast. To do this, data science teams need to be able to easily share data, models and code snippets. However, this is where the inefficiencies often start. Since most of these analysts are working locally, on their own desktops, sharing data involves copying or emailing files between team members.
In one extreme case, we spoke to an organization, which had its analytics team scattered across three countries and two continents. The average size of the files they were passing back and forth was only 10GB, but when you were doing this multiple times a day -- the team was typically spending at least an hour a day waiting for files to arrive. One analyst tried to make his day more efficient by scheduling his snack breaks during the times when he was waiting for a file from someone else on his team.
Another common scenario that leaders face is multiple members of their team extracting the same data. This is inefficient at the team level as well as putting unneeded strain on the systems that they extract data from.
3) Serious Library and Version Incompatibilities
Working locally also results in library and version incompatibilities. In the company example above, each individual analyst had their own versions of local libraries, which meant that time was required to refactor code snippets they got from other team members, who all had their own, different local libraries.
A vexing example of this was when an analyst sent a code snippet to another team member that referenced a library that the recipient did not have. The recipient then had to download the library which was a newer version to the one originally used. Then the analyst had to refactor the code to work with the newer version. This cycle only continued when that analyst made changes, then sent an updated code snippet back to the original team member, who then had to refactor it again to work with the older version. Even if you don’t know anything about refactoring, that was clearly not an efficient process.
4) Slow and Inefficient processing (especially for high-volume data)
While sample sizes can always be adjusted, single-threaded processing is increasingly an issue for data science productivity. Many teams want to use distributed computing, and engines like Apache Spark – but the time and skills required for implementation are too daunting. Some are allowed “timeshare” situations on larger clusters, but that can also lead to many manual processes and sometimes inefficient processing if the data can’t be in close proximity to the processing engines.
5) Can’t Leverage Cloud due to Security and Compliance
There’s an obvious well-known solution to compute and storage constraints – aka the public cloud – but that’s been off limits for many due to the security and compliance policies at their companies. Teams and their leaders haven’t been able to take advantage of the cloud, because it’s time-consuming to address the many requirements of enterprise security. That’s precisely why Cazena’s services are single-tenant and private for each enterprise team.
How Cazena helps
Cazena’s Data Science Sandbox as a Service addresses all of these pitfalls – and a few more. The immediate value that many leaders see in the Cazena Service is a secure, centralized cloud platform, which allows datasets to be stored in a single place. It’s much more than a secure fileshare though. Using a Cazena platform also gives teams a simple way to use a variety of distributed processing engines (e.g. Spark, MPP) – all in close proximity to their data for maximum efficiency.
Having centralized tools means that the entire team uses a common version of R and Python. Libraries that the members need are added to a single place so that all analysts have access to the same set of libraries and more importantly, the version of the libraries is consistent. When a library is upgraded to a newer version, all members of the team automatically now need to use this new version.
These features of the Cazena platform form the critical foundation that enable data science teams to be productive. Analytic leaders get most excited about those efficiency-enhancing functions. The fact that Cazena also gives teams the flexibility to use any analytic language – and makes it easy to run analytics across full datasets – is just the cherry on the top.
Do you have additional examples of simple things that make the daily life of a data scientist more – or less – productive? I am really interested in learning about those to see if there are additional things we can do to add value to the data science process.
Lovan Chetty is Director of Product Management with Cazena, and can be reached at firstname.lastname@example.org.