07 Aug 2019
Welcome to the Wednesday edition of my KDD 25 Roundup. Yesterday was all about timeseries data and real-time detection. Today has been a nice change of pace: rather than watch paper presentations, I attended a day-ish long tutorial session on Databricks.
There can be a considerable amount of overhead (organizationally and mentally) involved in building data pipelines in a cloud environment. There are practically infinite knobs to tune, in terms of tools available, configurations, options, etc. That’s a lot of decisions to make when starting a data operation.
Databricks is a preconfigured data management, exploration, and analysis tool, with a nice point-and-click self-service portal for adding it to your existing cloud environment.
In short, you can add data to Databricks, then access it from the couple of tools it exposes, largely based on Spark. I spent the most time in the notebook environment writing Python code to test out some pre-determined happy paths. It feels like a Jupyter notebook, but is missing some of the nice convenience features (e.g. hotkey for pulling up docstrings in a pop-up, though I understand they’ll be integrating Jupyter in the future).
From what I can tell, the biggest value-add of Databricks is the super slick startup process; you can be up and running with a pretty functional environment with a couple clicks and a few minutes. When this environment is enough, Databricks is probably worth it. That being said, I didn’t get the impression that the platform is highly customizable or extensible. This could impede adoption for organizations with very specific needs, such as government entities which would need to integrate with monitoring or compliance systems.
Overall, I think my luke-warm impression of Databricks could be helped by spending more time on the platform and getting to know some of its more advanced features. Hopefully these initial reactions of mine can help you decide whether it’s right for your organization or not.