How to develop a Data Science platform upon Kubernetes

12th April 2019


Image result for Kubernetes

Kubernetes is an open-source system for automating deployment, scaling, and management of containerised applications. It is rapidly becoming the defacto standard as a platform for Data science application.

Any Data Science application should be simple, affordable, scalable and available

Simple: The platform needs to be relatively simple to maintain and deploy. It should also be simple to use: Single-sign-on, simple dashboards, etc.

Affordable: Taking advantage of things like spot pricing to get cheaper, opportunistic compute. Also, auto-scaling should bring the cluster size down when not in use.

Scalable: Able to scale up to support very large data and long-running, highly parallel computing tasks.

Available: Furthermore, run everything in Kubernetes to give the best possible chance of being able to run the platform anywhere.

Kubernetes can run Google Cloud, AWS, Azure, on-premises and even on one’s own laptop.

These tools fall into three main categories: code, compute and storage.

Code

Data science is often conducted in interactive coding and visualization environments such as Jupyter Notebooks, RStudio. These environments provide support for writing code in R or Python for querying, transforming and visualizing information. Limitations (CPU, memory, disk space, etc) may rise fairly quickly when dealing with larger datasets.

Storage

There are a few types of cloud storage available.

Object Stores

We can store large amounts of data in object stores like AWS S3 or Google Cloud Storage. However, using caching parts of the data in memory, can get achieve good query performance over large data sets.

AWS EFS

AWS’s Elastic Filesystem is a networked operating system that is effectively unlimited in storage, supports massive throughput speeds. This is costly compared to S3 and EBS, but allows portable and persistent home/shared directories between the various coding environment.

Compute

Kubernetes obviously supports running arbitrary compute off-the-shelf. The industry standard for processing large sets of data through a programmatic interface is Spark. While Spark 2.3 supports native Kubernetes, it has yet to add language support for R or Python.

Using Pachyderm as general storage and compute solution for end-to-end data & code traceability. Also, for TensorFlow training, we’re looking at Kubeflow. This allows simple provisioning of TensorFlow clusters within Kubernetes.

Read also: Supervised Learning V/S Unsupervised Learning using ML

Tagged with:

Manish Prasad

An experienced data scientist with a passion to work on new challenges

To give you the best possible experience, this site uses cookies. Continuing to use jamieai.com means you agree on our use of cookies.
Accept