I’m a huge advocate for proper software development practice in data science workflows. Data science is an explosive field – which also means its a bit loosely defined. It is often self taught in some form, and encompasses a wide range of skills. One of the weakest skills I often see in newer data scientist is software development and engineering: containerisation, continuous integration, unit testing, etc…so I’m going to focus in one of the most popular new tools for data science and engineering – Docker.
Docker is a tool for containerising code. There’s a ton of Docker introductions out there. I want to emphasise one important point: in 2019 you will want (or even need)to know Docker as a data scientist. Before we start with Docker however, we need to talk about a container.
What is a container?
A container is a more intuitive concept than you’d think. Basically, it wraps up your code all nice so that it has everything it needs to run in a nice little package. This is important as it makes code scalable and portable.
“A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings.” – Docker.com
So, what do they mean by that? Say you’re building a (useless) application in Python that takes in text, labels them with part-of-speech (POS) tags and returns them. You write all the code for your application then commit it to Github. Your friend wants to use your application, so he clones your repository to his computer and tries to use it. Bam – error! He’s using Python 3.5 and you used 2.7. No big deal, he downloads 2.7 and tries to run it again. Bam – another error! He doesn’t have spaCy library you used in his Python. He tries to install it, but again he receives an error as it’s not the same version of the library!
After a long, painstaking process he finally manages to get it working. However, this was neither efficient nor effective. All these problems could have been avoided if you’d both been running the code in the exact same environment. Now since our application only contains Python, we could just use the
virtualenv package for this. However, what happens if our application also has bash scripts? Some java code? Maybe a local database running or other services running on some port? At this point we need more.
Why use Containers?
Containers allow us to run all our code with all its bells, whistles, languages and services across many different hosts consistently. If your friend had container software installed (like Docker), he could easily run your application on his own computer without needing to weed through the variety of dependencies necessary for it.
Unlike virtual machines which install an entire operating system (OS) on your host machine, containers are lightweight and can be spun up with ease. Unlike language specific virtual environments (like Python’s
virtualenv), they also have far more functionality. You can have containers running Python, different linux distributions, and windows along with a number of other software and services.
Containers become even more important in production, where you could be managing tons of instances of your code across large clusters of nodes. This can be an abstract concept as a new data scientist, but I can tell you from experience that companies are using Docker more and more for machine learning. It is highly encouraged those running deep learning workflows in libraries like Tensorflow do so in containers for reproducibility and portability.
So what is Docker?
As you have have no doubt guessed, Docker is an open source and enterprise software for containerising code. There are a couple offerings out there, but Docker is definitely the most popular and therefore the most desired on a resume. Now that you understand the goal of containerising code, we can dive into some Docker specific terms:
Docker Image – An image is a definition of a container. Docker images contain the plans for what software will be in the container, and what code it will execute when it runs. For example, the aforementioned application would use a base image of Python 2.7.
Dockerfile – This is the file in which we define our image. You can see an example further below, but it’s the file that contains the instructions on how to build your image. This includes things like the base image (Python, Ubuntu, etc…), other packages we might need (pip, Python libraries), and what code you want to execute when the container runs (my_app.py).
Docker Container – We know what a container does, but what a Docker container actually is in computer science terms is an instance of an image. We define an image in our Dockerfile, build that Dockerfile into an actual image, and then run the instructions from that image in a container. An image definition is like a class definition – we can create multiple containers from the same image just as we can have multiple instances of the same class.
See the image below for the general, simplified trajectory of how to use Docker.
Getting Started with Docker
I see a lot of Docker tutorials out there – but they often focus on deploying an application, multiple services or something more complex. For demonstration purposes, and those completely new to Docker/data science, I want to demonstrate the most minimal and useless Dockerfile, image, and container that I can using the example above. Let’s start.
Let’s also start by creating a new directory for our work. I prefer to use terminal whenever possible, and encourage beginners to do so to gain comfort. Make a new directory, and populate it with a file named
Build a Quick App
Before we decide what we’ll need in our Docker image, let’s build our app! Like I said, it’ll be a simple app that takes in text and returns POS tags using Python 2.7. Normally, you might use a web framework like flask for this, but because I want this to be as bare bones as possible we’re going to take terminal input. I named this file
import spacy nlp = spacy.load('en_core_web_sm') text = raw_input("Please enter a phrase: ") doc = nlp(unicode(text)) for token in doc: print(token, token.pos_)
Create your Dockerfile
Feel free to try out these lines of code outside Docker – but remember the point is to containerize this code! We now need to create our image. Open up
Dockerfile and put the following lines in it:
# Use an official Python runtime as a parent image FROM python:2.7 # Set the working directory to /app WORKDIR /app # Copy the current directory contents into the container at /app COPY . /app
# install the only library we really need for this RUN pip install spacy
# install spacy's english words RUN python -m spacy download en
# and finally run our Python script! # Run your application! CMD ["python","app.py"]
It should be pretty intuitive – but I’ll run through them. First we specify a base image with the Docker
FROM command. It’s good practice to keep these images as lightweight as possible. Normally I’d use the “slim” or “alpine” versions of Python which are minimal, but we need all of Python to work with spaCy.
After, we create a new directory called
app note that this directory will be inside of our Docker container. The next command is key – the
COPY . command copies everything from our current directory (specified by the
.), and pastes in into the
app directory inside our Docker image! So if you created a new directory with
app.py and this
Dockerfile in it, when you build the image it’ll copy that Python code into your Docker image.
Lastly, we use the Docker
RUN command to install the packages we need, and the
CMD command to tell Docker we want to run our Python script when we run this container.
Build Your Image
Now that we’ve defined our image we have to build it! First, open up a terminal and
cd into the directory where you have your
app.py. Make sure you have Docker in your
$PATH and that it is running by typing a simple command like
docker --help. Then run the command below.
docker build -t my_image_name .
The first part of the command is standard when using docker.
build obviously builds the image. The
-t allows us to name the image for convenience – here I choose the filler of
my_image_name but it can be whatever you want. The
. again specifies to use the
Dockerfile in the current directory.
When you run the command above, you’ll notice it pulls the Python image from the official Docker registry online, and then begins installing all of the things you told it to! It takes a little while to build the first time, but it’s worth the wait. You can check to make sure the image was built by running
docker image ls.
Run the Container
Now that our image is build we’re all see to actually run our container! The commands for this are simple.
docker run -it my_image_name
There ya go. The
-i is for interactive, and just states that Docker should keep the terminal open for you to enter the input. If you’re running a flask application, you won’t need the
Upon running this, you should be met with your input prompt and can label your text and have it spit back POS tags! If you want it to keep going instead of shutting down the container after one go, just wrap it in a
while True: loop. Of course in production you’d want to wrap an API around this and expose it on a port – but we’ll cover that at a later date.
Docker for Data Science
In this tutorial we had spaCy do any data science for us. However, Docker is equally as important for managing, training and deploying your models. I always train frameworks like tensorflow in a Docker image to avoid dependency and versioning issues. It’s actually recommended to use the tensorflow Docker image for GPU support: https://www.tensorflow.org/install/docker.
With that – we’re still only at the tip of the iceberg. Companies often use Kubernetes (K8s) to manage all the Docker containers running at scale across a number of clusters: https://kubernetes.io/. There is even a machine learning framework built for K8s called Kubeflow that is growing in popularity.
Docker’s role in data science is becoming more prevalent everyday – especially as data science moves from your local machine and into the cloud at scale. Data science is no longer completely isolated from software engineering, and the best data scientists of the next century will not only know statistics and mathematics but scalable training and deployment.
Read also: Introduction to autoencoders