Unlike the well established and regulated field software development, data science is still new and exploratory. Many data scientists are self-taught, bringing together the creativity and unique thought process of many different fields. However, this doesn’t always include the structure, rigidity, and format of computer science.
Remember, data science is a demanding field. It’s a combination of statistics, higher level mathematics, computer science, and general problem solving/creativity. While extensive exploration in jupyter notebooks is all well and good, you won’t be able to accomplish anything truly tangible without the dev skills – and terminology to bring your exploration up to production grade code. Here are some simple things you should keep in mind when doing your next data science project.
The Stuff You Should Already be Doing
I wish this could go without saying, but that isn’t always the case. If it needs to be run more than once, your code should be functionalized and reusable. Organizing code into logical, parametrized units is a fundamental aspect of computer science. This doesn’t only make things easier for you, but for everyone who will also be using your code. Remember, like anything data science is about generating value. If your code can’t be reused by your team or future teams, then long term it’s useless.
Again – you probably aren’t the only person who’s ever going to need to use your code! At some point, someone else at your company may also need to use your functions. Make it easier on them and tell them in plain text what the code actually does. This saves a ton of time, avoids errors and makes your work reproducible.
Not only that but if it’s the code you wrote a long time ago then even you might forget what it does. Avoid the embarrassment, keep well-documented information on everything you build.
It’s a simple thing, and yet many data scientists refuse to actively practice (honestly, myself included sometimes). Take a look at some popular Github repositories. All code is well organized into intuitive modules with the proper descriptive documents (requirements.txt, README.md, etc…). If you’re new to professional programming and looking to get started, try Cookiecutter Data Science. This repository gives a great introduction to a good directory structure for data science projects.
The Stuff You Should Give a Shot
Most data sciences know what Git is and kind of use it, but they don’t seem to use it the same way that software engineers do. Git has a ton of functionality for version control, branching, testing, and deployment. It’s crucial to have good practice with Git when coding in a team, and bringing their data science to software engineering quality work.
There’s a lot of material here that can really take your Git game to the next level.
It may be the propensity of the data scientist to rely on the notebook format for coding, but most data scientists really aren’t good with test cases. I rarely see unit tests included in the code, and often people try to push notebook quality code to production without properly vetting it.
Do yourself, your team, and your company a favor – make sure everything you write covers a broad variety of potential instances. I’ve heard cases of data scientists pushing untested notebooks to production on AWS, and running the company bill through the roof with some computationally intensive trash. Here’s a Medium article on it to get you started. Additionally, you can try frameworks like coveralls.
Continuous Integration (CI)
Essentially, continuous integration is tying it all together. A popular standard for developers, CI involves merging all developer changes together as frequently as possible, then automating the build of your application (or just successful run of the code). Basically, this means committing your changes via Git as often as possible to avoid conflicts between working branches. The longer your team is all editing code, the greater the likelihood of a problem when you go to combine your changes later. The automated deployment part makes sure that all the code runs smoothly after integrating these changes.
There are a ton of different tools for CI, although personally, I hear a lot about (and use) Jenkins and Travis CI. Regardless of whether or not you subscribe to a framework or everything about CI, the general idea is good practice for everyone. When coding with your team, make sure to merge, commit, and run your code as frequently as you can. Trust me – it can avoid some pretty severe headaches.