Tools for setting up a new ML project.

3 min readJan 15, 2020

Here is a list of tools I find worth a try if you are going to set up a new ML project. This list is not intended to be exhaustive overview and it does not include any ML frameworks or libraries.

It is focused on auxiliary tools that can make development easier and experiments reproducible. Some of this tools I have used in real projects, others I just tried on a toy example, but found interesting to use in future.

Starting a new project

cookiecutter — a scaffold generator for all sorts of projects. It allows generating boilerplate code for an empty project. Useful for the unification of project directories. It also creates basic configurations for packaging and testing (may also include configured code coverage tools).

But I highly recommend NOT to use https://github.com/drivendata/cookiecutter-data-science cause it contains a whole bunch of controversial python practices like using src as a module
It’s better to use https://github.com/audreyr/cookiecutter-pypackage and, but borrow the directory structure from the cookiecutter-data-science.

Training

Tracking experiments

MLFlow — is a tool for keeping track of performed experiments and their parameters. Has own web-view for accessing experiment data. Require a running server to store experiment runs.

Managing artifacts

DVC — a tool that can be thought of as git for large or binary files. It does not try to calculate a diff between versions and relies on third-party file storage (for example - S3). In essence, it consists of 2 parts - the data versioning part and the data pipeline part. The first one is definitely a must-have thing. It could be very useful for keeping track of your trained models near the code. But the second part - DVC Pipeline, looks redundant for me and it is better to replace it with good old Makefile.

Data pipelines & ETL

Makefile — a very old tool. You define target files and steps on how to obtain them.
Luigi or Airflow — distributed versions of the Makefile. Allows building a Directed Acyclic Graphs of tasks in python. They also have a web-view and built-in scheduler.
kedro — yet another data pipeline framework which also provides it’s own project generation tool. It may be interesting because it provides a high level of abstraction. Kedo-pipelines is abstract enough to be converted into Airflow-pipeline. However they heavily depend on Dataset abstraction, so it might be tricky to implement lazy data loaders.
metaflow — a server-less variation of Airflow, it allows to build pipelines and DAGs, but does not have a scheduler or server-side. Instead, it can launch pipelines in the cloud.

Deployment

Create API

Flask-RESTPlus + gunicorn — slightly more sophisticated setup then a just simple Flask. RESTPlus restricts you with better API development practices than pure Flask does. It also allows to easily document your API with Swagger.
FastAPI — even more compact version of Flask + Swagger which builds API documentation automatically.

Run on cluster

Docker + Kubernetes — There are tons of training materials and reviews on the Internet. Little advice — don’t try to deploy your own K8s control plane.
cortex — allows you to quickly deploy your models as a service on AWS. It automatically wraps your predict function into API endpoint and instantiates EC2 instances with for serving it. But in my opinion, this framework is only suitable for prototypes, as the solution is not transferable and no containers are created.

Further reading

Feel free to join my Telegram channel or visit my blog if you are interested in ML Engineering and NLP.

Tools for setting up a new ML project.

Starting a new project

Training

Deployment

Written by Vasnetsov Andrey