The website uses cookies. By using this site, you agree to our use of cookies as described in the Privacy Policy.
I Agree

Managing your data science project environments with Conda

A minimal set of “best practices” for the discerning data scientist.

Image for post
Image for post
The environment and package manager for the discerning data scientist. Source: https://docs.conda.io/en/latest/

Conda “Best Practices”

In the this post I detail a minimal set of “best practices” for using Conda to manage project-specific environments that I use in my own data science work. This article assumes a basic familiarity with Conda and focuses on communicating a small set of “best practices” for managing data science project environments with Conda. If you have never heard of Conda or are just getting started using Conda, then I recommend you take a look at Getting started with Conda.

TLDR;

Here is the basic recipe for using Conda to manage a project specific software stack.

(base) $ mkdir project-dir
(base) $ cd project-dir
(base) $ nano environment.yml # create the environment file
(base) $ conda env create --prefix ./env --file environment.yml
(base) $ conda activate ./env # activate the environment
(/path/to/env) $ nano environment.yml # forgot to add some deps
(/path/to/env) $ conda env update --prefix ./env --file environment.yml --prune # update the environment
(/path/to/env) $ conda deactivate # done working on project (for now!)

New project, new directory

Everynew project (no matter how small!) should live in its own directory. A good reference to get started with organizing your project directory is Good Enough Practices for Scientific Computing.

mkdir project-dir
cd project-dir

New project, new environment

Now that you have a new project directory you are ready to create a new environment for your project. We will do this in two steps.

  1. Create an environment file that describes the software dependencies (including specific version numbers!) for the project.
  2. Use the newly created environment file to build the software environment.

Here is an example of a typical environment file that could be used to run GPU accelerated, distributed training of deep learning models developed using PyTorch.

name: nullchannels:
- pytorch
- conda-forge
- defaultsdependencies:
- cudatoolkit=10.1
- jupyterlab=1.2
- pip=20.0
- python=3.7
- pytorch=1.5
- tensorboard=2.1
- torchvision=0.6
- torchtext=0.6

Once you have created an environment.yml file inside your project directory you can use the following commands to create the environment as a sub-directory called env inside your project directory.

conda env create --prefix ./env --file environment.yml

Activating an environment

Activating environments is essential to making the software in environments work well (or sometimes at all!). Activation of an environment does two things.

  1. Adds entries to PATH for the environment.
  2. Runs any activation scripts that the environment may contain.

Step 2 is particularly important as activation scripts are how packages can set arbitrary environment variables that may be necessary for their operation.

conda activate ./env # activate the environment
(/path/to/env) $ # prompt indicates which environment is active!

Updating an environment

You are unlikely to know ahead of time which packages (and version numbers!) you will need to use for your research project. For example it may be the case that…

  • one of your core dependencies just released a new version (dependency version number update).
  • you need an additional package for data analysis (add a new dependency).
  • you have found a better visualization package and no longer need to old visualization package (add new dependency and remove old dependency).

If any of these occurs during the course of your research project, all you need to do is update the contents of your environment.yml file accordingly and then run the following command.

conda env update --prefix ./env --file environment.yml --prune

Alternatively, you can simply rebuild the environment from scratch with the following command.

conda env create --prefix ./env --file environment.yml --force

Unless building the environment from scratch takes a significant amount of time (which should be extremely rare!) I almost always rebuild my environments from scratch when I add (or remove) dependencies.

Deactivating an environment

When you are done working on your project it is a good idea to deactivate the current environment. To deactivate the currently active environment use the deactivate command as follows.

conda deactivate # done working on project (for now!)
(base) $ # now you are back to the base environment

Interested in Learning More?

For more details on using Conda to manage software stacks for you data science projects, checkout the Introduction to Conda for (Data) Scientists training materials that I have contributed to The Carpentries Incubator. These lesson materials are under active development so please feel free to open issues or submit PRs!

Measure
Measure
Summary | 1 Annotation
create the environment as a sub-directory called env inside your project directory
2021/02/03 09:13