In the this post I detail a minimal set of “best practices” for using Conda to manage project-specific environments that I use in my own data science work. This article assumes a basic familiarity with Conda and focuses on communicating a small set of “best practices” for managing data science project environments with Conda. If you have never heard of Conda or are just getting started using Conda, then I recommend you take a look at Getting started with Conda.
Here is the basic recipe for using Conda to manage a project specific software stack.
(base) $ mkdir project-dir
(base) $ cd project-dir
(base) $ nano environment.yml # create the environment file
(base) $ conda env create --prefix ./env --file environment.yml
(base) $ conda activate ./env # activate the environment
(/path/to/env) $ nano environment.yml # forgot to add some deps
(/path/to/env) $ conda env update --prefix ./env --file environment.yml --prune # update the environment
(/path/to/env) $ conda deactivate # done working on project (for now!)
Everynew project (no matter how small!) should live in its own directory. A good reference to get started with organizing your project directory is Good Enough Practices for Scientific Computing.
Now that you have a new project directory you are ready to create a new environment for your project. We will do this in two steps.
Here is an example of a typical environment file that could be used to run GPU accelerated, distributed training of deep learning models developed using PyTorch.
Once you have created an
environment.yml file inside your project directory you can use the following commands to create the environment as a sub-directory called
env inside your project directory.
conda env create --prefix ./env --file environment.yml
Activating environments is essential to making the software in environments work well (or sometimes at all!). Activation of an environment does two things.
PATHfor the environment.
Step 2 is particularly important as activation scripts are how packages can set arbitrary environment variables that may be necessary for their operation.
conda activate ./env # activate the environment
(/path/to/env) $ # prompt indicates which environment is active!
You are unlikely to know ahead of time which packages (and version numbers!) you will need to use for your research project. For example it may be the case that…
If any of these occurs during the course of your research project, all you need to do is update the contents of your
environment.yml file accordingly and then run the following command.
conda env update --prefix ./env --file environment.yml --prune
Alternatively, you can simply rebuild the environment from scratch with the following command.
conda env create --prefix ./env --file environment.yml --force
Unless building the environment from scratch takes a significant amount of time (which should be extremely rare!) I almost always rebuild my environments from scratch when I add (or remove) dependencies.
When you are done working on your project it is a good idea to deactivate the current environment. To deactivate the currently active environment use the
deactivate command as follows.
conda deactivate # done working on project (for now!)
(base) $ # now you are back to the base environment
For more details on using Conda to manage software stacks for you data science projects, checkout the Introduction to Conda for (Data) Scientists training materials that I have contributed to The Carpentries Incubator. These lesson materials are under active development so please feel free to open issues or submit PRs!