PyData Global 2024

Reproducible Python projects using Nix
12-04, 18:30–19:00 (UTC), AI/ML Track

As data scientists and machine learning engineers, it is crucial that we can reproduce results and seamlessly share projects across teams and stakeholders. However, differing operating systems, Python environments, package versions, and package managers often hinder reproducibility across different machines. This talk will explore how Nix can be leveraged to create reproducible work environments and how it can be a convenient tool for any Data Scientist or ML Engineer.


Audience

This talk is for Data Scientists and Machine Learning Engineers at any level. It would also be useful for any Python developer. Basic knowledge of Docker containers is helpful but optional.

Objective

Attendees will learn why reproducibility is necessary and how to use Nix’s features in daily work.

Description

In recent years, containerization using tools like Docker has become a cornerstone for efficiently deploying applications. One key reason for its popularity is the ability to create consistent and deterministic production environments. However, reproducible environments are also highly beneficial for development. Achieving reproducibility often becomes complex due to the diverse range of operating systems, package versions, dependency managers, and Python versions. What runs smoothly on one machine may fail on another, leading to inconsistencies that hinder collaboration and slow down development velocity.

We'll begin by exploring the fundamental challenges of maintaining consistency in development setups. While Docker is widely adopted for encapsulating production environments, it has limitations, particularly in development. We'll briefly discuss these limitations to set the stage for why a tool like Nix is needed.
Next, we’ll dive into the Nix ecosystem, introducing key components such as Nixpkgs, NixOS, and the Nix language. We will see how these elements work together to create environments that are not only consistent across different machines but also highly customizable and reproducible.

Finally, we’ll put theory into practice by building a sample data science project. We'll walk through each step to make the development environment reproducible using Nix, demonstrating how this tool can streamline collaboration and ensure that your code runs identically, regardless of the underlying system.

Outline

  1. Introduction and motivation [1 min]
  2. Why do we need reproducible behavior? [2 min]
  3. Why is it hard to have a deterministic environment [2 min]
  4. Docker containers [2 min]
    • Why is it widely used?
    • Where is it useful?
    • Where does it fall short for dev environments?
  5. Nix concepts [5 min]
    - What is it?
    - Benefits
    - Ecosystem
    - Isolated environments
  6. Sample Data Science project [2 min]
  7. Show non-deterministic behavior of the project [3 min]
  8. Introduce Nix in the project [3 min]
    • Install Nix
    • Write a nix file using the nix language
  9. Spin up nix environment [1 min]
  10. Run the project and verify reproducibility [3 min]
  11. Further use cases and extensions [3 min]
    • CI/CD integration
    • Handling different Python versions
    • Collaboration
  12. Q/A [3 min]

Prior Knowledge Expected

No previous knowledge expected

Avik is a seasoned data scientist who has worked in multiple domains of machine learning. He loves coding in Python and writing elegant and scalable code.