PyData Global 2024

Enabling Multi-Language Programming in Data Engineering Workflows with the Snakemake Framework
12-03, 14:30–15:00 (UTC), AI/ML Track

Streamlining clinical trial output workflows is a key challenge in clinical studies. To deliver reports to health authorities, clinical trial statisticians need to create several scripts to produce deliverables such as output datasets, tables, figures, and listings. Statisticians must also handle specific execution orders to respect dependencies between the generated datasets.

Our project leverages Python programming to automatically generate orchestration workflows from clinical trial project metadata using the Snakemake framework. Snakemake supports the execution of multiple jobs using Docker containers, facilitating multilingual orchestration. This enables our users to run end-to-end (E2E) data engineering workflows using their preferred programming languages, primarily SAS and R. Moreover, Snakemake allows parallel runs for efficient workflow management.


This talk will provide context about the clinical trials domain and the challenges associated with generating and managing output workflows for clinical trial reports. We will explore our technical solution, which leverages Python and the Snakemake framework to automate the orchestration of end-to-end (E2E) data engineering workflows.


Prior Knowledge Expected

No previous knowledge expected

I am currently working at Roche as Senior Data Scientist, I have a deep passion for elevating Python code quality and enhancing its role within the pharmaceutical industry. I am also actively engaged in streamlining automation workflows for both R and Python packages delivery.