Workshop 3: Workflow Automation with Snakemake
Introduction
In this workshop, you will learn how to make your genome analysis reproducible and shareable by refactoring your scripts into a Snakemake workflow. You will use Large Language Models (LLMs) to help design, implement, and debug your workflow, and you will run your analysis on the compute cluster. This workshop is designed for students with little prior experience in workflow management or Snakemake.
Supporting Materials
Problem Statement
Your PI wants your ancient genome analysis to be reproducible and easy for other lab members to use. Your job is to:
- Refactor your scripts from previous workshops into a Snakemake workflow
- Automate the steps for downloading data, running sequence analysis, and summarizing results
- Run your workflow on the compute cluster
- Summarize your workflow and findings in a brief report
Technical Skills Introduced
- Using VS Code and git for collaborative workflow development
- Introduction to workflow management with Snakemake
- Writing and debugging Snakemake rules
- Integrating Python scripts into workflows
- Submitting Snakemake jobs to a compute cluster
- Prompt engineering and iterative debugging with LLMs
Workshop Structure
- Setup: Clone your workshop repository from GitHub Classroom, set up your environment, and review your previous scripts.
- Workflow Design: Use LLMs to help you design a Snakemake workflow for your analysis pipeline.
- Implementation: Prompt the LLM to help you write Snakemake rules for each analysis step (download, analysis, reporting).
- Cluster Execution: Use LLMs to help you generate and debug cluster job submission for Snakemake workflows.
- Reporting: Summarize your workflow and findings in a short markdown report. All files should be tracked in git and pushed to GitHub Classroom.
Sample Initial Prompt
I need to refactor my genome analysis scripts into a Snakemake workflow that downloads an ancient genome FASTA file, computes sequence statistics, and summarizes the results. Please generate a Snakefile and example rule for running the analysis on a compute cluster.
Deliverables
By the end of this workshop, you will have created the following artifacts:
- Snakemake Workflow Files
- A complete and well-documented Snakefile and any config or rule files needed for your workflow
- Example:
Snakefile
,config.yaml
,rules/
- Integrated Python Scripts
- Python scripts for sequence analysis, adapted for use within the Snakemake workflow
- Example:
scripts/analyze_mtDNA.py
- Cluster Submission Script
- A script or command for running your Snakemake workflow on the compute cluster (e.g., with qsub or Snakemake’s cluster integration)
- Example:
run_snakemake.qsub
- Workflow Output Results
- Output files generated by the workflow, including sequence statistics and any summary files
- Example:
results/summary.md
,results/gc_content.txt
- Brief Report
- A short markdown report (1–2 paragraphs) summarizing your workflow design, results, and any challenges encountered. This should be clear enough to share with your PI or collaborators.
- Example:
workflow_report.md
- Version-Controlled Repository
- All code and workflow files should be tracked in your git repository and pushed to GitHub Classroom as part of reproducible research best practices. This ensures your work is reproducible and easy to share with instructors and collaborators.