View on GitHub

Computational Ancient DNA Workshop

Workshop 3: Workflow Automation with Snakemake

Introduction

In this workshop, you will learn how to make your genome analysis reproducible and shareable by refactoring your scripts into a Snakemake workflow. You will use Large Language Models (LLMs) to help design, implement, and debug your workflow, and you will run your analysis on the compute cluster. This workshop is designed for students with little prior experience in workflow management or Snakemake.

Supporting Materials

Problem Statement

Your PI wants your ancient genome analysis to be reproducible and easy for other lab members to use. Your job is to:

Refactor your scripts from previous workshops into a Snakemake workflow
Automate the steps for downloading data, running sequence analysis, and summarizing results
Run your workflow on the compute cluster
Summarize your workflow and findings in a brief report

Technical Skills Introduced

Using VS Code and git for collaborative workflow development
Introduction to workflow management with Snakemake
Writing and debugging Snakemake rules
Integrating Python scripts into workflows
Submitting Snakemake jobs to a compute cluster
Prompt engineering and iterative debugging with LLMs

Workshop Structure

Setup: Clone your workshop repository from GitHub Classroom, set up your environment, and review your previous scripts.
Workflow Design: Use LLMs to help you design a Snakemake workflow for your analysis pipeline.
Implementation: Prompt the LLM to help you write Snakemake rules for each analysis step (download, analysis, reporting).
Cluster Execution: Use LLMs to help you generate and debug cluster job submission for Snakemake workflows.
Reporting: Summarize your workflow and findings in a short markdown report. All files should be tracked in git and pushed to GitHub Classroom.

Sample Initial Prompt

I need to refactor my genome analysis scripts into a Snakemake workflow that downloads an ancient genome FASTA file, computes sequence statistics, and summarizes the results. Please generate a Snakefile and example rule for running the analysis on a compute cluster.

Deliverables

By the end of this workshop, you will have created the following artifacts:

Snakemake Workflow Files
- A complete and well-documented Snakefile and any config or rule files needed for your workflow
- Example: Snakefile, config.yaml, rules/
Integrated Python Scripts
- Python scripts for sequence analysis, adapted for use within the Snakemake workflow
- Example: scripts/analyze_mtDNA.py
Cluster Submission Script
- A script or command for running your Snakemake workflow on the compute cluster (e.g., with qsub or Snakemake’s cluster integration)
- Example: run_snakemake.qsub
Workflow Output Results
- Output files generated by the workflow, including sequence statistics and any summary files
- Example: results/summary.md, results/gc_content.txt
Brief Report
- A short markdown report (1–2 paragraphs) summarizing your workflow design, results, and any challenges encountered. This should be clear enough to share with your PI or collaborators.
- Example: workflow_report.md
Version-Controlled Repository
- All code and workflow files should be tracked in your git repository and pushed to GitHub Classroom as part of reproducible research best practices. This ensures your work is reproducible and easy to share with instructors and collaborators.