Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
80d8420
feat(data): add validation dataset split with motifs
ssenan Jul 31, 2024
90354a7
fix(library): remove unused refactor folder
ssenan Jul 31, 2024
df756b8
feat(training): remove accelerate dependency for accelerated training
ssenan Aug 1, 2024
dcf8bcf
clean(unet): clean up imports after pyproject.toml update
ssenan Aug 1, 2024
dc102fc
feat(configs): add base hydra configs
ssenan Aug 1, 2024
36bd3c2
feat(data): add fn to extract training/val dataset
ssenan Aug 1, 2024
29d3a60
feat(sampling): write all generated sequences to output folder
ssenan Aug 1, 2024
0812995
clean(training): remove old train loop
ssenan Aug 1, 2024
fe52b2f
feat(training): distributed training setup complete, still needs test…
ssenan Aug 1, 2024
fc5324c
fix(data): update encode_data file to include validation split
ssenan Aug 1, 2024
37caf59
fix(config): change distributed numworkers to 2
ssenan Aug 1, 2024
e6a34c7
fix(diffusion): fix device of noise generation
ssenan Aug 1, 2024
6894f36
fix(training): ensure model is on device when training on single gpu
ssenan Aug 1, 2024
95a4228
fix(training): ensure train_loop tensors on cuda
ssenan Aug 1, 2024
25aeded
fix(training): allow mixed precision for sequence tensor
ssenan Aug 1, 2024
67b019c
fix(training): allow defined precision to map to sequence tensor
ssenan Aug 1, 2024
327d9be
fix(training): ensure val loss precision matches model precision
ssenan Aug 1, 2024
21a22c2
fix(data): ensure correct dtype, simplify torch dataset creation
ssenan Aug 2, 2024
1bfa727
fix(data): remove motif related files/functions, to be evaluated exte…
ssenan Aug 7, 2024
197caf5
fix(logging): keep loss vars the same as previous runs
ssenan Aug 7, 2024
6e654ea
feat(data): add individual files for all cell lines
ssenan Aug 7, 2024
d4879b8
chore(data): update data configs to match motif dict removal
ssenan Aug 8, 2024
419c316
feat(uv): resolve environment by uv instead of conda
ssenan Apr 15, 2025
ac4c809
fix(mypy): remove mypy
ssenan Apr 15, 2025
5133423
fix(data): ensure shuffling properly handled in both single & multi gpu
ssenan Apr 16, 2025
94ee24c
fix(dep): add tqdm to dependency list
ssenan Apr 17, 2025
b86617f
feat(training): training validation patience parameter, min training …
ssenan Apr 17, 2025
9dbf0ab
feat(uv): add basic testing / formatting with uv
ssenan Apr 17, 2025
1e01c05
fix(training): remove sequence length parameter from default config
ssenan Apr 17, 2025
39e0fe0
fix(training): update default config to use train rather than train_d…
ssenan Apr 17, 2025
8be2493
fix(training): correctly resolve DDP training devices
ssenan Apr 17, 2025
d155632
feat(sampling): add default config for sampling procedure
ssenan Apr 17, 2025
b42c3ae
feat(sampling): simple yaml for 1000bp sequence generation
ssenan Apr 17, 2025
1f8907e
feat(readme): add basic readme for training / sampling + overview of …
ssenan Apr 17, 2025
f3592f2
Merge branch 'pinellolab:main' into publish
ssenan Apr 17, 2025
fb96e31
fix(data): remove unused cell type-specific datasets for k562, hepg2,…
ssenan Apr 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
205 changes: 18 additions & 187 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,212 +22,43 @@

---

## Introduction

DNA-Diffusion is diffusion-based model for generation of 200bp cell type-specific synthetic regulatory elements.

## Abstract
<div align="center">
<img src="docs/images/dnadiffusion.png" width="600"/>
</div>

The Human Genome Project has laid bare the DNA sequence of the entire human genome, revealing the blueprint for tens of thousands of genes involved in a plethora of biological process and pathways.
In addition to this (coding) part of the human genome, DNA contains millions of non-coding elements involved in the regulation of said genes.

Such regulatory elements control the expression levels of genes, in a way that is, at least in part, encoded in their primary genomic sequence.
Many human diseases and disorders are the result of genes being misregulated.
As such, being able to control the behavior of such elements, and thus their effect on gene expression, offers the tantalizing opportunity of correcting disease-related misregulation.

Although such cellular programming should in principle be possible through changing the sequence of regulatory elements, the rules for doing so are largely unknown.
A number of experimental efforts have been guided by preconceived notions and assumptions about what constitutes a regulatory element, essentialy resulting in a "trial and error" approach.

Here, we instead propose to use a large-scale data-driven approach to learn and apply the rules underlying regulatory element sequences, applying the latest generative modelling techniques.

## Introduction and Prior Work

The goal of this project is to investigate the application and adaptation of recent diffusion models (see https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ for a nice intro and references) to genomics data. Diffusion models are powerful models that have been used for image generation (e.g. stable diffusion, DALL-E), music generation (recent version of the magenta project) with outstanding results.
A particular model formulation called "guided" diffusion allows to bias the generative process toward a particular direction if during training a text or continuous/discrete labels are provided. This allows the creation of "AI artists" that, based on a text prompt, can create beautiful and complex images (a lot of examples here: https://www.reddit.com/r/StableDiffusion/).

Some groups have reported the possibility of generating synthetic DNA regulatory elements in a context-dependent system, for example, cell-specific enhancers.
(https://elifesciences.org/articles/41279 ,
https://www.biorxiv.org/content/10.1101/2022.07.26.501466v1)

### Step 1: generative model

We propose to develop models that can generate cell type specific or context specific DNA-sequences with certain regulatory properties based on an input text prompt.
For example:

- "A sequence that will correspond to open (or closed) chromatin in cell type X"

- "A sequence that will activate a gene to its maximum expression level in cell type X"

- "A sequence active in cell type X that contains binding site(s) for the transcription factor Y"

- "A sequence that activates a gene in liver and heart, but not in brain"

### Step 2: extensions and improvements

Beyond individual regulatory elements, so called "Locus Control Regions" are known to harbour multiple regulatory elements in specific configurations, working in concert to result in more complex regulatory rulesets. Having parallels with "collaging" approaches, in which multiple stable diffusion steps are combined into one final (graphical) output, we want to apply this notion to DNA sequences with the goal of designing larger regulatory loci. This is a particularly exciting and, to our knowledge, hitherto unexplored direction.

Besides synthetic DNA creations, a diffusion model can help understand and interpret regulatory sequence element components and for instance be a valuable tool for studying single nucleotide variations (https://www.biorxiv.org/content/10.1101/2022.08.22.504706v1) and evolution.
(https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1502-5)

Taken together, we believe our work can accelerate our understanding of the intrinsic properties of DNA-regulatory sequence in normal development and different diseases.

## Proposed framework

For this work we propose to build a Bit Diffusion model based on the formulation proposed by Chen, Zhang and Hinton https://arxiv.org/abs/2208.04202. This model is a generic approach for generating discrete data with continuous diffusion models. An implementation of this approach already exists, and this is a potential code base to build upon:

https://github.com/lucidrains/bit-diffusion

## Tasks and potential roadmap:

- Collecting genomic datasets
- Implementing the guided diffusion based on the code base
- Thinking about the best encoding of biological information for the guided diffusion (e.g. cell type: K562, very strong activating sequence for chromatin, or cell type: GM12878, very open chromatin)
- Plans for validation based on existing datasets or how to perform new biological experiments (we need to think about potential active learning strategies).

## Deliverables

- **Dataset:** compile and provide a complete database of cell-specific regulatory regions (DNAse assay) to allow scientists to train and generate different diffusion models based on the regulatory sequences.

- **Models:** Provide a model that can generate regulatory sequences given a specific cell type and genomic context.

- **API:** Provide an API to make it possible to manipulate DNA regulatory models and a visual playground to generate synthetic contextual sequences.

## Datasets

### DHS Index:

Chromatin (DNA + associated proteins) that is actively used for the regulation of genes (i.e. "regulatory elements") is typically accessible to DNA-binding proteins such as transcription factors ([review](https://www.nature.com/articles/s41576-018-0089-8), [relevant paper](https://www.nature.com/articles/nature11232)).
Through the use of a technique called [DNase-seq](https://en.wikipedia.org/wiki/DNase-Seq), we've measured which parts of the genome are accessible across 733 human biosamples encompassing 438 cell and tissue types and states, resulting in more than 3.5 million DNase Hypersensitive Sites (DHSs).
Using Non-Negative Matrix Factorization, we've summarized these data into 16 _components_, each corresponding to a different cellular context (e.g. 'cardiac', 'neural', 'lymphoid').

For the efforts described in this proposal, and as part of an earlier [ongoing project](https://www.meuleman.org/research/synthseqs/) in the research group of Wouter Meuleman,
we've put together smaller subsets of these data that can be used to train models to generate synthetic sequences for each NMF component.

Please find these data, along with a data dictionary, [here](https://www.meuleman.org/research/synthseqs/#material).

### Other potential datasets:

- DNA-sequences data corresponding to annotated regulatory sequences such as gene promoters or distal regulatory sequences such as enhancers annotated (based on chromatin marks or accessibility) for hundreds of cells by the NHGRI funded projects like ENCODE or Roadmap Epigenomics.

- Data from MPRA assays that test the regulatory potential of hundred of DNA sequences in parallel (https://elifesciences.org/articles/69479.pdf , https://www.nature.com/articles/s41588-021-01009-4 , ... )

- MIAA assays that test the ability of open chromatin within a given cell type.

## Models

## Input modality:

A) Cell type + regulatory element ex: Brain tumor cell weak Enhancer
B) Cell type + regulatory elements + TF combination (presence or absence) Ex: Prostate cell, enhancer , AR(present), TAFP2a (present) and ER (absent),
C) Cell type + TF combination + TF positions Ex: Blood Stem cell GATA2(presence) and ER(absent) + GATA1 (100-108)
D) Sequencing having a GENETIC VARIANT -> low number diffusion steps = nucleotide importance prediction

### Output:

DNA-sequence

**Model size:**
The number of enhancers and biological sequences isn’t bigger than the number of available images on the Lion dataset. The dimensionality of our generated DNA outputs should not be longer than 4 bases [A,C,T,G] X ~1kb. The final models should be bigger than ~2 GB.

**Models:**
Different models can be created based on the total sequence length.

## APIs

TBD depending on interest

## Paper

**Can the project be turned into a paper? What does the evaluation process for such a paper look like? What conferences are we targeting? Can we release a blog post as well as the paper?**

Yes, We intend to have a mix of our in silico generations and experimental validations to study our models' performance on classic regulatory systems ( ex: Sickle cell and Cancer).
Our group and collaborators present a substantial reputation in the academic community and different publications in high-impact journals, such as Nature and Cell.

## Resources Requirements

**What kinds of resources (e.g. GPU hours, RAM, storage) are needed to complete the project?**

Our initial model can be trained with small datasets (~1k sequences) in about 3 hours ( ~500 epochs) on a colab PRO (24GB ram ) single GPU Tesla K80. Based on this we expect that to train this or similar models on the large dataset mentioned above ( ~3 million sequences (4x200) we will need several high-performant GPUs for about 3 months. ( Optimization suggestions are welcome!)

## Timeline

**What is a (rough) timeline for this project?**

6 months to 1 year.

## Broader Impact

**How is the project expected to positively impact biological research at large?**

We believe this project will help to better understand genomic regulatory sequences: their composition and the potential regulators acting on them in different biological contexts and with the potential to create therapeutics based on this knowledge.

## Reproducibility

We will use best practices to make sure our code is reproducible and with versioning. We will release data processing scripts and conda environments/docker to make sure other researchers can easily run it.

We have several assays and technologies to test the synthetic sequences generated by these models at scale based on CRISPR genome editing or massively parallel reporter assays (MPRA).

## Failure Case

Regardless of the performance of the final models, we believe it is important to test diffusion models on novel domains and other groups can build on top of our investigations.

## Preliminary Findings

Using the Bit Diffusion model we were able to reconstruct 200 bp sequences that presented very similar motif composition to those trained sequences. The plan is to add the cell conditional variables to the model to check how different regulatory regions depend on the cell-specific context.

## Next Steps

Expand the model length to generate complete regulatory regions (enhancers + Gene promoter pairs)
Use our synthetic enhancers on in-vivo models and check how they can regulate the transcriptional dynamics in biological scenarios (Besides the MPRA arrays).

## How to contribute

If this project sounds exciting to you, **please join us**!
Join the OpenBioML discord: https://discord.gg/Y9CN2dUzQJ, we are discussing this project in the **dna-diffusion** channel and we will provide instructions on how to get involved.

## Known contributors

You can access the contributor list [here](https://docs.google.com/spreadsheets/d/1_nxDI6DIoWbyUDpIDX-tJIILejrJ0kEYrcXXdWlzPvU/edit#gid=1871728801).

## Development

### Setup environment

We use [hatch](https://hatch.pypa.io/latest/install/) to manage the development environment and production build. It is often convenient to install hatch with [pipx](https://pypa.github.io/pipx/installation/).

### Run unit tests

You can run all the tests with:
## Installation
Our preferred package / project manager is [uv](https://github.com/astral-sh/uv).
To install the necessary packages, run:

```bash
hatch run test
uv sync
```
This will create a virtual environment in `.venv` and install all dependencies listed in the pyproject.toml file.

### Format the code
## Usage

Execute the following command to apply linting and check typing:
### Sequence Generation
We provide a basic config file for generating sequences using the diffusion model resulting in 1000 sequences made per cell type. Base generation utilizes a guidance scale 1.0, however this can be tuned within the sample.py with the `cond_weight_to_metric` parameter. To generate sequences call:

```bash
hatch run lint
uv run sample.py
```

### Publish a new version

You can check the current version with:

```bash
hatch version
```

You can bump the version with commands such as `hatch version dev` or `patch`, `minor` or `major`. Or edit the `src/dnadiffusion/__about__.py` file. After changing the version, when you push to github, the Test Release workflow will automatically publish it on Test-PyPI and a github release will be created as a draft.

## Serve the documentation

You can serve the mkdocs documentation with:
### Training
If you would like to train the model, we provide a basic config file for training the diffusion model. To train the model call:

```bash
hatch run docs-serve
uv run train.py
```

This will automatically watch for changes in your code.

## Contributors ✨

Expand Down Expand Up @@ -255,4 +86,4 @@ Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/d
<img src="https://contributors-img.web.app/image?repo=pinellolab/DNA-Diffusion" />
</a>

This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!
This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!
5 changes: 5 additions & 0 deletions configs/data/debug.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
_target_: src.dnadiffusion.data.dataloader.get_dataset
data_path: "data/K562_hESCT0_HepG2_GM12878_12k_sequences_per_group.txt"
saved_data_path: "data/encode_data.pkl"
load_saved_data: True
debug: True
5 changes: 5 additions & 0 deletions configs/data/default.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
_target_: src.dnadiffusion.data.dataloader.get_dataset
data_path: "data/K562_hESCT0_HepG2_GM12878_12k_sequences_per_group.txt"
saved_data_path: "data/encode_data.pkl"
load_saved_data: True
debug: False
5 changes: 5 additions & 0 deletions configs/data/gm12878.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
_target_: src.dnadiffusion.data.dataloader.get_dataset
data_path: "data/GM12878_ENCLB441ZZZ.txt"
saved_data_path: "data/gm12878_encode_data.pkl"
load_saved_data: True
debug: False
5 changes: 5 additions & 0 deletions configs/data/hepg2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
_target_: src.dnadiffusion.data.dataloader.get_dataset
data_path: "data/HepG2_ENCLB029COU.txt"
saved_data_path: "data/hepg2_encode_data.pkl"
load_saved_data: True
debug: False
5 changes: 5 additions & 0 deletions configs/data/k562.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
_target_: src.dnadiffusion.data.dataloader.get_dataset
data_path: "data/K562_ENCLB843GMH.txt"
saved_data_path: "data/k562_encode_data.pkl"
load_saved_data: True
debug: False
4 changes: 4 additions & 0 deletions configs/diffusion/default.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
_target_: src.dnadiffusion.models.diffusion.Diffusion
timesteps: 50
beta_start: 0.0001
beta_end: 0.2
5 changes: 5 additions & 0 deletions configs/model/unet.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
_target_: src.dnadiffusion.models.unet.UNet
dim: 200
channels: 1
dim_mults: [1,2,4]
resnet_block_groups: 4
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
_target_: torch.optim.Adam
_partial_: True
lr: 0.02
lr: 1e-4
3 changes: 3 additions & 0 deletions configs/sampling/default.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
checkpoint_path: "model.pt"
sample_batch_size: 10
number_of_samples: 1000
6 changes: 6 additions & 0 deletions configs/train.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
defaults:
- model: unet
- data: default
- optimizer: adam
- diffusion: default
- training: default
6 changes: 6 additions & 0 deletions configs/train_debug.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
defaults:
- model: unet
- data: debug
- optimizer: adam
- diffusion: default
- training: debug
13 changes: 13 additions & 0 deletions configs/training/debug.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
distributed: False
precision: "float32"
num_workers: 1
pin_memory: False
batch_size: 1
sample_batch_size: 1
num_epochs: 2200
min_epochs: 5
patience: 2
log_step: 50
sample_epoch: 50000
number_of_samples: 10
use_wandb: False
13 changes: 13 additions & 0 deletions configs/training/default.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
distributed: False
precision: "bf16"
num_workers: 2
pin_memory: True
batch_size: 120
sample_batch_size: 10
num_epochs: 5000
min_epochs: 2000
patience: 10
log_step: 50
sample_epoch: 500
number_of_samples: 1000
use_wandb: True
Binary file modified data/encode_data.pkl
Binary file not shown.
File renamed without changes.
Binary file added docs/images/dnadiffusion.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading