Skip to content

scripts/build_backgrounds_chrombpnet.py: --part variants/baselines --assay X overwrites the interim NPZ #71

@lucapinello

Description

@lucapinello

Background

`scripts/build_backgrounds_chrombpnet.py` writes `chrombpnet_effect_cdfs_interim.npz` and `chrombpnet_baseline_cdfs_interim.npz` via plain `np.savez_compressed(interim_path, ...)` — which overwrites whatever was there. So the documented two-pass build flow:

```bash
python scripts/build_backgrounds_chrombpnet.py --part variants --assay ATAC_DNASE ...
python scripts/build_backgrounds_chrombpnet.py --part variants --assay CHIP ...
```

…silently loses the 42 ATAC/DNase tracks: the second invocation replaces the interim with 744 CHIP tracks. The subsequent `--part merge` only sees the 744 and produces a 744-track final NPZ instead of the expected 786.

Reproduction

This happened during the 2026-04-30 chrombpnet CDF rebuild (PR #70). Following the handoff at `audits/2026-04-29_chrombpnet_cdf_rebuild/HANDOFF.md` verbatim produced a 744-track final NPZ; the post-merge spot-check caught the 744 ≠ 786 mismatch. Recovered by re-running Phase 1 with the existing 744-CHIP NPZ in place and finishing with `--part merge-incremental` (~50 min cost).

Suggested fix

Two reasonable options:

  1. Append-on-write within an interim: read the existing interim if present, merge its track rows with the new run's rows, save back. Mirrors how `merge-incremental` already handles the final NPZ via `PerTrackNormalizer.append_tracks`.
  2. Refuse to overwrite without --force: detect existing interim, error with a clear message ("found interim from a prior run with different tracks; pass --force to overwrite, or run `--part merge` to consume it first").

Option 1 makes the documented two-pass flow Just Work; option 2 is more conservative but at least avoids silent data loss.

Why it matters

Anyone following the handoff's documented invocations as-is loses data without realising it. The error mode is silent — the script writes a smaller file successfully and the merge "succeeds" with the wrong track count. Only a track-count check at the end of merge surfaces it.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions