PEARL stands for Protein Engineering Adapter via Reinforcement Learning.
This repository explores PETase-family sequence design through remote generation/training on Tinker plus local scoring, selection, mining, and evaluation logic. It is an experimental research codebase, not a validated product.
- Active workspace map after the April 28 cleanup:
REPO_MAP.md - Historical sponsor-facing summary:
WHITEPAPER.md - Repo structure and supported surface:
docs/overview.md - Supported workflows:
docs/workflows.md - Operator notes:
docs/operations.md - Current scientific status:
docs/science.md - Manifold-construction pivot:
docs/manifold_construction.md - Experiment configs:
configs/experiments/README.md - Full experimental history:
notes/LABNOTES.md
May 2026 heat check: the project has enough working components to justify the next paid smoke, but not enough evidence to claim the protein-design thesis is solved. The strongest direction is a small PLM plus preference/RL loop: natural PETase/cutinase records as positives, Phase 7 fold-failed generated artifacts as hard negatives, then compact post-train generation and structural validation before any larger library expansion.
April 29, 2026 DPO correction: Phase 7 generated/local-library sequences are no longer allowed on the chosen side of the paid-run DPO dataset. The current local Phase 8 build uses reviewed natural PETase/cutinase records as chosen positives and demotes the fold-failed Phase 7 generated panel to hard negatives.
April 28, 2026 cleanup note: the active workspace is now focused on Phase 8 DPO readiness. The current 10k DPO dataset lives locally in data/phase8_dpo/, its structural evidence lives in reports/analysis/phase7_local_library_v1/, and old run outputs/scripts/configs were moved to the local ignored archive at archive/2026-04-28-labyrinth-cleanup/. See REPO_MAP.md and notes/LABNOTES.md for the current map and latest scientific status.
As of April 23, 2026:
- merged
stage-b-litemined pool:1,597,184raw candidates179exact-unique functional hits54exact-unique family-faithful hits197lineage clusters at0.85
- best historical strict branch:
strict-core-v7-repair- stage-A and stage-B-lite trained cleanly
- full robustness stayed narrow:
p12:[0, 0, 0]p24:[0, 2, 0]p48:[0, 3, 1]- the main miss was prompt coverage breadth, with only
4 / 48prompts hit atp48
- negative strict/repair evidence:
strict-core-v8-coveragefailed to broadenv7, regressed atp12/p24, and lost family-faithful robustness- the April 21/22
v9p12/p24 repair rescue found79loose high-ESM survivors but0strict shortlist rows and0retrain positives - local Gemma mining and historical local-exploit scans did not expose a usable passive basin
- scaffold-first manifold pivot:
- Phase 1 built a local scaffold bank with
12,619unique sequences,4,893family-manifold scaffolds,3,769strict-manifold scaffolds,79recoveredv9negatives, and274strict candidate positives - Phase 2 built and ESM-scored a
10,000-candidate same-length strict-manifold frontier; all candidates scored>=95 - Phase 2 selection passed readiness with
230selected strict candidates across79parent scaffolds,8lengths, and100two-mutants
- Phase 1 built a local scaffold bank with
- manifold curriculum outcomes:
v1: nonzero transfer but failed breadth;p12passed with tier-2 hits[1, 2, 0], whilep24failed with[0, 1, 0]v1.1: p24-only gate failed cleanly with0tier-2 hits and0raw single-motif plus geometry plus ESM candidatesv1.2: length-retargeted repair distillation recovered real but narrow signal:3functional hits,2family-faithful hits, and3 / 24prompt coveragev1.3: support-prompt widening regressed to[0, 0, 1]tier-2 hits,1 / 24prompt coverage, and0family-faithful hits
- current rule:
- do not launch another paid manifold
v1.xreplay, stage-B, p48, or broad mining tranche from this branch line - the manifold
v2objective panel is now built atreports/analysis/manifold_v2_objective_panel_20260424/ - use its
2v1.2family-faithful hits as positive anchors and45v1.3stable-only / geometry-only finalists as hard negatives - use its
305v9/v1.1 drift negatives and190historical support positives to shape the next offline constructor - the first v2 offline constructor selected
64hard-gated pre-ESM candidates across38parents and8exact lengths - the expanded v2 constructor scored
192 / 192candidates above ESM85 - final reselection produced
34strict/core/ESM candidates across18parent source keys and14exact lengths - the finalized v2 curriculum has
42rows:34selected candidates plus8purebred anchors - the v2 p24/c128 diagnostic completed but failed durability with tier-2 hits
[0, 1, 0], prompt coverage1 / 24, and0family-faithful hits - active next branch is v2.1 bridge-weighted replay at
reports/curriculum/manifold_v21_20260424/manifold_v21_bridge_curriculum.jsonl - v2.1 has
71rows:28v2 strict-breadth anchors,15measured bridge replay rows,12support prompt anchors,12historical family-faithful anchors, and4purebred anchors - current paid scope is a tiny v2.1 stage-A train plus p24-only diagnostic gate; no stage-B, p48, or broad mining from this artifact
- keep paid mining as a small diagnostic only if the offline v2 redesign stalls
- do not launch another paid manifold
See docs/science.md for the current research readout and primary artifact links.
The supported reusable workflows are:
minepostprocessanalyzebuild-datasetrepairtrainrobustnessrerankermanifold-construction(Phase 1 and Phase 2 selection implemented)
The details and entrypoints for those workflows live in docs/workflows.md.
Versioned strict_core_* and strict_first_union wrappers now live under the archive and are exposed at their old scripts/ paths through symlinks for continuity with the historical record. They are not the supported workflow surface anymore. The supported control flow is now config-driven and library-backed through src/pearl.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtPinned local/dev requirements are in requirements.txt:
tinker==0.16.1torch==2.10.0transformers==5.2.0tiktoken==0.12.0numpy==2.4.2safetensors==0.7.0sentencepiece==0.2.1rapidfuzz==3.14.5
Production CUDA environments used on Nebius are separate from the local/dev baseline.
main.py: current generation/eval engine with shared helpers now extracted intosrc/pearlsrc/pearl/family.py: family scoring and catalytic geometry checkssrc/pearl/esm_proxy.py: local ESM proxy scorersrc/pearl/: reusable library surface for paths, detached jobs, reports, smoke gates, curricula, and run-record assemblyscripts/: supported workflow entrypoints plus archived compatibility symlinks, includingscripts/manifold_construction_experiment.pyreports/: local run artifactsdata/: prompts, records, and family datasets
The repo boundary is now explicit:
- reusable engine and shared helpers live under
src/pearl - supported workflow runners are config-driven entrypoints
- historical campaign wrappers are archived and kept only through compatibility symlinks
python scripts/run_ablation.py \
--name my-eval-run \
--model moonshotai/Kimi-K2.5 \
--variant baseline \
--prompts-path /abs/path/prompts.jsonl \
--reference-records-path /abs/path/petase_records.jsonl \
--prompt-count 24 \
--candidate-sample-count 128 \
--second-stage-top-k 16 \
--second-stage-esm-weight 0.4 \
--second-stage-motif-weight 0.3 \
--second-stage-geometry-weight 0.3 \
--second-stage-template-weight 0.05 \
--init-state-path tinker://.../weights/... \
--eval-only \
--resume \
--capture-candidate-audit \
--seed 41python scripts/run_robustness_suite.py \
--name my-robustness \
--init-state-path tinker://.../weights/... \
--model moonshotai/Kimi-K2.5 \
--variant baseline \
--suite-sizes 12,24,48 \
--temperatures 0.8 \
--seeds 41,53,67 \
--candidate-sample-count 128 \
--second-stage-top-k 16 \
--second-stage-esm-weight 0.4 \
--second-stage-motif-weight 0.3 \
--second-stage-geometry-weight 0.3 \
--second-stage-template-weight 0.05Use this path when remote Tinker sampling dominates wall clock and you want to decouple:
- stockpile candidate pools first
- then run H100 ESM rescoring/finalization only on completed pools
Sync the bundle to a Nebius H100 VM from your Mac:
bash scripts/sync_topoff1m_a_eval_bundle.sh <VM_IP>Set up the VM once:
ssh -i ~/.ssh/nebius_h200 svdr@<VM_IP>
bash ~/work/tinker/scripts/setup_nebius_h100_eval_env.sh
export TINKER_API_KEY=...Launch ultra on the VM:
export STOCKPILE_JOBS=4
export STOCKPILE_RETRIES=2
bash ~/work/tinker/scripts/launch_topoff1m_a_robustness_h100.sh ultraQueue balanced only after ultra is actually complete:
python3 ~/work/tinker/scripts/launch_detached_job.py \
--job-name pearl-topoff1m-a-balanced-robustness-2phase-h100-queue \
--cwd ~/work/tinker \
--metadata-path ~/work/tinker/reports/logs/pearl-topoff1m-a-balanced-robustness-2phase-h100-queue.json \
--log-path ~/work/tinker/reports/logs/pearl-topoff1m-a-balanced-robustness-2phase-h100-queue.log \
--env "TINKER_API_KEY=$TINKER_API_KEY" \
--env "STOCKPILE_JOBS=$STOCKPILE_JOBS" \
--env "STOCKPILE_RETRIES=$STOCKPILE_RETRIES" \
-- bash -lc 'while [ ! -f "$HOME/work/tinker/reports/robustness/pearl-topoff1m-a-ultra-robustness-2phase-h100-p12p24p48-t08-s41s53s67/robustness_summary.json" ]; do sleep 60; done; bash ~/work/tinker/scripts/launch_topoff1m_a_robustness_h100.sh balanced'Operational notes:
- The queue gate should watch for
robustness_summary.json, not the parent PID. - The VM venv needs
sentencepiece,protobuf, andtiktokeninstalled or some stockpile lanes can fail during tokenizer init. run_robustness_two_phase.pynow supports:--stockpile-jobs--stockpile-retries
- Kill the VM only after both of these files exist:
reports/robustness/pearl-topoff1m-a-ultra-robustness-2phase-h100-p12p24p48-t08-s41s53s67/robustness_summary.jsonreports/robustness/pearl-topoff1m-a-balanced-robustness-2phase-h100-p12p24p48-t08-s41s53s67/robustness_summary.json
python scripts/check_retrain_readiness.py \
reports/ablations/.../candidate_audit.json \
--selected-onlypython scripts/run_raft_wave.py \
--name wave1 \
--init-state-path tinker://.../weights/... \
--total-prompt-count 200 \
--shard-count 4 \
--candidate-sample-count 256 \
--second-stage-top-k 16 \
--temperature 0.8Most runs produce:
report.json: step-level selected output recordssummary.json: aggregate run metricscandidate_audit.json: full per-candidate pool (if enabled)
Robustness suites additionally produce:
runs_manifest.jsonrobustness_summary.jsonwith durability-gate pass/fail and seed vectors
- Sequences from this repo are computational outputs only.
- ESM proxy is a lightweight stability proxy, not a structural truth model.
- Passing local gates does not imply biochemical activity or wet-lab success.
Apache License 2.0. See LICENSE.