Implement a faster GFA (1 and 2) parser by fedarko · Pull Request #422 · marbl/MetagenomeScope

fedarko · 2026-04-29T00:30:33Z

Tentatively ready!

Closes #310 - this just ignores lines that do not start with a prefix we expect (e.g. hifiasm's A-lines, comments, ...) This is partial work on #403. Need to update this to handle E-lines and O-lines in GFA 2 files. These are both a bit tricky because they require some fancy additional logic; we want to ignore containment E-lines (and maybe really all non-dovetail E-lines?), and ideally we want to support O-lines with fancy recursive path definitions. But neither of those is super intractable or anything. Also, we might want to eventually reconsider how we handle inconsistent lengths. TECHNICALLY in GFA 2 you are allowed to override segment length (as given by a sequence) with another length. but like that just seems so dreadful to me that i really doubt that supporting it will do anything but cause problems for us. The all_line_types.gfa2.gfa test case (c/o gfapy) has an example of this which is currently causing a test error, as expected. Anyway ....... At least from some initial testing on the hg002 graph this is already much faster which is encouraging. i'm sure there are ways to speed it up even further Code is kind of sloppy, need to add more tests

currently only exposed within code - should add a cli param i guess

Now that this sanity check is fully configurable, closes #421

probs not very important but whatever at least now it is consistent

Per the GFA 2 specification: https://gfa-spec.github.io/GFA-spec/GFA2.html Matches GfaViz' behavior. This was already tested in the all_line_types.gfa2.gfa test case from Gfapy, but just to be safe I added some tests that explicitly verifies that this stuff works as I intend.

and only visualizing the dovetail edges. not TOO bad to do although i would like to add more tests... and should really tidy up thee code wrt self implying edges to reduce duplication

the way E-lines are expanded in O-lines needs some work - should make it look ahead and avoid adding target, also, if the edge is followed by a segment. but I need to finally commit this so I can stop worrying abt losing this work

Surprisingly this seems to work ok, but I want to abstract it to another function and add a zillion tests because this is surprisingly tricky

worked out surprisingly well!

codecov · 2026-05-08T02:36:00Z

Codecov Report

❌ Patch coverage is 91.81495% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.21%. Comparing base (65d4a56) to head (fe83fcc).

Files with missing lines	Patch %	Lines
metagenomescope/graph/assembly_graph.py	88.23%	2 Missing and 4 partials ⚠️
metagenomescope/parsers.py	97.16%	2 Missing and 2 partials ⚠️
metagenomescope/_cli.py	0.00%	3 Missing ⚠️
metagenomescope/defaults.py	0.00%	3 Missing ⚠️
metagenomescope/descs.py	0.00%	3 Missing ⚠️
metagenomescope/gfa_utils.py	94.44%	1 Missing and 2 partials ⚠️
metagenomescope/layout/layout.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #422      +/-   ##
==========================================
+ Coverage   65.88%   67.21%   +1.32%     
==========================================
  Files          33       34       +1     
  Lines        4042     4233     +191     
  Branches      990     1039      +49     
==========================================
+ Hits         2663     2845     +182     
- Misses       1308     1314       +6     
- Partials       71       74       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

this is not really allowed in the GFA 1 specification, but let's nip it in the bud anyway ...

While putting these together I ran into some confusion about what exactly constitutes a dovetail... see the second test added here for details. Need to discuss, but for now let's accept it ig

I think this makes sense. This makes implementation easy (now we can apply this after parsing, so it is mostly graph independent) and futureproofs this, plus makes this clearer imo.

I think having debug / verbose at the end makes sense

I was CONSIDERING doing this whole thing of detecting edge covs then only retaining the highest cov one but ehhh it gets finicky. see #430 for possible extension if i eventually have free time

using " marks around the input name makes the error messages clearer

tricky, since "".split(",") yields [""] -- NOT []!

I think this is fair to say, since it is totally possible now to load gfa files with hundreds of thousands (or even millions!) of nodes / edges on my goofy 8MB RAM laptop - i just did that. need to do more benchmarking with, like, actually modern hardware...

fedarko added 26 commits April 27, 2026 22:18

fix enough-line-parts err msg

934a327

test multiple-S-lines-for-same-ID case

fa94628

Recognize dp:i tags for GFA segment coverage

6d52580

Make original_graph sanity checking optional #421

ff8ff8e

currently only exposed within code - should add a cli param i guess

Add --dcheck flag; update changelog re #403, etc

384ab6e

Now that this sanity check is fully configurable, closes #421

changelog wording

fcfa6f3

Mention --dcheck in README usage; CLI mcw doc

334ba02

more GFA cov tag tests

62a618c

make GFA S coverage tag precedence match Bandage

ec4088f

probs not very important but whatever at least now it is consistent

more tag dict parsing tests & docs

d6642c2

abstract count / len logic to util func

186b184

update readme re: #423 a bit

4f70352

more accurate comment re: GFA 1 segment regex

2a050c8

update test info in readme

ef3280c

add tentative support for GFA2 E-lines #403

8fd0180

and only visualizing the dovetail edges. not TOO bad to do although i would like to add more tests... and should really tidy up thee code wrt self implying edges to reduce duplication

tidy is-edge-self-implying code

c2effe8

update changelog re dovetail-only; README fmt

41eec3a

quote fmt

b161112

fmt

79d21cf

big commit: support GFA 2 O-lines; various tidying

c83d1ee

the way E-lines are expanded in O-lines needs some work - should make it look ahead and avoid adding target, also, if the edge is followed by a segment. but I need to finally commit this so I can stop worrying abt losing this work

lint

6473b38

update readme re: gfa2 O-lines and E-lines

2ca7b37

proper edge expanding in O-lines

e706c33

Surprisingly this seems to work ok, but I want to abstract it to another function and add a zillion tests because this is surprisingly tricky

abstract edge expansion to misc_utils & test

4eedbae

worked out surprisingly well!

fedarko added 3 commits May 7, 2026 22:40

also explicitly disallow P-lines with * IDs

408258a

this is not really allowed in the GFA 1 specification, but let's nip it in the bud anyway ...

distinguishing the "should never"s

2164783

Test O-line recursion

2a5e4a7

fedarko added 2 commits May 8, 2026 18:40

abstract dovetail edge detection to sep file/func

5bf5888

Add some tests for dovetail edge detection

7e31534

While putting these together I ran into some confusion about what exactly constitutes a dovetail... see the second test added here for details. Need to discuss, but for now let's accept it ig

fedarko changed the title ~~Implement a faster GFA parser~~ Implement a faster GFA (1 and 2) parser May 9, 2026

fedarko added 27 commits May 11, 2026 15:04

clean up test input md file

6af75e1

update changelog re test data md cleanup

a723b39

rm dup gfa edges skeleton

35983bd

make --rmdup not a flag; default "gfaonly"

e11d28f

I think this makes sense. This makes implementation easy (now we can apply this after parsing, so it is mostly graph independent) and futureproofs this, plus makes this clearer imo.

reorder CLI flags a bit

b0e1378

I think having debug / verbose at the end makes sense

simplify layout code a tiny bit

698a469

implement --rmdup #403; mv cov attrs to ui config

fb665aa

I was CONSIDERING doing this whole thing of detecting edge covs then only retaining the highest cov one but ehhh it gets finicky. see #430 for possible extension if i eventually have free time

test --rmdup

6052696

Tidy readme & explain --rmdup

84006d4

edge(s)

a385d59

test self-implying gfa 2 edges

a803824

directly test from_suffix_orient() & catch "" case

2fefeb9

using " marks around the input name makes the error messages clearer

better catch & test zero-segment GFA 1 paths

fb9944a

tricky, since "".split(",") yields [""] -- NOT []!

abstract more stuff to gfa_utils; empty paths

765d4c7

testing empty GFA 2 paths & P-lines in O-lines

8f7ed67

more gfa path tests

32e526f

test O-paths with unrecognized IDs

e919ae7

clearer name...

c8acc6a

test RC path in path

1b0b529

fancier RC paths

e7a7eec

RC path-of-paths test

ffe6983

sty

ff12246

direct E-lines in O-lines tests

b023b8a

testing loop edges in gfa2 paths

1634d60

clean up logger/logging thing

8be18cb

mention --rmdup

6655a73

millions

fe83fcc

I think this is fair to say, since it is totally possible now to load gfa files with hundreds of thousands (or even millions!) of nodes / edges on my goofy 8MB RAM laptop - i just did that. need to do more benchmarking with, like, actually modern hardware...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a faster GFA (1 and 2) parser#422

Implement a faster GFA (1 and 2) parser#422
fedarko wants to merge 71 commits into
mainfrom
fast-gfa

fedarko commented Apr 29, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fedarko commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fedarko commented Apr 29, 2026 •

edited

Loading

codecov Bot commented May 8, 2026 •

edited

Loading