Skip to content

Implement a faster GFA (1 and 2) parser#422

Open
fedarko wants to merge 71 commits into
mainfrom
fast-gfa
Open

Implement a faster GFA (1 and 2) parser#422
fedarko wants to merge 71 commits into
mainfrom
fast-gfa

Conversation

@fedarko
Copy link
Copy Markdown
Member

@fedarko fedarko commented Apr 29, 2026

Tentatively ready!

Will address #403, #310, #421, #425

fedarko added 26 commits April 27, 2026 22:18
Closes #310 - this just ignores lines that do not start with a prefix
we expect (e.g. hifiasm's A-lines, comments, ...)

This is partial work on #403. Need to update this to handle
E-lines and O-lines in GFA 2 files. These are both a bit tricky
because they require some fancy additional logic; we want to ignore
containment E-lines (and maybe really all non-dovetail E-lines?),
and ideally we want to support O-lines with fancy recursive path
definitions. But neither of those is super intractable or anything.

Also, we might want to eventually reconsider how we handle
inconsistent lengths. TECHNICALLY in GFA 2 you are allowed to
override segment length (as given by a sequence) with another length.
but like that just seems so dreadful to me that i really doubt that
supporting it will do anything but cause problems for us.
The all_line_types.gfa2.gfa test case (c/o gfapy) has an example of
this which is currently causing a test error, as expected.

Anyway .......
At least from some initial testing on the hg002 graph this is already
much faster which is encouraging. i'm sure there are ways to speed
it up even further

Code is kind of sloppy, need to add more tests
currently only exposed within code - should add a cli param i guess
Now that this sanity check is fully configurable, closes #421
probs not very important but whatever at least now it is consistent
Per the GFA 2 specification:
https://gfa-spec.github.io/GFA-spec/GFA2.html

Matches GfaViz' behavior. This was already tested in the
all_line_types.gfa2.gfa test case from Gfapy, but just to be safe
I added some tests that explicitly verifies that this stuff works
as I intend.
and only visualizing the dovetail edges. not TOO bad to do although
i would like to add more tests...

and should really tidy up thee code wrt self implying edges to
reduce duplication
the way E-lines are expanded in O-lines needs some work - should
make it look ahead and avoid adding target, also, if the edge
is followed by a segment. but I need to finally commit this so I
can stop worrying abt losing this work
Surprisingly this seems to work ok, but I want to abstract it to
another function and add a zillion tests because this is surprisingly
tricky
worked out surprisingly well!
@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 91.81495% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.21%. Comparing base (65d4a56) to head (fe83fcc).

Files with missing lines Patch % Lines
metagenomescope/graph/assembly_graph.py 88.23% 2 Missing and 4 partials ⚠️
metagenomescope/parsers.py 97.16% 2 Missing and 2 partials ⚠️
metagenomescope/_cli.py 0.00% 3 Missing ⚠️
metagenomescope/defaults.py 0.00% 3 Missing ⚠️
metagenomescope/descs.py 0.00% 3 Missing ⚠️
metagenomescope/gfa_utils.py 94.44% 1 Missing and 2 partials ⚠️
metagenomescope/layout/layout.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #422      +/-   ##
==========================================
+ Coverage   65.88%   67.21%   +1.32%     
==========================================
  Files          33       34       +1     
  Lines        4042     4233     +191     
  Branches      990     1039      +49     
==========================================
+ Hits         2663     2845     +182     
- Misses       1308     1314       +6     
- Partials       71       74       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

fedarko added 3 commits May 7, 2026 22:40
this is not really allowed in the GFA 1 specification, but let's
nip it in the bud anyway ...
fedarko added 2 commits May 8, 2026 18:40
While putting these together I ran into some confusion about
what exactly constitutes a dovetail... see the second test added
here for details. Need to discuss, but for now let's accept it ig
@fedarko fedarko changed the title Implement a faster GFA parser Implement a faster GFA (1 and 2) parser May 9, 2026
fedarko added 27 commits May 11, 2026 15:04
I think this makes sense. This makes implementation easy (now we
can apply this after parsing, so it is mostly graph independent)
and futureproofs this, plus makes this clearer imo.
I think having debug / verbose at the end makes sense
I was CONSIDERING doing this whole thing of detecting edge covs
then only retaining the highest cov one but ehhh it gets finicky.
see #430 for possible extension if i eventually have free time
using " marks around the input name makes the error messages clearer
tricky, since "".split(",") yields [""] -- NOT []!
I think this is fair to say, since it is totally possible now to
load gfa files with hundreds of thousands (or even millions!) of
nodes / edges on my goofy 8MB RAM laptop - i just did that.

need to do more benchmarking with, like, actually modern hardware...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant