Skip to content

Excavate: extract IPv6 URLs (#1815)#3071

Merged
liquidsec merged 2 commits intoblacklanternsecurity:devfrom
ChrisJr404:feat/excavate-ipv6-1815
May 7, 2026
Merged

Excavate: extract IPv6 URLs (#1815)#3071
liquidsec merged 2 commits intoblacklanternsecurity:devfrom
ChrisJr404:feat/excavate-ipv6-1815

Conversation

@ChrisJr404
Copy link
Copy Markdown

Summary

Closes #1815 ("Excavate IPv6 URLs", filed by @TheTechromancer).

The url_full YARA rule and the two Python post-filters (full_url_regex / full_url_regex_strict) all only accepted word-character / dotted hostnames in the host slot, so URLs with bracketed IPv6 hosts were dropped at extraction time:

  • http://[2001:db8::1]/api
  • http://[::1]:8080/path
  • https://[fe80::dead:beef]/foo/bar.html

This PR adds a \\[[0-9a-fA-F:]+\\] alternative to the host part of all three patterns. The bracketed form is preserved in the captured host so downstream parsers (urllib, etc.) still recognise the URL as IPv6.

Tests

bbot/test/test_step_1/test_excavate_url_regexes.py — 6 cases:

  • test_full_url_regex_matches_ipv6 — accepts 6 representative IPv6 URLs and verifies the captured host keeps the leading [.
  • test_full_url_regex_still_matches_existing_patterns — regression guard for plain DNS-name + IPv4 URLs.
  • Two corresponding pairs for full_url_regex_strict.
  • Two corresponding pairs for the YARA url_full rule (compiled directly from excavate.URLExtractor.yara_rules['url_full']).
$ pytest bbot/test/test_step_1/test_excavate_url_regexes.py -v
test_full_url_regex_matches_ipv6                              PASSED
test_full_url_regex_still_matches_existing_patterns           PASSED
test_full_url_regex_strict_matches_ipv6                       PASSED
test_full_url_regex_strict_still_matches_existing_patterns    PASSED
test_yara_url_rule_matches_ipv6                               PASSED
test_yara_url_rule_still_matches_existing_patterns            PASSED

Notes

  • No behavioural change for existing DNS-name / IPv4 URLs: the original alternation is preserved as the second branch of the alternation.
  • The new patterns accept IPv6-shaped tokens regardless of whether they are valid addresses; downstream URL validation still runs (validators.validate_url_parsed) before the URL becomes a URL_UNVERIFIED event, so malformed inputs still get rejected one stage later.

The url_full YARA rule and the full_url_regex / full_url_regex_strict
post-filters all required hosts to be word-character labels, so URLs
with bracketed IPv6 hosts (http://[2001:db8::1]/, http://[::1]:8080/...)
were dropped at extraction time. Add a [0-9a-fA-F:]+ alternative to the
host part of all three patterns so IPv6 URLs are emitted as
URL_UNVERIFIED events alongside DNS-name URLs.

Adds bbot/test/test_step_1/test_excavate_url_regexes.py — 6 cases that
pin both the new IPv6 acceptance and a regression guard for the
existing DNS-name / IPv4 URLs.

Closes blacklanternsecurity#1815
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@ChrisJr404
Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

bls-cla-bot Bot added a commit to blacklanternsecurity/CLA that referenced this pull request May 3, 2026
@ChrisJr404
Copy link
Copy Markdown
Author

recheck

1 similar comment
@ChrisJr404
Copy link
Copy Markdown
Author

recheck

@liquidsec
Copy link
Copy Markdown
Contributor

liquidsec commented May 7, 2026

hi @ChrisJr404, thanks for this! I let the tests run, right now its just failing lint - can you run uv ruff format and commit the resulting changes? Then we can let the rest of the tests run to make sure the changes didn't break anything else. I think as long as they all pass we're good to merge.

@liquidsec
Copy link
Copy Markdown
Contributor

@aconite33 do you know what is wrong with claassistant in this case?

@ChrisJr404
Copy link
Copy Markdown
Author

Done in 6ead620, lint should be clean now.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91%. Comparing base (5be4993) to head (6ead620).
⚠️ Report is 8 commits behind head on dev.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #3071   +/-   ##
=====================================
+ Coverage     91%     91%   +1%     
=====================================
  Files        437     440    +3     
  Lines      37509   37560   +51     
=====================================
+ Hits       33925   33979   +54     
+ Misses      3584    3581    -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@liquidsec liquidsec merged commit 107415f into blacklanternsecurity:dev May 7, 2026
18 of 19 checks passed
@liquidsec liquidsec mentioned this pull request May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants