Excavate: extract IPv6 URLs (#1815) by ChrisJr404 · Pull Request #3071 · blacklanternsecurity/bbot

ChrisJr404 · 2026-05-03T23:07:25Z

Summary

Closes #1815 ("Excavate IPv6 URLs", filed by @TheTechromancer).

The url_full YARA rule and the two Python post-filters (full_url_regex / full_url_regex_strict) all only accepted word-character / dotted hostnames in the host slot, so URLs with bracketed IPv6 hosts were dropped at extraction time:

http://[2001:db8::1]/api
http://[::1]:8080/path
https://[fe80::dead:beef]/foo/bar.html

This PR adds a \\[[0-9a-fA-F:]+\\] alternative to the host part of all three patterns. The bracketed form is preserved in the captured host so downstream parsers (urllib, etc.) still recognise the URL as IPv6.

Tests

bbot/test/test_step_1/test_excavate_url_regexes.py — 6 cases:

test_full_url_regex_matches_ipv6 — accepts 6 representative IPv6 URLs and verifies the captured host keeps the leading [.
test_full_url_regex_still_matches_existing_patterns — regression guard for plain DNS-name + IPv4 URLs.
Two corresponding pairs for full_url_regex_strict.
Two corresponding pairs for the YARA url_full rule (compiled directly from excavate.URLExtractor.yara_rules['url_full']).

$ pytest bbot/test/test_step_1/test_excavate_url_regexes.py -v
test_full_url_regex_matches_ipv6                              PASSED
test_full_url_regex_still_matches_existing_patterns           PASSED
test_full_url_regex_strict_matches_ipv6                       PASSED
test_full_url_regex_strict_still_matches_existing_patterns    PASSED
test_yara_url_rule_matches_ipv6                               PASSED
test_yara_url_rule_still_matches_existing_patterns            PASSED

Notes

No behavioural change for existing DNS-name / IPv4 URLs: the original alternation is preserved as the second branch of the alternation.
The new patterns accept IPv6-shaped tokens regardless of whether they are valid addresses; downstream URL validation still runs (validators.validate_url_parsed) before the URL becomes a URL_UNVERIFIED event, so malformed inputs still get rejected one stage later.

The url_full YARA rule and the full_url_regex / full_url_regex_strict post-filters all required hosts to be word-character labels, so URLs with bracketed IPv6 hosts (http://[2001:db8::1]/, http://[::1]:8080/...) were dropped at extraction time. Add a [0-9a-fA-F:]+ alternative to the host part of all three patterns so IPv6 URLs are emitted as URL_UNVERIFIED events alongside DNS-name URLs. Adds bbot/test/test_step_1/test_excavate_url_regexes.py — 6 cases that pin both the new IPv6 acceptance and a regression guard for the existing DNS-name / IPv4 URLs. Closes blacklanternsecurity#1815

github-actions · 2026-05-03T23:07:39Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

ChrisJr404 · 2026-05-03T23:08:48Z

I have read the CLA Document and I hereby sign the CLA

ChrisJr404 · 2026-05-04T01:16:40Z

recheck

ChrisJr404 · 2026-05-04T16:30:13Z

recheck

liquidsec · 2026-05-07T15:08:33Z

hi @ChrisJr404, thanks for this! I let the tests run, right now its just failing lint - can you run uv ruff format and commit the resulting changes? Then we can let the rest of the tests run to make sure the changes didn't break anything else. I think as long as they all pass we're good to merge.

liquidsec · 2026-05-07T15:09:14Z

@aconite33 do you know what is wrong with claassistant in this case?

ChrisJr404 · 2026-05-07T15:59:04Z

Done in 6ead620, lint should be clean now.

codecov · 2026-05-07T19:11:42Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91%. Comparing base (5be4993) to head (6ead620).
⚠️ Report is 8 commits behind head on dev.

Additional details and impacted files

@@          Coverage Diff          @@
##             dev   #3071   +/-   ##
=====================================
+ Coverage     91%     91%   +1%     
=====================================
  Files        437     440    +3     
  Lines      37509   37560   +51     
=====================================
+ Hits       33925   33979   +54     
+ Misses      3584    3581    -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bls-cla-bot Bot added a commit to blacklanternsecurity/CLA that referenced this pull request May 3, 2026

@ChrisJr404 has signed the CLA in blacklanternsecurity/bbot#3071

b2313c4

TheTechromancer mentioned this pull request May 4, 2026

Dev -> Stable 3.0 #3079

Open

style: ruff format

6ead620

liquidsec approved these changes May 7, 2026

View reviewed changes

liquidsec merged commit 107415f into blacklanternsecurity:dev May 7, 2026
18 of 19 checks passed

liquidsec mentioned this pull request May 7, 2026

Excavate IPv6 URLs #1815

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Excavate: extract IPv6 URLs (#1815)#3071

Excavate: extract IPv6 URLs (#1815)#3071
liquidsec merged 2 commits intoblacklanternsecurity:devfrom
ChrisJr404:feat/excavate-ipv6-1815

ChrisJr404 commented May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026 •

edited

Loading

Uh oh!

ChrisJr404 commented May 3, 2026

Uh oh!

ChrisJr404 commented May 4, 2026

Uh oh!

ChrisJr404 commented May 4, 2026

Uh oh!

liquidsec commented May 7, 2026 •

edited

Loading

Uh oh!

liquidsec commented May 7, 2026

Uh oh!

ChrisJr404 commented May 7, 2026

Uh oh!

codecov Bot commented May 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ChrisJr404 commented May 3, 2026

Summary

Tests

Notes

Uh oh!

github-actions Bot commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisJr404 commented May 3, 2026

Uh oh!

ChrisJr404 commented May 4, 2026

Uh oh!

ChrisJr404 commented May 4, 2026

Uh oh!

liquidsec commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liquidsec commented May 7, 2026

Uh oh!

ChrisJr404 commented May 7, 2026

Uh oh!

codecov Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 3, 2026 •

edited

Loading

liquidsec commented May 7, 2026 •

edited

Loading

codecov Bot commented May 7, 2026 •

edited

Loading