Excavate: extract IPv6 URLs (#1815)#3071
Conversation
The url_full YARA rule and the full_url_regex / full_url_regex_strict post-filters all required hosts to be word-character labels, so URLs with bracketed IPv6 hosts (http://[2001:db8::1]/, http://[::1]:8080/...) were dropped at extraction time. Add a [0-9a-fA-F:]+ alternative to the host part of all three patterns so IPv6 URLs are emitted as URL_UNVERIFIED events alongside DNS-name URLs. Adds bbot/test/test_step_1/test_excavate_url_regexes.py — 6 cases that pin both the new IPv6 acceptance and a regression guard for the existing DNS-name / IPv4 URLs. Closes blacklanternsecurity#1815
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
|
recheck |
1 similar comment
|
recheck |
|
hi @ChrisJr404, thanks for this! I let the tests run, right now its just failing lint - can you run |
|
@aconite33 do you know what is wrong with claassistant in this case? |
|
Done in 6ead620, lint should be clean now. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## dev #3071 +/- ##
=====================================
+ Coverage 91% 91% +1%
=====================================
Files 437 440 +3
Lines 37509 37560 +51
=====================================
+ Hits 33925 33979 +54
+ Misses 3584 3581 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
Closes #1815 ("Excavate IPv6 URLs", filed by @TheTechromancer).
The
url_fullYARA rule and the two Python post-filters (full_url_regex/full_url_regex_strict) all only accepted word-character / dotted hostnames in the host slot, so URLs with bracketed IPv6 hosts were dropped at extraction time:http://[2001:db8::1]/apihttp://[::1]:8080/pathhttps://[fe80::dead:beef]/foo/bar.htmlThis PR adds a
\\[[0-9a-fA-F:]+\\]alternative to the host part of all three patterns. The bracketed form is preserved in the captured host so downstream parsers (urllib, etc.) still recognise the URL as IPv6.Tests
bbot/test/test_step_1/test_excavate_url_regexes.py— 6 cases:test_full_url_regex_matches_ipv6— accepts 6 representative IPv6 URLs and verifies the captured host keeps the leading[.test_full_url_regex_still_matches_existing_patterns— regression guard for plain DNS-name + IPv4 URLs.full_url_regex_strict.url_fullrule (compiled directly fromexcavate.URLExtractor.yara_rules['url_full']).Notes
validators.validate_url_parsed) before the URL becomes a URL_UNVERIFIED event, so malformed inputs still get rejected one stage later.