Skip to content

fix: Prevent accidental request dropping with maxRequestsPerCrawl#3531

Open
janbuchar wants to merge 3 commits intomasterfrom
fix-accidental-request-dropping
Open

fix: Prevent accidental request dropping with maxRequestsPerCrawl#3531
janbuchar wants to merge 3 commits intomasterfrom
fix-accidental-request-dropping

Conversation

@janbuchar
Copy link
Copy Markdown
Contributor

@janbuchar janbuchar commented Mar 26, 2026

  • closes maxRequestsPerCrawl with RQ optimizations drops requests #3153

  • core issue: maxRequestsPerCrawl and enqueueLinks({ limit }) pre-truncated request lists before sending them to the request queue. Duplicate URLs consumed budget slots, starving actual new URLs.

  • Fix: Instead of guessing upfront which requests will be new, let the request queue tell us.

    • A new maxNewRequests option on addRequestsBatched feeds requests to the queue in budget-capped chunks, counts how many were actually new from the wasAlreadyPresent results, and stops pulling once the budget is exhausted.
    • chunkedAsyncIterable was extended to accept a dynamic size callback
    • Leftovers from the source iterator are returned as requestsOverLimit for callers to report.

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Mar 26, 2026
@janbuchar janbuchar requested review from B4nan, barjin and l2ysho March 26, 2026 17:25
@github-actions github-actions bot added this to the 137th sprint - Tooling team milestone Mar 26, 2026
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Mar 26, 2026
@janbuchar janbuchar marked this pull request as ready for review March 27, 2026 19:49
Copy link
Copy Markdown
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @janbuchar !

Here are some late-night ideas I got (otherwise it looks pretty solid):

*
* This is useful in combination with `maxRequestsPerCrawl` to avoid duplicate URLs consuming the budget.
*/
maxNewRequests?: number;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's maybe mention that this effectively forces waitForAllRequestsToBeAdded as per this line?

waitForAllRequestsToBeAdded: Promise<ProcessedRequest[]>,
): Promise<AddRequestsBatchedResult> => {
if (maxNewRequests !== undefined) {
for await (const request of generateRequests()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this generateRequests() call will create a new generator instance (and therefore push all the requests to requestsOverLimit, regardless of which ones were processed).

const result = await rq.addRequestsBatched([...e.g. 5 urls...], {
        maxNewRequests: 1,
});

// result.addedRequests.length === 1
// result.requestsOverLimit.length === 5 (should be 4?)

Can we have a test for this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a side note, the intent of sharing the state through a common closure variable (the iterator) took me some time to understand. Can we pass the half-eaten iterator explicitly as a param?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

maxRequestsPerCrawl with RQ optimizations drops requests

3 participants