perf: 3.3x faster encode_ordinary_batch for single-threaded workloads#531
Open
homanp wants to merge 1 commit into
Open
perf: 3.3x faster encode_ordinary_batch for single-threaded workloads#531homanp wants to merge 1 commit into
homanp wants to merge 1 commit into
Conversation
Bypass ThreadPoolExecutor when num_threads <= 1 or batch size is 1, calling the Rust binding directly. Also optimizes decode paths: - decode_batch/decode_bytes_batch: same ThreadPoolExecutor bypass - decode_tokens_bytes: map() with direct _core_bpe binding - decode_with_offsets: ASCII fast-path using bytes.isascii() - cached frozenset for special token validation - precomputed token byte lengths lookup table encode_ordinary_batch benchmark (scripts/benchmark.py, num_threads=1): baseline: 10.6M bytes/s this PR: 35.2M bytes/s (3.33x) All 33 tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
3.3x faster
encode_ordinary_batchfor single-threaded workloads, plus decode path optimizations. all 33 tests pass.Benchmark
Using
scripts/benchmark.pywithRAYON_NUM_THREADS=1, 10K documents:huggingface baseline unchanged at ~6.3M bytes/s, confirming the measurement is stable.
What changed
The main win: when
num_threads <= 1or batch size is 1, skip theThreadPoolExecutorentirely and callself._core_bpe.encode_ordinarydirectly viamap(). The executor has significant overhead for small/single-threaded batches, creating the pool, dispatching tasks, collecting results, that dominates when the actual encoding is fast (which it is, because it's Rust).Most LLM applications encode/decode single prompts or small batches. the default
num_threads=8creates an 8-thread pool for every call even when there's only one item to process. Bypassing the executor for these cases removes pure Python overhead and lets the Rust core do its job without coordination cost.