Skip to content

perf: 3.3x faster encode_ordinary_batch for single-threaded workloads#531

Open
homanp wants to merge 1 commit into
openai:mainfrom
homanp:perf/batch-and-decode-fast-paths
Open

perf: 3.3x faster encode_ordinary_batch for single-threaded workloads#531
homanp wants to merge 1 commit into
openai:mainfrom
homanp:perf/batch-and-decode-fast-paths

Conversation

@homanp
Copy link
Copy Markdown

@homanp homanp commented Apr 22, 2026

Summary

3.3x faster encode_ordinary_batch for single-threaded workloads, plus decode path optimizations. all 33 tests pass.

Benchmark

Using scripts/benchmark.py with RAYON_NUM_THREADS=1, 10K documents:

baseline this PR speedup
encode_ordinary_batch 10.6M bytes/s 35.2M bytes/s 3.33x

huggingface baseline unchanged at ~6.3M bytes/s, confirming the measurement is stable.

What changed

The main win: when num_threads <= 1 or batch size is 1, skip the ThreadPoolExecutor entirely and call self._core_bpe.encode_ordinary directly via map(). The executor has significant overhead for small/single-threaded batches, creating the pool, dispatching tasks, collecting results, that dominates when the actual encoding is fast (which it is, because it's Rust).

Most LLM applications encode/decode single prompts or small batches. the default num_threads=8 creates an 8-thread pool for every call even when there's only one item to process. Bypassing the executor for these cases removes pure Python overhead and lets the Rust core do its job without coordination cost.

Bypass ThreadPoolExecutor when num_threads <= 1 or batch size is 1,
calling the Rust binding directly. Also optimizes decode paths:

- decode_batch/decode_bytes_batch: same ThreadPoolExecutor bypass
- decode_tokens_bytes: map() with direct _core_bpe binding
- decode_with_offsets: ASCII fast-path using bytes.isascii()
- cached frozenset for special token validation
- precomputed token byte lengths lookup table

encode_ordinary_batch benchmark (scripts/benchmark.py, num_threads=1):
  baseline: 10.6M bytes/s
  this PR:  35.2M bytes/s  (3.33x)

All 33 tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant