-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Open
Copy link
Labels
enhancementNew feature or requestNew feature or request
Description
Support Parallel Processing (Sharding) for DB and File Sources
Context
To support high-volume data imports, we need the ability to run multiple importer containers simultaneously. Each container should automatically process a unique slice of the data to avoid duplicates or locking.
Requirements
The system should respect two standard environment variables:
- SHARD_COUNT: The total number of parallel containers (denominator).
- SHARD_INDEX: The unique ID of the current container (0 to SHARD_COUNT - 1).
Approach
Database sources can apply the logic to the primary key (or configured field) using SQL (i.e. WHERE MOD(record_id, SHARD_COUNT) = SHARD_INDEX)
File sources such as CSV, apply the logic to line number during iteration (i.e. if current_line_number % SHARD_COUNT == SHARD_INDEX: ...
Configuration
No changes to config.yml should be required. This should be driven entirely by environment variables to allow for easy scaling in Docker Compose.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request