Skip to content

Feature Request: Support Parallel Processing (Sharding) for DB and File Sources #33

@chriscorrea

Description

@chriscorrea

Support Parallel Processing (Sharding) for DB and File Sources

Context

To support high-volume data imports, we need the ability to run multiple importer containers simultaneously. Each container should automatically process a unique slice of the data to avoid duplicates or locking.

Requirements

The system should respect two standard environment variables:

  • SHARD_COUNT: The total number of parallel containers (denominator).
  • SHARD_INDEX: The unique ID of the current container (0 to SHARD_COUNT - 1).

Approach

Database sources can apply the logic to the primary key (or configured field) using SQL (i.e. WHERE MOD(record_id, SHARD_COUNT) = SHARD_INDEX)

File sources such as CSV, apply the logic to line number during iteration (i.e. if current_line_number % SHARD_COUNT == SHARD_INDEX: ...

Configuration

No changes to config.yml should be required. This should be driven entirely by environment variables to allow for easy scaling in Docker Compose.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions