v0.5.1
This release adds Spark integration and improves pipeline execution performance and testability in distributed scenarios.
That's right — type-safe Spark. I hope you're as excited as I am!
What's New
-
Spark Extension: You can now build Flowthru pipelines that run on Apache Spark via typed DataFrame wrappers over Spark.NET. The extension provides a catalog entry system and expression builder that lets you declare Spark transformations alongside your native .NET steps. See Flowthru.Extensions.Spark and the KedroSpaceflightsSpark starter for a complete example.
-
Scheduling Strategy System: The task graph executor now supports pluggable scheduling strategies via the new
ISchedulingStrategyinterface. When multiple steps are ready to dispatch simultaneously, strategies can prioritize which step claims a worker slot based on graph topology, step costs, or custom heuristics. This enables better control over execution behavior for pipelines with complex dependency patterns. -
Parallelized Pre-Flight Checks: Pre-flight validation now executes independent checks in parallel, reducing startup latency for large pipelines. DAG analysis, file system inspection, and schema validation now share a single concurrent barrier instead of a sequential pipeline.
-
Post-Run Metadata: Metadata providers can now implement
IPostRunMetadataProviderto receive execution results after a pipeline completes. This enables post-run analytics like timing-based diagram coloring or performance reporting without modifying your pipeline structure. The interface is optional — existing providers continue to work unchanged. -
FUnit Test Aggregation for Distributed Projects: FUnit tests can now be collected across multiple independent library projects and executed as a single suite. This is useful for multi-catalog pipelines where catalog and flow definitions live in separate projects. See SpaceflightsDistributed for an example of multi-project catalog organization with unified testing.
-
New Starter: KedroSpaceflightsSpark demonstrates Spark DataFrame integration alongside the familiar Spaceflights pipeline structure.
Bug Fixes
- Arrow Enum Marshalling: Fixed enum serialization in the Python extension when converting Flowthru schemas to Arrow format. Enums are now correctly marshalled as their underlying integer values, preventing schema mismatches during cross-language data transfer. This resolves issues when Python steps consume or produce enum-containing schemas via the Arrow format.
❤️ Thank You
- Spencer Elkington