Skip to content

v0.5.1

Choose a tag to compare

@github-actions github-actions released this 15 Apr 10:39
· 131 commits to main since this release
Immutable release. Only release title and notes can be modified.

This release adds Spark integration and improves pipeline execution performance and testability in distributed scenarios.

That's right — type-safe Spark. I hope you're as excited as I am!

What's New

  • Spark Extension: You can now build Flowthru pipelines that run on Apache Spark via typed DataFrame wrappers over Spark.NET. The extension provides a catalog entry system and expression builder that lets you declare Spark transformations alongside your native .NET steps. See Flowthru.Extensions.Spark and the KedroSpaceflightsSpark starter for a complete example.

  • Scheduling Strategy System: The task graph executor now supports pluggable scheduling strategies via the new ISchedulingStrategy interface. When multiple steps are ready to dispatch simultaneously, strategies can prioritize which step claims a worker slot based on graph topology, step costs, or custom heuristics. This enables better control over execution behavior for pipelines with complex dependency patterns.

  • Parallelized Pre-Flight Checks: Pre-flight validation now executes independent checks in parallel, reducing startup latency for large pipelines. DAG analysis, file system inspection, and schema validation now share a single concurrent barrier instead of a sequential pipeline.

  • Post-Run Metadata: Metadata providers can now implement IPostRunMetadataProvider to receive execution results after a pipeline completes. This enables post-run analytics like timing-based diagram coloring or performance reporting without modifying your pipeline structure. The interface is optional — existing providers continue to work unchanged.

  • FUnit Test Aggregation for Distributed Projects: FUnit tests can now be collected across multiple independent library projects and executed as a single suite. This is useful for multi-catalog pipelines where catalog and flow definitions live in separate projects. See SpaceflightsDistributed for an example of multi-project catalog organization with unified testing.

  • New Starter: KedroSpaceflightsSpark demonstrates Spark DataFrame integration alongside the familiar Spaceflights pipeline structure.

Bug Fixes

  • Arrow Enum Marshalling: Fixed enum serialization in the Python extension when converting Flowthru schemas to Arrow format. Enums are now correctly marshalled as their underlying integer values, preventing schema mismatches during cross-language data transfer. This resolves issues when Python steps consume or produce enum-containing schemas via the Arrow format.

❤️ Thank You

  • Spencer Elkington