Release v0.5.1 · chaoticgoodcomputing/flowthru

This release adds Spark integration and improves pipeline execution performance and testability in distributed scenarios.

That's right — type-safe Spark. I hope you're as excited as I am!

What's New

Spark Extension: You can now build Flowthru pipelines that run on Apache Spark via typed DataFrame wrappers over Spark.NET. The extension provides a catalog entry system and expression builder that lets you declare Spark transformations alongside your native .NET steps. See Flowthru.Extensions.Spark and the KedroSpaceflightsSpark starter for a complete example.
Scheduling Strategy System: The task graph executor now supports pluggable scheduling strategies via the new ISchedulingStrategy interface. When multiple steps are ready to dispatch simultaneously, strategies can prioritize which step claims a worker slot based on graph topology, step costs, or custom heuristics. This enables better control over execution behavior for pipelines with complex dependency patterns.
Parallelized Pre-Flight Checks: Pre-flight validation now executes independent checks in parallel, reducing startup latency for large pipelines. DAG analysis, file system inspection, and schema validation now share a single concurrent barrier instead of a sequential pipeline.
Post-Run Metadata: Metadata providers can now implement IPostRunMetadataProvider to receive execution results after a pipeline completes. This enables post-run analytics like timing-based diagram coloring or performance reporting without modifying your pipeline structure. The interface is optional — existing providers continue to work unchanged.
FUnit Test Aggregation for Distributed Projects: FUnit tests can now be collected across multiple independent library projects and executed as a single suite. This is useful for multi-catalog pipelines where catalog and flow definitions live in separate projects. See SpaceflightsDistributed for an example of multi-project catalog organization with unified testing.
New Starter: KedroSpaceflightsSpark demonstrates Spark DataFrame integration alongside the familiar Spaceflights pipeline structure.

Bug Fixes

Arrow Enum Marshalling: Fixed enum serialization in the Python extension when converting Flowthru schemas to Arrow format. Enums are now correctly marshalled as their underlying integer values, preventing schema mismatches during cross-language data transfer. This resolves issues when Python steps consume or produce enum-containing schemas via the Arrow format.

❤️ Thank You

Spencer Elkington

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.1

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's New

Bug Fixes

❤️ Thank You

Uh oh!