Releases: chaoticgoodcomputing/flowthru
v0.8.0
This release brings configuration management into Flowthru's type-safe ecosystem, and adds HTTP data pulling with optional storage-based caching.
What's New
- Configuration as Typed Catalog: Configuration sections are now bound as strongly-typed catalog items, so steps can depend on configuration the same way they depend on data. Use the
[FlowthruConfig]attribute and[ConfigSection]properties to wire configuration sections (fromappsettings.jsonor environment variables) directly into your step parameters. This eliminates string-based configuration lookups and moves configuration errors to compile-time.- Idiomatic .NET Integration: Flowthru now uses the standard
Microsoft.Extensions.Configurationsystem rather than custom configuration plumbing. Extensions are configured through the standard DI container, and all examples have been migrated. See KedroSpaceflights.Custom for a complete example of configuration as catalog in action.
- Idiomatic .NET Integration: Flowthru now uses the standard
- HTTP Provider with Storage Caching: You can now pull data over HTTP using the
HttpStorageMediumProviderfrom the new HTTP extension. Optionally cache downloaded files locally to avoid repeated HTTP calls; specify a cache directory and maximum age, and cached files are automatically validated and refreshed. RetailDataSplitFlow demonstrates HTTP pulling with caching. - Enhanced Metadata Surface: The metadata provider API now has better visibility into flow exports and improved default provider behavior, making it easier to inspect and document your pipelines after execution.
🚀 Features
- pull over HTTP, storage cache (8d430203)
- better metadata surface and default provider behavior (d9df524f)
- migrate to dotnet IConfiguration (d3c8e50d)
- extensions now use idiomatic C# config system (77100c0c)
- configuration as catalog (ed8e3cb4)
🩹 Fixes
- resolve issue with GitHub pre-releases (95b15373)
- resolve CI orphaned head issue (ec5c7237)
- resolve issues with manual dispatch deployments hitting NX versioning clothesline. (07c1da41)
❤️ Thank You
- Spencer Elkington
v0.6.2
v0.6.0
Release 0.6.0
We've added deferred query handles for GQL and EFCore, giving steps explicit control over when data is fetched from external sources. This should give users more control about when data is materialized within their pipelines.
What's New
- Deferred Query Handles: You can now use
GqlQuery<TResult, T>andDbQuery<T>to declare queries in your catalog without triggering data fetches. Steps callToListAsync()to materialize data when needed, making I/O boundaries explicit in step code. This keeps the catalog focused on structural declaration while steps control execution timing.- GQL Query Object: The
GqlQuery<TResult, T>handle supports non-paginated, Relay-paginated, and offset-paginated queries, with composable filters viaGqlQuery<TFilter, TResult, T>and aWithFilter()method. - EFCore Deferred Queries: The
DbQuery<T>handle offers fluent composition (Where,OrderBy,Take,Skip) and server-sideINSERT-FROM-SELECTfusion when saving to the same database and scope. - KedroSpaceflightsGQL Example: Demonstrates step-based parameterization using deferred GQL queries to fetch company analytics on demand. See examples/advanced/KedroSpaceflightsGQL.
- GQL Query Object: The
- Improved Spaceflights Distributed: The distributed Spaceflights example now includes comprehensive FUnit test coverage for all data science and reporting steps, showing how to write testable multi-step pipelines. See examples/advanced/SpaceflightsDistributed.
- Enhanced Spark Support: The Spark extension now includes more Spark-native operations — better expression translation, support for additional scalar functions, and optimizations for ranking and windowing. See KedroSpaceflightsSpark.
Bug Fixes
- Spaceflights Distributed Tests: All data processing, model training, evaluation, and reporting steps now have corresponding FUnit tests, ensuring step-level transformations are validated before integration.
❤️ Thank You
- Spencer Elkington
v0.5.1
This release adds Spark integration and improves pipeline execution performance and testability in distributed scenarios.
That's right — type-safe Spark. I hope you're as excited as I am!
What's New
-
Spark Extension: You can now build Flowthru pipelines that run on Apache Spark via typed DataFrame wrappers over Spark.NET. The extension provides a catalog entry system and expression builder that lets you declare Spark transformations alongside your native .NET steps. See Flowthru.Extensions.Spark and the KedroSpaceflightsSpark starter for a complete example.
-
Scheduling Strategy System: The task graph executor now supports pluggable scheduling strategies via the new
ISchedulingStrategyinterface. When multiple steps are ready to dispatch simultaneously, strategies can prioritize which step claims a worker slot based on graph topology, step costs, or custom heuristics. This enables better control over execution behavior for pipelines with complex dependency patterns. -
Parallelized Pre-Flight Checks: Pre-flight validation now executes independent checks in parallel, reducing startup latency for large pipelines. DAG analysis, file system inspection, and schema validation now share a single concurrent barrier instead of a sequential pipeline.
-
Post-Run Metadata: Metadata providers can now implement
IPostRunMetadataProviderto receive execution results after a pipeline completes. This enables post-run analytics like timing-based diagram coloring or performance reporting without modifying your pipeline structure. The interface is optional — existing providers continue to work unchanged. -
FUnit Test Aggregation for Distributed Projects: FUnit tests can now be collected across multiple independent library projects and executed as a single suite. This is useful for multi-catalog pipelines where catalog and flow definitions live in separate projects. See SpaceflightsDistributed for an example of multi-project catalog organization with unified testing.
-
New Starter: KedroSpaceflightsSpark demonstrates Spark DataFrame integration alongside the familiar Spaceflights pipeline structure.
Bug Fixes
- Arrow Enum Marshalling: Fixed enum serialization in the Python extension when converting Flowthru schemas to Arrow format. Enums are now correctly marshalled as their underlying integer values, preventing schema mismatches during cross-language data transfer. This resolves issues when Python steps consume or produce enum-containing schemas via the Arrow format.
❤️ Thank You
- Spencer Elkington
v0.3.1
🩹 Fixes
Whoops! I jacked up the mechanism that delivers new packs to NuGet. This should resolve the issue, with v0.3.1 going out containing the last few releases.
❤️ Thank You
- Spencer Elkington
v0.3.0
What's New?
- Parallel Step Execution: You can now run independent steps concurrently by passing the
--parallelismCLI flag or settingMaxDegreeOfParallelismin your execution options. Use--parallelism autoto match your machine's core count automatically. Flowthru's DAG guarantees are preserved — steps only run in parallel when their dependencies allow it. - GraphQL Extension: We've added Flowthru.Extensions.GQL, which lets you use GraphQL endpoints as catalog entry sources and sinks. You can wire queries and mutations into your catalog using
GqlItemFactory, with built-in pagination support for large result sets.
To demonstrate both parallelism and the new GraphQL support, we've added an end-to-end Spaceflights pipeline that runs ingestion, data science, and reporting stages against GraphQL data sources.
🚀 Features
🩹 Fixes
- agent hook (98c1cad)
- programmatic defaults for core counts (67a1065)
- corrected log output (107ceea)
- correct test assertion on CLI incoming (780386d)
- add multithread to split pipeline example (4b7f4af)
- refactor gql to better fit (c32837b)
❤️ Thank You
- Spencer Elkington