Skip to content

Commit fc9ee90

Browse files
committed
Harden Junjo library logging behavior
- replace core runtime stdout output with stdlib logging under the junjo logger hierarchy - keep library import silent by installing only a NullHandler at the package root - log propagated execution failures once at the owning workflow/subflow boundary instead of duplicating nested stack traces - add run-scoped correlation fields to runtime log records for workflow, node, and concurrent execution paths - enrich exporter flush/shutdown warnings with endpoint context for operational diagnosis - update the base example and OpenTelemetry docs to teach app-owned logging configuration - add regression tests for NullHandler setup, no direct runtime stdout, structured log correlation, and exporter warning logs
1 parent 5491d45 commit fc9ee90

14 files changed

Lines changed: 598 additions & 52 deletions

File tree

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,9 @@ execution, state management, and lifecycle observation.
6363
- Workflow telemetry now records `junjo.workflow.execution_graph_snapshot` to make it explicit that the graph payload is an execution-scoped compiled snapshot containing both runtime and structural identities.
6464
- Failed workflow, subflow, node, and concurrent spans now set the standard OpenTelemetry `error.type` attribute in addition to Junjo-specific error metadata.
6565
- `JunjoOtelExporter` now exposes `shutdown()`, and the docs/examples now teach provider shutdown as the normal OpenTelemetry lifecycle while keeping `flush()` as a targeted manual drain tool.
66+
- Core library execution paths now use the standard Python logging system under the `junjo` logger hierarchy instead of writing directly to stdout, and the library root now installs only a `NullHandler`.
67+
- `JunjoOtelExporter.flush()` and `shutdown()` now log warning details through `junjo.telemetry` when Junjo-owned export components fail or refuse to flush cleanly.
68+
- Runtime log records now include run-scoped correlation fields such as `run_id`, and propagated execution failures are logged once at the owning workflow or subflow boundary instead of emitting duplicate stack traces from nested execution layers.
6669
- `on_state_changed` hook payloads and state-change telemetry context now identify the active executable that performed the mutation, rather than mixing workflow metadata with node or subflow runtime identities.
6770
- Hook callback failures are now recorded as `junjo.hook_error` events on the surrounding workflow, subflow, node, or concurrent span, and terminal hooks now dispatch before span close so those events stay attached to the real execution span.
6871
- Lifecycle observation examples and docs now show hook registration as a separate concern from workflow definition.

HARDENING_PLAN.md

Lines changed: 33 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -21,17 +21,19 @@ summarized briefly. Open work is described in more detail.
2121
- Removal of the old subscriber implementation
2222
- Replacement of `HookManager` with the new `Hooks` + internal lifecycle split
2323
- Graph validation, compilation, structural IDs, and rendering hardening
24+
- Standard OpenTelemetry `error.type` alignment on failed spans
25+
- Provider-owned OpenTelemetry shutdown lifecycle with wrapper-local `shutdown()` / `flush()` semantics
26+
- Hook callback failure telemetry attached to the surrounding execution span instead of standalone hook-error spans
2427
- Public docstring, docs, and example alignment for the hardened runtime model
28+
- Changelog and agent guidance cleanup for the current hardening work
2529

2630
### Partially Complete
2731

2832
- Regression coverage for major known runtime/store failures
29-
- Documentation and example truthfulness
30-
- Changelog and agent guidance cleanup
33+
- Observability operational safety
3134

3235
### Still Open
3336

34-
- Production-safe observability controls
3537
- Release/process discipline improvements
3638

3739
## Completed Work
@@ -72,17 +74,20 @@ Delivered:
7274
- Runtime execution, internal lifecycle dispatch, public hooks, and telemetry are separated.
7375
- Hook failures are isolated and recorded without failing workflow execution.
7476
- State change hooks now receive detached state snapshots and JSON patch payloads.
77+
- Hook callback failures are now recorded on the surrounding execution span instead of being modeled as standalone hook-error spans.
78+
- Terminal hooks now dispatch before span close so hook failure telemetry stays attached to the real workflow, subflow, node, or concurrent span.
7579

7680
### 4. Docs And Example Truthfulness
7781

78-
Status: largely completed
82+
Status: completed
7983

8084
Delivered:
8185

8286
- Public runtime docstrings were updated instead of being shortened away.
8387
- `Workflow` and `Subflow` constructor docs remain on `__init__` for generated docs and hover help.
8488
- Examples now separate workflow definition from hook/logging wiring.
8589
- Hook documentation and examples now show real usage rather than placeholder configuration.
90+
- OpenTelemetry docs now describe provider-owned shutdown and state-model-controlled telemetry serialization truthfully.
8691

8792
### 5. Graph Hardening
8893

@@ -116,36 +121,43 @@ Delivered:
116121

117122
## Phase B - Observability Operational Safety
118123

124+
Status: partially complete
125+
126+
### Delivered so far
127+
128+
- Failed workflow, subflow, node, and concurrent spans now set standard OpenTelemetry `error.type`.
129+
- `JunjoOtelExporter` now exposes `shutdown()`, and docs/examples now teach provider shutdown as the normal lifecycle instead of exporter-local flush on exit.
130+
- Hook callback failures now stay attached to the surrounding execution span rather than creating standalone hook-error spans.
131+
- State telemetry docs now explain the current control point clearly:
132+
state snapshots and JSON patches follow the state model's Pydantic serialization.
133+
- Core runtime execution paths now emit through package logging instead of direct `print()`.
134+
- Junjo now uses the standard Python `logging` package under the `junjo` logger hierarchy and installs only a `NullHandler` at the library root.
135+
- Exporter-local `flush()` and `shutdown()` failures now log through `junjo.telemetry` instead of failing silently.
136+
- Runtime log records now carry run-scoped correlation fields, and propagated failures now log once at the owning workflow or subflow boundary instead of duplicating nested stack traces.
137+
119138
### Why this is still open
120139

121-
Telemetry correctness improved, but operational controls are still missing.
122-
Core runtime paths still use `print()` and there is no real telemetry
123-
configuration model for redaction, payload size, or capture profiles.
140+
Telemetry correctness improved, but the library still has no explicit
141+
library-level telemetry capture policy for redaction, payload size ceilings, or
142+
different observability profiles.
124143

125144
### Remaining changes
126145

127-
- Replace direct runtime `print()` calls with package logging.
128-
- Remove stdout emission from shipped library execution paths, currently including:
129-
- workflow start / progress / completion / failure messages in ``src/junjo/workflow.py``
130-
- node failure prints in ``src/junjo/node.py``
131-
- run-concurrent start / completion / failure prints in ``src/junjo/run_concurrent.py``
132-
- Define package logger names, log levels, and expectations for library consumers.
133-
- Keep example-app logging and demo ``print()`` usage out of the core library runtime.
134-
- Introduce explicit telemetry configuration:
146+
- Decide whether Junjo needs explicit library-level telemetry capture configuration beyond the current state-model serialization controls and current graph snapshot defaults.
147+
- If explicit capture controls are added, design them deliberately:
135148
- state capture policy
136149
- graph capture policy
137150
- patch capture policy
138151
- redaction/masking support
139152
- size ceilings
140153
- AI Studio vs generic OTLP profiles
141-
- Add explicit exporter lifecycle behavior such as shutdown/flush expectations.
142-
- Consider versioning Junjo-specific telemetry schema fields.
154+
- Decide later whether point-in-time lifecycle telemetry should stay span/event-first or move toward correlated logs; current instrumentation remains span/event-first intentionally.
155+
- Consider versioning Junjo-specific telemetry schema fields once the capture model stabilizes.
143156

144157
### Exit criteria
145158

146-
- Core runtime emits through logging instead of `print()`.
147-
- Telemetry payload controls are configurable and documented.
148-
- Exporter lifecycle and failure behavior are explicit.
159+
- The repo has a clear documented stance on telemetry payload controls, whether that remains state-model serialization only or expands into explicit capture configuration.
160+
- Exporter lifecycle and failure behavior remain explicit and accurate in docs/examples.
149161

150162
## Phase C - Quality Gates And Release Discipline
151163

@@ -195,4 +207,5 @@ The highest-risk runtime correctness work is already done. The remaining work is
195207
primarily about:
196208

197209
- making observability production-safe
210+
- deciding how much explicit telemetry capture policy Junjo should own as a library
198211
- making release quality enforceable by process instead of memory

docs/opentelemetry.rst

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,56 @@ It also exposes:
7070
Use ``flush()`` for targeted cases such as tests or short-lived scripts. Use
7171
provider shutdown for the normal application lifecycle.
7272

73+
Library Logging
74+
===============
75+
76+
Junjo emits library logs under the ``junjo`` logger hierarchy. Applications own
77+
handlers, formatting, and log levels.
78+
79+
The main library loggers are:
80+
81+
- ``junjo.workflow``
82+
- ``junjo.node``
83+
- ``junjo.run_concurrent``
84+
- ``junjo.telemetry``
85+
86+
Junjo does not install real log handlers of its own. If you want to see Junjo
87+
execution diagnostics, configure logging in your application and opt in to the
88+
``junjo`` logger namespace.
89+
90+
.. code-block:: python
91+
92+
import logging
93+
94+
logging.basicConfig(
95+
level=logging.INFO,
96+
format="%(levelname)s %(name)s %(message)s",
97+
)
98+
logging.getLogger("junjo").setLevel(logging.DEBUG)
99+
100+
With that configuration, Junjo emits debug-level execution progress through the
101+
standard Python logging system without taking over your application's logging
102+
setup.
103+
104+
Runtime log records include run-scoped correlation fields through standard
105+
logging ``extra`` attributes when that execution context exists:
106+
107+
- ``run_id``
108+
- ``executable_definition_id``
109+
- ``executable_runtime_id``
110+
- ``span_type``
111+
112+
Applications using structured logging handlers can capture those fields
113+
directly from the log record without parsing log message text.
114+
115+
Execution failures are logged at the owning workflow or subflow boundary so one
116+
propagated failure produces one library-owned error log instead of multiple
117+
stack traces from each nested execution layer.
118+
119+
Exporter-local warning logs under ``junjo.telemetry`` also include the OTLP
120+
``endpoint`` on the log record so operational failures can be tied back to the
121+
destination that failed.
122+
73123
Choosing an OpenTelemetry Exporter
74124
===================================
75125

examples/base/src/base/main.py

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,45 @@
1+
import logging
2+
13
from dotenv import load_dotenv
24

35
from base.otel_config import init_otel
46
from base.sample_workflow.hooks import create_logging_hooks
57
from base.sample_workflow.workflow import create_sample_workflow
68

9+
logger = logging.getLogger(__name__)
10+
11+
12+
def configure_logging() -> None:
13+
"""Configure application logging for the base example."""
14+
logging.basicConfig(
15+
level=logging.INFO,
16+
format="%(levelname)s %(name)s %(message)s",
17+
)
18+
logging.getLogger("junjo").setLevel(logging.DEBUG)
19+
720

821
async def main():
922
"""The main entry point for the application."""
1023

1124
# Load the environment variables
1225
load_dotenv()
26+
configure_logging()
1327

1428
# Setup OpenTelemetry before anything else happens
1529
telemetry_providers = init_otel(service_name="Junjo Base Example")
1630

1731
try:
1832
workflow = create_sample_workflow(hooks=create_logging_hooks())
1933

20-
print("Executing the workflow...")
34+
logger.info("Executing the workflow...")
2135
result = await workflow.execute()
22-
print("Final state: ", result.state.model_dump_json())
36+
logger.info("Final state: %s", result.state.model_dump_json())
2337

24-
print("Done executing the base example workflow.")
38+
logger.info("Done executing the base example workflow.")
2539
finally:
2640
if telemetry_providers is not None:
2741
tracer_provider, meter_provider = telemetry_providers
28-
print("Shutting down OpenTelemetry providers...")
42+
logger.info("Shutting down OpenTelemetry providers...")
2943
tracer_provider.shutdown()
3044
meter_provider.shutdown()
3145

examples/base/src/base/otel_config.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import logging
12
import os
23

34
from junjo.telemetry.junjo_otel_exporter import JunjoOtelExporter
@@ -7,6 +8,8 @@
78
from opentelemetry.sdk.resources import Resource
89
from opentelemetry.sdk.trace import TracerProvider
910

11+
logger = logging.getLogger(__name__)
12+
1013

1114
def init_otel(
1215
service_name: str,
@@ -16,8 +19,10 @@ def init_otel(
1619
# Load the JUNJO_AI_STUDIO_API_KEY from the environment variable
1720
JUNJO_AI_STUDIO_API_KEY = os.getenv("JUNJO_AI_STUDIO_API_KEY")
1821
if JUNJO_AI_STUDIO_API_KEY is None:
19-
print("JUNJO_AI_STUDIO_API_KEY environment variable is not set. "
20-
"Generate a new API key in the Junjo AI Studio UI.")
22+
logger.warning(
23+
"JUNJO_AI_STUDIO_API_KEY environment variable is not set. "
24+
"Generate a new API key in the Junjo AI Studio UI."
25+
)
2126
return None
2227

2328
# Configure OpenTelemetry for this application

examples/base/src/base/sample_workflow/hooks.py

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
from __future__ import annotations
22

3+
import logging
4+
35
from junjo import Hooks
46
from junjo.hooks import (
57
LifecycleEvent,
@@ -11,6 +13,8 @@
1113
from base.sample_workflow.sample_subflow.store import SampleSubflowState
1214
from base.sample_workflow.store import SampleWorkflowState
1315

16+
logger = logging.getLogger(__name__)
17+
1418

1519
def _base_event_details(event: LifecycleEvent) -> dict:
1620
details = {
@@ -55,15 +59,16 @@ def _format_event_details(event: LifecycleEvent) -> dict:
5559

5660

5761
def _log_event(event: LifecycleEvent) -> None:
58-
print(f"[hook] {event.hook_name}", _format_event_details(event))
62+
logger.info("[hook] %s %s", event.hook_name, _format_event_details(event))
5963

6064

6165
def _log_workflow_completed(
6266
event: WorkflowCompletedEvent[SampleWorkflowState],
6367
) -> None:
6468
state = event.result.state
65-
print(
66-
f"[hook] {event.hook_name}",
69+
logger.info(
70+
"[hook] %s %s",
71+
event.hook_name,
6772
{
6873
**_base_event_details(event),
6974
"counter": state.counter,
@@ -78,8 +83,9 @@ def _log_subflow_completed(
7883
event: SubflowCompletedEvent[SampleSubflowState],
7984
) -> None:
8085
state = event.result.state
81-
print(
82-
f"[hook] {event.hook_name}",
86+
logger.info(
87+
"[hook] %s %s",
88+
event.hook_name,
8389
{
8490
**_base_event_details(event),
8591
"item_count": len(state.items or []),
@@ -91,8 +97,9 @@ def _log_subflow_completed(
9197

9298
def _log_state_changed(event: StateChangedEvent[SampleWorkflowState]) -> None:
9399
state = event.state
94-
print(
95-
f"[hook] {event.hook_name}",
100+
logger.info(
101+
"[hook] %s %s",
102+
event.hook_name,
96103
{
97104
**_base_event_details(event),
98105
"action_name": event.action_name,

src/junjo/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@
77
This library also produces annotated Opentelemetry Spans to help make sense of
88
execution telemetry.
99
"""
10+
import logging
11+
1012
from .condition import Condition
1113
from .edge import Edge
1214
from .graph import (
@@ -32,6 +34,8 @@
3234
Workflow,
3335
)
3436

37+
logging.getLogger("junjo").addHandler(logging.NullHandler())
38+
3539
__all__ = [
3640
"Condition",
3741
"Graph",

src/junjo/hooks.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -220,10 +220,13 @@ class Hooks:
220220
221221
.. code-block:: python
222222
223+
import logging
224+
223225
hooks = Hooks()
226+
logger = logging.getLogger(__name__)
224227
225228
def log_completed(event: WorkflowCompletedEvent[MyState]) -> None:
226-
print(event.hook_name, event.result.state.model_dump())
229+
logger.info("%s %s", event.hook_name, event.result.state.model_dump())
227230
228231
hooks.on_workflow_completed(log_completed)
229232

0 commit comments

Comments
 (0)