Python: Foundry Evals integration for Python by alliscode · Pull Request #4750 · microsoft/agent-framework

alliscode · 2026-03-17T21:15:01Z

Add evaluation framework with local and Foundry-hosted evaluator support:

EvalItem/EvalResult core types with conversation splitting strategies
@evaluator decorator for defining custom evaluation functions
LocalEvaluator for running evaluations locally
FoundryEvals provider for Azure AI Foundry hosted evaluations
evaluate_agent() orchestration with expected values support
evaluate_workflow() for multi-agent workflow evaluation
Comprehensive test suite and evaluation samples

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows the Contribution Guidelines
All unit tests pass, and I have added new tests where possible
Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

python/packages/core/agent_framework/_evaluation.py

python/packages/core/agent_framework/__init__.py

python/packages/core/agent_framework/_eval.py

python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py

markwallace-microsoft · 2026-03-19T20:44:50Z

Python Test Coverage Report •

File	Stmts	Miss	Cover	Missing
packages/core/agent_framework
_evaluation.py	627	73	88%	158, 166, 478, 480, 606, 609, 688–690, 695, 732–735, 791–792, 795, 801–803, 807, 840–842, 896, 931, 943–945, 950, 974–979, 1070, 1148–1149, 1151–1155, 1161, 1201, 1546, 1548, 1556, 1566, 1570, 1615, 1633–1634, 1705, 1711, 1726, 1730–1732, 1762, 1768–1772, 1804, 1835–1836, 1838, 1863–1864, 1869
packages/foundry/agent_framework_foundry
_foundry_evals.py	237	4	98%	432, 437, 616, 681
TOTAL	28829	3431	88%

Python Unit Test Overview

Tests	Skipped	Failures	Errors	Time
5729	20 💤	0 ❌	0 🔥	1m 29s ⏱️

Copilot

Pull request overview

This PR adds a provider-agnostic evaluation framework to the Python Agent Framework, with both local (no-API) evaluators and an Azure AI Foundry-backed provider, plus end-to-end samples that demonstrate agent and workflow evaluation patterns.

Changes:

Introduces core evaluation types and orchestration (EvalItem, EvalResults, evaluate_agent(), evaluate_workflow()) plus local checks (LocalEvaluator, @evaluator).
Adds Azure AI Foundry provider integration (FoundryEvals) and trace/target evaluation helpers.
Adds/updates evaluation samples (Foundry evals patterns + self-reflection groundedness) and expands test coverage for local evaluation.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
python/samples/05-end-to-end/evaluation/self_reflection/self_reflection.py	Migrates groundedness scoring to `FoundryEvals` and updates batch runner.
python/samples/05-end-to-end/evaluation/self_reflection/README.md	Updates self-reflection sample documentation for Foundry Evals usage and env vars.
python/samples/05-end-to-end/evaluation/self_reflection/.env.example	Updates env var example to `FOUNDRY_PROJECT_ENDPOINT`.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_workflow_sample.py	New sample: evaluate multi-agent workflows with Foundry evaluators.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py	New sample: evaluate existing responses / traces via Foundry.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_multiturn_sample.py	New sample: demonstrate conversation split strategies for eval.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_mixed_sample.py	New sample: mix `LocalEvaluator` with Foundry evaluators in one call.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_all_patterns_sample.py	New “kitchen sink” sample covering all evaluation patterns.
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_agent_sample.py	New sample: evaluate_agent patterns + direct `FoundryEvals.evaluate()`.
python/samples/05-end-to-end/evaluation/foundry_evals/README.md	New README describing Foundry eval samples and entry points.
python/samples/05-end-to-end/evaluation/foundry_evals/.env.example	New env example for Foundry eval samples.
python/samples/03-workflows/evaluation/evaluate_workflow.py	New workflow evaluation sample using local checks.
python/samples/02-agents/evaluation/evaluate_with_expected.py	New sample demonstrating expected outputs/tool call expectations.
python/samples/02-agents/evaluation/evaluate_agent.py	New sample demonstrating basic local evaluation for agents.
python/packages/core/tests/core/test_observability.py	Adjusts OTLP exporter-related test skipping.
python/packages/core/tests/core/test_local_eval.py	Adds a comprehensive test suite for local eval framework behaviors.
python/packages/core/agent_framework/_evaluation.py	Adds the provider-agnostic evaluation framework implementation.
python/packages/core/agent_framework/init.py	Re-exports evaluation APIs/types from the package root.
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py	Adds the Foundry-backed `FoundryEvals` provider + trace/target helpers.
python/packages/azure-ai/agent_framework_azure_ai/init.py	Exposes `FoundryEvals` and helper functions from the azure-ai package.

python/packages/core/tests/core/test_observability.py

python/packages/foundry/agent_framework_foundry/_foundry_evals.py

python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py

python/samples/05-end-to-end/evaluation/self_reflection/self_reflection.py

python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py

python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_mixed_sample.py

python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_agent_sample.py

python/samples/05-end-to-end/evaluation/self_reflection/self_reflection.py

python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_workflow_sample.py

python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py

python/packages/foundry/agent_framework_foundry/_foundry_evals.py

python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py

python/packages/core/agent_framework/_evaluation.py

python/packages/foundry/agent_framework_foundry/_foundry_evals.py

python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py

python/packages/foundry/agent_framework_foundry/_foundry_evals.py

python/samples/02-agents/evaluation/evaluate_with_expected.py

python/packages/foundry/agent_framework_foundry/_foundry_evals.py

python/samples/02-agents/evaluation/evaluate_agent.py

python/samples/05-end-to-end/evaluation/foundry_evals/.env.example

python/packages/core/agent_framework/_evaluation.py

Merged and refactored eval module per Eduard's PR review: - Merge _eval.py + _local_eval.py into single _evaluation.py - Convert EvalItem from dataclass to regular class - Rename to_dict() to to_eval_data() - Convert _AgentEvalData to TypedDict - Simplify check system: unified async pattern with isawaitable - Parallelize checks and evaluators with asyncio.gather - Add all/any mode to tool_called_check - Fix bool(passed) truthy bug in _coerce_result - Remove deprecated function_evaluator/async_function_evaluator aliases - Remove _MinimalAgent, tighten evaluate_agent signature - Set self.name in __init__ (LocalEvaluator, FoundryEvals) - Limit FoundryEvals to AsyncOpenAI only - Type project_client as AIProjectClient - Remove NotImplementedError continuous eval code - Add evaluation samples in 02-agents/ and 03-workflows/ - Update all imports and tests (167 passing) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use cast(list[Any], x) with type: ignore[redundant-cast] comments to satisfy both mypy (which considers casting Any redundant) and pyright strict mode (which needs explicit casts to narrow Unknown types). Also fix evaluator decorator check_name type annotation to be explicitly str, resolving mypy str|Any|None mismatch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…format - Remove import of non-existent _foundry_memory_provider module (incorrectly kept during rebase conflict resolution) - Apply ruff formatter to test_local_eval.py and get-started samples Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The upstream provider-leading client refactor (microsoft#4818) made client= a required parameter on Agent(). Update the three getting-started eval samples to use FoundryChatClient with FOUNDRY_PROJECT_ENDPOINT, matching the standard pattern from 01-get-started samples. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace ~80 lines of manual OpenAI evals API code (create_eval, run_eval, manual polling, raw JSONL params) with FoundryEvals: - evaluate_groundedness() uses FoundryEvals.evaluate() with EvalItem - Remove create_openai_client(), create_eval(), run_eval() functions - Remove openai SDK type imports (DataSourceConfigCustom, etc.) - run_self_reflection_batch creates FoundryEvals instance once, reuses it for all iterations across all prompts Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Migrate all foundry_evals samples from AzureOpenAIResponsesClient to FoundryChatClient - Update env var from AZURE_AI_PROJECT_ENDPOINT to FOUNDRY_PROJECT_ENDPOINT - Use AzureCliCredential consistently across all samples - Fix README.md: correct function names (evaluate_dataset -> FoundryEvals.evaluate, evaluate_responses -> evaluate_traces) - Update self_reflection .env.example and README.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…jectClient AIProjectClient from azure.ai.projects.aio requires an async credential. Switch all foundry_evals samples from azure.identity.AzureCliCredential to azure.identity.aio.AzureCliCredential. Also pass project_client to FoundryChatClient instead of duplicating endpoint+credential. Close credential in self_reflection sample to avoid resource leak. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Convert all Example:: / Typical usage:: code blocks to .. code-block:: python format matching codebase convention (both _evaluation.py and _foundry_evals.py) - Add async pagination in _fetch_output_items via async for (handles large result sets) - Replace hasattr(__aenter__) with isinstance(client, AsyncOpenAI) in _resolve_openai_client - Move AsyncOpenAI import from TYPE_CHECKING to runtime (needed for isinstance) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix tests: use MagicMock(spec=AsyncOpenAI) for project_client mocks (isinstance check now requires proper type, not duck-typing) - Fix tests: replace mock_page.__iter__ with _AsyncPage helper for async for - Fix evaluate_response: auto-extract queries from response messages when query is not provided (previously always raised ValueError) - Add debug logging when skipping internal _-prefixed executor IDs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- T1: Add comment explaining builtin.* pass-through in _resolve_evaluator - T2: Add comment referencing OpenAI evals API for testing_criteria dict - T3: Document Mustache-style {{item.*}} template placeholders - T4: Document poll loop 60s sleep upper bound rationale - T5: Narrow run type to RunRetrieveResponse, use typed field access instead of vars()/getattr dance in _extract_result_counts and _extract_per_evaluator; use run.error and run.report_url directly - T6: Clarify openai_client docstring re: Azure Foundry endpoint - T8: Remove misleading empty expected_tool_calls from sample - Update tests to match real SDK PerTestingCriteriaResult shape Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

RunRetrieveResponse is the correct type — no backward compat needed for a brand new feature. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

FoundryEvals now takes client: FoundryChatClient as its primary parameter instead of openai_client: AsyncOpenAI. The builtin.* evaluators require a Foundry endpoint, so the type should reflect that. - FoundryEvals.__init__: client: FoundryChatClient replaces openai_client - evaluate_traces / evaluate_foundry_target: same change - _resolve_openai_client: extracts .client from FoundryChatClient - project_client fallback retained for standalone functions - All samples updated to construct FoundryChatClient and pass as client= - Tests updated (openai_client= → client=) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

If a developer sets a higher poll_interval, respect it. Only clamp to remaining time and enforce a 1s minimum for rate-limit protection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…mple Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com>

Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com>

- Rename model_deployment -> model across FoundryEvals and all samples - Make model param optional, resolves from client.model - Convert EvalResults from dataclass to regular class - Remove deprecated evaluate_response() function - Refactor splitters: BUILT_IN_SPLITTERS dict + standalone functions - Change per_turn_items from classmethod to staticmethod - Simplify EvalCheck type alias to use Awaitable[CheckResult] - Remove errored property from EvalResults - Remove default value from Evaluator protocol eval_name - Rename assert_passed -> raise_for_status, add EvalNotPassedError - Type agent param as SupportsAgentRun | None - Fix Arguments docstring - Update __init__.py exports - Update all tests and samples Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Move _foundry_evals.py from azure-ai to foundry package - Move test_foundry_evals.py to foundry/tests/ - Update lazy re-exports in agent_framework.foundry namespace - Update .pyi type stubs - All samples now import from agent_framework.foundry - Split tool-call evaluation into evaluate_tool_calls_sample.py - Fix all_passed to check errored count from result_counts - Fix raise_for_status to include errored item details Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

FoundryEvals() now works zero-config when FOUNDRY_PROJECT_ENDPOINT and FOUNDRY_MODEL environment variables are set. Auto-creates a FoundryChatClient under the hood, matching the established env var pattern. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…rror check - Remove unused _normalize_queries function and its tests - Add pyright ignore for EvalAPIError None check (defensive guard) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add image (data/uri) content handling to AgentEvalConverter.convert_message() so that Content.from_data() and Content.from_uri() image payloads are preserved as input_image parts in the Foundry evaluator format. - Handle Content type='data' and type='uri' → emit input_image parts - Add 6 unit tests for image content through convert_message/convert_messages - Add integration test verifying images flow through EvalItem → JSONL path - Add evaluate_multimodal.py sample demonstrating local image eval Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix project_client docstring to say async-only (not sync/async) - Add builtin evaluator name validation warning in _resolve_evaluator - Replace getattr with typed attribute access in _poll_eval_run, _extract_result_counts, _extract_per_evaluator, _fetch_output_items - Remove cast import from _foundry_evals (no longer needed) - Tighten _coerce_result: honour explicit 'passed' when both 'score' and 'passed' are present; remove performative cast - Fix self_reflection sample: add env file existence check - Fix traces sample: correct Pattern 2 section label - Update all Foundry eval samples to FoundryChatClient + FOUNDRY_MODEL (remove AIProjectClient + AZURE_AI_MODEL_DEPLOYMENT_NAME pattern) - Add eval_name and OpenAI client docs to FoundryEvals docstring - Update test mocks to match typed SDK objects (_MockResultCounts) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…mments Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ConversationSplitter is now a runtime-checkable Protocol with a named 'conversation' parameter, making the expected signature self-documenting. ConversationSplit enum members gain a __call__ method so they satisfy the protocol directly -- ConversationSplit.LAST_TURN(conversation) works. This simplifies _split_conversation from an isinstance dispatch to a single split(conversation) call. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Replace FOUNDRY_MODEL with AZURE_AI_MODEL_DEPLOYMENT_NAME in all eval samples to match repo convention - Replace Unicode symbols with ASCII equivalents in all eval sample print statements to avoid cp1252 encoding errors on Windows Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

markwallace-microsoft added documentation Improvements or additions to documentation python labels Mar 17, 2026

github-actions bot changed the title ~~Foundry Evals integration for Python~~ Python: Foundry Evals integration for Python Mar 17, 2026

alliscode force-pushed the af-foundry-evals-python branch from a0edd5f to fe9e621 Compare March 17, 2026 21:21

eavanvalkenburg reviewed Mar 18, 2026

View reviewed changes

alliscode force-pushed the af-foundry-evals-python branch 6 times, most recently from 15d8640 to aad92ac Compare March 19, 2026 20:41

alliscode force-pushed the af-foundry-evals-python branch 8 times, most recently from a74c9d1 to 8d8b6e8 Compare March 25, 2026 17:55

alliscode marked this pull request as ready for review March 25, 2026 19:43

Copilot AI review requested due to automatic review settings March 25, 2026 19:43

Copilot started reviewing on behalf of alliscode March 25, 2026 19:46 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

alliscode force-pushed the af-foundry-evals-python branch from d266ee2 to 997a379 Compare March 25, 2026 20:01

moonbox3 reviewed Mar 26, 2026

View reviewed changes

TaoChenOSU reviewed Mar 26, 2026

View reviewed changes

eavanvalkenburg reviewed Mar 27, 2026

View reviewed changes

alliscode and others added 2 commits March 27, 2026 11:05

alliscode and others added 20 commits March 27, 2026 11:05

Revert unrelated formatting changes to get-started samples

b568898

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix lint errors in eval samples (E501, ASYNC240, formatting)

641c25a

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove evaluate_all_patterns_sample.py (redundant with focused samples)

8288bd9

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Revert test_observability.py to upstream/main (not our test)

9c050ef

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove unnecessary Any union from run type annotations

1af02d0

RunRetrieveResponse is the correct type — no backward compat needed for a brand new feature. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove implicit 60s upper bound on poll interval

1156a34

If a developer sets a higher poll_interval, respect it. Only clamp to remaining time and enforce a 1s minimum for rate-limit protection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove 1s floor on poll interval — let the developer control it

b5142f1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Update python/samples/05-end-to-end/evaluation/foundry_evals/.env.exa…

d0a57ef

…mple Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com>

Update python/samples/02-agents/evaluation/evaluate_agent.py

2d4fb5f

Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com>

alliscode force-pushed the af-foundry-evals-python branch from 398621f to b63dd34 Compare March 27, 2026 18:08

alliscode and others added 8 commits March 27, 2026 11:17

Fix pyright errors: remove dead _normalize_queries, suppress EvalAPIE…

fde1bb9

…rror check - Remove unused _normalize_queries function and its tests - Add pyright ignore for EvalAPIError None check (defensive guard) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix ruff lint errors (E501, SIM108, SIM102)

9795c68

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix pyright errors: type-narrow dict to dict[str, Any], add ignore co…

b3eb251

…mments Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

TaoChenOSU approved these changes Mar 30, 2026

View reviewed changes

Conversation

alliscode commented Mar 17, 2026

Contribution Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markwallace-microsoft commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python Unit Test Overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

markwallace-microsoft commented Mar 19, 2026 •

edited

Loading