Python: Foundry Evals integration for Python#4750
Open
alliscode wants to merge 48 commits intomicrosoft:mainfrom
Open
Python: Foundry Evals integration for Python#4750alliscode wants to merge 48 commits intomicrosoft:mainfrom
alliscode wants to merge 48 commits intomicrosoft:mainfrom
Conversation
a0edd5f to
fe9e621
Compare
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py
Outdated
Show resolved
Hide resolved
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py
Outdated
Show resolved
Hide resolved
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py
Outdated
Show resolved
Hide resolved
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py
Outdated
Show resolved
Hide resolved
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py
Outdated
Show resolved
Hide resolved
15d8640 to
aad92ac
Compare
Member
Python Test Coverage Report •
Python Unit Test Overview
|
||||||||||||||||||||||||||||||||||||||||
a74c9d1 to
8d8b6e8
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds a provider-agnostic evaluation framework to the Python Agent Framework, with both local (no-API) evaluators and an Azure AI Foundry-backed provider, plus end-to-end samples that demonstrate agent and workflow evaluation patterns.
Changes:
- Introduces core evaluation types and orchestration (
EvalItem,EvalResults,evaluate_agent(),evaluate_workflow()) plus local checks (LocalEvaluator,@evaluator). - Adds Azure AI Foundry provider integration (
FoundryEvals) and trace/target evaluation helpers. - Adds/updates evaluation samples (Foundry evals patterns + self-reflection groundedness) and expands test coverage for local evaluation.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| python/samples/05-end-to-end/evaluation/self_reflection/self_reflection.py | Migrates groundedness scoring to FoundryEvals and updates batch runner. |
| python/samples/05-end-to-end/evaluation/self_reflection/README.md | Updates self-reflection sample documentation for Foundry Evals usage and env vars. |
| python/samples/05-end-to-end/evaluation/self_reflection/.env.example | Updates env var example to FOUNDRY_PROJECT_ENDPOINT. |
| python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_workflow_sample.py | New sample: evaluate multi-agent workflows with Foundry evaluators. |
| python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py | New sample: evaluate existing responses / traces via Foundry. |
| python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_multiturn_sample.py | New sample: demonstrate conversation split strategies for eval. |
| python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_mixed_sample.py | New sample: mix LocalEvaluator with Foundry evaluators in one call. |
| python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_all_patterns_sample.py | New “kitchen sink” sample covering all evaluation patterns. |
| python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_agent_sample.py | New sample: evaluate_agent patterns + direct FoundryEvals.evaluate(). |
| python/samples/05-end-to-end/evaluation/foundry_evals/README.md | New README describing Foundry eval samples and entry points. |
| python/samples/05-end-to-end/evaluation/foundry_evals/.env.example | New env example for Foundry eval samples. |
| python/samples/03-workflows/evaluation/evaluate_workflow.py | New workflow evaluation sample using local checks. |
| python/samples/02-agents/evaluation/evaluate_with_expected.py | New sample demonstrating expected outputs/tool call expectations. |
| python/samples/02-agents/evaluation/evaluate_agent.py | New sample demonstrating basic local evaluation for agents. |
| python/packages/core/tests/core/test_observability.py | Adjusts OTLP exporter-related test skipping. |
| python/packages/core/tests/core/test_local_eval.py | Adds a comprehensive test suite for local eval framework behaviors. |
| python/packages/core/agent_framework/_evaluation.py | Adds the provider-agnostic evaluation framework implementation. |
| python/packages/core/agent_framework/init.py | Re-exports evaluation APIs/types from the package root. |
| python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py | Adds the Foundry-backed FoundryEvals provider + trace/target helpers. |
| python/packages/azure-ai/agent_framework_azure_ai/init.py | Exposes FoundryEvals and helper functions from the azure-ai package. |
python/packages/foundry/agent_framework_foundry/_foundry_evals.py
Outdated
Show resolved
Hide resolved
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py
Outdated
Show resolved
Hide resolved
python/samples/05-end-to-end/evaluation/self_reflection/self_reflection.py
Outdated
Show resolved
Hide resolved
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py
Show resolved
Hide resolved
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_mixed_sample.py
Outdated
Show resolved
Hide resolved
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_agent_sample.py
Outdated
Show resolved
Hide resolved
python/samples/05-end-to-end/evaluation/self_reflection/self_reflection.py
Show resolved
Hide resolved
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_workflow_sample.py
Outdated
Show resolved
Hide resolved
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py
Outdated
Show resolved
Hide resolved
d266ee2 to
997a379
Compare
moonbox3
reviewed
Mar 26, 2026
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py
Outdated
Show resolved
Hide resolved
TaoChenOSU
reviewed
Mar 26, 2026
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py
Outdated
Show resolved
Hide resolved
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py
Outdated
Show resolved
Hide resolved
python/packages/foundry/agent_framework_foundry/_foundry_evals.py
Outdated
Show resolved
Hide resolved
python/samples/05-end-to-end/evaluation/foundry_evals/.env.example
Outdated
Show resolved
Hide resolved
Merged and refactored eval module per Eduard's PR review: - Merge _eval.py + _local_eval.py into single _evaluation.py - Convert EvalItem from dataclass to regular class - Rename to_dict() to to_eval_data() - Convert _AgentEvalData to TypedDict - Simplify check system: unified async pattern with isawaitable - Parallelize checks and evaluators with asyncio.gather - Add all/any mode to tool_called_check - Fix bool(passed) truthy bug in _coerce_result - Remove deprecated function_evaluator/async_function_evaluator aliases - Remove _MinimalAgent, tighten evaluate_agent signature - Set self.name in __init__ (LocalEvaluator, FoundryEvals) - Limit FoundryEvals to AsyncOpenAI only - Type project_client as AIProjectClient - Remove NotImplementedError continuous eval code - Add evaluation samples in 02-agents/ and 03-workflows/ - Update all imports and tests (167 passing) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use cast(list[Any], x) with type: ignore[redundant-cast] comments to satisfy both mypy (which considers casting Any redundant) and pyright strict mode (which needs explicit casts to narrow Unknown types). Also fix evaluator decorator check_name type annotation to be explicitly str, resolving mypy str|Any|None mismatch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…format - Remove import of non-existent _foundry_memory_provider module (incorrectly kept during rebase conflict resolution) - Apply ruff formatter to test_local_eval.py and get-started samples Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The upstream provider-leading client refactor (microsoft#4818) made client= a required parameter on Agent(). Update the three getting-started eval samples to use FoundryChatClient with FOUNDRY_PROJECT_ENDPOINT, matching the standard pattern from 01-get-started samples. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace ~80 lines of manual OpenAI evals API code (create_eval, run_eval, manual polling, raw JSONL params) with FoundryEvals: - evaluate_groundedness() uses FoundryEvals.evaluate() with EvalItem - Remove create_openai_client(), create_eval(), run_eval() functions - Remove openai SDK type imports (DataSourceConfigCustom, etc.) - run_self_reflection_batch creates FoundryEvals instance once, reuses it for all iterations across all prompts Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Migrate all foundry_evals samples from AzureOpenAIResponsesClient to FoundryChatClient - Update env var from AZURE_AI_PROJECT_ENDPOINT to FOUNDRY_PROJECT_ENDPOINT - Use AzureCliCredential consistently across all samples - Fix README.md: correct function names (evaluate_dataset -> FoundryEvals.evaluate, evaluate_responses -> evaluate_traces) - Update self_reflection .env.example and README.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…jectClient AIProjectClient from azure.ai.projects.aio requires an async credential. Switch all foundry_evals samples from azure.identity.AzureCliCredential to azure.identity.aio.AzureCliCredential. Also pass project_client to FoundryChatClient instead of duplicating endpoint+credential. Close credential in self_reflection sample to avoid resource leak. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Convert all Example:: / Typical usage:: code blocks to .. code-block:: python format matching codebase convention (both _evaluation.py and _foundry_evals.py) - Add async pagination in _fetch_output_items via async for (handles large result sets) - Replace hasattr(__aenter__) with isinstance(client, AsyncOpenAI) in _resolve_openai_client - Move AsyncOpenAI import from TYPE_CHECKING to runtime (needed for isinstance) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix tests: use MagicMock(spec=AsyncOpenAI) for project_client mocks (isinstance check now requires proper type, not duck-typing) - Fix tests: replace mock_page.__iter__ with _AsyncPage helper for async for - Fix evaluate_response: auto-extract queries from response messages when query is not provided (previously always raised ValueError) - Add debug logging when skipping internal _-prefixed executor IDs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- T1: Add comment explaining builtin.* pass-through in _resolve_evaluator
- T2: Add comment referencing OpenAI evals API for testing_criteria dict
- T3: Document Mustache-style {{item.*}} template placeholders
- T4: Document poll loop 60s sleep upper bound rationale
- T5: Narrow run type to RunRetrieveResponse, use typed field access
instead of vars()/getattr dance in _extract_result_counts and
_extract_per_evaluator; use run.error and run.report_url directly
- T6: Clarify openai_client docstring re: Azure Foundry endpoint
- T8: Remove misleading empty expected_tool_calls from sample
- Update tests to match real SDK PerTestingCriteriaResult shape
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
RunRetrieveResponse is the correct type — no backward compat needed for a brand new feature. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
FoundryEvals now takes client: FoundryChatClient as its primary parameter instead of openai_client: AsyncOpenAI. The builtin.* evaluators require a Foundry endpoint, so the type should reflect that. - FoundryEvals.__init__: client: FoundryChatClient replaces openai_client - evaluate_traces / evaluate_foundry_target: same change - _resolve_openai_client: extracts .client from FoundryChatClient - project_client fallback retained for standalone functions - All samples updated to construct FoundryChatClient and pass as client= - Tests updated (openai_client= → client=) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
If a developer sets a higher poll_interval, respect it. Only clamp to remaining time and enforce a 1s minimum for rate-limit protection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…mple Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com>
Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com>
- Rename model_deployment -> model across FoundryEvals and all samples - Make model param optional, resolves from client.model - Convert EvalResults from dataclass to regular class - Remove deprecated evaluate_response() function - Refactor splitters: BUILT_IN_SPLITTERS dict + standalone functions - Change per_turn_items from classmethod to staticmethod - Simplify EvalCheck type alias to use Awaitable[CheckResult] - Remove errored property from EvalResults - Remove default value from Evaluator protocol eval_name - Rename assert_passed -> raise_for_status, add EvalNotPassedError - Type agent param as SupportsAgentRun | None - Fix Arguments docstring - Update __init__.py exports - Update all tests and samples Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move _foundry_evals.py from azure-ai to foundry package - Move test_foundry_evals.py to foundry/tests/ - Update lazy re-exports in agent_framework.foundry namespace - Update .pyi type stubs - All samples now import from agent_framework.foundry - Split tool-call evaluation into evaluate_tool_calls_sample.py - Fix all_passed to check errored count from result_counts - Fix raise_for_status to include errored item details Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
398621f to
b63dd34
Compare
FoundryEvals() now works zero-config when FOUNDRY_PROJECT_ENDPOINT and FOUNDRY_MODEL environment variables are set. Auto-creates a FoundryChatClient under the hood, matching the established env var pattern. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rror check - Remove unused _normalize_queries function and its tests - Add pyright ignore for EvalAPIError None check (defensive guard) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add image (data/uri) content handling to AgentEvalConverter.convert_message() so that Content.from_data() and Content.from_uri() image payloads are preserved as input_image parts in the Foundry evaluator format. - Handle Content type='data' and type='uri' → emit input_image parts - Add 6 unit tests for image content through convert_message/convert_messages - Add integration test verifying images flow through EvalItem → JSONL path - Add evaluate_multimodal.py sample demonstrating local image eval Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix project_client docstring to say async-only (not sync/async) - Add builtin evaluator name validation warning in _resolve_evaluator - Replace getattr with typed attribute access in _poll_eval_run, _extract_result_counts, _extract_per_evaluator, _fetch_output_items - Remove cast import from _foundry_evals (no longer needed) - Tighten _coerce_result: honour explicit 'passed' when both 'score' and 'passed' are present; remove performative cast - Fix self_reflection sample: add env file existence check - Fix traces sample: correct Pattern 2 section label - Update all Foundry eval samples to FoundryChatClient + FOUNDRY_MODEL (remove AIProjectClient + AZURE_AI_MODEL_DEPLOYMENT_NAME pattern) - Add eval_name and OpenAI client docs to FoundryEvals docstring - Update test mocks to match typed SDK objects (_MockResultCounts) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…mments Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ConversationSplitter is now a runtime-checkable Protocol with a named 'conversation' parameter, making the expected signature self-documenting. ConversationSplit enum members gain a __call__ method so they satisfy the protocol directly -- ConversationSplit.LAST_TURN(conversation) works. This simplifies _split_conversation from an isinstance dispatch to a single split(conversation) call. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace FOUNDRY_MODEL with AZURE_AI_MODEL_DEPLOYMENT_NAME in all eval samples to match repo convention - Replace Unicode symbols with ASCII equivalents in all eval sample print statements to avoid cp1252 encoding errors on Windows Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TaoChenOSU
approved these changes
Mar 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add evaluation framework with local and Foundry-hosted evaluator support:
Contribution Checklist