Building Custom Evaluators for Google ADK Agents
deepeval and Arize test output quality. ADK custom evaluators test agent behavior — routing decisions, tool call arguments, execution traces. Here's how to build them with real Python code.
When we built the CPC Comparison AI Agent for a Health Insurance Payer, writing the actual agent logic was the easy part. The hard part was knowing whether it worked correctly. A single user query triggers a chain: intent classification → agent routing → tool selection → API calls → data transformation → file generation. Any break in that chain produces a wrong answer — and traditional unit tests don't even know the chain exists. This article covers how we built a custom evaluation framework for our Google ADK agents, why popular LLM evaluation tools weren't sufficient on their own, and how you can build the same for your own ADK system.
Why Standard Tests Break for Agentic Systems
Unit testing an agent is fundamentally different from unit testing a function. A function has a deterministic input → output contract. An agent has an input → multi-step execution trace → output contract. The execution trace matters as much as the final output.
- A query answered correctly for the wrong reason is a latent bug — the agent got lucky, not right
- Tool call arguments can be hallucinated even when the final answer looks correct
- Routing failures (wrong sub-agent called) may produce a plausible but wrong output
- Mock mode behavior can diverge from live API behavior in subtle, hard-to-detect ways
- Traditional assert-based tests have no concept of 'agent trace' or 'tool call sequence'
LLM Evaluation Tools vs Agent Evaluators — What's the Difference?
Tools like deepeval, Arize Phoenix, and Braintrust are excellent for evaluating LLM output quality — whether answers are faithful, relevant, and free of hallucination. But they don't know your agent's routing graph, tool schemas, or execution trace. They evaluate what came out of the model, not how the agent got there. ADK custom evaluators fill the gap by testing agent behavior at every step of execution.
| deepeval / Arize / Braintrust | Google ADK Custom Evaluators | |
|---|---|---|
| Evaluates | LLM output quality | Agent execution behavior |
| Knows routing graph | ❌ No | ✅ Yes — validates agent sequence |
| Inspects tool call args | ❌ No | ✅ Yes — schema-validates every call |
| Framework-agnostic | ✅ Works with any LLM | ADK-specific by design |
| Great for | RAG quality, hallucination, toxicity | Routing accuracy, tool use correctness |
| Catches | Wrong answers, irrelevant responses | Wrong routing, hallucinated tool args |
| When to use | Model output benchmarking | Agent system integration testing |
What Google ADK Provides Out of the Box
Google ADK ships with a built-in evaluation module that handles common cases — but it has real gaps for domain-specific agent systems.
- Built-in: Response quality scoring using LLM-as-judge (relevance, coherence)
- Built-in: Trajectory evaluation — did the agent follow a reasonable path?
- Built-in: Tool use evaluation at a coarse level
- Gap: No domain-specific routing validation (e.g. 'cpc_intake_agent must always precede documents_agent')
- Gap: No schema-level tool argument validation against your defined contracts
- Gap: No mock-vs-live parity testing to catch environment-dependent failures
- Gap: No custom output structure validation (e.g. Excel sheet structure, JSON schema integrity)
Anatomy of a Custom Evaluator
Every custom evaluator follows the same pattern: it receives a test case (input + expected + actual execution trace), validates a specific aspect of agent behavior, and returns a scored result. Here's the base interface everything builds on:
from dataclasses import dataclass, field
from typing import Any
@dataclass
class EvalTestCase:
input: str # The user query
expected: dict[str, Any] # Expected behavior (routing, tool args, output)
actual: dict[str, Any] = field(default_factory=dict) # Filled after agent runs
@dataclass
class EvalResult:
evaluator: str
passed: bool
score: float # 0.0 – 1.0
details: str
class BaseEvaluator:
"""Abstract base for all ADK custom evaluators."""
def evaluate(self, test_case: EvalTestCase) -> EvalResult:
raise NotImplementedError
def name(self) -> str:
return self.__class__.__name__Building a RoutingEvaluator
The RoutingEvaluator validates that the orchestrator routed the query to the correct sequence of sub-agents. This is the most important evaluator for multi-agent systems — a routing failure silently produces wrong results with no error raised.
class RoutingEvaluator(BaseEvaluator):
"""
Validates that the orchestrator routed to the correct
sub-agent sequence for a given input query.
Example expected:
{"agent_sequence": ["orchestrator", "cpc_intake_agent",
"documents_agent", "comparison_agent"]}
"""
def evaluate(self, test_case: EvalTestCase) -> EvalResult:
expected_seq = test_case.expected.get("agent_sequence", [])
actual_seq = [
step["agent"]
for step in test_case.actual.get("trace", [])
]
if expected_seq == actual_seq:
return EvalResult(
evaluator="RoutingEvaluator",
passed=True,
score=1.0,
details=f"Correct route: {' → '.join(actual_seq)}"
)
# Partial credit: proportion of correctly placed agents
matched = sum(
1 for a, e in zip(actual_seq, expected_seq) if a == e
)
score = matched / max(len(expected_seq), 1)
return EvalResult(
evaluator="RoutingEvaluator",
passed=False,
score=round(score, 2),
details=(
f"Expected: {expected_seq}\n"
f"Got: {actual_seq}"
)
)
# Example test cases
ROUTING_TEST_CASES = [
EvalTestCase(
input="Compare Plan A effective 2024-01-15 with Plan B",
expected={
"agent_sequence": [
"orchestrator", "cpc_intake_agent",
"documents_agent", "comparison_agent"
]
}
),
EvalTestCase(
input="What plans are available?",
expected={
"agent_sequence": ["orchestrator", "cpc_intake_agent"]
}
),
]Building a ToolCallEvaluator
The ToolCallEvaluator validates every tool call argument against a defined JSON schema. This catches hallucinated tool arguments — one of the most common and subtle failure modes in production agent systems — before they reach your APIs.
import jsonschema
class ToolCallEvaluator(BaseEvaluator):
"""
Validates tool call arguments against defined JSON schemas.
Catches hallucinated or malformed tool arguments before
they reach your actual APIs.
Args:
tool_schemas: dict mapping tool_name → JSON schema
"""
def __init__(self, tool_schemas: dict[str, dict]):
self.tool_schemas = tool_schemas
def evaluate(self, test_case: EvalTestCase) -> EvalResult:
errors = []
tool_calls = test_case.actual.get("tool_calls", [])
for call in tool_calls:
tool_name = call.get("tool")
schema = self.tool_schemas.get(tool_name)
if not schema:
errors.append(f"Unknown tool called: '{tool_name}'")
continue
try:
jsonschema.validate(
instance=call.get("args", {}),
schema=schema
)
except jsonschema.ValidationError as e:
errors.append(f"{tool_name}: {e.message}")
passed = len(errors) == 0
return EvalResult(
evaluator="ToolCallEvaluator",
passed=passed,
score=1.0 if passed else 0.0,
details=(
"All tool calls valid"
if passed
else " | ".join(errors)
)
)
# Define your tool schemas (align with ADK tool definitions)
CPC_TOOL_SCHEMAS = {
"get_plan_highlights": {
"type": "object",
"required": ["plan_code", "effective_date"],
"properties": {
"plan_code": {"type": "string", "minLength": 1},
"effective_date": {
"type": "string",
"pattern": "^\\d{4}-\\d{2}-\\d{2}$"
}
},
"additionalProperties": False
},
"generate_comparison_report": {
"type": "object",
"required": ["plan_a_data", "plan_b_data"],
"properties": {
"plan_a_data": {"type": "object"},
"plan_b_data": {"type": "object"},
"output_format": {
"type": "string",
"enum": ["excel", "json"]
}
}
}
}Running Your Eval Suite: Mock Mode vs Live API Mode
The eval runner wires together your test cases, your agent, and your evaluators. The mode parameter is the key design decision — mock mode gives you a fast, deterministic inner loop during development; live mode gives you real production signal before deployment.
from typing import Literal
def run_eval_suite(
agent,
evaluators: list[BaseEvaluator],
test_cases: list[EvalTestCase],
mode: Literal["mock", "live"] = "mock",
) -> list[EvalResult]:
"""
Runs all evaluators against all test cases.
Mock mode: Deterministic, no credentials required.
Use during inner-loop development.
Live mode: Real API calls, real data.
Use before production deployment.
"""
all_results: list[EvalResult] = []
for i, tc in enumerate(test_cases):
print(f"\nTest case {i + 1}: '{tc.input[:60]}...'")
# Run the agent and capture the execution trace
tc.actual = agent.run(tc.input, mode=mode)
# Run every evaluator against this test case
for ev in evaluators:
result = ev.evaluate(tc)
all_results.append(result)
status = "✓" if result.passed else "✗"
print(f" {status} {ev.name()}: {result.details[:80]}")
# Summary
passed = sum(r.passed for r in all_results)
total = len(all_results)
print(f"\n{'='*50}")
print(f"Results ({mode} mode): {passed}/{total} passed")
print(f"Overall score: {sum(r.score for r in all_results)/total:.2%}")
return all_results
# Wire it all together
if __name__ == "__main__":
from your_agent import CPCAgent
agent = CPCAgent()
evaluators = [
RoutingEvaluator(),
ToolCallEvaluator(CPC_TOOL_SCHEMAS),
]
# Fast inner loop — no credentials needed
run_eval_suite(agent, evaluators, ROUTING_TEST_CASES, mode="mock")
# Pre-deployment gate — real API calls
run_eval_suite(agent, evaluators, ROUTING_TEST_CASES, mode="live")Bonus: An Agent That Generates Evaluator Scaffolding
Writing evaluator boilerplate for every new agent is repetitive. This agent takes your ADK agent spec (tools, routing rules, output schema) and generates a complete custom evaluator class scaffold using Gemini on Vertex AI. We're building this out as a standalone POC — see the Evaluator Generator Agent in the POC Lab for the full case study.
import json
import vertexai
from vertexai.generative_models import GenerativeModel
GENERATOR_SYSTEM_PROMPT = """You are an expert in Google ADK agent evaluation.
Given an agent specification in JSON, generate a complete Python
custom evaluator class that:
1. Extends BaseEvaluator
2. Validates the specific behavior described in the spec
3. Returns an EvalResult with a score from 0.0 to 1.0
4. Includes 3 representative EvalTestCase examples
5. Has docstrings explaining what is being evaluated
Output only valid Python code. No markdown fences, no explanations."""
def generate_evaluator_scaffold(
agent_spec: dict,
project: str,
location: str = "us-central1",
) -> str:
"""
Takes an ADK agent spec and generates a custom evaluator scaffold.
agent_spec example:
{
"agent_name": "CPC Intake Agent",
"routing_rule": "Always called after orchestrator for plan queries",
"tools": [
{
"name": "extract_plan_codes",
"required_args": ["raw_query"],
"output": {"plan_codes": "list[str]", "effective_date": "str"}
}
],
"success_criteria": "plan_codes non-empty, effective_date in YYYY-MM-DD format"
}
"""
vertexai.init(project=project, location=location)
model = GenerativeModel(
"gemini-1.5-pro",
system_instruction=GENERATOR_SYSTEM_PROMPT
)
prompt = f"Generate a custom evaluator for this agent:\n\n{json.dumps(agent_spec, indent=2)}"
response = model.generate_content(prompt)
return response.text
# Example usage
if __name__ == "__main__":
spec = {
"agent_name": "CPC Intake Agent",
"routing_rule": "Called by orchestrator for all plan comparison queries",
"tools": [{
"name": "extract_plan_codes",
"required_args": ["raw_query"],
"output": {
"plan_codes": "list[str] — min 2 items for comparison",
"effective_date": "str — YYYY-MM-DD format"
}
}],
"success_criteria": (
"plan_codes has >= 2 items, "
"effective_date matches ISO format"
)
}
scaffold = generate_evaluator_scaffold(
agent_spec=spec,
project="your-gcp-project"
)
print(scaffold)Key Principles
Building a custom eval framework is an investment that pays back immediately and compounds over time. Every new agent capability you add should come with a corresponding evaluator.
- Build evaluators before you build agents — define your success criteria first
- Use deepeval/Arize for output quality + ADK custom evaluators for behavior correctness — they are complementary, not competing
- Mock mode for fast inner-loop dev (seconds per test); live mode as a pre-deployment gate (minutes per suite)
- Partial credit scoring (0.0–1.0) gives you more signal than binary pass/fail
- Every bug you fix should add a regression test case to prevent recurrence
- The Evaluator Generator Agent reduces the cost of writing new evaluators — lower cost = more coverage
Takeaway
The biggest lesson from the CPC project: evaluation is not a phase that comes after building — it's the foundation you build on. deepeval and Arize are excellent tools for measuring LLM output quality, but they operate above the agent framework layer. They can't tell you whether your orchestrator routed correctly, whether your tool arguments are valid, or whether your mock and live execution traces are consistent. ADK custom evaluators fill that gap. Build them early, run them continuously, and treat every new evaluator as first-class code.