Back to Insights
GenAI 12 min readMarch 2026

Building Custom Evaluators for Google ADK Agents

deepeval and Arize test output quality. ADK custom evaluators test agent behavior — routing decisions, tool call arguments, execution traces. Here's how to build them with real Python code.

VM
Venkat Meruva
AI Solution Architect

When we built the CPC Comparison AI Agent for a Health Insurance Payer, writing the actual agent logic was the easy part. The hard part was knowing whether it worked correctly. A single user query triggers a chain: intent classification → agent routing → tool selection → API calls → data transformation → file generation. Any break in that chain produces a wrong answer — and traditional unit tests don't even know the chain exists. This article covers how we built a custom evaluation framework for our Google ADK agents, why popular LLM evaluation tools weren't sufficient on their own, and how you can build the same for your own ADK system.

Why Standard Tests Break for Agentic Systems

Unit testing an agent is fundamentally different from unit testing a function. A function has a deterministic input → output contract. An agent has an input → multi-step execution trace → output contract. The execution trace matters as much as the final output.

  • A query answered correctly for the wrong reason is a latent bug — the agent got lucky, not right
  • Tool call arguments can be hallucinated even when the final answer looks correct
  • Routing failures (wrong sub-agent called) may produce a plausible but wrong output
  • Mock mode behavior can diverge from live API behavior in subtle, hard-to-detect ways
  • Traditional assert-based tests have no concept of 'agent trace' or 'tool call sequence'

LLM Evaluation Tools vs Agent Evaluators — What's the Difference?

Tools like deepeval, Arize Phoenix, and Braintrust are excellent for evaluating LLM output quality — whether answers are faithful, relevant, and free of hallucination. But they don't know your agent's routing graph, tool schemas, or execution trace. They evaluate what came out of the model, not how the agent got there. ADK custom evaluators fill the gap by testing agent behavior at every step of execution.

deepeval / Arize / BraintrustGoogle ADK Custom Evaluators
EvaluatesLLM output qualityAgent execution behavior
Knows routing graph❌ No✅ Yes — validates agent sequence
Inspects tool call args❌ No✅ Yes — schema-validates every call
Framework-agnostic✅ Works with any LLMADK-specific by design
Great forRAG quality, hallucination, toxicityRouting accuracy, tool use correctness
CatchesWrong answers, irrelevant responsesWrong routing, hallucinated tool args
When to useModel output benchmarkingAgent system integration testing
💡 The right answer: use both. deepeval/Arize for output quality metrics; ADK custom evaluators for agent behavior correctness. They complement each other perfectly.

What Google ADK Provides Out of the Box

Google ADK ships with a built-in evaluation module that handles common cases — but it has real gaps for domain-specific agent systems.

  • Built-in: Response quality scoring using LLM-as-judge (relevance, coherence)
  • Built-in: Trajectory evaluation — did the agent follow a reasonable path?
  • Built-in: Tool use evaluation at a coarse level
  • Gap: No domain-specific routing validation (e.g. 'cpc_intake_agent must always precede documents_agent')
  • Gap: No schema-level tool argument validation against your defined contracts
  • Gap: No mock-vs-live parity testing to catch environment-dependent failures
  • Gap: No custom output structure validation (e.g. Excel sheet structure, JSON schema integrity)

Anatomy of a Custom Evaluator

Every custom evaluator follows the same pattern: it receives a test case (input + expected + actual execution trace), validates a specific aspect of agent behavior, and returns a scored result. Here's the base interface everything builds on:

evaluators/base.py
from dataclasses import dataclass, field
from typing import Any

@dataclass
class EvalTestCase:
    input: str                        # The user query
    expected: dict[str, Any]          # Expected behavior (routing, tool args, output)
    actual: dict[str, Any] = field(default_factory=dict)  # Filled after agent runs

@dataclass
class EvalResult:
    evaluator: str
    passed: bool
    score: float          # 0.0 – 1.0
    details: str

class BaseEvaluator:
    """Abstract base for all ADK custom evaluators."""

    def evaluate(self, test_case: EvalTestCase) -> EvalResult:
        raise NotImplementedError

    def name(self) -> str:
        return self.__class__.__name__

Building a RoutingEvaluator

The RoutingEvaluator validates that the orchestrator routed the query to the correct sequence of sub-agents. This is the most important evaluator for multi-agent systems — a routing failure silently produces wrong results with no error raised.

evaluators/routing_evaluator.py
class RoutingEvaluator(BaseEvaluator):
    """
    Validates that the orchestrator routed to the correct
    sub-agent sequence for a given input query.

    Example expected:
        {"agent_sequence": ["orchestrator", "cpc_intake_agent",
                            "documents_agent", "comparison_agent"]}
    """

    def evaluate(self, test_case: EvalTestCase) -> EvalResult:
        expected_seq = test_case.expected.get("agent_sequence", [])
        actual_seq = [
            step["agent"]
            for step in test_case.actual.get("trace", [])
        ]

        if expected_seq == actual_seq:
            return EvalResult(
                evaluator="RoutingEvaluator",
                passed=True,
                score=1.0,
                details=f"Correct route: {' → '.join(actual_seq)}"
            )

        # Partial credit: proportion of correctly placed agents
        matched = sum(
            1 for a, e in zip(actual_seq, expected_seq) if a == e
        )
        score = matched / max(len(expected_seq), 1)

        return EvalResult(
            evaluator="RoutingEvaluator",
            passed=False,
            score=round(score, 2),
            details=(
                f"Expected: {expected_seq}\n"
                f"Got:      {actual_seq}"
            )
        )


# Example test cases
ROUTING_TEST_CASES = [
    EvalTestCase(
        input="Compare Plan A effective 2024-01-15 with Plan B",
        expected={
            "agent_sequence": [
                "orchestrator", "cpc_intake_agent",
                "documents_agent", "comparison_agent"
            ]
        }
    ),
    EvalTestCase(
        input="What plans are available?",
        expected={
            "agent_sequence": ["orchestrator", "cpc_intake_agent"]
        }
    ),
]

Building a ToolCallEvaluator

The ToolCallEvaluator validates every tool call argument against a defined JSON schema. This catches hallucinated tool arguments — one of the most common and subtle failure modes in production agent systems — before they reach your APIs.

evaluators/tool_call_evaluator.py
import jsonschema

class ToolCallEvaluator(BaseEvaluator):
    """
    Validates tool call arguments against defined JSON schemas.
    Catches hallucinated or malformed tool arguments before
    they reach your actual APIs.

    Args:
        tool_schemas: dict mapping tool_name → JSON schema
    """

    def __init__(self, tool_schemas: dict[str, dict]):
        self.tool_schemas = tool_schemas

    def evaluate(self, test_case: EvalTestCase) -> EvalResult:
        errors = []
        tool_calls = test_case.actual.get("tool_calls", [])

        for call in tool_calls:
            tool_name = call.get("tool")
            schema = self.tool_schemas.get(tool_name)

            if not schema:
                errors.append(f"Unknown tool called: '{tool_name}'")
                continue

            try:
                jsonschema.validate(
                    instance=call.get("args", {}),
                    schema=schema
                )
            except jsonschema.ValidationError as e:
                errors.append(f"{tool_name}: {e.message}")

        passed = len(errors) == 0
        return EvalResult(
            evaluator="ToolCallEvaluator",
            passed=passed,
            score=1.0 if passed else 0.0,
            details=(
                "All tool calls valid"
                if passed
                else " | ".join(errors)
            )
        )


# Define your tool schemas (align with ADK tool definitions)
CPC_TOOL_SCHEMAS = {
    "get_plan_highlights": {
        "type": "object",
        "required": ["plan_code", "effective_date"],
        "properties": {
            "plan_code": {"type": "string", "minLength": 1},
            "effective_date": {
                "type": "string",
                "pattern": "^\\d{4}-\\d{2}-\\d{2}$"
            }
        },
        "additionalProperties": False
    },
    "generate_comparison_report": {
        "type": "object",
        "required": ["plan_a_data", "plan_b_data"],
        "properties": {
            "plan_a_data": {"type": "object"},
            "plan_b_data": {"type": "object"},
            "output_format": {
                "type": "string",
                "enum": ["excel", "json"]
            }
        }
    }
}

Running Your Eval Suite: Mock Mode vs Live API Mode

The eval runner wires together your test cases, your agent, and your evaluators. The mode parameter is the key design decision — mock mode gives you a fast, deterministic inner loop during development; live mode gives you real production signal before deployment.

evaluators/runner.py
from typing import Literal

def run_eval_suite(
    agent,
    evaluators: list[BaseEvaluator],
    test_cases: list[EvalTestCase],
    mode: Literal["mock", "live"] = "mock",
) -> list[EvalResult]:
    """
    Runs all evaluators against all test cases.

    Mock mode:  Deterministic, no credentials required.
                Use during inner-loop development.
    Live mode:  Real API calls, real data.
                Use before production deployment.
    """
    all_results: list[EvalResult] = []

    for i, tc in enumerate(test_cases):
        print(f"\nTest case {i + 1}: '{tc.input[:60]}...'")

        # Run the agent and capture the execution trace
        tc.actual = agent.run(tc.input, mode=mode)

        # Run every evaluator against this test case
        for ev in evaluators:
            result = ev.evaluate(tc)
            all_results.append(result)
            status = "✓" if result.passed else "✗"
            print(f"  {status} {ev.name()}: {result.details[:80]}")

    # Summary
    passed = sum(r.passed for r in all_results)
    total = len(all_results)
    print(f"\n{'='*50}")
    print(f"Results ({mode} mode): {passed}/{total} passed")
    print(f"Overall score: {sum(r.score for r in all_results)/total:.2%}")

    return all_results


# Wire it all together
if __name__ == "__main__":
    from your_agent import CPCAgent

    agent = CPCAgent()
    evaluators = [
        RoutingEvaluator(),
        ToolCallEvaluator(CPC_TOOL_SCHEMAS),
    ]

    # Fast inner loop — no credentials needed
    run_eval_suite(agent, evaluators, ROUTING_TEST_CASES, mode="mock")

    # Pre-deployment gate — real API calls
    run_eval_suite(agent, evaluators, ROUTING_TEST_CASES, mode="live")

Bonus: An Agent That Generates Evaluator Scaffolding

Writing evaluator boilerplate for every new agent is repetitive. This agent takes your ADK agent spec (tools, routing rules, output schema) and generates a complete custom evaluator class scaffold using Gemini on Vertex AI. We're building this out as a standalone POC — see the Evaluator Generator Agent in the POC Lab for the full case study.

generator/evaluator_generator.py
import json
import vertexai
from vertexai.generative_models import GenerativeModel

GENERATOR_SYSTEM_PROMPT = """You are an expert in Google ADK agent evaluation.
Given an agent specification in JSON, generate a complete Python
custom evaluator class that:
1. Extends BaseEvaluator
2. Validates the specific behavior described in the spec
3. Returns an EvalResult with a score from 0.0 to 1.0
4. Includes 3 representative EvalTestCase examples
5. Has docstrings explaining what is being evaluated

Output only valid Python code. No markdown fences, no explanations."""


def generate_evaluator_scaffold(
    agent_spec: dict,
    project: str,
    location: str = "us-central1",
) -> str:
    """
    Takes an ADK agent spec and generates a custom evaluator scaffold.

    agent_spec example:
    {
        "agent_name": "CPC Intake Agent",
        "routing_rule": "Always called after orchestrator for plan queries",
        "tools": [
            {
                "name": "extract_plan_codes",
                "required_args": ["raw_query"],
                "output": {"plan_codes": "list[str]", "effective_date": "str"}
            }
        ],
        "success_criteria": "plan_codes non-empty, effective_date in YYYY-MM-DD format"
    }
    """
    vertexai.init(project=project, location=location)
    model = GenerativeModel(
        "gemini-1.5-pro",
        system_instruction=GENERATOR_SYSTEM_PROMPT
    )

    prompt = f"Generate a custom evaluator for this agent:\n\n{json.dumps(agent_spec, indent=2)}"
    response = model.generate_content(prompt)

    return response.text


# Example usage
if __name__ == "__main__":
    spec = {
        "agent_name": "CPC Intake Agent",
        "routing_rule": "Called by orchestrator for all plan comparison queries",
        "tools": [{
            "name": "extract_plan_codes",
            "required_args": ["raw_query"],
            "output": {
                "plan_codes": "list[str] — min 2 items for comparison",
                "effective_date": "str — YYYY-MM-DD format"
            }
        }],
        "success_criteria": (
            "plan_codes has >= 2 items, "
            "effective_date matches ISO format"
        )
    }

    scaffold = generate_evaluator_scaffold(
        agent_spec=spec,
        project="your-gcp-project"
    )
    print(scaffold)

Key Principles

Building a custom eval framework is an investment that pays back immediately and compounds over time. Every new agent capability you add should come with a corresponding evaluator.

  • Build evaluators before you build agents — define your success criteria first
  • Use deepeval/Arize for output quality + ADK custom evaluators for behavior correctness — they are complementary, not competing
  • Mock mode for fast inner-loop dev (seconds per test); live mode as a pre-deployment gate (minutes per suite)
  • Partial credit scoring (0.0–1.0) gives you more signal than binary pass/fail
  • Every bug you fix should add a regression test case to prevent recurrence
  • The Evaluator Generator Agent reduces the cost of writing new evaluators — lower cost = more coverage

Takeaway

The biggest lesson from the CPC project: evaluation is not a phase that comes after building — it's the foundation you build on. deepeval and Arize are excellent tools for measuring LLM output quality, but they operate above the agent framework layer. They can't tell you whether your orchestrator routed correctly, whether your tool arguments are valid, or whether your mock and live execution traces are consistent. ADK custom evaluators fill that gap. Build them early, run them continuously, and treat every new evaluator as first-class code.