POCGenAI Agent

Evaluator Generator Agent

Auto-scaffold custom evaluators from an agent spec

Google ADK · Vertex AI · Gemini · Python

Google ADKVertex AIGemini 1.5 ProPythonjsonschema

THE CHALLENGE

Evaluator Coverage Is the Bottleneck

Custom evaluators for ADK agents are essential — but every new agent needs its own evaluator written from scratch. The result: teams skip evaluation, or write minimal tests that miss the most important failure modes. The cost of writing evaluators needs to drop dramatically.

Repetitive Boilerplate

Every evaluator follows the same BaseEvaluator pattern — the interface, data classes, scoring logic — but must be hand-written for each agent.

Schema Drift

Tool schemas and routing rules change as agents evolve. Manually updating evaluators to match is easy to forget and hard to audit.

Low Eval Coverage

Teams that find evaluator writing tedious skip it — leading to agents in production with no behavioral correctness checks.

THE SOLUTION

Generate Evaluator Scaffolds from Agent Specs

1Developer provides an agent spec JSON (tools, routing rules, success criteria)

2Generator Agent sends spec to Gemini 1.5 Pro with a structured system prompt

3Gemini generates a complete BaseEvaluator subclass with scoring logic

4Output includes 3 representative EvalTestCase examples ready to run

Outcome: A complete, runnable evaluator class — with test cases — generated in seconds from a structured agent spec. Developer reviews, adjusts edge cases, and runs against the agent immediately.

AGENT CAPABILITIES

What Each Agent Does

⚙️

Evaluator Generator Agent

Single Agent

Parses agent spec: tools, routing rules, output schema, success criteria
Constructs a structured prompt for reliable code generation
Calls Gemini 1.5 Pro on Vertex AI with system-level constraints
Returns a complete, runnable Python evaluator class

Vertex AI · Gemini 1.5 Pro · Structured Prompting

EVALUATION FRAMEWORK

Evaluating the Evaluator Generator

The generator itself needs evaluation — does it produce syntactically valid Python? Does the generated evaluator correctly reflect the input spec? Does it handle edge cases like optional tool args?

SyntaxEvaluator

✓ Generated code parses without SyntaxError
✓ Class extends BaseEvaluator correctly
✓ evaluate() method signature is correct

SpecAlignmentEvaluator

✓ All tools from spec appear in evaluator logic
✓ Routing rules reflected in agent_sequence checks
✓ Success criteria mapped to scoring logic

TestCaseEvaluator

✓ At least 3 EvalTestCase examples generated
✓ Test cases cover positive and negative paths
✓ Expected values are plausible for the spec

RunnabilityEvaluator

✓ Generated class can be instantiated
✓ evaluate() runs without runtime errors
✓ EvalResult score is in 0.0–1.0 range

⚡ Mock Mode

Static spec → expected output pairs for fast iteration

🔴 Live API Mode

Real Gemini calls · Validates generation quality on novel specs

KEY LEARNINGS

What the AI Lab Taught Us

System Prompt Engineering Is Critical

The quality of generated evaluators is almost entirely determined by the system prompt. Explicit constraints (no markdown, extend BaseEvaluator, score 0.0–1.0) dramatically reduce post-processing.

Spec Structure Drives Output Quality

Well-structured agent specs (typed tool args, explicit routing rules, clear success criteria) produce much better evaluator scaffolds than loosely described specs.

Generated Code Needs Human Review

The generator handles the 80% boilerplate reliably. The remaining 20% — domain-specific edge cases and scoring thresholds — still needs developer judgment.

NEXT STEPS

From Prototype to Production

Spec Schema Validation

Enforce a typed JSON schema for agent specs before generation to improve output consistency

Evaluator Registry

Store generated evaluators in a registry that auto-updates when agent specs change

IDE Integration

VS Code extension that generates evaluator scaffolds directly from ADK agent class definitions

Multi-Model Comparison

Compare evaluator quality across Gemini 1.5 Pro, Gemini 2.0, and Claude to find the best code generation model

Note: See the companion Insights article for the full technical deep-dive: Building Custom Evaluators for Google ADK Agents — including the base evaluator interface, RoutingEvaluator, ToolCallEvaluator, and comparison with deepeval/Arize.

Back to all POCs