Evaluator Generator Agent
Auto-scaffold custom evaluators from an agent spec
Google ADK · Vertex AI · Gemini · Python
THE CHALLENGE
Evaluator Coverage Is the Bottleneck
Custom evaluators for ADK agents are essential — but every new agent needs its own evaluator written from scratch. The result: teams skip evaluation, or write minimal tests that miss the most important failure modes. The cost of writing evaluators needs to drop dramatically.
Repetitive Boilerplate
Every evaluator follows the same BaseEvaluator pattern — the interface, data classes, scoring logic — but must be hand-written for each agent.
Schema Drift
Tool schemas and routing rules change as agents evolve. Manually updating evaluators to match is easy to forget and hard to audit.
Low Eval Coverage
Teams that find evaluator writing tedious skip it — leading to agents in production with no behavioral correctness checks.
THE SOLUTION
Generate Evaluator Scaffolds from Agent Specs
Outcome: A complete, runnable evaluator class — with test cases — generated in seconds from a structured agent spec. Developer reviews, adjusts edge cases, and runs against the agent immediately.
AGENT CAPABILITIES
What Each Agent Does
Evaluator Generator Agent
Single Agent- Parses agent spec: tools, routing rules, output schema, success criteria
- Constructs a structured prompt for reliable code generation
- Calls Gemini 1.5 Pro on Vertex AI with system-level constraints
- Returns a complete, runnable Python evaluator class
EVALUATION FRAMEWORK
Evaluating the Evaluator Generator
The generator itself needs evaluation — does it produce syntactically valid Python? Does the generated evaluator correctly reflect the input spec? Does it handle edge cases like optional tool args?
- ✓ Generated code parses without SyntaxError
- ✓ Class extends BaseEvaluator correctly
- ✓ evaluate() method signature is correct
- ✓ All tools from spec appear in evaluator logic
- ✓ Routing rules reflected in agent_sequence checks
- ✓ Success criteria mapped to scoring logic
- ✓ At least 3 EvalTestCase examples generated
- ✓ Test cases cover positive and negative paths
- ✓ Expected values are plausible for the spec
- ✓ Generated class can be instantiated
- ✓ evaluate() runs without runtime errors
- ✓ EvalResult score is in 0.0–1.0 range
KEY LEARNINGS
What the AI Lab Taught Us
System Prompt Engineering Is Critical
The quality of generated evaluators is almost entirely determined by the system prompt. Explicit constraints (no markdown, extend BaseEvaluator, score 0.0–1.0) dramatically reduce post-processing.
Spec Structure Drives Output Quality
Well-structured agent specs (typed tool args, explicit routing rules, clear success criteria) produce much better evaluator scaffolds than loosely described specs.
Generated Code Needs Human Review
The generator handles the 80% boilerplate reliably. The remaining 20% — domain-specific edge cases and scoring thresholds — still needs developer judgment.
NEXT STEPS
From Prototype to Production
Spec Schema Validation
Enforce a typed JSON schema for agent specs before generation to improve output consistency
Evaluator Registry
Store generated evaluators in a registry that auto-updates when agent specs change
IDE Integration
VS Code extension that generates evaluator scaffolds directly from ADK agent class definitions
Multi-Model Comparison
Compare evaluator quality across Gemini 1.5 Pro, Gemini 2.0, and Claude to find the best code generation model
Note: See the companion Insights article for the full technical deep-dive: Building Custom Evaluators for Google ADK Agents — including the base evaluator interface, RoutingEvaluator, ToolCallEvaluator, and comparison with deepeval/Arize.