Mastering AgentOps: A Deep Dive into Evaluating ADK Agents with Langfuse
Thomas Chong
2025-08-03
The Challenge of Production-Ready Agents
The era of AI agents is upon us. Building a proof-of-concept agent that can chat or perform a simple task has become remarkably accessible. However, the journey from a clever prototype to a reliable, production-grade application is fraught with challenges. How do you ensure your agent behaves predictably? How do you measure its performance over time and prevent regressions when you update a prompt or a tool? Answering these questions requires moving beyond simple "vibe-testing" and adopting a more rigorous, engineering-focused discipline.
This is the domain of AgentOps: the set of practices and tools for observing, analyzing, and optimizing the entire lifecycle of AI agents. Just as DevOps revolutionized traditional software, AgentOps provides the framework for building trustworthy and effective agentic systems. This article will focus on a critical pillar of AgentOps: evaluation.
The Tools for the Job
To tackle this challenge, we will use a powerful combination of open-source tools:
- Google's Agent Development Kit (ADK): ADK is a code-first Python toolkit designed to make agent development feel more like traditional software engineering. It offers a flexible, modular framework for building, orchestrating, and, crucially, evaluating sophisticated AI agents. Its emphasis on a code-first approach provides the flexibility and control needed for complex, production-ready systems.
- Langfuse: Langfuse is an open-source LLM engineering platform built for the entire development workflow. It helps teams collaboratively debug, analyze, and iterate on their LLM applications by providing detailed tracing, prompt management, and evaluation capabilities.
Our Goal
This article provides a complete, step-by-step guide to creating a closed-loop evaluation system. We will build a realistic agent using ADK, systematically evaluate its performance using ADK's built-in features, and then programmatically log those quantitative evaluation scores into Langfuse. This links the outcome of an evaluation with the detailed execution trace, providing a comprehensive view for debugging and continuous improvement.
Part 1: Foundational Concepts: ADK and Langfuse
Before diving into the code, it's essential to understand the core philosophies of our chosen tools and how they complement each other to create a robust AgentOps workflow.
Google's Agent Development Kit (ADK): A Code-First Approach to Agentic Systems
ADK's core philosophy is to treat agent development with the same rigor as software development. It prioritizes flexibility, testability, and version control through a code-first paradigm, allowing developers to define agent logic, tools, and orchestration directly in Python.
The key components a developer interacts with are:
- Agents: These are the central decision-making entities. The most common is the
LlmAgent
, which uses a Large Language Model for reasoning. ADK also provides workflow agents likeSequentialAgent
andParallelAgent
to orchestrate other agents in predictable patterns. - Tools: These are the capabilities an agent can use to interact with the world. A tool can be a simple Python function, a complex class, an API call, or even another agent.
- Runners & Session: These components manage the agent's execution flow and maintain the state of a conversation over time, respectively.
ADK also includes a powerful local development UI, launched with the adk web
command, which is invaluable for testing, debugging, and, as we will see, generating test cases.
Langfuse: The Observability Backbone for LLM Applications
Langfuse provides the observability layer necessary to understand what your agent is doing under the hood. To effectively use it, one must understand its simple but powerful data model:
- Trace: Represents a single, complete end-to-end execution of a request. For our purposes, a trace will correspond to one full run of our agent during an evaluation.
- Observation: A single, discrete step within a trace. This could be an LLM call, a tool execution, or any other instrumented part of the application. Langfuse's native integration with ADK automatically creates these observations, giving a detailed, step-by-step view of the agent's reasoning process.
- Score: A metric used to evaluate the quality or performance of a trace or observation. Scores can be numeric, categorical, or boolean. This is the object we will create programmatically to log our ADK evaluation results.
The Langfuse Python SDK is the primary interface for custom instrumentation, allowing us to send our evaluation scores and enrich the automatically captured traces.
Synergy: Why ADK + Langfuse is a Perfect Match for AgentOps
The combination of ADK and Langfuse creates a powerful, two-part system for robust AgentOps. ADK's evaluation feature provides the quantitative measurement—an objective score of how well the agent performed against a known-good dataset. Langfuse provides the qualitative context—the exact sequence of prompts, tool calls, and model responses that led to that score.
Consider the flow:
- ADK is engineered for building and testing agents with a software development discipline. Its evaluation feature produces specific, objective scores like
tool_trajectory_avg_score
andresponse_match_score
. - However, these scores exist in a vacuum. A
tool_trajectory_avg_score
of 0.5 tells you that the agent deviated from the expected path, but it doesn't tell you how or why. - Langfuse's strength lies in capturing detailed, hierarchical traces of every step an agent takes. The native ADK integration automates this logging, providing a complete record of the agent's reasoning process.
- By programmatically sending ADK's evaluation scores into the corresponding Langfuse trace, we link the "what" (the score) with the "why" (the detailed trace). This creates a complete feedback loop. A developer can see a low score in Langfuse, click into the associated trace, and immediately debug the exact point of failure. This is the essence of effective AgentOps.
Part 2: Building Our Sample Agent: The "Support Agent"
Let's put theory into practice by building an agent for a realistic use case. Our "Support Agent" will be designed to help users by looking up the status of their orders using a custom tool.
Environment and Project Setup
First, prepare your development environment and project structure.
-
Create and activate a virtual environment:
python3 -m venv .venv source .venv/bin/activate
(On Windows, use
.venv\Scripts\activate
) -
Install dependencies: Create a
requirements.txt
file with the following content:google-adk langfuse python-dotenv google-generativeai
Then, install them:
pip install -r requirements.txt
This installs ADK, Langfuse, a helper for environment variables, and the Google Gemini SDK.
-
Set up the project structure: Create the following directories and files:
adk_langfuse_eval/ ├── support_agent/ │ ├── __init__.py │ ├── agent.py │ └── .env ├── run_evaluation.py └── requirements.txt
PHOTO: A screenshot of the final project directory structure as shown in a VS Code sidebar or similar tree view should be placed here.
-
Configure environment variables: Create a file named
.env
inside thesupport_agent/
directory and add your credentials. You can get these from Google AI Studio and the Langfuse project settings page.GOOGLE_API_KEY="your_google_api_key_here" LANGFUSE_PUBLIC_KEY="pk-lf-..." LANGFUSE_SECRET_KEY="sk-lf-..." LANGFUSE_HOST="https://cloud.langfuse.com" # or https://us.cloud.langfuse.com for US region
Coding the "Support Agent" (agent.py)
Now, let's write the code for our agent. Open support_agent/agent.py
and add the following:
# support_agent/agent.py
from google.adk.agents import LlmAgent
from google.adk.tools import tool
from pydantic import Field
@tool
def lookup_order_status(
order_id: str = Field(description="The unique identifier for the customer's order, e.g., 'ABC-123'.")
) -> dict:
"""
Looks up the status of a specific order by its ID.
Args:
order_id: The customer's order ID.
Returns:
A dictionary containing the order's status and details.
"""
# In a real application, this would query a database or an API.
# For this example, we'll return a fixed status for a known ID.
known_orders = {
"ABC-123": {"status": "Shipped", "estimated_delivery": "2025-07-28"},
"XYZ-789": {"status": "Processing", "estimated_delivery": "2025-07-30"}
}
result = known_orders.get(order_id, {"status": "Not Found", "error": "Invalid order ID."})
return result
# This is the main agent ADK will run.
root_agent = LlmAgent(
name="support_agent",
model="gemini-1.5-flash-latest",
description="A helpful agent for the e-commerce store 'Gadget World', capable of checking order statuses.",
instruction="""
You are a friendly and helpful customer support agent for 'Gadget World'.
Your primary task is to help users find the status of their order using the `lookup_order_status` tool.
- Politely greet the user.
- You must extract the `order_id` from the user's message.
- You must call the `lookup_order_status` tool with the extracted `order_id`.
- Once the tool returns the status, present the information to the user in a clear, helpful sentence.
- If the tool returns 'Not Found', apologize and inform the user that the order ID might be incorrect.
- Do not ask for any other personally identifiable information.
""",
tools=[lookup_order_status],
)
This code defines a realistic lookup_order_status
tool and an LlmAgent
with a detailed prompt that instructs it on how to act as a helpful support agent.
Define the __init__.py
Finally, make the agent discoverable by ADK. In support_agent/__init__.py
, add the following line. This makes the directory a Python package and exposes the root_agent
instance.
# support_agent/__init__.py
from .agent import root_agent
Part 3: Mastering ADK's Evaluation Workflow
With our agent built, we now need to create a reliable way to test it. This involves creating a high-quality dataset of expected interactions and defining what a "successful" run looks like.
Creating a High-Quality Evaluation Dataset (.evalset.json
)
Writing a valid .evalset.json
file by hand is a complex and error-prone task. Fortunately, ADK provides a much better way. The adk web
UI is not just a testing ground; it is a powerful test case generation tool. It allows you to capture real, multi-turn interactions and save them as structured, reusable evaluation sets.
Here's how to generate the evalset:
-
From your terminal, in the
adk_langfuse_eval
directory, run the ADK web server:adk web ./support_agent
-
Open your browser to the provided URL (usually
http://127.0.0.1:8000
). You will see a chat interface. -
Have a conversation with the agent to create a test case. For example: User: "Hi, can you check on my order? The ID is ABC-123"
PHOTO: A screenshot of the
adk web
UI showing the conversation with the "Support Agent". The agent's response, showing the correct tool call and final answer, should be visible. -
Once the interaction is complete, navigate to the Eval tab on the right side of the interface.
-
Click Create new eval set. Name it
support.evalset.json
and save it inside thesupport_agent
directory. -
Click Add current session to save the conversation you just had as a new evaluation case within that file.
PHOTO: A screenshot of the "Eval" tab in
adk web
, with arrows pointing to the "Create new eval set" and "Add current session" buttons to clearly illustrate the process.
This process serializes the entire interaction—user inputs, the exact tool calls the agent made with their arguments, and the final text response—into the correct JSON format. This becomes our "golden" dataset representing the expected behavior.
Defining Success: The test_config.json
File
Next, we need to define the pass/fail criteria for our evaluation. Create a file named test_config.json
inside the support_agent
directory. This file specifies the thresholds for ADK's built-in evaluation metrics:
tool_trajectory_avg_score
: This metric is critical. It measures the correctness of the agent's tool usage. A score of 1.0 signifies that the agent called thelookup_order_status
tool with the perfect arguments (i.e.,order_id: 'ABC-123'
).response_match_score
: This metric uses the ROUGE score to compare the agent's final text response to the reference response in the evalset. It measures the overlap in words and phrases, allowing for some flexibility in the LLM's wording.
Add the following content to support_agent/test_config.json
:
{
"criteria": {
"tool_trajectory_avg_score": 0.9,
"response_match_score": 0.8
}
}
For our support task, we expect a very high score for tool trajectory, as getting the order_id
right is essential. The response match score is slightly lower to account for minor variations in the LLM's generated text (e.g., "The order has shipped" vs. "Your order's status is: Shipped").
Choosing Your Evaluation Method
ADK offers multiple ways to run evaluations. Our goal of integrating with Langfuse makes the programmatic approach the only viable option, as it provides structured results that we can parse and use.
| Evaluation Method | Primary Use Case | Pros | Cons |
| :--- | :--- | :--- | :--- |
| adk web
UI | Interactive debugging, creating evalsets | Visual, easy to inspect traces, great for exploration | Manual, not scalable for CI, limited configuration |
| adk eval
CLI | Quick checks, simple CI/CD steps | Easy to automate, fast for one-off runs | Text-based output, difficult to parse for integration |
| Programmatic (AgentEvaluator
)| Advanced CI/CD, custom integrations | Full control, returns structured results, enables integration | Requires writing a custom runner script |
Part 4: The Integration: Piping ADK Eval Scores into Langfuse
This is the heart of the tutorial. We will now write the run_evaluation.py
script that orchestrates the entire process: running the ADK evaluation programmatically and sending the results to Langfuse.
The run_evaluation.py
Script
Open the run_evaluation.py
file at the root of your project and add the following code. The comments explain each step in detail.
# run_evaluation.py
import asyncio
import os
from pathlib import Path
import langfuse
from dotenv import load_dotenv
# This is the programmatic entry point for ADK's evaluation logic.
from google.adk.evaluation.agent_evaluator import AgentEvaluator
# --- 1. Setup and Initialization ---
# Load environment variables from the agent's .env file
dotenv_path = Path('./support_agent/.env')
load_dotenv(dotenv_path=dotenv_path)
# Initialize the Langfuse client. It will automatically pick up credentials
# from the environment variables we just loaded.
try:
langfuse_client = langfuse.Langfuse()
print("Langfuse client initialized successfully.")
except Exception as e:
print(f"Failed to initialize Langfuse client: {e}")
exit(1)
# Define paths to our agent and evaluation files
AGENT_MODULE_PATH = "support_agent"
EVAL_SET_PATH = os.path.join(AGENT_MODULE_PATH, "support.evalset.json")
CONFIG_FILE_PATH = os.path.join(AGENT_MODULE_PATH, "test_config.json")
async def run_and_log_evaluation():
"""
Runs the ADK evaluation programmatically and logs scores to Langfuse.
"""
print(f"Starting evaluation for agent at: {AGENT_MODULE_PATH}")
print(f"Using evalset: {EVAL_SET_PATH}")
print(f"Using config: {CONFIG_FILE_PATH}")
try:
# --- 2. Execute the Evaluation Programmatically ---
# The AgentEvaluator.evaluate_eval_set function runs the agent against
# each case in the evalset and returns a list of result objects.
# This is the key function that enables our custom integration.
eval_results_list = await AgentEvaluator.evaluate_eval_set(
agent_module_path=AGENT_MODULE_PATH,
eval_set_file_path=EVAL_SET_PATH,
config_file_path=CONFIG_FILE_PATH,
)
print(f"\nEvaluation complete. Found {len(eval_results_list)} results.")
except Exception as e:
print(f"An error occurred during ADK evaluation: {e}")
return
# --- 3. Capture Results and Score Traces in Langfuse ---
for result in eval_results_list:
# The Langfuse ADK integration automatically creates a trace for each
# evaluation run. We need to get the ID of that trace.
# We extract it from the 'actual_invocation' metadata.
trace_id = result.actual_invocation.session_input.metadata.get("langfuse_trace_id")
if not trace_id:
print(f"Warning: Langfuse trace ID not found for eval case {result.eval_id}. Skipping.")
continue
print(f"\nProcessing results for eval case '{result.eval_id}' (Trace ID: {trace_id})")
# Extract the scores from the evaluation result object
tool_score = result.metrics.get("tool_trajectory_avg_score", 0.0)
response_score = result.metrics.get("response_match_score", 0.0)
print(f" - Tool Trajectory Score: {tool_score}")
print(f" - Response Match Score: {response_score}")
# --- 4. Send Scores to Langfuse ---
# For each result, we make two calls to langfuse.score(), linking
# our quantitative scores directly to the qualitative trace.
try:
langfuse_client.score(
trace_id=trace_id,
name="tool_trajectory_avg_score",
value=float(tool_score),
comment=f"ADK Eval Case ID: {result.eval_id}"
)
print(f" - Successfully logged tool_trajectory_avg_score to Langfuse.")
langfuse_client.score(
trace_id=trace_id,
name="response_match_score",
value=float(response_score),
comment=f"ADK Eval Case ID: {result.eval_id}"
)
print(f" - Successfully logged response_match_score to Langfuse.")
except Exception as e:
print(f" - Error logging scores to Langfuse for trace {trace_id}: {e}")
# --- 5. Shutdown ---
# Ensure all buffered data is sent to the Langfuse server before exiting.
langfuse_client.flush()
print("\nLangfuse client flushed. All data sent.")
if __name__ == "__main__":
asyncio.run(run_and_log_evaluation())
This script is the centerpiece of our workflow. It uses AgentEvaluator.evaluate_eval_set
to run the tests, iterates through the structured results, extracts the trace ID and metric scores, and then uses the langfuse.score
method to push this data into the correct trace in Langfuse.
Part 5: Closing the Loop: Analysis in the Langfuse UI
Now, let's execute our script and see the results in the Langfuse dashboard.
-
Run the script: From your terminal in the project's root directory, run:
python run_evaluation.py
You should see output indicating that the evaluation is running and that scores are being logged to Langfuse.
-
Analyze in Langfuse: Navigate to your Langfuse project in your browser.
-
Go to the Traces dashboard. You will see a new trace corresponding to the evaluation run. Notice the Scores column is now populated with the metrics we just sent.
PHOTO: A screenshot of the Langfuse "Traces" dashboard. A row for the "support_agent" trace should be highlighted, with the "Scores" column clearly showing values for "tool_trajectory_avg_score" and "response_match_score".
-
Click on the trace to open the detailed view. On the left, you can see the entire execution hierarchy: the agent's main run, the LLM call it made to decide which tool to use, and the execution of the
lookup_order_status
tool itself. This is the rich, qualitative context captured automatically by the Langfuse-ADK integration.PHOTO: A screenshot of the detailed view of a single trace. The left-hand panel should show the nested observations: the parent
support_agent
span, a childGeneration
observation for the LLM call, and another childSpan
for thelookup_order_status
tool call. -
Now, click on the Scores tab within this trace view. This is the payoff. You will see our two custom scores,
tool_trajectory_avg_score
andresponse_match_score
, neatly logged with their values and comments.PHOTO: A screenshot of the "Scores" tab within the trace detail view. It should clearly display two entries: one for "tool_trajectory_avg_score" and one for "response_match_score", showing the numeric values and comments we provided in the script.
-
Beyond a Single Run: Monitoring Over Time
The true power of this workflow is realized when it's automated. By integrating this run_evaluation.py
script into a CI/CD pipeline that runs after every code or prompt change, you can build powerful dashboards in Langfuse to track these evaluation scores over time. This allows your team to instantly spot performance regressions, objectively compare different models or prompts, and make data-driven decisions to improve agent quality. This is the foundation of a mature AgentOps practice.
Conclusion
In this guide, we have built a complete, end-to-end evaluation workflow. We started by building a functional, realistic agent with Google's ADK. We then used the adk web
UI not just for testing, but as a sophisticated tool to generate a "golden" evaluation dataset. Finally, we wrote a programmatic runner that executes the ADK evaluation and pipes the quantitative results directly into our observability platform, Langfuse, linking them to the rich, qualitative execution traces.
This pattern of combining quantitative, dataset-based evaluation with qualitative, trace-based observability is fundamental to building reliable, production-grade AI systems. It moves agent development from an art to an engineering discipline, providing the feedback loops necessary for continuous improvement.
As a next step, you can adapt this pattern for your own, more complex agents. Explore creating larger and more diverse evalsets to cover more edge cases. Investigate other evaluation methods, such as using an LLM-as-a-judge within Langfuse to score aspects like helpfulness or tone. Or, expand your agent's capabilities by building multi-agent systems, a feature where ADK truly shines. The foundation you've built here will support you as you tackle these more advanced challenges in the exciting world of agent development.