End-to-end evaluation of multi-agent systems on Vertex AI with Cloud Run deployment for A2A agents

Junjie_Bu · August 6, 2025, 4:35pm

Authors:

Junjie Bu, Senior Staff Software Engineer
Chitra Venkatesh, Product Manager

In our previous post, we detailed how to build a simple A2A agent in the colab notebook and use Vertex AI’s evaluation service to evaluate the final response.

In this post, we will showcase the deployment of A2A agents to cloud run and build and evaluate a multi-agents system wherein the agents collaborate via A2A.

We walk through the process of taking an existing multi-agent application, in this case, an Airbnb - Weather A2A Agents sample from a A2A-samples github repository, and deploying it as a set of scalable, serverless A2A Agent services on Cloud Run with a few clicks.

The key takeaways from this article include:

Deploying A2A agents to Cloud Run: Learn how to containerize and deploy your Python-based A2A agents to Cloud Run using colab with a few clicks, which provides an A2A agent service endpoints through a secure and scalable architecture.
Orchestration with a Hosting Agent: See how to create a central “hosting” agent that orchestrates the interactions between the deployed A2A agents, routing user requests to the appropriate specialized agent.
Leveraging Vertex AI for evaluation: Discover how to use Vertex AI’s evaluation services to rigorously assess the performance of your multi-agent system. We’ll cover how to:
Define evaluation datasets with prompts and expected tool calls (trajectories).
Run evaluation tasks to measure trajectory-based metrics like trajectory_exact_match, trajectory_precision, and trajectory_recall.
Evaluate the final generated responses for coherence and safety.
Custom evaluation metrics: Learn how to create custom metrics to evaluate specific aspects of your agent’s behavior, such as whether the final response logically follows from the sequence of tool calls.

This guide provides a practical, hands-on example of how to build, deploy, and evaluate a sophisticated multi-agent system on Google Cloud, giving you the tools and techniques to create your own robust and reliable AI-powered applications.

We will walk you through a practical demonstration of a multi-agent system. We’ll show you how to easily deploy to Google Cloud Run with just a few simple steps in the colab (you can also follow deploy this system to do it from command lines).

The Hosting Agent, which handles the orchestration and routing between the other agents, will be run directly from this Colab notebook. Its logic is adapted from our official samples with minor modifications for this environment, but it can also be run locally.

Finally, we will demonstrate how to use the Vertex AI Evaluation service to assess the performance of this multi-agent interaction. For now, we will leverage the existing function tool call mechanism for this evaluation. In a future post, we will dive deeper into more advanced techniques, including how to incorporate richer A2A traceability information and use enhanced features of the Vertex AI Evaluation service.

Prerequisites

Google Cloud Project: You need a Google Cloud Project with the Vertex AI API enabled.
Authentication: You need to be authenticated to Google Cloud. In a Colab environment, this is usually handled by running from google.colab import auth and auth.authenticate_user().
Agent logic: The Airbnb A2A Agent and Weather A2A Agent are imported from github into this colab and deployed to Cloud run directly. The logic for the Hosting/Routing Agent (e.g., a HostingAgentExecutor class) is defined or importable within this notebook. This executor should have a method like async def execute(self, message_payload: a2a.types.MessagePayload) → a2a.types.Message.

Link to the Colab notebook.

Preparation

We will be leveraging the A2A (Agent2Agent) and ADK (Agent Development Kit) Python SDKs in this tutorial. To learn more about the Agent2Agent Protocol, you can review the official A2A documentation.

%pip install google-cloud-aiplatform httpx “a2a-sdk” --quiet %pip install --upgrade --quiet  ‘google-adk’ %pip install “langchain-google-genai==2.1.5” --quiet %pip install “langchain-mcp-adapters==0.1.0” --quiet %pip install “langchain-google-vertexai==2.0.24” --quiet %pip install “langgraph==0.4.5” --quiet

We define some global configurations and also make sure the GCP project service account has the permission for cloud run and vertex AI.

PROJECT_ID = '[Your project Id]'  # @param {type:"string"} PROJECT_NUM = '[Your project Number]'  # @param {type:"string"} LOCATION = 'us-central1'  # @param {type:"string"}  # --- Authentication (for Colab) --- if not PROJECT_ID:     raise ValueError('Please set your PROJECT_ID.')  try:     auth.authenticate_user()     print('Colab user authenticated.') except Exception as e:     print(         f'Not in a Colab environment or auth failed: {e}. Assuming local gcloud auth.'     )  !gcloud projects add-iam-policy-binding {PROJECT_ID} \   --member="serviceAccount:{PROJECT_NUM}[email protected]" \   --role="roles/cloudbuild.builds.builder"  !gcloud projects add-iam-policy-binding {PROJECT_ID} \   --member="serviceAccount:{PROJECT_NUM}[email protected]" \   --role="roles/aiplatform.user"

Deploy Airbnb and Weather A2A agents to Cloud Run

We use git clone to pull the A2A samples from github, create Dockerfiles, build the docker image and then deploy to cloud run.

 !git clone https://github.com/a2aproject/a2a-samples.git --depth 1 -b main

Prepare the docker files:

%%writefile a2a-samples/samples/python/Dockerfile FROM python:3.13-slim FROM node:20-slim as node COPY --from= ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ # Copy the latest Node.js, which is required for the airbnb_agent to work COPY --from=node /usr/local/bin/node /usr/local/bin/ COPY --from=node /usr/local/bin/npm /usr/local/bin/  EXPOSE 10002 WORKDIR /app COPY . /app RUN uv sync WORKDIR /app/agents/airbnb_planner_multiagent/airbnb_agent/  ENTRYPOINT [“uv”, “run”, “.”, “–host”, “0.0.0.0”, “–port”, “10002”]

Build the docker image for airbnb A2A agent

IMAGE_NAME = "airbnb-a2a-sample-agent" # @param {type:"string"} # LOCATION = "us-central1" # @param {type:"string"} TAG = "latest" # @param {type:"string"} SOURCE_PATH = "a2a-samples/samples/python/" # @param {type:"string"} # Using Google Container Registry (GCR) IMAGE_URL = f"gcr.io/{PROJECT_ID}/{IMAGE_NAME}:{TAG}" print(f"Building and pushing image to: {IMAGE_URL}")  !gcloud builds submit {SOURCE_PATH} \   --project={PROJECT_ID} \   --tag={IMAGE_URL}

Deploy Airbnb A2A agent in Cloud Run

 # Replace with your actual service name, region, and docker image URL SERVICE_NAME = 'airbnb-a2a-sample-agent'  # @param {type:"string"} IMAGE_URL = f'gcr.io/{PROJECT_ID}/{SERVICE_NAME}:latest' AIRBNB_APP_URL = f'https://{SERVICE_NAME}-{PROJECT_NUM}.{LOCATION}.run.app' #Run the airbnb A2A Agent in cloud Run !gcloud run deploy {SERVICE_NAME} \     --memory=4G \     --image={IMAGE_URL} \     --region={LOCATION} \     --port=10002 \     --project={PROJECT_ID} \     --no-allow-unauthenticated \     --set-env-vars=GOOGLE_GENAI_USE_VERTEXAI=TRUE,GOOGLE_GENAI_MODEL="gemini-2.5-flash",PROJECT_ID={PROJECT_ID},LOCATION={LOCATION},APP_URL={AIRBNB_APP_URL}

Endpoint authentication: Public vs. private

When deploying with gcloud run, the –allow-unauthenticated flag is critical for securing your A2A endpoint.

Public Endpoint: If you deploy without the --allow-unauthenticated flag, Cloud Run generates a public URL that can be accessed by anyone on the internet.
Private Endpoint (Recommended): If you deploy with --no-allow-unauthenticated, the URL is private and protected by IAM-based authentication.

Authenticating with a private endpoint

To communicate with a private A2A endpoint, you must first obtain an identity token and include it in the Authorization header of your request. We will demonstrate how to do this in the Colab Notebook below.

Note: If you are using a public URL, you do not need to generate a token and should omit the Authorization header from your requests.

You can obtain the identity token using the gcloud CLI. You may retrieve the token directly in the notebook or run the command in your local shell and copy the token value here.

Once you finished the above steps, you should be able to see the instance up and running in the cloud run console in your GCP projects.

Follow the similar steps to deploy Weather A2A agents.

If everything goes well, you should be able to see the above glcoud run deploy successfully. You can run a quick test to check if the A2A agent card is returned correctly. You may need an identity token in the header to run the query if it’s deployed to a private endpoint.

HOST = f’{AIRBNB_APP_URL}{AGENT_CARD_WELL_KNOWN_PATH}’ !curl -H “Authorization: Bearer {TOKEN}” {HOST}

Helper functions

Next, create a few helper functions to format and print the results of the evaluation.

Assembling the hosting agent for evaluation

The Vertex AI Evaluation service can interact with agents in two primary ways: either directly with agents that are Queryable, or by using a custom function wrapper that conforms to a specific signature.

For this tutorial, we will use the custom function approach. This allows us to create a wrapper that not only triggers the Hosting Agent with a given input but also parses its complex output. The function’s primary role is to extract key information for evaluation, such as the agent’s final response and the sequence of tools it called.

We defined a RemoteAgentConnections class here since we deployed a private endpoint above which uses IAM-based authentication:

class RemoteAgentConnections:     """A class to hold the connections to the remote agents."""      def __init__(self, agent_card: AgentCard, agent_url: str):         print(f'agent_card: {agent_card}')         print(f'agent_url: {agent_url}')         headers = {'Authorization': f'Bearer {TOKEN}'}         self._httpx_client = httpx.AsyncClient(timeout=30, headers=headers)         self.agent_client = A2AClient(             self._httpx_client, agent_card, url=agent_url         )         self.card = agent_card      def get_agent(self) -> AgentCard:         """Get the agent card."""         return self.card      async def send_message(         self, message_request: SendMessageRequest     ) -> SendMessageResponse:         """Send a message to the agent."""         return await self.agent_client.send_message(message_request)

We then define the Hosting Agent (RoutingAgent) which utilize the defined RemoteAgentConnections class to talk to A2A Agents deployed to a private Cloud Run endpoint securely.

async def create_routing_agent() -> Agent:     """Creates and asynchronously initializes the RoutingAgent."""     routing_agent_instance = await RoutingAgent.create(         remote_agent_addresses=[             AIRBNB_APP_URL,             WEATHER_APP_URL,         ]     )     return routing_agent_instance.create_agent()

Note that the latest ADK also provides an easier way to connect to a remote A2A agent with the RemoteA2aAgent class. You can check more details here. An A2A remote connection can be created as simple as:

 prime_agent = RemoteA2aAgent(     name="prime_agent",     description="Agent that handles checking if numbers are prime.",     agent_card=(         f"http://localhost:8001/a2a/check_prime_agent{AGENT_CARD_WELL_KNOWN_PATH}"     ), )

Now that our multi-agents setup is ready, let’s set up the evaluation dataset.

Preparing the evaluation dataset

To evaluate your agent with the Vertex AI Evaluation service, you need to construct a dataset tailored to the specific aspects you want to measure.

You can include the following fields:

Ground Truth: The ideal or expected final response from the agent.
Reference Trajectory: The ideal sequence of tool calls the agent should execute to arrive at the correct answer for a given prompt.
Pre-Generated Results: You can also bring your own results (such as previously generated responses and tool call trajectories) to evaluate them against the ground truth.

Below, we provide an example dataset for the hosting agent, which includes user prompts and their corresponding reference trajectories.

 eval_data_a2a = {    "prompt": [        "What's the weather in Yosemite Valley, CA",        "Looking for Airbnb in Yosemite for August 1 to 6, 2025",        "What's the weather in San Francisco, CA",        "Looking for Airbnb in Paris, France for August 10 to 12, 2025",    ],    "predicted_trajectory": [        [{            "tool_name": "send_message",            "tool_input": {                 "task": "What's the weather in Yosemite Valley, CA",                 "agent_name": "Weather Agent"             },          }        ],        [{            "tool_name": "send_message",            "tool_input": {                 "task": "Find Airbnb in Yosemite for August 1 to 6, 2025",                 "agent_name": "Airbnb Agent"             },          }        ],        [{            "tool_name": "send_message",            "tool_input": {                 "task": "What's the weather in San Francisco, CA",                 "agent_name": "Weather Agent"             },          }        ],        [{            "tool_name": "send_message",            "tool_input": {                 "task": "Find Airbnb in Yosemite for August 10 to 12, 2025",                 "agent_name": "Airbnb Agent"             },          }        ],    ], }   eval_sample_dataset = pd.DataFrame(eval_data_a2a)

Run evaluation task

Let’s submit an evaluation by running the evaluate method of the new EvalTask. We need to define the metrics before submitting the Evaluation Task.

Available trajectory metrics

To evaluate an agent’s trajectory, the Vertex AI Evaluation service provides several ground-truth based metrics:

trajectory_exact_match: The predicted trajectory is identical to the reference trajectory (same actions in the same order).
trajectory_in_order_match: All reference actions are present in the predicted trajectory and appear in the correct order. (Extra actions in the prediction are ignored).
trajectory_any_order_match: All reference actions are present in the predicted trajectory, regardless of their order or any extra actions.
trajectory_precision: The proportion of predicted actions that are also present in the reference trajectory.
trajectory_recall: The proportion of reference actions that are also present in the predicted trajectory.

Note: trajectory_precision and trajectory_recall return a score from 0.0 to 1.0. The other trajectory metrics return a binary score of either 0 (mismatch) or 1 (match).

 trajectory_metrics = [    "trajectory_exact_match",    "trajectory_in_order_match",    "trajectory_any_order_match",    "trajectory_precision",    "trajectory_recall", ]  EXPERIMENT_RUN = f"trajectory-{get_id()}"  trajectory_eval_task = EvalTask(    dataset=eval_sample_dataset,    metrics=trajectory_metrics,    experiment=EXPERIMENT_NAME,    output_uri_prefix=BUCKET_URI + "/multiple-metric-eval", )  trajectory_eval_result = trajectory_eval_task.evaluate(    runnable=agent_parsed_outcome_sync, experiment_run_name=EXPERIMENT_RUN )  display_eval_report(trajectory_eval_result)

Finally, you can visualize the output using the helper functions. Beyond assessing trajectory, you can define custom rubrics to evaluate your agents. Feel free to check out our Colab notebook for more information.

For additional information, check out the following resources:

Topic		Replies	Views
Using Vertex AI to evaluate an example A2A Agent Community Articles googler-article	4	439	August 11, 2025
Evaluating success in a multi-agent system: Why trajectory assessment and handoffs matters Community Articles googler-article , best-practices , ai-ml , learning , developer-tools	0	232	July 2, 2025
Building and Deploying AI Agents with LangChain on Vertex AI Community Articles googler-article , ai-ml	30	397	September 17, 2024