Monitoring Agentic AI Applications with Splunk Observability Cloud

2 minutes Author Derek Mitchell

Splunk Observability for AI monitors the performance, quality, security, and cost of AI application stack. It includes the following:

AI Agent Monitoring, which monitors the performance, quality, security, and cost of LLM and agentic applications.
AI Infrastructure Monitoring, which monitors the health, availability, and consumption (or usage) of AI infrastructure.

This workshop provides hands-on experience deploying and working with these capabilities in Splunk Observability Cloud. This includes:

Understanding how to connect an Azure account to Splunk Observability Cloud to capture AI infrastructure-related metrics.
Exploring out-of-the box dashboards and navigators related to AI infrastructure.
Reviewing the architecture of an Agentic AI application built with LangChain and LangGraph.
Practice deploying an Agentic AI application and instrumenting it with OpenTelemetry.
Exploring how metrics, traces, and logs can be used in Splunk Observability Cloud to understand agent performance.
Practice modifying an Agentic AI application to use tool calls and agents.
Practice adding quality issues to an application and detecting them with Splunk Observability Cloud using semantic quality evals.
Practice adding AI Defense instrumentation to the application and security risks, and detecting them with Splunk Observability Cloud.

Tip

The easiest way to navigate through this workshop is by using:

the left/right arrows (< | >) on the top right of this page
the left (◀️) and right (▶️) cursor keys on your keyboard

Connect to EC2 Instance

5 minutes

Connect to your EC2 Instance

We’ve prepared an Ubuntu Linux instance in AWS/EC2 for each attendee:

Access the Splunk Show event by clicking on the link for your region
Click Enroll on the top-right corner
Then look near the bottom of the page for your EC2 instance details

You should see connection information such as the following:

Using the IP address (which is part of the SSH Command) and SSH Password provided as part of the Connection Information, connect to your EC2 instance using one of the methods below:

Mac OS / Linux
- ssh splunk@IP address
Windows 10+
- Use the OpenSSH client
Earlier versions of Windows
- Use Putty

VPN Connection

If you’re working from an office and having trouble connecting, try connecting to your corporate VPN first.

Retrieve your Instance Name

Once you’ve logged into your EC2 instance via ssh, use the following command to get your instance name:

echo $INSTANCE

Make a note of this, as your instance name is unique to you and will be used later in the workshop to find your data in Splunk Observability Cloud.

Connect Visual Studio Code (Optional)

We’ll be editing several files throughout the workshop. The workshop instructions include tip for doing this using a vi editor, and workshop participants can use the nano editor as well.

If you prefer a full-fledged IDE, you can connect Visual Studio Code running on your laptop to edit remote files on the EC2 instance.

The high-level steps to do this are as follows:

Download and install VS code on your machine using this link.
In VS Code, navigate to Settings and then Extensions.
Search for the Remote – SSH extension (by Microsoft) and install it.

Press F1 (or Ctrl+Shift+P on Windows / Cmd+Shift+P on Mac OS).
Run Remote-SSH: Connect to Host.
Copy your SSH command from Splunk Show: ssh -p 2222 splunk@EC2_PUBLIC_IP.
Choose the default SSH config file when prompted.
Press F1 (or Ctrl+Shift+P on Windows / Cmd+Shift+P on Mac OS) again.
Run Remote-SSH: Connect to Host.
Select the host you just added. VS Code will open a new window and start the connection.
A prompt will appear at the top of VS Code asking for the SSH password. Copy the password from Splunk Show and enter it here.
Click Open Folder then input /home/splunk/workshop/agentic-ai as the folder name:

You can now files remotely with VS Code!

Review Azure OpenAI Metrics, Dashboards, and Navigators

10 minutes

This workshop will use OpenAI models running in Azure.

You can monitor the performance of Azure OpenAI applications by configuring your Azure OpenAI applications to send metrics to Splunk Observability Cloud.

We’ve already integrated our Azure account with the workshop instance of Splunk Observability Cloud using the steps described in the documentation.

To ensure Azure OpenAI metrics are included, the connection was configured to pull metrics from Cognitive Services:

Azure OpenAI Metrics

A number of metrics are captured for Azure OpenAI:

ProcessedPromptTokens
GeneratedTokens
AzureOpenAIRequests
AzureOpenAITimeToResponse
AzureOpenAIAvailabilityRate
AzureOpenAITokenPerSecond
AzureOpenAIContextTokensCacheMatchRate

Navigate to Metrics -> Metric finder, and then search for the ProcessedPromptTokens metric and click View in chart:

Note: you can also use this link to view this metric with the Metric finder.

Azure OpenAI Navigator

Splunk Observability Cloud collects OpenTelemetry generative AI client and model server metrics to track the token usage and Open AI large-language model (LLM) services running in Azure.

You can view these metrics using the Azure OpenAI navigator. Navigate to Infrastructure -> Overview -> AI Frameworks and then click on Azure OpenAI:

Azure OpenAI Dashboard

Splunk Observability Cloud provides a built-in dashboard for Azure OpenAI that gives you immediate visibility into:

The active Azure OpenAI models
Token usage
Invocation latency
Invocations by model
Time to first byte
Total response time
Model availability
Number of tokens per request
Number of tokens processed by model
Number of tokens generated by model

Navigate to Dashboards and then search for Azure OpenAI to view the dashboard:

Deploy the OpenTelemetry Collector

10 minutes

We’ll be using OpenTelemetry throughout the workshop to capture metrics, traces, and logs from an Agentic AI application running in Kubernetes. In this section, we’ll install an OpenTelemetry collector in our Kubernetes cluster using Helm. This will be used to capture metrics, traces, and logs from our environment and send them to Splunk.

Install the Collector using Helm

We first need to add the helm repo:

helm repo add splunk-otel-collector-chart https://signalfx.github.io/splunk-otel-collector-chart

And ensure the repo is up-to-date:

helm repo update

To configure the helm chart deployment, let’s create a new file named values.yaml in the /home/splunk directory:

# switch to the /home/splunk dir
cd /home/splunk
# create a values.yaml file in vi
vi values.yaml

Then paste the following contents:

Type :set paste before pasting the contents, to prevent vi from auto-indenting the pasted code.

agent:
  config:
    exporters:
      signalfx:
        send_otlp_histograms: true

To save your changes in vi, press the esc key to enter command mode, then type :wq! followed by pressing the enter/return key.

This custom configuration ensures that any histogram metrics received by the exporter will be sent to Splunk Observability backend in OTLP format without conversion to SignalFx format. This setting is critical to ensure that histogram metrics used by AI Agent Monitoring such as gen_ai.evaluation.score are processed as expected.

Now we can use the following command to install the collector:

  helm upgrade --install splunk-otel-collector --version 0.136.0 \
  --set="splunkObservability.realm=$REALM" \
  --set="splunkObservability.accessToken=$ACCESS_TOKEN" \
  --set="clusterName=$INSTANCE-cluster" \
  --set="environment=agentic-ai-$INSTANCE" \
  --set="splunkPlatform.token=$HEC_TOKEN" \
  --set="splunkPlatform.endpoint=$HEC_URL" \
  --set="splunkPlatform.index=splunk4rookies-workshop" \
  -f values.yaml \
  splunk-otel-collector-chart/splunk-otel-collector

NAME: splunk-otel-collector
LAST DEPLOYED: Fri Dec 20 01:01:43 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Splunk OpenTelemetry Collector is installed and configured to send data to Splunk Observability realm us1.

Confirm the Collector is Running

We can confirm whether the collector is running with the following command:

kubectl get pods

NAME                                                         READY   STATUS    RESTARTS   AGE
splunk-otel-collector-agent-dkn88                            1/1     Running   0          53s
splunk-otel-collector-agent-ksmh4                            1/1     Running   0          53s
splunk-otel-collector-agent-lc2lf                            1/1     Running   0          53s
splunk-otel-collector-k8s-cluster-receiver-dbf64995b-xgm9b   1/1     Running   0          53s

Confirm your K8s Cluster is in O11y Cloud

Using the New Kubernetes Experience

If you’re configured to use the new Kubernetes experience in O11y Cloud, follow the steps in this section. Otherwise, refer to the Using the Traditional Kubernetes Experience section instead.

In Splunk Observability Cloud, navigate to Infrastructure -> Kubernetes overview, then add your cluster name (which is <your instance name>-cluster):

Tip: use the echo $INSTANCE command if you’ve forgotten your instance name

After clicking Apply Filters you should see an overview for your cluster similar to the following:

Using the Traditional Kubernetes Experience

In Splunk Observability Cloud, navigate to Infrastructure -> Kubernetes -> Kubernetes Clusters, and then search for your cluster name (which is <your instance name>-cluster):

Tip: use the echo $INSTANCE command if you’ve forgotten your instance name

Agentic AI Application Architecture

15 minutes

Application Overview

This workshop utilizes an Agentic AI application for booking travel. In this section, we’ll walk through the application architecture and highlight the key LangChain and LangGraph concepts it uses.

LangChain vs. LangGraph

LangChain provides the core building blocks for working with large language models, such as prompts, tools, and model integrations. LangGraph builds on those concepts to orchestrate complex, stateful workflows between those components. In simple terms, LangChain helps you define what an LLM-powered step does, while LangGraph helps control how those steps flow together in an agentic application.

Although the primary goal of the workshop is to instrument the application with OpenTelemetry, having a basic understanding of how the application is structured will make the observability work much clearer. Seeing how the agents, tools, and workflows are built will help you recognize what the telemetry represents once we begin tracing and analyzing the system.

If you’d like to explore the implementation while we go through the architecture, the application source code is available on your EC2 instance at:

~/workshop/agentic-ai/base-app/main.py.

The application is a Flask API that accepts a travel planning request and runs it through a LangGraph workflow made up of several LangChain-powered LLM nodes. Each node plays a specific role, updates shared state, and hands off to the next step.

In this part of the workshop, we will review:

the request lifecycle
the shared state model
how LangGraph nodes work
the LangChain abstractions used in the code
where observability will matter later

Navigate to the subsections to learn more about the application architecture and implementation.

4.1 Request Lifecycle

What the application does

At a high level, the application accepts a request and turns it into a multi-step workflow:

coordinator
flight specialist
hotel specialist
activity specialist
synthesizer

The main flow looks like this:

@app.route("/travel/plan", methods=["POST"])
def plan():
    data = request.get_json()

    origin = data.get("origin", "Seattle")
    destination = data.get("destination", "Paris")
    user_request = data.get(
        "user_request",
        f"Planning a week-long trip from {origin} to {destination}. "
        "Looking for boutique hotel, flights and unique experiences.",
    )
    travellers = int(data.get("travellers", 2))

    result = plan_travel_internal(
        origin=origin,
        destination=destination,
        user_request=user_request,
        travellers=travellers
    )

    return jsonify(result), 200

A helpful way to explain this is:

Flask receives the request
plan_travel_internal() builds the workflow state
LangGraph executes the nodes
each node updates the state
the final itinerary is returned as JSON

Knowledge Check

Where does the LangGraph workflow actually start executing in this API flow?

Click here to see the answer

It starts inside plan_travel_internal(). The Flask route only receives the request and extracts parameters. plan_travel_internal() initializes the workflow state and invokes the LangGraph graph, which then runs the nodes (coordinator, specialists, synthesizer) that update the state until the final itinerary is produced.

4.2 Shared State

Shared State in LangGraph

The most important LangGraph concept in this app is the shared state object:

class PlannerState(TypedDict):
    messages: Annotated[List[AnyMessage], add_messages]
    user_request: str
    session_id: str
    origin: str
    destination: str
    departure: str
    return_date: str
    travellers: int
    flight_summary: Optional[str]
    hotel_summary: Optional[str]
    activities_summary: Optional[str]
    final_itinerary: Optional[str]
    current_agent: str

This state moves through the graph from node to node.

Each node:

reads values from state
does some work
writes new values back to state
sets current_agent to control what happens next

This is a key LangGraph mental model: stateful workflow orchestration.

Knowledge Check

How would you explain the syntax used for the messages field?

messages: Annotated[List[AnyMessage], add_messages]

Click here to see the answer

messages: Annotated[List[AnyMessage], add_messages] does two things.

List[AnyMessage] defines the type of the field: it’s a list of LangChain message objects (system, human, or AI messages).
Annotated[..., add_messages] adds LangGraph behavior that tells the graph how updates to this field should be handled.

Specifically, add_messages means that when a node writes new messages, LangGraph will append them to the existing list instead of overwriting it. So the conversation history grows as each node adds messages.

4.3 Orchestration

Where execution begins

The main orchestration happens in plan_travel_internal():

def plan_travel_internal(
    origin: str,
    destination: str,
    user_request: str,
    travellers: int,
    ) -> Dict[str, object]:
    session_id = str(uuid4())
    departure, return_date = _compute_dates()

    initial_state: PlannerState = {
        "messages": [HumanMessage(content=user_request)],
        "user_request": user_request,
        "session_id": session_id,
        "origin": origin,
        "destination": destination,
        "departure": departure,
        "return_date": return_date,
        "travellers": travellers,
        "flight_summary": None,
        "hotel_summary": None,
        "activities_summary": None,
        "final_itinerary": None,
        "current_agent": "start",
    }

    workflow = build_workflow()
    compiled_app = workflow.compile()

    for step in compiled_app.stream(initial_state, config):
        node_name, node_state = next(iter(step.items()))
        final_state = node_state

This function implements the following application lifecycle:

build initial state
build the graph
compile it
stream execution step by step

Knowledge Check

Question 1

Why does the code use compiled_app.stream(initial_state, config) instead of simply calling the graph once and getting the final result?

Click here to see the answer

Because streaming executes the workflow step by step as each node runs. This lets the application observe intermediate states, track which node is executing, and monitor the workflow in real time instead of waiting only for the final output.

Question 2

Why do we create an initial_state before running the graph?

Click here to see the answer

Because LangGraph workflows operate on a shared state object. The initial_state provides the starting data that nodes will read from, update, and pass along as the workflow progresses.

4.4 Defining the Graph

How the graph is defined

The graph is built explicitly in build_workflow():

def build_workflow() -> StateGraph:
    graph = StateGraph(PlannerState)
    graph.add_node("coordinator", lambda state: coordinator_node(state))
    graph.add_node("flight_specialist", lambda state: flight_specialist_node(state))
    graph.add_node("hotel_specialist", lambda state: hotel_specialist_node(state))
    graph.add_node("activity_specialist", lambda state: activity_specialist_node(state))
    graph.add_node("plan_synthesizer", lambda state: plan_synthesizer_node(state))
    graph.add_conditional_edges(START, should_continue)
    graph.add_conditional_edges("coordinator", should_continue)
    graph.add_conditional_edges("flight_specialist", should_continue)
    graph.add_conditional_edges("hotel_specialist", should_continue)
    graph.add_conditional_edges("activity_specialist", should_continue)
    graph.add_conditional_edges("plan_synthesizer", should_continue)
    return graph

And the routing logic is here:

def should_continue(state: PlannerState) -> str:
    mapping = {
    "start": "coordinator",
    "flight_specialist": "flight_specialist",
    "hotel_specialist": "hotel_specialist",
    "activity_specialist": "activity_specialist",
    "plan_synthesizer": "plan_synthesizer",
    }
    return mapping.get(state["current_agent"], END)

Even though this uses conditional edges, the workflow is effectively linear:

start
coordinator
flight specialist
hotel specialist
activity specialist
synthesizer
end

Knowledge Check

If the workflow is effectively linear, why does the graph still use add_conditional_edges and the should_continue() router?

Click here to see the answer

Because it makes the workflow flexible and extensible. Even though the current flow is linear, the routing function allows the graph to dynamically decide the next node based on the state. This makes it easy to add branching, retries, or different execution paths later without redesigning the graph.

4.5 Defining Nodes

How a node works

A LangGraph node in this app is just a Python function that accepts state and returns updated state.

For example, the flight specialist:

def flight_specialist_node(state: PlannerState) -> PlannerState:
    llm = _create_llm(
    "flight_specialist", temperature=0.4, session_id=state["session_id"]
    )

    step = (
        f"Find an appealing flight from {state['origin']} to {state['destination']} "
        f"departing {state['departure']} for {state['travellers']} travellers."
    )

    messages = [
        SystemMessage(content="You are a flight booking specialist. Provide concise options."),
        HumanMessage(content=step),
    ]

    result = llm.invoke(messages)
    state["flight_summary"] = result.content
    state["messages"].append(result)
    state["current_agent"] = "hotel_specialist"
    return state

This exhibits the common node pattern:

create or access an LLM
build a prompt from structured state
invoke the model
save the result into state
set the next node

The hotel and activity nodes follow the same structure, which makes the workflow easy to explain.

Knowledge Check

When creating the LLM for the flight_specialist node, we specified a temperature of 0.4. What does this mean?

Click here to see the answer

Temperature controls how random or creative the model’s responses are.

Lower temperature (e.g., 0.0–0.3): more deterministic and consistent responses
Medium (around 0.4–0.7): balanced between accuracy and creativity
Higher (0.8+): more diverse and creative, but less predictable

So setting temperature=0.4 means the flight_specialist agent will produce responses that are mostly consistent and reliable, with a small amount of variation, which useful for tasks that need correctness but not completely rigid answers.

4.6 Message Abstractions

LangChain Message Abstractions

The application uses LangChain message abstractions rather than one long prompt string.

from langchain_core.messages import (
    AIMessage,
    BaseMessage,
    HumanMessage,
    SystemMessage,
)

This is important because each node can separate:

system role
user task
model response

For example:

messages = [
    SystemMessage(content="You are a flight booking specialist. Provide concise options."),
    HumanMessage(content=step),
]
result = llm.invoke(messages)

Knowledge Check

How would you define system, human, and AI messages?

Click here to see the answer

In LangChain and LangGraph, messages are typically categorized by who is speaking and what role they play in guiding the conversation:

System message: Sets the rules and context for the AI’s behavior. It defines instructions, constraints, tone, and goals that guide how the model should respond throughout the interaction.
Human message: Input from the user. It contains questions, requests, or information that the AI should respond to.
AI message: The model’s response. It represents the assistant’s generated output based on the system instructions and human input.

4.7 LLM Creation

LLM Creation

The LLM itself is created here:

def _create_llm(agent_name: str, *, temperature: float, session_id: str) -> AzureChatOpenAI:
    azure_deployment_name = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME")
    azure_openai_api_version = os.getenv("AZURE_OPENAI_API_VERSION")

    return AzureChatOpenAI(
        azure_deployment=azure_deployment_name,
        openai_api_version=azure_openai_api_version,
        temperature=temperature,
        model_name = azure_deployment_name,
        # AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT environment variables will be used to connect to the LLM
    )

This approach separates model configuration from workflow logic. Different nodes can use different temperatures depending on how deterministic or creative they should be.

Knowledge Check

How would you create an LLM for OpenAI (rather than Azure OpenAI?)

Click here to see the answer

Creating an LLM for OpenAI has a few differences. The function would return a ChatOpenAI object instead of AzureChatOpenAI.

With OpenAI directly, you don’t use the Azure-specific parameters (azure_deployment, openai_api_version, Azure endpoint). Instead, you specify the model name and rely on the standard OPENAI_API_KEY environment variable.

Here’s an example:

def _create_llm(agent_name: str, *, temperature: float, session_id: str) -> ChatOpenAI:
    model_name = os.getenv("OPENAI_MODEL_NAME", "gpt-4o-mini")

    return ChatOpenAI(
        model=model_name,
        temperature=temperature,
        # Uses OPENAI_API_KEY automatically from environment
    )

4.8 Decomposition Pattern

The synthesizer shows the decomposition pattern

The final node combines the specialist outputs into one answer.

def plan_synthesizer_node(state: PlannerState) -> PlannerState:
    llm = _create_llm(
    "plan_synthesizer", temperature=0.3, session_id=state["session_id"]
    )

    content = json.dumps(
        {
            "flight": state["flight_summary"],
            "hotel": state["hotel_summary"],
            "activities": state["activities_summary"],
        },
        indent=2,
    )

    response = llm.invoke(
        [
            SystemMessage(
                content="You are the travel plan synthesiser. Combine the specialist insights into a concise, structured itinerary."
            ),
            HumanMessage(
                content=(
                    f"Traveller request: {state['user_request']}\n\n"
                    f"Origin: {state['origin']} | Destination: {state['destination']}\n"
                    f"Dates: {state['departure']} to {state['return_date']}\n\n"
                    f"Specialist summaries:\n{content}"
                )
            ),
        ]
    )
    state["final_itinerary"] = response.content
    state["messages"].append(response)
    state["current_agent"] = "completed"
    return state

This is a classic pattern for agentic apps:

decompose work into specialists
collect intermediate outputs
synthesize into a final response

That is one of the main architectural ideas you should take away from this overview.

Knowledge Check

Why does the app use a separate plan_synthesizer node instead of letting one agent generate the entire travel plan?

Click here to see the answer

Because the system breaks the problem into specialized tasks first (flights, hotels, activities). Each specialist produces a focused summary, and the plan_synthesizer node then combines those outputs into one coherent itinerary.

This pattern improves modularity, reliability, and observability, since each agent handles a smaller problem and the final node integrates the results.

Deploy the Agentic AI Application

15 minutes

Deploy the Agentic AI Application (Linux)

We’ll start by running the application directly on our Linux EC2 instance.

Set Environment Variables

The document provided by the workshop instructor contains export commands to set the following environment variables:

AZURE_OPENAI_DEPLOYMENT_NAME
AZURE_OPENAI_API_VERSION
AZURE_OPENAI_ENDPOINT
AZURE_OPENAI_API_KEY

These environment variables tell the application how to connect to an OpenAI model hosted in Azure.

Copy and paste these export commands from the document and run them in your ssh terminal.

Create Virtual Environment

Next, we’ll create a Python virtual environment and install the packages needed to run the application:

cd ~/workshop/agentic-ai/base-app
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run the Application

Then we can run the application with the following command:

python3 main.py

Test the Application

Open a second terminal session connected to your EC2 instance, and run the following command to test the application. It should return the suggested travel plans in json format:

curl http://localhost:8080/travel/plan \
  -H "Content-Type: application/json" \
  -d '{
    "origin": "Seattle",
    "destination": "Tokyo",
    "user_request": "We are planning a week-long trip to Seattle from Tokyo. Looking for boutique hotel, business-class flights and unique experiences.",
    "travelers": 2
  }'

{"activities_summary":"Sure! Here are signature activities for a week in Tokyo:\n\n1. Day 1: Explore Asakusa and Senso-ji Temple, then stroll Nakamise Shopping Street.\n2. Day 2: Visit Tsukiji Outer Market for fresh sushi breakfast, then tour Ginza for upscale shopping.\n3. Day 3: Spend the day in Shibuya\u2014cross the famous scramble, visit Hachiko statue, and shop in trendy boutiques.\n4. Day 4: Explore Harajuku\u2019s Takeshita Street and Meiji Shrine, followed by Omotesando\u2019s stylish cafes.\n5. Day 5: Discover Akihabara\u2019s electronics and anime culture, with a visit to a themed caf\u00e9.\n6. Day 6: Take a day trip to Odaiba for teamLab Borderless digital art museum and waterfront views.\n7. Day 7: Relax in Ueno Park, visit museums, and shop at Ameya-Yokocho market.\n\nWould you like hotel or dining recommendations as well?","agent_steps":[{"agent":"coordinator","status":"completed"},{"agent":"flight_specialist","status":"completed"},{"agent":"hotel_specialist","status":"completed"}

Stop the Application

Once you’ve confirmed that the application is working successfully, return to your first terminal and stop the application.

Deploy the Agentic AI Application (Kubernetes)

Now that the application is working successfully, let’s deploy it to Kubernetes.

Build the Docker Image

In this section, we’ll use the Dockerfile located at ~/workshop/agentic-ai/base-app/Dockerfile to build a Docker image for the application. Run the following commands to build the image:

cd ~/workshop/agentic-ai/base-app
docker build --platform linux/amd64 -t localhost:9999/agentic-ai-app:base-app .
docker push localhost:9999/agentic-ai-app:base-app

Tip: if the image is taking too long to build, consider using the pre-built image instead. To do so, update the image name in the ~/workshop/agentic-ai/base-app/k8s.yaml file to ghcr.io/splunk/agentic-ai-app:base-app instead of localhost:9999/agentic-ai-app:base-app.

Create Application Namespace

Let’s create a new namespace to host our application:

kubectl create ns travel-agent

Create Secret with Azure Credentials

We’ll use a Kubernetes secret to store the Azure OpenAI endpoint and key:

Caution: ensure you run this command in the terminal where you set the AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY environment variables earlier.

{ [ -z "$AZURE_OPENAI_ENDPOINT" ] || \
  [ -z "$AZURE_OPENAI_API_KEY" ]; } && \
  echo "Error: Missing variables" || \
  kubectl create secret generic azure-openai-api \
  -n travel-agent \
  --from-literal=azure-openai-api-endpoint=$AZURE_OPENAI_ENDPOINT \
  --from-literal=azure-openai-api-key=$AZURE_OPENAI_API_KEY

Note: if you get an error that says Missing variables, you’ll need to define your environment variables again using the export commands provided in the document from your instructor.

Deploy the Application Using the Kubernetes Manifest File

A pre-built Kubernetes manifest can be found in the file named ~/workshop/agentic-ai/base-app/k8s.yaml.

We can deploy the application using the manifest file as follows:

kubectl apply -f ~/workshop/agentic-ai/base-app/k8s.yaml

Ensure the Application is Running

Use the following command to ensure the application pod has a status of Running:

kubectl get pods -n travel-agent

NAME                                        READY   STATUS    RESTARTS   AGE
travel-planner-langchain-68977dc5c4-4w7p9   1/1     Running   0          41s

Test the Application in Kubernetes

Run the following command to test the application:

curl http://travel-planner.localhost/travel/plan \
  -H "Content-Type: application/json" \
  -d '{
    "origin": "Seattle",
    "destination": "Tokyo",
    "user_request": "We are planning a week-long trip to Seattle from Tokyo. Looking for boutique hotel, business-class flights and unique experiences.",
    "travelers": 2
  }'

{"activities_summary":"Sure! Here are signature activities for a week in Tokyo:\n\n1. Day 1: Explore Asakusa and Senso-ji Temple, then stroll Nakamise Shopping Street.\n2. Day 2: Visit Tsukiji Outer Market for fresh sushi breakfast, then tour Ginza for upscale shopping.\n3. Day 3: Spend the day in Shibuya\u2014cross the famous scramble, visit Hachiko statue, and shop in trendy boutiques.\n4. Day 4: Explore Harajuku\u2019s Takeshita Street and Meiji Shrine, followed by Omotesando\u2019s stylish cafes.\n5. Day 5: Discover Akihabara\u2019s electronics and anime culture, with a visit to a themed caf\u00e9.\n6. Day 6: Take a day trip to Odaiba for teamLab Borderless digital art museum and waterfront views.\n7. Day 7: Relax in Ueno Park, visit museums, and shop at Ameya-Yokocho market.\n\nWould you like hotel or dining recommendations as well?","agent_steps":[{"agent":"coordinator","status":"completed"},{"agent":"flight_specialist","status":"completed"},{"agent":"hotel_specialist","status":"completed"}

Troubleshooting

If you need to troubleshoot, use the following command to view the application logs:

kubectl logs -l app=travel-planner-langchain -n travel-agent -f

Instrument the Agentic AI Application

15 minutes

Note: this section of the workshop requires changes to multiple files. If you’re not sure where to make the changes, or your application is no longer working, please refer to the expected solution for this section which is in the ~/workshop/agentic-ai/app-with-instrumentation folder.

There are a few steps required to instrument our Agentic AI application with OpenTelemetry and deploy it to Kubernetes:

Add the instrumentation packages to the requirements.txt file
Update the Dockerfile that invokes the application using opentelemetry-instrument
Build a new Docker image with the instrumentation packages
Update the Kubernetes manifest with environment variables
Deploy the Kubernetes manifest

Add Instrumentation Packages

Next, we need to install several instrumentation packages. We can achieve this by opening the ~/workshop/agentic-ai/base-app/requirements.txt for editing and adding the following packages to the bottom of the file:

splunk-opentelemetry==2.8.0
splunk-otel-instrumentation-langchain==0.1.7
splunk-otel-genai-emitters-splunk==0.1.7
splunk-otel-util-genai==0.1.9
opentelemetry-instrumentation-flask==0.59b0

These packages can be described as follows:

splunk-opentelemetry: this is the Splunk distribution of OpenTelemetry Python, which instruments a Python application to capture and report distributed traces to Splunk APM.
splunk-otel-instrumentation-langchain: this package provides OpenTelemetry instrumentation for LangChain LLM/chat workflows.
splunk-otel-genai-emitters-splunk: this package provides emitters for Splunk schema for Evaluation Results logs to optimize storage and filtering in Splunk Platform.
splunk-otel-util-genai: this package includes utility functions to provide APIs and data types to ease instrumentation of Generative AI workloads using OpenTelemetry semantic conventions.
opentelemetry-instrumentation-flask: this library builds on the OpenTelemetry WSGI middleware to track web requests in Flask applications.

Hint: run the following command to compare your changes with the expected solution:
diff ~/workshop/agentic-ai/base-app/requirements.txt ~/workshop/agentic-ai/app-with-instrumentation/requirements.txt

Update the Dockerfile

Then, we need to enable OpenTelemetry instrumentation. This is done by updating the Dockerfile to ensure the application is started with opentelemetry-instrument. Open the ~/workshop/agentic-ai/base-app/Dockerfile file for editing and update the last line as follows:

# Run the server with instrumentation
CMD ["opentelemetry-instrument", "python", "main.py"]

Hint: run the following command to compare your changes with the expected solution:
diff ~/workshop/agentic-ai/base-app/Dockerfile ~/workshop/agentic-ai/app-with-instrumentation/Dockerfile

Build an Updated Docker Image

Build an updated Docker image with a new tag:

cd ~/workshop/agentic-ai/base-app
docker build --platform linux/amd64 -t localhost:9999/agentic-ai-app:app-with-instrumentation .
docker push localhost:9999/agentic-ai-app:app-with-instrumentation

Tip: if the image is taking too long to build, consider using the pre-built image instead. To do so, update the image name in the ~/workshop/agentic-ai/base-app/k8s.yaml file to ghcr.io/splunk/agentic-ai-app:app-with-instrumentation instead of localhost:9999/agentic-ai-app:app-with-instrumentation.

Define the Config Map

When we deploy our application to Kubernetes, we want telemetry (metrics, traces, and logs) to be sent to Splunk Observability Cloud with a clear and unique environment identifier. This makes it easier to filter, compare, and troubleshoot data across different deployments.

To do this, we’ll set the OpenTelemetry resource attribute named deployment.environment. Rather than hard-coding the value, we’ll derive it from the INSTANCE environment variable that already exists on our EC2 instance. This ensures each deployment is automatically tagged with the correct environment name.

We’ll store this configuration in a Kubernetes ConfigMap, which can later be injected into our application pods as an environment variable.

Create the ConfigMap with the following command:

kubectl create configmap instance-config \
--from-literal=OTEL_RESOURCE_ATTRIBUTES=deployment.environment=agentic-ai-$INSTANCE \
-n travel-agent

What this does:

Defines the OTEL_RESOURCE_ATTRIBUTES environment variable expected by OpenTelemetry.
Sets deployment.environment to a value like agentic-ai-shw-1c43, depending on the value of $INSTANCE.
Creates the ConfigMap in the travel-agent namespace.

We’ll reference this ConfigMap in the next step when we configure our Kubernetes deployment.

Update the Kubernetes Manifest

OpenTelemetry instrumentation, and AI Agent Monitoring in particular, require a number of environment variables to be set that define how instrumentation data is collected, processed, and exported.

Open the ~/workshop/agentic-ai/base-app/k8s.yaml file for editing. Update the image tag to ensure we’re using the image with the instrumentation:

          image: localhost:9999/agentic-ai-app:app-with-instrumentation

In the same file, add the following environment variables between the comments that say Begin: Add Environment Variables and End: Add Environment Variables:

Hint: Type :set paste before pasting the contents, to prevent vi from auto-indenting the pasted code.

            # Begin: Add Environment Variables
            # Service Name
            - name: OTEL_SERVICE_NAME
              value: "travel-planner"
            # Additional OTEL configuration
            - name: OTEL_RESOURCE_ATTRIBUTES
              valueFrom:
                configMapKeyRef:
                  name: instance-config
                  key: OTEL_RESOURCE_ATTRIBUTES
            - name: SPLUNK_OTEL_AGENT
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: "http://$(SPLUNK_OTEL_AGENT):4317"
            - name: OTEL_EXPORTER_OTLP_PROTOCOL
              value: "grpc"
            - name: OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
              value: "DELTA"
            - name: OTEL_PYTHON_EXCLUDED_URLS
              value: "^(https?://)?[^/]+(/health)?$"
            - name: OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT
              value: "true"
            - name: OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT_MODE
              value: "SPAN"
            - name: OTEL_INSTRUMENTATION_GENAI_EMITTERS
              value: "span_metric,splunk"
            - name: SPLUNK_PROFILER_ENABLED
              value: "true"
            # End: Add Environment Variables

Note: some of the text may not be visible without scrolling. Use the Copy text to clipboard button on the top right-hand corner to ensure you’ve copied all of the text.

Note: indentation is critical with yaml; ensure the new environment variables align with the existing environment variables.

The following environment variables are specific to Agentic AI monitoring and can be described as follows:

OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE: this determines if the OTLP metric exporter reports cumulative totals, deltas, or low-memory-friendly temporality for emitted metrics. Setting this to DELTA is recommended for Agentic AI monitoring.
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT: this is used to enable/disable message capture from Agentic AI applications. We’ve set it to true for this workshop.
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT_MODE: this defines how messages should be captured. We’ve set it to SPAN for this workshop, which ensures messages are captured using the span event store.
OTEL_INSTRUMENTATION_GENAI_EMITTERS: we’ve set this to span_metric,splunk for the workshop, which ensures that both span and metric data are captured, as well as Splunk-specific features.

Hint: run the following command to compare your changes with the expected solution:
diff ~/workshop/agentic-ai/base-app/k8s.yaml ~/workshop/agentic-ai/app-with-instrumentation/k8s.yaml

Deploy the Updated Application

We can deploy the updated application using the manifest file as follows:

kubectl apply -f ~/workshop/agentic-ai/base-app/k8s.yaml

Test the Application in Kubernetes

Ensure the new application pod has started successfully and the old pod is no longer present:

kubectl get pods -n travel-agent

NAME                                        READY   STATUS    RESTARTS   AGE
travel-planner-langchain-68977dc5c4-4w7p9   1/1     Running   0          41s

Then, run the following command to test the application:

curl http://travel-planner.localhost/travel/plan \
  -H "Content-Type: application/json" \
  -d '{
    "origin": "Seattle",
    "destination": "Tokyo",
    "user_request": "We are planning a week-long trip to Seattle from Tokyo. Looking for boutique hotel, business-class flights and unique experiences.",
    "travelers": 2
  }'

Troubleshooting

If you need to troubleshoot, use the following command to view the application logs:

kubectl logs -l app=travel-planner-langchain -n travel-agent -f

Review Agent Trace Data

10 minutes

Review LLM Provider Configuration

Splunk Observability Cloud includes an integration that allows you to connect a Large Language Model (LLM). Splunk uses this connection to evaluate the semantic quality of LLM responses generated by your applications.

This integration has already been configured in the workshop organization.

To view the configuration, navigate to Data Management → Deployed Integrations, search for LLM Providers, and select it. You should see the following provider:

Click on the Azure OpenAI O11y Specialists provider to view the details:

In this organization, the sampling rate is set to 20%. This means that, on average, Splunk evaluates the semantic quality of 20% of the LLM responses generated by the application.

A rate limit of 50 evaluations per minute is also configured. Both the sampling rate and the rate limit can be adjusted depending on customer needs. Higher sampling rates provide more evaluation data, but they also increase token usage and associated costs.

Review AI Agent Monitoring Configuration

Splunk Observability Cloud also includes a page that allows you to configure which data source is used for storing details related to AI Agent Monitoring. The choices include:

Data source – Splunk Observability Cloud
Data source – Splunk logs

You can see these settings by navigating to Settings -> AI Agent Monitoring:

Splunk recommends utilizing Splunk Observability Cloud for storing AI Agent Monitoring related details. This is the setting we’ve used for this workshop.

Review AI Monitoring Permissions

Due to the potentially sensitive nature of LLM conversation data, a new role called ai_monitoring has been added to Splunk Observability Cloud to control who can access and view this information:

View Trace Data in Splunk Observability Cloud

In Splunk Observability Cloud, navigate to APM and then select Service Map. Ensure your environment name is selected (e.g. agentic-ai-$INSTANCE).

Tip: use the echo $INSTANCE command if you’ve forgotten your instance name

You should see a service map that looks like the following:

Click on Traces on the right-hand side menu. Then select one of the slower running traces. It should look like the following example:

Notice that we don’t see our agent names in the Agent flow section (i.e. coordinator, flight-specialist, etc.).

Scrolling down, let’s click on one of the AI interactions in the trace. Here, we can see that the prompt and response has been captured. We can also see the results of the semantic quality evaluations for this trace:

Next, navigate to APM and then select AI agents. Ensure your environment name is selected (e.g. agentic-ai-$INSTANCE). You’ll notice that the page is empty!

We’ll address these instrumentation issues in the next section.

Add Tool Calls

15 minutes

Note: this section of the workshop requires changes to multiple files. If you’re not sure where to make the changes, or your application is no longer working, please refer to the model solution for this section which is in the ~/workshop/agentic-ai/app-with-agents-and-tools folder.

In the previous section, we discovered that our agents aren’t appearing on the new Agents page, nor in the Agent flow at the top of the trace.

The reason is that our application isn’t currently using agents, but is instead invoking the LLM directly.

In other words, right now, our app is like a scripted play. Every line and every action is written in the code. When we call the LLM, we are just asking it to read a specific line. Because the LLM isn’t making choices, the Observability for AI instrumentation doesn’t recognize it as an autonomous agent.

In this next section, we are going to give the LLM tools and the authority to decide how to use them. By moving to an agentic model, the LLM will start generating Tool Calls. Our OpenTelemetry instrumentation will capture these interactions, allowing us to see the LLM’s thought process and tool usage, and each of our agents will be represented in Splunk Observability Cloud.

Direct Invocation vs. Agentic Traces

Before making these changes, let’s dive deeper into how traces are captured when the LLM is invoked directly vs. via an agent.

Direct Invocation Traces:

When you call llm.invoke(), the instrumentation sees a standard “Chat” or “Completion” span. It records the prompt and the response. Because there is no “loop” or “tool-calling” logic managed by the agent framework, Splunk Observability Cloud doesn’t see the metadata required to categorize the span as an “Agent.”

Agentic Traces:

When you use an agent (e.g., create_react_agent), the framework wraps the execution in specific “Agent” and “Tool” spans. These spans contain metadata that tells OpenTelemetry: “This isn’t just a chat; this is a reasoning loop with specific tools.” This is what populates the Agents Page and the Agent Flow diagrams in the trace visualization.

Make a Backup

Before making changes to the Python code, make a backup of the main.py file using the following command:

cp ~/workshop/agentic-ai/base-app/main.py ~/workshop/agentic-ai/base-app/main.py.bak

Add Import Statements

Open the ~/workshop/agentic-ai/base-app/main.py file for editing.

Add the following import statements between the lines that say Begin: Add Import Statements and End: Add Import Statements:

# Begin: Add Import Statements

from langchain_core.tools import tool
from langchain.agents import (
    create_agent as _create_react_agent,  # type: ignore[attr-defined]
)

# End: Add Import Statements

Add Tools

In the same main.py file, add the tool definitions between the lines that say Begin: Tool Definitions and End: Tool Definitions:

# Begin: Tool Definitions

@tool
def mock_search_flights(origin: str, destination: str, departure: str) -> str:
    """Return mock flight options for a given origin/destination pair."""
    # create a local random.Random instance
    seed = hash((origin, destination, departure)) % (2**32)
    rng = random.Random(seed)
    airline = rng.choice(["SkyLine", "AeroJet", "CloudNine"])
    fare = rng.randint(700, 1250)

    return (
        f"Top choice: {airline} non-stop service {origin}->{destination}, "
        f"depart {departure} 09:15, arrive {departure} 17:05. "
        f"Premium economy fare ${fare} return."
    )


@tool
def mock_search_hotels(destination: str, check_in: str, check_out: str) -> str:
    """Return mock hotel recommendation for the stay."""
    seed = hash((destination, check_in, check_out)) % (2**32)
    rng = random.Random(seed)
    name = rng.choice(["Grand Meridian", "Hotel Lumière", "The Atlas"])
    rate = rng.randint(240, 410)

    return (
        f"{name} near the historic centre. Boutique suites, rooftop bar, "
        f"average nightly rate ${rate} including breakfast."
    )


@tool
def mock_search_activities(destination: str) -> str:
    """Return a short list of signature activities for the destination."""
    data = DESTINATIONS.get(destination.lower(), DESTINATIONS["paris"])
    bullets = "\n".join(f"- {item}" for item in data["highlights"])
    return f"Signature experiences in {destination.title()}:\n{bullets}"
    
# End: Tool Definitions

Configure the Application for AI Agent Monitoring

Currently, our application creates an LLM and invokes it as follows:

def flight_specialist_node(state: PlannerState) -> PlannerState:
    llm = _create_llm(
    "flight_specialist", temperature=0.4, session_id=state["session_id"]
    )
    ...
    result = llm.invoke(messages)
    ...

For AI Agent Monitoring, we need to instead create an agent that includes metadata with the agent name, and then invoke the agent rather than the LLM.

LangChain Agents

In the next step, we’ll add agents to our application. But what exactly is an agent in the context of LangChain?

According to the LangChain documentation:

“Agents combine language models with tools to create systems that can reason about tasks, decide which tools to use, and iteratively work towards solutions.”

In practice, this means the model is no longer limited to generating text. Instead, it can choose from a set of available tools (such as APIs, databases, or code execution) to help complete a task.

This style of agent is often called a LangChain ReAct agent.

ReAct stands for Reasoning + Acting. The agent works through a loop where it:

briefly reasons about the task,
selects and calls a relevant tool,
observes the result, and
uses that new information to decide the next step.

This process repeats until the agent has gathered enough information to produce a final answer.

Replace the definitions for the coordinator_node, flight_specialist_node, hotel_specialist_node, activity_specialist_node, and plan_synthesizer_node functions with the following:

Tip: to delete a large number of lines in bulk using the vi editor, press Shift + v to ensure Visual Line mode, then use the down arrow to select all the lines you want to delete, then press d to delete the selected lines.

def coordinator_node(
    state: PlannerState
) -> PlannerState:
    llm = _create_llm("coordinator", temperature=0.2, session_id=state["session_id"])
    agent = _create_react_agent(llm, tools=[]).with_config(
        {
            "run_name": "coordinator",
            "tags": ["agent", "agent:coordinator"],
            "metadata": {
                "agent_name": "coordinator",
                "session_id": state["session_id"],
            },
        }
    )
    system_message = SystemMessage(
        content=(
            "You are the lead travel coordinator. Extract the key details from the "
            "traveller's request and describe the plan for the specialist agents."
        )
    )

    result = agent.invoke({"messages": [system_message] + list(state["messages"])})
    final_message = result["messages"][-1]
    state["messages"].append(
        final_message
        if isinstance(final_message, BaseMessage)
        else AIMessage(content=str(final_message))
    )
    state["current_agent"] = "flight_specialist"
    return state


def flight_specialist_node(
    state: PlannerState
) -> PlannerState:
    llm = _create_llm(
        "flight_specialist", temperature=0.4, session_id=state["session_id"]
    )
    agent = _create_react_agent(llm, tools=[mock_search_flights]).with_config(
        {
            "run_name": "flight_specialist",
            "tags": ["agent", "agent:flight_specialist"],
            "metadata": {
                "agent_name": "flight_specialist",
                "session_id": state["session_id"],
            },
        }
    )
    step = (
        f"Find an appealing flight from {state['origin']} to {state['destination']} "
        f"departing {state['departure']} for {state['travellers']} travellers."
    )

    # IMPORTANT: pass a proper list of messages (not stringified)
    messages = [
        SystemMessage(content="You are a flight booking specialist. Provide concise options."),
        HumanMessage(content=step),
    ]

    result = agent.invoke({"messages": messages})
    final_message = result["messages"][-1]
    state["flight_summary"] = final_message.content if isinstance(final_message, BaseMessage) else str(final_message)
    state["messages"].append(final_message if isinstance(final_message, BaseMessage) else AIMessage(content=str(final_message)))
    state["current_agent"] = "hotel_specialist"
    return state


def hotel_specialist_node(
    state: PlannerState
) -> PlannerState:
    llm = _create_llm(
        "hotel_specialist", temperature=0.5, session_id=state["session_id"]
    )
    agent = _create_react_agent(llm, tools=[mock_search_hotels]).with_config(
        {
            "run_name": "hotel_specialist",
            "tags": ["agent", "agent:hotel_specialist"],
            "metadata": {
                "agent_name": "hotel_specialist",
                "session_id": state["session_id"],
            },
        }
    )
    step = (
        f"Recommend a boutique hotel in {state['destination']} between {state['departure']} "
        f"and {state['return_date']} for {state['travellers']} travellers."
    )

    # IMPORTANT: pass a proper list of messages (not stringified)
    messages = [
        SystemMessage(content="You are a hotel booking specialist. Provide concise options."),
        HumanMessage(content=step),
    ]

    result = agent.invoke({"messages": messages})

    final_message = result["messages"][-1]
    state["hotel_summary"] = (
        final_message.content
        if isinstance(final_message, BaseMessage)
        else str(final_message)
    )
    state["messages"].append(
        final_message
        if isinstance(final_message, BaseMessage)
        else AIMessage(content=str(final_message))
    )
    state["current_agent"] = "activity_specialist"
    return state


def activity_specialist_node(
    state: PlannerState
) -> PlannerState:
    llm = _create_llm(
        "activity_specialist", temperature=0.6, session_id=state["session_id"]
    )
    agent = _create_react_agent(llm, tools=[mock_search_activities]).with_config(
        {
            "run_name": "activity_specialist",
            "tags": ["agent", "agent:activity_specialist"],
            "metadata": {
                "agent_name": "activity_specialist",
                "session_id": state["session_id"],
            },
        }
    )
    step = f"Curate signature activities for travellers spending a week in {state['destination']}."

    # IMPORTANT: pass a proper list of messages (not stringified)
    messages = [
        SystemMessage(content="You are a hotel booking specialist. Provide concise options."),
        HumanMessage(content=step),
    ]

    result = agent.invoke({"messages": messages})

    final_message = result["messages"][-1]
    state["activities_summary"] = (
        final_message.content
        if isinstance(final_message, BaseMessage)
        else str(final_message)
    )
    state["messages"].append(
        final_message
        if isinstance(final_message, BaseMessage)
        else AIMessage(content=str(final_message))
    )
    state["current_agent"] = "plan_synthesizer"
    return state
    
def plan_synthesizer_node(state: PlannerState) -> PlannerState:
    llm = _create_llm(
        "plan_synthesizer", temperature=0.3, session_id=state["session_id"]
    )

    agent = _create_react_agent(llm, tools=[]).with_config(
        {
            "run_name": "plan_synthesizer",
            "tags": ["agent", "agent:plan_synthesizer"],
            "metadata": {
                "agent_name": "plan_synthesizer",
                "session_id": state["session_id"],
            },
        }
    )

    system_content = (
        "You are the travel plan synthesiser. Combine the specialist insights into a "
        "concise, structured itinerary covering flights, accommodation and activities."
    )

    content = json.dumps(
        {
            "flight": state["flight_summary"],
            "hotel": state["hotel_summary"],
            "activities": state["activities_summary"],
        },
        indent=2,
    )

    out = agent.invoke(
        {
            "messages": [
                SystemMessage(content=system_content),
                HumanMessage(
                    content=(
                        f"Traveller request: {state['user_request']}\n\n"
                        f"Origin: {state['origin']} | Destination: {state['destination']}\n"
                        f"Dates: {state['departure']} to {state['return_date']}\n\n"
                        f"Specialist summaries:\n{content}"
                    )
                ),
            ]
        }
    )
    # 1) Extract the assistant’s final text
    final_msg = next(m for m in reversed(out["messages"]) if isinstance(m, AIMessage))
    state["final_itinerary"] = final_msg.content

    # 2) Append the new messages to your ongoing conversation
    state["messages"].extend(out["messages"])  # or append just final_msg

    state["current_agent"] = "completed"
    return state

Notice how we passed a tool when creating the flight, hotel, and activity specialist agents. When the agent is invoked, the LLM will decide whether the tool should be invoked to fulfill the request.

Hint: run the following command to compare your changes with the model solution:
diff ~/workshop/agentic-ai/base-app/main.py ~/workshop/agentic-ai/app-with-agents-and-tools/main.py

Build an Updated Docker Image

Build an updated Docker image with a new tag:

cd ~/workshop/agentic-ai/base-app
docker build --platform linux/amd64 -t localhost:9999/agentic-ai-app:app-with-agents-and-tools .
docker push localhost:9999/agentic-ai-app:app-with-agents-and-tools

Tip: if the image is taking too long to build, consider using the pre-built image instead. To do so, update the image name in the ~/workshop/agentic-ai/base-app/k8s.yaml file to ghcr.io/splunk/agentic-ai-app:app-with-agents-and-tools instead of localhost:9999/agentic-ai-app:app-with-agents-and-tools.

Update the Kubernetes Manifest

Open the ~/workshop/agentic-ai/base-app/k8s.yaml file for editing and update the image to ensure we’re using the one with the agents and tools:

          image: localhost:9999/agentic-ai-app:app-with-agents-and-tools

Deploy the Updated Application

We can deploy the updated application using the manifest file as follows:

kubectl apply -f ~/workshop/agentic-ai/base-app/k8s.yaml

Test the Application in Kubernetes

Ensure the new application pod has started successfully and the old pod is no longer present:

kubectl get pods -n travel-agent

NAME                                        READY   STATUS    RESTARTS   AGE
travel-planner-langchain-68977dc5c4-4w7p9   1/1     Running   0          41s

Then, run the following command to test the application:

curl http://travel-planner.localhost/travel/plan \
  -H "Content-Type: application/json" \
  -d '{
    "origin": "Seattle",
    "destination": "Tokyo",
    "user_request": "We are planning a week-long trip to Seattle from Tokyo. Looking for boutique hotel, business-class flights and unique experiences.",
    "travelers": 2
  }'

View Data in Splunk Observability Cloud

Let’s return to Splunk Observability Cloud to see how the trace looks now.

Navigate to APM and then select AI agents. Ensure your environment name is selected (e.g. agentic-ai-$INSTANCE). You’ll notice that the page populated now!

Navigate to APM -> AI trace data. This is a new page that lets us search for traces that include AI-related content:

Ensure your environment name is selected (e.g. agentic-ai-$INSTANCE).
Select one of the newer traces. We see all of our agents represented in the Agent flow now!

We can also see the tool calls:

Detect Quality Issue

15 minutes

Note: this section of the workshop requires changes to multiple files. If you’re not sure where to make the changes, or your application is no longer working, please refer to the model solution for this section which is in the ~/workshop/agentic-ai/app-with-quality-issue folder.

In the previous sections, we instrumented our application with OpenTelemetry, and configured it to evaluate the semantic quality of agent responses.

In this section, let’s add some quality issues to our application, so we can see how Splunk Observability Cloud is able to detect such issues.

About the Poisoned Chat Wrapper

In this section, we’ll use a custom class named PoisonedChatWrapper which wraps the existing ChatModel to intercept and ‘poison’ the output. We’ve taken this approach so that we can intercept the output before it’s captured with OpenTelemetry instrumentation.

If you’re curious to understand this is done, please review the poison_chat_wrapper.py file.

Poison the Hotel Specialist Output

Next, let’s modify the hotel specialist agent to use this wrapper and modify the LLM output. First, modify the ~/workshop/agentic-ai/base-app/main.py file to add the following import statement between the lines that say Begin: Add Import Statements and End: Add Import Statements:

from poison_chat_wrapper import PoisonedChatWrapper

Then, replace the definition of the hotel_specialist_node function with the following:

Tip: to delete a large number of lines in bulk using the vi editor, press Shift + v to ensure Visual Line mode, then use the down arrow to select all the lines you want to delete, then press d to delete the selected lines.

def hotel_specialist_node(
    state: PlannerState
) -> PlannerState:
    base_llm = _create_llm(
        "hotel_specialist", temperature=0.5, session_id=state["session_id"]
    )

    poisoned_llm = PoisonedChatWrapper(
        inner_llm=base_llm,
        poison_snippet="Note: I think this hotel is pretty terrible, best of luck if you stay there!"
    )

    agent = _create_react_agent(poisoned_llm, tools=[mock_search_hotels]).with_config(
        {
            "run_name": "hotel_specialist",
            "tags": ["agent", "agent:hotel_specialist"],
            "metadata": {
                "agent_name": "hotel_specialist",
                "session_id": state["session_id"],
            },
        }
    )
    step = (
        f"Recommend a boutique hotel in {state['destination']} between {state['departure']} "
        f"and {state['return_date']} for {state['travellers']} travellers."
    )

    # IMPORTANT: pass a proper list of messages (not stringified)
    messages = [
        SystemMessage(content="You are a hotel booking specialist. Provide concise options."),
        HumanMessage(content=step),
    ]

    result = agent.invoke({"messages": messages})

    final_message = result["messages"][-1]
    state["hotel_summary"] = (
        final_message.content
        if isinstance(final_message, BaseMessage)
        else str(final_message)
    )
    state["messages"].append(
        final_message
        if isinstance(final_message, BaseMessage)
        else AIMessage(content=str(final_message))
    )
    state["current_agent"] = "activity_specialist"
    return state

Hint: run the following command to compare your changes with the model solution:
diff ~/workshop/agentic-ai/base-app/main.py ~/workshop/agentic-ai/app-with-quality-issue/main.py

Build an Updated Docker Image

Build an updated Docker image with a new tag:

cd ~/workshop/agentic-ai/base-app
docker build --platform linux/amd64 -t localhost:9999/agentic-ai-app:app-with-quality-issue .
docker push localhost:9999/agentic-ai-app:app-with-quality-issue

Tip: if the image is taking too long to build, consider using the pre-built image instead. To do so, update the image name in the ~/workshop/agentic-ai/base-app/k8s.yaml file to ghcr.io/splunk/agentic-ai-app:app-with-quality-issue instead of localhost:9999/agentic-ai-app:app-with-quality-issue.

Update the Kubernetes Manifest

Open the ~/workshop/agentic-ai/base-app/k8s.yaml file for editing and update the image to ensure we’re using the one with the quality issue:

          image: localhost:9999/agentic-ai-app:app-with-quality-issue

Deploy the Updated Application

We can deploy the updated application using the manifest file as follows:

kubectl apply -f ~/workshop/agentic-ai/base-app/k8s.yaml

Test the Application in Kubernetes

Ensure the new application pod has started successfully and the old pod is no longer present:

kubectl get pods -n travel-agent

NAME                                        READY   STATUS    RESTARTS   AGE
travel-planner-langchain-68977dc5c4-4w7p9   1/1     Running   0          41s

Then, run the following command to test the application:

curl http://travel-planner.localhost/travel/plan \
  -H "Content-Type: application/json" \
  -d '{
    "origin": "Seattle",
    "destination": "Tokyo",
    "user_request": "We are planning a week-long trip to Seattle from Tokyo. Looking for boutique hotel, business-class flights and unique experiences.",
    "travelers": 2
  }'

View Data in Splunk Observability Cloud

Let’s return to Splunk Observability Cloud to see how the trace looks now.

Looking at the invoke_agent span for the hotel_specialist agent, we can see that the agent has several quality issues, as it recommended a hotel and then called it pretty terrible:

Note: not all agent invocations are evaluated, as the workshop org is set to evaluate only 20% of the time. This is configurable at the org level. If you don’t see an evaluation on the invoke_agent span for the hotel_specialist agent, trying sending another request.

Add AI Defense Instrumentation

15 minutes

Note: this section of the workshop requires changes to multiple files. If you’re not sure where to make the changes, or your application is no longer working, please refer to the expected solution for this section which is in the ~/workshop/agentic-ai/app-with-ai-defense folder.

Splunk Observability Cloud integrates with Cisco AI Defense to provide a consolidated view of security and privacy risks detected at runtime for your AI agents, allowing you to monitor performance and risks in one place.

This is referred to as Splunk AI Security Monitoring, which helps you to:

Identify which agents, interactions, and services involve detected or blocked security and privacy risks, such as prompt injection and PII leakage
Track risk trends alongside latency, errors, and other performance metrics over time
Investigate risky interactions in trace context, down to specific prompts and responses

In this section, we’ll add the AI Defense integration to our Agentic AI application and review the resulting data in Splunk Observability Cloud.

How It Works

Splunk AI Security Monitoring provides an instrumentation library, opentelemetry-instrumentation-aidefense, to automate security and privacy risk tracing for Python-based AI agents. This library captures and attaches security telemetry to calls that your AI agents make to LLMs (such as OpenAI) and orchestration frameworks (such as LangChain) to ensure that every prompt and response can be audited against security guardrails and recorded within a unified OpenTelemetry trace. It does this by adding the gen_ai.security.event_id attribute to LLM or workflow spans.

SDK vs. Gateway Mode

The opentelemetry-instrumentation-aidefense library can operate in either SDK mode or gateway mode:

With the SDK mode, the developer adds explicit security checks using inspect_prompt(). This option is best for developers that want full control how security checks are implemented and how issues are addressed.
With Gateway mode, LLM calls proxied through Cisco AI Defense Gateway so application code changes are not required. This mode is supported for popular commercial LLMs such as OpenAI, Anthropic, etc.

This workshop utilizes Gateway mode with Azure OpenAI.

Setup the Cisco AI Defense Integration

The first step is to Set up an integration with Cisco AI Defense.

If you navigate to Data Management -> Deployed integrations and search for AI Defense, you’ll see that this integration has already been configured:

Note: the aiDefenseIntegration feature flag must be enabled to see this integration

Add Instrumentation Packages

Next, we need to install several instrumentation packages. We can achieve this by opening the ~/workshop/agentic-ai/base-app/requirements.txt for editing and adding the following packages:

# AI Defense instrumentation (Gateway Mode support in v0.2.0+)
splunk-otel-instrumentation-aidefense>=0.2.0
# We may need to include the AI Defense SDK even with Gateway mode
cisco-aidefense-sdk>=2.0.0
# HTTP client (httpx is required for Gateway Mode to work)
httpx>=0.24.0

Hint: run the following command to compare your changes with the expected solution:
diff ~/workshop/agentic-ai/base-app/requirements.txt ~/workshop/agentic-ai/app-with-ai-defense/requirements.txt

Build an Updated Docker Image

Build an updated Docker image with a new tag:

cd ~/workshop/agentic-ai/base-app
docker build --platform linux/amd64 -t localhost:9999/agentic-ai-app:app-with-ai-defense .
docker push localhost:9999/agentic-ai-app:app-with-ai-defense

Tip: if the image is taking too long to build, consider using the pre-built image instead. To do so, update the image name in the ~/workshop/agentic-ai/base-app/k8s.yaml file to ghcr.io/splunk/agentic-ai-app:app-with-ai-defense instead of localhost:9999/agentic-ai-app:app-with-ai-defense.

Create a Secret for the AI Defense Gateway

The document provided by the workshop instructor contains a kubectl create secret command to create a secret to store the AI Defense Gateway URL.

Copy and paste this kubectl create secret command from the document and run it in your ssh terminal.

Update the Kubernetes Manifest

Open the ~/workshop/agentic-ai/base-app/k8s.yaml file for editing and replace the definition of the AZURE_OPENAI_ENDPOINTenvironment variable as follows, which ensures that any requests destined for Azure OpenAI are instead sent through the AI Defense gateway:

            - name: AZURE_OPENAI_ENDPOINT
              valueFrom:
                secretKeyRef:
                  name: ai-defense-secret
                  key: ai-defense-gateway-url

In the same file, update the image to ensure we’re using the one with the instrumentation:

          image: localhost:9999/agentic-ai-app:app-with-ai-defense

Hint: run the following command to compare your changes with the expected solution:
diff ~/workshop/agentic-ai/base-app/k8s.yaml ~/workshop/agentic-ai/app-with-ai-defense/k8s.yaml

Deploy the Updated Application

We can deploy the updated application using the manifest file as follows:

kubectl apply -f ~/workshop/agentic-ai/base-app/k8s.yaml

Test the Application in Kubernetes

Ensure the new application pod has started successfully and the old pod is no longer present:

kubectl get pods -n travel-agent

NAME                                        READY   STATUS    RESTARTS   AGE
travel-planner-langchain-68977dc5c4-4w7p9   1/1     Running   0          41s

Then, run the following command to test the application:

curl http://travel-planner.localhost/travel/plan \
  -H "Content-Type: application/json" \
  -d '{
    "origin": "Seattle",
    "destination": "Tokyo",
    "user_request": "We are planning a week-long trip to Seattle from Tokyo. Looking for boutique hotel, business-class flights and unique experiences.",
    "travelers": 2
  }'

For now, just ensure that the application is still working. In the next section, we’ll add a security risk and then show how it can be detected.

Detect Security Risks

15 minutes

Note: this section of the workshop requires changes to multiple files. If you’re not sure where to make the changes, or your application is no longer working, please refer to the model solution for this section which is in the ~/workshop/agentic-ai/app-with-security-risk folder.

In an earlier section, we added a wrapper to inject quality issues in the output from one of the application agents.

In this section, we’ll perform a similar exercise to create a security risk.

Then we’ll showcase how these risks can be surfaced in Splunk Observability Cloud.

Poison the Activity Specialist Output

Let’s modify the activity specialist agent to use this wrapper and modify the LLM output.

Open the ~/workshop/agentic-ai/base-app/main.py file for editing.

Replace the definition of the activity_specialist_node function with the version included below. This effectively simulates a scenario where the LLM has included the user’s credit card number as part of the response, which is a clear security risk and PCI violation.

Tip: to delete a large number of lines in bulk using the vi editor, press Shift + v to ensure Visual Line mode, then use the down arrow to select all the lines you want to delete, then press d to delete the selected lines.

def activity_specialist_node(
    state: PlannerState
) -> PlannerState:
    base_llm = _create_llm(
        "activity_specialist", temperature=0.6, session_id=state["session_id"]
    )

    poisoned_llm = PoisonedChatWrapper(
        inner_llm=base_llm,
        poison_snippet="Note: I've charged your Visa on file with credit card number 4111 1111 1111 1111."
    )

    agent = _create_react_agent(poisoned_llm, tools=[mock_search_activities]).with_config(
        {
            "run_name": "activity_specialist",
            "tags": ["agent", "agent:activity_specialist"],
            "metadata": {
                "agent_name": "activity_specialist",
                "session_id": state["session_id"],
            },
        }
    )
    step = f"Curate signature activities for travellers spending a week in {state['destination']}."

    # IMPORTANT: pass a proper list of messages (not stringified)
    messages = [
        SystemMessage(content="You are a hotel booking specialist. Provide concise options."),
        HumanMessage(content=step),
    ]

    result = agent.invoke({"messages": messages})

    final_message = result["messages"][-1]
    state["activities_summary"] = (
        final_message.content
        if isinstance(final_message, BaseMessage)
        else str(final_message)
    )
    state["messages"].append(
        final_message
        if isinstance(final_message, BaseMessage)
        else AIMessage(content=str(final_message))
    )
    state["current_agent"] = "plan_synthesizer"
    return state

Hint: run the following command to compare your changes with the model solution:
diff ~/workshop/agentic-ai/base-app/main.py ~/workshop/agentic-ai/app-with-security-risk/main.py

Build an Updated Docker Image

Build an updated Docker image with a new tag:

cd ~/workshop/agentic-ai/base-app
docker build --platform linux/amd64 -t localhost:9999/agentic-ai-app:app-with-security-risk .
docker push localhost:9999/agentic-ai-app:app-with-security-risk

Tip: if the image is taking too long to build, consider using the pre-built image instead. To do so, update the image name in the ~/workshop/agentic-ai/base-app/k8s.yaml file to ghcr.io/splunk/agentic-ai-app:app-with-security-risk instead of localhost:9999/agentic-ai-app:app-with-security-risk.

Update the Kubernetes Manifest

Open the ~/workshop/agentic-ai/base-app/k8s.yaml file for editing and update the image to ensure we’re using the one with the security risk:

          image: localhost:9999/agentic-ai-app:app-with-security-risk

Deploy the Updated Application

We can deploy the updated application using the manifest file as follows:

kubectl apply -f ~/workshop/agentic-ai/base-app/k8s.yaml

Test the Application in Kubernetes

Ensure the new application pod has started successfully and the old pod is no longer present:

kubectl get pods -n travel-agent

NAME                                        READY   STATUS    RESTARTS   AGE
travel-planner-langchain-68977dc5c4-4w7p9   1/1     Running   0          41s

Then, run the following command to test the application:

curl http://travel-planner.localhost/travel/plan \
  -H "Content-Type: application/json" \
  -d '{
    "origin": "Seattle",
    "destination": "Tokyo",
    "user_request": "We are planning a week-long trip to Seattle from Tokyo. Looking for boutique hotel, business-class flights and unique experiences.",
    "travelers": 2
  }'

View Events in Cisco AI Defense

Workshop attendees won’t be able to log in to the AI Defense application directly. However, if we were able to view the AI Defense dashboard, we would see that an event was logged for this request and that the credit card number included in the prompt was automatically redacted.

Note that policies can be configured AI Defense to specify whether we want to monitor or block specific types of security issues. In this case, we’ve chosen to just monitor PCI-related issues.

View Data in Splunk Observability Cloud

Let’s return to Splunk Observability Cloud to see how the trace looks now.

Navigate to APM and then select AI agents. Ensure your environment name is selected (e.g. agentic-ai-$INSTANCE). You’ll notice that the page includes security risks now!

You should also see the security risks on the AI overview page, as well as the AI agent page for the plan_synthesizer agent.

Navigate to APM -> AI trace data and load the most recent trace.

In the agent flow, we can see that a security risk was detected:

Looking at the invoke_agent span for the activity_specialist agent, we can see that PCI security risk was detected and blocked, due to the LLM disclosing the customer’s credit card number in the response in plain text:

Clicking on the security risk provides additional details, along with a link to view the event in Cisco AI Defense:

And if we view the Span details for this span, we can see that the gen_ai.security.event_id attribute is included with this span:

This attribute allows us to correlate the span in Splunk Observability Cloud with the corresponding event in Cisco AI Defense.

Explore Other Agentic AI Frameworks

15 minutes

In earlier sections of this workshop, we focused on instrumenting Agentic AI applications built with LangChain and LangGraph using OpenTelemetry.

In this section, we broaden the scope to cover other popular Agentic AI frameworks and outline the available instrumentation approaches.

At a high level, there are two primary options for instrumenting Agentic AI applications with OpenTelemetry. The best approach depends on the framework used and whether the application already includes existing instrumentation.

Choosing the Right Instrumentation Approach

Option 1: Splunk OpenTelemetry Instrumentation (Recommended When Available)

Splunk provides OpenTelemetry instrumentation packages for several widely used Agentic AI frameworks, including:

CrewAI
LangChain/LangGraph
LlamaIndex
OpenAI SDK
OpenAI Agents SDK

When to use this option

Choose this approach when:

Your application uses one of the frameworks listed above.
You want OpenTelemetry instrumentation optimized for Splunk Observability Cloud with minimal configuration.
You prefer a zero-code instrumentation experience.

How it works

Follow the steps in Zero-code instrumentation integrations to instrument your application.

Depending on the framework, you may need to:

Install additional Splunk OpenTelemetry packages
Set specific environment variables to enable optional features such as:
- Capturing LLM prompts and completions
- Evaluating semantic quality of LLM responses
- Integrating with Cisco AI Defense

Note: This is the same approach used earlier in the workshop for LangChain and LangGraph, including optional prompt and completion capture.

Option 2: Third-Party Instrumentation Libraries

If your framework is not directly supported by Splunk OpenTelemetry instrumentation, you can use a third-party library that provides broader framework coverage.

Commonly used third-party instrumentation libraries include:

When to use this option

This approach is well suited when:

Your application uses an Agentic AI framework not listed in Option 1
The application is already instrumented with a third-party instrumentation library
You want to avoid re-instrumenting existing code

How it works

Third-party libraries typically emit telemetry in their own formats or earlier OpenTelemetry schemas. To integrate this data with Splunk Observability Cloud:

Enable a translation layer that converts the emitted telemetry into the latest OpenTelemetry semantic conventions.
Configure the OpenTelemetry Collector to:

Receive the translated data
Export it to Splunk Observability Cloud

For step-by-step instructions, see: Translate and collect data from AI applications instrumented with third-party libraries.

Summary

Scenario	Recommended Option
Supported framework, minimal setup	Splunk OpenTelemetry Instrumentation
Unsupported framework	Third-party instrumentation library
Existing third-party instrumentation	Third-party + OpenTelemetry translation

CrewAI Example

Let’s walkthrough an example using CrewAI. The travel planner application we’ve been using during the workshop has been re-written using CrewAI. You can find the source code in the ~/workshop/agentic-ai/crewai folder.

Note that CrewAI uses a declarative approach to define agents and tasks. For example, the ~/workshop/agentic-ai/crewai/config/agents.yaml file defines agents such as the following:

coordinator:
  role: Travel Coordinator
  goal: Extract traveler intent and define a clear execution plan for specialists.
  backstory: You are a lead travel coordinator managing specialist agents for flights, hotels, and activities.
  verbose: true
  allow_delegation: false

flight_specialist:
  role: Flight Booking Specialist
  goal: Find an appealing and practical round-trip flight option.
  backstory: You specialize in concise, high-signal flight recommendations.
  verbose: true
  allow_delegation: false

And the ~/workshop/agentic-ai/crewai/config/tasks.yaml file defines tasks such as the following:

coordinate_trip:
  description: >
  Read the user request and extract key trip details:
    origin, destination, travel style, and constraints.
    Provide a short execution brief for specialists.
  User request: {user_request}
  Origin: {origin}
  Destination: {destination}
  Departure: {departure}
  Return: {return_date}
  Travellers: {travellers}
  expected_output: >
    A concise planning brief with extracted details and assumptions.
  agent: coordinator

Notice that the following packages were added to the requirements.txt file to instrument the CrewAI application:

splunk-opentelemetry==2.8.0
splunk-otel-instrumentation-crewai==0.1.3
splunk-otel-instrumentation-openai==0.1.0
splunk-otel-genai-emitters-splunk==0.1.7
splunk-otel-util-genai==0.1.9
opentelemetry-instrumentation-flask==0.59b0

Deploy the CrewAI Example

Let’s deploy the CrewAI example by first building new Docker images:

cd ~/workshop/agentic-ai/crewai
docker build --platform linux/amd64 -t localhost:9999/agentic-ai-app:crewai .
docker push localhost:9999/agentic-ai-app:crewai

Tip: if the image is taking too long to build, consider using the pre-built image instead. To do so, update the image name in the ~/workshop/agentic-ai/crewai/k8s.yaml file to ghcr.io/splunk/agentic-ai-app:crewai instead of localhost:9999/agentic-ai-app:crewai.

Let’s use a different environment name for this version of the application:

kubectl create configmap instance-config-crewai \
--from-literal=OTEL_RESOURCE_ATTRIBUTES=deployment.environment=agentic-ai-crewai-$INSTANCE \
-n travel-agent

We can then deploy the CrewAI application using the manifest file as follows:

kubectl apply -f ~/workshop/agentic-ai/crewai/k8s.yaml

Test the Application in Kubernetes

Ensure the new application pod has started successfully and the old pod is no longer present:

kubectl get pods -n travel-agent

NAME                                        READY   STATUS    RESTARTS   AGE
travel-planner-langchain-68977dc5c4-4w7p9   1/1     Running   0          41s

Then, run the following command to test the application:

curl http://travel-planner.localhost/travel/plan \
  -H "Content-Type: application/json" \
  -d '{
    "origin": "Seattle",
    "destination": "Tokyo",
    "user_request": "We are planning a week-long trip to Seattle from Tokyo. Looking for boutique hotel, business-class flights and unique experiences.",
    "travelers": 2
  }'

View Data in Splunk Observability Cloud

Let’s return to Splunk Observability Cloud to view traces for the CrewAI application.

Navigate to APM and then select AI agents. Ensure your environment name is selected (e.g. agentic-ai-crewai-$INSTANCE). You’ll notice that the agent names are slightly different:

Navigate to APM -> AI trace data and load the most recent trace.

In the trace, we should see similar details that we captured with the LangChain/LangGraph version of the application:

Do you notice anything different about the CrewAI traces compared to LangChain/LangGraph traces?

Click here to see the answer

There are a few differences:

The agent names are different (Hotel Booking Specialist vs. hotel_specialist)
The coordinator and plan synthesizer agents aren’t listed for the CrewAI version
The spans for the crewai inferred service include the agent instructions as part of the waterfall view

Wrap-up

5 minutes

Congratulations, you’ve successfully completed the Monitoring Agentic AI Applications with Splunk Observability Cloud workshop!

You’ve achieved the following:

An understanding of how to connect an Azure account to Splunk Observability Cloud to capture AI infrastructure-related metrics.
Experience exploring out-of-the box dashboards and navigators related to AI infrastructure.
An understanding of the architecture of an Agentic AI application built with LangChain and LangGraph.
Practice deploying an Agentic AI application and instrumenting it with OpenTelemetry.
Experience exploring how metrics, traces, and logs can be used in Splunk Observability Cloud to understand agent performance.
Practice modifying an Agentic AI application to use tool calls and agents.
Practice adding quality issues to an application and detecting them with Splunk Observability Cloud using semantic quality evals.
Practice adding AI Defense instrumentation to the application and security risks, and detecting them with Splunk Observability Cloud.

Monitoring Agentic AI Applications with Splunk Observability Cloud

Subsections of Monitoring Agentic AI Applications with Splunk Observability Cloud

Connect to EC2 Instance

Connect to your EC2 Instance

Retrieve your Instance Name

Connect Visual Studio Code (Optional)

Review Azure OpenAI Metrics, Dashboards, and Navigators

Azure OpenAI Metrics

Azure OpenAI Navigator

Azure OpenAI Dashboard

Deploy the OpenTelemetry Collector

Install the Collector using Helm

Confirm the Collector is Running

Confirm your K8s Cluster is in O11y Cloud

Using the New Kubernetes Experience

Using the Traditional Kubernetes Experience

Agentic AI Application Architecture

Application Overview

Subsections of 4. Agentic AI Application Architecture

4.1 Request Lifecycle

What the application does

Knowledge Check

4.2 Shared State

Shared State in LangGraph

Knowledge Check

4.3 Orchestration

Where execution begins

Knowledge Check

Question 1

Question 2

4.4 Defining the Graph

How the graph is defined

Knowledge Check

4.5 Defining Nodes

How a node works

Knowledge Check

4.6 Message Abstractions

LangChain Message Abstractions

Knowledge Check

4.7 LLM Creation

LLM Creation

Knowledge Check

4.8 Decomposition Pattern

The synthesizer shows the decomposition pattern

Knowledge Check

Deploy the Agentic AI Application

Deploy the Agentic AI Application (Linux)

Set Environment Variables

Create Virtual Environment

Run the Application

Test the Application

Stop the Application

Deploy the Agentic AI Application (Kubernetes)

Build the Docker Image

Create Application Namespace

Create Secret with Azure Credentials

Deploy the Application Using the Kubernetes Manifest File

Ensure the Application is Running

Test the Application in Kubernetes

Troubleshooting

Instrument the Agentic AI Application

Add Instrumentation Packages

Update the Dockerfile

Build an Updated Docker Image

Define the Config Map

Update the Kubernetes Manifest

Deploy the Updated Application

Test the Application in Kubernetes

Troubleshooting

Review Agent Trace Data

Review LLM Provider Configuration

Review AI Agent Monitoring Configuration

Review AI Monitoring Permissions

View Trace Data in Splunk Observability Cloud

Add Tool Calls

Direct Invocation vs. Agentic Traces

Make a Backup

Add Import Statements

Add Tools

Configure the Application for AI Agent Monitoring