This section includes the steps that workshop attendees will follow:
Practice deploying the OpenTelemetry Collector in the Red Hat OpenShift cluster.
Practice adding Prometheus receivers to the collector to ingest infrastructure metrics.
Practice monitoring the Weaviate vector database in the cluster.
Practice gathering the Pure Storage metrics using Prometheus.
Practice instrumenting Python services that interact with Large Language Models (LLMs) with OpenTelemetry.
Understanding which details which OpenTelemetry captures in the trace from applications that interact with LLMs.
Subsections of 2. Workshop
Overview of the Workshop Environment
5 minutes
Cisco’s AI-ready PODs combine cutting-edge hardware and software to
deliver a robust, scalable, and efficient AI infrastructure.
Splunk Observability Cloud provides comprehensive visibility
into this entire stack: from infrastructure to application components.
This hands-on workshop teaches you how to monitor AI infrastructure
using OpenTelemetry and Prometheus, without requiring
access to an actual Cisco AI POD. You’ll gain practical experience
deploying and configuring monitoring technologies in a realistic environment.
Lab Environment
The workshop uses a shared OpenShift Cluster running in AWS, equipped
with NVIDIA GPUs and NVIDIA AI Enterprise software.
Pre-Deployed Infrastructure
The workshop instructor has deployed the following shared components to the
workshop environment:
NVIDIA NIM models:
meta/llama-3.2-1b-instruct - Processes user prompts
Weaviate - A vector database for semantic search and retrieval
Prometheus exporter - Simulates Pure Storage metrics typical of production AI PODs
Your Workspace
Each participant receives a dedicated namespace within the shared cluster,
ensuring isolated environments for independent work.
Workshop Activities
During the workshop, each participant will execute the following tasks:
Deploy and configure an OpenTelemetry collector in your namespace
Integrate observability data collection with the cluster infrastructure
Deploy a Python application that leverages the NVIDIA NIM models
Monitor application performance and infrastructure metrics using Splunk Observability Cloud
What is Prometheus?
While Prometheus typically refers to a full-stack monitoring system
used for storage and alerting, this workshop focuses on the Prometheus
ecosystem’s data standards.
We will be leveraging Prometheus Exporters, which are small utilities
that translate a component’s internal health into a standardized
metrics endpoint (e.g., http://localhost:9100/metrics).
Instead of using a full Prometheus server to collect this data, we will use
the OpenTelemetry Collector. By using its Prometheus receiver,
the collector can scrape these endpoints, allowing us to gather
rich telemetry data using a widely-supported industry format.
Connect to the OpenShift Cluster
5 minutes
Connect to your EC2 Instance
We’ve prepared an Ubuntu Linux instance in AWS/EC2 for each attendee.
Using the IP address and password provided by your instructor, connect to your EC2 instance
using one of the methods below:
Mac OS / Linux
ssh splunk@IP address
Windows 10+
Use the OpenSSH client
Earlier versions of Windows
Use Putty
Set the Workshop Participant Number
The instructor will provide each participant with a number from 1 to 30.
Store this in an environment variable, and remember what it is, as
it will be used throughout the workshop:
In this section we’ll deploy the OpenTelemetry Collector in our OpenShift namespace,
which gathers metrics, logs, and traces from the infrastructure and applications
running in the cluster, and sends the resulting data to Splunk Observability Cloud.
Deploy the OpenTelemetry Collector
Ensure Helm is installed
Run the following command to confirm that Helm is installed:
Note: if you get an error that says Missing variables, you’ll need to
define your environment variables again. Add your participant number
before running the following commands:
Run the following command to confirm that the collector pods are running:
watch -n 1 oc get pods
NAME READY STATUS RESTARTS AGE
splunk-otel-collector-agent-58rwm 1/1 Running 0 6m40s
splunk-otel-collector-agent-8dndr 1/1 Running 0 6m40s
Note: in OpenShift environments, the collector takes about three minutes to
start and transition to the Running state.
Review Collector Data in Splunk Observability Cloud
Confirm that you can see your cluster in Splunk Observability Cloud by navigating to
Infrastructure Monitoring -> Kubernetes -> Kubernetes Clusters and then
adding a filter on k8s.cluster.name with your cluster name (i.e. ai-pod-workshop-participant-1):
Monitor NVIDIA Components
10 minutes
In this section, we’ll use the Prometheus receiver with the OpenTelemetry collector
to monitor the NVIDIA components running in the OpenShift cluster. We’ll start by
navigating to the directory where the collector configuration file is stored:
cd otel-collector
Capture the NVIDIA DCGM Exporter metrics
The NVIDIA DCGM exporter is running
in our OpenShift cluster. It exposes GPU metrics that we can send to Splunk.
To do this, let’s customize the configuration of the collector by editing the
otel-collector-values.yaml file that we used earlier when deploying the collector.
Add the following content, just below the kubeletstatsreceiver:
receiver_creator/nvidia:# Name of the extensions to watch for endpoints to start and stop.watch_observers:[k8s_observer ]receivers:prometheus/dcgm:config:config:scrape_configs:- job_name:gpu-metricsscrape_interval:60sstatic_configs:- targets:- '`endpoint`:9400'rule:type == "pod" && labels["app"] == "nvidia-dcgm-exporter"
This tells the collector to look for pods with a label of app=nvidia-dcgm-exporter.
And when it finds a pod with this label, it will connect to port 9400 of the pod and scrape
the default metrics endpoint (/v1/metrics).
Why are we using the receiver_creator receiver instead of just the Prometheus receiver?
The Prometheus receiver uses a static configuration that scrapes metrics from predefined endpoints.
The receiver_creator receiver enables dynamic creation of receivers (including Prometheus receivers) based on runtime information, allowing for scalable and flexible scraping setups.
Using receiver_creator can simplify configurations in dynamic environments by automating the management of multiple Prometheus scraping targets.
To ensure this new receiver is used, we’ll need to add a new pipeline to the
otel-collector-values.yaml file as well.
We’ll add one more Prometheus receiver related to NVIDIA in the next section.
Capture the NVIDIA NIM metrics
The meta-llama-3-2-1b-instruct large language model was deployed to the
OpenShift cluster using NVIDIA NIM. It includes a Prometheus endpoint
that we can scrape with the collector. Let’s add the following to the
otel-collector-values.yaml file, just below the prometheus/dcgmreceiver
we added earlier:
This tells the collector to look for pods with a label of app=meta-llama-3-2-1b-instruct.
And when it finds a pod with this label, it will connect to port 8000 of the pod and scrape
the /v1/metrics metrics endpoint.
There’s no need to make changes to the pipeline, as this receiver will already be picked up
as part of the receiver_creator/nvidia receiver.
Add a Filter Processor
Scraping Prometheus endpoints can result in a large number of metrics, sometimes
with high cardinality.
Let’s add a filter processor that defines exactly what metrics we want to send to Splunk.
Specifically, we’ll send only the metrics that are utilized by a dashboard chart or an
alert detector.
Add the following code to the otel-collector-values.yaml file, after the exporters section
but before the receivers section:
Take a moment to compare the contents of your modified otel-collector-values.yaml
file with the otel-collector-values-with-nvidia.yaml file. Remember that indentation
is important for yaml files, and needs to be precise:
Update your file if needed to ensure the contents match.
Don’t restart the collector yet
Because restarting the collector in an OpenShift environment takes 3 minutes per node,
we’ll wait until we’ve completed all configuration changes before initiating a restart.
Monitor the Vector Database
5 minutes
In this step, we’ll configure the Prometheus receiver to monitor the Weaviate vector database.
What is a Vector Database?
A vector database stores and indexes data as numerical “vector embeddings,” which capture
the semantic meaning of information like text or images. Unlike traditional databases,
they excel at similarity searches, finding conceptually related data points rather
than exact matches.
How is a Vector Database Used?
Vector databases play a key role in a pattern called
Retrieval Augmented Generation (RAG), which is widely used by
applications that leverage Large Language Models (LLMs).
The pattern is as follows:
The end-user asks a question to the application
The application takes the question and calculates a vector embedding for it
The app then performs a similarity search, looking for related documents in the vector database
The app then takes the original question and the related documents, and sends it to the LLM as context
The LLM reviews the context and returns a response to the application
Capture Weaviate Metrics with Prometheus
Let’s modify the OpenTelemetry collector configuration to scrape Weaviate’s Prometheus
metrics.
To do so, let’s add an additional Prometheus receiver creator section
to the otel-collector-values.yaml file. Add it after the receiver_creator/nvidia
section but before the pipelines section:
receiver_creator/weaviate:# Name of the extensions to watch for endpoints to start and stop.watch_observers:[k8s_observer ]receivers:prometheus/weaviate:config:config:scrape_configs:- job_name:weaviate-metricsscrape_interval:60sstatic_configs:- targets:- '`endpoint`:2112'rule:type == "pod" && labels["app"] == "weaviate"
We’ll need to ensure that Weaviate’s metrics are added to the filter/metrics_to_be_included filter
processor configuration as well:
processors:filter/metrics_to_be_included:metrics:# Include only metrics used in charts and detectorsinclude:match_type:strictmetric_names:- DCGM_FI_DEV_FB_FREE- ...- object_count- vector_index_size- vector_index_operations- vector_index_tombstones- vector_index_tombstone_cleanup_threads- vector_index_tombstone_cleanup_threads- requests_total- objects_durations_ms_sum- objects_durations_ms_count- batch_delete_durations_ms_sum- batch_delete_durations_ms_count
Note: add just the new metrics starting with object_count
We also want to add a Resource processor to the configuration file with
the following configuration. Add it after the filter/metrics_to_be_includedprocessor
but before the receivers section:
This processor takes the service.instance.id property on the Weaviate metrics
and copies it into a new property called weaviate.instance.id. This is done so
that we can more easily distinguish Weaviate metrics from other metrics that use
service.instance.id, which is a standard OpenTelemetry property used in
Splunk Observability Cloud.
We’ll need to add a new metrics pipeline for Weaviate metrics as well (we
need to use a separate pipeline since we don’t want the weaviate.instance.id
metric to be added to non-Weaviate metrics). Add the following to the bottom of the file:
Take a moment to compare the
contents of your modified otel-collector-values.yaml file with the
otel-collector-values-with-weaviate.yaml file. Remember that indentation
is important for yaml files, and needs to be precise:
Update your file if needed to ensure the contents match.
Don’t restart the collector yet
Because restarting the collector in an OpenShift environment takes 3 minutes per node,
we’ll wait until we’ve completed all configuration changes before initiating a restart.
Monitor Storage
5 minutes
In this step, we’ll configure the Prometheus receiver to monitor the storage.
What storage do Cisco AI PODs utilize?
Cisco AI PODs have a number of different storage options, including Pure Storage,
VAST, and NetApp.
The workshop will focus on Pure Storage.
How do we capture Pure Storage metrics?
Cisco AI PODs that utilize Pure Storage also use a technology called Portworx,
which provides persistent storage for Kubernetes.
Portworx includes a metrics endpoint that we can scrape using the Prometheus receiver.
Capture Storage Metrics with Prometheus
Let’s modify the OpenTelemetry collector configuration to scrape Portworx metrics
with the Prometheus receiver.
To do so, let’s add an additional Prometheus receiver creator section
to the otel-collector-values.yaml file. Add it after the receiver_creator/weaviate
section but before the pipelines section:
receiver_creator/storage:# Name of the extensions to watch for endpoints to start and stop.watch_observers:[k8s_observer ]receivers:prometheus/portworx:config:config:scrape_configs:- job_name:portworx-metricsstatic_configs:- targets:- '`endpoint`:17001'- '`endpoint`:17018'rule:type == "pod" && labels["app"] == "portworx-metrics-sim"
We’ll need to ensure that Portworx metrics are added to the filter/metrics_to_be_included filter
processor configuration as well:
processors:filter/metrics_to_be_included:metrics:# Include only metrics used in charts and detectorsinclude:match_type:strictmetric_names:- DCGM_FI_DEV_FB_FREE- ...- px_cluster_cpu_percent- px_cluster_disk_total_bytes- px_cluster_disk_utilized_bytes- px_cluster_status_nodes_offline- px_cluster_status_nodes_online- px_volume_read_latency_seconds- px_volume_reads_total- px_volume_readthroughput- px_volume_write_latency_seconds- px_volume_writes_total- px_volume_writethroughput
Note: add just the new metrics starting with px_cluster_cpu_percent
We’ll need to add a new metrics pipeline for Portworx metrics as well.
Add the following to the bottom of the file:
Take a moment to compare the
contents of your modified otel-collector-values.yaml file with the
otel-collector-values-with-portworx.yaml file.Remember that indentation
is important for yaml files, and needs to be precise:
Update your file if needed to ensure the contents match.
Don’t restart the collector yet
Because restarting the collector in an OpenShift environment takes 3 minutes per node,
we’ll wait until we’ve completed all configuration changes before initiating a restart.
Review AI POD Dashboards
10 minutes
In this section, we’ll review the AI POD dashboards in Splunk Observability Cloud
to confirm that the data from NVIDIA, Pure Storage, and Weaviate is captured
as expected.
Update the OpenTelemetry Collector Config
We can apply the collector configuration changes by running the following Helm command:
Note: if you get an error that says Missing variables, you’ll need to
define your environment variables again. Add your participant number
before running the following commands:
Navigate to Dashboards in Splunk Observability Cloud, then search for the
Cisco AI PODs Dashboard, which is included in the Built-in dashboard groups.
Ensure the dashboard is filtered on your OpenShift cluster name.
The charts should be populated as in the following example:
Review the Pure Storage Dashboard Tab
Navigate to the PURE STORAGE tab and ensure the dashboard is filtered
on your OpenShift cluster name. The charts should be populated as in the
following example:
Review the Weaviate Infrastructure Navigator
Since Weaviate isn’t included by default with an AI POD, it’s
not included on the out-of-the-box AI POD dashboard. Instead,
we can view Weaviate performance data using one of the infrastructure
navigators.
In Splunk Observability Cloud, navigate to Infrastructure -> AI Frameworks -> Weaviate.
Filter on the k8s.cluster.name of interest, and ensure the navigator is populated as in the
following example:
Review the LLM Application
15 minutes
In the final step of the workshop, we’ll deploy an application to our OpenShift cluster
that uses the instruct and embeddings models.
What is LangChain?
Like most applications that interact with LLMs, our application is written in Python.
It also uses LangChain, which is an open-source orchestration
framework that simplifies the development of applications powered by LLMs.
Application Overview
Connect to the LLMs
Our application starts by connecting to two LLMs that we’ll be using:
meta/llama-3.2-1b-instruct: used for responding to user prompts
nvidia/llama-3.2-nv-embedqa-1b-v2: used to calculate embeddings
# connect to a LLM NIM at the specified endpoint, specifying a specific modelllm=ChatNVIDIA(base_url=INSTRUCT_MODEL_URL,model="meta/llama-3.2-1b-instruct")# Initialize and connect to a NeMo Retriever Text Embedding NIM (nvidia/llama-3.2-nv-embedqa-1b-v2)embeddings_model=NVIDIAEmbeddings(model="nvidia/llama-3.2-nv-embedqa-1b-v2",base_url=EMBEDDINGS_MODEL_URL)
Why are there two models? Here’s a helpful analogy:
The Embedding model is the “Librarian” (it helps find the right books),
The Instruct model is the “Writer” (it reads the books and writes the answer).
Define the Prompt Template
The application then defines a prompt template that will be used in interactions
with the meta/llama-3.2-1b-instruct LLM:
prompt=ChatPromptTemplate.from_messages([("system","You are a helpful and friendly AI!""Your responses should be concise and no longer than two sentences.""Do not hallucinate. Say you don't know if you don't have this information.""Answer the question using only the context""\n\nQuestion: {question}\n\nContext: {context}"),("user","{question}")])
Note how we’re explicitly instructing the LLM to just say it doesn’t know the answer if
it doesn’t know, which helps minimize hallucinations. There’s also a placeholder for
us to provide context that the LLM can use to answer the question.
Connect to the Vector Database
The application then connects to the vector database that was pre-populated
with NVIDIA data sheet documents:
The application uses LCEL (LangChain Expression Language) to define the chain.
The | (pipe) symbol works like an assembly line; the output of one step becomes
the input for the next.
Step 1: The Input Map {…}: We are preparing the ingredients for our prompt.
context: We turn our vector store into a retriever. This acts like a search engine that finds the most relevant snippets from our NVIDIA data sheets based on the user’s question.
question: We use RunnablePassthrough() to ensure the user’s original question is passed directly into the prompt.
Note: These keys (context and question) map directly to the {context} and {question} placeholders we defined in our prompt template earlier.
Step 2: The prompt: This is the instruction manual. It takes the context and the question and formats them using the prompt template (e.g., “Answer the question using only the context…”).
Step 3: The llm: This is the “Engine” (like GPT-4). It reads the formatted prompt and generates a response.
Step 4: The StrOutputParser(): By default, AI models return complex objects. This “cleaner” ensures we get back a simple, readable string of text.
Invoke the Chain
Finally, the application invokes the chain by passing the end user’s question in
as input:
response=chain.invoke(question)
This is the “Start” button. You drop the end users’ question into the beginning of the pipeline,
and it flows through the retriever, the prompt, and the LLM until the answer comes
out the other side.
Instrument the LLM Application
10 minutes
Instrument the Application with OpenTelemetry
Instrumentation Packages
To capture metrics, traces, and logs from our application, we’ve instrumented it with OpenTelemetry.
This required adding the following package to the requirements.txt file (which ultimately gets
installed with pip install):
splunk-opentelemetry==2.8.0
We also added the following to the Dockerfile used to build the
container image for this application, to install additional OpenTelemetry
instrumentation packages:
Finally, to enhance the traces and metrics collected with OpenTelemetry from this
LangChain application, we added additional Splunk instrumentation packages:
To instrument the application with OpenTelemetry, we also included several
environment variables in the Kubernetes manifest file used to deploy the application:
env:- name:OTEL_SERVICE_NAMEvalue:"llm-app"- name:OTEL_EXPORTER_OTLP_ENDPOINTvalue:"http://splunk-otel-collector-agent:4317"- name:OTEL_EXPORTER_OTLP_PROTOCOLvalue:"grpc"# filter out health check requests to the root URL- name:OTEL_PYTHON_EXCLUDED_URLSvalue:"^(https?://)?[^/]+(/)?$"- name:OTEL_PYTHON_DISABLED_INSTRUMENTATIONSvalue:"httpx,requests"- name:OTEL_INSTRUMENTATION_LANGCHAIN_CAPTURE_MESSAGE_CONTENTvalue:"true"- name:OTEL_LOGS_EXPORTERvalue:"otlp"- name:OTEL_PYTHON_LOG_CORRELATIONvalue:"true"- name:OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCEvalue:"delta"- name:OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLEDvalue:"true"- name:OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENTvalue:"true"- name:OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT_MODEvalue:"SPAN_AND_EVENT"- name:OTEL_INSTRUMENTATION_GENAI_EMITTERSvalue:"span_metric_event,splunk"- name:OTEL_INSTRUMENTATION_GENAI_EMITTERS_EVALUATIONvalue:"replace-category:SplunkEvaluationResults"- name:SPLUNK_PROFILER_ENABLEDvalue:"true"
Note that the OTEL_INSTRUMENTATION_LANGCHAIN_CAPTURE_MESSAGE_CONTENT and
OTEL_INSTRUMENTATION_GENAI_* environment variables are specific to the
LangChain instrumentation we’ve used.
Deploy the LLM Application
10 minutes
Deploy the LLM Application
Use the following command to deploy this application to the OpenShift cluster:
cd ~/workshop/cisco-ai-pods
oc apply -f ./llm-app/k8s-manifest.yaml
Note: to build a Docker image for this Python application, we executed the following commands:
Then run the following command to send a question to the LLM:
curl -X "POST"\
'http://llm-app:8080/askquestion'\
-H 'Accept: application/json'\
-H 'Content-Type: application/json'\
-d '{
"question": "How much memory does the NVIDIA H200 have?"
}'
The NVIDIA H200 has 141GB of HBM3e memory, which is twice the capacity of the NVIDIA H100 Tensor Core GPU with 1.4X more memory bandwidth.
Review Metrics, Traces, and Logs
10 minutes
View Trace Data in Splunk Observability Cloud
In Splunk Observability Cloud, navigate to APM and then select Service Map.
Ensure your environment name is selected (e.g. ai-pod-workshop-participant-1). You should see a service map that looks like the following:
Click on Traces on the right-hand side menu. Then select one of the slower running
traces. It should look like the following example:
The trace shows all the interactions that our application executed to return an answer
to the users question (i.e. “How much memory does the NVIDIA H200 have?”)
For example, we can see where our application performed a similarity search to look
for documents related to the question at hand in the Weaviate vector database.
We can also see how the application created a prompt to send to the LLM, including the
context that was retrieved from the vector database:
Note: if you don’t see the chat and invoke_workflow AI interactions
in the trace waterfall view, or you don’t see the AI details tab on the
right-hand side, ask your instructor about the superpowers which need to
be enabled.
Finally, we can see the response from the LLM, the time it took, and the number of
input and output tokens utilized:
Confirm Metrics are Sent to Splunk
Navigate to Dashboards in Splunk Observability Cloud, then search for the
Cisco AI PODs Dashboard, which is included in the Built-in dashboard groups.
Navigate to the NIM FOR LLMS tab and ensure the dashboard is filtered
on your OpenShift cluster name. The charts should be populated as in the
following example:
Wrap-Up
5 minutes
Wrap-Up
We hope you enjoyed this workshop, which provided hands-on experience deploying and working
with several of the technologies that are used to monitor Cisco AI PODs with
Splunk Observability Cloud. Specifically, you had the opportunity to:
Work with a RedHat OpenShift cluster with GPU-based worker nodes.
Work with the NVIDIA NIM Operator and NVIDIA GPU Operator.
Work with Large Language Models (LLMs) deployed using NVIDIA NIM to the cluster.
Deploy the OpenTelemetry Collector in the Red Hat OpenShift cluster.
Add Prometheus receivers to the collector to ingest infrastructure metrics.
Monitor the Weaviate vector database in the cluster.
Configure monitoring for Pure Storage metrics using Prometheus.
Instrument Python services that interact with Large Language Models (LLMs) with OpenTelemetry.
Understand which details which OpenTelemetry captures in the trace from applications that interact with LLMs.