Workshop

This section includes the steps that workshop attendees will follow:

Practice deploying the OpenTelemetry Collector in the Red Hat OpenShift cluster.
Practice adding Prometheus receivers to the collector to ingest infrastructure metrics.
Practice monitoring the Weaviate vector database in the cluster.
Practice gathering the Pure Storage metrics using Prometheus.
Practice instrumenting Python services that interact with Large Language Models (LLMs) with OpenTelemetry.
Understanding which details which OpenTelemetry captures in the trace from applications that interact with LLMs.

Overview of the Workshop Environment

5 minutes

Cisco’s AI-ready PODs combine cutting-edge hardware and software to deliver a robust, scalable, and efficient AI infrastructure. Splunk Observability Cloud provides comprehensive visibility into this entire stack: from infrastructure to application components.

This hands-on workshop teaches you how to monitor AI infrastructure using OpenTelemetry and Prometheus, without requiring access to an actual Cisco AI POD. You’ll gain practical experience deploying and configuring monitoring technologies in a realistic environment.

Lab Environment

The workshop uses a shared OpenShift Cluster running in AWS, equipped with NVIDIA GPUs and NVIDIA AI Enterprise software.

Pre-Deployed Infrastructure

The workshop instructor has deployed the following shared components to the workshop environment:

NVIDIA NIM models:
- meta/llama-3.2-1b-instruct - Processes user prompts
- nvidia/llama-3.2-nv-embedqa-1b-v2 - Generates embeddings
Weaviate - A vector database for semantic search and retrieval
Prometheus exporter - Simulates Pure Storage metrics typical of production AI PODs

Your Workspace

Each participant receives a dedicated namespace within the shared cluster, ensuring isolated environments for independent work.

Workshop Activities

During the workshop, each participant will execute the following tasks:

Deploy and configure an OpenTelemetry collector in your namespace
Integrate observability data collection with the cluster infrastructure
Deploy a Python application that leverages the NVIDIA NIM models
Monitor application performance and infrastructure metrics using Splunk Observability Cloud

What is Prometheus?

While Prometheus typically refers to a full-stack monitoring system used for storage and alerting, this workshop focuses on the Prometheus ecosystem’s data standards.

We will be leveraging Prometheus Exporters, which are small utilities that translate a component’s internal health into a standardized metrics endpoint (e.g., http://localhost:9100/metrics).

Instead of using a full Prometheus server to collect this data, we will use the OpenTelemetry Collector. By using its Prometheus receiver, the collector can scrape these endpoints, allowing us to gather rich telemetry data using a widely-supported industry format.

Connect to the OpenShift Cluster

5 minutes

Connect to your EC2 Instance

We’ve prepared an Ubuntu Linux instance in AWS/EC2 for each attendee.

Using the IP address and password provided by your instructor, connect to your EC2 instance using one of the methods below:

Mac OS / Linux
- ssh splunk@IP address
Windows 10+
- Use the OpenSSH client
Earlier versions of Windows
- Use Putty

Set the Workshop Participant Number

The instructor will provide each participant with a number from 1 to 30. Store this in an environment variable, and remember what it is, as it will be used throughout the workshop:

export PARTICIPANT_NUMBER=<your participant number>

Install the OpenShift CLI

To access the OpenShift cluster, we’ll need to install the OpenShift CLI.

We can use the following command to download the OpenShift CLI binary directly to our EC2 instance:

curl -L -O https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/stable/openshift-client-linux.tar.gz

Extract the contents:

tar -xvzf openshift-client-linux.tar.gz

Move the resulting files (oc and kubectl) to a location that’s included as part of your path. For example:

sudo mv oc /usr/local/bin/oc
sudo mv kubectl /usr/local/bin/kubectl

Connect to the OpenShift Cluster

Ensure the Kube config file is modifiable by the splunk user:

chmod 600 /home/splunk/.kube/config

Use the cluster API URL and password provided by the workshop organizer to log in to the OpenShift cluster:

oc login https://api.<cluster-domain>:443 -u participant$PARTICIPANT_NUMBER -p '<password>'

Ensure you’re connected to the OpenShift cluster:

oc whoami --show-server

https://api.***.openshiftapps.com:443

Deploy the OpenTelemetry Collector

10 minutes

In this section we’ll deploy the OpenTelemetry Collector in our OpenShift namespace, which gathers metrics, logs, and traces from the infrastructure and applications running in the cluster, and sends the resulting data to Splunk Observability Cloud.

Deploy the OpenTelemetry Collector

Ensure Helm is installed

Run the following command to confirm that Helm is installed:

helm version

version.BuildInfo{Version:"v3.19.4", GitCommit:"7cfb6e486dac026202556836bb910c37d847793e", GitTreeState:"clean", GoVersion:"go1.24.11"}

If it’s not installed, execute the following commands:

sudo apt-get install curl gpg apt-transport-https --yes
curl -fsSL https://packages.buildkite.com/helm-linux/helm-debian/gpgkey | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/helm.gpg] https://packages.buildkite.com/helm-linux/helm-debian/any/ any main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update
sudo apt-get install helm

Add the Splunk OpenTelemetry Collector Helm Chart

Add the Splunk OpenTelemetry Collector for Kubernetes’ Helm chart repository:

helm repo add splunk-otel-collector-chart https://signalfx.github.io/splunk-otel-collector-chart

Ensure the repository is up-to-date:

helm repo update

Configure Environment Variables

Set environment variables to configure the Splunk environment you’d like the collector to send data to:

export USER_NAME=workshop-participant-$PARTICIPANT_NUMBER
export CLUSTER_NAME=ai-pod-$USER_NAME
export ENVIRONMENT_NAME=ai-pod-$USER_NAME
export SPLUNK_INDEX=splunk4rookies-workshop

Confirm that the environment name is set:

echo $ENVIRONMENT_NAME

ai-pod-workshop-participant-1

Deploy the Collector

Navigate to the workshop directory:

cd ~/workshop/cisco-ai-pods

Then install the collector in your namespace using the following command:

{ [ -z "$CLUSTER_NAME" ] || \
  [ -z "$ENVIRONMENT_NAME" ] || \
  [ -z "$USER_NAME" ]; } && \
  echo "Error: Missing variables" || \
  helm upgrade --install splunk-otel-collector \
  --set="clusterName=$CLUSTER_NAME" \
  --set="environment=$ENVIRONMENT_NAME" \
  --set="splunkObservability.accessToken=$ACCESS_TOKEN" \
  --set="splunkObservability.realm=$REALM" \
  --set="splunkPlatform.endpoint=$HEC_URL" \
  --set="splunkPlatform.token=$HEC_TOKEN" \
  --set="splunkPlatform.index=$SPLUNK_INDEX" \
  -f ./otel-collector/otel-collector-values.yaml \
  -n $USER_NAME \
  splunk-otel-collector-chart/splunk-otel-collector

Note: if you get an error that says Missing variables, you’ll need to define your environment variables again. Add your participant number before running the following commands:
export PARTICIPANT_NUMBER=<your participant number>
export USER_NAME=workshop-participant-$PARTICIPANT_NUMBER
export CLUSTER_NAME=ai-pod-$USER_NAME
export ENVIRONMENT_NAME=ai-pod-$USER_NAME
export SPLUNK_INDEX=splunk4rookies-workshop

Run the following command to confirm that the collector pods are running:

watch -n 1 oc get pods

NAME                                                          READY   STATUS    RESTARTS   AGE
splunk-otel-collector-agent-58rwm                             1/1     Running   0          6m40s
splunk-otel-collector-agent-8dndr                             1/1     Running   0          6m40s

Note: in OpenShift environments, the collector takes about three minutes to start and transition to the Running state.

Review Collector Data in Splunk Observability Cloud

Confirm that you can see your cluster in Splunk Observability Cloud by navigating to Infrastructure Monitoring -> Kubernetes -> Kubernetes Clusters and then adding a filter on k8s.cluster.name with your cluster name (i.e. ai-pod-workshop-participant-1):

Monitor NVIDIA Components

10 minutes

In this section, we’ll use the Prometheus receiver with the OpenTelemetry collector to monitor the NVIDIA components running in the OpenShift cluster. We’ll start by navigating to the directory where the collector configuration file is stored:

cd otel-collector

Capture the NVIDIA DCGM Exporter metrics

The NVIDIA DCGM exporter is running in our OpenShift cluster. It exposes GPU metrics that we can send to Splunk.

To do this, let’s customize the configuration of the collector by editing the otel-collector-values.yaml file that we used earlier when deploying the collector.

Add the following content, just below the kubeletstats receiver:

      receiver_creator/nvidia:
        # Name of the extensions to watch for endpoints to start and stop.
        watch_observers: [ k8s_observer ]
        receivers:
          prometheus/dcgm:
            config:
              config:
                scrape_configs:
                  - job_name: gpu-metrics
                    scrape_interval: 60s
                    static_configs:
                      - targets:
                          - '`endpoint`:9400'
            rule: type == "pod" && labels["app"] == "nvidia-dcgm-exporter"

This tells the collector to look for pods with a label of app=nvidia-dcgm-exporter. And when it finds a pod with this label, it will connect to port 9400 of the pod and scrape the default metrics endpoint (/v1/metrics).

Why are we using the receiver_creator receiver instead of just the Prometheus receiver?
The Prometheus receiver uses a static configuration that scrapes metrics from predefined endpoints.
The receiver_creator receiver enables dynamic creation of receivers (including Prometheus receivers) based on runtime information, allowing for scalable and flexible scraping setups.
Using receiver_creator can simplify configurations in dynamic environments by automating the management of multiple Prometheus scraping targets.

To ensure this new receiver is used, we’ll need to add a new pipeline to the otel-collector-values.yaml file as well.

Add the following code to the bottom of the file:

    service:
      pipelines:
        metrics/nvidia-metrics:
          exporters:
            - signalfx
          processors:
            - memory_limiter
            - batch
            - resourcedetection
            - resource
          receivers:
            - receiver_creator/nvidia

We’ll add one more Prometheus receiver related to NVIDIA in the next section.

Capture the NVIDIA NIM metrics

The meta-llama-3-2-1b-instruct large language model was deployed to the OpenShift cluster using NVIDIA NIM. It includes a Prometheus endpoint that we can scrape with the collector. Let’s add the following to the otel-collector-values.yaml file, just below the prometheus/dcgm receiver we added earlier:

          prometheus/nim-llm:
            config:
              config:
                scrape_configs:
                  - job_name: nim-for-llm-metrics
                    scrape_interval: 60s
                    metrics_path: /v1/metrics
                    static_configs:
                      - targets:
                          - '`endpoint`:8000'
            rule: type == "pod" && labels["app"] == "meta-llama-3-2-1b-instruct"

This tells the collector to look for pods with a label of app=meta-llama-3-2-1b-instruct. And when it finds a pod with this label, it will connect to port 8000 of the pod and scrape the /v1/metrics metrics endpoint.

There’s no need to make changes to the pipeline, as this receiver will already be picked up as part of the receiver_creator/nvidia receiver.

Add a Filter Processor

Scraping Prometheus endpoints can result in a large number of metrics, sometimes with high cardinality.

Let’s add a filter processor that defines exactly what metrics we want to send to Splunk. Specifically, we’ll send only the metrics that are utilized by a dashboard chart or an alert detector.

Add the following code to the otel-collector-values.yaml file, after the exporters section but before the receivers section:

    processors:
      filter/metrics_to_be_included:
        metrics:
          # Include only metrics used in charts and detectors
          include:
            match_type: strict
            metric_names:
              - DCGM_FI_DEV_FB_FREE
              - DCGM_FI_DEV_FB_USED
              - DCGM_FI_DEV_GPU_TEMP
              - DCGM_FI_DEV_GPU_UTIL
              - DCGM_FI_DEV_MEM_CLOCK
              - DCGM_FI_DEV_MEM_COPY_UTIL
              - DCGM_FI_DEV_MEMORY_TEMP
              - DCGM_FI_DEV_POWER_USAGE
              - DCGM_FI_DEV_SM_CLOCK
              - DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION
              - DCGM_FI_PROF_DRAM_ACTIVE
              - DCGM_FI_PROF_GR_ENGINE_ACTIVE
              - DCGM_FI_PROF_PCIE_RX_BYTES
              - DCGM_FI_PROF_PCIE_TX_BYTES
              - DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
              - generation_tokens_total
              - go_info
              - go_memstats_alloc_bytes
              - go_memstats_alloc_bytes_total
              - go_memstats_buck_hash_sys_bytes
              - go_memstats_frees_total
              - go_memstats_gc_sys_bytes
              - go_memstats_heap_alloc_bytes
              - go_memstats_heap_idle_bytes
              - go_memstats_heap_inuse_bytes
              - go_memstats_heap_objects
              - go_memstats_heap_released_bytes
              - go_memstats_heap_sys_bytes
              - go_memstats_last_gc_time_seconds
              - go_memstats_lookups_total
              - go_memstats_mallocs_total
              - go_memstats_mcache_inuse_bytes
              - go_memstats_mcache_sys_bytes
              - go_memstats_mspan_inuse_bytes
              - go_memstats_mspan_sys_bytes
              - go_memstats_next_gc_bytes
              - go_memstats_other_sys_bytes
              - go_memstats_stack_inuse_bytes
              - go_memstats_stack_sys_bytes
              - go_memstats_sys_bytes
              - go_sched_gomaxprocs_threads
              - gpu_cache_usage_perc
              - gpu_total_energy_consumption_joules
              - http.server.active_requests
              - num_request_max
              - num_requests_running
              - num_requests_waiting
              - process_cpu_seconds_total
              - process_max_fds
              - process_open_fds
              - process_resident_memory_bytes
              - process_start_time_seconds
              - process_virtual_memory_bytes
              - process_virtual_memory_max_bytes
              - promhttp_metric_handler_requests_in_flight
              - promhttp_metric_handler_requests_total
              - prompt_tokens_total
              - python_gc_collections_total
              - python_gc_objects_collected_total
              - python_gc_objects_uncollectable_total
              - python_info
              - request_finish_total
              - request_success_total
              - system.cpu.time
              - e2e_request_latency_seconds
              - time_to_first_token_seconds
              - time_per_output_token_seconds
              - request_prompt_tokens
              - request_generation_tokens

Ensure the filter/metrics_to_be_included processor is included in the metrics/nvidia-metrics pipeline we added earlier:

    service:
      pipelines:
        metrics/nvidia-metrics:
          exporters:
            - signalfx
          processors:
            - memory_limiter
            - filter/metrics_to_be_included
            - batch
            - resourcedetection
            - resource
          receivers:
            - receiver_creator/nvidia

Verify Changes

Take a moment to compare the contents of your modified otel-collector-values.yaml file with the otel-collector-values-with-nvidia.yaml file. Remember that indentation is important for yaml files, and needs to be precise:

diff otel-collector-values.yaml otel-collector-values-with-nvidia.yaml

Update your file if needed to ensure the contents match.

Don’t restart the collector yet

Because restarting the collector in an OpenShift environment takes 3 minutes per node, we’ll wait until we’ve completed all configuration changes before initiating a restart.

Monitor the Vector Database

5 minutes

In this step, we’ll configure the Prometheus receiver to monitor the Weaviate vector database.

What is a Vector Database?

A vector database stores and indexes data as numerical “vector embeddings,” which capture the semantic meaning of information like text or images. Unlike traditional databases, they excel at similarity searches, finding conceptually related data points rather than exact matches.

How is a Vector Database Used?

Vector databases play a key role in a pattern called Retrieval Augmented Generation (RAG), which is widely used by applications that leverage Large Language Models (LLMs).

The pattern is as follows:

The end-user asks a question to the application
The application takes the question and calculates a vector embedding for it
The app then performs a similarity search, looking for related documents in the vector database
The app then takes the original question and the related documents, and sends it to the LLM as context
The LLM reviews the context and returns a response to the application

Capture Weaviate Metrics with Prometheus

Let’s modify the OpenTelemetry collector configuration to scrape Weaviate’s Prometheus metrics.

To do so, let’s add an additional Prometheus receiver creator section to the otel-collector-values.yaml file. Add it after the receiver_creator/nvidia section but before the pipelines section:

      receiver_creator/weaviate:
        # Name of the extensions to watch for endpoints to start and stop.
        watch_observers: [ k8s_observer ]
        receivers:
          prometheus/weaviate:
            config:
              config:
                scrape_configs:
                  - job_name: weaviate-metrics
                    scrape_interval: 60s
                    static_configs:
                      - targets:
                          - '`endpoint`:2112'
            rule: type == "pod" && labels["app"] == "weaviate"

We’ll need to ensure that Weaviate’s metrics are added to the filter/metrics_to_be_included filter processor configuration as well:

    processors:
      filter/metrics_to_be_included:
        metrics:
          # Include only metrics used in charts and detectors
          include:
            match_type: strict
            metric_names:
              - DCGM_FI_DEV_FB_FREE
              - ...
              - object_count
              - vector_index_size
              - vector_index_operations
              - vector_index_tombstones
              - vector_index_tombstone_cleanup_threads
              - vector_index_tombstone_cleanup_threads
              - requests_total
              - objects_durations_ms_sum
              - objects_durations_ms_count
              - batch_delete_durations_ms_sum
              - batch_delete_durations_ms_count

Note: add just the new metrics starting with object_count

We also want to add a Resource processor to the configuration file with the following configuration. Add it after the filter/metrics_to_be_included processor but before the receivers section:

      resource/weaviate:
        attributes:
          - key: weaviate.instance.id
            from_attribute: service.instance.id
            action: insert

This processor takes the service.instance.id property on the Weaviate metrics and copies it into a new property called weaviate.instance.id. This is done so that we can more easily distinguish Weaviate metrics from other metrics that use service.instance.id, which is a standard OpenTelemetry property used in Splunk Observability Cloud.

We’ll need to add a new metrics pipeline for Weaviate metrics as well (we need to use a separate pipeline since we don’t want the weaviate.instance.id metric to be added to non-Weaviate metrics). Add the following to the bottom of the file:

        metrics/weaviate:
          exporters:
            - signalfx
          processors:
            - memory_limiter
            - filter/metrics_to_be_included
            - resource/weaviate
            - batch
            - resourcedetection
            - resource
          receivers:
            - receiver_creator/weaviate

Take a moment to compare the contents of your modified otel-collector-values.yaml file with the otel-collector-values-with-weaviate.yaml file. Remember that indentation is important for yaml files, and needs to be precise:

diff otel-collector-values.yaml otel-collector-values-with-weaviate.yaml

Update your file if needed to ensure the contents match.

Don’t restart the collector yet

Because restarting the collector in an OpenShift environment takes 3 minutes per node, we’ll wait until we’ve completed all configuration changes before initiating a restart.

Monitor Storage

5 minutes

In this step, we’ll configure the Prometheus receiver to monitor the storage.

What storage do Cisco AI PODs utilize?

Cisco AI PODs have a number of different storage options, including Pure Storage, VAST, and NetApp.

The workshop will focus on Pure Storage.

How do we capture Pure Storage metrics?

Cisco AI PODs that utilize Pure Storage also use a technology called Portworx, which provides persistent storage for Kubernetes.

Portworx includes a metrics endpoint that we can scrape using the Prometheus receiver.

Capture Storage Metrics with Prometheus

Let’s modify the OpenTelemetry collector configuration to scrape Portworx metrics with the Prometheus receiver.

To do so, let’s add an additional Prometheus receiver creator section to the otel-collector-values.yaml file. Add it after the receiver_creator/weaviate section but before the pipelines section:

      receiver_creator/storage:
        # Name of the extensions to watch for endpoints to start and stop.
        watch_observers: [ k8s_observer ]
        receivers:
          prometheus/portworx:
            config:
              config:
                scrape_configs:
                  - job_name: portworx-metrics
                    static_configs:
                      - targets:
                          - '`endpoint`:17001'
                          - '`endpoint`:17018'
            rule: type == "pod" && labels["app"] == "portworx-metrics-sim"

We’ll need to ensure that Portworx metrics are added to the filter/metrics_to_be_included filter processor configuration as well:

    processors:
      filter/metrics_to_be_included:
        metrics:
          # Include only metrics used in charts and detectors
          include:
            match_type: strict
            metric_names:
              - DCGM_FI_DEV_FB_FREE
              - ...
              - px_cluster_cpu_percent
              - px_cluster_disk_total_bytes
              - px_cluster_disk_utilized_bytes
              - px_cluster_status_nodes_offline
              - px_cluster_status_nodes_online
              - px_volume_read_latency_seconds
              - px_volume_reads_total
              - px_volume_readthroughput
              - px_volume_write_latency_seconds
              - px_volume_writes_total
              - px_volume_writethroughput

Note: add just the new metrics starting with px_cluster_cpu_percent

We’ll need to add a new metrics pipeline for Portworx metrics as well. Add the following to the bottom of the file:

        metrics/storage:
          exporters:
            - signalfx
          processors:
            - memory_limiter
            - filter/metrics_to_be_included
            - batch
            - resourcedetection
            - resource
          receivers:
            - receiver_creator/storage

Take a moment to compare the contents of your modified otel-collector-values.yaml file with the otel-collector-values-with-portworx.yaml file.Remember that indentation is important for yaml files, and needs to be precise:

diff otel-collector-values.yaml otel-collector-values-with-portworx.yaml

Update your file if needed to ensure the contents match.

Don’t restart the collector yet

Because restarting the collector in an OpenShift environment takes 3 minutes per node, we’ll wait until we’ve completed all configuration changes before initiating a restart.

Review AI POD Dashboards

10 minutes

In this section, we’ll review the AI POD dashboards in Splunk Observability Cloud to confirm that the data from NVIDIA, Pure Storage, and Weaviate is captured as expected.

Update the OpenTelemetry Collector Config

We can apply the collector configuration changes by running the following Helm command:

{ [ -z "$CLUSTER_NAME" ] || \
  [ -z "$ENVIRONMENT_NAME" ] || \
  [ -z "$USER_NAME" ]; } && \
  echo "Error: Missing variables" || \
  helm upgrade splunk-otel-collector \
  --set="clusterName=$CLUSTER_NAME" \
  --set="environment=$ENVIRONMENT_NAME" \
  --set="splunkObservability.accessToken=$ACCESS_TOKEN" \
  --set="splunkObservability.realm=$REALM" \
  --set="splunkPlatform.endpoint=$HEC_URL" \
  --set="splunkPlatform.token=$HEC_TOKEN" \
  --set="splunkPlatform.index=$SPLUNK_INDEX" \
  -f ./otel-collector-values.yaml \
  -n $USER_NAME \
  splunk-otel-collector-chart/splunk-otel-collector

Note: if you get an error that says Missing variables, you’ll need to define your environment variables again. Add your participant number before running the following commands:
export PARTICIPANT_NUMBER=<your participant number>
export USER_NAME=workshop-participant-$PARTICIPANT_NUMBER
export CLUSTER_NAME=ai-pod-$USER_NAME
export ENVIRONMENT_NAME=ai-pod-$USER_NAME
export SPLUNK_INDEX=splunk4rookies-workshop

Review the AI POD Overview Dashboard Tab

Navigate to Dashboards in Splunk Observability Cloud, then search for the Cisco AI PODs Dashboard, which is included in the Built-in dashboard groups. Ensure the dashboard is filtered on your OpenShift cluster name. The charts should be populated as in the following example:

Review the Pure Storage Dashboard Tab

Navigate to the PURE STORAGE tab and ensure the dashboard is filtered on your OpenShift cluster name. The charts should be populated as in the following example:

Review the Weaviate Infrastructure Navigator

Since Weaviate isn’t included by default with an AI POD, it’s not included on the out-of-the-box AI POD dashboard. Instead, we can view Weaviate performance data using one of the infrastructure navigators.

In Splunk Observability Cloud, navigate to Infrastructure -> AI Frameworks -> Weaviate. Filter on the k8s.cluster.name of interest, and ensure the navigator is populated as in the following example:

Review the LLM Application

15 minutes

In the final step of the workshop, we’ll deploy an application to our OpenShift cluster that uses the instruct and embeddings models.

What is LangChain?

Like most applications that interact with LLMs, our application is written in Python. It also uses LangChain, which is an open-source orchestration framework that simplifies the development of applications powered by LLMs.

Application Overview

Connect to the LLMs

Our application starts by connecting to two LLMs that we’ll be using:

meta/llama-3.2-1b-instruct: used for responding to user prompts
nvidia/llama-3.2-nv-embedqa-1b-v2: used to calculate embeddings

# connect to a LLM NIM at the specified endpoint, specifying a specific model
llm = ChatNVIDIA(base_url=INSTRUCT_MODEL_URL, model="meta/llama-3.2-1b-instruct")

# Initialize and connect to a NeMo Retriever Text Embedding NIM (nvidia/llama-3.2-nv-embedqa-1b-v2)
embeddings_model = NVIDIAEmbeddings(model="nvidia/llama-3.2-nv-embedqa-1b-v2",
                                   base_url=EMBEDDINGS_MODEL_URL)

Why are there two models? Here’s a helpful analogy:
The Embedding model is the “Librarian” (it helps find the right books),
The Instruct model is the “Writer” (it reads the books and writes the answer).

Define the Prompt Template

The application then defines a prompt template that will be used in interactions with the meta/llama-3.2-1b-instruct LLM:

prompt = ChatPromptTemplate.from_messages([
    ("system",
        "You are a helpful and friendly AI!"
        "Your responses should be concise and no longer than two sentences."
        "Do not hallucinate. Say you don't know if you don't have this information."
        "Answer the question using only the context"
        "\n\nQuestion: {question}\n\nContext: {context}"
    ),
    ("user", "{question}")
])

Note how we’re explicitly instructing the LLM to just say it doesn’t know the answer if it doesn’t know, which helps minimize hallucinations. There’s also a placeholder for us to provide context that the LLM can use to answer the question.

Connect to the Vector Database

The application then connects to the vector database that was pre-populated with NVIDIA data sheet documents:

    weaviate_client = weaviate.connect_to_custom(
        http_host=os.getenv('WEAVIATE_HTTP_HOST'),
        http_port=os.getenv('WEAVIATE_HTTP_PORT'),
        http_secure=False,
        grpc_host=os.getenv('WEAVIATE_GRPC_HOST'),
        grpc_port=os.getenv('WEAVIATE_GRPC_PORT'),
        grpc_secure=False
    )
        
    vector_store = WeaviateVectorStore(
        client=weaviate_client,
        embedding=embeddings_model,
        index_name="CustomDocs",
        text_key="page_content"
    )

Define the Chain

The application uses LCEL (LangChain Expression Language) to define the chain. The | (pipe) symbol works like an assembly line; the output of one step becomes the input for the next.

    chain = (
        {
            "context": vector_store.as_retriever(),
            "question": RunnablePassthrough()
        }
        | prompt
        | llm
        | StrOutputParser()
    )

Let’s break this down step-by-step:

Step 1: The Input Map {…}: We are preparing the ingredients for our prompt.
- context: We turn our vector store into a retriever. This acts like a search engine that finds the most relevant snippets from our NVIDIA data sheets based on the user’s question.
- question: We use RunnablePassthrough() to ensure the user’s original question is passed directly into the prompt.
- Note: These keys (context and question) map directly to the {context} and {question} placeholders we defined in our prompt template earlier.
Step 2: The prompt: This is the instruction manual. It takes the context and the question and formats them using the prompt template (e.g., “Answer the question using only the context…”).
Step 3: The llm: This is the “Engine” (like GPT-4). It reads the formatted prompt and generates a response.
Step 4: The StrOutputParser(): By default, AI models return complex objects. This “cleaner” ensures we get back a simple, readable string of text.

Invoke the Chain

Finally, the application invokes the chain by passing the end user’s question in as input:

    response = chain.invoke(question)

This is the “Start” button. You drop the end users’ question into the beginning of the pipeline, and it flows through the retriever, the prompt, and the LLM until the answer comes out the other side.

Instrument the LLM Application

10 minutes

Instrument the Application with OpenTelemetry

Instrumentation Packages

To capture metrics, traces, and logs from our application, we’ve instrumented it with OpenTelemetry. This required adding the following package to the requirements.txt file (which ultimately gets installed with pip install):

splunk-opentelemetry==2.8.0

We also added the following to the Dockerfile used to build the container image for this application, to install additional OpenTelemetry instrumentation packages:

# Add additional OpenTelemetry instrumentation packages
RUN opentelemetry-bootstrap --action=install

Then we modified the ENTRYPOINT in the Dockerfile to call opentelemetry-instrument when running the application:

ENTRYPOINT ["opentelemetry-instrument", "flask", "run", "-p", "8080", "--host", "0.0.0.0"]

Finally, to enhance the traces and metrics collected with OpenTelemetry from this LangChain application, we added additional Splunk instrumentation packages:

splunk-otel-instrumentation-langchain==0.1.4
splunk-otel-util-genai==0.1.4

Environment Variables

To instrument the application with OpenTelemetry, we also included several environment variables in the Kubernetes manifest file used to deploy the application:

  env:
    - name: OTEL_SERVICE_NAME
      value: "llm-app"
    - name: OTEL_EXPORTER_OTLP_ENDPOINT
      value: "http://splunk-otel-collector-agent:4317"
    - name: OTEL_EXPORTER_OTLP_PROTOCOL
      value: "grpc"
      # filter out health check requests to the root URL
    - name: OTEL_PYTHON_EXCLUDED_URLS
      value: "^(https?://)?[^/]+(/)?$"
    - name: OTEL_PYTHON_DISABLED_INSTRUMENTATIONS
      value: "httpx,requests"
    - name: OTEL_INSTRUMENTATION_LANGCHAIN_CAPTURE_MESSAGE_CONTENT
      value: "true"
    - name: OTEL_LOGS_EXPORTER
      value: "otlp"
    - name: OTEL_PYTHON_LOG_CORRELATION
      value: "true"
    - name: OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
      value: "delta"
    - name: OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED
      value: "true"
    - name: OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT
      value: "true"
    - name: OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT_MODE
      value: "SPAN_AND_EVENT"
    - name: OTEL_INSTRUMENTATION_GENAI_EMITTERS
      value: "span_metric_event,splunk"
    - name: OTEL_INSTRUMENTATION_GENAI_EMITTERS_EVALUATION
      value: "replace-category:SplunkEvaluationResults"
    - name: SPLUNK_PROFILER_ENABLED
      value: "true"

Note that the OTEL_INSTRUMENTATION_LANGCHAIN_CAPTURE_MESSAGE_CONTENT and OTEL_INSTRUMENTATION_GENAI_* environment variables are specific to the LangChain instrumentation we’ve used.

Deploy the LLM Application

10 minutes

Deploy the LLM Application

Use the following command to deploy this application to the OpenShift cluster:

cd ~/workshop/cisco-ai-pods
oc apply -f ./llm-app/k8s-manifest.yaml

Note: to build a Docker image for this Python application, we executed the following commands:
cd workshop/cisco-ai-pods/llm-app
docker build --platform linux/amd64 -t ghcr.io/splunk/cisco-ai-pod-workshop-app:1.0 .
docker push ghcr.io/splunk/cisco-ai-pod-workshop-app:1.0

Test the LLM Application

Let’s ensure the application is working as expected.

Start a pod that has access to the curl command:

oc run curl --rm -it --image=curlimages/curl:latest \
  --overrides='{
    "spec": {
      "containers": [{
        "name": "curl",
        "image": "curlimages/curl:latest",
        "stdin": true,
        "tty": true,
        "command": ["sh"],
        "resources": {
          "limits": {
            "cpu": "50m",
            "memory": "100Mi"
          },
          "requests": {
            "cpu": "50m",
            "memory": "100Mi"
          }
        }
      }]
    }
  }'

Then run the following command to send a question to the LLM:

curl -X "POST" \
 'http://llm-app:8080/askquestion' \
  -H 'Accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "question": "How much memory does the NVIDIA H200 have?"
  }'

The NVIDIA H200 has 141GB of HBM3e memory, which is twice the capacity of the NVIDIA H100 Tensor Core GPU with 1.4X more memory bandwidth.

Review Metrics, Traces, and Logs

10 minutes

View Trace Data in Splunk Observability Cloud

In Splunk Observability Cloud, navigate to APM and then select Service Map. Ensure your environment name is selected (e.g. ai-pod-workshop-participant-1).
You should see a service map that looks like the following:

Click on Traces on the right-hand side menu. Then select one of the slower running traces. It should look like the following example:

The trace shows all the interactions that our application executed to return an answer to the users question (i.e. “How much memory does the NVIDIA H200 have?”)

For example, we can see where our application performed a similarity search to look for documents related to the question at hand in the Weaviate vector database.

We can also see how the application created a prompt to send to the LLM, including the context that was retrieved from the vector database:

Note: if you don’t see the chat and invoke_workflow AI interactions in the trace waterfall view, or you don’t see the AI details tab on the right-hand side, ask your instructor about the superpowers which need to be enabled.

Finally, we can see the response from the LLM, the time it took, and the number of input and output tokens utilized:

Confirm Metrics are Sent to Splunk

Navigate to Dashboards in Splunk Observability Cloud, then search for the Cisco AI PODs Dashboard, which is included in the Built-in dashboard groups. Navigate to the NIM FOR LLMS tab and ensure the dashboard is filtered on your OpenShift cluster name. The charts should be populated as in the following example:

Wrap-Up

5 minutes

Wrap-Up

We hope you enjoyed this workshop, which provided hands-on experience deploying and working with several of the technologies that are used to monitor Cisco AI PODs with Splunk Observability Cloud. Specifically, you had the opportunity to:

Work with a RedHat OpenShift cluster with GPU-based worker nodes.
Work with the NVIDIA NIM Operator and NVIDIA GPU Operator.
Work with Large Language Models (LLMs) deployed using NVIDIA NIM to the cluster.
Deploy the OpenTelemetry Collector in the Red Hat OpenShift cluster.
Add Prometheus receivers to the collector to ingest infrastructure metrics.
Monitor the Weaviate vector database in the cluster.
Configure monitoring for Pure Storage metrics using Prometheus.
Instrument Python services that interact with Large Language Models (LLMs) with OpenTelemetry.
Understand which details which OpenTelemetry captures in the trace from applications that interact with LLMs.

Workshop

Subsections of 2. Workshop

Overview of the Workshop Environment

Lab Environment

Pre-Deployed Infrastructure

Your Workspace

Workshop Activities

What is Prometheus?

Connect to the OpenShift Cluster

Connect to your EC2 Instance

Set the Workshop Participant Number

Install the OpenShift CLI

Connect to the OpenShift Cluster

Deploy the OpenTelemetry Collector

Deploy the OpenTelemetry Collector

Ensure Helm is installed

Add the Splunk OpenTelemetry Collector Helm Chart

Configure Environment Variables

Deploy the Collector

Review Collector Data in Splunk Observability Cloud

Monitor NVIDIA Components

Capture the NVIDIA DCGM Exporter metrics

Capture the NVIDIA NIM metrics

Add a Filter Processor

Verify Changes

Monitor the Vector Database

What is a Vector Database?

How is a Vector Database Used?

Capture Weaviate Metrics with Prometheus

Monitor Storage

What storage do Cisco AI PODs utilize?

How do we capture Pure Storage metrics?

Capture Storage Metrics with Prometheus

Review AI POD Dashboards

Update the OpenTelemetry Collector Config

Review the AI POD Overview Dashboard Tab

Review the Pure Storage Dashboard Tab

Review the Weaviate Infrastructure Navigator

Review the LLM Application

What is LangChain?

Application Overview

Connect to the LLMs

Define the Prompt Template

Connect to the Vector Database

Define the Chain

Invoke the Chain

Instrument the LLM Application

Instrument the Application with OpenTelemetry

Instrumentation Packages

Environment Variables

Deploy the LLM Application

Deploy the LLM Application

Test the LLM Application

Review Metrics, Traces, and Logs

View Trace Data in Splunk Observability Cloud

Confirm Metrics are Sent to Splunk

Wrap-Up

Wrap-Up