Solving Problems with O11y Cloud

2 minutes Author Derek Mitchell

In this workshop, you’ll get hands-on experience with the following:

Deploy the OpenTelemetry Collector and customize the collector config
Deploy an application and instrument it with OpenTelemetry
See how tags are captured using an OpenTelemetry SDK
Create a Troubleshooting MetricSet
Troubleshoot a problem and determine root cause using Tag Spotlight

Let’s get started!

Tip

The easiest way to navigate through this workshop is by using:

the left/right arrows (< | >) on the top right of this page
the left (◀️) and right (▶️) cursor keys on your keyboard

Connect to EC2 Instance

5 minutes

Connect to your EC2 Instance

We’ve prepared an Ubuntu Linux instance in AWS/EC2 for each attendee.

Using the IP address and password provided by your instructor, connect to your EC2 instance using one of the methods below:

Mac OS / Linux
- ssh splunk@IP address
Windows 10+
- Use the OpenSSH client
Earlier versions of Windows
- Use Putty

Editing Files

We’ll use vi to edit files during the workshop. Here’s a quick primer.

To open a file for editing:

vi <filename>

To edit the file, click i to switch to Insert mode and begin entering text as normal. Use Esc to return to Command mode.
To save your changes without exiting the editor, enter Esc to return to command mode then enter :w.
To exit the editor without saving changes, enter Esc to return to command mode then enter :q!.
To save your changes and exist the editor, enter Esc to return to command mode then enter :wq.

See An introduction to the vi editor for a comprehensive introduction to vi.

If you’d prefer using another editor, you can use nano instead:

nano <filename>

Deploy the OpenTelemetry Collector and Customize Config

15 minutes

The first step to “getting data in” is to deploy an OpenTelemetry collector, which receives and processes telemetry data in our environment before exporting it to Splunk Observability Cloud.

We’ll be using Kubernetes for this workshop, and will deploy the collector in our K8s cluster using Helm.

What is Helm?

Helm is a package manager for Kubernetes which provides the following benefits:

Manage Complexity
- deal with a single values.yaml file rather than dozens of manifest files
Easy Updates
- in-place upgrades
Rollback support
- Just use helm rollback to roll back to an older version of a release

Install the Collector using Helm

Let’s change into the correct directory and run a script to install the collector:

cd /home/splunk/workshop/tagging
./1-deploy-otel-collector.sh

"splunk-otel-collector-chart" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "splunk-otel-collector-chart" chart repository
Update Complete. ⎈Happy Helming!⎈
NAME: splunk-otel-collector
LAST DEPLOYED: Mon Dec 23 18:47:38 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
NOTES:
Splunk OpenTelemetry Collector is installed and configured to send data to Splunk Observability realm us1.

Note that the script may take a minute or so to run.

How did this script install the collector? It first ensured that the environment variables set in the ~./profile file are read:

source ~/.profile

It then installed the splunk-otel-collector-chart Helm chart and ensured it’s up-to-date:

  helm repo add splunk-otel-collector-chart https://signalfx.github.io/splunk-otel-collector-chart
  helm repo update

And finally, it used helm install to install the collector:

  helm install splunk-otel-collector --version 0.111.0 \
  --set="splunkObservability.realm=$REALM" \
  --set="splunkObservability.accessToken=$ACCESS_TOKEN" \
  --set="clusterName=$INSTANCE-k3s-cluster" \
  --set="environment=tagging-workshop-$INSTANCE" \
  splunk-otel-collector-chart/splunk-otel-collector \
  -f otel/values.yaml

Note that the helm install command references a values.yaml file, which is used to customize the collector configuration. We’ll explore this is more detail below.

Confirm the Collector is Running

We can confirm whether the collector is running with the following command:

kubectl get pods

NAME                                                            READY   STATUS    RESTARTS   AGE
splunk-otel-collector-agent-kfvjb                               1/1     Running   0          2m33s
splunk-otel-collector-certmanager-7d89558bc9-2fqnx              1/1     Running   0          2m33s
splunk-otel-collector-certmanager-cainjector-796cc6bd76-hz4sp   1/1     Running   0          2m33s
splunk-otel-collector-certmanager-webhook-6959cd5f8-qd5b6       1/1     Running   0          2m33s
splunk-otel-collector-k8s-cluster-receiver-57569b58c8-8ghds     1/1     Running   0          2m33s
splunk-otel-collector-operator-6fd9f9d569-wd5mn                 2/2     Running   0          2m33s

Confirm your K8s Cluster is in O11y Cloud

In Splunk Observability Cloud, navigate to Infrastructure -> Kubernetes -> Kubernetes Clusters, and then search for your Cluster Name (which is $INSTANCE-k3s-cluster):

Get the Collector Configuration

Before we customize the collector config, how do we determine what the current configuration looks like?

In a Kubernetes environment, the collector configuration is stored using a Config Map.

We can see which config maps exist in our cluster with the following command:

kubectl get cm -l app=splunk-otel-collector

NAME                                                 DATA   AGE
splunk-otel-collector-otel-k8s-cluster-receiver   1      3h37m
splunk-otel-collector-otel-agent                  1      3h37m

We can then view the config map of the collector agent as follows:

kubectl describe cm splunk-otel-collector-otel-agent

Name:         splunk-otel-collector-otel-agent
Namespace:    default
Labels:       app=splunk-otel-collector
              app.kubernetes.io/instance=splunk-otel-collector
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=splunk-otel-collector
              app.kubernetes.io/version=0.113.0
              chart=splunk-otel-collector-0.113.0
              helm.sh/chart=splunk-otel-collector-0.113.0
              heritage=Helm
              release=splunk-otel-collector
Annotations:  meta.helm.sh/release-name: splunk-otel-collector
              meta.helm.sh/release-namespace: default

Data
====
relay:
----
exporters:
  otlphttp:
    headers:
      X-SF-Token: ${SPLUNK_OBSERVABILITY_ACCESS_TOKEN}
    metrics_endpoint: https://ingest.us1.signalfx.com/v2/datapoint/otlp
    traces_endpoint: https://ingest.us1.signalfx.com/v2/trace/otlp
    (followed by the rest of the collector config in yaml format)

How to Update the Collector Configuration in K8s

We can customize the collector configuration in K8s using the values.yaml file.

See this file for a comprehensive list of customization options that are available in the values.yaml file.

Let’s look at an example.

Add the Debug Exporter

Suppose we want to see the traces that are sent to the collector. We can use the debug exporter for this purpose, which can be helpful for troubleshooting OpenTelemetry-related issues.

You can use vi or nano to edit the values.yaml file. We will show an example using vi:

vi /home/splunk/workshop/tagging/otel/values.yaml

Add the debug exporter to the bottom of the values.yaml file by copying and pasting the section marked with Add the section below in the following example:

Press ‘i’ to enter into insert mode in vi before adding the text below.

splunkObservability:
  logsEnabled: false
  profilingEnabled: true
  infrastructureMonitoringEventsEnabled: true
certmanager:
  enabled: true
operator:
  enabled: true

agent:
  config:
    receivers:
      kubeletstats:
        insecure_skip_verify: true
        auth_type: serviceAccount
        endpoint: ${K8S_NODE_IP}:10250
        metric_groups:
          - container
          - pod
          - node
          - volume
        k8s_api_config:
          auth_type: serviceAccount
        extra_metadata_labels:
          - container.id
          - k8s.volume.type
    extensions:
      zpages:
        endpoint: 0.0.0.0:55679
    # Add the section below 
    exporters:
      debug:
        verbosity: detailed
    service:
      pipelines:
        traces:
          exporters:
            - sapm
            - signalfx
            - debug

To save your changes in vi, press the esc key to enter command mode, then type :wq! followed by pressing the enter/return key.

Once the file is saved, we can apply the changes with:

cd /home/splunk/workshop/tagging

helm upgrade splunk-otel-collector --version 0.111.0 \
--set="splunkObservability.realm=$REALM" \
--set="splunkObservability.accessToken=$ACCESS_TOKEN" \
--set="clusterName=$INSTANCE-k3s-cluster" \
--set="environment=tagging-workshop-$INSTANCE" \
splunk-otel-collector-chart/splunk-otel-collector \
-f otel/values.yaml

Release "splunk-otel-collector" has been upgraded. Happy Helming!
NAME: splunk-otel-collector
LAST DEPLOYED: Mon Dec 23 19:08:08 2024
NAMESPACE: default
STATUS: deployed
REVISION: 2
NOTES:
Splunk OpenTelemetry Collector is installed and configured to send data to Splunk Observability realm us1.

Whenever a change to the collector config is made via a values.yaml file, it’s helpful to review the actual configuration applied to the collector by looking at the config map:

kubectl describe cm splunk-otel-collector-otel-agent

We can see that the debug exporter was added to the traces pipeline as desired:

  traces:
    exporters:
    - sapm
    - signalfx
    - debug

We’ll explore the output of the debug exporter once we deploy an application in our cluster and start capturing traces.

Deploy the Sample Application and Instrument with OpenTelemetry

15 minutes

At this point, we’ve deployed an OpenTelemetry collector in our K8s cluster, and it’s successfully collecting infrastructure metrics.

The next step is to deploy a sample application and instrument with OpenTelemetry to capture traces.

We’ll use a microservices-based application written in Python. To keep the workshop simple, we’ll focus on two services: a credit check service and a credit processor service.

Deploy the Application

To save time, we’ve built Docker images for both of these services already which are available in Docker Hub. We can deploy the credit check service in our K8s cluster with the following command:

kubectl apply -f /home/splunk/workshop/tagging/creditcheckservice-py-with-tags/creditcheckservice-dockerhub.yaml

deployment.apps/creditcheckservice created
service/creditcheckservice created

Next, let’s deploy the credit processor service:

kubectl apply -f /home/splunk/workshop/tagging/creditprocessorservice/creditprocessorservice-dockerhub.yaml

deployment.apps/creditprocessorservice created
service/creditprocessorservice created

Finally, let’s deploy a load generator to generate traffic:

kubectl apply -f /home/splunk/workshop/tagging/loadgenerator/loadgenerator-dockerhub.yaml

deployment.apps/loadgenerator created

Explore the Application

We’ll provide an overview of the application in this section. If you’d like to see the complete source code for the application, refer to the Observability Workshop repository in GitHub

OpenTelemetry Instrumentation

If we look at the Dockerfile’s used to build the credit check and credit processor services, we can see that they’ve already been instrumented with OpenTelemetry. For example, let’s look at /home/splunk/workshop/tagging/creditcheckservice-py-with-tags/Dockerfile:

FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Copy requirements over
COPY requirements.txt .

RUN apt-get update && apt-get install --yes gcc python3-dev

ENV PIP_ROOT_USER_ACTION=ignore

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy main app
COPY main.py .

# Bootstrap OTel
RUN splunk-py-trace-bootstrap

# Set the entrypoint command to run the application
CMD ["splunk-py-trace", "python3", "main.py"]

We can see that splunk-py-trace-bootstrap was included, which installs OpenTelemetry instrumentation for supported packages used by our applications. We can also see that splunk-py-trace is used as part of the command to start the application.

And if we review the /home/splunk/workshop/tagging/creditcheckservice-py-with-tags/requirements.txt file, we can see that splunk-opentelemetry[all] was included in the list of packages.

Finally, if we review the Kubernetes manifest that we used to deploy this service (/home/splunk/workshop/tagging/creditcheckservice-py-with-tags/creditcheckservice-dockerhub.yaml), we can see that environment variables were set in the container to tell OpenTelemetry where to export OTLP data to:

  env:
    - name: PORT
      value: "8888"
    - name: NODE_IP
      valueFrom:
        fieldRef:
          fieldPath: status.hostIP
    - name: OTEL_EXPORTER_OTLP_ENDPOINT
      value: "http://$(NODE_IP):4317"
    - name: OTEL_SERVICE_NAME
      value: "creditcheckservice"
    - name: OTEL_PROPAGATORS
      value: "tracecontext,baggage"

This is all that’s needed to instrument the service with OpenTelemetry!

Explore the Application

We’ve captured several custom tags with our application, which we’ll explore shortly. Before we do that, let’s introduce the concept of tags and why they’re important.

What are tags?

Tags are key-value pairs that provide additional metadata about spans in a trace, allowing you to enrich the context of the spans you send to Splunk APM.

For example, a payment processing application would find it helpful to track:

The payment type used (i.e. credit card, gift card, etc.)
The ID of the customer that requested the payment

This way, if errors or performance issues occur while processing the payment, we have the context we need for troubleshooting.

While some tags can be added with the OpenTelemetry collector, the ones we’ll be working with in this workshop are more granular, and are added by application developers using the OpenTelemetry SDK.

Why are tags so important?

Tags are essential for an application to be truly observable. They add the context to the traces to help us understand why some users get a great experience and others don’t. And powerful features in Splunk Observability Cloud utilize tags to help you jump quickly to root cause.

A note about terminology before we proceed. While we discuss tags in this workshop, and this is the terminology we use in Splunk Observability Cloud, OpenTelemetry uses the term attributes instead. So when you see tags mentioned throughout this workshop, you can treat them as synonymous with attributes.

How are tags captured?

To capture tags in a Python application, we start by importing the trace module by adding an import statement to the top of the /home/splunk/workshop/tagging/creditcheckservice-py-with-tags/main.py file:

import requests
from flask import Flask, request
from waitress import serve
from opentelemetry import trace  # <--- ADDED BY WORKSHOP
...

Next, we need to get a reference to the current span so we can add an attribute (aka tag) to it:

def credit_check():
    current_span = trace.get_current_span()  # <--- ADDED BY WORKSHOP
    customerNum = request.args.get('customernum')
    current_span.set_attribute("customer.num", customerNum)  # <--- ADDED BY WORKSHOP
...

That was pretty easy, right? We’ve captured a total of four tags in the credit check service, with the final result looking like this:

def credit_check():
    current_span = trace.get_current_span()  # <--- ADDED BY WORKSHOP
    customerNum = request.args.get('customernum')
    current_span.set_attribute("customer.num", customerNum)  # <--- ADDED BY WORKSHOP

    # Get Credit Score
    creditScoreReq = requests.get("http://creditprocessorservice:8899/getScore?customernum=" + customerNum)
    creditScoreReq.raise_for_status()
    creditScore = int(creditScoreReq.text)
    current_span.set_attribute("credit.score", creditScore)  # <--- ADDED BY WORKSHOP

    creditScoreCategory = getCreditCategoryFromScore(creditScore)
    current_span.set_attribute("credit.score.category", creditScoreCategory)  # <--- ADDED BY WORKSHOP

    # Run Credit Check
    creditCheckReq = requests.get("http://creditprocessorservice:8899/runCreditCheck?customernum=" + str(customerNum) + "&score=" + str(creditScore))
    creditCheckReq.raise_for_status()
    checkResult = str(creditCheckReq.text)
    current_span.set_attribute("credit.check.result", checkResult)  # <--- ADDED BY WORKSHOP

    return checkResult

Review Trace Data

Before looking at the trace data in Splunk Observability Cloud, let’s review what the debug exporter has captured by tailing the agent collector logs with the following command:

kubectl logs -l component=otel-collector-agent -f

Hint: use CTRL+C to stop tailing the logs.

You should see traces written to the agent collector logs such as the following:

InstrumentationScope opentelemetry.instrumentation.flask 0.44b0
Span #0
    Trace ID       : 9f9fc109903f25ba57bea9b075aa4833
    Parent ID      : 
    ID             : 6d71519f454f6059
    Name           : /check
    Kind           : Server
    Start time     : 2024-12-23 19:55:25.815891965 +0000 UTC
    End time       : 2024-12-23 19:55:27.824664949 +0000 UTC
    Status code    : Unset
    Status message : 
Attributes:
     -> http.method: Str(GET)
     -> http.server_name: Str(waitress.invalid)
     -> http.scheme: Str(http)
     -> net.host.port: Int(8888)
     -> http.host: Str(creditcheckservice:8888)
     -> http.target: Str(/check?customernum=30134241)
     -> net.peer.ip: Str(10.42.0.19)
     -> http.user_agent: Str(python-requests/2.31.0)
     -> net.peer.port: Str(47248)
     -> http.flavor: Str(1.1)
     -> http.route: Str(/check)
     -> customer.num: Str(30134241)
     -> credit.score: Int(443)
     -> credit.score.category: Str(poor)
     -> credit.check.result: Str(OK)
     -> http.status_code: Int(200)

Notice how the trace includes the tags (aka attributes) that we captured in the code, such as credit.score and credit.score.category. We’ll use these in the next section, when we analyze the traces in Splunk Observability Cloud to find the root cause of a performance issue.

Create a Troubleshooting MetricSet

5 minutes

Index Tags

To use advanced features in Splunk Observability Cloud such as Tag Spotlight, we’ll need to first index one or more tags.

To do this, navigate to Settings -> MetricSets and ensure the APM tab is selected. Then click the + Add Custom MetricSet button.

Let’s index the credit.score.category tag by entering the following details (note: since everyone in the workshop is using the same organization, the instructor will do this step on your behalf):

Click Start Analysis to proceed.

The tag will appear in the list of Pending MetricSets while analysis is performed.

Once analysis is complete, click on the checkmark in the Actions column.

Troubleshooting vs. Monitoring MetricSets

You may have noticed that, to index this tag, we created something called a Troubleshooting MetricSet. It’s named this way because a Troubleshooting MetricSet, or TMS, allows us to troubleshoot issues with this tag using features such as Tag Spotlight.

You may have also noticed that there’s another option which we didn’t choose called a Monitoring MetricSet (or MMS). Monitoring MetricSets go beyond troubleshooting and allow us to use tags for alerting and dashboards. While we won’t be exploring this capability as part of this workshop, it’s a powerful feature that I encourage you to explore on your own.

Troubleshoot a Problem Using Tag Spotlight

15 minutes

Explore APM Data

Let’s explore some of the APM data we’ve captured to see how our application is performing.

Navigate to APM, then use the Environment dropdown to select your environment (i.e. tagging-workshop-instancename).

You should see creditprocessorservice and creditcheckservice displayed in the list of services:

Click on Service Map on the right-hand side to view the service map. We can see that the creditcheckservice makes calls to the creditprocessorservice, with an average response time of at least 3 seconds:

Next, click on Traces on the right-hand side to see the traces captured for this application. You’ll see that some traces run relatively fast (i.e. just a few milliseconds), whereas others take a few seconds.

Click on one of the longer running traces. In this example, the trace took five seconds, and we can see that most of the time was spent calling the /runCreditCheck operation, which is part of the creditprocessorservice:

But why are some traces slow, and others are relatively quick?

Close the trace and return to the Trace Analyzer. If you toggle Errors only to on, you’ll also notice that some traces have errors:

If we look at one of the error traces, we can see that the error occurs when the creditprocessorservice attempts to call another service named otherservice. But why do some requests result in a call to otherservice, and others don’t?

To determine why some requests perform slowly, and why some requests result in errors, we could look through the traces one by one and try to find a pattern behind the issues.

Splunk Observability Cloud provides a better way to find the root cause of an issue. We’ll explore this next.

Using Tag Spotlight

Since we indexed the credit.score.category tag, we can use it with Tag Spotlight to troubleshoot our application.

Navigate to APM then click on Tag Spotlight on the right-hand side. Ensure the creditcheckservice service is selected from the Service drop-down (if not already selected).

With Tag Spotlight, we can see 100% of credit score requests that result in a score of impossible have an error, yet requests for all other credit score types have no errors at all!

This illustrates the power of Tag Spotlight! Finding this pattern would be time-consuming without it, as we’d have to manually look through hundreds of traces to identify the pattern (and even then, there’s no guarantee we’d find it).

We’ve looked at errors, but what about latency? Let’s click on the Requests & errors distribution dropdown and change it to Latency distribution.

IMPORTANT: Click on the settings icon beside Cards display to add the P50 and P99 metrics.

Here, we can see that the requests with a poor credit score request are running slowly, with P50, P90, and P99 times of around 3 seconds, which is too long for our users to wait, and much slower than other requests.

We can also see that some requests with an exceptional credit score request are running slowly, with P99 times of around 5 seconds, though the P50 response time is relatively quick.

Using Dynamic Service Maps

Now that we know the credit score category associated with the request can impact performance and error rates, let’s explore another feature that utilizes indexed tags: Dynamic Service Maps.

With Dynamic Service Maps, we can breakdown a particular service by a tag. For example, let’s click on APM, then click Service Map to view the service map.

Click on creditcheckservice. Then, on the right-hand menu, click on the drop-down that says Breakdown, and select the credit.score.category tag.

At this point, the service map is updated dynamically, and we can see the performance of requests hitting creditcheckservice broken down by the credit score category:

This view makes it clear that performance for good and fair credit scores is excellent, while poor and exceptional scores are much slower, and impossible scores result in errors.

Our Findings

Tag Spotlight has uncovered several interesting patterns for the engineers that own this service to explore further:

Why are all the impossible credit score requests resulting in error?
Why are all the poor credit score requests running slowly?
Why do some of the exceptional requests run slowly?

As an SRE, passing this context to the engineering team would be extremely helpful for their investigation, as it would allow them to track down the issue much more quickly than if we simply told them that the service was “sometimes slow”.

If you’re curious, have a look at the source code for the creditprocessorservice. You’ll see that requests with impossible, poor, and exceptional credit scores are handled differently, thus resulting in the differences in error rates and latency that we uncovered.

The behavior we saw with our application is typical for modern cloud-native applications, where different inputs passed to a service lead to different code paths, some of which result in slower performance or errors. For example, in a real credit check service, requests resulting in low credit scores may be sent to another downstream service to further evaluate risk, and may perform more slowly than requests resulting in higher scores, or encounter higher error rates.

Summary

2 minutes

This workshop provided hands-on experience with the following concepts:

How to deploy the Splunk Distribution of the OpenTelemetry Collector using Helm.
How instrument an application with OpenTelemetry.
How to capture tags of interest from your application using an OpenTelemetry SDK.
How to index tags in Splunk Observability Cloud using Troubleshooting MetricSets.
How to utilize tags in Splunk Observability Cloud to find “unknown unknowns” using the Tag Spotlight and Dynamic Service Map features.

Collecting tags aligned with the best practices shared in this workshop will let you get even more value from the data you’re sending to Splunk Observability Cloud. Now that you’ve completed this workshop, you have the knowledge you need to start collecting tags from your own applications!

To get started with capturing tags today, check out how to add tags in various supported languages, and then how to use them to create Troubleshooting MetricSets so they can be analyzed in Tag Spotlight. For more help, feel free to ask a Splunk Expert.

And to see how other languages and environments are instrumented with OpenTelemetry, explore the Splunk OpenTelemetry Examples GitHub repository.

Tip for Workshop Facilitator(s)

Once the workshop is complete, remember to delete the APM MetricSet you created earlier for the credit.score.category tag.

Solving Problems with O11y Cloud

Subsections of Solving Problems with O11y Cloud

Connect to EC2 Instance

Connect to your EC2 Instance

Editing Files

Deploy the OpenTelemetry Collector and Customize Config

What is Helm?

Install the Collector using Helm

Confirm the Collector is Running

Confirm your K8s Cluster is in O11y Cloud

Get the Collector Configuration

How to Update the Collector Configuration in K8s

Add the Debug Exporter

Deploy the Sample Application and Instrument with OpenTelemetry

Deploy the Application

Explore the Application

OpenTelemetry Instrumentation

Explore the Application

What are tags?

Why are tags so important?

How are tags captured?

Review Trace Data

Create a Troubleshooting MetricSet

Index Tags

Troubleshooting vs. Monitoring MetricSets

Troubleshoot a Problem Using Tag Spotlight

Explore APM Data

Using Tag Spotlight

Using Dynamic Service Maps

Our Findings

Summary