Scenarios

  • This scenario is for ITOps teams managing a hybrid infrastructure that need to troubleshoot cloud-native performance issues, by correlating real-time metrics with logs to troubleshoot faster, improve MTTD/MTTR, and optimize costs.
  • This scenario helps engineering teams identify and fix issues caused by planned and unplanned changes to their microservices-based applications.
  • Use Splunk Real User Monitoring (RUM) and Synthetics to get insight into end user experience, and proactively test scenarios to improve that experience.
  • Deploy ThousandEyes Enterprise Agent in Kubernetes, stream synthetic data into Splunk Observability Cloud, and enable bi-directional drilldowns between ThousandEyes and Splunk APM.
  • Deploy Isovalent Enterprise Platform (Cilium, Hubble, and Tetragon) on Amazon EKS and integrate with Splunk Observability Cloud for comprehensive eBPF-based monitoring and observability. Includes an end-to-end demo investigating a DNS issue using Hubble dashboards.
Last Modified Jan 30, 2026

Subsections of Scenarios

Optimize Cloud Monitoring

3 minutes   Author Tim Hard

The elasticity of cloud architectures means that monitoring artifacts must scale elastically as well, breaking the paradigm of purpose-built monitoring assets. As a result, administrative overhead, visibility gaps, and tech debt skyrocket while MTTR slows. This typically happens for three reasons:

  • Complex and Inefficient Data Management: Infrastructure data is scattered across multiple tools with inconsistent naming conventions, leading to fragmented views and poor metadata and labelling. Managing multiple agents and data flows adds to the complexity.
  • Inadequate Monitoring and Troubleshooting Experience: Slow data visualization and troubleshooting, cluttered with bespoke dashboards and manual correlations, are hindered further by the lack of monitoring tools for ephemeral technologies like Kubernetes and serverless functions.
  • Operational and Scaling Challenges: Manual onboarding, user management, and chargeback processes, along with the need for extensive data summarization, slow down operations and inflate administrative tasks, complicating data collection and scalability.

To address these challenges you need a way to:

  • Standardize Data Collection and Tags: Centralized monitoring with a single, open-source agent to apply uniform naming standards and ensure metadata for visibility. Optimize data collection and use a monitoring-as-code approach for consistent collection and tagging.
  • Reuse Content Across Teams: Streamline new IT infrastructure onboarding and user management with templates and automation. Utilize out-of-the-box dashboards, alerts, and self-service tools to enable content reuse, ensuring uniform monitoring and reducing manual effort.
  • Improve Timeliness of Alerts: Utilize highly performant open source data collection, combined with real-time streaming-based data analytics and alerting, to enhance the timeliness of notifications. Automatically configured alerts for common problem patterns (AutoDetect) and minimal yet effective monitoring dashboards and alerts will ensure rapid response to emerging issues, minimizing potential disruptions.
  • Correlate Infrastructure Metrics and Logs: Achieve full monitoring coverage of all IT infrastructure by enabling seamless correlation between infrastructure metrics and logs. High-performing data visualization and a single source of truth for data, dashboards, and alerts will simplify the correlation process, allowing for more effective troubleshooting and analysis of the IT environment.

In this workshop, we’ll explore:

  • How to standardize data collection and tags using OpenTelemetry.
  • How to reuse content across teams.
  • How to improve timelines of alerts.
  • How to correlate infrastructure metrics and logs.
Tip

The easiest way to navigate through this workshop is by using:

  • the left/right arrows (< | >) on the top right of this page
  • the left (◀️) and right (▢️) cursor keys on your keyboard
Last Modified Jul 23, 2025

Subsections of Optimize Cloud Monitoring

Getting Started

3 minutes   Author Tim Hard

During this technical Optimize Cloud Monitoring Workshop, you will build out an environment based on a lightweight Kubernetes1 cluster.

To simplify the workshop modules, a pre-configured AWS/EC2 instance is provided.

The instance is pre-configured with all the software required to deploy the Splunk OpenTelemetry Connector2 and the microservices-based OpenTelemetry Demo Application3 in Kubernetes which has been instrumented using OpenTelemetry to send metrics, traces, spans and logs.

This workshop will introduce you to the benefits of standardized data collection, how content can be re-used across teams, correlating metrics and logs, and creating detectors to fire alerts. By the end of these technical workshops, you will have a good understanding of some of the key features and capabilities of the Splunk Observability Cloud.

Here are the instructions on how to access your pre-configured AWS/EC2 instance

Splunk Architecture Splunk Architecture


  1. Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. ↩︎

  2. OpenTelemetry Collector offers a vendor-agnostic implementation on how to receive, process and export telemetry data. In addition, it removes the need to run, operate and maintain multiple agents/collectors to support open-source telemetry data formats (e.g. Jaeger, Prometheus, etc.) sending to multiple open-source or commercial back-ends. ↩︎

  3. The OpenTelemetry Demo Application is a microservice-based distributed system intended to illustrate the implementation of OpenTelemetry in a near real-world environment. ↩︎

Last Modified Jul 23, 2025

Subsections of 1. Getting Started

How to connect to your workshop environment

5 minutes   Author Tim Hard
  1. How to retrieve the IP address of the AWS/EC2 instance assigned to you.
  2. Connect to your instance using SSH, Putty1 or your web browser.
  3. Verify your connection to your AWS/EC2 cloud instance.
  4. Using Putty (Optional)
  5. Using Multipass (Optional)

1. AWS/EC2 IP Address

In preparation for the workshop, Splunk has prepared an Ubuntu Linux instance in AWS/EC2.

To get access to the instance that you will be using in the workshop please visit the URL to access the Google Sheet provided by the workshop leader.

Search for your AWS/EC2 instance by looking for your first and last name, as provided during registration for this workshop.

attendee spreadsheet attendee spreadsheet

Find your allocated IP address, SSH command (for Mac OS, Linux and the latest Windows versions) and password to enable you to connect to your workshop instance.

It also has the Browser Access URL that you can use in case you cannot connect via SSH or Putty - see EC2 access via Web browser

Important

Please use SSH or Putty to gain access to your EC2 instance if possible and make a note of the IP address as you will need this during the workshop.

2. SSH (Mac OS/Linux)

Most attendees will be able to connect to the workshop by using SSH from their Mac or Linux device, or on Windows 10 and above.

To use SSH, open a terminal on your system and type ssh splunk@x.x.x.x (replacing x.x.x.x with the IP address found in Step #1).

ssh login ssh login

When prompted Are you sure you want to continue connecting (yes/no/[fingerprint])? please type yes.

ssh password ssh password

Enter the password provided in the Google Sheet from Step #1.

Upon successful login, you will be presented with the Splunk logo and the Linux prompt.

ssh connected ssh connected

3. SSH (Windows 10 and above)

The procedure described above is the same on Windows 10, and the commands can be executed either in the Windows Command Prompt or PowerShell. However, Windows regards its SSH Client as an “optional feature”, which might need to be enabled.

You can verify if SSH is enabled by simply executing ssh

If you are shown a help text on how to use the SSH command (like shown in the screenshot below), you are all set.

Windows SSH enabled Windows SSH enabled

If the result of executing the command looks something like the screenshot below, you want to enable the “OpenSSH Client” feature manually.

Windows SSH disabled Windows SSH disabled

To do that, open the “Settings” menu, and click on “Apps”. While in the “Apps & features” section, click on “Optional features”.

Windows Apps Settings Windows Apps Settings

Here, you are presented with a list of installed features. On the top, you see a button with a plus icon to “Add a feature”. Click it. In the search input field, type “OpenSSH”, and find a feature called “OpenSSH Client”, or respectively, “OpenSSH Client (Beta)”, click on it, and click the “Install”-button.

Windows Enable OpenSSH Client Windows Enable OpenSSH Client

Now you are set! In case you are not able to access the provided instance despite enabling the OpenSSH feature, please do not shy away from reaching out to the course instructor, either via chat or directly.

At this point you are ready to continue and start the workshop


4. Putty (For Windows Versions prior to Windows 10)

If you do not have SSH pre-installed or if you are on a Windows system, the best option is to install Putty which you can find here.

Important

If you cannot install Putty, please go to Web Browser (All).

Open Putty and enter in the Host Name (or IP address) field the IP address provided in the Google Sheet.

You can optionally save your settings by providing a name and pressing Save.

putty-2 putty-2

To then login to your instance click on the Open button as shown above.

If this is the first time connecting to your AWS/EC2 workshop instance, you will be presented with a security dialogue, please click Yes.

putty-3 putty-3

Once connected, login in as splunk and the password is the one provided in the Google Sheet.

Once you are connected successfully you should see a screen similar to the one below:

putty-4 putty-4

At this point, you are ready to continue and start the workshop


5. Web Browser (All)

If you are blocked from using SSH (Port 22) or unable to install Putty you may be able to connect to the workshop instance by using a web browser.

Note

This assumes that access to port 6501 is not restricted by your company’s firewall.

Open your web browser and type http://x.x.x.x:6501 (where X.X.X.X is the IP address from the Google Sheet).

http-6501 http-6501

Once connected, login in as splunk and the password is the one provided in the Google Sheet.

http-connect http-connect

Once you are connected successfully you should see a screen similar to the one below:

web login web login

Unlike when you are using regular SSH, copy and paste does require a few extra steps to complete when using a browser session. This is due to cross browser restrictions.

When the workshop asks you to copy instructions into your terminal, please do the following:

Copy the instruction as normal, but when ready to paste it in the web terminal, choose Paste from the browser as show below:

web paste 1 web paste 1

This will open a dialogue box asking for the text to be pasted into the web terminal:

web paste 3 web paste 3

Paste the text in the text box as shown, then press OK to complete the copy and paste process.

Note

Unlike regular SSH connection, the web browser has a 60-second time out, and you will be disconnected, and a Connect button will be shown in the center of the web terminal.

Simply click the Connect button and you will be reconnected and will be able to continue.

web reconnect web reconnect

At this point, you are ready to continue and start the workshop.


6. Multipass (All)

If you are unable to access AWS but want to install software locally, follow the instructions for using Multipass.

Last Modified Jul 23, 2025

Deploy OpenTelemetry Demo Applciation

10 minutes   Author Tim Hard

Introduction

For this workshop, we’ll be using the OpenTelemetry Demo Application running in Kubernetes. This application is for an online retailer and includes more than a dozen services written in many different languages. While metrics, traces, and logs are being collected from this application, this workshop is primarily focused on how Splunk Observability Cloud can be used to more efficiently monitor infrastructure.

Pre-requisites

Initial Steps

The initial setup can be completed by executing the following steps on the command line of your EC2 instance.

cd ~/workshop/optimize-cloud-monitoring && \
./deploy-application.sh

You’ll be asked to enter your favorite city. This will be used in the OpenTelemetry Collector configuration as a custom tag to show how easy it is to add additional context to your observability data.

Note

Custom tagging will be covered in more detail in the Standardize Data Collection section of this workshop.

Enter Favorite City Enter Favorite City

Your application should now be running and sending data to Splunk Observability Cloud. You’ll dig into the data in the next section.

Last Modified Jul 23, 2025

Standardize Data Collection

2 minutes   Author Tim Hard

Why Standards Matter

As cloud adoption grows, we often face requests to support new technologies within a diverse landscape, posing challenges in delivering timely content. Take, for instance, a team containerizing five workloads on AWS requiring EKS visibility. Usually, this involves assisting with integration setup, configuring metadata, and creating dashboards and alertsβ€”a process that’s both time-consuming and increases administrative overhead and technical debt.

Splunk Observability Cloud was designed to handle customers with a diverse set of technical requirements and stacks – from monolithic to microservices architectures, from homegrown applications to Software-as-a-Service.

Splunk offers a native experience for OpenTelemetry, which means OTel is the preferred way to get data into Splunk. Between Splunk’s integrations and the OpenTelemetry community, there are a number of integrations available to easily collect from diverse infrastructure and applications. This includes both on-prem systems like VMWare and as well as guided integrations with cloud vendors, centralizing these hybrid environments.

Splunk Observability Cloud Integrations Splunk Observability Cloud Integrations

For someone like a Splunk admin, the OpenTelemetry Collector can additionally be deployed to a Splunk Universal Forwarder as a Technical Add-on. This enables fast roll-out and centralized configuration management using the Splunk Deployment Server. Let’s assume that the same team adopting Kubernetes is going to deploy a cluster for each one of our B2B customers. I’ll show you how to make a simple modification to the OpenTelemetry collector to add the customerID, and then use mirrored dashboards to allow any of our SRE teams to easily see the customer they care about.

Last Modified Jul 23, 2025

Subsections of 2. Standardize Data Collection

What Are Tags?

3 minutes   Author Tim Hard

Tags are key-value pairs that provide additional metadata about metrics, spans in a trace, or logs allowing you to enrich the context of the data you send to Splunk Observability Cloud. Many tags are collected by default such as hostname or OS type. Custom tags can be used to provide environment or application-specific context. Examples of custom tags include:

Infrastructure specific attributes
  • What data center a host is in
  • What services are hosted on an instance
  • What team is responsible for a set of hosts
Application specific attributes
  • What Application Version is running
  • Feature flags or experimental identifiers
  • Tenant ID in multi-tenant applications
  • User ID
  • User role (e.g. admin, guest, subscriber)
  • User geographical location (e.g. country, city, region)
There are two ways to add tags to your data
  • Add tags as OpenTelemetry attributes to metrics, traces, and logs when you send data to the Splunk Distribution of OpenTelemetry Collector. This option lets you add spans in bulk.
  • Instrument your application to create span tags. This option gives you the most flexibility at the per-application level.

Why are tags so important?

Tags are essential for an application to be truly observable. Tags add context to the traces to help us understand why some users get a great experience and others don’t. Powerful features in Splunk Observability Cloud utilize tags to help you jump quickly to the root cause.

Contextual Information: Tags provide additional context to the metrics, traces, and logs allowing developers and operators to understand the behavior and characteristics of infrastructure and traced operations.

Filtering and Aggregation: Tags enable filtering and aggregation of collected data. By attaching tags, users can filter and aggregate data based on specific criteria. This filtering and aggregation help in identifying patterns, diagnosing issues, and gaining insights into system behavior.

Correlation and Analysis: Tags facilitate correlation between metrics and other telemetry data, such as traces and logs. By including common identifiers or contextual information as tags, users can correlate metrics, traces, and logs enabling comprehensive analysis and troubleshooting of distributed systems.

Customization and Flexibility: OpenTelemetry allows developers to define custom tags based on their application requirements. This flexibility enables developers to capture domain-specific metadata or contextual information that is crucial for understanding the behavior of their applications.

Attributes vs. Tags

A note about terminology before we proceed. While this workshop is about tags, and this is the terminology we use in Splunk Observability Cloud, OpenTelemetry uses the term attributes instead. So when you see tags mentioned throughout this workshop, you can treat them as synonymous with attributes.

Last Modified Jul 23, 2025

Adding Context With Tags

3 minutes   Author Tim Hard

When you deployed the OpenTelemetry Demo Application in the Getting Started section of this workshop, you were asked to enter your favorite city. For this workshop, we’ll be using that to show the value of custom tags.

For this workshop, the OpenTelemetry collector is pre-configured to use the city you provided as a custom tag called store.location which will be used to emulate Kubernetes Clusters running in different geographic locations. We’ll use this tag as a filter to show how you can use Out-of-the-Box integration dashboards to quickly create views for specific teams, applications, or other attributes about your environment. Efficiently enabling content to be reused across teams without increasing technical debt.

Here is the OpenTelemetry Collector configuration used to add the store.location tag to all of the data sent to this collector. This means any metrics, traces, or logs will contain the store.location tag which can then be used to search, filter, or correlate this value.

Tip

If you’re interested in a deeper dive on the OpenTelemetry Collector, head over to the Self Service Observability workshop where you can get hands-on with configuring the collector or the OpenTelemetry Collector Ninja Workshop where you’ll dissect the inner workings of each collector component.

OpenTelemetry Collector Configuration OpenTelemetry Collector Configuration

While this example uses a hard-coded value for the tag, parameterized values can also be used, allowing you to customize the tags dynamically based on the context of each host, application, or operation. This flexibility enables you to capture relevant metadata, user-specific details, or system parameters, providing a rich context for metrics, tracing, and log data while enhancing the observability of your distributed systems.

Now that you have the appropriate context, which as we’ve established is critical to Observability, let’s head over to Splunk Observability Cloud and see how we can use the data we’ve just configured.

Last Modified Jul 23, 2025

Reuse Content Across Teams

3 minutes   Author Tim Hard

In today’s rapidly evolving technological landscape, where hybrid and cloud environments are becoming the norm, the need for effective monitoring and troubleshooting solutions has never been more critical. However, managing the elasticity and complexity of these modern infrastructures poses a significant challenge for teams across various industries. One of the primary pain points encountered in this endeavor is the inadequacy of existing monitoring and troubleshooting experiences.

Traditional monitoring approaches often fall short in addressing the intricacies of hybrid and cloud environments. Teams frequently encounter slow data visualization and troubleshooting processes, compounded by the clutter of bespoke yet similar dashboards and the manual correlation of data from disparate sources. This cumbersome workflow is made worse by the absence of monitoring tools tailored to ephemeral technologies such as containers, orchestrators like Kubernetes, and serverless functions.

Infrastructure Overview in Splunk Observability Cloud Infrastructure Overview in Splunk Observability Cloud

In this section, we’ll cover how Splunk Observability Cloud provides out-of-the-box content for every integration. Not only do the out-of-the-box dashboards provide rich visibility into the infrastructure that is being monitored they can also be mirrored. This is important because it enables you to create standard dashboards for use by teams throughout your organization. This allows all teams to see any changes to the charts in the dashboard, and members of each team can set dashboard variables and filter customizations relevant to their requirements.

Last Modified Jul 23, 2025

Subsections of 3. Reuse Content Across Teams

Infrastrcuture Navigators

5 minutes   Author Tim Hard

Splunk Infrastructure Monitoring (IM) is a market-leading monitoring and observability service for hybrid cloud environments. Built on a patented streaming architecture, it provides a real-time solution for engineering teams to visualize and analyze performance across infrastructure, services, and applications in a fraction of the time and with greater accuracy than traditional solutions.

300+ Easy-to-use OOTB content: Pre-built navigators and dashboards, deliver immediate visualizations of your entire environment so that you can interact with all your data in real time.
Kubernetes navigator: Provides an instant, comprehensive out-of-the-box hierarchical view of nodes, pods, and containers. Ramp up even the most novice Kubernetes user with easy-to-understand interactive cluster maps.
AutoDetect alerts and detectors: Automatically identify the most important metrics, out-of-the-box, to create alert conditions for detectors that accurately alert from the moment telemetry data is ingested and use real-time alerting capabilities for important notifications in seconds.
Log views in dashboards: Combine log messages and real-time metrics on one page with common filters and time controls for faster in-context troubleshooting.
Metrics pipeline management: Control metrics volume at the point of ingest without re-instrumentation with a set of aggregation and data-dropping rules to store and analyze only the needed data. Reduce metrics volume and optimize observability spend.

Infrastructure Overview Infrastructure Overview

Exercise: Find your Kubernetes Cluster
  • From the Splunk Observability Cloud homepage, click the Infrastructure Infrastructure button -> Kubernetes -> K8s nodes
  • First, use the k8s filter k8s filter option to pick your cluster.
  • From the filter drop-down box, use the store.location value you entered when deploying the application.
  • You then can start typing the city you used which should also appear in the drop-down values. Select yours and make sure just the one for your workshop is highlighted with a blue tick blue tick.
  • Click the Apply Filter button to focus on our Cluster.

Kubernetes Navigator Kubernetes Navigator

  • You should now have your Kubernetes Cluster visible
  • Here we can see all of the different components of the cluster (Nodes, Pods, etc), each of which has relevant metrics associated with it. On the right side, you can also see what services are running in the cluster.

Before moving to the next section, take some time to explore the Kubernetes Navigator to see the data that is available Out of the Box.

Last Modified Jul 23, 2025

Dashboard Cloning

5 minutes   Author Tim Hard

ITOps teams responsible for monitoring fleets of infrastructure frequently find themselves manually creating dashboards to visualize and analyze metrics, traces, and log data emanating from rapidly changing cloud-native workloads hosted in Kubernetes and serverless architectures, alongside existing on-premises systems. Moreover, due to the absence of a standardized troubleshooting workflow, teams often resort to creating numerous custom dashboards, each resembling the other in structure and content. As a result, administrative overhead skyrockets and MTTR slows.

To address this, you can use the out-of-the-box dashboards available in Splunk Observability Cloud for each and every integration. These dashboards are filterable and can be used for ad hoc troubleshooting or as a templated approach to getting users the information they need without having to start from scratch. Not only do the out-of-the-box dashboards provide rich visibility into the infrastructure that is being monitored they can also be cloned.

Exercise: Create a Mirrored Dashboard
  1. In Splunk Observability Cloud, click the Global Search Search Search button. (Global Search can be used to quickly find content)
  2. Search for Pods and select K8s pods (Kubernetes) Search Search
  3. This will take you to the out-of-the-box Kubernetes Pods dashboard which we will use as a template for mirroring dashboards.
  4. In the upper right corner of the dashboard click the Dashboard actions button (3 horizontal dots) -> Click Save As… Search Search
  5. Enter a dashboard name (i.e. Kubernetes Pods Dashboard)
  6. Under Dashboard group search for your e-mail address and select it.
  7. Click Save

Note: Every Observability Cloud user who has set a password and logged in at least once, gets a user dashboard group and user dashboard. Your user dashboard group is your individual workspace within Observability Cloud.

Save Dashboard Save Dashboard

After saving, you will be taken to the newly created dashboard in the Dashboard Group for your user. This is an example of cloning an out-of-the-box dashboard which can be further customized and enables users to quickly build role, application, or environment relevant views.

Custom dashboards are meant to be used by multiple people and usually represent a curated set of charts that you want to make accessible to a broad cross-section of your organization. They are typically organized by service, team, or environment.

Last Modified Jul 23, 2025

Dashboard Mirroring

5 minutes   Author Tim Hard

Not only do the out-of-the-box dashboards provide rich visibility into the infrastructure that is being monitored they can also be mirrored. This is important because it enables you to create standard dashboards for use by teams throughout your organization. This allows all teams to see any changes to the charts in the dashboard, and members of each team can set dashboard variables and filter customizations relevant to their requirements.

Exercise: Create a Mirrored Dashboard
  1. While on the Kubernetes Pods dashboard, you created in the previous step, In the upper right corner of the dashboard click the Dashboard actions button (3 horizontal dots) -> Click Add a mirror…. A configuration modal for the Dashboard Mirror will open.

    Mirror Dashboard Menu Mirror Dashboard Menu

  2. Under My dashboard group search for your e-mail address and select it.

  3. (Optional) Modify the dashboard in Dashboard name override name.

  4. (Optional) Add a dashboard description in Dashboard description override.

  5. Under Default filter overrides search for k8s.cluster.name and select the name of your Kubernetes cluster.

  6. Under Default filter overrides search for store.location and select the city you entered during the workshop setup. Mirror Dashboard Config Mirror Dashboard Config

  7. Click Save

You will now be taken to the newly created dashboard which is a mirror of the Kubernetes Pods dashboard you created in the previous section. Any changes to the original dashboard will be reflected in this dashboard as well. This allows teams to have a consistent yet specific view of the systems they care about and any modifications or updates can be applied in a single location, significantly minimizing the effort needed when compared to updating each individual dashboard.

In the next section, you’ll add a new logs-based chart to the original dashboard and see how the dashboard mirror is automatically updated as well.

Last Modified Jul 23, 2025

Correlate Metrics and Logs

1 minute   Author Tim Hard

Correlating infrastructure metrics and logs is often a challenging task, primarily due to inconsistencies in naming conventions across various data sources, including hosts operating on different systems. However, leveraging the capabilities of OpenTelemetry can significantly simplify this process. With OpenTelemetry’s robust framework, which offers rich metadata and attribution, metrics, traces, and logs can seamlessly correlate using standardized field names. This automated correlation not only alleviates the burden of manual effort but also enhances the overall observability of the system.

By aligning metrics and logs based on common field names, teams gain deeper insights into system performance, enabling more efficient troubleshooting, proactive monitoring, and optimization of resources. In this workshop section, we’ll explore the importance of correlating metrics with logs and demonstrate how Splunk Observability Cloud empowers teams to unlock additional value from their observability data.

Log Observer Log Observer

Last Modified Jul 23, 2025

Subsections of 4. Correlate Metrics and Logs

Correlate Metrics and Logs

5 minutes   Author Tim Hard

In this section, we’ll dive into the seamless correlation of metrics and logs facilitated by the robust naming standards offered by OpenTelemetry. By harnessing the power of OpenTelemetry within Splunk Observability Cloud, we’ll demonstrate how troubleshooting issues becomes significantly more efficient for Site Reliability Engineers (SREs) and operators. With this integration, contextualizing data across various telemetry sources no longer demands manual effort to correlate information. Instead, SREs and operators gain immediate access to the pertinent context they need, allowing them to swiftly pinpoint and resolve issues, improving system reliability and performance.

Exercise: View pod logs

The Kubernetes Pods Dashboard you created in the previous section already includes a chart that contains all of the pod logs for your Kubernetes Cluster. The log entries are split by container in this stacked bar chart. To view specific log entries perform the following steps:

  1. On the Kubernetes Pods Dashboard click on one of the bar charts. A modal will open with the most recent log entries for the container you’ve selected.

    K8s pod logs K8s pod logs

  2. Click one of the log entries.

    K8s pod log event K8s pod log event

    Here we can see the entire log event with all of the fields and values. You can search for specific field names or values within the event itself using the Search for fields bar in the event.

  3. Enter the city you configured during the application deployment

    K8s pod log field search K8s pod log field search

    The event will now be filtered to the store.location field. This feature is great for exploring large log entries for specific fields and values unique to your environment or to search for keywords like Error or Failure.

  4. Close the event using the X in the upper right corner.

  5. Click the Chart actions (three horizontal dots) on the Pod log event rate chart

  6. Click View in Log Observer

View in Log Observer View in Log Observer

This will take us to Log Observer. In the next section, you’ll create a chart based on log events and add it to the K8s Pod Dashboard you cloned in section 3.2 Dashboard Cloning. You’ll also see how this new chart is automatically added to the mirrored dashboard you created in section 3.3 Dashboard Mirroring.

Last Modified Jul 23, 2025

Create Log-based Chart

5 minutes   Author Tim Hard

In Log Observer, you can perform codeless queries on logs to detect the source of problems in your systems. You can also extract fields from logs to set up log processing rules and transform your data as it arrives or send data to Infinite Logging S3 buckets for future use. See What can I do with Log Observer? to learn more about Log Observer capabilities.

In this section, you’ll create a chart filtered to logs that include errors which will be added to the K8s Pod Dashboard you cloned in section 3.2 Dashboard Cloning.

Exercise: Create Log-based Chart

Because you drilled into Log Observer from the K8s Pod Dashboard in the previous section, the dashboard will already be filtered to your cluster and store location using the k8s.cluster.name and store.location fields and the bar chart is split by k8s.pod.name. To filter the dashboard to only logs that contain errors complete the following steps:

Log Observer can be filtered using Keywords or specific key-value pairs.

  1. In Log Observer click Add Filter along the top.

  2. Make sure you’ve selected Fields as the filter type and enter severity in the Find a field… search bar.

  3. Select severity from the fields list.

    You should now see a list of severities and the number of log entries for each.

  4. Under Top values, hover over Error and click the = button to apply the filter.

    Log Observer: Filter errors Log Observer: Filter errors

    The dashboard will now be filtered to only log entries with a severity of Error and the bar chart will be split by the Kubernetes Pod that contains the errors. Next, you’ll save the chart on your Kubernetes Pods Dashboard.

  5. In the upper right corner of the Log Observer dashboard click Save.

  6. Select Save to Dashboard.

  7. In the Chart name field enter a name for your chart.

  8. (Optional) In the Chart description field enter a description for your chart.

    Log Observer: Save Chart Name Log Observer: Save Chart Name

  9. Click Select Dashboard and search for the name of the Dashboard you cloned in section 3.2 Dashboard Cloning.

  10. Select the dashboard in the Dashboard Group for your email address.

    Log Observer: Select Dashboard Log Observer: Select Dashboard

  11. Click OK

  12. For the Chart type select Log timeline

  13. Click Save and go to the dashboard

You will now be taken to your Kubernetes Pods Dashboard where you should see the chart you just created for pod errors.

Log Errors Chart Log Errors Chart

Because you updated the original Kubernetes Pods Dashboard, your mirrored dashboard will also include this chart as well! You can see this by clicking the mirrored version of your dashboard along the top of the Dashboard Group for your user.

Log Errors Chart Log Errors Chart

Now that you’ve seen how data can be reused across teams by cloning the dashboard, creating dashboard mirrors and how metrics can easily be correlated with logs, let’s take a look at how to create alerts so your teams can be notified when there is an issue with their infrastructure, services, or applications.

Last Modified Jul 23, 2025

Improve Timeliness of Alerts

1 minutes   Author Tim Hard

When monitoring hybrid and cloud environments, ensuring timely alerts for critical infrastructure and applications poses a significant challenge. Typically, this involves crafting intricate queries, meticulously scheduling searches, and managing alerts across various monitoring solutions. Moreover, the proliferation of disparate alerts generated from identical data sources often results in unnecessary duplication, contributing to alert fatigue and noise within the monitoring ecosystem.

In this section, we’ll explore how Splunk Observability Cloud addresses these challenges by enabling the effortless creation of alert criteria. Leveraging its 10-second default data collection capability, alerts can be triggered swiftly, surpassing the timeliness achieved by traditional monitoring tools. This enhanced responsiveness not only reduces Mean Time to Detect (MTTD) but also accelerates Mean Time to Resolve (MTTR), ensuring that critical issues are promptly identified and remediated.

Detector Dashboard Detector Dashboard

Last Modified Jul 23, 2025

Subsections of 5. Improve Timeliness of Alerts

Create Custom Detector

10 minutes   Author Tim Hard

Splunk Observability Cloud provides detectors, events, alerts, and notifications to keep you informed when certain criteria are met. There are a number of pre-built AutoDetect Detectors that automatically surface when common problem patterns occur, such as when an EC2 instance’s CPU utilization is expected to reach its limit. Additionally, you can also create custom detectors if you want something more optimized or specific. For example, you want a message sent to a Slack channel or to an email address for the Ops team that manages this Kubernetes cluster when Memory Utilization on their pods has reached 85%.

Exercise: Create Custom Detector

In this section you’ll create a detector on Pod Memory Utilization which will trigger if utilization surpasses 85%

  1. On the Kubernetes Pods Dashboard you cloned in section 3.2 Dashboard Cloning, click the Get Alerts button (bell icon) for the Memory usage (%) chart -> Click New detector from chart.

    New Detector from Chart New Detector from Chart

  2. In the Create detector add your initials to the detector name.

    Create Detector: Update Detector Name Create Detector: Update Detector Name

  3. Click Create alert rule.

    These conditions are expressed as one or more rules that trigger an alert when the conditions in the rules are met. Importantly, multiple rules can be included in the same detector configuration which minimizes the total number of alerts that need to be created and maintained. You can see which signal this detector will alert on by the bell icon in the Alert On column. In this case, this detector will alert on the Memory Utilization for the pods running in this Kubernetes cluster.

    Alert Signal Alert Signal

  4. Click Proceed To Alert Conditions.

    Many pre-built alert conditions can be applied to the metric you want to alert on. This could be as simple as a static threshold or something more complex, for example, is memory usage deviating from the historical baseline across any of your 50,000 containers?

    Alert Conditions Alert Conditions

  5. Select Static Threshold.

  6. Click Proceed To Alert Settings.

    In this case, you want the alert to trigger if any pods exceed 85% memory utilization. Once you’ve set the alert condition, the configuration is back-tested against the historical data so you can confirm that the alert configuration is accurate, meaning will the alert trigger on the criteria you’ve defined? This is also a great way to confirm if the alert generates too much noise.

    Alert Settings Alert Settings

  7. Enter 85 in the Threshold field.

  8. Click Proceed To Alert Message.

    Next, you can set the severity for this alert, you can include links to runbooks and short tips on how to respond, and you can customize the message that is included in the alert details. The message can include parameterized fields from the actual data, for example, in this case, you may want to include which Kubernetes node the pod is running on, or the store.location configured when you deployed the application, to provide additional context.

    Alert Message Alert Message

  9. Click Proceed To Alert Recipients.

    You can choose where you want this alert to be sent when it triggers. This could be to a team, specific email addresses, or to other systems such as ServiceNow, Slack, Splunk On-Call or Splunk ITSI. You can also have the alert execute a webhook which enables me to leverage automation or to integrate with many other systems such as homegrown ticketing tools. For the purpose of this workshop do not include a recipient

    Alert Recipients Alert Recipients

  10. Click Proceed To Alert Activation.

    Activate Alert Activate Alert

  11. Click Activate Alert.

    Activate Alert Message Activate Alert Message

    You will receive a warning because no recipients were included in the Notification Policy for this detector. This can be warning can be dismissed.

  12. Click Save.

    Activate Alert Message Activate Alert Message

    You will be taken to your newly created detector where you can see any triggered alerts.

  13. In the upper right corner, Click Close to close the Detector.

The detector status and any triggered alerts will automatically be included in the chart because this detector was configured for this chart.

Alert Chart Alert Chart

Congratulations! You’ve successfully created a detector that will trigger if pod memory utilization exceeds 85%. After a few minutes, the detector should trigger some alerts. You can click the detector name in the chart to view the triggered alerts.

Last Modified Jul 23, 2025

Conclusion

1 minute  

Today you’ve seen how Splunk Observability Cloud can help you overcome many of the challenges you face monitoring hybrid and cloud environments. You’ve demonstrated how Splunk Observability Cloud streamlines operations with standardized data collection and tags, ensuring consistency across all IT infrastructure. The Unified Service Telemetry has been a game-changer, providing in-context metrics, logs, and trace data that make troubleshooting swift and efficient. By enabling the reuse of content across teams, you’re minimizing technical debt and bolstering the performance of our monitoring systems.

Happy Splunking!

Dancing Buttercup Dancing Buttercup

Last Modified Jul 23, 2025

Debug Problems in Microservices

  • This workshop shows how tags can be used to reduce the time required for SREs to isolate issues across services, so they know which team to engage to troubleshoot the issue further, and can provide context to help engineering get a head start on debugging.
  • This workshop shows how Database Query Performance and AlwaysOn Profiling can be used to reduce the time required for engineers to debug problems in microservices.
Last Modified Oct 13, 2025

Subsections of Debug Problems in Microservices

Tagging Workshop

2 minutes   Author Derek Mitchell

Splunk Observability Cloud includes powerful features that dramatically reduce the time required for SREs to isolate issues across services, so they know which team to engage to troubleshoot the issue further, and can provide context to help engineering get a head start on debugging.

Unlocking these features requires tags to be included with your application traces. But how do you know which tags are the most valuable and how do you capture them?

In this workshop, we’ll explore:

  • What are tags and why are they such a critical part of making an application observable.
  • How to use OpenTelemetry to capture tags of interest from your application.
  • How to index tags in Splunk Observability Cloud and the differences between Troubleshooting MetricSets and Monitoring MetricSets.
  • How to utilize tags in Splunk Observability Cloud to find β€œunknown unknowns” using the Tag Spotlight and Dynamic Service Map features.
  • How to utilize tags for alerting and dashboards.

The workshop uses a simple microservices-based application to illustrate these concepts. Let’s get started!

Tip

The easiest way to navigate through this workshop is by using:

  • the left/right arrows (< | >) on the top right of this page
  • the left (◀️) and right (▢️) cursor keys on your keyboard
Last Modified Jul 23, 2025

Subsections of Tagging Workshop

Build the Sample Application

10 minutes  

Introduction

For this workshop, we’ll be using a microservices-based application. This application is for an online retailer and normally includes more than a dozen services. However, to keep the workshop simple, we’ll be focusing on two services used by the retailer as part of their payment processing workflow: the credit check service and the credit processor service.

Pre-requisites

You will start with an EC2 instance and perform some initial steps in order to get to the following state:

  • Deploy the Splunk distribution of the OpenTelemetry Collector
  • Build and deploy creditcheckservice and creditprocessorservice
  • Deploy a load generator to send traffic to the services

Initial Steps

The initial setup can be completed by executing the following steps on the command line of your EC2 instance:

cd workshop/tagging
./0-deploy-collector-with-services.sh
Java

There are implementations in multiple languages available for creditcheckservice. Run

./0-deploy-collector-with-services.sh java

to pick Java over Python.

View your application in Splunk Observability Cloud

Now that the setup is complete, let’s confirm that it’s sending data to Splunk Observability Cloud. Note that when the application is deployed for the first time, it may take a few minutes for the data to appear.

Navigate to APM, then use the Environment dropdown to select your environment (i.e. tagging-workshop-instancename).

If everything was deployed correctly, you should see creditprocessorservice and creditcheckservice displayed in the list of services:

APM Overview APM Overview

Click on Explore on the right-hand side to view the service map. We can see that the creditcheckservice makes calls to the creditprocessorservice, with an average response time of at least 3 seconds:

Service Map Service Map

Next, click on Traces on the right-hand side to see the traces captured for this application. You’ll see that some traces run relatively fast (i.e. just a few milliseconds), whereas others take a few seconds.

Traces Traces

If you toggle Errors only to on, you’ll also notice that some traces have errors:

Traces Traces

Toggle Errors only back to off and sort the traces by duration, then click on one of the longer running traces. In this example, the trace took five seconds, and we can see that most of the time was spent calling the /runCreditCheck operation, which is part of the creditprocessorservice.

Long Running Trace Long Running Trace

Currently, we don’t have enough details in our traces to understand why some requests finish in a few milliseconds, and others take several seconds. To provide the best possible customer experience, this will be critical for us to understand.

We also don’t have enough information to understand why some requests result in errors, and others don’t. For example, if we look at one of the error traces, we can see that the error occurs when the creditprocessorservice attempts to call another service named otherservice. But why do some requests results in a call to otherservice, and others don’t?

Trace with Errors Trace with Errors

We’ll explore these questions and more in the workshop.

Last Modified Jul 23, 2025

What are Tags?

3 minutes  

To understand why some requests have errors or slow performance, we’ll need to add context to our traces. We’ll do this by adding tags. But first, let’s take a moment to discuss what tags are, and why they’re so important for observability.

What are tags?

Tags are key-value pairs that provide additional metadata about spans in a trace, allowing you to enrich the context of the spans you send to Splunk APM.

For example, a payment processing application would find it helpful to track:

  • The payment type used (i.e. credit card, gift card, etc.)
  • The ID of the customer that requested the payment

This way, if errors or performance issues occur while processing the payment, we have the context we need for troubleshooting.

While some tags can be added with the OpenTelemetry collector, the ones we’ll be working with in this workshop are more granular, and are added by application developers using the OpenTelemetry API.

Attributes vs. Tags

A note about terminology before we proceed. While this workshop is about tags, and this is the terminology we use in Splunk Observability Cloud, OpenTelemetry uses the term attributes instead. So when you see tags mentioned throughout this workshop, you can treat them as synonymous with attributes.

Why are tags so important?

Tags are essential for an application to be truly observable. As we saw with our credit check service, some users are having a great experience: fast with no errors. But other users get a slow experience or encounter errors.

Tags add the context to the traces to help us understand why some users get a great experience and others don’t. And powerful features in Splunk Observability Cloud utilize tags to help you jump quickly to root cause.

Sneak Peak: Tag Spotlight

Tag Spotlight uses tags to discover trends that contribute to high latency or error rates:

Tag Spotlight Preview Tag Spotlight Preview

The screenshot above provides an example of Tag Spotlight from another application.

Splunk has analyzed all of the tags included as part of traces that involve the payment service.

It tells us very quickly whether some tag values have more errors than others.

If we look at the version tag, we can see that version 350.10 of the service has a 100% error rate, whereas version 350.9 of the service has no errors at all:

Tag Spotlight Preview Tag Spotlight Preview

We’ll be using Tag Spotlight with the credit check service later on in the workshop, once we’ve captured some tags of our own.

Last Modified Jul 23, 2025

Capture Tags with OpenTelemetry

15 minutes  

Please proceed to one of the subsections for Java or Python. Ask your instructor for the one used during the workshop!

Last Modified Jul 23, 2025

Subsections of 3. Capture Tags with OpenTelemetry

1. Capture Tags - Java

15 minutes  

Let’s add some tags to our traces, so we can find out why some customers receive a poor experience from our application.

Identify Useful Tags

We’ll start by reviewing the code for the creditCheck function of creditcheckservice (which can be found in the file /home/splunk/workshop/tagging/creditcheckservice-java/src/main/java/com/example/creditcheckservice/CreditCheckController.java):

@GetMapping("/check")
public ResponseEntity<String> creditCheck(@RequestParam("customernum") String customerNum) {
    // Get Credit Score
    int creditScore;
    try {
        String creditScoreUrl = "http://creditprocessorservice:8899/getScore?customernum=" + customerNum;
        creditScore = Integer.parseInt(restTemplate.getForObject(creditScoreUrl, String.class));
    } catch (HttpClientErrorException e) {
        return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body("Error getting credit score");
    }

    String creditScoreCategory = getCreditCategoryFromScore(creditScore);

    // Run Credit Check
    String creditCheckUrl = "http://creditprocessorservice:8899/runCreditCheck?customernum=" + customerNum + "&score=" + creditScore;
    String checkResult;
    try {
        checkResult = restTemplate.getForObject(creditCheckUrl, String.class);
    } catch (HttpClientErrorException e) {
        return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body("Error running credit check");
    }

    return ResponseEntity.ok(checkResult);
}

We can see that this function accepts a customer number as an input. This would be helpful to capture as part of a trace. What else would be helpful?

Well, the credit score returned for this customer by the creditprocessorservice may be interesting (we want to ensure we don’t capture any PII data though). It would also be helpful to capture the credit score category, and the credit check result.

Great, we’ve identified four tags to capture from this service that could help with our investigation. But how do we capture these?

Capture Tags

We start by adding OpenTelemetry imports to the top of the CreditCheckController.java file:

...
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.instrumentation.annotations.WithSpan;
import io.opentelemetry.instrumentation.annotations.SpanAttribute;

Next, we use the @WithSpan annotation to produce a span for creditCheck:

@GetMapping("/check")
@WithSpan // ADDED
public ResponseEntity<String> creditCheck(@RequestParam("customernum") String customerNum) {
    ...

We can now get a reference to the current span and add an attribute (aka tag) to it:

...
try {
    String creditScoreUrl = "http://creditprocessorservice:8899/getScore?customernum=" + customerNum;
    creditScore = Integer.parseInt(restTemplate.getForObject(creditScoreUrl, String.class));
} catch (HttpClientErrorException e) {
    return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body("Error getting credit score");
}
Span currentSpan = Span.current(); // ADDED
currentSpan.setAttribute("credit.score", creditScore); // ADDED
...

That was pretty easy, right? Let’s capture some more, with the final result looking like this:

@GetMapping("/check")
@WithSpan(kind=SpanKind.SERVER)
public ResponseEntity<String> creditCheck(@RequestParam("customernum")
                                          @SpanAttribute("customer.num")
                                          String customerNum) {
    // Get Credit Score
    int creditScore;
    try {
        String creditScoreUrl = "http://creditprocessorservice:8899/getScore?customernum=" + customerNum;
        creditScore = Integer.parseInt(restTemplate.getForObject(creditScoreUrl, String.class));
    } catch (HttpClientErrorException e) {
        return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body("Error getting credit score");
    }
    Span currentSpan = Span.current();
    currentSpan.setAttribute("credit.score", creditScore);

    String creditScoreCategory = getCreditCategoryFromScore(creditScore);
    currentSpan.setAttribute("credit.score.category", creditScoreCategory);

    // Run Credit Check
    String creditCheckUrl = "http://creditprocessorservice:8899/runCreditCheck?customernum=" + customerNum + "&score=" + creditScore;
    String checkResult;
    try {
        checkResult = restTemplate.getForObject(creditCheckUrl, String.class);
    } catch (HttpClientErrorException e) {
        return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body("Error running credit check");
    }
    currentSpan.setAttribute("credit.check.result", checkResult);

    return ResponseEntity.ok(checkResult);
}

Redeploy Service

Once these changes are made, let’s run the following script to rebuild the Docker image used for creditcheckservice and redeploy it to our Kubernetes cluster:

./5-redeploy-creditcheckservice.sh java

Confirm Tag is Captured Successfully

After a few minutes, return to Splunk Observability Cloud and load one of the latest traces to confirm that the tags were captured successfully (hint: sort by the timestamp to find the latest traces):

Trace with Attributes Trace with Attributes

Well done, you’ve leveled up your OpenTelemetry game and have added context to traces using tags.

Next, we’re ready to see how you can use these tags with Splunk Observability Cloud!

Last Modified Jul 23, 2025

2. Capture Tags - Python

15 minutes  

Let’s add some tags to our traces, so we can find out why some customers receive a poor experience from our application.

Identify Useful Tags

We’ll start by reviewing the code for the credit_check function of creditcheckservice (which can be found in the /home/splunk/workshop/tagging/creditcheckservice-py/main.py file):

@app.route('/check')
def credit_check():
    customerNum = request.args.get('customernum')

    # Get Credit Score
    creditScoreReq = requests.get("http://creditprocessorservice:8899/getScore?customernum=" + customerNum)
    creditScoreReq.raise_for_status()
    creditScore = int(creditScoreReq.text)

    creditScoreCategory = getCreditCategoryFromScore(creditScore)

    # Run Credit Check
    creditCheckReq = requests.get("http://creditprocessorservice:8899/runCreditCheck?customernum=" + str(customerNum) + "&score=" + str(creditScore))
    creditCheckReq.raise_for_status()
    checkResult = str(creditCheckReq.text)

    return checkResult

We can see that this function accepts a customer number as an input. This would be helpful to capture as part of a trace. What else would be helpful?

Well, the credit score returned for this customer by the creditprocessorservice may be interesting (we want to ensure we don’t capture any PII data though). It would also be helpful to capture the credit score category, and the credit check result.

Great, we’ve identified four tags to capture from this service that could help with our investigation. But how do we capture these?

Capture Tags

We start by adding importing the trace module by adding an import statement to the top of the creditcheckservice-py/main.py file:

import requests
from flask import Flask, request
from waitress import serve
from opentelemetry import trace  # <--- ADDED BY WORKSHOP
...

Next, we need to get a reference to the current span so we can add an attribute (aka tag) to it:

def credit_check():
    current_span = trace.get_current_span()  # <--- ADDED BY WORKSHOP
    customerNum = request.args.get('customernum')
    current_span.set_attribute("customer.num", customerNum)  # <--- ADDED BY WORKSHOP
...

That was pretty easy, right? Let’s capture some more, with the final result looking like this:

def credit_check():
    current_span = trace.get_current_span()  # <--- ADDED BY WORKSHOP
    customerNum = request.args.get('customernum')
    current_span.set_attribute("customer.num", customerNum)  # <--- ADDED BY WORKSHOP

    # Get Credit Score
    creditScoreReq = requests.get("http://creditprocessorservice:8899/getScore?customernum=" + customerNum)
    creditScoreReq.raise_for_status()
    creditScore = int(creditScoreReq.text)
    current_span.set_attribute("credit.score", creditScore)  # <--- ADDED BY WORKSHOP

    creditScoreCategory = getCreditCategoryFromScore(creditScore)
    current_span.set_attribute("credit.score.category", creditScoreCategory)  # <--- ADDED BY WORKSHOP

    # Run Credit Check
    creditCheckReq = requests.get("http://creditprocessorservice:8899/runCreditCheck?customernum=" + str(customerNum) + "&score=" + str(creditScore))
    creditCheckReq.raise_for_status()
    checkResult = str(creditCheckReq.text)
    current_span.set_attribute("credit.check.result", checkResult)  # <--- ADDED BY WORKSHOP

    return checkResult

Redeploy Service

Once these changes are made, let’s run the following script to rebuild the Docker image used for creditcheckservice and redeploy it to our Kubernetes cluster:

./5-redeploy-creditcheckservice.sh

Confirm Tag is Captured Successfully

After a few minutes, return to Splunk Observability Cloud and load one of the latest traces to confirm that the tags were captured successfully (hint: sort by the timestamp to find the latest traces):

Trace with Attributes Trace with Attributes

Well done, you’ve leveled up your OpenTelemetry game and have added context to traces using tags.

Next, we’re ready to see how you can use these tags with Splunk Observability Cloud!

Last Modified Jul 23, 2025

Explore Trace Data

5 minutes  

Now that we’ve captured several tags from our application, let’s explore some of the trace data we’ve captured that include this additional context, and see if we can identify what’s causing a poor user experience in some cases.

Use Trace Analyzer

Navigate to APM, then select Traces. This takes us to the Trace Analyzer, where we can add filters to search for traces of interest. For example, we can filter on traces where the credit score starts with 7:

Credit Check Starts with Seven Credit Check Starts with Seven

If you load one of these traces, you’ll see that the credit score indeed starts with seven.

We can apply similar filters for the customer number, credit score category, and credit score result.

Explore Traces With Errors

Let’s remove the credit score filter and toggle Errors only to on, which results in a list of only those traces where an error occurred:

Traces with Errors Only Traces with Errors Only

Click on a few of these traces, and look at the tags we captured. Do you notice any patterns?

Next, toggle Errors only to off, and sort traces by duration. Look at a few of the slowest running traces, and compare them to the fastest running traces. Do you notice any patterns?

If you found a pattern that explains the slow performance and errors - great job! But keep in mind that this is a difficult way to troubleshoot, as it requires you to look through many traces and mentally keep track of what you saw, so you can identify a pattern.

Thankfully, Splunk Observability Cloud provides a more efficient way to do this, which we’ll explore next.

Last Modified Jul 23, 2025

Index Tags

5 minutes  

Index Tags

To use advanced features in Splunk Observability Cloud such as Tag Spotlight, we’ll need to first index one or more tags.

To do this, navigate to Settings -> APM MetricSets. Then click the + New MetricSet button.

Let’s index the credit.score.category tag by entering the following details (note: since everyone in the workshop is using the same organization, the instructor will do this step on your behalf):

Create Troubleshooting MetricSet Create Troubleshooting MetricSet

Click Start Analysis to proceed.

The tag will appear in the list of Pending MetricSets while analysis is performed.

Pending MetricSets Pending MetricSets

Once analysis is complete, click on the checkmark in the Actions column.

MetricSet Configuraiton Applied MetricSet Configuraiton Applied

How to choose tags for indexing

Why did we choose to index the credit.score.category tag and not the others?

To understand this, let’s review the primary use cases for tags:

  • Filtering
  • Grouping

Filtering

With the filtering use case, we can use the Trace Analyzer capability of Splunk Observability Cloud to filter on traces that match a particular tag value.

We saw an example of this earlier, when we filtered on traces where the credit score started with 7.

Or if a customer calls in to complain about slow service, we could use Trace Analyzer to locate all traces with that particular customer number.

Tags used for filtering use cases are generally high-cardinality, meaning that there could be thousands or even hundreds of thousands of unique values. In fact, Splunk Observability Cloud can handle an effectively infinite number of unique tag values! Filtering using these tags allows us to rapidly locate the traces of interest.

Note that we aren’t required to index tags to use them for filtering with Trace Analyzer.

Grouping

With the grouping use case, we can use Trace Analyzer to group traces by a particular tag.

But we can also go beyond this and surface trends for tags that we collect using the powerful Tag Spotlight feature in Splunk Observability Cloud, which we’ll see in action shortly.

Tags used for grouping use cases should be low to medium-cardinality, with hundreds of unique values.

For custom tags to be used with Tag Spotlight, they first need to be indexed.

We decided to index the credit.score.category tag because it has a few distinct values that would be useful for grouping. In contrast, the customer number and credit score tags have hundreds or thousands of unique values, and are more valuable for filtering use cases rather than grouping.

Troubleshooting vs. Monitoring MetricSets

You may have noticed that, to index this tag, we created something called a Troubleshooting MetricSet. It’s named this way because a Troubleshooting MetricSet, or TMS, allows us to troubleshoot issues with this tag using features such as Tag Spotlight.

You may have also noticed that there’s another option which we didn’t choose called a Monitoring MetricSet (or MMS). Monitoring MetricSets go beyond troubleshooting and allow us to use tags for alerting and dashboards. We’ll explore this concept later in the workshop.

Last Modified Jul 23, 2025

Use Tags for Troubleshooting

5 minutes  

Using Tag Spotlight

Now that we’ve indexed the credit.score.category tag, we can use it with Tag Spotlight to troubleshoot our application.

Navigate to APM then click on Tag Spotlight on the right-hand side. Ensure the creditcheckservice service is selected from the Service drop-down (if not already selected).

With Tag Spotlight, we can see 100% of credit score requests that result in a score of impossible have an error, yet requests for all other credit score types have no errors at all!

Tag Spotlight with Errors Tag Spotlight with Errors

This illustrates the power of Tag Spotlight! Finding this pattern would be time-consuming without it, as we’d have to manually look through hundreds of traces to identify the pattern (and even then, there’s no guarantee we’d find it).

We’ve looked at errors, but what about latency? Let’s click on Latency near the top of the screen to find out.

Here, we can see that the requests with a poor credit score request are running slowly, with P50, P90, and P99 times of around 3 seconds, which is too long for our users to wait, and much slower than other requests.

We can also see that some requests with an exceptional credit score request are running slowly, with P99 times of around 5 seconds, though the P50 response time is relatively quick.

Tag Spotlight with Latency Tag Spotlight with Latency

Using Dynamic Service Maps

Now that we know the credit score category associated with the request can impact performance and error rates, let’s explore another feature that utilizes indexed tags: Dynamic Service Maps.

With Dynamic Service Maps, we can breakdown a particular service by a tag. For example, let’s click on APM, then click Explore to view the service map.

Click on creditcheckservice. Then, on the right-hand menu, click on the drop-down that says Breakdown, and select the credit.score.category tag.

At this point, the service map is updated dynamically, and we can see the performance of requests hitting creditcheckservice broken down by the credit score category:

Service Map Breakdown Service Map Breakdown

This view makes it clear that performance for good and fair credit scores is excellent, while poor and exceptional scores are much slower, and impossible scores result in errors.

Summary

Tag Spotlight has uncovered several interesting patterns for the engineers that own this service to explore further:

  • Why are all the impossible credit score requests resulting in error?
  • Why are all the poor credit score requests running slowly?
  • Why do some of the exceptional requests run slowly?

As an SRE, passing this context to the engineering team would be extremely helpful for their investigation, as it would allow them to track down the issue much more quickly than if we simply told them that the service was “sometimes slow”.

If you’re curious, have a look at the source code for the creditprocessorservice. You’ll see that requests with impossible, poor, and exceptional credit scores are handled differently, thus resulting in the differences in error rates and latency that we uncovered.

The behavior we saw with our application is typical for modern cloud-native applications, where different inputs passed to a service lead to different code paths, some of which result in slower performance or errors. For example, in a real credit check service, requests resulting in low credit scores may be sent to another downstream service to further evaluate risk, and may perform more slowly than requests resulting in higher scores, or encounter higher error rates.

Last Modified Jul 23, 2025

Use Tags for Monitoring

15 minutes  

Earlier, we created a Troubleshooting Metric Set on the credit.score.category tag, which allowed us to use Tag Spotlight with that tag and identify a pattern to explain why some users received a poor experience.

In this section of the workshop, we’ll explore a related concept: Monitoring MetricSets.

What are Monitoring MetricSets?

Monitoring MetricSets go beyond troubleshooting and allow us to use tags for alerting, dashboards and SLOs.

Create a Monitoring MetricSet

(note: your workshop instructor will do the following for you, but observe the steps)

Let’s navigate to Settings -> APM MetricSets, and click the edit button (i.e. the little pencil) beside the MetricSet for credit.score.category.

edit APM MetricSet edit APM MetricSet

Check the box beside Also create Monitoring MetricSet then click Start Analysis

Monitoring MetricSet Monitoring MetricSet

The credit.score.category tag appears again as a Pending MetricSet. After a few moments, a checkmark should appear. Click this checkmark to enable the Pending MetricSet.

pending APM MetricSet pending APM MetricSet

Using Monitoring MetricSets

This mechanism creates a new dimension from the tag on a bunch of metrics that can be used to filter these metrics based on the values of that new dimension. Important: To differentiate between the original and the copy, the dots in the tag name are replaced by underscores for the new dimension. With that the metrics become a dimension named credit_score_category and not credit.score.category.

Next, let’s explore how we can use this Monitoring MetricSet.

Last Modified Jul 23, 2025

Subsections of 7. Use Tags for Monitoring

Use Tags with Dashboards

5 minutes  

Dashboards

Navigate to Metric Finder, then type in the name of the tag, which is credit_score_category (remember that the dots in the tag name were replaced by underscores when the Monitoring MetricSet was created). You’ll see that multiple metrics include this tag as a dimension:

Metric Finder Metric Finder

By default, Splunk Observability Cloud calculates several metrics using the trace data it receives. See Learn about MetricSets in APM for more details.

By creating an MMS, credit_score_category was added as a dimension to these metrics, which means that this dimension can now be used for alerting and dashboards.

To see how, let’s click on the metric named service.request.duration.ns.p99, which brings up the following chart:

Service Request Duration Service Request Duration

Add filters for sf_environment, sf_service, and sf_dimensionalized. Then set the Extrapolation policy to Last value and the Display units to Nanosecond:

Chart with Seconds Chart with Seconds

With these settings, the chart allows us to visualize the service request duration by credit score category:

Duration by Credit Score Duration by Credit Score

Now we can see the duration by credit score category. In my example, the red line represents the exceptional category, and we can see that the duration for these requests sometimes goes all the way up to 5 seconds.

The orange represents the very good category, and has very fast response times.

The green line represents the poor category, and has response times between 2-3 seconds.

It may be useful to save this chart on a dashboard for future reference. To do this, click on the Save as… button and provide a name for the chart:

Save Chart As Save Chart As

When asked which dashboard to save the chart to, let’s create a new one named Credit Check Service - Your Name (substituting your actual name):

Save Chart As Save Chart As

Now we can see the chart on our dashboard, and can add more charts as needed to monitor our credit check service:

Credit Check Service Dashboard Credit Check Service Dashboard

Last Modified Jul 23, 2025

Use Tags with Alerting

3 minutes  

Alerts

It’s great that we have a dashboard to monitor the response times of the credit check service by credit score, but we don’t want to stare at a dashboard all day.

Let’s create an alert so we can be notified proactively if customers with exceptional credit scores encounter slow requests.

To create this alert, click on the little bell on the top right-hand corner of the chart, then select New detector from chart:

New Detector From Chart New Detector From Chart

Let’s call the detector Latency by Credit Score Category. Set the environment to your environment name (i.e. tagging-workshop-yourname) then select creditcheckservice as the service. Since we only want to look at performance for customers with exceptional credit scores, add a filter using the credit_score_category dimension and select exceptional:

Create New Detector Create New Detector

As an alert condition instead of “Static threshold” we want to select “Sudden Change” to make the example more vivid.

Alert Condition: Sudden Change Alert Condition: Sudden Change

We can then set the remainder of the alert details as we normally would. The key thing to remember here is that without capturing a tag with the credit score category and indexing it, we wouldn’t be able to alert at this granular level, but would instead be forced to bucket all customers together, regardless of their importance to the business.

Unless you want to get notified, we do not need to finish this wizard. You can just close the wizard by clicking the X on the top right corner of the wizard pop-up.

Last Modified Jul 23, 2025

Use Tags with Service Level Objectives

10 minutes  

We can now use the created Monitoring MetricSet together with Service Level Objectives a similar way we used them with dashboards and detectors/alerts before. For that we want to be clear about some key concepts:

Key Concepts of Service Level Monitoring

(skip if you know this)

ConceptDefinitionExamples
Service level indicator (SLI)An SLI is a quantitative measurement showing some health of a service, expressed as a metric or combination of metrics.Availability SLI: Proportion of requests that resulted in a successful response
Performance SLI: Proportion of requests that loaded in < 100 ms
Service level objective (SLO)An SLO defines a target for an SLI and a compliance period over which that target must be met. An SLO contains 3 elements: an SLI, a target, and a compliance period. Compliance periods can be calendar, such as monthly, or rolling, such as past 30 days.Availability SLI over a calendar period: Our service must respond successfully to 95% of requests in a month
Performance SLI over a rolling period: Our service must respond to 99% of requests in < 100 ms over a 7-day period
Service level agreement (SLA)An SLA is a contractual agreement that indicates service levels your users can expect from your organization. If an SLA is not met, there can be financial consequences.A customer service SLA indicates that 90% of support requests received on a normal support day must have a response within 6 hours.
Error budgetA measurement of how your SLI performs relative to your SLO over a period of time. Error budget measures the difference between actual and desired performance. It determines how unreliable your service might be during this period and serves as a signal when you need to take corrective action.Our service can respond to 1% of requests in >100 ms over a 7 day period.
Burn rateA unitless measurement of how quickly a service consumes the error budget during the compliance window of the SLO. Burn rate makes the SLO and error budget actionable, showing service owners when a current incident is serious enough to page an on-call responder.For an SLO with a 30-day compliance window, a constant burn rate of 1 means your error budget is used up in exactly 30 days.

Creating a new Service Level Objective

There is an easy to follow wizard to create a new Service Level Objective (SLO). In the left navigation just follow the link “Detectors & SLOs”. From there select the third tab “SLOs” and click the blue button to the right that says “Create SLO”.

Create new SLO Create new SLO

The wizard guides you through some easy steps. And if everything during the previous steps worked out, you will have no problems here. ;)

In our case we want to use Service & endpoint as our Metric type instead of Custom metric. We filter the Environment down to the environment that we are using during this workshop (i.e. tagging-workshop-yourname) and select the creditcheckservice from the Service and endpoint list. Our Indicator type for this workshop will be Request latency and not Request success.

Now we can select our Filters. Since we are using the Request latency as the Indicator type and that is a metric of the APM Service, we can filter on credit.score.category. Feel free to try out what happens, when you set the Indicator type to Request success.

Today we are only interested in our exceptional credit scores. So please select that as the filter.

Choose Service or Metric for SLO Choose Service or Metric for SLO

In the next step we define the objective we want to reach. For the Request latency type, we define the Target (%), the Latency (ms) and the Compliance Window. Please set these to 99, 100 and Last 7 days. This will give us a good idea what we are achieving already.

Here we will already be in shock or play around with the numbers to make it not so scary. Feel free to play around with the numbers to see how well we achieve the objective and how much we have left to burn.

Define Objective for SLO Define Objective for SLO

The third step gives us the chance to alert (aka annoy) people who should be aware about these SLOs to initiate countermeasures. These “people” can also be mechanism like ITSM systems or webhooks to initiate automatic remediation steps.

Activate all categories you want to alert on and add recipients to the different alerts.

Define Alerting for SLO Define Alerting for SLO

The next step is only the naming for this SLO. Have your own naming convention ready for this. In our case we would just name it creditchceckservice:score:exceptional:YOURNAME and click the Create-button BUT you can also just cancel the wizard by clicking anything in the left navigation and confirming to Discard changes.

Name and Save the SLO Name and Save the SLO

And with that we have (nearly) successfully created an SLO including the alerting in case we might miss or goals.

Last Modified Jul 23, 2025

Summary

2 minutes  

In this workshop, we learned the following:

  • What are tags and why are they such a critical part of making an application observable?
  • How to use OpenTelemetry to capture tags of interest from your application.
  • How to index tags in Splunk Observability Cloud and the differences between Troubleshooting MetricSets and Monitoring MetricSets.
  • How to utilize tags in Splunk Observability Cloud to find β€œunknown unknowns” using the Tag Spotlight and Dynamic Service Map features.
  • How to utilize tags for dashboards, alerting and service level objectives.

Collecting tags aligned with the best practices shared in this workshop will let you get even more value from the data you’re sending to Splunk Observability Cloud. Now that you’ve completed this workshop, you have the knowledge you need to start collecting tags from your own applications!

To get started with capturing tags today, check out how to add tags in various supported languages, and then how to use them to create Troubleshooting MetricSets so they can be analyzed in Tag Spotlight. For more help, feel free to ask a Splunk Expert.

Tip for Workshop Facilitator(s)

Once the workshop is complete, remember to delete the APM MetricSet you created earlier for the credit.score.category tag.

Last Modified Jul 23, 2025

Profiling Workshop

2 minutes   Author Derek Mitchell

Service Maps and Traces are extremely valuable in determining what service an issue resides in. And related log data helps provide detail on why issues are occurring in that service.

But engineers sometimes need to go even deeper to debug a problem that’s occurring in one of their services.

This is where features such as Splunk’s AlwaysOn Profiling and Database Query Performance come in.

AlwaysOn Profiling continuously collects stack traces so that you can discover which lines in your code are consuming the most CPU and memory.

And Database Query Performance can quickly identify long-running, unoptimized, or heavy queries and mitigate issues they might be causing.

In this workshop, we’ll explore:

  • How to debug an application with several performance issues.
  • How to use Database Query Performance to find slow-running queries that impact application performance.
  • How to enable AlwaysOn Profiling and use it to find the code that consumes the most CPU and memory.
  • How to apply fixes based on findings from Splunk Observability Cloud and verify the result.

The workshop uses a Java-based application called The Door Game hosted in Kubernetes. Let’s get started!

Tip

The easiest way to navigate through this workshop is by using:

  • the left/right arrows (< | >) on the top right of this page
  • the left (◀️) and right (▢️) cursor keys on your keyboard
Last Modified Jul 23, 2025

Subsections of Profiling Workshop

Build the Sample Application

10 minutes  

Introduction

For this workshop, we’ll be using a Java-based application called The Door Game. It will be hosted in Kubernetes.

Pre-requisites

You will start with an EC2 instance and perform some initial steps in order to get to the following state:

  • Deploy the Splunk distribution of the OpenTelemetry Collector
  • Deploy the MySQL database container and populate data
  • Build and deploy the doorgame application container

Initial Steps

The initial setup can be completed by executing the following steps on the command line of your EC2 instance.

You’ll be asked to enter a name for your environment. Please use profiling-workshop-yourname (where yourname is replaced by your actual name).

cd workshop/profiling
./1-deploy-otel-collector.sh
./2-deploy-mysql.sh
./3-deploy-doorgame.sh

Let’s Play The Door Game

Now that the application is deployed, let’s play with it and generate some observability data.

You should be able to access The Door Game application by pointing your browser to port 81 of the IP address for your EC2 instance. For example:

http://52.23.184.60:81

You should be met with The Door Game intro screen:

Door Game Welcome Screen Door Game Welcome Screen

Click Let's Play to start the game:

Let’s Play Let’s Play

Did you notice that it took a long time after clicking Let's Play before we could actually start playing the game?

Let’s use Splunk Observability Cloud to determine why the application startup is so slow.

Last Modified Oct 27, 2025

Troubleshoot Game Startup

10 minutes  

Let’s use Splunk Observability Cloud to determine why the game started so slowly.

View your application in Splunk Observability Cloud

Note: when the application is deployed for the first time, it may take a few minutes for the data to appear.

Navigate to APM, then use the Environment dropdown to select your environment (i.e. profiling-workshop-name).

If everything was deployed correctly, you should see doorgame displayed in the list of services:

APM Overview APM Overview

Click on Explore on the right-hand side to view the service map. We should the doorgame application on the service map:

Service Map Service Map

Notice how the majority of the time is being spent in the MySQL database. We can get more details by clicking on Database Query Performance on the right-hand side.

Database Query Performance Database Query Performance

This view shows the SQL queries that took the most amount of time. Ensure that the Compare to dropdown is set to None, so we can focus on current performance.

We can see that one query in particular is taking a long time:

select * from doorgamedb.users, doorgamedb.organizations

(do you notice anything unusual about this query?)

Let’s troubleshoot further by clicking on one of the spikes in the latency graph. This brings up a list of example traces that include this slow query:

Traces with Slow Query Traces with Slow Query

Click on one of the traces to see the details:

Trace with Slow Query Trace with Slow Query

In the trace, we can see that the DoorGame.startNew operation took 25.8 seconds, and 17.6 seconds of this was associated with the slow SQL query we found earlier.

What did we accomplish?

To recap what we’ve done so far:

  • We’ve deployed our application and are able to access it successfully.
  • The application is sending traces to Splunk Observability Cloud successfully.
  • We started troubleshooting the slow application startup time, and found a slow SQL query that seems to be the root cause.

To troubleshoot further, it would be helpful to get deeper diagnostic data that tells us what’s happening inside our JVM, from both a memory (i.e. JVM heap) and CPU perspective. We’ll tackle that in the next section of the workshop.

Last Modified Jul 23, 2025

Enable AlwaysOn Profiling

20 minutes  

Let’s learn how to enable the memory and CPU profilers, verify their operation, and use the results in Splunk Observability Cloud to find out why our application startup is slow.

Update the application configuration

We will need to pass additional configuration arguments to the Splunk OpenTelemetry Java agent in order to enable both profilers. The configuration is documented here in detail, but for now we just need the following settings:

SPLUNK_PROFILER_ENABLED="true"
SPLUNK_PROFILER_MEMORY_ENABLED="true"

Since our application is deployed in Kubernetes, we can update the Kubernetes manifest file to set these environment variables. Open the doorgame/doorgame.yaml file for editing, and ensure the values of the following environment variables are set to “true”:

- name: SPLUNK_PROFILER_ENABLED
  value: "true"
- name: SPLUNK_PROFILER_MEMORY_ENABLED
  value: "true"

Next, let’s redeploy the Door Game application by running the following command:

cd workshop/profiling
kubectl apply -f doorgame/doorgame.yaml

After a few seconds, a new pod will be deployed with the updated application settings.

Confirm operation

To ensure the profiler is enabled, let’s review the application logs with the following commands:

kubectl logs -l app=doorgame --tail=100 | grep JfrActivator

You should see a line in the application log output that shows the profiler is active:

[otel.javaagent 2024-02-05 19:01:12:416 +0000] [main] INFO com.splunk.opentelemetry.profiler.JfrActivator - Profiler is active.```

This confirms that the profiler is enabled and sending data to the OpenTelemetry collector deployed in our Kubernetes cluster, which in turn sends profiling data to Splunk Observability Cloud.

Profiling in APM

Visit http://<your IP address>:81 and play a few more rounds of The Door Game.

Then head back to Splunk Observability Cloud, click on APM, and click on the doorgame service at the bottom of the screen.

Click on “Traces” on the right-hand side to load traces for this service. Filter on traces involving the doorgame service and the GET new-game operation (since we’re troubleshooting the game startup sequence):

New Game Traces New Game Traces

Selecting one of these traces brings up the following screen:

Trace with Call Stacks Trace with Call Stacks

You can see that the spans now include “Call Stacks”, which is a result of us enabling CPU and memory profiling earlier.

Click on the span named doorgame: SELECT doorgamedb, then click on CPU stack traces on the right-hand side:

Trace with CPU Call Stacks Trace with CPU Call Stacks

This brings up the CPU call stacks captured by the profiler.

Let’s open the AlwaysOn Profiler to review the CPU stack trace in more detail. We can do this by clicking on the Span link beside View in AlwaysOn Profiler:

Flamegraph and table Flamegraph and table

The AlwaysOn Profiler includes both a table and a flamegraph. Take some time to explore this view by doing some of the following:

  • click a table item and notice the change in flamegraph
  • navigate the flamegraph by clicking on a stack frame to zoom in, and a parent frame to zoom out
  • add a search term like splunk or jetty to highlight some matching stack frames

Let’s have a closer look at the stack trace, starting with the DoorGame.startNew method (since we already know that it’s the slowest part of the request)

com.splunk.profiling.workshop.DoorGame.startNew(DoorGame.java:24)
com.splunk.profiling.workshop.UserData.loadUserData(UserData.java:33)
com.mysql.cj.jdbc.StatementImpl.executeQuery(StatementImpl.java:1168)
com.mysql.cj.NativeSession.execSQL(NativeSession.java:655)
com.mysql.cj.protocol.a.NativeProtocol.sendQueryString(NativeProtocol.java:998)
com.mysql.cj.protocol.a.NativeProtocol.sendQueryPacket(NativeProtocol.java:1065)
com.mysql.cj.protocol.a.NativeProtocol.readAllResults(NativeProtocol.java:1715)
com.mysql.cj.protocol.a.NativeProtocol.read(NativeProtocol.java:1661)
com.mysql.cj.protocol.a.TextResultsetReader.read(TextResultsetReader.java:48)
com.mysql.cj.protocol.a.TextResultsetReader.read(TextResultsetReader.java:87)
com.mysql.cj.protocol.a.NativeProtocol.read(NativeProtocol.java:1648)
com.mysql.cj.protocol.a.ResultsetRowReader.read(ResultsetRowReader.java:42)
com.mysql.cj.protocol.a.ResultsetRowReader.read(ResultsetRowReader.java:75)
com.mysql.cj.protocol.a.MultiPacketReader.readMessage(MultiPacketReader.java:44)
com.mysql.cj.protocol.a.MultiPacketReader.readMessage(MultiPacketReader.java:66)
com.mysql.cj.protocol.a.TimeTrackingPacketReader.readMessage(TimeTrackingPacketReader.java:41)
com.mysql.cj.protocol.a.TimeTrackingPacketReader.readMessage(TimeTrackingPacketReader.java:62)
com.mysql.cj.protocol.a.SimplePacketReader.readMessage(SimplePacketReader.java:45)
com.mysql.cj.protocol.a.SimplePacketReader.readMessage(SimplePacketReader.java:102)
com.mysql.cj.protocol.a.SimplePacketReader.readMessageLocal(SimplePacketReader.java:137)
com.mysql.cj.protocol.FullReadInputStream.readFully(FullReadInputStream.java:64)
java.io.FilterInputStream.read(Unknown Source:0)
sun.security.ssl.SSLSocketImpl$AppInputStream.read(Unknown Source:0)
sun.security.ssl.SSLSocketImpl.readApplicationRecord(Unknown Source:0)
sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(Unknown Source:0)
sun.security.ssl.SSLSocketInputRecord.readHeader(Unknown Source:0)
sun.security.ssl.SSLSocketInputRecord.read(Unknown Source:0)
java.net.SocketInputStream.read(Unknown Source:0)
java.net.SocketInputStream.read(Unknown Source:0)
java.lang.ThreadLocal.get(Unknown Source:0)

We can interpret the stack trace as follows:

  • When starting a new Door Game, a call is made to load user data.
  • This results in executing a SQL query to load the user data (which is related to the slow SQL query we saw earlier).
  • We then see calls to read data in from the database.

So, what does this all mean? It means that our application startup is slow since it’s spending time loading user data. In fact, the profiler has told us the exact line of code where this happens:

com.splunk.profiling.workshop.UserData.loadUserData(UserData.java:33)

Let’s open the corresponding source file (./doorgame/src/main/java/com/splunk/profiling/workshop/UserData.java) and look at this code in more detail:

public class UserData {

    static final String DB_URL = "jdbc:mysql://mysql/DoorGameDB";
    static final String USER = "root";
    static final String PASS = System.getenv("MYSQL_ROOT_PASSWORD");
    static final String SELECT_QUERY = "select * FROM DoorGameDB.Users, DoorGameDB.Organizations";

    HashMap<String, User> users;

    public UserData() {
        users = new HashMap<String, User>();
    }

    public void loadUserData() {

        // Load user data from the database and store it in a map
        Connection conn = null;
        Statement stmt = null;
        ResultSet rs = null;

        try{
            conn = DriverManager.getConnection(DB_URL, USER, PASS);
            stmt = conn.createStatement();
            rs = stmt.executeQuery(SELECT_QUERY);
            while (rs.next()) {
                User user = new User(rs.getString("UserId"), rs.getString("FirstName"), rs.getString("LastName"));
                users.put(rs.getString("UserId"), user);
            }

Here we can see the application logic in action. It establishes a connection to the database, then executes the SQL query we saw earlier:

select * FROM DoorGameDB.Users, DoorGameDB.Organizations

It then loops through each of the results, and loads each user into a HashMap object, which is a collection of User objects.

We have a good understanding of why the game startup sequence is so slow, but how do we fix it?

For more clues, let’s have a look at the other part of AlwaysOn Profiling: memory profiling. To do this, click on the Memory tab in AlwaysOn profiling:

Memory Profiling Memory Profiling

At the top of this view, we can see how much heap memory our application is using, the heap memory allocation rate, and garbage collection activity.

We can see that our application is using about 400 MB out of the max 1 GB heap size, which seems excessive for such a simple application. We can also see that some garbage collection occurred, which caused our application to pause (and probably annoyed those wanting to play the Game Door).

At the bottom of the screen, which can see which methods in our Java application code are associated with the most heap memory usage. Click on the first item in the list to show the Memory Allocation Stack Traces associated with the java.util.Arrays.copyOf method specifically:

Memory Allocation Stack Traces Memory Allocation Stack Traces

With help from the profiler, we can see that the loadUserData method not only consumes excessive CPU time, but it also consumes excessive memory when storing the user data in the HashMap collection object.

What did we accomplish?

We’ve come a long way already!

  • We learned how to enable the profiler in the Splunk OpenTelemetry Java instrumentation agent.
  • We learned how to verify in the agent output that the profiler is enabled.
  • We have explored several profiling related workflows in APM:
    • How to navigate to AlwaysOn Profiling from the troubleshooting view
    • How to explore the flamegraph and method call duration table through navigation and filtering
    • How to identify when a span has sampled call stacks associated with it
    • How to explore heap utilization and garbage collection activity
    • How to view memory allocation stack traces for a particular method

In the next section, we’ll apply a fix to our application to resolve the slow startup performance.

Last Modified Jul 23, 2025

Fix Application Startup Slowness

10 minutes  

In this section, we’ll use what we learned from the profiling data in Splunk Observability Cloud to resolve the slowness we saw when starting our application.

Examining the Source Code

Open the corresponding source file once again (./doorgame/src/main/java/com/splunk/profiling/workshop/UserData.java) and focus on the following code:

public class UserData {

    static final String DB_URL = "jdbc:mysql://mysql/DoorGameDB";
    static final String USER = "root";
    static final String PASS = System.getenv("MYSQL_ROOT_PASSWORD");
    static final String SELECT_QUERY = "select * FROM DoorGameDB.Users, DoorGameDB.Organizations";

    HashMap<String, User> users;

    public UserData() {
        users = new HashMap<String, User>();
    }

    public void loadUserData() {

        // Load user data from the database and store it in a map
        Connection conn = null;
        Statement stmt = null;
        ResultSet rs = null;

        try{
            conn = DriverManager.getConnection(DB_URL, USER, PASS);
            stmt = conn.createStatement();
            rs = stmt.executeQuery(SELECT_QUERY);
            while (rs.next()) {
                User user = new User(rs.getString("UserId"), rs.getString("FirstName"), rs.getString("LastName"));
                users.put(rs.getString("UserId"), user);
            }

After speaking with a database engineer, you discover that the SQL query being executed includes a cartesian join:

select * FROM DoorGameDB.Users, DoorGameDB.Organizations

Cartesian joins are notoriously slow, and shouldn’t be used, in general.

Upon further investigation, you discover that there are 10,000 rows in the user table, and 1,000 rows in the organization table. When we execute a cartesian join using both of these tables, we end up with 10,000 x 1,000 rows being returned, which is 10,000,000 rows!

Furthermore, the query ends up returning duplicate user data, since each record in the user table is repeated for each organization.

So when our code executes this query, it tries to load 10,000,000 user objects into the HashMap, which explains why it takes so long to execute, and why it consumes so much heap memory.

Let’s Fix That Bug

After consulting the engineer that originally wrote this code, we determined that the join with the Organizations table was inadvertent.

So when loading the users into the HashMap, we simply need to remove this table from the query.

Open the corresponding source file once again (./doorgame/src/main/java/com/splunk/profiling/workshop/UserData.java) and change the following line of code:

    static final String SELECT_QUERY = "select * FROM DoorGameDB.Users, DoorGameDB.Organizations";

to:

    static final String SELECT_QUERY = "select * FROM DoorGameDB.Users";

Now the method should perform much more quickly, and less memory should be used, as it’s loading the correct number of users into the HashMap (10,000 instead of 10,000,000).

Rebuild and Redeploy Application

Let’s test our changes by using the following commands to re-build and re-deploy the Door Game application:

cd workshop/profiling
./5-redeploy-doorgame.sh

Once the application has been redeployed successfully, visit The Door Game again to confirm that your fix is in place: http://<your IP address>:81

Clicking Let's Play should take us to the game more quickly now (though performance could still be improved):

Choose Door Choose Door

Start the game a few more times, then return to Splunk Observability Cloud to confirm that the latency of the GET new-game operation has decreased.

What did we accomplish?

  • We discovered why our SQL query was so slow.
  • We applied a fix, then rebuilt and redeployed our application.
  • We confirmed that the application starts a new game more quickly.

In the next section, we’ll explore continue playing the game and fix any remaining performance issues that we find.

Last Modified Jul 23, 2025

Fix In Game Slowness

10 minutes  

Now that our game startup slowness has been resolved, let’s play several rounds of the Door Game and ensure the rest of the game performs quickly.

As you play the game, do you notice any other slowness? Let’s look at the data in Splunk Observability Cloud to put some numbers on what we’re seeing.

Review Game Performance in Splunk Observability Cloud

Navigate to APM then click on Traces on the right-hand side of the screen. Sort the traces by Duration in descending order:

Slow Traces Slow Traces

We can see that a few of the traces with an operation of GET /game/:uid/picked/:picked/outcome have a duration of just over five seconds. This explains why we’re still seeing some slowness when we play the app (note that the slowness is no longer on the game startup operation, GET /new-game, but rather a different operation used while actually playing the game).

Let’s click on one of the slow traces and take a closer look. Since profiling is still enabled, call stacks have been captured as part of this trace. Click on the child span in the waterfall view, then click CPU Stack Traces:

View Stack on Span View Stack on Span

At the bottom of the call stack, we can see that the thread was busy sleeping:

com.splunk.profiling.workshop.ServiceMain$$Lambda$.handle(Unknown Source:0)
com.splunk.profiling.workshop.ServiceMain.lambda$main$(ServiceMain.java:34)
com.splunk.profiling.workshop.DoorGame.getOutcome(DoorGame.java:41)
com.splunk.profiling.workshop.DoorChecker.isWinner(DoorChecker.java:14)
com.splunk.profiling.workshop.DoorChecker.checkDoorTwo(DoorChecker.java:30)
com.splunk.profiling.workshop.DoorChecker.precheck(DoorChecker.java:36)
com.splunk.profiling.workshop.Util.sleep(Util.java:9)
java.util.concurrent.TimeUnit.sleep(Unknown Source:0)
java.lang.Thread.sleep(Unknown Source:0)
java.lang.Thread.sleep(Native Method:0)

The call stack tells us a story – reading from the bottom up, it lets us describe what is happening inside the service code. A developer, even one unfamiliar with the source code, should be able to look at this call stack to craft a narrative like:

We are getting the outcome of a game. We leverage the DoorChecker to see if something is the winner, but the check for door two somehow issues a precheck() that, for some reason, is deciding to sleep for a long time.

Our workshop application is left intentionally simple – a real-world service might see the thread being sampled during a database call or calling into an un-traced external service. It is also possible that a slow span is executing a complicated business process, in which case maybe none of the stack traces relate to each other at all.

The longer a method or process is, the greater chance we will have call stacks sampled during its execution.

Let’s Fix That Bug

By using the profiling tool, we were able to determine that our application is slow when issuing the DoorChecker.precheck() method from inside DoorChecker.checkDoorTwo(). Let’s open the doorgame/src/main/java/com/splunk/profiling/workshop/DoorChecker.java source file in our editor.

By quickly glancing through the file, we see that there are methods for checking each door, and all of them call precheck(). In a real service, we might be uncomfortable simply removing the precheck() call because there could be unseen/unaccounted side effects.

Down on line 29 we see the following:

    private boolean checkDoorTwo(GameInfo gameInfo) {
        precheck(2);
        return gameInfo.isWinner(2);
    }

    private void precheck(int doorNum) {
        long extra = (int)Math.pow(70, doorNum);
        sleep(300 + extra);
    }

With our developer hat on, we notice that the door number is zero based, so the first door is 0, the second is 1, and the 3rd is 2 (this is conventional). The extra value is used as extra/additional sleep time, and it is computed by taking 70^doorNum (Math.pow performs an exponent calculation). That’s odd, because this means:

  • door 0 => 70^0 => 1ms
  • door 1 => 70^1 => 70ms
  • door 2 => 70^2 => 4900ms

We’ve found the root cause of our slow bug! This also explains why the first two doors weren’t ever very slow.

We have a quick chat with our product manager and team lead, and we agree that the precheck() method must stay but that the extra padding isn’t required. Let’s remove the extra variable and make precheck now read like this:

    private void precheck(int doorNum) {
        sleep(300);
    }

Now all doors will have a consistent behavior. Save your work and then rebuild and redeploy the application using the following command:

cd workshop/profiling
./5-redeploy-doorgame.sh

Once the application has been redeployed successfully, visit The Door Game again to confirm that your fix is in place: http://<your IP address>:81

What did we accomplish?

  • We found another performance issue with our application that impacts game play.
  • We used the CPU call stacks included in the trace to understand application behavior.
  • We learned how the call stack can tell us a story and point us to suspect lines of code.
  • We identified the slow code and fixed it to make it faster.
Last Modified Jul 23, 2025

Summary

3 minutes  

In this workshop, we accomplished the following:

  • We deployed our application and captured traces with Splunk Observability Cloud.
  • We used Database Query Performance to find a slow-running query that impacted the game startup time.
  • We enabled AlwaysOn Profiling and used it to confirm which line of code was causing the increased startup time and memory usage.
  • We found another application performance issue and used AlwaysOn Profiling again to find the problematic line of code.
  • We applied fixes for both of these issues and verified the result using Splunk Observability Cloud.

Enabling AlwaysOn Profiling and utilizing Database Query Performance for your applications will let you get even more value from the data you’re sending to Splunk Observability Cloud.

Now that you’ve completed this workshop, you have the knowledge you need to start collecting deeper diagnostic data from your own applications!

To get started with Database Query Performance today, check out Monitor Database Query Performance.

And to get started with AlwaysOn Profiling, check out Introduction to AlwaysOn Profiling for Splunk APM

For more help, feel free to ask a Splunk Expert.

Last Modified Jul 23, 2025

Optimize End User Experiences

90 minutes   Author Sarah Ware

How can we use Splunk Observability to get insight into end user experience, and proactively test scenarios to improve that experience?

Sections:

  • Set up basic Synthetic tests to understand availability and performance ASAP
    • Uptime test
    • API test
    • Single page Browser test
  • Explore RUM to understand our real users
  • Write advanced Synthetics tests based on what we’ve learned about our users and what we need them to do
  • Customize dashboard charts to capture our KPIs, show trends, and show data in context of our events
  • Create Detectors to alert on our KPIs
Tip
Keep in mind throughout the workshop: how can I prioritize activities strategically to get the fastest time to value for my end users and for myself/ my developers?

Context

As a reminder, we need frontend performance monitoring to capture everything that goes into our end user experience. If we’re just monitoring the backend, we’re missing all of the other resources that are critical to our users’ success. Read What the Fastly Outage Can Teach Us About Observability for a real world example. Click the image below to zoom in. What goes into the front end What goes into the front end

References

Throughout this workshop, we will see references to resources to help further understand end user experience and how to optimize it. In addition to Splunk Docs for supported features and Lantern for tips and tricks, Google’s web.dev and Mozilla are great resources.

Remember that the specific libraries, platforms, and CDNs you use often also have their own specific resources. For example React, Wordpress, and Cloudflare all have their own tips to improve performance.

Last Modified Jul 23, 2025

Subsections of Optimize End User Experiences

Synthetics

Let’s quickly set up some tests in Synthetics to immediately start understanding our end user experience, without waiting for real users to interact with our app.

We can capture not only the performance and availability of our own apps and endpoints, but also those third parties we rely on any time of the day or night.

Tip
If you find that your tests are being bot-blocked, see the docs for tips on how to allow Synthetic testing. if you need to test something that is not accessible externally, see private location instructions.
Last Modified Jul 23, 2025

Subsections of Synthetics

Uptime Test

5 minutes  

Introduction

The simplest way to keep an eye on endpoint availability is with an Uptime test. This lightweight test can run internally or externally around the world, as frequently as every minute. Because this is the easiest (and cheapest!) test to set up, and because this is ideal for monitoring availability of your most critical enpoints and ports, let’s start here.

Pre-requisites

  • Publicly accessible HTTP(S) endpoint(s) to test
  • Access to Splunk Observability Cloud
Last Modified Jul 23, 2025

Subsections of 1. Uptime Test

Creating a test

  1. Open Synthetics Synthetics navigation item Synthetics navigation item

  2. Click the Add new test button on the right side of the screen, then select Uptime and HTTP test. image image

  3. Name your test with your team name (provided by your workshop instructor), your initials, and any other details you’d like to include, like geographic region.

  4. For now let’s test a GET request. Fill in the URL field. You can use one of your own, or one of ours like https://online-boutique-eu.splunko11y.com, https://online-boutique-us.splunko11y.com, or https://www.splunk.com.

  5. Click Try now to validate that the endpoint is accessible before the selected location before saving the test. Try now does not count against your subscription usage, so this is a good practice to make sure you’re not wasting real test runs on a misconfigured test. image image

Tip
A common reason for Try now to fail is that there is a non-2xx response code. If that is expected, add a Validation for the correct response code.
  1. Add any additional validations needed, for example: response code, response header, and response size. Advanced settings for test configuration Advanced settings for test configuration

  2. Add and remove any locations you’d like. Keep in mind where you expect your endpoint to be available.

  3. Change the frequency to test your more critical endpoints more often, up to one minute. image image

  4. Make sure “Round-robin” is on so the test will run from one location at a time, rather than from all locations at once.

    • If an endpoint is highly critical, think about if it is worth it to have all locations tested at the same time every single minute. If you have automations built in with a webhook from a detector, or if you have strict SLAs you need to track, this could be worth it to have as much coverage as possible. But if you are doing more manual investigation, or if this is a less critical endpoint, you could be wasting test runs that are executing while an issue is being investigated.
    • Remember that your license is based on the number of test runs per month. Turning Round-robin off will multiply the number of test runs by the number of locations you have selected.
  5. When you are ready for the test to start running, make sure “Active” is on, then scroll down and click Submit to save the test configuration. submit button submit button

Now the test will start running with your saved configuration. Take a water break, then we’ll look at the results!

Last Modified Jul 23, 2025

Understanding results

  1. From the Synthetics landing page, click into a test to see its summary view and play with the Performance KPIs chart filters to see how you can slice and dice your data. This is a good place to get started understanding trends. Later, we will see what custom charts look like, so you can tailor dashboards to the KPIs you care about most. KPI chart filters KPI chart filters

    Workshop Question: Using the Performance KPIs chart

    What metrics are available? Is your data consistent across time and locations? Do certain locations run slower than others? Are there any spikes or failures?

  2. Click into a recent run either in the chart or in the table below. run results chart run results chart

  3. If there are failures, look at the response to see if you need to add a response code assertion (302 is a common one), if there is some authorization needed, or different request headers added. Here we have information about this particular test run including if it succeeded or failed, the location, timestamp, and duration in addition to the other Uptime test metrics. Click through to see the response, request, and connection info as well. uptime test result uptime test result If you need to edit the test for it to run successfully, click the test name in the top left breadcrumb on this run result page, then click Edit test on the top right of the test overview page. Remember to scroll down and click Submit to save your changes after editing the test configuration.

  4. In addition to the test running successfully, there are other metrics to measure the health of your endpoints. For example, Time to First Byte(TTFB) is a great indicator of performance, and you can optimize TTFB to improve end user experience.

  5. Go back to the test overview page and change the Performance KPIs chart to display First Byte time. Once the test has run for a long enough time, expanding the time frame will draw the data points as lines to better see trends and anomalies, like in the example below. Performance KPIs for Uptime Tests Performance KPIs for Uptime Tests

In the example above, we can see that TTFB varies consistently between locations. Knowing this, we can keep location in mind when reporting on metrics. We could also improve the experience, for example by serving users in those locations an endpoint hosted closer to them, which should reduce network latency. We can also see some slight variations in the results over time, but overall we already have a good idea of our baseline for this endpoint’s KPIs. When we have a baseline, we can alert on worsening metrics as well as visualize improvements.

Tip

We are not setting a detector on this test yet, to make sure it is running consistently and successfully. If you are testing a highly critical endpoint and want to be alerted on it ASAP (and have tolerance for potential alert noise), jump to Single Test Detectors.

Once you have your Uptime test running successfully, let’s move on to the next test type.

Last Modified Jul 23, 2025

API Test

5 minutes  

The API test provides a flexible way to check the functionality and performance of API endpoints. The shift toward API-first development has magnified the necessity to monitor the back-end services that provide your core front-end functionality.

Whether you’re interested in testing multi-step API interactions or you want to gain visibility into the performance of your endpoints, the API Test can help you accomplish your goals.

This excercise will walk through a multi-step test on the Spotify API. You can also use it as a reference to build tests on your own APIs or on those of your critical third parties.

API test result API test result

Last Modified Jul 23, 2025

Subsections of 2. API Test

Global Variables

Global variables allow us to use stored strings in multiple tests, so we only need to update them in one place.

View the global variable that we’ll use to perform our API test. Click on Global Variables under the cog icon. The global variable named env.encoded_auth will be the one that we’ll use to build the spotify API transaction.

global variables global variables

Last Modified Jul 23, 2025

Create new API test

Create a new API test by clicking on the Add new test button and select API test from the dropdown. Name the test using your team name, your initials, and Spotify API e.g. [Daisy] RWC - Spotify API

new API test new API test

Last Modified Jul 23, 2025

Authentication Request

Click on + Add requests and enter the request step name e.g. Authenticate with Spotify API.

placeholder placeholder

Expand the Request section, from the drop-down change the request method to POST and enter the following URL:

https://accounts.spotify.com/api/token

In the Payload body section enter the following:

grant_type=client_credentials

Next, add two + Request headers with the following key/value pairings:

  • CONTENT-TYPE: application/x-www-form-urlencoded
  • AUTHORIZATION: Basic {{env.encoded_auth}}

Expand the Validation section and add the following extraction:

  • Extract from Response body JSON $.access_token as access_token

This will parse the JSON payload that is received from the Spotify API, extract the access token and store it as a custom variable.

Add request payload token Add request payload token

Last Modified Jul 23, 2025

Search Request

Click on + Add Request to add the next step. Name the step Search for Tracks named “Up around the bend”.

Expand the Request section and change the request method to GET and enter the following URL:

https://api.spotify.com/v1/search?q=Up%20around%20the%20bend&type=track&offset=0&limit=5

Next, add two request headers with the following key/value pairings:

  • CONTENT-TYPE: application/json
  • AUTHORIZATION: Bearer {{custom.access_token}}
    • This uses the custom variable we created in the previous step!

Add search request Add search request

Expand the Validation section and add the following extraction:

  • Extract from Response body JSON $.tracks.items[0].id as track.id

Add search payload Add search payload

To validate the test before saving, scroll to the top and change the location as needed. Click Try now. See the docs for more information on the try now feature.

try now try now

When the validation is successful, click on < Return to test to return to the test configuration page. And then click Save to save the API test.

Extra credit

Have more time to work on this test? Take a look at the Response Body in one of your run results. What additional steps would make this test more thorough? Edit the test, and use the Try now feature to validate any changes you make before you save the test.

Last Modified Jul 23, 2025

View results

Wait for a few minutes for the test to provision and run. Once you see the test has run successfully, click on the run to view the results:

API test result API test result

Resources

Last Modified Jul 23, 2025

Single Page Browser Test

5 minutes  

We have started testing our endpoints, now let’s test the front end browser experience.

Starting with a single page browser test will let us capture how first- and third-party resources impact how our end users experience our browser-based site. It also allows us to start to understand our user experience metrics before introducing the complexity of multiple steps in one test.

A page where your users commonly “land” is a good choice to start with a single page test. This could be your site homepage, a section main page, or any other high-traffic URL that is important to you and your end users.

  1. Click Create new test and select Browser test Create new browser test Create new browser test

  2. Include your team name and initials in the test name. Add to the Name and Custom properties to describe the scope of the test (like Desktop for device type). Then click + Edit steps Browser test content fields Browser test content fields

  3. Change the transaction label (top left) and step name (on the right) to something readable that describes the step. Add the URL you’d like to test. Your workshop instructor can provide you with a URL as well. In the below example, the transaction is “Home” and the step name is “Go to homepage”. Transaction and step label Transaction and step label

  4. To validate the test, change the location as needed and click Try now. See the docs for more information on the try now feature. browser test try now buttons browser test try now buttons

  5. Wait for the test validation to complete. If the test validation failed, double check your URL and test location and try again. With Try now you can see what the result of the test will be if it were saved and run as-is. Try Now browser test results Try Now browser test results

  6. Click < Return to test to continue the configuration. Return to test button Return to test button

  7. Edit the locations you want to use, keeping in mind any regional rules you have for your site.

    Browser test details Browser test details

  8. You can edit the Device and Frequency or leave them at their default values for now. Click Submit at the bottom of the form to save the test and start running it.

    browser test submit button browser test submit button

Bonus Exercise

Have a few spare seconds? Copy this test and change just the title and device type, and save. Now you have visibility into the end user experience on another device and connection speed!

While our Synthetic tests are running, let’s see how RUM is instrumented to start getting data from our real users.

Last Modified Jul 23, 2025

RUM

15 minutes  

With RUM instrumented, we will be able to better understand our end users, what they are doing, and what issues they are encountering.

This workshop walks through how our demo site is instrumented and how to interpret the data. If you already have a RUM license, this will help you understand how RUM works and how you can use it to optimize your end user experience.

Tip

Our Docs also contain guidance such as scenarios using Splunk RUM and demo applications to test out RUM for mobile apps.

Last Modified Jul 23, 2025

Subsections of RUM

Overview

The aim of this Splunk Real User Monitoring (RUM) workshop is to let you:

  • Shop for items on the Online Boutique to create traffic, and create RUM User Sessions1 that you can view in the Splunk Observability Suite.
  • See an overview of the performance of all your application(s) in the Application Summary Dashboard
  • Examine the performance of a specific website with RUM metrics.

In order to reach this goal, we will use an online boutique to order various products. While shopping on the online boutique you will create what is called a User Session.

You may encounter some issues with this web site, and you will use Splunk RUM to identify the issues, so they can be resolved by the developers.

The workshop host will provide you with a URL for an online boutique store that has RUM enabled.

Each of these Online Boutiques are also being visited by a few synthetic users; this will allow us to generate more live data to be analyzed later.


  1. A RUM User session is a “recording” of a collection of user interactions on an application, basically collecting a website or app’s performance measured straight from the browser or Mobile App of the end user. To do this a small amount of JavaScript is embedded in each page. This script then collects data from each user as he or she explores the page, and transfers that data back for analysis. ↩︎

Last Modified Jul 23, 2025

RUM instrumentation in a browser app

  • Check the HEAD section of the Online-boutique webpage in your browser
  • Find the code that instruments RUM

1. Browse to the Online Boutique

Your workshop instructor will provide you with the Online Boutique URL that has RUM installed so that you can complete the next steps.

2. Inspecting the HTML source

The changes needed for RUM are placed in the <head> section of the hosts Web page. Right click to view the page source or to inspect the code. Below is an example of the <head> section with RUM:

Online Boutique Online Boutique

This code enables RUM Tracing, Session Replay, and Custom Events to better understand performance in the context of user workflows:

  • The first part is to indicate where to download the Splunk Open Telemetry Javascript file from: https://cdn.signalfx.com/o11y-gdi-rum/latest/splunk-otel-web.js (this can also be hosted locally if so required).
  • The next section defines the location where to send the traces to in the beacon url: {beaconUrl: "https://rum-ingest.eu0.signalfx.com/v1/rum"
  • The RUM Access Token: rumAuth: "<redacted>".
  • Identification tags app and environment to indentify in the SPLUNK RUM UI e.g. app: "online-boutique-us-store", environment: "online-boutique-us"} (these values will be different in your workshop)

The above lines 21 and 23-30 are all that is required to enable RUM on your website!

Lines 22 and 31-34 are optional if you want Session Replay instrumented.

Line 36-39 var tracer=Provider.getTracer('appModuleLoader'); will add a Custom Event for every page change, allowing you to better track your website conversions and usage. This may or may not be instrumented for this workshop.

Exercise

Time to shop! Take a minute to open the workshop store URL in as many browsers and devices as you’d like, shop around, add items to cart, checkout, and feel free to close the shopping browsers when you’re finished. Keep in mind this is a lightweight demo shop site, so don’t be alarmed if the cart doesn’t match the item you picked!

Last Modified Jul 23, 2025

RUM Landing Page

  • Visit the RUM landing page and and check the overview of the performance of all your RUM enabled applications with the Application Summary Dashboard (Both Mobile and Web based)

1. Visit the RUM Landing Page

Login into Splunk Observability. From the left side menu bar select RUM RUM-ico RUM-ico. This will bring you to your the RUM Landing Page.

The goal of this page is to give you in a single page, a clear indication of the health, performance and potential errors found in your application(s) and allow you to dive deeper into the information about your User Sessions collected from your web page/App. You will have a pane for each of your active RUM applications. (The view below is the default expanded view)

RUM-App-sum RUM-App-sum

If you have multiple applications, (which will be the case when every attendee is using their own ec2 instance for the RUM workshop), the pane view may be automatically reduced by collapsing the panes as shown below:

RUM-App-sum-collapsed RUM-App-sum-collapsed

You can expanded a condensed RUM Application Summary View to the full dashboard by clicking on the small browser RUM-browser RUM-browser or Mobile RUM-mobile RUM-mobileicon. (Depending on the type of application: Mobile or Browser based) on the left in front of the applications name, highlighted by the red arrow.

First find the right application to use for the workshop:

If you are participating in a stand alone RUM workshop, the workshop leader will tell you the name of the application to use, in the case of a combined workshop, it will follow the naming convention we used for IM and APM and use the ec2 node name as a unique id like jmcj-store as shown as the last app in the screenshot above.

2. Configure the RUM Application Summary Dashboard Header Section

RUM Application Summary Dashboard consists of 6 major sections. The first is the selection header, where you can set/filter a number of options:

  • A drop down for the Time Window you’re reviewing (You are looking at the past 15 minutes by default)
  • A drop down to select the Environment1 you want to look at. This allows you to focus on just the subset of applications belonging to that environment, or Select all to view all available.
  • A drop down list with the various Apps being monitored. You can use the one provided by the workshop host or select your own. This will focus you on just one application.
  • A drop down to select the Source, Browser or Mobile applications to view. For the Workshop leave All selected.
  • A hamburger menu located at the right of the header allowing you to configure some settings of your Splunk RUM application. (We will visit this in a later section).

RUM-SummaryHeader RUM-SummaryHeader

For the workshop lets do a deeper dive into the Application Summary screen in the next section: Check Health Browser Application


A common application deployment pattern is to have multiple, distinct application environments that don’t interact directly with each other but that are all being monitored by Splunk APM or RUM: for instance, quality assurance (QA) and production environments, or multiple distinct deployments in different datacenters, regions or cloud providers.


  1. A deployment environment is a distinct deployment of your system or application that allows you to set up configurations that don’t overlap with configurations in other deployments of the same application. Separate deployment environments are often used for different stages of the development process, such as development, staging, and production. ↩︎

Last Modified Jul 23, 2025

Check Browser Applications health at a glance

  • Get familiar with the UI and options available from this landing page
  • Identify Page Views/JavaScript Errors and Request/Errors in a single view
    Check the Web Vitals metrics and any Detector that has fired for in relation to your Browser Application

Application Summary Dashboard

1.Header Bar

As seen in the previous section the RUM Application Summary Dashboard consists of 5 major sections.
The first section is the selection header, where you can collapse the Pane via the RUM-browser RUM-browser Browser icon or the > in front of the application name, which is jmcj-store in the example below. It also provides access to the Application Overview page if you click the link with your application name which is jmcj-store in the example below.

Further, you can also open the Application Overview or App Health Dashboard via the triple dot trippleburger trippleburger menu on the right.

RUM-SummaryHeader RUM-SummaryHeader

For now, let’s look at the high level information we get on the application summary dashboard.

The RUM Application Summary Dashboard is focused on providing you with at a glance highlights of the status of your application.

2. Page Views / JavaScript Errors & Network Requests / Errors

The first section shows Page Views / JavaScript Errors, & Network Requests and Errors charts show the quantity and trend of these issues in your application. This could be Javascript errors, or failed network calls to back end services.

RUM-chart RUM-chart

In the example above you can see that there are no failed network calls in the Network chart, but in the Page View chart you can see that a number of pages do experience some errors. These are often not visible for regular users, but can seriously impact the performance of your web site.

You can see the count of the Page Views / Network Requests / Errors by hovering over the charts.

RUM-chart-clicked RUM-chart-clicked

3. JavaScript Errors

With the second section of the RUM Application Summary Dashboard we are showing you an overview of the JavaScript errors occurring in your application, along with a count of each error.

RUM-javascript RUM-javascript

In the example above you can see there are three JavaScript errors, one that appears 29 times in the selected time slot, and the other two each appear 12 times.

If you click on one of the errors a pop-out opens that will show a summary (below) of the errors over time, along with a Stack Trace of the JavaScript error, giving you an indication of where the problems occurred. (We will see this in more detail in one of the following sections)

RUM-javascript-chart RUM-javascript-chart

4. Web Vitals

The next section of the RUM Application Summary Dashboard is showing you Google’s Core Web Vitals, three metrics that are not only used by Google in its search ranking system, but also quantify end user experience in terms of loading, interactivity, and visual stability.

WEB-vitals WEB-vitals

As you can see our site is well behaved and scores Good for all three Metrics. These metrics can be used to identify the effect changes to your application have, and help you improve the performance of your site.

If you click on any of the Metrics shown in the Web Vitals pane you will be taken to the corresponding Tag Spotlight Dashboard. e.g. clicking on the Largest Contentful Paint (LCP) chartlet, you will be taken to a dashboard similar to the screen shot below, that gives you timeline and table views for how this metric has performed. This should allow you to spot trends and identify where the problem may be more common, such as an operating system, geolocation, or browser version.

WEB-vitals-tag WEB-vitals-tag

5. Most Recent Detectors

The final section of the RUM Application Summary Dashboard is focused on providing you an overview of recent detectors that have triggered for your application. We have created a detector for this screen shot but your pane will be empty for now. We will add some detectors to your site and make sure they are triggered in one of the next sections.

detectors detectors

In the screen shot you can see we have a critical alert for the RUM Aggregated View Detector, and a Count, how often this alert has triggered in the selected time window. If you happen to have an alert listed, you can click on the name of the Alert (that is shown as a blue link) and you will be taken to the Alert Overview page showing the details of the alert (Note: this will move you away from the current page, Please use the Back option of your browser to return to the overview page).

alert alert


Exercise

Please take a few minutes to experiment with the RUM Application Summary Dashboard and the underlying chart and dashboards before going on to the next section.

Last Modified Jul 23, 2025

Analyzing RUM Metrics

  • See RUM Metrics and Session information in the RUM UI
  • See correlated APM traces in the RUM & APM UI

1. RUM Overview Pages

From your RUM Application Summary Dashboard you can see detailed information by opening the Application Overview Page via the tripple dot trippleburger trippleburger menu on the right by selecting Open Application Overview or by clicking the link with your application name which is jmcj-rum-app in the example below.

RUM-SummaryHeader RUM-SummaryHeader

This will take you to the RUM Application Overview Page screen as shown below.

RUM app overview with UX metrics RUM app overview with UX metrics

2. RUM Browser Overview

2.1. Header

The RUM UI consists of five major sections. The first is the selection header, where you can set/filter a number of options:

  • A drop down for the time window you’re reviewing (You are looking at the past 15 minutes in this case)
  • A drop down to select the Comparison window (You are comparing current performance on a rolling window - in this case compared to 1 hour ago)
  • A drop down with the available Environments to view
  • A drop down list with the Various Web apps
  • Optionally a drop down to select Browser or Mobile metrics (Might not be available in your workshop)

RUM-Header RUM-Header

2.2. UX Metrics

By default, RUM prioritizes the metrics that most directly reflect the experience of the end user.

Additional Tags

All of the dashboard charts allow us to compare trends over time, create detectors, and click through to further diagnose issues.

First, we see page load and route change information, which can help us understand if something unexpected is impacting user traffic trends.

Page load and route change charts Page load and route change charts

Next, Google has defined Core Web Vitals to quantify the user experience as measured by loading, interactivity, and visual stability. Splunk RUM builds in Google’s thresholds into the UI, so you can easily see if your metrics are in an acceptable range.

Core Web Vitals charts Core Web Vitals charts

  • Largest Contentful Paint (LCP), measures loading performance. How long does it take for the largest block of content in the viewport to load? To provide a good user experience, LCP should occur within 2.5 seconds of when the page first starts loading.
  • First Input Delay (FID), measures interactivity. How long does it take to be able to interact with the app? To provide a good user experience, pages should have a FID of 100 milliseconds or less.
  • Cumulative Layout Shift (CLS), measures visual stability. How much does the content move around after the initial load? To provide a good user experience, pages should maintain a CLS of 0.1. or less.

Improving Web Vitals is a key component to optimizing your end user experience, so being able to quickly understand them and create detectors if they exceed a threshold is critical.

Google has some great resources if you want to learn more, for example the business impact of Core Web Vitals.

2.3. Front-end health

Common causes of frontend issues are javascript errors and long tasks, which can especially affect interactivity. Creating detectors on these indicators helps us investigate interactivity issues sooner than our users report it, allowing us to build workarounds or roll back related releases faster if needed. Learn more about optimizing long tasks for better end user experience!

JS error charts JS error charts Long task charts Long task charts

2.4. Back-end health

Common back-end issues affecting user experience are network issues and resource requests. In this example, we clearly see a spike in Time To First Byte that lines up with a resource request spike, so we already have a good starting place to investigate.

Back-end health charts Back-end health charts

  • Time To First Byte (TTFB), measures how long it takes for a client’s browser to receive the first byte of the response from the server. The longer it takes for the server to process the request and send a response, the slower your visitors’ browser is at displaying your page.
Last Modified Jul 23, 2025

Analyzing RUM Tags in the Tag Spotlight view

  • Look into the Metrics views for the various endpoints and use the Tags sent via the Tag spotlight for deeper analysis

1. Find an url for the Cart endpoint

From the RUM Overview page, please select the url for the Cart endpoint to dive deeper into the information available for this endpoint.

RUM-Cart2 RUM-Cart2

Once you have selected and clicked on the blue url, you will find yourself in the Tag Spotlight overview

RUM-Tag RUM-Tag

Here you will see all of the tags that have been sent to Splunk RUM as part of the RUM traces. The tags displayed will be relevant to the overview that you have selected. These are generic Tags created automatically when the Trace was sent, and additional Tags you have added to the trace as part of the configuration of your website.

Additional Tags

We are already sending two additional tags, you have seen them defined in the Beacon url that was added to your website: app: "[nodename]-store", environment: "[nodename]-workshop" in the first section of this workshop! You can add additional tags in a similar way.

In our example we have selected the Page Load view as shown here:

RUM-Header RUM-Header

You can select any of the following Tag views, each focused on a specific metric.

RUM-views RUM-views


2. Explore the information in the Tag Spotlight view

The Tag spotlight is designed to help you identify problems, either through the chart view,, where you may quickly identify outliers or via the TAGs.

In the Page Load view, if you look at the Browser, Browser Version & OS Name Tag views,you can see the various browser types and versions, as well as for the underlying OS.

This makes it easy to identify problems related to specific browser or OS versions, as they would be highlighted.

RUM-Tag2 RUM-Tag2

In the above example you can see that Firefox had the slowest response, various Browser versions ( Chrome) that have different response times and the slow response of the Android devices.

A further example are the regional Tags that you can use to identify problems related to ISP or locations etc. Here you should be able to find the location you have been using to access the Online Boutique. Drill down by selecting the town or country you are accessing the Online Boutique from by clicking on the name as shown below (City of Amsterdam):

RUM-click RUM-click

This will select only the sessions relevant to the city selected as shown below:

RUM-Adam RUM-Adam

By selecting the various Tag you build up a filter, you can see the current selection below

RUM-Adam RUM-Adam

To clear the filter and see every trace click on Clear All at the top right of the page.

If the overview page is empty or shows RUM-Adam RUM-Adam, no traces have been received in the selected timeslot. You need to increase the time window at the top left. You can start with the Last 12 hours for example.

You can then use your mouse to select the time slot you want like show in the view below and activate that time filter by clicking on the little spyglass icon.

RUM-time RUM-time

Last Modified Jul 23, 2025

Analyzing RUM Sessions

  • Dive into RUM Session information in the RUM UI
  • Identify Javascript errors in the Span of an user interaction

1. Drill down in the Sessions

After you have analyzed the information and drilled down via the Tag Spotlight to a subset of the traces, you can view the actual session as it was run by the end-user’s browser.

You do this by clicking on the link User Sessions as shown below:

RUM-Header RUM-Header

This will give you a list of sessions that matched both the time filter and the subset selected in the Tag Profile.

Select one by clicking on the session ID, It is a good idea to select one that has the longest duration (preferably over 700 ms).

RUM-Header RUM-Header

Once you have selected the session, you will be taken to the session details page. As you are selecting a specific action that is part of the session, you will likely arrive somewhere in the middle of the session, at the moment of the interaction.

You can see the URL that you selected earlier is where we are focusing on in the waterfall.

RUM-Session-Tag RUM-Session-Tag

Scroll down a little bit on the page, so you see the end of the operation as shown below.

RUM-Session-info RUM-Session-info

You can see that we have received a few Javascript Console errors that may not have been detected or visible to the end users. To examine these in more detail click on the middle one that says: *Cannot read properties of undefined (reading ‘Prcie’)

RUM-Session-info RUM-Session-info

This will cause the page to expand and show the Span detail for this interaction, It will contain a detailed error.stack you can pass on the developer to solve the issue. You may have noticed when buying in the Online Boutique that the final total always was $0.00.

RUM-Session-info RUM-Session-info

Last Modified Jul 23, 2025

Advanced Synthetics

30 minutes  

Introduction

This workshop walks you through using the Chrome DevTools Recorder to create a synthetic test on a Splunk demonstration environment or on your own public website.

The exported JSON from the Chrome DevTools Recorder will then be used to create a Splunk Synthetic Monitoring Real Browser Test.

Pre-requisites

  • Google Chrome Browser installed
  • Publicly browser-accessible URL
  • Access to Splunk Observability Cloud

Supporting resources

  1. Lantern: advanced Selectors for multistep browser tests
  2. Chrome for Developers DevTools Tips
  3. web.dev Core Web Vitals reference
Last Modified Jul 23, 2025

Subsections of Advanced Synthetics

Record a test

Write down a short user journey you want to test. Remember: smaller bites are easier to chew! In other words, get started with just a few steps. This is easier not only to create and maintain the test, but also to understand and act on the results. Test the essential features to your users, like a support contact form, login widget, or date picker.

Note

Record the test in the same type of viewport that you want to run it. For example, if you want to run a test on a mobile viewport, narrow your browser width to mobile and refresh before starting the recording. This way you are capturing the correct elements that could change depending on responsive style rules.

Open your starting URL in Chrome Incognito. This is important so you’re not carrying cookies into the recording, which we won’t set up in the Synthetic test by default. If you workshop instructor does not have a custom URL, feel free to use https://online-boutique-eu.splunko11y.com or https://online-boutique-us.splunko11y.com, which are in the examples below.

Open the Chrome DevTools Recorder

Next, open the Developer Tools (in the new tab that was opened above) by pressing Ctrl + Shift + I on Windows or Cmd + Option + I on a Mac, then select Recorder from the top-level menu or the More tools flyout menu.

Open Recorder Open Recorder

Note

Site elements might change depending on viewport width. Before recording, set your browser window to the correct width for the test you want to create (Desktop, Tablet, or Mobile). Change the DevTools “dock side” to pop out as a separate window if it helps.

Create a new recording

With the Recorder panel open in the DevTools window. Click on the Create a new recording button to start.

Recorder Recorder

For the Recording Name use your initials to prefix the name of the recording e.g. <your initials> - <website name>. Click on Start Recording to start recording your actions.

Recording Name Recording Name

Now that we are recording, complete a few actions on the site. An example for our demo site is:

  • Click on Vintage Camera Lens
  • Click on Add to Cart
  • Click on Place Order
  • Click on End recording in the Recorder panel.

End Recording End Recording

Export the recording

Click on the Export button:

Export button Export button

Select JSON as the format, then click on Save

Export JSON Export JSON

Save JSON Save JSON

Congratulations! You have successfully created a recording using the Chrome DevTools Recorder. Next, we will use this recording to create a Real Browser Test in Splunk Synthetic Monitoring.


View the example JSON file for this browser test recording
{
    "title": "RWC - Online Boutique",
    "steps": [
        {
            "type": "setViewport",
            "width": 1430,
            "height": 1016,
            "deviceScaleFactor": 1,
            "isMobile": false,
            "hasTouch": false,
            "isLandscape": false
        },
        {
            "type": "navigate",
            "url": "https://online-boutique-eu.splunko11y.com/",
            "assertedEvents": [
                {
                    "type": "navigation",
                    "url": "https://online-boutique-eu.splunko11y.com/",
                    "title": "Online Boutique"
                }
            ]
        },
        {
            "type": "click",
            "target": "main",
            "selectors": [
                [
                    "div:nth-of-type(2) > div:nth-of-type(2) a > div"
                ],
                [
                    "xpath//html/body/main/div/div/div[2]/div[2]/div/a/div"
                ],
                [
                    "pierce/div:nth-of-type(2) > div:nth-of-type(2) a > div"
                ]
            ],
            "offsetY": 170,
            "offsetX": 180,
            "assertedEvents": [
                {
                    "type": "navigation",
                    "url": "https://online-boutique-eu.splunko11y.com/product/66VCHSJNUP",
                    "title": ""
                }
            ]
        },
        {
            "type": "click",
            "target": "main",
            "selectors": [
                [
                    "aria/ADD TO CART"
                ],
                [
                    "button"
                ],
                [
                    "xpath//html/body/main/div[1]/div/div[2]/div/form/div/button"
                ],
                [
                    "pierce/button"
                ],
                [
                    "text/Add to Cart"
                ]
            ],
            "offsetY": 35.0078125,
            "offsetX": 46.4140625,
            "assertedEvents": [
                {
                    "type": "navigation",
                    "url": "https://online-boutique-eu.splunko11y.com/cart",
                    "title": ""
                }
            ]
        },
        {
            "type": "click",
            "target": "main",
            "selectors": [
                [
                    "aria/PLACE ORDER"
                ],
                [
                    "div > div > div.py-3 button"
                ],
                [
                    "xpath//html/body/main/div/div/div[4]/div/form/div[4]/button"
                ],
                [
                    "pierce/div > div > div.py-3 button"
                ],
                [
                    "text/Place order"
                ]
            ],
            "offsetY": 29.8125,
            "offsetX": 66.8203125,
            "assertedEvents": [
                {
                    "type": "navigation",
                    "url": "https://online-boutique-eu.splunko11y.com/cart/checkout",
                    "title": ""
                }
            ]
        }
    ]
}
Last Modified Jul 23, 2025

Create a Browser Test

In Splunk Observability Cloud, navigate to Synthetics and click on Add new test.

From the dropdown select Browser test.

Add new test Add new test

You will then be presented with the Browser test content configuration page.

New Test New Test

Last Modified Jul 23, 2025

Import JSON

To begin configuring our test, we need to import the JSON that we exported from the Chrome DevTools Recorder. To enable the Import button, we must first give our test a name e.g. [<your team name>] <your initials> - Online Boutique.

Browser edit form Browser edit form

Once the Import button is enabled, click on it and either drop the JSON file that you exported from the Chrome DevTools Recorder or upload the file.

Import JSON Import JSON

Once the JSON file has been uploaded, click on Continue to edit steps

Import complete message Import complete message

Before we make any edits to the test, let’s first configure the settings, click on < Return to test

Return to test button in the browser test editor Return to test button in the browser test editor

Last Modified Jul 23, 2025

Test settings

The simple settings allow you to configure the basics of the test:

  • Name: The name of the test (e.g. RWC - Online Boutique).
  • Details:
    • Locations: The locations where the test will run from.
    • Device: Emulate different devices and connection speeds. Also, the viewport will be adjusted to match the chosen device.
    • Frequency: How often the test will run.
    • Round-robin: If multiple locations are selected, the test will run from one location at a time, rather than all locations at once.
    • Active: Set the test to active or inactive.

For this workshop, we will configure the locations that we wish to monitor from. Click in the Locations field and you will be presented with a list of global locations (over 50 in total).

Global Locations Global Locations

Select the following locations:

  • AWS - N. Virginia
  • AWS - London
  • AWS - Melbourne

Once complete, scroll down and click on Click on Submit to save the test.

The test will now be scheduled to run every 5 minutes from the 3 locations that we have selected. This does take a few minutes for the schedule to be created.

So while we wait for the test to be scheduled, click on Edit test so we can go through the Advanced settings.

Last Modified Jul 23, 2025

Advanced Test Settings

Click on Advanced, these settings are optional and can be used to further configure the test.

Note

In the case of this workshop, we will not be using any of these settings; this is for informational purposes only.

Advanced Settings Advanced Settings

  • Security:
    • TLS/SSL validation: When activated, this feature is used to enforce the validation of expired, invalid hostname, or untrusted issuer on SSL/TLS certificates.
    • Authentication: Add credentials to authenticate with sites that require additional security protocols, for example from within a corporate network. By using concealed global variables in the Authentication field, you create an additional layer of security for your credentials and simplify the ability to share credentials across checks.
  • Custom Content:
    • Custom headers: Specify custom headers to send with each request. For example, you can add a header in your request to filter out requests from analytics on the back end by sending a specific header in the requests. You can also use custom headers to set cookies.
    • Cookies: Set cookies in the browser before the test starts. For example, to prevent a popup modal from randomly appearing and interfering with your test, you can set cookies. Any cookies that are set will apply to the domain of the starting URL of the check. Splunk Synthetics Monitoring uses the public suffix list to determine the domain.
    • Host overrides: Add host override rules to reroute requests from one host to another. For example, you can create a host override to test an existing production site against page resources loaded from a development site or a specific CDN edge node.

See the Advanced Settings for Browser Tests section of the Docs for more information.

Next, we will edit the test steps to provide more meaningful names for each step.

Last Modified Jul 23, 2025

Edit test steps

To edit the steps click on the + Edit steps or synthetic transactions button. From here, we are going to give meaningful names to each step.

For each step, we are going to give them a meaningful, readable name. That could look like:

  • Step 1 replace the text Go to URL with Go to Homepage
  • Step 2 enter the text Select Typewriter.
  • Step 3 enter Add to Cart.
  • Step 4 enter Place Order.

Editing browser test step names Editing browser test step names

Note

If you’d like, group the test steps into Transactions and edit the transaction names as seen above. This is especially useful for Single Page Apps (SPAs), where the resource waterfall is not split by URL. We can also create charts and alerts based on transactions.

Click < Return to test to return to the test configuration page and click Save to save the test.

You will be returned to the test dashboard where you will see test results start to appear.

Browser KPIs chart Browser KPIs chart

Congratulations! You have successfully created a Real Browser Test in Splunk Synthetic Monitoring. Next, we will look into a test result in more detail.

Last Modified Jul 23, 2025

View test results

1. Click into a spike or failure in your test run results.

Spike in the browser test performance KPIs chart Spike in the browser test performance KPIs chart

2. What can you learn about this test run? If it failed, use the error message, filmstrip, video replay, and waterfall to understand what happened.

Single test run result, with an error message and screenshots Single test run result, with an error message and screenshots

3. What do you see in the resources? Make sure to click through all of the page (or transaction) tabs.

resources in the browser test waterfall, with a long request highlighted resources in the browser test waterfall, with a long request highlighted

Workshop Question

Do you see anything interesting? Common issues to find and fix include: unexpected response codes, duplicate requests, forgotten third parties, large or slow files, and long gaps between requests.

Want to learn more about specific performance improvements? Google and Mozilla have great resources to help understand what goes into frontend performance as well as in-depth details of how to optimize it.

Last Modified Jul 23, 2025

Frontend Dashboards

15 minutes  

Go to Dashboards and find the End User Experiences dashboard group.

Click the three dots on the top right to open the dashboard menu, and select Save As, and include your team name and initials in the dashboard name.

Save to the dashboard group that matches your email address. Now you have your own copy of this dashboard to customize!

Dashboard save as Dashboard save as

Last Modified Jul 23, 2025

Subsections of Frontend Dashboards

Copying and editing charts

We have some good charts in our dashboard, but let’s add a few more.

  1. Go to Dashboards by clicking the dasboard icon on the left side of the screen. Find the Browser app health dashboard and scroll to the Largest Contentful Paint (LCP) chart. Click the chart actions icon to open the flyout menu, and click “Copy” to add this chart to your clipboard. copy chart copy chart

  2. Now you can continue to add any other charts to your clipboard by clicking the “add to clipboard” icon. copy chart inline icon copy chart inline icon

  3. When you have collected the charts you want on your dashboard, click the “create” icon on the top right. You might need to reload the page if you were looking at charts in another browser tab. create icon create icon

  4. Click the “Paste charts” menu option. paste charts paste charts

Now you are able to resize and edit the charts as you’d like!

Bonus: edit chart data

  1. Click the chart actions icon and select Open to edit the chart. chart actions menu chart actions menu

  2. Remove the existing Test signal. edit the test signal edit the test signal

  3. Click Add filter and type test: *yourInitials*. This will use a wildcard match so that all of the tests you have created that contain your initials (or any string you decide) will be pulled into the chart. add filter button add filter button

  4. Click into the functions to see how adding and removing dimensions changes how the data is displayed. For example, if you want all of your test location data rolled up, remove that dimension from the function. test signal functions test signal functions

  5. Change the chart name and description as appropriate, and click “Save and close” to commit your changes or just “Close” to cancel your changes. chart close buttons chart close buttons

Last Modified Jul 23, 2025

Events in context with chart data

Seeing the visualization of our KPIs is great. What’s better? KPIs in context with events! Overlaying events on a dashboard can help us more quickly understand if an event like a deployment caused a change in metrics, for better or worse.

  1. Your instructor will push a condition change to the workshop application. Click the event marker on any of your dashboard charts to see more details. condition change event condition change event

  2. In the dimensions, we can see more details about this specific event. If we click the event record, we can mark for deletion if needed. event record event record event details event details

  3. We can also see a history of events in the event feed by clicking the icon on the top right of the screen, and selecting Event feed. event feed event feed

  4. Again, we can see details about recent events in this feed. event details event details

  5. We can also add new events in the GUI or via API. To add a new event in the GUI, click the New event button. GUI add event button GUI add event button

  6. Name your event with your team name, initials, and what kind of event it is (deployment, campaign start, etc). Choose a timestamp, or leave as-is to use the current time, and click “Create”. create event form create event form

  7. Now, we need to make sure our new event is overlaid in this dashboard. Wait a minute or so (refresh the page if needed) and then search for the event in the Event overlay field. overlay event overlay event

  8. If your event is within the dashboard time window, you should now see it overlaid in your charts. Click “Save” to make sure your event overlay is saved to your dashboard! save dashboard save dashboard

Keep in mind

Want to add context to that bug ticket, or show your manager how your change improved app performance? Seeing observability data in context with events not only helps with troubleshooting, but also helps us communicate with other teams.

Last Modified Jul 23, 2025

Detectors

20 minutes  

After we have a good understanding of our performance baseline, we can start to create Detectors so that we receive alerts when our KPIs are unexpected. If we create detectors before understanding our baseline, we run the risk of generating unnecessary alert noise.

For RUM and Synthetics, we will explore how to create detectors:

  1. on a single Synthetic test
  2. on a single KPI in RUM
  3. on a dashboard chart

For more Detector resources, please see our Observability docs, Lantern, and consider an Education course if you’d like to go more in depth with instructor guidance.

Last Modified Jul 23, 2025

Subsections of Detectors

Test Detectors

Why would we want a detector on a single Synthetic test? Some examples:

  • The endpoint, API transaction, or browser journey is highly critical
  • We have deployed code changes and want to know if the resulting KPI is or is not as we expect
  • We need to temporarily keep a close eye on a specific change we are testing and don’t want to create a lot of noise, and will disable the detector later
  • We want to know about unexpected issues before a real user encounters them
  1. On the test overview page, click Create Detector on the top right. create detector on a single synthetic test create detector on a single synthetic test

  2. Name the detector with your team name and your initials and LCP (the signal we will eventually use), so that the instructor can better keep track of everyone’s progress.

  3. Change the signal to First byte time. change the signal change the signal

  4. Change the alert details, and see how the chart to the right shows the amount of alert events under those conditions. This is where you can decide how much alert noise you want to generate, based on how much your team tolerates. Play with the settings to see how they affect estimated alert noise. alert noise preview alert noise preview

  5. Now, change the signal to Largest contentful paint. This is a key web vital related to the user experience as it relates to loading time. Change the threshold to 2500ms. It’s okay if there is no sample alert event in the detector preview.

  6. Scroll down in this window to see the notification options, including severity and recipients. notification options notification options

  7. Click the notifications link to customize the alert subject, message, tip, and runbook link. notification customization dialog notification customization dialog

  8. When you are happy with the amount of alert noise this detector would generate, click Activate. activate the detector activate the detector

Last Modified Jul 23, 2025

RUM Detectors

Let’s say we want to know about an issue in production without waiting for a ticket from our support center. This is where creating detectors in RUM will be helpful for us.

  1. Go to the RUM overview of our App. Scroll to the LCP chart, click the chart menu icon, and click Create Detector. RUM LCP chart with action menu flyout RUM LCP chart with action menu flyout

  2. Rename the detector to include your team name and initials, and change the scope of the detector to App so we are not limited to a single URL or page. Change the threshold and sensitivity until there is at least one alert event in the time frame. RUM alert details RUM alert details

  3. Change the alert severity and add a recipient if you’d like, and click Activate to save the Detector.

Exercise

Now, your workshop instructor will change something on the website. How do you find out about the issue, and how do you investigate it?

Tip

Wait a few minutes, and take a look at the online store homepage in your browser. How is the experience in an incognito browser window? How is it different when you refresh the page?

Last Modified Jul 23, 2025

Chart Detectors

With our custom dashboard charts, we can create detectors focussed directly on the data and conditions we care about. In building our charts, we also built signals that can trigger alerts.

Static detectors

For many KPIs, we have a static value in mind as a threshold.

  1. In your custom End User Experience dashboard, go to the “LCP - all tests” chart.

  2. Click the bell icon on the top right of the chart, and select “New detector from chart” new detector from chart new detector from chart

  3. Change the detector name to include your team name and initials, and adjust the alert details. Change the threshold to 2500 or 4000 and see how the alert noise preview changes. alert details alert details

  4. Change the severity, and add yourself as a recipient before you save this detector. Click Activate. activate the detector activate the detector

Advanced: Dynamic detectors

Sometimes we have metrics that vary naturally, so we want to create a more dynamic detector that isn’t limited by the static threshold we decide in the moment.

  1. To create dynamic detectors on your chart, click the link to the “old” detector wizard. click the link to open this detector in the old editor click the link to open this detector in the old editor

  2. Change the detector name to include your team name and initials, and Click Create alert rule name and create alert rule name and create alert rule

  3. Confirm the signal looks correct and proceed to Alert condition. alert signal details alert signal details

  4. Select the “Sudden Change” condition and proceed to Alert settings list of dynamic conditions list of dynamic conditions

  5. Play with the settings and see how the estimated alert noise is previewed in the chart above. Tune the settings, and change the advanced settings if you’d like, before proceeding to the Alert message. alert noise marked on chart based on settings alert noise marked on chart based on settings

  6. Customize the severity, runbook URL, any tips, and message payload before proceeding to add recipients. alert message options alert message options

  7. For the sake of this workshop, only add your own email address as recipient. This is where you would add other options like webhooks, ticketing systems, and Slack channels if it’s in your real environment. recipient options recipient options

  8. Finally, confirm the detector name before clicking Activate Alert Rule activate alert rule button activate alert rule button

Last Modified Jul 23, 2025

Summary

2 minutes  

In this workshop, we learned the following:

  • How to create simple synthetic tests so that we can quickly begin to understand the availability and performance of our application
  • How to understand what RUM shows us about the end user experience, including specific user sessions
  • How to write advanced synthetic browser tests to proactively test our most important user actions
  • How to visualize our frontend performance data in context with events on dashboards
  • How to set up detectors so we don’t have to wait to hear about issues from our end users
  • How all of the above, plus Splunk and Google’s resources, helps us optimize end user experience

There is a lot more we can do with front end performance monitoring. If you have extra time, be sure to play with the charts, detectors, and do some more synthetic testing. Remember our resources such as Lantern, Splunk Docs, and experiment with apps for Mobile RUM.

This is just the beginning! If you need more time to trial Splunk Observability, or have any other questions, reach out to a Splunk Expert.

Last Modified Jul 23, 2025

ThousandEyes Integration with Splunk Observability Cloud

120 minutes   Author Alec Chamberlain

This workshop demonstrates integrating ThousandEyes with Splunk Observability Cloud to provide unified visibility across your synthetic monitoring and observability data.

What You’ll Learn

By the end of this workshop, you will:

  • Deploy a ThousandEyes Enterprise Agent as a containerized workload in Kubernetes
  • Integrate ThousandEyes metrics with Splunk Observability Cloud using OpenTelemetry
  • Configure distributed tracing so ThousandEyes and Splunk APM can link to the same requests
  • Create synthetic tests for internal Kubernetes services and external dependencies
  • Monitor test results in Splunk Observability Cloud dashboards
  • Move from ThousandEyes into Splunk APM traces and back to the originating ThousandEyes test

Sections

Core path

  • Overview - Understand ThousandEyes agent types and architecture
  • Deployment - Deploy the Enterprise Agent in Kubernetes
  • Splunk Integration - Stream ThousandEyes metrics into Splunk Observability Cloud
  • Distributed Tracing - Enable supported bi-directional drilldowns between ThousandEyes and Splunk APM

Scenario extensions

  • Kubernetes Testing - Create internal tests that are useful for both synthetic monitoring and trace correlation
  • RUM - Correlate ThousandEyes network signals with Splunk RUM for end-user investigations

Support

Tip

Think of this scenario as two connected integrations: the OpenTelemetry stream gets ThousandEyes metrics into Splunk, and distributed tracing gives you the reverse path back into ThousandEyes from Splunk APM.

Prerequisites

  • A Kubernetes cluster (v1.16+)
  • RBAC permissions to deploy resources in your chosen namespace
  • A ThousandEyes account with access to Enterprise Agent tokens
  • A Splunk Observability Cloud account with ingest token access and permission to create an API token for APM lookups

Benefits of Integration

By connecting ThousandEyes to Splunk Observability Cloud, you gain:

  • πŸ”— Unified visibility: Correlate synthetic test results with RUM, APM traces, and infrastructure metrics
  • πŸ“Š Enhanced dashboards: Visualize ThousandEyes data alongside your existing Splunk observability metrics
  • πŸ”„ Bi-directional drilldowns: Move from ThousandEyes Service Map to Splunk traces and from Splunk APM back to the ThousandEyes test that generated the request
  • 🚨 Centralized alerting: Configure alerts based on ThousandEyes test results within Splunk
  • πŸ” Root cause analysis: Quickly identify if issues are network-related (ThousandEyes) or application-related (APM)
  • πŸ“ˆ Comprehensive analytics: Analyze synthetic monitoring trends with Splunk’s powerful analytics engine
Last Modified Apr 10, 2026

Subsections of ThousandEyes Integration

Overview

10 minutes  

ThousandEyes Agent Types

Enterprise Agents

Enterprise Agents are software-based monitoring agents that you deploy within your own infrastructure. They provide:

  • Inside-out visibility: Monitor and test from your internal network to external services
  • Customizable placement: Deploy where your users and applications are
  • Full test capabilities: HTTP, network, DNS, voice, and other test types
  • Persistent monitoring: Continuously running agents that execute scheduled tests

In this workshop, we’re deploying an Enterprise Agent as a containerized workload inside a Kubernetes cluster.

Endpoint Agents

Endpoint Agents are lightweight agents installed on end-user devices (laptops, desktops) that provide:

  • Real user perspective: Monitor from actual user endpoints
  • Browser-based monitoring: Capture real user experience metrics
  • Session data: Detailed insights into application performance from the user’s viewpoint

This workshop focuses on Enterprise Agent deployment only.

Architecture

---
config:
  theme: 'base'
---
graph LR
    subgraph k8s["Kubernetes Cluster"]
        secret["Secret<br/>te-creds"]
        agent["ThousandEyes<br/>Enterprise Agent<br/>Pod"]
        
        subgraph apps["Application Pods"]
            api["API Gateway<br/>Pod"]
            payment["Payment Service<br/>Pod"]
            auth["Auth Service<br/>Pod"]
        end
        
        subgraph svcs["Services"]
            api_svc["api-gateway<br/>Service"]
            payment_svc["payment-svc<br/>Service"]
            auth_svc["auth-service<br/>Service"]
        end
        
        api_svc --> api
        payment_svc --> payment
        auth_svc --> auth
        
        secret -.-> agent
        agent -->|"HTTP Tests"| api_svc
        agent -->|"HTTP Tests"| payment_svc
        agent -->|"HTTP Tests"| auth_svc
    end
    
    external["External<br/>Services"]
    
    agent --> external
    
    subgraph te["ThousandEyes Platform"]
        te_cloud["ThousandEyes<br/>Cloud"]
        te_api["API<br/>v7/stream"]
        te_cloud <--> te_api
    end
    
    agent -->|"Test Results"| te_cloud
    
    subgraph splunk["Splunk Observability Cloud"]
        otel["OpenTelemetry<br/>Collector"]
        metrics["Metrics"]
        dashboards["Dashboards"]
        apm["APM/RUM"]
        alerts["Alerts"]
        
        otel --> metrics
        otel --> dashboards
        metrics --> apm
        dashboards --> alerts
    end
    
    te_cloud -->|"OTel/HTTP metrics"| otel
    te_cloud -->|"Trace lookup"| apm
    apm -->|"Deep links to test"| te_cloud
    
    user["DevOps/SRE<br/>Team"]
    user -.-> te_cloud
    user -.-> dashboards
    user -.-> agent
    
    style k8s fill:#e1f5ff,stroke:#0288d1,stroke-width:2px
    style apps fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    style svcs fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    style agent fill:#ffeb3b,stroke:#f57c00,stroke-width:2px
    style secret fill:#ffcdd2,stroke:#c62828,stroke-width:2px
    style api fill:#e1bee7,stroke:#7b1fa2,stroke-width:1px
    style payment fill:#e1bee7,stroke:#7b1fa2,stroke-width:1px
    style auth fill:#e1bee7,stroke:#7b1fa2,stroke-width:1px
    style api_svc fill:#ce93d8,stroke:#7b1fa2,stroke-width:1px
    style payment_svc fill:#ce93d8,stroke:#7b1fa2,stroke-width:1px
    style auth_svc fill:#ce93d8,stroke:#7b1fa2,stroke-width:1px
    style external fill:#c8e6c9,stroke:#388e3c,stroke-width:2px
    style te fill:#fff9c4,stroke:#f57f17,stroke-width:2px
    style te_cloud fill:#ffecb3,stroke:#f57f17,stroke-width:2px
    style te_api fill:#ffe082,stroke:#f57f17,stroke-width:2px
    style splunk fill:#ff6e40,stroke:#d84315,stroke-width:2px
    style otel fill:#ff8a65,stroke:#d84315,stroke-width:2px
    style metrics fill:#ffccbc,stroke:#d84315,stroke-width:1px
    style dashboards fill:#ffccbc,stroke:#d84315,stroke-width:1px
    style apm fill:#ffccbc,stroke:#d84315,stroke-width:1px
    style alerts fill:#ffccbc,stroke:#d84315,stroke-width:1px
    style user fill:#b2dfdb,stroke:#00695c,stroke-width:2px

Architecture Components

1. Kubernetes Cluster

  • Secret (te-creds): Stores the base64-encoded TEAGENT_ACCOUNT_TOKEN for authentication
  • ThousandEyes Enterprise Agent Pod:
    • Container image: thousandeyes/enterprise-agent:latest
    • Hostname: te-agent-aleccham (customizable)
    • Security capabilities: NET_ADMIN, SYS_ADMIN (required for network testing)
    • Memory allocation: 2GB request, 3.5GB limit
    • Network mode: IPv4 only (configured via TEAGENT_INET: "4" environment variable)
    • Image pull policy: Always (ensures latest image is pulled)
    • Init command: /sbin/my_init (required for proper agent initialization)
  • Internal Services: Kubernetes workloads including REST APIs, microservices, databases, and gRPC services

2. Test Targets

  • Internal Services: Monitor services within the Kubernetes cluster
  • External Services: Test external dependencies such as:
    • Payment gateways (Stripe, PayPal)
    • Third-party APIs
    • SaaS applications
    • CDN endpoints
    • Public websites

3. ThousandEyes Platform

  • ThousandEyes Cloud: Central platform for:
    • Agent registration and management
    • Test configuration and scheduling
    • Metrics collection and aggregation
    • Built-in alerting engine
  • ThousandEyes API: RESTful API (v7/stream endpoint) for programmatic access

4. Test Types & Metrics

The Enterprise Agent performs:

  • HTTP/HTTPS tests: Web page availability, response times, status codes
  • DNS tests: Resolution time, record validation
  • Network layer tests: Latency, packet loss, path visualization
  • Voice/RTP tests: Quality metrics for voice traffic

Metrics collected include:

  • HTTP server availability (%)
  • Throughput (bytes/s)
  • Request duration (seconds)
  • Page load completion (%)
  • Error codes and failure reasons

5. Splunk Observability Cloud Integration

  • OpenTelemetry Metrics Stream:
    • Endpoint: https://ingest.{realm}.signalfx.com/v2/datapoint/otlp
    • Protocol: HTTP or gRPC
    • Format: Protobuf
    • Authentication: X-SF-Token header
    • Signal type: Metrics (OpenTelemetry v2)
  • Distributed Tracing Integration:
    • ThousandEyes test type: HTTP Server or API with distributed tracing enabled
    • ThousandEyes connector target: https://api.{realm}.signalfx.com
    • Authentication: Splunk API token in the X-SF-Token header
    • Outcome: ThousandEyes can open related Splunk APM traces, and Splunk APM traces can link back to the originating ThousandEyes test
  • Observability Features:
    • Metrics: Real-time visualization of ThousandEyes data
    • Dashboards: Pre-built ThousandEyes dashboard with unified views
    • APM/RUM Integration: Correlate synthetic tests with application traces and real user monitoring
    • Alerting: Centralized alert management with correlation rules

6. Data Flow

  1. Agent authenticates using token from Kubernetes Secret
  2. Agent runs scheduled tests against internal and external targets
  3. Test results sent to ThousandEyes Cloud
  4. ThousandEyes streams metrics to Splunk via OpenTelemetry protocol
  5. For HTTP Server and API tests with distributed tracing enabled, ThousandEyes injects b3, traceparent, and tracestate headers into the request
  6. The instrumented application sends the resulting trace to Splunk APM
  7. ThousandEyes can open the related Splunk trace, and Splunk APM can link back to the original ThousandEyes test
  8. DevOps, network, and application teams collaborate across both views during an investigation

Testing Capabilities

With this deployment, you can:

  • βœ… Test internal services: Monitor Kubernetes services, APIs, and microservices from within the cluster
  • βœ… Test external dependencies: Validate connectivity to payment gateways, third-party APIs, and SaaS platforms
  • βœ… Measure performance: Capture latency, availability, and performance metrics from your cluster’s perspective
  • βœ… Troubleshoot issues: Identify whether problems originate from your infrastructure, network path, or instrumented application services
Note

This is not an officially supported ThousandEyes agent deployment configuration. However, it has been tested and works very well in production-like environments.

Last Modified Apr 10, 2026

Deployment

20 minutes  

This section guides you through deploying the ThousandEyes Enterprise Agent in your Kubernetes cluster.

Components

The deployment consists of two files:

1. Secrets File (credentialsSecret.yaml)

Contains your ThousandEyes agent token (base64 encoded). This secret is referenced by the deployment to authenticate the agent with ThousandEyes Cloud.

apiVersion: v1
kind: Secret
metadata:
  name: te-creds
type: Opaque
data:
  TEAGENT_ACCOUNT_TOKEN: <base64-encoded-token>

2. Deployment Manifest (thousandEyesDeploy.yaml)

Defines the Enterprise Agent pod configuration with the following key settings:

  • Namespace: te-demo (customize as needed)
  • Image: thousandeyes/enterprise-agent:latest from Docker Hub
  • Hostname: te-agent-aleccham (appears in ThousandEyes dashboard)
  • Capabilities: Requires NET_ADMIN and SYS_ADMIN for network testing
  • Resources:
    • Memory limit: 3584Mi
    • Memory request: 2000Mi
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: te-demo
  name: thousandeyes
  labels:
    app: thousandeyes
spec:
  replicas: 1
  selector:
    matchLabels:
      app: thousandeyes
  template:
    metadata:
      labels:
        app: thousandeyes
    spec:
      hostname: te-agent-aleccham
      containers:
      - name: thousandeyes
        image: 'thousandeyes/enterprise-agent:latest'
        imagePullPolicy: Always
        command:
          - /sbin/my_init
        securityContext:
          capabilities:
            add:
              - NET_ADMIN
              - SYS_ADMIN
        env:
          - name: TEAGENT_ACCOUNT_TOKEN
            valueFrom:
              secretKeyRef:
                name: te-creds
                key: TEAGENT_ACCOUNT_TOKEN
          - name: TEAGENT_INET
            value: "4"
        resources:
          limits:
            memory: 3584Mi
          requests:
            memory: 2000Mi
Important Notes
  • The agent requires elevated privileges (NET_ADMIN, SYS_ADMIN) to perform network tests
  • The TEAGENT_INET: "4" environment variable forces IPv4-only mode (required for some network configurations)
  • The /sbin/my_init command is required for proper agent initialization and service management
  • The imagePullPolicy: Always ensures you always pull the latest image version
  • Adjust the hostname field to uniquely identify your agent in the ThousandEyes dashboard
  • Modify the namespace to match your Kubernetes environment
  • The ThousandEyes Enterprise Agent has relatively high hardware requirements; you may need to adjust these depending on your environment

Installation Steps

Step 1: Create the ThousandEyes Token

  1. Log in to the ThousandEyes platform at app.thousandeyes.com/login

  2. Navigate to Cloud & Enterprise Agents > Agent Settings > Add New Enterprise Agent

  3. Copy your Account Group Token

  4. Base64 encode the token:

    echo -n 'your-token-here' | base64
  5. Save the base64-encoded output for the next step

Get ThousandEyes Token Get ThousandEyes Token

Step 2: Create the Namespace

Create the namespace (if it doesn’t exist):

kubectl create namespace te-demo

Step 3: Create the Secret

Create a file named credentialsSecret.yaml with your base64-encoded token:

apiVersion: v1
kind: Secret
metadata:
  name: te-creds
  namespace: te-demo
type: Opaque
data:
  TEAGENT_ACCOUNT_TOKEN: <your-base64-encoded-token-here>

Apply the secret:

kubectl apply -f credentialsSecret.yaml

Step 4: Create the Deployment

Create a file named thousandEyesDeploy.yaml with the deployment manifest shown above (customize the hostname and namespace as needed).

Apply the deployment:

kubectl apply -f thousandEyesDeploy.yaml

Step 5: Verify the Deployment

Verify the agent is running:

kubectl get pods -n te-demo

Expected output:

NAME                            READY   STATUS    RESTARTS   AGE
thousandeyes-xxxxxxxxxx-xxxxx   1/1     Running   0          2m

Check the logs to ensure the agent is connecting:

kubectl logs -n te-demo -l app=thousandeyes

Step 6: Verify in ThousandEyes Dashboard

Verify in the ThousandEyes dashboard that the agent has registered successfully:

Navigate to Cloud & Enterprise Agents > Agent Settings to see your newly registered agent.

Success

Your ThousandEyes Enterprise Agent is now running in Kubernetes! Next, we’ll integrate it with Splunk Observability Cloud.

Background

ThousandEyes does not provide official Kubernetes deployment documentation. Their standard deployment method uses docker run commands, which makes it challenging to translate into reusable Kubernetes manifests. This guide bridges that gap by providing production-ready Kubernetes configurations.

Last Modified Mar 29, 2026

Splunk Integration

15 minutes  

About Splunk Observability Cloud

Splunk Observability Cloud is a real-time observability platform purpose-built for monitoring metrics, traces, and logs at scale. It ingests OpenTelemetry data and provides advanced dashboards and analytics to help teams detect and resolve performance issues quickly. This section explains how to integrate ThousandEyes data with Splunk Observability Cloud using OpenTelemetry.

Scope Of This Section

This section covers the metrics streaming path from ThousandEyes into Splunk Observability Cloud. The next section adds the separate distributed tracing workflow that creates bi-directional links between ThousandEyes and Splunk APM.

Step 1: Create a Splunk Observability Cloud Access Token

To send ThousandEyes metrics to Splunk Observability Cloud, you need an access token with the Ingest scope. Follow these steps:

  1. In the Splunk Observability Cloud platform, go to Settings > Access Token
  2. Click Create Token
  3. Enter a Name
  4. Select Ingest scope
  5. Select Create to generate your access token
  6. Copy the access token and store it securely

You need the access token to send telemetry data to Splunk Observability Cloud.

Step 2: Create an Integration

This integration is the one-way telemetry stream that gets ThousandEyes metrics into Splunk Observability Cloud dashboards and detectors.

Using the ThousandEyes UI

To integrate Splunk Observability Cloud with ThousandEyes:

  1. Log in to your account on the ThousandEyes platform and go to Manage > Integration > Integration 1.0

  2. Click New Integration and select OpenTelemetry Integration

    ThousandEyes Integration Setup ThousandEyes Integration Setup

  3. Enter a Name for the integration

  4. Set the Target to HTTP

  5. Enter the Endpoint URL: https://ingest.{REALM}.signalfx.com/v2/datapoint/otlp

    • Replace {REALM} with your Splunk environment (e.g., us1, eu0)
  6. For Preset Configuration, select Splunk Observability Cloud

  7. For Auth Type, select Custom

  8. Add the following Custom Headers:

    • X-SF-Token: {TOKEN} (Enter your Splunk Observability Cloud access token created in Step 1)
    • Content-Type: application/x-protobuf
  9. For OpenTelemetry Signal, select Metric

  10. For Data Model Version, select v2

  11. Select a test

  12. Click Save to complete the integration setup

Integration Complete Integration Complete

You have now successfully integrated your ThousandEyes data with Splunk Observability Cloud.

Using the ThousandEyes API

For a programmatic integration, use the following API commands:

HTTP Protocol

curl -v -XPOST https://api.thousandeyes.com/v7/stream \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BEARER_TOKEN" \
  -d '{
    "type": "opentelemetry",
    "testMatch": [{
      "id": "281474976717575",
      "domain": "cea"
    }],
    "endpointType": "http",
    "streamEndpointUrl": "https://ingest.{REALM}.signalfx.com:443/v2/datapoint/otlp",
    "customHeaders": {
      "X-SF-Token": "{TOKEN}",
      "Content-Type": "application/x-protobuf"
    }
  }'

gRPC Protocol

curl -v -XPOST https://api.thousandeyes.com/v7/stream \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BEARER_TOKEN" \
  -d '{
    "type": "opentelemetry",
    "testMatch": [{
      "id": "281474976717575",
      "domain": "cea"
    }],
    "endpointType": "grpc",
    "streamEndpointUrl": "https://ingest.{REALM}.signalfx.com:443",
    "customHeaders": {
      "X-SF-Token": "{TOKEN}",
      "Content-Type": "application/x-protobuf"
    }
  }'

Replace streamEndpointUrl and X-SF-Token values with the correct values for your Splunk Observability Cloud instance.

Note

Make sure to replace {REALM} with your Splunk environment realm (e.g., us1, us2, eu0) and {TOKEN} with your actual Splunk access token.

What Comes Next

After you finish the metrics integration, continue to Distributed Tracing to add the reverse investigation path from ThousandEyes into Splunk APM and back again.

Step 3: ThousandEyes Dashboard in Splunk Observability Cloud

Once the integration is set up, you can view real-time monitoring data in the ThousandEyes Network Monitoring Dashboard within Splunk Observability Cloud. The dashboard includes:

  • HTTP Server Availability (%): Displays the availability of monitored HTTP servers
  • HTTP Throughput (bytes/s): Shows the data transfer rate over time
  • Client Request Duration (seconds): Measures the latency of client requests
  • Web Page Load Completion (%): Indicates the percentage of successful page loads
  • Page Load Duration (seconds): Displays the time taken to load pages

Dashboard Template

You can download the dashboard template from the following link: Download ThousandEyes Splunk Observability Cloud dashboard template (Google Drive).

Success

Your ThousandEyes data is now streaming to Splunk Observability Cloud. Next, add the distributed tracing connector so you can pivot between ThousandEyes and Splunk APM during troubleshooting.

Last Modified Mar 29, 2026

Distributed Tracing and Bi-Directional Drilldowns

25 minutes  

This section turns the ThousandEyes and Splunk integration into a true investigation workflow. In the previous section, ThousandEyes streamed synthetic metrics into Splunk Observability Cloud. In this section, you will enable the supported ThousandEyes <-> Splunk APM distributed tracing integration so network, platform, and application teams can pivot between both tools while looking at the same request.

Why This Matters

This is the piece that gives you bi-directional access between the two environments. ThousandEyes can open the related trace in Splunk APM, and Splunk APM can take you back to the originating ThousandEyes test.

What You Will Learn

By the end of this section, you will be able to:

  • Instrument an internal service so it sends traces to Splunk APM
  • Enable distributed tracing on a ThousandEyes HTTP Server or API test
  • Configure the ThousandEyes Generic Connector for Splunk APM
  • Open the ThousandEyes Service Map and jump directly into the corresponding Splunk trace
  • Use the ThousandEyes metadata in Splunk APM to jump back to the original ThousandEyes test

Supported Workflow

This learning scenario follows the supported workflow documented by ThousandEyes and Splunk:

  • ThousandEyes automatically injects b3, traceparent, and tracestate headers into HTTP Server and API tests when distributed tracing is enabled.
  • The monitored endpoint must accept headers, be instrumented with OpenTelemetry, propagate trace context, and send traces to your observability backend.
  • For Splunk APM, ThousandEyes uses a Generic Connector that points at https://api.<REALM>.signalfx.com and authenticates with an API-scope Splunk token.
  • Splunk APM enriches matching traces with ThousandEyes attributes such as thousandeyes.test.id and thousandeyes.permalink, which enables the reverse jump back to ThousandEyes.

What Those Headers Actually Mean

This part is easy to gloss over and it should not be. The trace correlation only works if the service understands the headers ThousandEyes injects and continues the trace correctly.

  • traceparent and tracestate are the W3C Trace Context headers.
  • b3 is the Zipkin B3 single-header format.
  • ThousandEyes injects both because real environments often contain a mix of proxies, meshes, gateways, and app runtimes that do not all prefer the same propagation format.

In OpenTelemetry terms, the important setting is the propagator list:

OTEL_PROPAGATORS=baggage,b3,tracecontext

That does two things:

  1. It allows the service to extract either B3 or W3C context from the inbound ThousandEyes request.
  2. It preserves the W3C tracestate by keeping tracecontext enabled.
Important Detail

You do not add tracestate as a separate OpenTelemetry propagator. The tracecontext propagator handles both traceparent and tracestate.

What “Properly Done” Looks Like

The collector is only one part of this setup. A correct ThousandEyes tracing deployment in Kubernetes has three layers:

  1. Deployment annotation so the OpenTelemetry Operator injects the runtime-specific instrumentation.
  2. Instrumentation resource so the injected SDK knows where to send traces and which propagators to use.
  3. Collector trace pipeline so OTLP traces are actually received and exported to Splunk APM.

The most common mistake is to focus only on the collector. The collector never sees raw b3, traceparent, or tracestate request headers directly. Your application or auto-instrumentation library must extract those headers first, continue the span context, and then emit spans over OTLP to the collector.

Real-World Configuration From The Current Cluster

The examples below are trimmed from the live cluster currently running this workshop. They show the pattern that is actually working in Kubernetes today.

1. Deployment Annotation

In the live cluster, the teastore applications point at the teastore/default Instrumentation resource:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: teastore-webui-v1
  namespace: teastore
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/container-names: teastore-webui-v1
        instrumentation.opentelemetry.io/inject-java: teastore/default

This is the first place to verify when ThousandEyes requests are not turning into traces.

2. Instrumentation Resource

This is the live Instrumentation object from teastore, trimmed to the fields that matter for ThousandEyes:

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: default
  namespace: teastore
spec:
  exporter:
    endpoint: http://splunk-otel-collector-agent.otel-splunk.svc:4317
  propagators:
    - baggage
    - b3
    - tracecontext
  sampler:
    type: parentbased_always_on
  env:
    - name: OTEL_RESOURCE_ATTRIBUTES
      value: deployment.environment=teastore

This is the critical part for the ThousandEyes scenario:

  • endpoint sends spans to the cluster-local OTel agent service.
  • b3 allows extraction of ThousandEyes B3 headers.
  • tracecontext preserves traceparent and tracestate.
  • parentbased_always_on ensures the trace continues once ThousandEyes starts the request.

3. What The Injected Pod Actually Gets

On the running teastore-webui-v1 pod, the operator injected the following environment variables:

- name: JAVA_TOOL_OPTIONS
  value: " -javaagent:/otel-auto-instrumentation-java-teastore-webui-v1/javaagent.jar"
- name: OTEL_SERVICE_NAME
  value: teastore-webui-v1
- name: OTEL_EXPORTER_OTLP_ENDPOINT
  value: http://splunk-otel-collector-agent.otel-splunk.svc:4317
- name: OTEL_PROPAGATORS
  value: baggage,b3,tracecontext
- name: OTEL_TRACES_SAMPLER
  value: parentbased_always_on

This is a useful validation checkpoint because it proves the propagators are being applied to the workload, not just declared in an abstract config object.

4. Agent Collector Trace Pipeline

The live agent collector in otel-splunk is receiving OTLP, Jaeger, and Zipkin traffic and forwarding traces upstream. This is a trimmed excerpt from the running ConfigMap:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_http:
        endpoint: 0.0.0.0:14268
  zipkin:
    endpoint: 0.0.0.0:9411

service:
  pipelines:
    traces:
      receivers: [otlp, jaeger, zipkin]
      processors:
        [memory_limiter, k8sattributes, batch, resourcedetection, resource, resource/add_environment]
      exporters: [otlp, signalfx]

For ThousandEyes, the important part is not a special B3 option in the collector. The important part is simply that the collector exposes OTLP on 4317 and 4318, and that your services are exporting their spans there.

5. Gateway Collector Export To Splunk APM

The live gateway collector then forwards traces to Splunk Observability Cloud. This is the relevant part of the running gateway ConfigMap:

exporters:
  otlphttp:
    auth:
      authenticator: headers_setter
    traces_endpoint: https://ingest.us1.signalfx.com/v2/trace/otlp

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        include_metadata: true
      http:
        endpoint: 0.0.0.0:4318
        include_metadata: true

service:
  pipelines:
    traces:
      receivers: [otlp, jaeger, zipkin]
      processors:
        [memory_limiter, k8sattributes, batch, resource/add_cluster_name, resource/add_environment]
      exporters: [otlphttp, signalfx]

This is the part that gets the spans to Splunk APM. If this pipeline is broken, ThousandEyes can still inject headers into the request, but no correlated trace will ever appear in Splunk.

Current Cluster Takeaway

In the live cluster, the teastore/default Instrumentation resource is the pattern to follow for ThousandEyes because it explicitly includes b3 together with tracecontext. That is the configuration you want to replicate for this scenario.

Important

Do not use a browser page URL for this section. ThousandEyes documents that browsers do not accept the custom trace headers required for this workflow. Use an instrumented backend endpoint behind an HTTP Server or API test instead.

Step 1: Make Sure the Workload Emits Traces to Splunk APM

If your application is already instrumented and traces are visible in Splunk APM, you can skip to Step 2. Otherwise, the fastest learning path in Kubernetes is to use the Splunk OpenTelemetry Collector with the Operator enabled for zero-code instrumentation.

Install the Splunk OpenTelemetry Collector with the Operator

helm repo add splunk-otel-collector-chart https://signalfx.github.io/splunk-otel-collector-chart
helm repo update

helm install splunk-otel-collector splunk-otel-collector-chart/splunk-otel-collector \
  --namespace otel-splunk \
  --create-namespace \
  --set splunkObservability.realm=$REALM \
  --set splunkObservability.accessToken=$ACCESS_TOKEN \
  --set clusterName=$CLUSTER_NAME \
  --set operator.enabled=true \
  --set operatorcrds.install=true

Annotate the Deployment for Auto-Instrumentation

For Java workloads, a generic example looks like this:

kubectl patch deployment api-gateway -n production -p '{"spec":{"template":{"metadata":{"annotations":{"instrumentation.opentelemetry.io/inject-java":"otel-splunk/splunk-otel-collector"}}}}}'

For other runtimes, use the annotation that matches the language:

  • instrumentation.opentelemetry.io/inject-nodejs
  • instrumentation.opentelemetry.io/inject-python
  • instrumentation.opentelemetry.io/inject-dotnet

If the collector is installed in the same namespace as the application, the official Splunk documentation also supports using "true" as the annotation value.

If you want to follow the live cluster pattern from this workshop environment, the annotation value is namespace-qualified and points at the teastore/default Instrumentation object:

kubectl patch deployment teastore-webui-v1 -n teastore -p '{"spec":{"template":{"metadata":{"annotations":{"instrumentation.opentelemetry.io/container-names":"teastore-webui-v1","instrumentation.opentelemetry.io/inject-java":"teastore/default"}}}}}'

Validate That Traces Exist

  1. Wait for the deployment rollout to finish:

    kubectl rollout status deployment/api-gateway -n production
  2. Generate a few requests against a backend endpoint that crosses more than one service, for example:

    http://api-gateway.production.svc.cluster.local:8080/api/v1/orders

    In the current workshop cluster, a service such as http://teastore-webui.teastore.svc.cluster.local:8080/ is the right kind of target because it fronts several downstream application services and produces a more useful end-to-end trace than a simple health check.

  3. Confirm that traces are arriving in Splunk APM before you continue.

Learning Tip

Use a business transaction, not a pure /health endpoint, for the tracing exercise. A multi-service request gives you a far better Service Map in ThousandEyes and a more useful trace in Splunk APM.

Step 2: Enable Distributed Tracing on the ThousandEyes Test

Create or edit an HTTP Server or API test that targets the instrumented backend endpoint from Step 1.

  1. In ThousandEyes, create an HTTP Server or API test.
  2. Open Advanced Settings.
  3. Enable Distributed Tracing.
  4. Save the test and run it against the same endpoint that is already sending traces to Splunk APM.

Enable Distributed Tracing in ThousandEyes Enable Distributed Tracing in ThousandEyes

After the test runs, ThousandEyes injects the trace headers and captures the trace context for that request.

Step 3: Create the Splunk APM Connector in ThousandEyes

The metric streaming integration from the previous section uses an Ingest token. This step is different: ThousandEyes needs to query Splunk APM and build trace links, so it uses a Splunk API token instead.

  1. In Splunk Observability Cloud, create an access token with the API scope.
  2. In ThousandEyes, go to Manage > Integrations > Integrations 2.0.
  3. Create a Generic Connector with:
    • Target URL: https://api.<REALM>.signalfx.com
    • Header: X-SF-Token: <your-api-scope-token>
  4. Create a new Operation and select Splunk Observability APM.
  5. Enable the operation and save the integration.

Splunk APM Generic Connector in ThousandEyes Splunk APM Generic Connector in ThousandEyes

Splunk APM Operation in ThousandEyes Splunk APM Operation in ThousandEyes

Step 4: Validate the Bi-Directional Investigation Loop

Once the test is running and the connector is enabled, validate the workflow in both directions.

Start in ThousandEyes

  1. Open the test in ThousandEyes.
  2. Navigate to the Service Map tab.
  3. Confirm that you can see the trace path, service latency, and any downstream errors.
  4. Use the ThousandEyes link into Splunk APM to inspect the full trace.

ThousandEyes Service Map with Splunk APM correlation ThousandEyes Service Map with Splunk APM correlation

Continue in Splunk APM

Inside Splunk APM, verify that the trace contains ThousandEyes metadata such as:

  • thousandeyes.account.id
  • thousandeyes.test.id
  • thousandeyes.permalink
  • thousandeyes.source.agent.id

Use either the thousandeyes.permalink field or the Go to ThousandEyes test button in the trace waterfall view to navigate back to the originating ThousandEyes test.

Splunk APM trace linked back to ThousandEyes Splunk APM trace linked back to ThousandEyes

Suggested Learning Scenario

Use the following flow during a workshop:

  1. Create a ThousandEyes test against an internal API route that calls multiple services.
  2. Let ThousandEyes surface the issue first, so the class starts from the network and synthetic-monitoring perspective.
  3. Open the Service Map in ThousandEyes and identify where latency or errors begin.
  4. Jump into Splunk APM for span-level analysis.
  5. Jump back to ThousandEyes to inspect the test, agent, and network path again.

This is a strong teaching loop because it mirrors how different teams actually work:

  • Network and edge teams often start in ThousandEyes.
  • SRE and platform teams often start in Splunk dashboards or alerts.
  • Application teams usually want the trace in Splunk APM.

With this integration in place, everyone can pivot without losing context.

Common Pitfalls

  • A test might be visible in Splunk dashboards but still have no trace correlation. That usually means only the metrics stream is configured, not the Splunk APM Generic Connector.
  • A trace might exist in Splunk APM but not show up in ThousandEyes if the monitored endpoint does not propagate the trace headers downstream.
  • A shallow endpoint such as /health often produces limited trace value even when the configuration is correct.

References

Last Modified Mar 29, 2026

Kubernetes Service Testing and Correlation

20 minutes  

Replicating AppDynamics Test Recommendations

AppDynamics offers a feature called “Test Recommendations” that automatically suggests synthetic tests for your application endpoints. With ThousandEyes deployed inside your Kubernetes cluster, you can replicate this capability by leveraging Kubernetes service discovery combined with Splunk Observability Cloud’s unified view.

Since the ThousandEyes Enterprise Agent runs inside the cluster, it can directly test internal Kubernetes services using their service names as hostnames. This provides a powerful way to monitor backend services that may not be exposed externally.

How It Works

  1. Service Discovery: Use kubectl get svc to enumerate services in your cluster
  2. Hostname Construction: Build test URLs using Kubernetes DNS naming convention: <service-name>.<namespace>.svc.cluster.local
  3. Test Creation: Create both availability tests and trace-enabled transaction tests for internal services
  4. Correlation in Splunk: View synthetic test results alongside APM traces and infrastructure metrics

Benefits of In-Cluster Testing

  • Internal Service Monitoring: Test backend services not exposed to the internet
  • Service Mesh Awareness: Monitor services behind Istio, Linkerd, or other service meshes
  • DNS Resolution Testing: Validate Kubernetes DNS and service discovery
  • Network Policy Validation: Ensure network policies allow proper communication
  • Latency Baseline: Measure cluster-internal network performance
  • Pre-Production Testing: Test services before exposing them via Ingress/LoadBalancer

Step-by-Step Guide

1. Discover Kubernetes Services

List all services in your cluster or a specific namespace:

# Get all services in all namespaces
kubectl get svc --all-namespaces

# Get services in a specific namespace
kubectl get svc -n production

# Get services with detailed output including ports
kubectl get svc -n production -o wide

Example output:

NAMESPACE    NAME           TYPE        CLUSTER-IP      PORT(S)    AGE
production   api-gateway    ClusterIP   10.96.100.50    8080/TCP   5d
production   payment-svc    ClusterIP   10.96.100.51    8080/TCP   5d
production   auth-service   ClusterIP   10.96.100.52    9000/TCP   5d
production   postgres       ClusterIP   10.96.100.53    5432/TCP   5d

2. Build Test Hostnames

Kubernetes services are accessible via DNS using the following naming pattern:

<service-name>.<namespace>.svc.cluster.local

For the services above:

  • api-gateway.production.svc.cluster.local:8080
  • payment-svc.production.svc.cluster.local:8080
  • auth-service.production.svc.cluster.local:9000

Shorthand within the same namespace: If testing services in the same namespace as the ThousandEyes agent, you can use just the service name:

  • api-gateway:8080
  • payment-svc:8080

3. Create ThousandEyes Tests for Internal Services

For the best learning outcome, create two kinds of tests:

  • Availability tests against /health or /readiness endpoints to validate reachability and response time
  • Trace-enabled transaction tests against business endpoints that traverse multiple services

The first test teaches synthetic monitoring. The second teaches cross-tool troubleshooting with Splunk APM.

Via ThousandEyes UI

  1. Navigate to Cloud & Enterprise Agents > Test Settings
  2. Click Add New Test β†’ HTTP Server
  3. Configure an availability test:
    • Test Name: [K8s] API Gateway Health
    • URL: http://api-gateway.production.svc.cluster.local:8080/health
    • Interval: 2 minutes
    • Agents: Select your Kubernetes-deployed Enterprise Agent
    • HTTP Response Code: 200
  4. Configure a trace-enabled transaction test:
    • Test Name: [Trace] API Gateway Orders
    • URL: http://api-gateway.production.svc.cluster.local:8080/api/v1/orders
    • Interval: 2 minutes
    • Agents: Select your Kubernetes-deployed Enterprise Agent
    • Advanced Settings > Distributed Tracing: Enabled
  5. Click Create Test

Via ThousandEyes API

curl -X POST https://api.thousandeyes.com/v6/tests/http-server/new \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BEARER_TOKEN" \
  -d '{
    "testName": "[K8s] API Gateway Health",
    "url": "http://api-gateway.production.svc.cluster.local:8080/health",
    "interval": 120,
    "agents": [
      {"agentId": "<your-k8s-agent-id>"}
    ],
    "httpTimeLimit": 5000,
    "targetResponseTime": 1000,
    "alertsEnabled": 1
  }'

For the trace-enabled version, switch the url to a business transaction endpoint and enable distributed tracing in the ThousandEyes test configuration.

Best Practice

If your goal is to teach distributed tracing, avoid using /health as the only example. Health checks are useful for uptime monitoring, but they rarely produce the multi-service traces that make the ThousandEyes and Splunk APM integration compelling.

4. Configure Alerting Rules

Set up alerts for common failure scenarios:

  • Availability Alert: Trigger when HTTP response is not 200
  • Performance Alert: Trigger when response time exceeds baseline
  • DNS Resolution Alert: Trigger when service DNS cannot be resolved

5. View Results in Splunk Observability Cloud

Once tests are running and integrated with Splunk:

  1. Navigate to the ThousandEyes Dashboard in Splunk Observability Cloud
  2. Filter by test name (e.g., [K8s] prefix) to see all Kubernetes internal tests
  3. For trace-enabled tests, start in ThousandEyes first:
    • Open the Service Map
    • Inspect service-level latency and downstream errors
    • Follow the link into Splunk APM
  4. Correlate with APM data:
    • View synthetic test failures alongside APM error rates
    • Identify if issues are network-related (ThousandEyes) or application-related (APM)
    • Use the Splunk trace metadata to jump back to the originating ThousandEyes test
  5. Create custom dashboards combining:
    • ThousandEyes HTTP availability metrics
    • APM service latency and error rates
    • Kubernetes infrastructure metrics (CPU, memory, pod restarts)

Example Use Cases

Use Case 1: Microservices Health Checks

Test multiple microservice health endpoints:

http://user-service.production.svc.cluster.local:8080/actuator/health
http://order-service.production.svc.cluster.local:8080/actuator/health
http://inventory-service.production.svc.cluster.local:8080/actuator/health

Use Case 2: API Gateway Endpoint Testing

Test API gateway routes that are more likely to generate a useful trace:

http://api-gateway.production.svc.cluster.local:8080/api/v1/users
http://api-gateway.production.svc.cluster.local:8080/api/v1/orders
http://api-gateway.production.svc.cluster.local:8080/api/v1/products

Use Case 3: Database Connection Testing

While ThousandEyes is primarily for HTTP testing, you can test database proxies:

# Test PgBouncer or database HTTP management interfaces
http://pgbouncer.production.svc.cluster.local:8080/stats
http://redis-exporter.production.svc.cluster.local:9121/metrics

Use Case 4: External Service Dependencies

One of the most valuable capabilities of the in-cluster ThousandEyes agent is monitoring your application’s external dependencies from the same network perspective as your services. This helps identify whether issues originate from your infrastructure, network path, or the external service itself.

Testing Payment Gateways

Create tests for critical payment gateway endpoints to ensure availability and performance:

Stripe API:

# Via ThousandEyes UI
Test Name: [External] Stripe API Health
URL: https://api.stripe.com/healthcheck
Interval: 2 minutes
Agents: Your Kubernetes Enterprise Agent
Expected Response: 200

PayPal API:

Test Name: [External] PayPal API Health
URL: https://api.paypal.com/v1/notifications/webhooks
Interval: 2 minutes
Agents: Your Kubernetes Enterprise Agent
Expected Response: 401 (authentication required, but endpoint is reachable)

Via ThousandEyes API:

curl -X POST https://api.thousandeyes.com/v6/tests/http-server/new \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BEARER_TOKEN" \
  -d '{
    "testName": "[External] Stripe API Availability",
    "url": "https://api.stripe.com/healthcheck",
    "interval": 120,
    "agents": [
      {"agentId": "<your-k8s-agent-id>"}
    ],
    "httpTimeLimit": 5000,
    "targetResponseTime": 2000,
    "alertsEnabled": 1
  }'

Why Monitor External Dependencies?

  • Proactive Issue Detection: Know about payment gateway outages before your customers report them
  • Network Path Validation: Ensure your Kubernetes egress network can reach external services
  • Performance Baseline: Track latency from your cluster to external APIs
  • Compliance & SLA Monitoring: Verify third-party services meet their SLA commitments
  • Root Cause Analysis: Quickly determine if issues are network-related, your infrastructure, or the external provider
  • Payment Processors: Stripe, PayPal, Square, Braintree
  • Authentication Providers: Auth0, Okta, Azure AD
  • Email Services: SendGrid, Mailgun, AWS SES
  • SMS/Communications: Twilio, MessageBird
  • CDN Endpoints: Cloudflare, Fastly, Akamai
  • Cloud Storage: AWS S3, Google Cloud Storage, Azure Blob Storage
  • Third-Party APIs: Any critical business partner APIs
Best Practice

Use the [External] prefix in test names to easily distinguish between internal Kubernetes services and external dependencies in your dashboards.

Best Practices

  1. Use Consistent Naming: Prefix test names with [K8s] or [Internal] for easy filtering
  2. Test Health Endpoints First: Start with /health or /readiness endpoints before testing business logic
  3. Set Appropriate Intervals: Use shorter intervals (1-2 minutes) for critical services
  4. Tag Tests: Use ThousandEyes labels/tags to group tests by:
    • Environment (dev, staging, production)
    • Service type (API, database, cache)
    • Team ownership
  5. Monitor Test Agent Health: Ensure the ThousandEyes agent pod is healthy and has sufficient resources
  6. Use Both Test Types: Pair a simple availability test with a trace-enabled business transaction test for each critical service path
  7. Correlate with APM: Create Splunk dashboards that show both synthetic and APM data side-by-side
  8. Use Instrumented Backends for Trace Labs: Distributed tracing works best when the ThousandEyes target is an HTTP Server or API endpoint backed by OpenTelemetry-instrumented services
Tip

By testing internal services before they’re exposed externally, you can catch issues early and ensure your infrastructure is healthy before user traffic reaches it.

Last Modified Mar 29, 2026

ThousandEyes and Splunk RUM

10 minutes  

Integrate ThousandEyes with Splunk RUM to understand if network issues correlate to end user issues.

Requirements

  1. Admin privilege to both Splunk Observability Cloud and ThousandEyes
  2. At least one application sending data into Splunk RUM
  3. At least one test of these types running in ThousandEyes, on the same domain as the app in Splunk RUM:

Steps to integrate

  1. In ThousandEyes, create an OAuth Bearer token:
    • Select your username on the top-right corner, and then select Profile.
    • Under User API Tokens, select Create to generate an OAuth Bearer Token.
    • Copy or make a note of the token to use in the Observability Cloud data integration wizard
  2. In Splunk Observability Cloud, open Data Management > Available Integrations > ThousandEyes Integration with RUM
    • Use the same Ingest token from the previous Splunk Integration, or create and select a dedicated Ingest token to better track the data usage of your RUM integration.
    • Enter the OAuth Bearer token from ThousandEyes
    • Review the test matches, change selections as desired, and review the estimated data ingestion before selecting Done

View the integration

Go to the RUM application where your ThousandEyes tests are running, and view the Map. Hover over the locations where you also have ThousandEyes test running to see the preview of ThousandEyes metrics: Geo map view in Splunk RUM, with Network Latency from ThousandEyes visible Geo map view in Splunk RUM, with Network Latency from ThousandEyes visible

If you have active alerts in ThousandEyes, you will see the ThousandEyes icon over the relevant location bubble in RUM: Geo map view in Splunk RUM Geo map view in Splunk RUM

Click into a relevant region to see the Network metrics alongside other metrics from RUM, and open View ThousandEyes Tests to go to the relevant tests in ThousandEyes: RUM metrics with ThousandEyes metrics and Tests dialog open RUM metrics with ThousandEyes metrics and Tests dialog open

See RUM and ThousandEyes metrics in a custom dashboard

Now you can correlate other Observability Cloud KPIs with signals from your relevant ThousandEyes tests!

  1. Go to Dashboards > search for “RUM” > click into one of the out-of-the-box RUM dashboards in the RUM applications group
  2. Either copy charts with RUM KPIs that interest you, or open a dashboard’s action menu on the top right and Save As to create a copy in your own dashboard group.
  3. On the new dashboard, create a new chart with the signal network.latency
    • change the extrapolation policy to Last value
    • change the unit of measurement to Time > Millisecond
    • in Chart Options, select Show on-chart legend with the value thousandeyes.source.agent.name. This will segment the chart by agent location from ThousandEyes.
  4. Name and save the new chart, then copy it to create similar charts for network.jitter and network.loss by changing the metric in the copied charts signal, and adjusting the units of measure and visualization options as needed.

See the Dashboard Workshop for more in-depth guidance on creating custom dashboards and charts.

Tip

Think about other metrics in Observability Cloud that might be handy to view side-by-side with ThousandEyes test metrics.

For example, if you have API tests running in Synthetics, consider adding a heatmap of API test success by location.

Last Modified Mar 29, 2026

Troubleshooting

15 minutes  

This section covers common issues you may encounter when deploying and using the ThousandEyes Enterprise Agent in Kubernetes.

Test Failing with DNS Resolution Error

If your tests are failing with DNS resolution errors, verify DNS from within the ThousandEyes pod:

# Verify DNS resolution from within the pod
kubectl exec -n te-demo -it <pod-name> -- nslookup api-gateway.production.svc.cluster.local

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

Common causes:

  • Service doesn’t exist in the specified namespace
  • Typo in the service name or namespace
  • CoreDNS is not functioning properly

Connection Refused Errors

If you’re seeing connection refused errors, check the following:

# Verify service endpoints exist
kubectl get endpoints -n production api-gateway

# Check if pods are ready
kubectl get pods -n production -l app=api-gateway

# Test connectivity from agent pod
kubectl exec -n te-demo -it <pod-name> -- curl -v http://api-gateway.production.svc.cluster.local:8080/health

Common causes:

  • No pods backing the service (endpoints are empty)
  • Pods are not in Ready state
  • Wrong port specified in the test URL
  • Service selector doesn’t match pod labels

Network Policy Blocking Traffic

If network policies are blocking traffic from the ThousandEyes agent:

# List network policies
kubectl get networkpolicies -n production

# Describe network policy
kubectl describe networkpolicy <policy-name> -n production

Solution: Create a network policy to allow traffic from the te-demo namespace to your services:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-thousandeyes-agent
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api-gateway
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: te-demo
    ports:
    - protocol: TCP
      port: 8080

Agent Pod Not Starting

If the ThousandEyes agent pod is not starting, check the pod status and events:

# Get pod status
kubectl get pods -n te-demo

# Describe pod to see events
kubectl describe pod -n te-demo <pod-name>

# Check logs
kubectl logs -n te-demo <pod-name>

Common causes:

  • Insufficient resources (memory/CPU)
  • Invalid or missing TEAGENT_ACCOUNT_TOKEN secret
  • Security context capabilities not allowed by Pod Security Policy
  • Image pull errors

Solutions:

  • Increase memory limits if OOMKilled
  • Verify secret is created correctly: kubectl get secret te-creds -n te-demo -o yaml
  • Check Pod Security Policy allows NET_ADMIN and SYS_ADMIN capabilities
  • Verify image pull: kubectl describe pod -n te-demo <pod-name>

Agent Not Appearing in ThousandEyes Dashboard

If the agent is running but not appearing in the ThousandEyes dashboard:

# Check agent logs for connection issues
kubectl logs -n te-demo -l app=thousandeyes --tail=100

Common causes:

  • Invalid or incorrect TEAGENT_ACCOUNT_TOKEN
  • Network egress blocked (firewall or network policy)
  • Agent cannot reach ThousandEyes Cloud servers

Solutions:

  1. Verify the token is correct and properly base64-encoded
  2. Check if egress to *.thousandeyes.com is allowed
  3. Verify the agent can reach the internet:
kubectl exec -n te-demo -it <pod-name> -- curl -v https://api.thousandeyes.com

Data Not Appearing in Splunk Observability Cloud

If ThousandEyes data is not appearing in Splunk:

Verify integration configuration:

  1. Check the OpenTelemetry integration is configured correctly in ThousandEyes
  2. Verify the Splunk ingest endpoint URL is correct for your realm
  3. Confirm the X-SF-Token header contains a valid Splunk access token
  4. Ensure tests are assigned to the integration

Check test assignment:

# Use ThousandEyes API to verify integration
curl -v https://api.thousandeyes.com/v7/stream \
  -H "Authorization: Bearer $BEARER_TOKEN"

Common causes:

  • Wrong Splunk realm in endpoint URL
  • Invalid or expired Splunk access token
  • Tests not assigned to the OpenTelemetry integration
  • Integration not enabled or saved properly

Distributed Tracing Not Appearing in ThousandEyes

If your metric stream is working but the ThousandEyes Service Map is empty or no trace is found:

Verify the monitored endpoint:

  • It accepts HTTP headers
  • It is instrumented with OpenTelemetry
  • It propagates trace context downstream
  • It sends traces to Splunk APM

Common causes:

  • The endpoint is a page URL rather than an HTTP Server or API target
  • The service is not instrumented, so ThousandEyes can inject headers but no trace is emitted
  • The endpoint only returns a local health response and does not exercise downstream services

Recommended fixes:

  1. Switch the ThousandEyes test to an instrumented backend API route
  2. Confirm traces for that route already exist in Splunk APM
  3. Re-run the test after enabling ThousandEyes distributed tracing

If the trace opens in Splunk APM but you do not see the ThousandEyes backlink or metadata:

Common cause:

The b3 propagator can override trace_state and clear the value that ThousandEyes expects to preserve for the reverse link.

Fix:

Set the propagators explicitly on the instrumented service:

OTEL_PROPAGATORS=baggage,b3,tracecontext

After changing the environment variable, restart the instrumented workload and generate new traffic.

Splunk APM Connector Authentication Errors

If the Generic Connector in ThousandEyes cannot query Splunk APM:

Check the following:

  1. The connector target is https://api.<REALM>.signalfx.com
  2. The token used in the connector has the API scope
  3. The user creating the token has the required role in Splunk Observability Cloud
Token Reminder

The OpenTelemetry metrics stream uses a Splunk Ingest token. The ThousandEyes Generic Connector for APM uses a Splunk API token. Mixing them up is one of the most common causes of partial integration.

High Memory Usage

If the ThousandEyes agent pod is consuming excessive memory:

# Check current memory usage
kubectl top pod -n te-demo

# Check for OOMKilled events
kubectl describe pod -n te-demo <pod-name> | grep -i oom

Solutions:

  1. Increase memory limits in the deployment:
resources:
  limits:
    memory: 4096Mi  # Increase from 3584Mi
  requests:
    memory: 2500Mi  # Increase from 2000Mi
  1. Reduce the number of concurrent tests assigned to the agent
  2. Check if the agent is running unnecessary services

Permission Denied Errors

If you see permission denied errors in the agent logs:

Verify security context:

kubectl get pod -n te-demo <pod-name> -o jsonpath='{.spec.containers[0].securityContext}'

Solution: Ensure the pod has the required capabilities:

securityContext:
  capabilities:
    add:
      - NET_ADMIN
      - SYS_ADMIN
Note

Some Kubernetes clusters with strict Pod Security Policies may not allow these capabilities. You may need to work with your cluster administrators to create an appropriate policy exception.

Getting Help

If you encounter issues not covered in this guide:

  1. ThousandEyes Support: Contact ThousandEyes support at support.thousandeyes.com
  2. Splunk Support: For Splunk Observability Cloud issues, visit Splunk Support
  3. Community Forums:
Tip

When asking for help, always include relevant logs, pod descriptions, and error messages to help troubleshoot more effectively.

Last Modified Mar 29, 2026

Isovalent Enterprise Platform Integration with Splunk Observability Cloud

105 minutes   Author Alec Chamberlain

This workshop demonstrates integrating Isovalent Enterprise Platform with Splunk Observability Cloud to provide comprehensive visibility into Kubernetes networking, security, and runtime behavior using eBPF technology.

What You’ll Learn

By the end of this workshop, you will:

  • Deploy Amazon EKS with Cilium as the CNI in ENI mode
  • Configure Hubble for network observability with L7 visibility
  • Install Tetragon for runtime security monitoring
  • Integrate eBPF-based metrics with Splunk Observability Cloud using OpenTelemetry
  • Monitor network flows, security events, and infrastructure metrics in unified dashboards
  • Understand eBPF-powered observability and kube-proxy replacement

Sections

Tip

This integration leverages eBPF (Extended Berkeley Packet Filter) for high-performance, low-overhead observability directly in the Linux kernel.

Prerequisites

  • AWS CLI configured with appropriate credentials
  • kubectl, eksctl, and Helm 3.x installed
  • An AWS account with permissions to create EKS clusters, VPCs, and EC2 instances
  • A Splunk Observability Cloud account with access token
  • Approximately 90 minutes for complete setup

Benefits of Integration

By connecting Isovalent Enterprise Platform to Splunk Observability Cloud, you gain:

  • πŸ” Deep visibility: Network flows, L7 protocols (HTTP, DNS, gRPC), and runtime security events
  • πŸš€ High performance: eBPF-based observability with minimal overhead
  • πŸ” Security insights: Process monitoring, system call tracing, and network policy enforcement
  • πŸ“Š Unified dashboards: Cilium, Hubble, and Tetragon metrics alongside infrastructure and APM data
  • ⚑ Efficient networking: Kube-proxy replacement and native VPC networking with ENI mode

Source Repositories

All configuration files, Helm values, and dashboard JSON files referenced in this workshop are available in the following repositories:

  • isovalent_splunk_o11y β€” Helm values, OTel Collector configuration, Splunk dashboard JSON files, and the complete integration guide
  • isovalent-demo-jobs-app β€” The jobs-app Helm chart used in the demo scenario, including the error injection and remediation scripts
Last Modified Apr 10, 2026

Subsections of Isovalent Splunk Observability Integration

Overview

What is Isovalent Enterprise Platform?

The Isovalent Enterprise Platform consists of three core components built on eBPF (Extended Berkeley Packet Filter) technology:

Cilium

Cloud Native CNI and Network Security

  • eBPF-based networking and security for Kubernetes
  • Replaces kube-proxy with high-performance eBPF datapath
  • Native support for AWS ENI mode (pods get VPC IP addresses)
  • Network policy enforcement at L3-L7
  • Transparent encryption and load balancing

Hubble

Network Observability

  • Built on top of Cilium’s eBPF visibility
  • Real-time network flow monitoring
  • L7 protocol visibility (HTTP, DNS, gRPC, Kafka)
  • Flow export and historical data storage (Timescape)
  • Metrics exposed on port 9965

Tetragon

Runtime Security and Observability

  • eBPF-based runtime security
  • Process execution monitoring
  • System call tracing
  • File access tracking
  • Security event metrics on port 2112

Architecture

graph TB
    subgraph AWS["Amazon Web Services"]
        subgraph EKS["EKS Cluster"]
            subgraph Node["Worker Node"]
                CA["Cilium Agent<br/>:9962"]
                CE["Cilium Envoy<br/>:9964"]
                HA["Hubble<br/>:9965"]
                TE["Tetragon<br/>:2112"]
                OC["OTel Collector"]
            end
            CO["Cilium Operator<br/>:9963"]
            HR["Hubble Relay"]
        end
    end
    
    subgraph Splunk["Splunk Observability Cloud"]
        IM["Infrastructure Monitoring"]
        DB["Dashboards"]
    end
    
    CA -.->|"Scrape"| OC
    CE -.->|"Scrape"| OC
    HA -.->|"Scrape"| OC
    TE -.->|"Scrape"| OC
    CO -.->|"Scrape"| OC
    
    OC ==>|"OTLP/HTTP"| IM
    IM --> DB

Key Components

ComponentService NamePortPurpose
Cilium Agentcilium-agent9962CNI, network policies, eBPF programs
Cilium Envoycilium-envoy9964L7 proxy for HTTP, gRPC
Cilium Operatorcilium-operator9963Cluster-wide operations
Hubblehubble-metrics9965Network flow metrics
Tetragontetragon2112Runtime security metrics

Benefits of eBPF

  • High Performance: Runs in the Linux kernel with minimal overhead
  • Safety: Verifier ensures programs are safe to run
  • Flexibility: Dynamic instrumentation without kernel modules
  • Visibility: Deep insights into network and system behavior
Note

This integration provides visibility into Kubernetes networking at a level not possible with traditional CNI plugins.

Last Modified Nov 26, 2025

Prerequisites

Required Tools

Before starting this workshop, ensure you have the following tools installed:

AWS CLI

# Check installation
aws --version

# Should output: aws-cli/2.x.x or higher

kubectl

# Check installation
kubectl version --client

# Should output: Client Version: v1.28.0 or higher

eksctl

# Check installation
eksctl version

# Should output: 0.150.0 or higher

Helm

# Check installation
helm version

# Should output: version.BuildInfo{Version:"v3.x.x"}

AWS Requirements

  • AWS account with permissions to create:
    • EKS clusters
    • VPCs and subnets
    • EC2 instances
    • IAM roles and policies
    • Elastic Network Interfaces
  • AWS CLI configured with credentials (aws configure)

Splunk Observability Cloud

You’ll need:

  • A Splunk Observability Cloud account
  • An Access Token with ingest permissions
  • Your Realm identifier (e.g., us1, us2, eu0)
Getting Splunk Credentials

In Splunk Observability Cloud:

  1. Navigate to Settings β†’ Access Tokens
  2. Create a new token with Ingest permissions
  3. Note your realm from the URL: https://app.<realm>.signalfx.com

Cost Considerations

AWS Costs (Approximate)

  • EKS Control Plane: ~$73/month
  • EC2 Nodes (2x m5.xlarge): ~$280/month
  • Data Transfer: Variable
  • EBS Volumes: ~$20/month

Estimated Total: ~$380-400/month for lab environment

Splunk Costs

  • Based on metrics volume (DPM - Data Points per Minute)
  • Free trial available for testing
Warning

Remember to clean up resources after completing the workshop to avoid ongoing charges.

Time Estimate

  • EKS Cluster Creation: 15-20 minutes
  • Cilium Installation: 10-15 minutes
  • Integration Setup: 10 minutes
  • Total: Approximately 90 minutes
Last Modified Nov 26, 2025

EKS Setup

Step 1: Add Helm Repositories

Add the required Helm repositories:

# Add Isovalent Helm repository
helm repo add isovalent https://helm.isovalent.com

# Add Splunk OpenTelemetry Collector Helm repository
helm repo add splunk-otel-collector-chart https://signalfx.github.io/splunk-otel-collector-chart

# Update Helm repositories
helm repo update

Step 2: Create EKS Cluster Configuration

Create a file named cluster.yaml:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: isovalent-demo
  region: us-east-1
  version: "1.30"
iam:
  withOIDC: true
addonsConfig:
  disableDefaultAddons: true
addons:
- name: coredns

Key Configuration Details:

  • disableDefaultAddons: true - Disables AWS VPC CNI and kube-proxy (Cilium will replace both)
  • withOIDC: true - Enables IAM roles for service accounts (required for Cilium to manage ENIs)
  • coredns addon is retained as it’s needed for DNS resolution
Why Disable Default Addons?

Cilium provides its own CNI implementation using eBPF, which is more performant than the default AWS VPC CNI. By disabling the defaults, we avoid conflicts and let Cilium handle all networking.

Step 3: Create the EKS Cluster

Create the cluster (this takes approximately 15-20 minutes):

eksctl create cluster -f cluster.yaml

Verify the cluster is created:

# Update kubeconfig
aws eks update-kubeconfig --name isovalent-demo --region us-east-1

# Check pods
kubectl get pods -n kube-system

Expected Output:

  • CoreDNS pods will be in Pending state (this is normal - they’re waiting for the CNI)
  • No worker nodes yet
Note

Without a CNI plugin, pods cannot get IP addresses or network connectivity. CoreDNS will remain pending until Cilium is installed.

Step 4: Get Kubernetes API Server Endpoint

You’ll need this for the Cilium configuration:

aws eks describe-cluster --name isovalent-demo --region us-east-1 \
  --query 'cluster.endpoint' --output text

Save this endpoint - you’ll use it in the Cilium installation step.

Step 5: Install Prometheus CRDs

Cilium uses Prometheus ServiceMonitor CRDs for metrics:

kubectl apply -f https://github.com/prometheus-operator/prometheus-operator/releases/download/v0.68.0/stripped-down-crds.yaml
Next Steps

With the EKS cluster created, you’re ready to install Cilium, Hubble, and Tetragon.

Last Modified Nov 26, 2025

Cilium Installation

Step 1: Configure Cilium Enterprise

Create a file named cilium-enterprise-values.yaml. Replace <YOUR-EKS-API-SERVER-ENDPOINT> with the endpoint from the previous step (without the https:// prefix).

# Enable/disable debug logging
debug:
  enabled: false
  verbose: ~

# Configure unique cluster name & ID
cluster:
  name: isovalent-demo
  id: 0

# Configure ENI specifics
eni:
  enabled: true
  updateEC2AdapterLimitViaAPI: true   # Dynamically fetch ENI limits from EC2 API
  awsEnablePrefixDelegation: true     # Assign /28 CIDR blocks per ENI (16 IPs) instead of individual IPs

enableIPv4Masquerade: false           # Pods use their real VPC IPs β€” no SNAT needed in ENI mode
loadBalancer:
  serviceTopology: true               # Prefer backends in the same AZ to reduce cross-AZ traffic costs

ipam:
  mode: eni

routingMode: native                   # No overlay tunnels β€” traffic routes natively through VPC

# BPF / KubeProxyReplacement
# Cilium replaces kube-proxy entirely with eBPF programs in the kernel.
# This requires a direct path to the API server, hence k8sServiceHost.
kubeProxyReplacement: "true"
k8sServiceHost: <YOUR-EKS-API-SERVER-ENDPOINT>
k8sServicePort: 443

# TLS for internal Cilium communication
tls:
  ca:
    certValidityDuration: 3650        # 10 years for the CA cert

# Hubble: network observability built on top of Cilium's eBPF datapath
hubble:
  enabled: true
  metrics:
    enableOpenMetrics: true           # Use OpenMetrics format for better Prometheus compatibility
    enabled:
      # DNS: query/response tracking with namespace-level label context
      - dns:labelsContext=source_namespace,destination_namespace
      # Drop: packet drop reasons (policy deny, invalid, etc.) per namespace
      - drop:labelsContext=source_namespace,destination_namespace
      # TCP: connection state tracking (SYN, FIN, RST) per namespace
      - tcp:labelsContext=source_namespace,destination_namespace
      # Port distribution: which destination ports are being used
      - port-distribution:labelsContext=source_namespace,destination_namespace
      # ICMP: ping/traceroute visibility with workload identity context
      - icmp:labelsContext=source_namespace,destination_namespace;sourceContext=workload-name|reserved-identity;destinationContext=workload-name|reserved-identity
      # Flow: per-workload flow counters (forwarded, dropped, redirected)
      - flow:sourceContext=workload-name|reserved-identity;destinationContext=workload-name|reserved-identity
      # HTTP L7: request/response metrics with full workload context and exemplars for trace correlation
      - "httpV2:exemplars=true;labelsContext=source_ip,source_namespace,source_workload,destination_namespace,destination_workload,traffic_direction;sourceContext=workload-name|reserved-identity;destinationContext=workload-name|reserved-identity"
      # Policy: network policy verdict tracking (allowed/denied) per workload
      - "policy:sourceContext=app|workload-name|pod|reserved-identity;destinationContext=app|workload-name|pod|dns|reserved-identity;labelsContext=source_namespace,destination_namespace"
      # Flow export: enables Hubble to export flow records to Timescape for historical storage
      - flow_export
    serviceMonitor:
      enabled: true                   # Creates a Prometheus ServiceMonitor for auto-discovery
  tls:
    enabled: true
    auto:
      enabled: true
      method: cronJob                 # Automatically rotate Hubble TLS certs on a schedule
      certValidityDuration: 1095      # 3 years per cert rotation
  relay:
    enabled: true                     # Hubble Relay aggregates flows from all nodes cluster-wide
    tls:
      server:
        enabled: true
    prometheus:
      enabled: true
      serviceMonitor:
        enabled: true
  timescape:
    enabled: true                     # Stores historical flow data for time-travel debugging

# Cilium Operator: cluster-wide identity and endpoint management
operator:
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true

# Cilium Agent: per-node eBPF datapath metrics
prometheus:
  enabled: true
  serviceMonitor:
    enabled: true

# Cilium Envoy: L7 proxy metrics (HTTP, gRPC)
envoy:
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true

# Enable the Cilium agent to hand off DNS proxy responsibilities to the
# external DNS Proxy HA deployment, so policies keep working during upgrades
extraConfig:
  external-dns-proxy: "true"

# Enterprise feature gates β€” these must be explicitly approved
enterprise:
  featureGate:
    approved:
      - DNSProxyHA          # High-availability DNS proxy (installed separately)
      - HubbleTimescape     # Historical flow storage via Timescape
Why label contexts matter

The labelsContext and sourceContext/destinationContext parameters on each Hubble metric control what dimensions the metric is broken down by. Setting labelsContext=source_namespace,destination_namespace means every metric will have those two labels attached, letting you filter by namespace in Splunk without cardinality explosion. The workload-name|reserved-identity fallback chain means Cilium will use the workload name if available, falling back to the security identity if not.

Step 2: Install Cilium Enterprise

When a new node joins an EKS cluster, the kubelet on that node immediately starts looking for a CNI plugin to set up networking. It reads whatever CNI configuration is present in /etc/cni/net.d/ and uses that to initialize the node. If we create the node group first, the AWS VPC CNI is what gets there first β€” and once a node has initialized with one CNI, switching to another requires draining and re-initializing the node.

By installing Cilium before any nodes exist, we ensure that Cilium’s CNI configuration is already present in kube-system and ready to be picked up the moment a node starts. When the EC2 instances boot, Cilium’s DaemonSet pod is scheduled immediately, its eBPF programs are loaded, and the node comes up Ready under Cilium’s control from the very first second.

This is also why the cluster was created with disableDefaultAddons: true in Step 3 of the EKS setup β€” without that, the AWS VPC CNI would be installed automatically and would race against Cilium.

Install Cilium using Helm:

helm install cilium isovalent/cilium --version 1.18.4 \
  --namespace kube-system -f cilium-enterprise-values.yaml
Pending jobs are expected

After installation you’ll see some jobs in a pending state β€” this is normal. Cilium’s Helm chart includes a job that generates TLS certificates for Hubble, and that job needs a node to run on. It will complete automatically once nodes are available in the next step.

Step 3: Create Node Group

Create a file named nodegroup.yaml:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: isovalent-demo
  region: us-east-1
managedNodeGroups:
- name: standard
  instanceType: m5.xlarge
  desiredCapacity: 2
  privateNetworking: true

Create the node group (this takes 5-10 minutes):

eksctl create nodegroup -f nodegroup.yaml

Step 4: Verify Cilium Installation

Once nodes are ready, verify all components:

# Check nodes
kubectl get nodes

# Check Cilium pods
kubectl get pods -n kube-system -l k8s-app=cilium

# Check all Cilium components
kubectl get pods -n kube-system | grep -E "(cilium|hubble)"

Expected Output:

  • 2 nodes in Ready state
  • Cilium pods running (1 per node)
  • Hubble relay and timescape running
  • Cilium operator running

Step 5: Install Tetragon with Enhanced Network Observability

Tetragon out of the box provides runtime security and process-level visibility. For the Splunk integration β€” especially the Network Explorer dashboards β€” you also want to enable its enhanced network observability mode, which tracks TCP/UDP socket statistics, RTT, connection events, and DNS at the kernel level.

Create a file named tetragon-network-values.yaml:

# Tetragon configuration with Enhanced Network Observability enabled
# Required for Splunk Observability Cloud Network Explorer integration

tetragon:
  # Enable network events β€” this activates eBPF-based socket tracking
  enableEvents:
    network: true

  # Layer3 settings: track TCP, UDP, and ICMP with RTT and latency
  # These enable the socket stats metrics (srtt, retransmits, bytes, etc.)
  layer3:
    tcp:
      enabled: true
      rtt:
        enabled: true     # Round-trip time per TCP flow
    udp:
      enabled: true
    icmp:
      enabled: true
    latency:
      enabled: true       # Per-connection latency tracking

  # DNS tracking at the kernel level (complements Hubble DNS metrics)
  dns:
    enabled: true

  # Expose Tetragon metrics via Prometheus
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true

  # Filter out noise from internal system namespaces β€” we only care about
  # application workloads, not the observability stack itself
  exportDenyList: |-
    {"health_check":true}
    {"namespace":["", "cilium", "tetragon", "kube-system", "otel-splunk"]}

  # Only include labels that are meaningful for the Network Explorer
  metricsLabelFilter: "namespace,workload,binary"

  resources:
    limits:
      cpu: 500m
      memory: 1Gi
    requests:
      cpu: 100m
      memory: 256Mi

# Enable the Tetragon Operator and TracingPolicy support.
# With tracingPolicy.enabled: true, the operator manages and deploys
# TracingPolicies (TCP connection tracking, HTTP visibility, etc.) automatically.
tetragonOperator:
  enabled: true
  tracingPolicy:
    enabled: true

Install Tetragon with these values:

helm install tetragon isovalent/tetragon --version 1.18.0 \
  --namespace tetragon --create-namespace \
  -f tetragon-network-values.yaml

Verify installation:

kubectl get pods -n tetragon

What you’ll see: Tetragon runs as a DaemonSet (one pod per node) plus an operator.

What Enhanced Network Observability adds

With layer3.tcp.rtt.enabled: true, Tetragon hooks into the kernel’s TCP socket state and records per-connection metrics including round-trip time, retransmit counts, bytes sent/received, and segment counts. These feed the tetragon_socket_stats_* metrics that power latency and throughput views in Splunk’s Network Explorer. Without this, you only get event counts β€” with it, you get connection quality data.

TracingPolicies (TCP connection tracking, HTTP visibility, etc.) are managed automatically by the Tetragon Operator when tetragonOperator.tracingPolicy.enabled: true is set in the Helm values above.

Step 6: Install Cilium DNS Proxy HA

Create a file named cilium-dns-proxy-ha-values.yaml:

enableCriticalPriorityClass: true
metrics:
  serviceMonitor:
    enabled: true

Install DNS Proxy HA:

helm upgrade -i cilium-dnsproxy isovalent/cilium-dnsproxy --version 1.16.7 \
  -n kube-system -f cilium-dns-proxy-ha-values.yaml

Verify:

kubectl rollout status -n kube-system ds/cilium-dnsproxy --watch
Success

You now have a fully functional EKS cluster with Cilium CNI, Hubble observability, and Tetragon security!

Last Modified Mar 4, 2026

Splunk Integration

Overview

The Splunk OpenTelemetry Collector uses Prometheus receivers to scrape metrics from all Isovalent components. Each component exposes metrics on different ports, and because Cilium and Hubble share the same pods (just different ports), we configure separate receivers for each one rather than relying on pod annotations.

ComponentPortWhat it provides
Cilium Agent9962eBPF datapath, policy enforcement, IPAM, BPF map stats
Cilium Envoy9964L7 proxy metrics (HTTP, gRPC)
Cilium Operator9963Cluster-wide identity and endpoint management
Hubble9965Network flows, DNS, HTTP L7, TCP flags, policy verdicts
Tetragon2112Runtime security, socket stats, network flow events

Step 1: Create Configuration File

Create a file named splunk-otel-collector-values.yaml. Replace the credential placeholders with your actual values.

terminationGracePeriodSeconds: 30
agent:
  config:
    extensions:
      # k8s_observer watches the Kubernetes API for pod and port changes.
      # This enables automatic service discovery without static endpoint lists.
      k8s_observer:
        auth_type: serviceAccount
        observe_pods: true

    receivers:
      kubeletstats:
        collection_interval: 30s
        insecure_skip_verify: true

      # Cilium Agent (port 9962) and Hubble (port 9965) both run in the
      # same DaemonSet pod, identified by label k8s-app=cilium.
      # We use two separate scrape jobs because they're on different ports.
      prometheus/isovalent_cilium:
        config:
          scrape_configs:
            - job_name: 'cilium_metrics_9962'
              scrape_interval: 30s
              metrics_path: /metrics
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - source_labels: [__meta_kubernetes_pod_label_k8s_app]
                  action: keep
                  regex: cilium
                - source_labels: [__meta_kubernetes_pod_ip]
                  target_label: __address__
                  replacement: ${__meta_kubernetes_pod_ip}:9962
                - target_label: job
                  replacement: 'cilium_metrics_9962'
            - job_name: 'hubble_metrics_9965'
              scrape_interval: 30s
              metrics_path: /metrics
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - source_labels: [__meta_kubernetes_pod_label_k8s_app]
                  action: keep
                  regex: cilium
                - source_labels: [__meta_kubernetes_pod_ip]
                  target_label: __address__
                  replacement: ${__meta_kubernetes_pod_ip}:9965
                - target_label: job
                  replacement: 'hubble_metrics_9965'

      # Cilium Envoy uses a different pod label (k8s-app=cilium-envoy)
      prometheus/isovalent_envoy:
        config:
          scrape_configs:
            - job_name: 'envoy_metrics_9964'
              scrape_interval: 30s
              metrics_path: /metrics
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - source_labels: [__meta_kubernetes_pod_label_k8s_app]
                  action: keep
                  regex: cilium-envoy
                - source_labels: [__meta_kubernetes_pod_ip]
                  target_label: __address__
                  replacement: ${__meta_kubernetes_pod_ip}:9964
                - target_label: job
                  replacement: 'cilium_metrics_9964'

      # Cilium Operator is a Deployment (not DaemonSet), identified by io.cilium.app=operator
      prometheus/isovalent_operator:
        config:
          scrape_configs:
            - job_name: 'cilium_operator_metrics_9963'
              scrape_interval: 30s
              metrics_path: /metrics
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - source_labels: [__meta_kubernetes_pod_label_io_cilium_app]
                  action: keep
                  regex: operator
                - target_label: job
                  replacement: 'cilium_metrics_9963'

      # Tetragon is identified by app.kubernetes.io/name=tetragon
      prometheus/isovalent_tetragon:
        config:
          scrape_configs:
            - job_name: 'tetragon_metrics_2112'
              scrape_interval: 30s
              metrics_path: /metrics
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
                  action: keep
                  regex: tetragon
                - source_labels: [__meta_kubernetes_pod_ip]
                  target_label: __address__
                  replacement: ${__meta_kubernetes_pod_ip}:2112
                - target_label: job
                  replacement: 'tetragon_metrics_2112'

    processors:
      # Strict allowlist filter: only forward metrics we've explicitly named.
      # Without this, Cilium and Tetragon can generate thousands of metric series
      # and overwhelm Splunk Observability Cloud with cardinality.
      filter/includemetrics:
        metrics:
          include:
            match_type: strict
            metric_names:
            # --- Kubernetes base metrics ---
            - container.cpu.usage
            - container.memory.rss
            - k8s.container.restarts
            - k8s.pod.phase
            - node_namespace_pod_container
            - tcp.resets
            - tcp.syn_timeouts

            # --- Cilium Agent metrics ---
            # API rate limiting β€” detect if the agent is being throttled
            - cilium_api_limiter_processed_requests_total
            - cilium_api_limiter_processing_duration_seconds
            # BPF map utilization β€” alerts when eBPF maps are near capacity
            - cilium_bpf_map_ops_total
            # Controller health β€” tracks background reconciliation tasks
            - cilium_controllers_group_runs_total
            - cilium_controllers_runs_total
            # Endpoint state β€” how many pods are in each lifecycle state
            - cilium_endpoint_state
            # Agent error/warning counts β€” early warning for problems
            - cilium_errors_warnings_total
            # IP address allocation tracking
            - cilium_ip_addresses
            - cilium_ipam_capacity
            # Kubernetes event processing rate
            - cilium_kubernetes_events_total
            # L7 policy enforcement (HTTP, DNS, Kafka)
            - cilium_policy_l7_total
            # DNS proxy latency histogram β€” key metric for catching DNS saturation
            - cilium_proxy_upstream_reply_seconds_bucket

            # --- Hubble metrics ---
            # DNS query and response counts β€” primary indicator in the demo scenario
            - hubble_dns_queries_total
            - hubble_dns_responses_total
            # Packet drops by reason (policy_denied, invalid, TTL_exceeded, etc.)
            - hubble_drop_total
            # Total flows processed β€” overall network activity volume
            - hubble_flows_processed_total
            # HTTP request latency histogram and total count
            - hubble_http_request_duration_seconds_bucket
            - hubble_http_requests_total
            # ICMP traffic tracking
            - hubble_icmp_total
            # Policy verdict counts (forwarded vs. dropped by policy)
            - hubble_policy_verdicts_total
            # TCP flag tracking (SYN, FIN, RST) β€” connection lifecycle visibility
            - hubble_tcp_flags_total

            # --- Tetragon metrics ---
            # Total eBPF events processed
            - tetragon_events_total
            # DNS cache health
            - tetragon_dns_cache_evictions_total
            - tetragon_dns_cache_misses_total
            - tetragon_dns_total
            # HTTP response tracking with latency
            - tetragon_http_response_total
            - tetragon_http_stats_latency_bucket
            - tetragon_http_stats_latency_count
            - tetragon_http_stats_latency_sum
            # Layer3 errors
            - tetragon_layer3_event_errors_total
            # TCP socket statistics β€” per-connection RTT, retransmits, byte/segment counts
            # These power the latency and throughput views in Network Explorer
            - tetragon_socket_stats_retransmitsegs_total
            - tetragon_socket_stats_rxsegs_total
            - tetragon_socket_stats_srtt_count
            - tetragon_socket_stats_srtt_sum
            - tetragon_socket_stats_txbytes_total
            - tetragon_socket_stats_txsegs_total
            - tetragon_socket_stats_rxbytes_total
            # UDP statistics
            - tetragon_socket_stats_udp_retrieve_total
            - tetragon_socket_stats_udp_txbytes_total
            - tetragon_socket_stats_udp_txsegs_total
            - tetragon_socket_stats_udp_rxbytes_total
            # Network flow events (connect, close, send, receive)
            - tetragon_network_connect_total
            - tetragon_network_close_total
            - tetragon_network_send_total
            - tetragon_network_receive_total

      resourcedetection:
        detectors: [system]
        system:
          hostname_sources: [os]

    service:
      pipelines:
        metrics:
          receivers:
            - prometheus/isovalent_cilium
            - prometheus/isovalent_envoy
            - prometheus/isovalent_operator
            - prometheus/isovalent_tetragon
            - hostmetrics
            - kubeletstats
            - otlp
          processors:
            - filter/includemetrics
            - resourcedetection

autodetect:
  prometheus: true

clusterName: isovalent-demo

splunkObservability:
  accessToken: <YOUR-SPLUNK-ACCESS-TOKEN>
  realm: <YOUR-SPLUNK-REALM>           # e.g. us1, us2, eu0
  profilingEnabled: true

cloudProvider: aws
distribution: eks
environment: isovalent-demo

# Gateway mode runs a central collector deployment that receives from all agents.
# Agents send to the gateway, which handles batching and export to Splunk.
# This reduces the number of direct connections to Splunk's ingest endpoint.
gateway:
  enabled: true
  resources:
    requests:
      cpu: 250m
      memory: 512Mi
    limits:
      cpu: 1
      memory: 1Gi

# certmanager handles mTLS between the OTel Collector agent and gateway
certmanager:
  enabled: true

Important: Replace:

  • <YOUR-SPLUNK-ACCESS-TOKEN> with your Splunk Observability Cloud access token
  • <YOUR-SPLUNK-REALM> with your realm (e.g., us1, us2, eu0)
Why we use a strict metric allowlist

Cilium can emit thousands of unique metric series when you factor in all the label combinations for workloads, namespaces, and protocol details. Without the filter/includemetrics allowlist, a busy cluster can easily generate 50,000+ active series and overwhelm Splunk’s ingestion. The list above is curated to include exactly the metrics that power the Cilium and Hubble dashboards, plus the Tetragon socket stats needed for Network Explorer. If you add new dashboards later, you can add metrics to this list.

What Tetragon socket stats enable

The tetragon_socket_stats_* metrics are what make per-connection latency and throughput analysis possible in Splunk’s Network Explorer. srtt_count/srtt_sum give you average TCP round-trip time per workload. retransmitsegs_total surfaces packet loss and congestion. txbytes/rxbytes show bandwidth per connection. None of this is visible through APM or standard infrastructure metrics.

Step 2: Install Splunk OpenTelemetry Collector

Install the collector:

helm upgrade --install splunk-otel-collector \
  splunk-otel-collector-chart/splunk-otel-collector \
  -n otel-splunk --create-namespace \
  -f splunk-otel-isovalent.yaml

Wait for rollout to complete:

kubectl rollout status daemonset/splunk-otel-collector-agent -n otel-splunk --timeout=60s

Step 3: Verify Metrics Collection

Check that the collector is scraping metrics:

kubectl logs -n otel-splunk -l app=splunk-otel-collector --tail=100 | grep -i "cilium\|hubble\|tetragon"

You should see log entries indicating successful scraping of each component.

Next Steps

Metrics are now flowing to Splunk Observability Cloud! Proceed to verification to check the dashboards.

Last Modified Mar 4, 2026

Verification

Verify All Components

Run this comprehensive check to ensure everything is running:

echo "=== Cluster Nodes ==="
kubectl get nodes

echo -e "\n=== Cilium Components ==="
kubectl get pods -n kube-system -l k8s-app=cilium

echo -e "\n=== Hubble Components ==="
kubectl get pods -n kube-system | grep hubble

echo -e "\n=== Tetragon ==="
kubectl get pods -n tetragon

echo -e "\n=== Splunk OTel Collector ==="
kubectl get pods -n otel-splunk

Expected Output:

  • 2 nodes in Ready state
  • Cilium pods: 2 running (one per node)
  • Hubble relay and timescape: running
  • Tetragon pods: 2 running + operator
  • Splunk collector pods: running

Verify Metrics Endpoints

Test that metrics are accessible from each component:

# Test Cilium metrics
kubectl exec -n kube-system ds/cilium -- curl -s localhost:9962/metrics | head -20

# Test Hubble metrics
kubectl exec -n kube-system ds/cilium -- curl -s localhost:9965/metrics | head -20

# Test Tetragon metrics
kubectl exec -n tetragon ds/tetragon -- curl -s localhost:2112/metrics | head -20

Each command should return Prometheus-formatted metrics.

Verify in Splunk Observability Cloud

Check Infrastructure Navigator

  1. Log in to your Splunk Observability Cloud account
  2. Navigate to Infrastructure β†’ Kubernetes
  3. Find your cluster: isovalent-demo
  4. Verify the cluster is reporting metrics

Search for Isovalent Metrics

Navigate to Metrics and search for:

  • cilium_* - Cilium networking metrics
  • hubble_* - Network flow metrics
  • tetragon_* - Runtime security metrics
Tip

It may take 2-3 minutes after installation for metrics to start appearing in Splunk Observability Cloud.

View Dashboards

Create Custom Dashboard

  1. Navigate to Dashboards β†’ Create
  2. Add charts for key metrics:

Cilium Endpoint State:

cilium_endpoint_state{cluster="isovalent-demo"}

Hubble Flow Processing:

hubble_flows_processed_total{cluster="isovalent-demo"}

Tetragon Events:

tetragon_dns_total{cluster="isovalent-demo"}

Example Queries

DNS Query Rate:

rate(hubble_dns_queries_total{cluster="isovalent-demo"}[1m])

Dropped Packets:

sum by (reason) (hubble_drop_total{cluster="isovalent-demo"})

Network Policy Enforcements:

rate(cilium_policy_l7_total{cluster="isovalent-demo"}[5m])

Troubleshooting

No Metrics in Splunk

If you don’t see metrics:

  1. Check collector logs:

    kubectl logs -n otel-splunk -l app=splunk-otel-collector --tail=200
  2. Verify scrape targets:

    kubectl describe configmap -n otel-splunk splunk-otel-collector-otel-agent
  3. Check network connectivity:

    kubectl exec -n otel-splunk -it deployment/splunk-otel-collector -- \
      curl -v https://ingest.<YOUR-REALM>.signalfx.com

Pods Not Running

If Cilium or Tetragon pods are not running:

  1. Check pod status:

    kubectl describe pod -n kube-system <cilium-pod-name>
  2. View logs:

    kubectl logs -n kube-system <cilium-pod-name>
  3. Verify node readiness:

    kubectl get nodes -o wide

Cleanup

To remove all resources and avoid AWS charges:

# Delete the OpenTelemetry Collector
helm uninstall splunk-otel-collector -n otel-splunk

# Delete the EKS cluster (this removes everything)
eksctl delete cluster --name isovalent-demo --region us-east-1
Warning

The cleanup process takes 10-15 minutes. Ensure all resources are deleted to avoid charges.

Next Steps

Now that your integration is working:

  • Deploy sample applications to generate network traffic
  • Create network policies and monitor enforcement
  • Set up alerts in Splunk for dropped packets or security events
  • Explore Hubble’s L7 visibility for HTTP/gRPC traffic
  • Use Tetragon to monitor process execution and file access
Success!

Congratulations! You’ve successfully integrated Isovalent Enterprise Platform with Splunk Observability Cloud.

Last Modified Nov 26, 2025

Demo β€” Investigating a DNS Issue with Isovalent and Splunk

What This Demo Shows

This demo tells a story that every ops or platform team has lived through: something is broken, users are complaining, and you have no idea where to start. The investigation takes you through the usual first stops β€” APM looks fine, infrastructure looks fine β€” and then pivots to the network layer, where Isovalent’s Hubble observability, flowing into Splunk, reveals the real problem: a DNS overload that was completely invisible to every other tool.

The application is jobs-app, a simulated multi-service hiring platform running in the tenant-jobs namespace. It has a frontend (recruiter, jobposting), a central API (coreapi), a background data pipeline (Kafka + resumes + loader), and a crawler service that periodically makes HTTP calls out to the internet. The crawler is going to be the villain in this story.

Key Takeaway

APM and infrastructure metrics look healthy. The root cause β€” a DNS overload β€” is only visible through the Isovalent Hubble dashboards in Splunk, because it lives below the application layer.


Before You Start

Do this before anyone is in the room. You want to be sitting at a clean, healthy dashboard when the demo begins β€” not fiddling with kubectl while people watch.

Deploy the Jobs App

If you haven’t already, deploy the jobs-app Helm chart from the isovalent-demo-jobs-app repository:

helm dependency build .
helm upgrade --install jobs-app . --namespace tenant-jobs --create-namespace

Make Sure Everything Is Running

Run through these checks so you’re not surprised mid-demo:

# Confirm your nodes are healthy
kubectl get nodes

# Confirm Cilium and Hubble are running on both nodes
kubectl get pods -n kube-system | grep -E "(cilium|hubble)"

# Confirm the Splunk OTel Collector is running β€” this is what ships metrics to Splunk
kubectl get pods -n otel-splunk

# Confirm the jobs-app is fully deployed and healthy
kubectl get pods -n tenant-jobs
Important

All pods must be in Running state before proceeding. If the OTel Collector isn’t up, no metrics will appear in Splunk and the demo won’t land.

Reset the App to a Healthy Baseline

Make sure the crawler is running at a calm, normal pace β€” 1 replica, crawling every 0.5 to 5 seconds:

helm upgrade jobs-app . --namespace tenant-jobs --reuse-values \
  --set crawler.replicas=1 \
  --set crawler.crawlFrequencyLowerBound=0.5 \
  --set crawler.crawlFrequencyUpperBound=5 \
  --set resumes.replicas=1

Then wait at least 5 minutes. Splunk needs time to ingest a clean baseline so the spike you’re about to create is visually obvious. Skip this and the charts won’t tell a clear story.

Inject the Problem

About 5–10 minutes before the demo starts (or live during the demo for effect), run:

helm upgrade jobs-app . --namespace tenant-jobs --reuse-values \
  --set crawler.replicas=5 \
  --set crawler.crawlFrequencyLowerBound=0.2 \
  --set crawler.crawlFrequencyUpperBound=0.3 \
  --set resumes.replicas=2

This scales the crawler from 1 pod up to 5, and cranks the crawl interval down to 0.2–0.3 seconds. Each crawler pod makes an HTTP request to api.github.com and every one of those requests needs a DNS lookup first. Five pods hammering DNS multiple times per second generates around 15–25 DNS queries per second sustained β€” enough to saturate the DNS proxy and cause response latency to back up. Other services in the namespace that depend on DNS start experiencing intermittent failures, which is exactly what’s in our ticket.


Act 1 β€” A Ticket Shows Up

Start by painting the picture. You don’t need to click anything yet β€” just set the scene.

“So it’s a normal afternoon and an ITSM ticket comes in. The jobs application team is saying that end users are reporting intermittent 500 errors on the recruiter and job posting pages, and load times have gotten noticeably worse over the last 15 minutes or so. It’s been escalated to P2. Let’s dig in.”

TicketINC-4072
PriorityP2 β€” High
SummaryIntermittent failures and slow response times on jobs-app
DescriptionRecruiter and job posting pages are returning 500 errors intermittently. Users report page loads have slowed significantly over the last 15 minutes. Engineering has not made any recent deployments.
Reported byApplication Support Team
Affected namespacetenant-jobs

“No recent deployments β€” that’s actually the interesting part. There’s no obvious change event to blame. So we need to figure out what changed on our own. Where do we start? APM.”


Act 2 β€” Check APM (Dead End)

This is where most people would go first, and that’s the point. Show APM, find it unhelpful, and use that to build the case for needing something deeper.

Navigate to: Splunk Observability Cloud β†’ APM β†’ Service Map

The service map for the tenant-jobs environment shows the topology: recruiter and jobposting both call coreapi, which connects to Elasticsearch. The resumes and loader services communicate over Kafka in the background.

“Here’s our service map. Every service is lit up β€” they’re all responding, all connected. Let’s look at what the numbers are actually saying.

Request rates look normal. Latency is slightly elevated, maybe, but nothing that would explain user-facing errors. Now look at the error rate on coreapi β€” it’s sitting around 10%. You might think that’s the problem, but it’s not. This app has a configurable error rate baked in as part of the setup. Ten percent is baseline, not a regression.

So APM is telling us: services are alive, traffic is flowing, and the error rate hasn’t changed. There’s nothing in the application traces that points to a root cause. Let’s try infrastructure.”

Why APM Can’t See This

APM instruments application code β€” it observes what happens inside your services. It has no visibility into what happens at the network layer before a connection is even established. DNS resolution, connection drops, and packet-level events are invisible to it by design.


Act 3 β€” Check Infrastructure (Dead End)

Show infra, find it clean, and let the audience feel the frustration of not having answers yet.

Navigate to: Splunk Observability Cloud β†’ Infrastructure β†’ Kubernetes β†’ Cluster: isovalent-demo

“Let’s look at the cluster itself. Maybe something is resource-constrained β€” a node running hot, pods getting OOMKilled, something like that.

Both nodes look healthy. CPU and memory are well within normal bounds. Drilling into the pods β€” all of them are in Running state, no restarts, nothing being evicted. The containers themselves aren’t hitting their resource limits.

So now we’re in a bit of an uncomfortable spot. The ticket says users are seeing errors. APM says the app is running. Infrastructure says the cluster is healthy. Where does that leave us?

This is actually a really common situation. There’s a whole class of problems that live below the application layer and below the infrastructure layer β€” things happening at the network level that traditional monitoring tools simply can’t see. DNS failures, connection drops, policy denials, traffic asymmetry. These things don’t show up in traces or pod metrics. You need something that can observe the network itself. That’s where Isovalent comes in.”


Act 4 β€” The Network Tells the Truth

This is the heart of the demo. Take your time here.

Navigate to: Splunk Observability Cloud β†’ Dashboards β†’ Hubble by Isovalent

“Cilium β€” our CNI, the networking layer running on every node β€” has a built-in observability component called Hubble. Hubble uses eBPF to watch every single network flow in the cluster in real time. Not sampled, not approximated β€” every connection, every DNS request, every packet drop. And because we’ve set up the OpenTelemetry Collector to scrape those Hubble metrics and forward them to Splunk, we can see all of that right here in the same platform we were just looking at for APM and infrastructure.

Let’s pull up the Hubble dashboard.”

DNS Queries Are Out of Control

Point to the DNS Queries chart, then navigate to the DNS Overview tab.

“There it is. Look at the DNS query volume β€” it spiked sharply about 15 minutes ago. That timestamp lines up exactly with when the ticket was opened.

What you’re looking at is hubble_dns_queries_total, broken down by source namespace. The spike is entirely coming from tenant-jobs β€” our application namespace. Something in the application started generating a massive amount of DNS traffic, and the DNS proxy started struggling to keep up.

But look at the bottom right β€” the Missing DNS Responses chart. This is the one with the alert firing. The value is going deeply negative, which means DNS queries are being sent out but responses are never coming back. The DNS proxy is overwhelmed and connections are just timing out in silence. That’s the ripple effect showing up as 500 errors for our users.”

Hubble DNS Overview showing Missing DNS Responses alert firing as values go deeply negative Hubble DNS Overview showing Missing DNS Responses alert firing as values go deeply negative

Top DNS Queries Reveal the Culprit

Point to the Top 10 DNS Queries chart.

“Now let’s figure out what’s making all these DNS requests. The Top 10 DNS Queries chart breaks down the most frequently queried domains, and one name is standing out by a mile: api.github.com.

That’s not a cluster-internal service β€” it’s an external endpoint. And the only thing in our app that talks to external endpoints is the crawler service. The crawler makes HTTP calls to an external URL as part of its job simulation. Every time it makes that HTTP call, it needs to resolve api.github.com through DNS first.

Normally this is fine. One crawler pod making a request every few seconds is totally manageable. But something has clearly changed about how aggressively it’s running.”

Dropped Flows Show the Blast Radius

Point to the Dropped Flows chart.

“The Dropped Flows chart is showing something else worth calling out. Hubble doesn’t just track successful connections β€” it captures every connection that gets rejected or dropped, along with a reason code for why. We’re seeing an uptick in drops starting at the exact same time as the DNS spike.

These drops are the downstream consequence of DNS overload. When services in the namespace try to make connections and DNS is too slow or failing, those connection attempts time out and get dropped. This is what APM was seeing as elevated latency β€” but APM had no idea it was a DNS problem underneath.”

Network Flow Volume Confirms the Pattern

Navigate to the Metrics & Monitoring tab.

“And if you look at the Metrics & Monitoring tab, the full picture becomes even clearer. Flows processed per node has gone vertical β€” that’s raw network traffic volume. The Forwarded vs Dropped chart is showing a meaningful proportion of those flows being dropped rather than forwarded. And the Drop Reason breakdown tells us it’s a mix of TTL_EXCEEDED and DROP_REASON_UNKNOWN β€” exactly what you’d expect when DNS timeouts start cascading. Something changed at a specific moment in time, and everything after that point looks different from the baseline.”

Hubble Metrics & Monitoring showing flow spike, forwarded vs dropped, and drop reasons Hubble Metrics & Monitoring showing flow spike, forwarded vs dropped, and drop reasons

L7 HTTP Traffic Tells an Interesting Story

Navigate to the L7 HTTP Metrics tab.

“Here’s something worth pointing out on the L7 HTTP Metrics tab, because it actually reinforces why APM wasn’t helpful. The incoming request volume is non-zero β€” traffic is still flowing. The success rate chart looks mostly green. If you were only looking at HTTP-level visibility, you might conclude the app is fine.

But look at the Incoming Requests by Source chart. The crawler is generating a disproportionate share of traffic β€” you can see it separating out from the other services. It’s making HTTP calls successfully, which is why APM doesn’t flag it. The problem is happening one layer down, in DNS, before the HTTP connections even establish.”

Hubble L7 HTTP Metrics showing crawler traffic spike with high request volume Hubble L7 HTTP Metrics showing crawler traffic spike with high request volume


Act 5 β€” Confirming the Root Cause

Now connect the dots and prove it.

“So here’s the full picture: at some point, the crawler service got scaled up from 1 replica to 5, and its crawl interval got set to something extremely aggressive β€” every 0.2 to 0.3 seconds. That’s 5 pods, each firing off a DNS lookup to resolve api.github.com multiple times per second. Combined, that’s 15 to 25 DNS queries per second, sustained. The DNS proxy wasn’t built to handle that kind of load from a single workload, so it starts queuing, slowing down, and eventually dropping requests. Every other service in the namespace that needs DNS resolution gets caught in the crossfire.

Let’s confirm that’s what we’re looking at.”

# Confirm the current crawler replica count β€” you'll see 5
kubectl get deploy crawler -n tenant-jobs

# Pull the environment config to see the crawl frequency settings
kubectl get deploy crawler -n tenant-jobs \
  -o jsonpath='{.spec.template.spec.containers[0].env}' | jq .

Optionally, switch over to the Cilium by Isovalent dashboard β†’ Policy: L7 Proxy tab.

“If you want to see this from the Cilium side rather than the Hubble side, switch to the Cilium by Isovalent dashboard and look at the Policy: L7 Proxy tab. The L7 Request Processing Rate for FQDN β€” that’s DNS β€” is sitting at over 21,000 requests. That’s not per minute. The DNS proxy has been processing an extraordinary volume of FQDN lookups, all of them being received and forwarded, which is why it started backing up. This view also shows the DNS Proxy Upstream Reply latency, which confirms the proxy is under pressure.”

Cilium Policy: L7 Proxy showing FQDN request processing rate spiking to 21k+ Cilium Policy: L7 Proxy showing FQDN request processing rate spiking to 21k+

*“There it is. Five replicas, crawling every 0.2 to 0.3 seconds.

APM can’t see this because it instruments code, not DNS. Infrastructure monitoring can’t see this because the pods are healthy β€” they’re doing exactly what they were configured to do. The only tool that could catch this is something operating at the eBPF level, watching every packet, every DNS request, every connection attempt in real time. That’s Hubble. And because we’ve wired it into Splunk, we caught it in the same dashboard we use for everything else.”


Act 6 β€” Fix It Live

This part is satisfying because you can watch the charts recover in real time.

“The fix is straightforward β€” scale the crawlers back down and restore the normal crawl interval.”

helm upgrade jobs-app . --namespace tenant-jobs --reuse-values \
  --set crawler.replicas=1 \
  --set crawler.crawlFrequencyLowerBound=0.5 \
  --set crawler.crawlFrequencyUpperBound=5 \
  --set resumes.replicas=1

Go back to the Hubble by Isovalent dashboard and let it sit for a minute.

“Watch the DNS Queries chart β€” you can see it coming back down almost immediately. Within a minute or two it’ll be back at baseline. Dropped flows will go to zero. Network flow volume will return to normal.

And if we went back to APM right now, we’d see latency normalizing and the error rate settling back to its expected 10% baseline.

We can close the ticket. Root cause: crawler misconfiguration causing DNS saturation. Resolution: reverted crawler replica count and crawl interval via Helm. Time to resolution: about 15 minutes from when the ticket was opened.”

Remediation Complete

DNS query rate returns to baseline, dropped flows clear, and application health is restored β€” all visible live in the Hubble dashboard.


Act 7 β€” What This Actually Means

End by zooming out and making the value statement feel concrete.

“Let’s think about what just happened here. We had a real production-style problem β€” something breaking for end users β€” and we went through the standard playbook. APM said nothing was wrong. Infrastructure said nothing was wrong. And without Hubble, the next step probably would have been a war room call, people staring at logs, maybe a full restart of the namespace hoping it would go away.

Instead, we found it in under three minutes from the moment we opened the Hubble dashboard. Not because we’re smarter, but because we had visibility into the right layer.

The reason this works is eBPF. Cilium’s Hubble component hooks into the Linux kernel and observes network events at the source β€” before they ever reach application code, before they show up in a pod log, before they become a trace in APM. And by shipping those metrics through the OpenTelemetry Collector into Splunk, they sit right alongside your APM data and your infrastructure data in the same platform. You’re not switching tools or context-switching between five different dashboards. You add a layer of visibility that wasn’t there before, and you keep it in the workflow your team already knows.

That’s the story. Network observability isn’t a niche need β€” it’s the gap that APM and infrastructure monitoring leave behind. Isovalent fills that gap, and Splunk is where you see it.”


Quick Reference

Inject the problem (run ~10 min before demo):

helm upgrade jobs-app . -n tenant-jobs --reuse-values \
  --set crawler.replicas=5 \
  --set crawler.crawlFrequencyLowerBound=0.2 \
  --set crawler.crawlFrequencyUpperBound=0.3 \
  --set resumes.replicas=2

Remediate (run live in Act 6):

helm upgrade jobs-app . -n tenant-jobs --reuse-values \
  --set crawler.replicas=1 \
  --set crawler.crawlFrequencyLowerBound=0.5 \
  --set crawler.crawlFrequencyUpperBound=5 \
  --set resumes.replicas=1

Confirm the misconfiguration:

kubectl get deploy crawler -n tenant-jobs
kubectl get deploy crawler -n tenant-jobs \
  -o jsonpath='{.spec.template.spec.containers[0].env}' | jq .

Splunk navigation path: APM β†’ Service Map β†’ (show it’s clean) β†’ Infrastructure β†’ Kubernetes β†’ (show it’s clean) β†’ Dashboards β†’ Hubble by Isovalent β†’ (show the DNS spike)

Timing Guide

SectionApprox. Time
Act 1 β€” The Ticket~1 min
Act 2 β€” APM (dead end)~2–3 min
Act 3 β€” Infrastructure (dead end)~1–2 min
Act 4 β€” Hubble Dashboards~4–5 min
Act 5 β€” Root Cause Confirmation~2 min
Act 6 β€” Fix It Live~2 min
Act 7 β€” Value Wrap-Up~2 min
Total~14–17 min