Splunk Observability Workshops

Get insights into your applications and infrastructure in real-time with the help of the monitoring, analytics and response tools of the Splunk Observability Cloud

These workshops are going to take you through the best-in-class observability platform for ingesting, monitoring, visualizing and analyzing metrics, traces and logs.

Splunk4Rookies - Observability Cloud
In this workshop, we will be showing how Splunk Observability Cloud provides instant visibility of the user experience – from the perspective of the front-end application to its back-end services – Letting you experience some of the most compelling product features and differentiators of Splunk Observability Cloud.
- Workshop Overview
  Workshop Overview
- What is OpenTelemetry & why should you care?
  Learn about OpenTelemetry and why you should care about it.
- UI - Quick Tour 🚌
  A quick tour of the Splunk Observability Cloud UI.
- Let's go shopping 💶
  Interact with the Online Boutique web application to generate data for Splunk Observability Cloud.
- Splunk RUM
  This section helps you understand how to use Splunk RUM to monitor the performance of your applications from the end user's perspective.
- Splunk APM
  In this section, we will use APM to drill down and identify where the problem is.
- Splunk Log Observer
  In this section, we will use Log Observer to drill down and identify what the problem is.
- Splunk Synthetics
  In this section, you will learn how to use Splunk Synthetics to monitor the performance and availability of your applications.
- Custom Service Health Dashboard 🏥
  In this section, you will learn how to build a custom Service Health Dashboard to monitor the health of your services.
- Workshop Wrap-up 🎁
  Congratulations, you have completed the Splunk4Rookies - Observability Cloud Workshop. Today, you have become familiar with how to use Splunk Observability Cloud to monitor your applications and infrastructure.
Ninja Workshops
Spring PetClinic SpringBoot Based Microservices On KubernetesLearn how to enable automatic discovery and configuration for your Java-based application running in Kubernetes. Experience real-time monitoring to help you maximize application behavior with end-to-end visibility.
- Spring PetClinic SpringBoot Based Microservices On Kubernetes
  Learn how to enable automatic discovery and configuration for your Java-based application running in Kubernetes. Experience real-time monitoring to help you maximize application behavior with end-to-end visibility.
Scenarios
Learn how to build observability solutions with Splunk
- Debug Problems in Microservices
  This scenario helps software developers to make debugging problems in microservices easier, faster, and more cost-effective for platform engineering teams rolling out standardized tooling.
- Optimize End User Experiences
  Use Splunk Real User Monitoring (RUM) and Synthetics to get insight into end user experience, and proactively test scenarios to improve that experience.
Resources
Resources for learning about Splunk Observability Cloud
- Frequently Asked Questions
  A collection of the common questions and their answers associated with Observability, DevOps, Incident Response and Splunk On-Call.
- Dimension, Properties and Tags
  One conversation that frequently comes up is Dimensions vs Properties and when you should use one verus the other.
- Naming Conventions for Tagging with OpenTelemetry and Splunk
  When deploying OpenTelemetry in a large organization, it’s critical to define a standardized naming convention for tagging, and a governance process to ensure the convention is adhered to.

Splunk4Rookies - Observability Cloud

2 minutes Authors Robert Castley Pieter Hagen

In this workshop, we will be showing how Splunk Observability Cloud provides instant visibility of the user experience – from the perspective of the front-end application to its back-end services – Letting you experience some of the most compelling product features and differentiators of Splunk Observability Cloud:

Infrastructure Monitoring
Full-fidelity Real User Monitoring (RUM)
Complete end-to-end NoSample Full-fidelity trace visibility with Application Performance Monitoring (APM)
No code log querying
Synthetic Monitoring
Finding root causes with tag analytics and error stacks
Related content

A key factor in building the Splunk Observability Cloud is to unify telemetry data to create a complete picture of an end-user experience and your application stack.

The workshop uses a microservices-based application that is deployed on AWS EC2 instances. The application is a simple e-commerce application that allows users to browse products, add them to a cart, and checkout. The application is instrumented with OpenTelemetry.

OpenTelemetry is a collection of tools, APIs, and software development kits (SDKs), used to instrument, generate, collect, and export telemetry data (metrics, traces, and logs) to help you analyze your software’s performance and behavior.

The OpenTelemetry community is growing. As a joint project between Splunk, Google, Microsoft, Amazon, and many other organizations, it currently has the second-largest number of contributors in the Cloud Native Computing Foundation after Kubernetes.

Workshop Overview

2 minutes

Introduction

The goal of this workshop is to experience an issue and use Splunk Observability Cloud to troubleshoot and identify the root cause. For this, we have provided a complete microservices-based application running in Kubernetes that has been instrumented to send metrics, traces, and logs to Splunk Observability Cloud.

Who should attend?

Those wanting to gain an understanding of Splunk Observability in a hands-on environment. This workshop is designed for people with little or no experience with Splunk Observability.

What you’ll need

Just you, your laptop, and a browser that can access external websites. We run these workshops in person or on Zoom, if you don’t have the Zoom client on your device you will be able to access them via a web browser.

What’s covered in this workshop?

This 3-hour session will take you through the basics of Splunk Observability, the only observability platform with streaming analytics, NoSample Full Fidelity distributed tracing in a hands-on environment.

OpenTelemetry

A Quick introduction to why OpenTelemetry and why is it important for Observability.

Tour of the Splunk Observability User Interface

A walkthrough of the various components of Splunk Observability Cloud showing you how to easily navigate the 5 main components:- APM, RUM, Log Observer, Synthetics and Infrastructure.

Generate Real User Data

Enjoy some retail therapy using the Online Boutique Website. Use your browser, mobile or tablet to spend some hard-earned virtual money to send metrics (do we have a problem?), traces (where is the problem?) and logs (what is the problem?).

Splunk Real User Monitoring (RUM)

Examine the real user data that has been provided by the telemetry received from all participants’ browser sessions. The goal is to find a browser, mobile or tablet session that performed poorly and begin the troubleshooting process.

Splunk Application Performance Monitoring (APM)

Discover the power of full end-to-end visibility by jumping from a RUM trace (front-end) to an APM trace (back-end). All the services are sending telemetry (traces and spans) that Splunk Observability Cloud can visualize, analyze and use to detect anomalies and errors.

Splunk Log Observer (LO)

Related Content is a powerful feature that allows you to jump from one component to another. In this case, we will jump from the APM trace to the logs to the related logs.

Splunk Synthetics

Wouldn’t it be great if we could have 24/7 monitoring of our application, and be alerted when there is a problem? This is where Synthetics comes in. We will show you a simple test that runs every 1 minute and checks the performance and availability of a typical user journey through the Online Boutique.

What is OpenTelemetry & why should you care?

2 minutes

OpenTelemetry

With the rise of cloud computing, microservices architectures, and ever-more complex business requirements, the need for Observability has never been greater. Observability is the ability to understand the internal state of a system by examining its outputs. In the context of software, this means being able to understand the internal state of a system by examining its telemetry data, which includes metrics, traces, and logs.

To make a system observable, it must be instrumented. That is, the code must emit traces, metrics, and logs. The instrumented data must then be sent to an Observability back-end such as Splunk Observability Cloud.

Metrics	Traces	Logs
Do I have a problem?	Where is the problem?	What is the problem?

OpenTelemetry does two important things:

Allows you to own the data that you generate rather than be stuck with a proprietary data format or tool.
Allows you to learn a single set of APIs and conventions

These two things combined enable teams and organizations the flexibility they need in today’s modern computing world.

There are a lot of variables to consider when getting started with Observability, including the all-important question: “How do I get my data into an Observability tool?”. The industry-wide adoption of OpenTelemetry makes this question easier to answer than ever.

Why Should You Care?

OpenTelemetry is completely open-source and free to use. In the past, monitoring and Observability tools relied heavily on proprietary agents meaning that the effort required to change or set up additional tooling required a large amount of changes across systems, from the infrastructure level to the application level.

Since OpenTelemetry is vendor-neutral and supported by many industry leaders in the Observability space, adopters can switch between supported Observability tools at any time with minor changes to their instrumentation. This is true regardless of which distribution of OpenTelemetry is used – like with Linux, the various distributions bundle settings and add-ons but are all fundamentally based on the community-driven OpenTelemetry project.

Splunk has fully committed to OpenTelemetry so that our customers can collect and use ALL their data, in any type, any structure, from any source, on any scale, and all in real-time. OpenTelemetry is fundamentally changing the monitoring landscape, enabling IT and DevOps teams to bring data to every question and every action. You will experience this during these workshops.

UI - Quick Tour 🚌

We are going to start with a short walkthrough of the various components of Splunk Observability Cloud. The aim of this is to get you familiar with the UI.

Signing in to Splunk Observability Cloud
Real User Monitoring (RUM)
Application Performance Monitoring (APM)
Log Observer
Synthetics
Infrastructure Monitoring

Tip

The easiest way to navigate through this workshop is by using:

the left/right arrows (< | >) on the top right of this page
the left (◀️) and right (▶️) cursor keys on your keyboard

Getting Started

2 minutes

You should have received an e-mail from Splunk inviting you to the Workshop Org. This e-mail will look like the screenshot below, if you cannot find it, please check your Spam/Junk folders or inform your Instructor. You can also check for other solutions in our login F.A.Q..

To proceed click the Join Now button or click on the link provided in the e-mail.

If you have already completed the registration process you can skip the rest and proceed directly to Splunk Observability Cloud and log in:

If this is your first time using Splunk Observability Cloud, you will be presented with the registration form. Enter your full name, and desired password. Please note that the password requirements are:

Must be between 8 and 32 characters
Must contain at least one capital letter
Must have at least one number
Must have at least one symbol (e.g. !@#$%^&*()_+)

Click the checkbox to agree to the terms and conditions and click the SIGN IN NOW button.

Home Page

5 minutes

After you have registered and logged into Splunk Observability Cloud you will be taken to the home or landing page. Here, you will find several useful features to help you get started.

Explore your data pane: Displays which integrations are enabled and allows you to add additional integrations if you are an Administrator.
Documentation pane: Training videos and links to documentation to get you started with Splunk Observability Cloud.
Recents pane: Recently created/visited dashboards and/or detectors for quick access.
Main Menu pane: Navigate the components of Splunk Observability Cloud.
Org Switcher: Easily switch between Organizations (if you are a member of more than one Organization).
Expand/Contract Main Menu: Expand » / Collapse « the main menu if space is at a premium.

Let’s start with our first exercise:

Exercise

Expand the Main Menu and click on Settings.
Check in the Org Switcher if you have access to more than one Organization.

Tip

If you have used Splunk Observability before, you may be placed in an Organization you have used previously. Make sure you are in the correct workshop organization. Verify this with your instructor if you have access to multiple Organizations.

Exercise

Click Onboarding Guidance (Here you can toggle the visibility of the onboarding panes. This is useful if you know the product well enough, and can use the space to show more information).
Hide the Onboarding Content for the Home Page.
At the bottom of the menu, select your preferred appearance: Light, Dark or Auto mode.
Did you also notice this is where the Sign Out option is? Please don’t 😊 !
Click < to get back to the main menu.

Next, let’s check out Splunk Real User Monitoring (RUM).

Real User Monitoring Overview

5 minutes

Splunk RUM is the industry’s only end-to-end, NoSample RUM solution - providing visibility into the full user experience of every web and mobile session to uniquely combine all front-end traces with back-end metrics, traces, and logs as they happen. IT Operations and Engineering teams can quickly scope, prioritize and isolate errors, measure how performance impacts real users and optimize end-user experiences by correlating performance metrics alongside video reconstructions of all user interactions.

Full user session analysis: Streaming analytics capture full user sessions from single and multi-page apps, measuring the customer impact of every resource, image, route change and API call.
Correlate issues faster: Infinite cardinality and full transaction analysis help you pinpoint and correlate issues faster across complex distributed systems.
Isolate latency and errors: Easily identify latency, errors and poor performance for each code change and deployment. Measure how content, images and third-party dependencies impact your customers.
Benchmark and improve page performance: Leverage core web vitals to measure and improve your page load experience, interactivity and visual stability. Find and fix impactful JavaScript errors, and easily understand which pages to improve first.
Explore meaningful metrics: Instantly visualize the customer impact with metrics on specific workflows, custom tags and auto-suggest un-indexed tags to quickly find the root cause of issues.
Optimize end-user experience: Correlate performance metrics alongside video reconstructions of all user interactions to optimize end-user experiences.

Real User Monitoring Home Page

Click RUM in the main menu, this will bring you to the main RUM Home or Landing page. The main concept of this page is to provide you at a glance, the overall status of all selected RUM applications, either in a full dashboard or the compact view.

Independent of the type of Status Dashboard used, the RUM Home Page is made up of 3 distinct sections:

Onboarding Pane: Training videos and links to documentation to get you started with Splunk RUM. (You can hide this pane in case you need the screen real estate).
Filter Pane: Filter on the time frame, environment, application and source type.
Application Summary Pane: Summary of all your applications that send RUM data.

RUM Environments & Application and Source Type

Splunk Observability uses the environments Tag that is sent as part of the RUM trace, (created with every interaction with your website or Mobile App), to separate data coming from different environments like “Production” or “Development”.
A further separation can be made by the Applications Tag. This allows you to distinguish between separate browser/mobile applications running in the same environment.
Splunk RUM is available for both browser and mobile applications, you could use Source Type to distinguish between them, however for this workshop, we will only use browser-based RUM.

Exercise

Ensure the time window is set to -15m
Select the environment for your workshop from the drop-down box. The naming convention is [NAME OF WORKSHOP]-workshop (Selecting this will make sure the workshop RUM application is visible)
Select the App name. There the naming convention is [NAME OF WORKSHOP]-store and leave Source set to All
In the JavaScript Errors tile click on the TypeError entry that says: Cannot read properties of undefined (reading ‘Prcie’) to see more details. Note that you are given a quick indication of what part of the website the error occurred, allowing you to fix this quickly.
Close the pane.
The 3rd tile reports Web Vitals, a metric that focuses on three important aspects of the user experience: loading, interactivity, and visual stability.

Based on the Web Vitals metrics, how do you rate the current web performance of the site?

According to the Web Vitals Metrics, the initial load of the site is OK and is rated Good

The last tile, Most recent detectors tile, will show if any alerts have been triggered for the application.
Click on the down ⌵ arrow in front of the Application name to toggle the view to the compact style. Note that you have all the main information available in this view as well. Click anywhere in the compact view to go back to the full view.

Next, let’s check out Splunk Application Performance Monitoring (APM).

Application Performance Monitoring Overview

5 minutes

Splunk APM provides a NoSample end-to-end visibility of every service and its dependency to solve problems quicker across monoliths and microservices. Teams can immediately detect problems from new deployments, confidently troubleshoot by scoping and isolating the source of an issue, and optimize service performance by understanding how back-end services impact end users and business workflows.

Real-time monitoring and alerting: Splunk provides out-of-the-box service dashboards and automatically detects and alerts on RED metrics (rate, error and duration) when there is a sudden change. Dynamic telemetry maps: Easily visualize service performance in modern production environments in real-time. End-to-end visibility of service performance from infrastructure, applications, end users, and all dependencies helps quickly scope new issues and troubleshoot more effectively.
Intelligent tagging and analysis: View all tags from your business, infrastructure and applications in one place to easily compare new trends in latency or errors to their specific tag values.
AI-directed troubleshooting identifies the most impactful issues: Instead of manually digging through individual dashboards, isolate problems more efficiently. Automatically identify anomalies and the sources of errors that impact services and customers the most.
Complete distributed tracing analyses every transaction: Identify problems in your cloud-native environment more effectively. Splunk distributed tracing visualizes and correlates every transaction from the back-end and front-end in context with your infrastructure, business workflows and applications.
Full stack correlation: Within Splunk Observability, APM links traces, metrics, logs and profiling together to easily understand the performance of every component and its dependency across your stack.
Monitor database query performance: Easily identify how slow and high execution queries from SQL and NoSQL databases impact your services, endpoints and business workflows — no instrumentation required.

Application Performance Monitoring Home page

Click APM in the main menu, the APM Home Page is made up of 3 distinct sections:

Onboarding Pane Pane: Training videos and links to documentation to get you started with Splunk APM.
APM Overview Pane: Real-time metrics for the Top Services and Top Business Workflows.
Functions Pane: Links for deeper analysis of your services, tags, traces, database query performance and code profiling.

The APM Overview pan provides a high-level view of the health of your application. It includes a summary of the services, latency and errors in your application. It also includes a list of the top services by error rate and the top business workflows by error rate (a business workflow is the start-to-finish journey of the collection of traces associated with a given activity or transaction and enables monitoring of end-to-end KPIs and identifying root causes and bottlenecks).

About Environments

To easily differentiate between multiple applications, Splunk uses environments. The naming convention for workshop environments is [NAME OF WORKSHOP]-workshop. Your instructor will provide you with the correct one to select.

Exercise

Verify that the time window we are working with is set to the last 15 minutes (-15m).
Change the environment to the workshop one by selecting its name from the drop-down box and make sure that is the only one selected.

What can you conclude from the Top Services by Error Rate chart?

The paymentservice has a high error rate

If you scroll down the Overview Page you will notice some services listed have Inferred Service next to them.

Splunk APM can infer the presence of the remote service, or inferred service if the span calling the remote service has the necessary information. Examples of possible inferred services include databases, HTTP endpoints, and message queues. Inferred services are not instrumented, but they are displayed on the service map and the service list.

Next, let’s check out Splunk Log Observer (LO).

Log Observer Overview

5 minutes

Log Observer Connect allows you to seamlessly bring in the same log data from your Splunk Platform into an intuitive and no-code interface designed to help you find and fix problems quickly. You can easily perform log-based analysis and seamlessly correlate your logs with Splunk Infrastructure Monitoring’s real-time metrics and Splunk APM traces in one place.

End-to-end visibility: By combining the powerful logging capabilities of Splunk Platform with Splunk Observability Cloud’s traces and real-time metrics for deeper insights and more context of your hybrid environment.
Perform quick and easy log-based investigations: By reusing logs that are already ingested in Splunk Cloud Platform or Enterprise in a simplified and intuitive interface (no need to know SPL!) with customizable and out-of-the-box dashboards
Achieve higher economies of scale and operational efficiency: By centralizing log management across teams, breaking down data and team silos, and getting better overall support

Log Observer Home Page

Click Log Observer in the main menu, the Log Observer Home Page is made up of 4 distinct sections:

Onboarding Pane: Training videos and links to documentation to get you started with Splunk Log Observer.
Filter Bar: Filter on time, indexes, and fields and also Save Queries.
Logs Table Pane: List of log entries that match the current filter criteria.
Fields Pane: List of fields available in the currently selected index.

Splunk indexes

Generally, in Splunk, an “index” refers to a designated place where your data is stored. It’s like a folder or container for your data. Data within a Splunk index is organized and structured in a way that makes it easy to search and analyze. Different indexes can be created to store specific types of data. For example, you might have one index for web server logs, another for application logs, and so on.

Tip

If you have used Splunk Enterprise or Splunk Cloud before, you are probably used to starting investigations with logs. As you will see in the following exercise, you can do that with Splunk Observability Cloud as well. This workshop, however, will use all the OpenTelemetry signals for investigations.

Let’s run a little search exercise:

Exercise

Set the time frame to -15m.
Click on Add Filter in the filter bar then click on Fields in the dialog.
Type in cardType and select it.
Under Top values click on visa, then click on = to add it to the filter.
Click on one of the log entries in the Logs table to validate that the entry contains cardType: "visa".
Let’s find all the orders that have been shipped. Click on Clear All in the filter bar to remove the previous filter.
Click again on Add Filter in the filter bar, then select Keyword. Next just type order: in the Enter Keyword… box and press enter.
You should now only have log lines that contain the word “order:”. There are still a lot of log lines, so let’s filter some more.
Add another filter, this time select the Fields box, then type severity in the Find a field… search box and select it.
Make sure you click the Exclude all logs with this fields at the bottom of the dialog box, as the order log line does not have a severity assigned. This will remove the others.
You may need to scroll down the page if you still have the onboarding content displayed at the top to see the Exclude all logs with this fields button.
You should now have a list of orders sold for the last 15 minutes.

Next, let’s check out Splunk Synthetics.

Synthetics Overview

5 minutes

Splunk Synthetic Monitoring provides visibility across URLs, APIs and critical web services to solve problems faster. IT Operations and engineering teams can easily detect, alert and prioritize issues, simulate multi-step user journeys, measure business impact from new code deployments and optimize web performance with guided step-by-step recommendations to ensure better digital experiences.

Ensure Availability: Proactively monitor and alert on the health and availability of critical services, URLs and APIs with customizable browser tests to simulate multi-step workflows that make up the user experience.
Improve Metrics: Core Web Vitals and modern performance metrics allow users to view all their performance defects in one place, measure and improve page load, interactivity and visual stability, and find and fix JavaScript errors to improve page performance.
front-end to back-end: Integrations with Splunk APM, Infrastructure Monitoring, On-Call and ITSI help teams view endpoint uptime against back-end services, the underlying infrastructure and within their incident response coordination so they can troubleshoot across their entire environment, in a single UI.
Detect and Alert: Monitor and simulate end-user experiences to detect, communicate and resolve issues for APIs, service endpoints and critical business transactions before they impact customers.
Business Performance: Easily define multi-step user flows for key business transactions and start recording and testing your critical user journeys in minutes. Track and report SLAs and SLOs for uptime and performance.
Filmstrips and Video Playback: View screen recordings, film strips, and screenshots alongside modern performance scores, competitive benchmarking, and metrics to visualize artificial end-user experiences. Optimize how fast you deliver visual content, and improve page stability and interactivity to deploy better digital experiences.

Synthetics Home Page

Click on Synthetics in the main menu. This will bring us to the Synthetics Home Page. It has 3 distinct sections that provide either useful information or allow you to pick or create a Synthetic Test.

Onboarding Pane: Training videos and links to documentation to get you started with Splunk Synthetics.
Test Pane: List of all the tests that are configured (Browser, API and Uptime)
Create Test Pane: Drop-down for creating new Synthetic tests.

Info

As part of the workshop we have created a default browser test against the application we are running. You find it in the Test Pane (2). It will have the following name Workshop Browser Test for, followed by the name of your Workshop (your instructor should have provided that to you).

To continue our tour, let’s look at the result of our workshop’s automatic browser test.

Exercise

In the Test Pane, click on the line that contains the name of your workshop. The result should look like this:

Note, On the Synthetic Tests Page, the first pane will show the performance of your site for the last day, 8 days and 30 days. As shown in the screenshot above, only if a test started far enough in the past, the corresponding chart will contain valid data. For the workshop, this depends on when it was created.
In the Performance KPI drop-down, change the time from the default 4 hours to the 1 last hour.

How often is the test run, and from where?

The test runs at a 1-minute round-robin interval from Frankfurt, London and Paris

Next, let’s examine the infrastructure our application is running on using Splunk Infrastructure Monitoring (IM).

Infrastructure Overview

5 minutes

Splunk Infrastructure Monitoring (IM) is a market-leading monitoring and observability service for hybrid cloud environments. Built on a patented streaming architecture, it provides a real-time solution for engineering teams to visualize and analyze performance across infrastructure, services, and applications in a fraction of the time and with greater accuracy than traditional solutions.

OpenTelemetry standardization: Gives you full control over your data — freeing you from vendor lock-in and implementing proprietary agents.
Splunk’s OTel Collector: Seamless installation and dynamic configuration, auto-discovers your entire stack in seconds for visibility across clouds, services, and systems.
300+ Easy-to-use OOTB content: Pre-built navigators and dashboards, deliver immediate visualizations of your entire environment so that you can interact with all your data in real time.
Kubernetes navigator: Provides an instant, comprehensive out-of-the-box hierarchical view of nodes, pods, and containers. Ramp up even the most novice Kubernetes user with easy-to-understand interactive cluster maps.
AutoDetect alerts and detectors: Automatically identify the most important metrics, out-of-the-box, to create alert conditions for detectors that accurately alert from the moment telemetry data is ingested and use real-time alerting capabilities for important notifications in seconds.
Log views in dashboards: Combine log messages and real-time metrics on one page with common filters and time controls for faster in-context troubleshooting.
Metrics pipeline management: Control metrics volume at the point of ingest without re-instrumentation with a set of aggregation and data-dropping rules to store and analyze only the needed data. Reduce metrics volume and optimize observability spend.

Let's go shopping 💶

5 minutes

Persona

You are a hip urban professional, longing to buy your next novelty items in the famous Online Boutique shop. You have heard that the Online Boutique is the place to go for all your hipster needs.

The purpose of this exercise is for you to interact with the Online Boutique web application. This is a sample application that is used to demonstrate the capabilities of Splunk Observability Cloud. The application is a simple e-commerce site that allows you to browse items, add them to your cart, and then checkout.

The application will already be deployed for you and your instructor will provide you with a link to the Online Boutique website e.g:

http://<s4r-workshop-i-xxx.splunk>.show:81/. The application is also running on ports 80 & 443 if you prefer to use those or port 81 is unreachable.

Exercise - Let’s go shopping

Once you have the link to the Online Boutique, have a browse through a few items, add them to your cart and then, finally, do a checkout.
Repeat this exercise a few times and if possible use different browsers, mobile devices or tablets as this will generate more data for you to explore.

Tip

While you are waiting for pages to load, please move your mouse cursor around the page. This will generate more data for us to explore at a later date in this workshop.

Exercise (cont.)

Did you notice anything about the checkout process? Did it seem to take a while to complete, but it did ultimately complete? When this happens please copy the Order Confirmation ID and save it locally somewhere as we will need it later.
Close the browser sessions you used to shop.

This is what a poor user experience can feel like and since this is a potential customer satisfaction issue we had better jump on this and troubleshoot.

Let’s go take a look at what the data looks like in Splunk RUM.

Splunk RUM

15 minutes

Persona

You are a frontend engineer, or an SRE tasked to do the first triage of a performance issue. You have been asked to investigate a potential customer satisfaction issue with the Online Boutique application.

We are going to examine the real user data that has been provided by the telemetry received from all participants’ browser sessions. The goal is to find a browser, mobile or tablet session that performed poorly and begin the troubleshooting process.

1. RUM Dashboard

In Splunk Observability Cloud from the main menu, click on RUM. you arrive at the RUM Home page, this view has already been covered in the short introduction earlier.

Exercise

Make sure you select your workshop by ensuring the drop-downs are set/selected as follows:
- The Time frame is set to -15m.
- The Environment selected is [NAME OF WORKSHOP]-workshop.
- The App selected is [NAME OF WORKSHOP]-store.
- The Source is set to All.
Next, click on the [NAME OF WORKSHOP]-store above the Page Views / JavaScript Errors chart.
This will bring up a new dashboard view breaking down the metrics by UX Metrics, Front-end Health, Back-end Health and Custom Events and comparing them to historic metrics (1 hour by default).

UX Metrics: Page Views, Page Load and Web Vitals metrics.
Front-end Health: Breakdown of Javascript Errors and Long Task duration and count.
Back-end Health: Network Errors and Requests and Time to First Byte.
Custom Events: RED metrics (Rate, Error & Duration) for custom events.

Exercise

Click through each of the tabs (UX Metrics, Front-end Health, Back-end Health and Custom Events) and examine the data.

If you examine the charts in the Custom Events Tab, what chart shows clearly the latency Spikes?

It is the Custom Event Latency chart

2. Tag Spotlight

Exercise

Make sure you are on the Custom Events tab by selecting it.
Have a look at the Custom Event Latency chart. The metrics shown here show the application latency. The comparison metrics to the side show the latency compared to 1 hour ago (which is selected in the top filter bar).
Click on the see all link under the chart title.

In this dashboard view, you are presented with all the tags associated with the RUM data. Tags are key-value pairs that are used to identify the data. In this case, the tags are automatically generated by the OpenTelemetry instrumentation. The tags are used to filter the data and to create the charts and tables. The Tag Spotlight view allows you to drill down into a user session.

Exercise

Change the timeframe to Last 1 hour.
Click Add Filters, select OS Version, click != and select Synthetics and RUMLoadGen then click the Apply Filter button.
Find the Custom Event Name chart, locate PlaceOrder in the list, click on it and select Add to filter.
Notice the large spikes in the graph across the top.
Click on the User Sessions tab.
Click on the Duration heading twice to sort the sessions by duration (longest at the top).
Click on the above the table and select Sf Geo City from the list of additional columns and click Save

We now have a User Session table that is sorted by longest duration descending and includes the cities of all the users that have been shopping on the site. We could apply more filters to further narrow down the data e.g. OS version, browser version, etc.

3. Session Replay

Sessions

A session is a collection of traces that correspond to the actions a single user takes when interacting with an application. By default, a session lasts until 15 minutes have passed from the last event captured in the session. The maximum session duration is 4 hours.

Exercise

In the User Sessions table, click on the top Session ID with the longest Duration (over 20 seconds or longer), this will take you to the RUM Session view.

Exercise

Click the RUM Session Replay Replay button. RUM Session Replay allows you to replay and see the user session. This is a great way to see exactly what the user experienced.
Click the button to start the replay.

RUM Session Replay can redact information, by default text is redacted. You can also redact images (which has been done for this workshop example). This is useful if you are replaying a session that contains sensitive information. You can also change the playback speed and pause the replay.

Tip

When playing back the session, notice how the mouse movements are captured. This is useful to see where the user is focusing their attention.

4. User Sessions

Exercise

Close the RUM Session Replay by clicking on the X in the top right corner.
Note the length of the span, this is the time it took to complete the order, not good!
Scroll down the page and you will see the Tags metadata (which is used in Tag Spotlight). After the tags, we come to the waterfall which shows the page objects that have been loaded (HTML, CSS, images, JavaScript etc.)
Keep scrolling down the page until you come to a blue APM link (the one with /cart/checkout at the end of the URL) and hover over it.

This brings up the APM Performance Summary. Having this end-to-end (RUM to APM) view is very useful when troubleshooting issues.

Exercise

You will see paymentservice and checkoutservice are in an error state as per the screenshot above.
Under Workflow Name click on front-end:/cart/checkout, this will bring up the APM Service Map.

Splunk APM

20 minutes

Persona

You are a back-end developer and you have been called in to help investigate an issue found by the SRE. The SRE has identified a poor user experience and has asked you to investigate the issue.

Discover the power of full end-to-end visibility by jumping from a RUM trace (front-end) to an APM trace (back-end). All the services are sending telemetry (traces and spans) that Splunk Observability Cloud can visualize, analyze and use to detect anomalies and errors.

RUM and APM are two sides of the same coin. RUM is the client-side view of the application and APM is the server-side view. In this section, we will use APM to drill down and identify where the problem is.

1. APM Explore

The APM Service Map displays the dependencies and connections among your instrumented and inferred services in APM. The map is dynamically generated based on your selections in the time range, environment, workflow, service, and tag filters.

When we clicked on the APM link in the RUM waterfall, filters were automatically added to the service map view to show the services that were involved in that WorkFlow Name (frontend:/cart/checkout).

You can see the services involved in the workflow in the Service Map. In the side pane, under Business Workflow, charts for the selected workflow are displayed. The Service Map and Business Workflow charts are synchronized. When you select a service in the Service Map, the charts in the Business Workflow pane are updated to show metrics for the selected service.

Exercise

Click on the paymentservice in the Service Map.

Splunk APM also provides built-in Service Centric Views to help you see problems occurring in real time and quickly determine whether the problem is associated with a service, a specific endpoint, or the underlying infrastructure. Let’s have a closer look.

Exercise

In the right hand pane, click on View Service.

2. APM Service Dashboard

Service Dashboard

APM Service Dashboards provide request, error, and duration (RED) metrics based on Monitoring MetricSets created from endpoint spans for your services, endpoints, and Business Workflows. If you scroll down the dashboard you will also see the host and Kubernetes-related metrics which help you determine whether there are problems with the underlying infrastructure.

Exercise

Check the Time box (1), you can see that the dashboards only show data relevant to the time it took for the APM trace we selected to complete (note that the charts are static).
In the time box change enter -1h and hit enter.
The Single Value charts, Request rate, Request latency (p90) and Error rate will start updating every 10 seconds showing that we still have a large number of errors occurring.
These charts are very useful to quickly identify performance issues. You can use this dashboard to keep an eye on the health of your service or use it as a base for a custom one.
We want to use some of these charts in a later exercise:
- In the Request rate Single Value chart (2), click the … and select Copy. Note that you now have a 1 before the + at the top right of the page (3), indicating you have a copied chart to the clipboard.
- In the Request rate line chart (4), either click on the Add to clipboard indicator that appeared (just at the (4) in the screenshot) to add it to the clipboard or use the … and select Add to clipboard.
Note that you now have 2 before the + on the top right of the page. (3)
Now let’s go back to the explore view, you can hit the back button in your Browser

Exercise

In the Service Map hover over the paymentservice. What can you conclude from the popup service chart?

The error percentage is very high.

We need to understand if there is a pattern to this error rate. We have a handy tool for that, Tag Spotlight.

2. APM Service View

Service View

As a service owners you can use the service view in Splunk APM to get a complete view of your service health in a single pane of glass. The service view includes a service-level indicator (SLI) for availability, dependencies, request, error, and duration (RED) metrics, runtime metrics, infrastructure metrics, Tag Spotlight, endpoints, and logs for a selected service. You can also quickly navigate to code profiling and memory profiling for your service from the service view.

Exercise

Check the Time box, you can see that the dashboards only show data relevant to the time it took for the APM trace we previosuly selected to complete (note that the charts are static).
In the Time box change the timeframe to -1h.
These charts are very useful to quickly identify performance issues. You can use this dashboard to keep an eye on the health of your service.
Scroll down the page and expand Infrastructure Metrics. Here you will see the metrics for the Host and Pod.
Runtime Metrics are not available as profiling data is not available for services written in Node.js.
Now let’s go back to the explore view, you can hit the back button in your Browser

Exercise

In the Service Map hover over the paymentservice. What can you conclude from the popup service chart?

The error percentage is very high.

We need to understand if there is a pattern to this error rate. We have a handy tool for that, Tag Spotlight.

3. APM Tag Spotlight

Exercise

To view the tags for the paymentservice click on the paymentservice and then click on Tag Spotlight in the right-hand side functions pane (you may need to scroll down depending upon your screen resolution).* Once in Tag Spotlight ensure the toggle Show tags with no values is off.

The views in Tag Spotlight are configurable for both the chart and cards. The view defaults to Requests & Errors.

It is also possible to configure which tag metrics are displayed in the cards. It is possible to select any combinations of:

Requests
Errors
Root cause errors
P50 Latency
P90 Latency
P99 Latency

Also ensure that the Show tags with no values toggle is unchecked.

Exercise

Which card exposes the tag that identifies what the problem is?

The version card. The number of requests against v350.10 matches the number of errors i.e. 100%

Now that we have identified the version of the paymentservice that is causing the issue, let’s see if we can find out more information about the error. Click on ← Tag Spotlight at the top of the page to get back to the Service Map.

4. APM Service Breakdown

Exercise

Select the paymentservice in the Service Map.
In the right-hand pane click on the Breakdown.
Select tenant.level in the list.
Back in the Service Map click on gold.
Click on Breakdown and select version, this is the tag that exposes the service version.
Repeat this for silver and bronze.

What can you conclude from what you are seeing?

Every tenant.level is being impacted by v350.10

You will now see the paymentservice broken down into three services, gold, silver and bronze. Each tenant is broken down into two services, one for each version (v350.10 and v350.9).

Span Tags

Using span tags to break down services is a very powerful feature. It allows you to see how your services are performing for different customers, different versions, different regions, etc. In this exercise, we have determined that v350.10 of the paymentservice is causing problems for all our customers.

Next, we need to drill down into a trace to see what is going on.

5. APM Trace Analyzer

As Splunk APM provides a NoSample end-to-end visibility of every service Splunk APM captures every trace. For this workshop, the Order Confirmation ID is available as a tag. This means that we can use this to search for the exact trace of the poor user experience you encountered earlier in the workshop.

Trace Analyzer

Splunk Observability Cloud provides several tools for exploring application monitoring data. Trace Analyzer is suited to scenarios where you have high-cardinality, high-granularity searches and explorations to research unknown or new issues.

Exercise

With the outer box of the paymentservice selected, in the right-hand pane, click on Traces.
To ensure we are using Trace Analyzer make sure the button Switch to Classic View is showing. If it is not, click on Switch to Trace Analyzer.
Set Time Range to Last 15 minutes.
Ensure the Sample Ratio is set to 1:1 and not 1:10.

The Trace & error count view shows the total traces and traces with errors in a stacked bar chart. You can use your mouse to select a specific period within the available time frame.

Exercise

Click on the dropdown menu that says Trace & error count, and change it to Trace duration

The Trace Duration view shows a heatmap of traces by duration. The heatmap represents 3 dimensions of data:

Time on the x-axis
Trace duration on the y-axis
The traces (or requests) per second are represented by the heatmap shades

You can use your mouse to select an area on the heatmap, to focus on a specific time period and trace duration range.

Exercise

Switch from Trace duration back to Trace & Error count.
In the time picker select Last 1 hour.
Note, that most of our traces have errors (red) and there are only a limited amount of traces that are error-free (blue).
Make sure the Sample Ratio is set to 1:1 and not 1:10.
Click on Add filters, type in orderId and select orderId from the list.
Paste in your Order Confirmation ID from when you went shopping earlier in the workshop and hit enter. If you didn’t capture one, please ask your instructor for one.

We have now filtered down to the exact trace where you encountered a poor user experience with a very long checkout wait.

A secondary benefit to viewing this trace is that the trace will be accessible for up to 13 months. This will allow developers to come back to this issue at a later stage and still view this trace for example.

Exercise

Click on the trace in the list.

Next, we will walk through the trace waterfall.

6. APM Waterfall

We have arrived at the Trace Waterfall from the Trace Analyzer. A trace is a collection of spans that share the same trace ID, representing a unique transaction handled by your application and its constituent services.

Each span in Splunk APM captures a single operation. Splunk APM considers a span to be an error span if the operation that the span captures results in an error.

Exercise

Click on the ! next to any of the paymentservice:grpc.hipstershop.PaymentService/Charge spans in the waterfall.

What is the error message and version being reported in the Span Details?

Invalid request and v350.10.

Now that we have identified the version of the paymentservice that is causing the issue, let’s see if we can find out more information about the error. This is where Related Logs come in.

Related Content relies on specific metadata that allow APM, Infrastructure Monitoring, and Log Observer to pass filters around Observability Cloud. For related logs to work, you need to have the following metadata in your logs:

service.name
deployment.environment
host.name
trace_id
span_id

Exercise

At the very bottom of the Trace Waterfall click on Logs (1). This highlights that there are Related Logs for this trace.
Click on the Logs for trace xxx entry in the pop-up, this will open the logs for the complete trace in Log Observer.

Next, let’s find out more about the error in the logs.

Splunk Log Observer

20 minutes

Persona

Remaining in your back-end developer role, you need to inspect the logs from your application to determine the root cause of the issue.

Using the content related to the APM trace (logs) we will now use Splunk Log Observer to drill down further to understand exactly what the problem is.

Related Content is a powerful feature that allows you to jump from one component to another and is available for metrics, traces and logs.

1. Log Filtering

Log Observer (LO), can be used in multiple ways. In the quick tour, you used the LO no-code interface to search for specific entries in the logs. This section, however, assumes you have arrived in LO from a trace in APM using the Related Content link.

The advantage of this is, as it was with the link between RUM & APM, that you are looking at your logs within the context of your previous actions. In this case, the context is the time frame (1), which matches that of the trace and the filter (2) which is set to the trace_id.

This view will include all the log lines from all applications or services that participated in the back-end transaction started by the end-user interaction with the Online Boutique.

Even in a small application such as our Online Boutique, the sheer amount of logs found can make it hard to see the specific log lines that matter to the actual incident we are investigating.

Exercise

We need to focus on just the Error messages in the logs:

Click on the Group By drop-down box and use the filter to find Severity.
Once selected click the Apply button (notice that the chart legend changes to show debug, error and info).
Selecting just the error logs can be done by either clicking on the word error (1) in the legend, followed by selecting Add to filter. Then click Run Search
You could also add the service name, sf_service=paymentservice, to the filter if there are error lines for multiple services, but in our case, this is not necessary.

Next, we will look at log entries in detail.

2. Viewing Log Entries

Before we look at a specific log line, let’s quickly recap what we have done so far and why we are here based on the 3 pillars of Observability:

Metrics	Traces	Logs
Do I have a problem?	Where is the problem?	What is the problem?

Using metrics we identified we have a problem with our application. This was obvious from the error rate in the Service Dashboards as it was higher than it should be.
Using traces and span tags we found where the problem is. The paymentservice comprises of two versions, v350.9 and v350.10, and the error rate was 100% for v350.10.
We did see that this error from the paymentservice v350.10 caused multiple retries and a long delay in the response back from the Online Boutique checkout.
From the trace, using the power of Related Content, we arrived at the log entries for the failing paymentservice version. Now, we can determine what the problem is.

Exercise

Click on an error entry in the log table (make sure it says hostname: "paymentservice-xxxx" in case there is a rare error from a different service in the list too.

Based on the message, what would you tell the development team to do to resolve the issue?

The development team needs to rebuild and deploy the container with a valid API Token or rollback to v350.9.

Click on the X in the log message pane to close it.

Congratulations

You have successfully used Splunk Observability Cloud to understand why you experienced a poor user experience whilst shopping at the Online Boutique. You used RUM, APM and logs to understand what happened in your service landscape and subsequently, found the underlying cause, all based on the 3 pillars of Observability, metrics, traces and logs

You also learned how to use Splunk’s intelligent tagging and analysis with Tag Spotlight to detect patterns in your applications’ behavior and to use the full stack correlation power of Related Content to quickly move between the different components whilst keeping in context of the issue.

In the next part of the workshop, we will move from problem-finding mode into mitigation, prevention and process improvement mode.

Next up, creating log charts in a custom dashboard.

3. Log Timeline Chart

Once you have a specific view in Log Observer, it is very useful to be able to use that view in a dashboard, to help in the future with reducing the time to detect or resolve issues. As part of the workshop, we will create an example custom dashboard that will use these charts.

Let’s look at creating a Log Timeline chart. The Log Timeline chart is used for visualizing log messages over time. It is a great way to see the frequency of log messages and to identify patterns. It is also a great way to see the distribution of log messages across your environment. These charts can be saved to a custom dashboard.

Exercise

First, we will reduce the amount of information to only the columns we are interested in:

Click on the Configure Table icon above the Logs table to open the Table Settings, untick _raw and ensure the following fields are selected k8s.pod.name, message and version.
Remove the fixed time from the time picker, and set it to the Last 15 minutes.
To make this work for all traces, remove the trace_id from the filter and add the fields sf_service=paymentservice and sf_environment=[WORKSHOPNAME].
Click Save and select Save to Dashboard.
In the chart creation dialog box that appears, for the Chart name use Log Timeline.
Click Select Dashboard and then click New dashboard in the Dashboard Selection dialog box.
In the New dashboard dialog box, enter a name for the new dashboard (no need to enter a description). Use the following format: Initials - Service Health Dashboard and click Save
Ensure the new dashboard is highlighted in the list (1) and click OK (2).
Ensure that Log Timeline is selected as the Chart Type.
Click the Save button (do not click Save and goto dashboard at this time).

Next, we will create a Log View chart.

4. Log View Chart

The next chart type that can be used with logs is the Log View chart type. This chart will allow us to see log messages based on predefined filters.

As with the previous Log Timeline chart, we will add a version of this chart to our Customer Health Service Dashboard:

Exercise

After the previous exercise make sure you are still in Log Observer.
The filters should be the same as the previous exercise, with the time picker set to the Last 15 minutes and filtering on severity=error, sf_service=paymentservice and sf_environment=[WORKSHOPNAME].
Make sure we have the header with just the fields we wanted.
Click again on Save and then Save to Dashboard.
This will again provide you with the Chart creation dialog.
For the Chart name use Log View.
This time Click Select Dashboard and search for the Dashboard you created in the previous exercise. You can start by typing your initials in the search box (1).
Click on your dashboard name to highlight it (2) and click OK (3).
This will return you to the create chart dialog.
Ensure Log View is selected as the Chart Type.
To see your dashboard click Save and go to dashboard.
The result should be similar to the dashboard below:
As the last step in this exercise, let us add your dashboard to your workshop team page, this will make it easy to find later in the workshop.
At the top of the page, click on the … to the left of your dashboard name.
Select Link to teams from the drop-down.
In the following Link to teams dialog box, find the Workshop team that your instructor will have provided for you and click Done.

In the next session, we will look at Splunk Synthetics and see how we can automate the testing of web-based applications.

Splunk Synthetics

15 minutes

Persona

Putting your SRE hat back on, you have been asked to set up monitoring for the Online Boutique. You need to ensure that the application is available and performing well 24 hours a day, 7 days a week.

Wouldn’t it be great if we could have 24/7 monitoring of our application, and be alerted when there is a problem? This is where Synthetics comes in. We will show you a simple test that runs every 1 minute and checks the performance and availability of a typical user journey through the Online Boutique.

1. Synthetics Dashboard

In Splunk Observability Cloud from the main menu, click on Synthetics. Click on All or Browser tests to see the list of active tests.

During our investigation in the RUM section, we found there was an issue with the Place Order Transaction. Let’s see if we can confirm this from the Synthetics test as well. We will be using the metric First byte time for the 4th page of the test, which is the Place Order step.

Exercise

In the Search box enter [WORKSHOP NAME] and select the test for your workshop (your instructor will advise as to which one to select).
Under Performance KPIs set the Time Picker to Last 1 hour and hit enter.
Click on Location and from the drop-down select Page. The next filter will be populated with the pages that are part of the test.
Click on Duration, deselect Duration and select First byte time.
Look at the legend and note the color of First byte time - Page 4.
Select the highest data point for First byte time - Page 4. You will now be taken to the Run results for this particular test run.

2. Synthetics Test Detail

Right now we are looking at the result of a single Synthetic Browser Test. This test is split up into Business Transactions, think of this as a group of one or more logically related interactions that represent a business-critical user flow.

Info

The screenshot below doesn’t contain a red banner with an error in it however you might be seeing one in your run results. This is expected as in some cases the test run fails and does not impact the workshop.

Filmstrip: Offers a set of screenshots of site performance so that you can see how the page responds in real-time.
Video: This lets you see exactly what a user trying to load your site from the location and device of a particular test run would experience.
Browser test metrics: A View that offers you a picture of website performance.
Synthetic transactions: List of the Synthetic transactions that made up the interaction with the site
Waterfall chart The waterfall chart is a visual representation of the interaction between the test runner and the site being tested.

By default, Splunk Synthetics provides screenshots and video capture of the test. This is useful for debugging issues. You can see, for example, the slow loading of large images, the slow rendering of a page etc.

Exercise

Use your mouse to scroll left and right through the filmstrip to see how the site was being rendered during the test run.
In the Video pane, press on the play button ▶ to see the test playback. If you click the ellipses ⋮ you can change the playback speed, view it Picture in Picture and even Download the video.
In the Synthetic Transaction pane, under the header Business Transactions, click on the first button Home
The waterfall below will show all the objects that make up the page. The first line is the HTML page itself. The next lines are the objects that make up the page (HTML, CSS, JavaScript, Images, Fonts, etc.).
In the waterfall find the line GET splunk-otel-web.js.
Click on the > button to open the metadata section to see the Request/Response Header information.
In the Synthetic Transaction pane, click on the second Business Transaction Shop. Note that the filmstrip adjusts and moves to the beginning of the new transaction.
Repeat this for all the other Transactions, then finally select thePlaceOrder transaction.

3. Synthetics to APM

We now should have a view similar to the one below.

Exercise

In the waterfall find an entry that starts with POST checkout.
Click on the > button in front of it to drop open the metadata section. Observe the metadata that is collected, and note the Server-Timing header. This header is what allows us to correlate the test run to a back-end trace.
Click on the blue APM link on the POST checkout line in the waterfall.

Exercise

Validate you see one or more errors for the paymentservice (1).
To validate that it’s the same error, click on the related content for Logs (2).
Repeat the earlier exercise to filter down to the errors only.
View the error log to validate the failed payment due to an invalid token.

4. Synthetics Detector

Given you can run these tests 24/7, it is an ideal tool to get warned early if your tests are failing or starting to run longer than your agreed SLA instead of getting informed by social media, or Uptime websites.

To stop that from happening let’s detect if our test is taking more than 1.1 minutes.

Exercise

Go back to the Synthetics home page via the menu on the left
Select the workshop test again and click the Create Detector button at the top of the page.
Edit the text New Synthetics Detector (1) and replace it with INITIALS - [WORKSHOPNAME]`.
Ensure that Run duration and Static threshold are selected.
Set the Trigger threshold (2) to be around 65,000 to 68,000 and hit enter to update the chart. Make sure you have more than one spike cutting through the threshold line as shown above (you may have to adjust the threshold value a bit to match your actual latency).
Leave the rest as default.
Note that there is now a row of red and white triangles appearing below the spikes (3). The red triangles let you know that your detector found that your test was above the given threshold & the white triangle indicates that the result returned below the threshold. Each red triangle will trigger an alert.
You can change the Alerts criticality (4) by changing the drop-down to a different level, as well as the method of alerting. Make sure you do NOT add a Recipient as this could lead to you being subjected to an alert storm!
Click Activate to deploy your detector.
To see your new created detector click Edit Test button
At the bottom of the page is a list of active detectors.
If you can’t find yours, but see one called New Synthetic Detector, you may not have saved it correctly with your name. Click on the New Synthetic Detector link, and redo the rename.
Click on the Close button to exit the edit mode.

Custom Service Health Dashboard 🏥

15 minutes

Persona

As the SRE hat suits you let’s keep it on as you have been asked to build a custom Service Health Dashboard for the paymentservice. The requirement is to display RED metrics, logs and Synthetic test duration results.

It is common for development and SRE teams to require a summary of the health of their applications and/or services. More often or not these are displayed on wall-mounted TVs. Splunk Observability Cloud has the perfect solution for this by creating custom dashboards.

In this section we are going to build a Service Health Dashboard we can use to display on teams’ monitors or TVs.

Enhancing the Dashboard

As we already saved some useful log charts in a dashboard in the Log Observer exercise, we are going to extend that dashboard.

Exercise

To get back to your dashboard with the two log charts, click on Dashboards from the main menu and you will be taken to your Team Dashboard view. Under Dashboards click in Search dashboards to search for your Service Health Dashboard group.
Click on the name and this will bring up your previously saved dashboard.
Even if the log information is useful, it will need more information to have it make sense for our team so let’s add a bit more information
The first step is adding a description chart to the dashboard. Click on the New text note and replace the text in the note with the following text and then click the Save and close button and name the chart Instructions

Information to use with text note

This is a Custom Health Dashboard for the **Payment service**,  
Please pay attention to any errors in the logs.
For more detail visit [link](https://https://www.splunk.com/en_us/products/observability.html)

The charts are not in a nice order, let’s correct that and rearrange the charts so that they are useful.
Move your mouse over the top edge of the Instructions chart, your mouse pointer will change to a ☩. This will allow you to drag the chart in the dashboard. Drag the Instructions chart to the top left location and resize it to a 1/3rd of the page by dragging the right-hand edge.
Drag and add the Log Timeline view chart next to the Instruction chart, resize it so it fills the other 2/3rd of the page to be the error rate chart next to the two the chart and resize it so it fills the page
Next, resize the Log lines chart to be the width of the page and resize it the make it at least twice as long.
You should have something similar to the dashboard below:

This looks great, let’s continue and add more meaningful charts.

Adding a Custom Chart

In this part of the workshop we are going to create a chart that we will add to our dashboard, we will also link it to the detector we previously built. This will allow us to see the behavior of our test and get alerted if one or more of our test runs breach its SLA.

Exercise

At the top of the dashboard click on the + and select Chart.
First, use the Untitled chart input field and name the chart Overall Test Duration.
For this exercise we want a bar or column chart, so click on the 3rd icon in the chart option box.
In the Plot editor enter synthetics.run.duration.time.ms (this is runtime in duration for our test) in the Signal box and hit enter.
Right now we see different colored bars, a different color for each region the test runs from. As this is not needed we can change that behavior by adding some analytics.
Click the Add analytics button.
From the drop-down choose the Mean option, then pick mean:aggregation and click outside the dialog box. Notice how the chart changes to a single color as the metrics are now aggregated.
The x-axis does not currently represent time to change this click on the settings icon at the end of the plot line. The following following dialog will open:
Change the Display units (2) in the drop-down box from None to Time (autoscaling)/Milliseconds(ms). The drop-down changes to Millisecond and the x-axis of the chart now represents the test duration time.
Close the dialog, either by clicking on the settings icon or the close button.
Add our detector by clicking the Link Detector button and start typing the name of the detector you created earlier.
Click on the detector name to select it.
Notice that a colored border appears around the chart, indicating the status of the alert, along with a bell icon at the top of the dashboard as shown below:
Click the Save and close button.
In the dashboard, move the charts so they look like the screenshot below:
For the final task, click three dots … at the top of the page (next to Event Overlay) and click on View fullscreen. This will be the view you would use on the TV monitor on the wall (press Esc to go back).

Tip

In your spare time have a try at adding another custom chart to the dashboard using RUM metrics. You could copy a chart from the out-of-the-box RUM applications dashboard group. Or you could use the RUM metric rum.client_error.count to create a chart that shows the number of client errors in the application.

Finally, we will run through a workshop wrap-up.

Workshop Wrap-up 🎁

10 minutes

Congratulations, you have completed the Splunk4Rookies - Observability Cloud Workshop. Today, you have become familiar with how to use Splunk Observability Cloud to monitor your applications and infrastructure.

Celebrate your achievement by adding this certificate to your LinkedIn profile.

Let’s recap what we have learned and what you can do next.

Key Takeaways

During the workshop, we have seen how the Splunk Observability Cloud in combination with the OpenTelemetry signals (metrics, traces and logs) can help you to reduce mean time to detect (MTTD) and also reduce mean time to resolution (MTTR).

We have a better understanding of the Main User interface and its components, the Landing, Infrastructure, APM, RUM, Synthetics, Dashboard pages, and a quick peek at the Settings page.
Depending on time, we did an Infrastructure exercise and looked at Metrics used in the Kubernetes Navigators and saw related services found on our Kubernetes cluster:

Understood what users were experiencing and used RUM & APM to Troubleshoot a particularly long page load, by following its trace across the front and back end and right to the log entries. We used tools like RUM Session replay and the APM Dependency map with Breakdown to discover what is causing our issue:

Used Tag Spotlight, in both RUM and APM, to understand blast radius, detect trends and context for our performance issues and errors. We drilled down in Span’s in the APM Trace waterfall to see how services interacted and find errors:

We used the Related content feature to follow the link between our Trace directly to the Logs related to our Trace and used filters to drill down to the exact cause of our issue.

We then looked at Synthetics, which can simulate web and mobile traffic and we used the available Synthetic Test, first to confirm our finding from RUM/AMP and Log observer, then we created a Detector so we would be alerted if when the run time of a test exceeded our SLA.
In the final exercise, we created a health dashboard to keep that running for our Developers and SREs on a TV screen:

Ninja Workshops

Spring PetClinic SpringBoot Based Microservices On Kubernetes
Learn how to enable automatic discovery and configuration for your Java-based application running in Kubernetes. Experience real-time monitoring to help you maximize application behavior with end-to-end visibility.

Spring PetClinic SpringBoot Based Microservices On Kubernetes

90 minutes

The goal of this workshop is to introduce the features of Splunk’s automatic discovery and configuration for Java.

The workshop scenario will be created by installing a simple (un-instrumented) Java microservices application in Kubernetes.

By following the simple steps to install the Splunk OpenTelemetry Collector and enabling automatic discovery and configuration for existing Java based deployments you will learn how easy it is to send metrics, traces and logs to Splunk Observability Cloud.

[!SPLUNK]Prerequisites
Outbound SSH access to port 2222.
Outbound HTTP access to port 81.
Familiarity with the Linux command line.

During this workshop we will cover the following components:

Splunk Infrastructure Monitoring (IM)
Splunk automatic discovery and configuration for Java (APM)
- Database Query Performance
- AlwaysOn Profiling
Splunk Log Observer (LO)
Splunk Real User Monitoring (RUM)

Splunk Synthetics is feeling a little left out here, but we cover that in other workshops

Architecture

5 minutes

The Spring PetClinic Java application is a simple microservices application that consists of a frontend and backend services. The frontend service is a Spring Boot application that serves a web interface to interact with the backend services. The backend services are Spring Boot applications that serve RESTful API’s to interact with a MySQL database.

By the end of this workshop, you will have a better understanding of how to enable automatic discovery and configuration for your Java-based applications running in Kubernetes.

The diagram below details the architecture of the Spring PetClinic Java application running in Kubernetes with the Splunk OpenTelemetry Operator and automatic discovery and configuration enabled.

Based on the example Josh Voravong created.

Preparation of the Workshop instance

15 minutes

The instructor will provide you with the login information for the instance that we will be using during the workshop.

When you first log into your instance, you will be greeted by the Splunk Logo as shown below. If you have any issues connecting to your workshop instance then please reach out to your Instructor.

$ ssh -p 2222 splunk@<ip-address>

███████╗██████╗ ██╗     ██╗   ██╗███╗   ██╗██╗  ██╗    ██╗  
██╔════╝██╔══██╗██║     ██║   ██║████╗  ██║██║ ██╔╝    ╚██╗ 
███████╗██████╔╝██║     ██║   ██║██╔██╗ ██║█████╔╝      ╚██╗
╚════██║██╔═══╝ ██║     ██║   ██║██║╚██╗██║██╔═██╗      ██╔╝
███████║██║     ███████╗╚██████╔╝██║ ╚████║██║  ██╗    ██╔╝ 
╚══════╝╚═╝     ╚══════╝ ╚═════╝ ╚═╝  ╚═══╝╚═╝  ╚═╝    ╚═╝  
Last login: Mon Feb  5 11:04:54 2024 from [Redacted]
Waiting for cloud-init status...
Your instance is ready!
splunk@show-no-config-i-0d1b29d967cb2e6ff:~$

To ensure your instance is configured correctly, we need to confirm that the required environment variables for this workshop are set correctly. In your terminal run the following script and check that the environment variables are present and set with actual valid values:

. ~/workshop/petclinic/scripts/check_env.sh

ACCESS_TOKEN = <redacted>
REALM = <e.g. eu0, us1, us2, jp0, au0 etc.>
RUM_TOKEN = <redacted>
HEC_TOKEN = <redacted>
HEC_URL = https://<...>/services/collector/event
INSTANCE = <instance_name>

Please make a note of the INSTANCE environment variable value as this will used later to filter data in Splunk Observability Cloud.

For this workshop, all of the above are required. If any have values missing, please contact your Instructor.

[!SPLUNK] Delete any existing OpenTelemetry Collectors If you have previously completed a Splunk Observability workshop using this EC2 instance, you need to ensure that any existing installation of the Splunk OpenTelemetry Collector is deleted. This can be achieved by running the following command:
helm delete splunk-otel-collector

Deploy the Splunk OpenTelemetry Collector

To get Observability signals (metrics, traces and logs) into Splunk Observability Cloud the Splunk OpenTelemetry Collector needs to be deployed into the Kubernetes cluster.

For this workshop, we will be using the Splunk OpenTelemetry Collector Helm Chart. First we need to add the Helm chart repository to Helm and update to ensure the latest version:

helm repo add splunk-otel-collector-chart https://signalfx.github.io/splunk-otel-collector-chart && helm repo update

Using ACCESS_TOKEN={REDACTED}
Using REALM=eu0
"splunk-otel-collector-chart" has been added to your repositories
Using ACCESS_TOKEN={REDACTED}
Using REALM=eu0
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "splunk-otel-collector-chart" chart repository
Update Complete. ⎈Happy Helming!⎈

Splunk Observability Cloud offers wizards in the UI to walk you through the setup of the OpenTelemetry Collector on Kubernetes, but in the interest of time, we will use the Helm install command below. Additional parameters are set to enable the operator and automatic discovery and configuration.

--set="operator.enabled=true" - this will install the Opentelemetry operator that will be used to handle automatic discovery and configuration.
--set="certmanager.enabled=true" - this will install the required certificate manager for the operator.
--set="splunkObservability.profilingEnabled=true" - this enables Code Profiling via the operator.

To install the collector run the following command, do NOT edit this:

helm install splunk-otel-collector --version 0.125.0
 \
--set="operator.enabled=true", \
--set="certmanager.enabled=true", \
--set="splunkObservability.realm=$REALM" \
--set="splunkObservability.accessToken=$ACCESS_TOKEN" \
--set="clusterName=$INSTANCE-k3s-cluster" \
--set="splunkObservability.profilingEnabled=true" \
--set="agent.service.enabled=true"  \
--set="environment=$INSTANCE-workshop" \
--set="splunkPlatform.endpoint=$HEC_URL" \
--set="splunkPlatform.token=$HEC_TOKEN" \
--set="splunkPlatform.index=splunk4rookies-workshop" \
splunk-otel-collector-chart/splunk-otel-collector \
-f ~/workshop/k3s/otel-collector.yaml

LAST DEPLOYED: Fri Apr 19 09:39:54 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
NOTES:
Splunk OpenTelemetry Collector is installed and configured to send data to Splunk Platform endpoint "https://http-inputs-o11y-workshop-eu0.splunkcloud.com:443/services/collector/event".

Splunk OpenTelemetry Collector is installed and configured to send data to Splunk Observability realm eu0.

[INFO] You've enabled the operator's auto-instrumentation feature (operator.enabled=true)! The operator can automatically instrument Kubernetes hosted applications.
  - Status: Instrumentation language maturity varies. See `operator.instrumentation.spec` and documentation for utilized instrumentation details.
  - Splunk Support: We offer full support for Splunk distributions and best-effort support for native OpenTelemetry distributions of auto-instrumentation libraries.

Ensure the Pods are reported as Running before continuing (this typically takes around 30 seconds).

kubectl get pods | grep splunk-otel

splunk-otel-collector-k8s-cluster-receiver-6bd5567d95-5f8cj     1/1     Running   0        10m
splunk-otel-collector-agent-tspd2                               1/1     Running   0        10m
splunk-otel-collector-operator-69d476cb7-j7zwd                  2/2     Running   0        10m

Ensure there are no errors reported by the Splunk OpenTelemetry Collector (press ctrl + c to exit) or use the installed awesome k9s terminal UI for bonus points!

kubectl logs -l app=splunk-otel-collector -f --container otel-collector

2021-03-21T16:11:10.900Z        INFO    service/service.go:364  Starting receivers...
2021-03-21T16:11:10.900Z        INFO    builder/receivers_builder.go:70 Receiver is starting... {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus"}
2021-03-21T16:11:11.009Z        INFO    builder/receivers_builder.go:75 Receiver started.       {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus"}
2021-03-21T16:11:11.009Z        INFO    builder/receivers_builder.go:70 Receiver is starting... {"component_kind": "receiver", "component_type": "k8s_cluster", "component_name": "k8s_cluster"}
2021-03-21T16:11:11.009Z        INFO    k8sclusterreceiver@v0.21.0/watcher.go:195       Configured Kubernetes MetadataExporter  {"component_kind": "receiver", "component_type": "k8s_cluster", "component_name": "k8s_cluster", "exporter_name": "signalfx"}
2021-03-21T16:11:11.009Z        INFO    builder/receivers_builder.go:75 Receiver started.       {"component_kind": "receiver", "component_type": "k8s_cluster", "component_name": "k8s_cluster"}
2021-03-21T16:11:11.009Z        INFO    healthcheck/handler.go:128      Health Check state change       {"component_kind": "extension", "component_type": "health_check", "component_name": "health_check", "status": "ready"}
2021-03-21T16:11:11.009Z        INFO    service/service.go:267  Everything is ready. Begin running and processing data.
2021-03-21T16:11:11.009Z        INFO    k8sclusterreceiver@v0.21.0/receiver.go:59       Starting shared informers and wait for initial cache sync.      {"component_kind": "receiver", "component_type": "k8s_cluster", "component_name": "k8s_cluster"}
2021-03-21T16:11:11.281Z        INFO    k8sclusterreceiver@v0.21.0/receiver.go:75       Completed syncing shared informer caches.       {"component_kind": "receiver", "component_type": "k8s_cluster", "component_name": "k8s_cluster"}

[!INFO] Deleting a failed installation If you make an error installing the OpenTelemetry Collector you can start over by deleting the installation with the following command:
helm delete splunk-otel-collector

Deploy the PetClinic Application

The first deployment of the application will be using prebuilt containers to give the base scenario: a regular Java microservices-based application running in Kubernetes that we want to start observing. So let’s deploy the application:

kubectl apply -f ~/workshop/petclinic/deployment.yaml

deployment.apps/config-server created
service/config-server created
deployment.apps/discovery-server created
service/discovery-server created
deployment.apps/api-gateway created
service/api-gateway created
service/api-gateway-external created
deployment.apps/customers-service created
service/customers-service created
deployment.apps/vets-service created
service/vets-service created
deployment.apps/visits-service created
service/visits-service created
deployment.apps/admin-server created
service/admin-server created
service/petclinic-db created
deployment.apps/petclinic-db created
configmap/petclinic-db-initdb-config created
deployment.apps/petclinic-loadgen-deployment created
configmap/scriptfile created

At this point, we can verify the deployment by checking that the Pods are running. The containers need to be downloaded and started so this may take a couple of minutes.

kubectl get pods

NAME                                                            READY   STATUS    RESTARTS   AGE
splunk-otel-collector-k8s-cluster-receiver-655dcd9b6b-dcvkb     1/1     Running   0          114s
splunk-otel-collector-agent-dg2vj                               1/1     Running   0          114s
splunk-otel-collector-operator-57cbb8d7b4-dk5wf                 2/2     Running   0          114s
petclinic-db-64d998bb66-2vzpn                                   1/1     Running   0          58s
api-gateway-d88bc765-jd5lg                                      1/1     Running   0          58s
visits-service-7f97b6c579-bh9zj                                 1/1     Running   0          58s
admin-server-76d8b956c5-mb2zv                                   1/1     Running   0          58s
customers-service-847db99f79-mzlg2                              1/1     Running   0          58s
vets-service-7bdcd7dd6d-2tcfd                                   1/1     Running   0          58s
petclinic-loadgen-deployment-5d69d7f4dd-xxkn4                   1/1     Running   0          58s
config-server-67f7876d48-qrsr5                                  1/1     Running   0          58s
discovery-server-554b45cfb-bqhgt                                1/1     Running   0          58s

Make sure the output of kubectl get pods matches the output as shown above. Ensure all the services are shown as Running (or use k9s to continuously monitor the status).

To test the application you need to obtain the public IP address of the instance you are running on. You can do this by running the following command:

curl http://ifconfig.me

You can validate if the application is running by visiting http://<IP_ADDRESS>:81 (replace <IP_ADDRESS> with the IP address you obtained above). You should see the PetClinic application running. The application is also running on ports 80 & 443 if you prefer to use those or port 81 is unreachable.

Make sure the application is working correctly by visiting the All Owners (1) and Veterinarians (2) tabs, you should get a list of names in each case.

Verify Kubernetes Cluster metrics

10 minutes

Once the installation has been completed, you can log in to Splunk Observability Cloud and verify that the metrics are flowing in from your Kubernetes cluster.

From the left-hand menu click on Infrastructure and select Kubernetes, then select the Kubernetes nodes pane. Once you are in the Kubernetes nodes view, change the Time filter from -4h to the last 15 minutes (-15m) to focus on the latest data.

Next, from the list of clusters, select the cluster name of your workshop instance (you can get the unique part from your cluster name by using the INSTANCE from the output from the shell script you ran earlier). (1)

You can now select your node by clicking on it name (1) in the node list.

Open the Hierarchy Map by clicking on the Hierarchy Map (1) link in the gray pane to show the graphical representation of your node.

You will now only have your cluster visible. Scroll down the page to see the metrics coming in from your cluster. Locate the Node log events rate chart and click on a vertical bar to see the log entries coming in from your cluster.

Setting up automatic discovery and configuration for APM

10 minutes

In this section we will enable automatic discovery and configuration for the Java services running in Kubernetes. This means that the OpenTelemetry Collector will look for Pod annotations that indicate that the Java application should be instrumented with the Splunk OpenTelemetry Java agent. This will allow us to get traces, spans, and profiling data from the Java services running on the cluster.

automatic discovery and configuration

It is important to understand that automatic discovery and configuration is designed to get trace, span & profiling data out of your application, without requiring code changes or recompilation.

This is a great way to get started with APM, but it is not a replacement for manual instrumentation. Manual instrumentation allows you to add custom spans, tags, and logs to your application, which can provide more context and detail to your traces.

For Java applications the OpenTelemetry Collector will look for the annotation instrumentation.opentelemetry.io/inject-java.

The annotation can have the value set true or to the namespace/daemonset of the OpenTelemetry Collector e.g. default/splunk-otel-collector. This allows working across namespaces and what we will use in this workshop.

Using the deployment.yaml

If you want your Pods to send traces automatically, you can add the annotation to the deployment.yaml as shown below. This will add the instrumentation library during the initial deployment. To speed things up we have done that for the following Pods:

admin-server
config-server
discovery-server

apiVersion: apps/v1
kind: Deployment
metadata:
  name: admin-server
  labels: 
    app.kubernetes.io/part-of: spring-petclinic
spec:
  selector:
    matchLabels:
      app: admin-server
  template:
    metadata:
      labels:
        app: admin-server
      annotations:
        instrumentation.opentelemetry.io/inject-java: "default/splunk-otel-collector"

Patching the Deployment

To configure automatic discovery and configuration the deployments need to be patched to add the instrumentation annotation. Once patched, the OpenTelemetry Collector will inject the automatic discovery and configuration library and the Pods will be restarted in order to start sending traces and profiling data. First, confirm that the api-gateway does not have the splunk-otel-java image.

kubectl describe pods api-gateway | grep Image:

Image:         quay.io/phagen/spring-petclinic-api-gateway:0.0.2

Next, enable the Java automatic discovery and configuration for all of the services by adding the annotation to the deployments. The following command will patch the all deployments. This will trigger the OpenTelemetry Operator to inject the splunk-otel-java image into the Pods:

kubectl get deployments -l app.kubernetes.io/part-of=spring-petclinic -o name | xargs -I % kubectl patch % -p "{\"spec\": {\"template\":{\"metadata\":{\"annotations\":{\"instrumentation.opentelemetry.io/inject-java\":\"default/splunk-otel-collector\"}}}}}"

deployment.apps/config-server patched (no change)
deployment.apps/admin-server patched (no change)
deployment.apps/customers-service patched
deployment.apps/visits-service patched
deployment.apps/discovery-server patched (no change)
deployment.apps/vets-service patched
deployment.apps/api-gateway patched

There will be no change for the config-server, discovery-server and admin-server as these have already been patched.

To check the container image(s) of the api-gateway pod again, run the following command:

kubectl describe pods api-gateway | grep Image:

Image:         ghcr.io/signalfx/splunk-otel-java/splunk-otel-java:v1.30.0
Image:         quay.io/phagen/spring-petclinic-api-gateway:0.0.2

A new image has been added to the api-gateway which will pull splunk-otel-java from ghcr.io (if you see two api-gateway containers, the original one is probably still terminating, so give it a few seconds).

Navigate back to the Kubernetes Navigator in Splunk Observability Cloud. After a couple of minutes you will see that the Pods are being restarted by the operator and the automatic discovery and configuration container will be added. This will look similar to the screenshot below:

Wait for the Pods to turn green in the Kubernetes Navigator, then go tho the next section.

Viewing the data in Splunk APM

Log in to Splunk Observability Cloud, from the left-hand menu click on APM to see the data generated by the traces from the newly instrumented services. Change the Environment filter (1) to the name of your workshop instance in the dropdown box (this will be <INSTANCE>-workshop where INSTANCE is the value from the shell script you ran earlier) and make sure it is the only one selected.

You will see the name (2) of the api-gateway service and metrics in the Latency and Request & Errors charts (you can ignore the Critical Alert, as it is caused by the sudden request increase generated by the load generator). You will also see the rest of the services appear.

Once you see the Customer service, Vets service and Visits services like show in the screenshot above, let’s click on the Service Map (3) pane to get ready for the next section.

APM Features

15 minutes

As we have seen in the previous section, once you enable automatic discovery and configuration on your services, traces are sent to Splunk Observability Cloud.

With these traces, Splunk will automatically generate Service Maps and RED Metrics. These are the first steps in understanding the behavior of your services and how they interact with each other.

In this next section, we are going to examine the traces themselves and what information they provide to help you understand the behavior of your services all without touching your code.

APM Service Map

The above map shows all the interactions between all of the services. The map may still be in an interim state as it will take the Petclinic Microservice application a few minutes to start up and fully synchronize. Reducing the time filter to a custom time of 2 minutes will help. You can click on the Refresh button (1) on the top right of the screen. The initial startup-related errors (red dots) will eventually disappear.

Next, let’s examine the metrics that are available for each service that is instrumented and visit the request, error, and duration (RED) metrics Dashboard

For this exercise we are going to use a common scenario you would use if the service operation was showing high latency, or errors for example.

Select (click) on the Customer Service in the Dependency map (1), then make sure the customers-service is selected in the Services dropdown box (2). Next, select GET /Owners from the Operations dropdown (3)**.

This should give you the workflow with a filter on GET /owners (1) as shown below.

APM Trace

To pick a trace, select a line in the Service Requests & Errors chart (1), when the dot appears click to get a list of sample traces:

Once you have the list of sample traces, click on the blue (2) Trace ID Link (make sure it has the same three services mentioned in the Service Column.)

This brings us the the Trace selected in the Waterfall view:

Here we find several sections:

The actual Waterfall Pane (1), where you see the trace and all the instrumented functions visible as spans, with their duration representation and order/relationship showing.
The Trace Info Pane (2), by default, shows the selected Span information (highlighted with a box around the Span in the Waterfall Pane).
The Span Pane (3), here you can find all the Tags that have been sent in the selected Span, You can scroll down to see all of them.
The process Pane, with tags related to the process that created the Span (scroll down to see as it is not in the screenshot).
The Trace Properties at the top of the right-hand pane by default is collapsed as shown.

APM Span

While we examine our spans, let’s look at several features that you get out of the box without code modifications when using automatic discovery and configuration on top of tracing:

First, in the Waterfall Pane, make sure the customers-service:SELECT petclinic or similar span is selected as shown in the screenshot below:

The basic latency information is shown as a bar for the instrumented function or call, in our example, it took 17.8 Milliseconds.
Several similar Spans (1), are only visible if the span is repeated multiple times. In this case, there are 10 repeats in our example. (You can show/hide them all by clicking on the 10x and all spans will show in order)
Inferred Services: Calls made to external systems that are not instrumented, show up as a grey ‘inferred’ span. The Inferred Service or span in our case here is a call to the Mysql Database mysql:petclinic SELECT petclinic (2) as shown above our selected span.
Span Tags: In the Tag Pane, standard tags produced by the automatic discovery and configuration. In this case, the span is calling a Database, so it includes the db.statement tag (3). This tag will hold the DB query statement and is used by the Database call performed during this span. This will be used by the DB-Query Performance feature. We look at DB-Query Performance in the next section.
Always-on Profiling: IF the system is configured to and has captured Profiling data during a Span life cycle, it will show the number of Call Stacks captured in the Spans timeline (18 Call Stacks for the customer-service:GET /owners Span shown above). (4)

We will look at Profiling in the next section.

Service Centric View

Splunk APM provide Service Centric Views that provide engineers a deep understanding of service performance in one centralized view. Now, across every service, engineers can quickly identify errors or bottlenecks from a service’s underlying infrastructure, pinpoint performance degradations from new deployments, and visualize the health of every third party dependency.

To see this dashboard for the api-gateway,Click on APM from the right hand menu bar and go to the Dependency Map. Make sure you have the api-gateway service selected in the Service Map, then click on the *View Service button in the top of the right-hand pane. This will bring you to the Service Centric View dashboard:

This view, which is available for each of your instrumented services, offers an overview of Service metrics, Runtime metrics and Infrastructure metrics.

You can select the Back function of you browser to go back to the previous view.

Always-On Profiling & DB Query Performance

15 minutes

As we have seen in the previous chapter, you can trace your interactions between the various services using APM without touching your code, which will allow you to identify issues faster.

However, besides tracing automatic discovery and configuration offers additional features out of the box that can help you find issues even faster. In this section we are going to look at two of them:

Always-on Profiling and Java Metrics
Database Query Performance

If you want to dive deeper into Always-on Profiling or DB-Query performance, we have a separate Ninja Workshop called Debug Problems in Microservices that you can follow.

Always-On Profiling & Metrics

When we installed the Splunk Distribution of the OpenTelemetry Collector using the Helm chart earlier, we configured it to enable AlwaysOn Profiling and Metrics. This means that the collector will automatically generate CPU and Memory profiles for the application and send them to Splunk Observability Cloud.

When you deploy the PetClinic application and set the annotation, the collector automatically detects the application and instruments it for traces and profiling. We can verify this by examining the startup logs of one of the Java containers we are instrumenting by running the following script:

The logs should show what flags were picked up by the Java automatic discovery and configuration:

.  ~/workshop/petclinic/scripts/get_logs.sh

2024/02/15 09:42:00 Problem with dial: dial tcp 10.43.104.25:8761: connect: connection refused. Sleeping 1s
2024/02/15 09:42:01 Problem with dial: dial tcp 10.43.104.25:8761: connect: connection refused. Sleeping 1s
2024/02/15 09:42:02 Connected to tcp://discovery-server:8761
Picked up JAVA_TOOL_OPTIONS:  -javaagent:/otel-auto-instrumentation-java/javaagent.jar
Picked up _JAVA_OPTIONS: -Dspring.profiles.active=docker,mysql -Dsplunk.profiler.call.stack.interval=150
OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended
[otel.javaagent 2024-02-15 09:42:03:056 +0000] [main] INFO io.opentelemetry.javaagent.tooling.VersionLogger - opentelemetry-javaagent - version: splunk-1.30.1-otel-1.32.1
[otel.javaagent 2024-02-15 09:42:03:768 +0000] [main] INFO com.splunk.javaagent.shaded.io.micrometer.core.instrument.push.PushMeterRegistry - publishing metrics for SignalFxMeterRegistry every 30s
[otel.javaagent 2024-02-15 09:42:07:478 +0000] [main] INFO com.splunk.opentelemetry.profiler.ConfigurationLogger - -----------------------
[otel.javaagent 2024-02-15 09:42:07:478 +0000] [main] INFO com.splunk.opentelemetry.profiler.ConfigurationLogger - Profiler configuration:
[otel.javaagent 2024-02-15 09:42:07:480 +0000] [main] INFO com.splunk.opentelemetry.profiler.ConfigurationLogger -                  splunk.profiler.enabled : true
[otel.javaagent 2024-02-15 09:42:07:505 +0000] [main] INFO com.splunk.opentelemetry.profiler.ConfigurationLogger -                splunk.profiler.directory : /tmp
[otel.javaagent 2024-02-15 09:42:07:505 +0000] [main] INFO com.splunk.opentelemetry.profiler.ConfigurationLogger -       splunk.profiler.recording.duration : 20s
[otel.javaagent 2024-02-15 09:42:07:506 +0000] [main] INFO com.splunk.opentelemetry.profiler.ConfigurationLogger -               splunk.profiler.keep-files : false
[otel.javaagent 2024-02-15 09:42:07:510 +0000] [main] INFO com.splunk.opentelemetry.profiler.ConfigurationLogger -            splunk.profiler.logs-endpoint : http://10.13.2.38:4317
[otel.javaagent 2024-02-15 09:42:07:513 +0000] [main] INFO com.splunk.opentelemetry.profiler.ConfigurationLogger -              otel.exporter.otlp.endpoint : http://10.13.2.38:4317
[otel.javaagent 2024-02-15 09:42:07:513 +0000] [main] INFO com.splunk.opentelemetry.profiler.ConfigurationLogger -           splunk.profiler.memory.enabled : true
[otel.javaagent 2024-02-15 09:42:07:515 +0000] [main] INFO com.splunk.opentelemetry.profiler.ConfigurationLogger -             splunk.profiler.tlab.enabled : true
[otel.javaagent 2024-02-15 09:42:07:516 +0000] [main] INFO com.splunk.opentelemetry.profiler.ConfigurationLogger -        splunk.profiler.memory.event.rate : 150/s
[otel.javaagent 2024-02-15 09:42:07:516 +0000] [main] INFO com.splunk.opentelemetry.profiler.ConfigurationLogger -      splunk.profiler.call.stack.interval : PT0.15S
[otel.javaagent 2024-02-15 09:42:07:517 +0000] [main] INFO com.splunk.opentelemetry.profiler.ConfigurationLogger -  splunk.profiler.include.internal.stacks : false
[otel.javaagent 2024-02-15 09:42:07:517 +0000] [main] INFO com.splunk.opentelemetry.profiler.ConfigurationLogger -      splunk.profiler.tracing.stacks.only : false
[otel.javaagent 2024-02-15 09:42:07:517 +0000] [main] INFO com.splunk.opentelemetry.profiler.ConfigurationLogger - -----------------------
[otel.javaagent 2024-02-15 09:42:07:518 +0000] [main] INFO com.splunk.opentelemetry.profiler.JfrActivator - Profiler is active.

We are interested in the section written by the com.splunk.opentelemetry.profiler.ConfigurationLogger or the Profiling Configuration.

We can see the various settings you can control, some that are useful depending on your use case like the splunk.profiler.directory - The location where the agent writes the call stacks before sending them to Splunk. This may be different depending on how you configure your containers.

Another parameter you may want to change is splunk.profiler.call.stack.interval This is how often the system takes a CPU Stack trace. You may want to reduce this if you have short spans like we have in our application. (we kept the default as the spans in this demo application are extremely short, so Span may not always have a CPU Call Stack related to it.)

You can find how to set these parameters here. Below, is how you set a higher collection rate for call stack in your deployment.yaml, exactly how to pass any JAVA option to the Java application running in your container:

env: 
- name: JAVA_OPTIONS
  value: "-Xdebug -Dsplunk.profiler.call.stack.interval=150"

If you don’t see those lines as a result of the script, the startup may have taken too long and generated too many connection errors, try looking at the logs directly with kubectl or the k9s utility that is installed.

Always-On Profiling in the Trace Waterfall

Make sure you have your original (or similar) Trace & Span (1) selected in the APM Waterfall view and select Memory Stack Traces (2) from the right-hand pane:

The pane should show you the Memory Stack Trace Flame Graph (3), you can scroll down and/or make the pane for a better view by dragging the right side of the pane.

As AlwaysOn Profiling is constantly taking snapshots, or stack traces, of your application’s code and reading through thousands of stack traces is not practical, AlwaysOn Profiling aggregates and summarizes profiling data, providing a convenient way to explore Call Stacks in a view called the Flame Graph. It represents a summary of all stack traces captured from your application. You can use the Flame Graph to discover which lines of code might be causing performance issues and to confirm whether the changes you make to the code have the intended effect.

To dive deeper into the Always-on Profiling, select Span (3) in the Profiling Pane under Memory Stack Traces This will bring you to the Always-on Profiling main screen, with the Memory view pre-selected:

Time filter will be set to the time frame of the span we selected (1)
Java Memory Metric Charts (2), Allow you to Monitor Heap Memory, Application Activity like Memory Allocation Rate and Garbage Collecting Metrics.
Ability to focus/see metrics and Stack Traces only related to the Span (3), This will filter out background activities running in the Java application if required.
Java Function calls identified. (4), allowing you to drill down into the Methods called from that function.
The Flame Graph (5), with the visualization of hierarchy based on the stack traces of the profiled service.
Ability to select the Service instance (6) in case the service spins up multiple version of it self.

For further investigation the UI let’s you grab the actual stack trace, by selecting a function and the relevant line from the flam chart, so you can use in your coding platform to go to the actual lines of code used at this point (depending of course on your preferred Coding platform)

Database Query Performance

With Database Query Performance, you can monitor the impact of your database queries on service availability directly in Splunk APM. This way, you can quickly identify long-running, un-optimized, or heavy queries and mitigate issues they might be causing, without having to instrument your databases.

To look at the performance of your database queries, make sure you are on the APM Service Map page either by going back in the browser or navigating to the APM section in the Menu bar, then click on the Service Map tile. Select the inferred database service mysql:petclinic Inferred Database server in the Dependency map (1), then scroll the right-hand pane to find the Database Query Performance Pane (2).

If the service you have selected in the map is indeed an (inferred) database server, this pane will populate with the top 90% (P90) database calls based on duration. To dive deeper into the db-query performance function click somewhere on the word Database Query Performance at the top of the pane.

This will bring us to the DB-query Performance overview screen:

Database Query Normalization

By default, Splunk APM instrumentation sanitizes database queries to remove or mask sensible data, such as secrets or personally identifiable information (PII) from the db.statements. You can find how to turn off database query normalization here.

This screen will show us all the Database queries (1) done towards our database from your application, based on the Traces & Spans sent to the Splunk Observability Cloud. Note that you can compare them across a time block or sort them on Total Time, P90 Latency & Requests (2).

For each Database query in the list, we see the highest latency, the total number of calls during the time window and the number of requests per second (3). This allows you to identify places where you might optimize your queries.

You can select traces containing Database Calls via the two charts in the right-hand pane (5). Use the Tag Spotlight pane (6) to drill down what tags are related to the database calls, based on endpoints or tags.

If you need to see a detailed view of a query:

Click on the specific Query (1), this wil give you a detailed query Details pane (2), which you can use for more detailed investigations:

Log Observer

10 minutes

Up until this point, there have been no code changes, yet tracing, profiling and Database Query Performance data is being sent to Splunk Observability Cloud.

Next we will work with the Splunk Log Observer to the mix to obtain log data from the Spring PetClinic application.

The Splunk OpenTelemetry Collector automatically collects logs from the Spring PetClinic application and sends them to Splunk Observability Cloud using the OTLP exporter, annotating the log events with trace_id, span_id and trace flags.

The Splunk Log Observer is then used to view the logs and with the changes to the log format the platform can automatically correlate log information with services and traces.

This feature is called Related Content.

In the bottom pane is where any related content will be reported. In the screenshot below you can see that APM has found a trace that is related to this log line (1):

By clicking (2) on Trace for 960432ac9f16b98be84618778905af50 we will be taken to the waterfall in APM for this specific trace, where this log line was generated:

Note that you now have a Related Content pane for Logs appear (1). Clicking on this will take you back to Log Observer and will display all the log lines that are part of this trace.

Real User Monitoring

10 minutes

To enable Real User Monitoring (RUM) instrumentation for an application, you need to add the Open Telemetry Javascript https://github.com/signalfx/splunk-otel-js-web snippet to the code base.

The Spring PetClinic application uses a single index HTML page, that is reused across all views of the application. This is the perfect location to insert the Splunk RUM instrumentation library as it will be loaded for all pages automatically.

The api-gateway service is already running the instrumentation and sending RUM traces to Splunk Observability Cloud and we will review the data in the next section.

If you want you can verify the snippet, you can view the page source in your browser by right-clicking on the page and selecting View Page Source.

    <script src="/env.js"></script>  
    <script src="https://cdn.signalfx.com/o11y-gdi-rum/latest/splunk-otel-web.js" crossorigin="anonymous"></script>
    <script src="https://cdn.signalfx.com/o11y-gdi-rum/latest/splunk-otel-web-session-recorder.js" crossorigin="anonymous"></script>
    <script>
        var realm = env.RUM_REALM;
        console.log('Realm:', realm);
        var auth = env.RUM_AUTH;
        var appName = env.RUM_APP_NAME;
        var environmentName = env.RUM_ENVIRONMENT
        if (realm && auth) {
            SplunkRum.init({
                realm: realm,
                rumAccessToken: auth,
                applicationName: appName,
                deploymentEnvironment: environmentName,
                version: '1.0.0',
            });
    
            SplunkSessionRecorder.init({
                app: appName,
                realm: realm,
                rumAccessToken: auth
            });
            const Provider = SplunkRum.provider; 
            var tracer=Provider.getTracer('appModuleLoader');
        } else {
        // Realm or auth is empty, provide default values or skip initialization
        console.log("Realm or auth is empty. Skipping Splunk Rum initialization.");
        }
    </script>
     <!-- Section added for  RUM -->

Select the RUM view for the Petclinic App

Lets start a quick high level tour into RUM by clicking RUM in the left-hand menu. Then change the Environment filter (1) to the name of your workshop instance from the dropdown box, it will be <INSTANCE>-workshop (1) (where INSTANCE is the value from the shell script you ran earlier). Make sure it is the only one selected.

Then change the App (2) dropdown box to the name of your app, it will be <INSTANCE>-store

Once you have selected your Environment and App, you will see an overview page showing the RUM status of your App (if your Summary Dashboard is just a single row of numbers, you are looking at the condensed view. You can expand it by clicking on the > (1) in front of the Application name). If any JavaScript error occurred they will show up as shown below:

To continue, click on the blue link (with your workshop name) to get to the details page, this will bring up a new dashboard view breaking down the interactions by UX Metrics, Front-end Health, Back-end Health and Custom Events and comparing them to historic metrics (1 hour by default).

Normally you have only one line inside the first chart, Click on the link that relates to your Petclinic shop, http://198.19.249.202:81 in our example:

This will bring us to the Tag Spotlight page.

RUM trace Waterfall view & linking to APM

In the TAG Spotlight view, you are presented with all the tags associated with the RUM data. Tags are key-value pairs that are used to identify the data. In this case, the tags are automatically generated by the OpenTelemetry instrumentation. The tags are used to filter the data and to create the charts and tables. The Tag Spotlight view allows you detect trends in behavior and to drill down into a user session.

Click on User Sessions (1), this will show you the list of user session that occurred during the time window. We want to look at one of the session , so click on Duration (2) to sort on duration, and make sure you click on the link of one of the longer ones (3):

RUM trace Waterfall view & linking to APM

We are now looking at the RUM Trace waterfall, this will tell you what happened during the session on the user device as they visited the page of our petclinic application.

If you scroll down the waterfall find click on the #!/owners/details segment on the right (1), you see a list of action that occurred during the handling of the Vets request. Note, that the HTTP request have a blue APM link before the return code. Pick one, and click on the APM link. This will show you the APM info for this Ser vice call to our Microservices in Kubernetes.

Note , that there give you the information what happened during action in the Microservices, and if you want to drill down to verify what happened with the request, click on the Trace ID url.

This will show you the trace related to your request from RUM:

You can see that the entry point into your service now has a RUM (1) related content link added, allowing you to return back to your RUM session after you validated what happened in your Microservices.

Workshop Wrap-up 🎁

Congratulations, you have completed the Get the Most Out of Your Existing Kubernetes Java Applications Using Automatic Discovery and Configuration With OpenTelemetry workshop.

Today, you have learnt how easy it is to add Tracing, Code Profiling and Database Query Performance to your existing Java application in Kubernetes.

You immediately improved the observability of the application and infrastructure with out touching a line of code or configuration using Automatic Discovery and Configuration.

You also learnt that with simple configuration changes you can add even more observability (logging and RUM) to the application in order to provide end-to-end observability.

Scenarios

Debug Problems in Microservices
This scenario helps software developers to make debugging problems in microservices easier, faster, and more cost-effective for platform engineering teams rolling out standardized tooling.
Optimize End User Experiences
Use Splunk Real User Monitoring (RUM) and Synthetics to get insight into end user experience, and proactively test scenarios to improve that experience.

Debug Problems in Microservices

2 minutes Author Derek Mitchell

Service Maps and Traces are extremely valuable in determining what service an issue resides in. And related log data helps provide detail on why issues are occurring in that service.

But engineers sometimes need to go even deeper to debug a problem that’s occurring in one of their services.

This is where features such as Splunk’s AlwaysOn Profiling and Database Query Performance come in.

AlwaysOn Profiling continuously collects stack traces so that you can discover which lines in your code are consuming the most CPU and memory.

And Database Query Performance can quickly identify long-running, unoptimized, or heavy queries and mitigate issues they might be causing.

In this workshop, we’ll explore:

How to debug an application with several performance issues.
How to use Database Query Performance to find slow-running queries that impact application performance.
How to enable AlwaysOn Profiling and use it to find the code that consumes the most CPU and memory.
How to apply fixes based on findings from Splunk Observability Cloud and verify the result.

The workshop uses a Java-based application called The Door Game hosted in Kubernetes. Let’s get started!

Tip

The easiest way to navigate through this workshop is by using:

the left/right arrows (< | >) on the top right of this page
the left (◀️) and right (▶️) cursor keys on your keyboard

Build the Sample Application

10 minutes

Introduction

For this workshop, we’ll be using a Java-based application called The Door Game. It will be hosted in Kubernetes.

Pre-requisites

You will start with an EC2 instance and perform some initial steps in order to get to the following state:

Deploy the Splunk distribution of the OpenTelemetry Collector
Deploy the MySQL database container and populate data
Build and deploy the doorgame application container

Initial Steps

The initial setup can be completed by executing the following steps on the command line of your EC2 instance.

You’ll be asked to enter a name for your environment. Please use profiling-workshop-yourname (where yourname is replaced by your actual name).

cd workshop/profiling
./1-deploy-otel-collector.sh
./2-deploy-mysql.sh
./3-deploy-doorgame.sh

Let’s Play The Door Game

Now that the application is deployed, let’s play with it and generate some observability data.

Get the external IP address for your application instance using the following command:

kubectl describe svc doorgame | grep "LoadBalancer Ingress"

The output should look like the following:

LoadBalancer Ingress:     52.23.184.60

You should be able to access The Door Game application by pointing your browser to port 81 of the provided IP address. For example:

http://52.23.184.60:81

You should be met with The Door Game intro screen:

Click Let's Play to start the game:

Did you notice that it took a long time after clicking Let's Play before we could actually start playing the game?

Let’s use Splunk Observability Cloud to determine why the application startup is so slow.

Troubleshoot Game Startup

10 minutes

Let’s use Splunk Observability Cloud to determine why the game started so slowly.

View your application in Splunk Observability Cloud

Note: when the application is deployed for the first time, it may take a few minutes for the data to appear.

Navigate to APM, then use the Environment dropdown to select your environment (i.e. profiling-workshop-name).

If everything was deployed correctly, you should see doorgame displayed in the list of services:

Click on Explore on the right-hand side to view the service map. We should the doorgame application on the service map:

Notice how the majority of the time is being spent in the MySQL database. We can get more details by clicking on Database Query Performance on the right-hand side.

This view shows the SQL queries that took the most amount of time. Ensure that the Compare to dropdown is set to None, so we can focus on current performance.

We can see that one query in particular is taking a long time:

select * from doorgamedb.users, doorgamedb.organizations

(do you notice anything unusual about this query?)

Let’s troubleshoot further by clicking on one of the spikes in the latency graph. This brings up a list of example traces that include this slow query:

Click on one of the traces to see the details:

In the trace, we can see that the DoorGame.startNew operation took 25.8 seconds, and 17.6 seconds of this was associated with the slow SQL query we found earlier.

What did we accomplish?

To recap what we’ve done so far:

We’ve deployed our application and are able to access it successfully.
The application is sending traces to Splunk Observability Cloud successfully.
We started troubleshooting the slow application startup time, and found a slow SQL query that seems to be the root cause.

To troubleshoot further, it would be helpful to get deeper diagnostic data that tells us what’s happening inside our JVM, from both a memory (i.e. JVM heap) and CPU perspective. We’ll tackle that in the next section of the workshop.

Enable AlwaysOn Profiling

20 minutes

Let’s learn how to enable the memory and CPU profilers, verify their operation, and use the results in Splunk Observability Cloud to find out why our application startup is slow.

Update the application configuration

We will need to pass additional configuration arguments to the Splunk OpenTelemetry Java agent in order to enable both profilers. The configuration is documented here in detail, but for now we just need the following settings:

SPLUNK_PROFILER_ENABLED="true"
SPLUNK_PROFILER_MEMORY_ENABLED="true"

Since our application is deployed in Kubernetes, we can update the Kubernetes manifest file to set these environment variables. Open the doorgame/doorgame.yaml file for editing, and ensure the values of the following environment variables are set to “true”:

- name: SPLUNK_PROFILER_ENABLED
  value: "true"
- name: SPLUNK_PROFILER_MEMORY_ENABLED
  value: "true"

Next, let’s redeploy the Door Game application by running the following command:

cd workshop/profiling
./5-redeploy-doorgame.sh

After a few minutes, a new pod will be deployed with the updated application settings.

Confirm operation

To ensure the profiler is enabled, let’s review the application logs with the following commands:

kubectl logs -l app=doorgame --tail=100 | grep JfrActivator

You should see a line in the application log output that shows the profiler is active:

[otel.javaagent 2024-02-05 19:01:12:416 +0000] [main] INFO com.splunk.opentelemetry.profiler.JfrActivator - Profiler is active.```

This confirms that the profiler is enabled and sending data to the OpenTelemetry collector deployed in our Kubernetes cluster, which in turn sends profiling data to Splunk Observability Cloud.

Profiling in APM

Visit http://<your IP address>:81 and play a few more rounds of The Door Game.

Then head back to Splunk Observability Cloud, click on APM, and click on the doorgame service at the bottom of the screen.

Click on “Traces” on the right-hand side to load traces for this service. Filter on traces involving the doorgame service and the GET new-game operation (since we’re troubleshooting the game startup sequence):

Selecting one of these traces brings up the following screen:

You can see that the spans now include “Call Stacks”, which is a result of us enabling CPU and memory profiling earlier.

Click on the span named doorgame: SELECT doorgamedb, then click on CPU stack traces on the right-hand side:

This brings up the CPU call stacks captured by the profiler.

Let’s open the AlwaysOn Profiler to review the CPU stack trace in more detail. We can do this by clicking on the Span link beside View in AlwaysOn Profiler:

The AlwaysOn Profiler includes both a table and a flamegraph. Take some time to explore this view by doing some of the following:

click a table item and notice the change in flamegraph
navigate the flamegraph by clicking on a stack frame to zoom in, and a parent frame to zoom out
add a search term like splunk or jetty to highlight some matching stack frames

Let’s have a closer look at the stack trace, starting with the DoorGame.startNew method (since we already know that it’s the slowest part of the request)

com.splunk.profiling.workshop.DoorGame.startNew(DoorGame.java:24)
com.splunk.profiling.workshop.UserData.loadUserData(UserData.java:33)
com.mysql.cj.jdbc.StatementImpl.executeQuery(StatementImpl.java:1168)
com.mysql.cj.NativeSession.execSQL(NativeSession.java:655)
com.mysql.cj.protocol.a.NativeProtocol.sendQueryString(NativeProtocol.java:998)
com.mysql.cj.protocol.a.NativeProtocol.sendQueryPacket(NativeProtocol.java:1065)
com.mysql.cj.protocol.a.NativeProtocol.readAllResults(NativeProtocol.java:1715)
com.mysql.cj.protocol.a.NativeProtocol.read(NativeProtocol.java:1661)
com.mysql.cj.protocol.a.TextResultsetReader.read(TextResultsetReader.java:48)
com.mysql.cj.protocol.a.TextResultsetReader.read(TextResultsetReader.java:87)
com.mysql.cj.protocol.a.NativeProtocol.read(NativeProtocol.java:1648)
com.mysql.cj.protocol.a.ResultsetRowReader.read(ResultsetRowReader.java:42)
com.mysql.cj.protocol.a.ResultsetRowReader.read(ResultsetRowReader.java:75)
com.mysql.cj.protocol.a.MultiPacketReader.readMessage(MultiPacketReader.java:44)
com.mysql.cj.protocol.a.MultiPacketReader.readMessage(MultiPacketReader.java:66)
com.mysql.cj.protocol.a.TimeTrackingPacketReader.readMessage(TimeTrackingPacketReader.java:41)
com.mysql.cj.protocol.a.TimeTrackingPacketReader.readMessage(TimeTrackingPacketReader.java:62)
com.mysql.cj.protocol.a.SimplePacketReader.readMessage(SimplePacketReader.java:45)
com.mysql.cj.protocol.a.SimplePacketReader.readMessage(SimplePacketReader.java:102)
com.mysql.cj.protocol.a.SimplePacketReader.readMessageLocal(SimplePacketReader.java:137)
com.mysql.cj.protocol.FullReadInputStream.readFully(FullReadInputStream.java:64)
java.io.FilterInputStream.read(Unknown Source:0)
sun.security.ssl.SSLSocketImpl$AppInputStream.read(Unknown Source:0)
sun.security.ssl.SSLSocketImpl.readApplicationRecord(Unknown Source:0)
sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(Unknown Source:0)
sun.security.ssl.SSLSocketInputRecord.readHeader(Unknown Source:0)
sun.security.ssl.SSLSocketInputRecord.read(Unknown Source:0)
java.net.SocketInputStream.read(Unknown Source:0)
java.net.SocketInputStream.read(Unknown Source:0)
java.lang.ThreadLocal.get(Unknown Source:0)

We can interpret the stack trace as follows:

When starting a new Door Game, a call is made to load user data.
This results in executing a SQL query to load the user data (which is related to the slow SQL query we saw earlier).
We then see calls to read data in from the database.

So, what does this all mean? It means that our application startup is slow since it’s spending time loading user data. In fact, the profiler has told us the exact line of code where this happens:

com.splunk.profiling.workshop.UserData.loadUserData(UserData.java:33)

Let’s open the corresponding source file (./doorgame/src/main/java/com/splunk/profiling/workshop/UserData.java) and look at this code in more detail:

public class UserData {

    static final String DB_URL = "jdbc:mysql://mysql/DoorGameDB";
    static final String USER = "root";
    static final String PASS = System.getenv("MYSQL_ROOT_PASSWORD");
    static final String SELECT_QUERY = "select * FROM DoorGameDB.Users, DoorGameDB.Organizations";

    HashMap<String, User> users;

    public UserData() {
        users = new HashMap<String, User>();
    }

    public void loadUserData() {

        // Load user data from the database and store it in a map
        Connection conn = null;
        Statement stmt = null;
        ResultSet rs = null;

        try{
            conn = DriverManager.getConnection(DB_URL, USER, PASS);
            stmt = conn.createStatement();
            rs = stmt.executeQuery(SELECT_QUERY);
            while (rs.next()) {
                User user = new User(rs.getString("UserId"), rs.getString("FirstName"), rs.getString("LastName"));
                users.put(rs.getString("UserId"), user);
            }

Here we can see the application logic in action. It establishes a connection to the database, then executes the SQL query we saw earlier:

select * FROM DoorGameDB.Users, DoorGameDB.Organizations

It then loops through each of the results, and loads each user into a HashMap object, which is a collection of User objects.

We have a good understanding of why the game startup sequence is so slow, but how do we fix it?

For more clues, let’s have a look at the other part of AlwaysOn Profiling: memory profiling. To do this, click on the Memory tab in AlwaysOn profiling:

At the top of this view, we can see how much heap memory our application is using, the heap memory allocation rate, and garbage collection activity.

We can see that our application is using about 400 MB out of the max 1 GB heap size, which seems excessive for such a simple application. We can also see that some garbage collection occurred, which caused our application to pause (and probably annoyed those wanting to play the Game Door).

At the bottom of the screen, which can see which methods in our Java application code are associated with the most heap memory usage. Click on the first item in the list to show the Memory Allocation Stack Traces associated with the java.util.Arrays.copyOf method specifically:

With help from the profiler, we can see that the loadUserData method not only consumes excessive CPU time, but it also consumes excessive memory when storing the user data in the HashMap collection object.

What did we accomplish?

We’ve come a long way already!

We learned how to enable the profiler in the Splunk OpenTelemetry Java instrumentation agent.
We learned how to verify in the agent output that the profiler is enabled.
We have explored several profiling related workflows in APM:
- How to navigate to AlwaysOn Profiling from the troubleshooting view
- How to explore the flamegraph and method call duration table through navigation and filtering
- How to identify when a span has sampled call stacks associated with it
- How to explore heap utilization and garbage collection activity
- How to view memory allocation stack traces for a particular method

In the next section, we’ll apply a fix to our application to resolve the slow startup performance.

Debug Problems in Microservices

Tagging Workshop
This workshop shows how tags can be used to reduce the time required for SREs to isolate issues across services, so they know which team to engage to troubleshoot the issue further, and can provide context to help engineering get a head start on debugging.
Profiling Workshop
This workshop shows how Database Query Performance and AlwaysOn Profiling can be used to reduce the time required for engineers to debug problems in microservices.

Tagging Workshop

2 minutes Author Derek Mitchell

Splunk Observability Cloud includes powerful features that dramatically reduce the time required for SREs to isolate issues across services, so they know which team to engage to troubleshoot the issue further, and can provide context to help engineering get a head start on debugging.

Unlocking these features requires tags to be included with your application traces. But how do you know which tags are the most valuable and how do you capture them?

In this workshop, we’ll explore:

What are tags and why are they such a critical part of making an application observable.
How to use OpenTelemetry to capture tags of interest from your application.
How to index tags in Splunk Observability Cloud and the differences between Troubleshooting MetricSets and Monitoring MetricSets.
How to utilize tags in Splunk Observability Cloud to find “unknown unknowns” using the Tag Spotlight and Dynamic Service Map features.
How to utilize tags for alerting and dashboards.

The workshop uses a simple microservices-based application to illustrate these concepts. Let’s get started!

Tip

The easiest way to navigate through this workshop is by using:

the left/right arrows (< | >) on the top right of this page
the left (◀️) and right (▶️) cursor keys on your keyboard

Build the Sample Application

10 minutes

Introduction

For this workshop, we’ll be using a microservices-based application. This application is for an online retailer and normally includes more than a dozen services. However, to keep the workshop simple, we’ll be focusing on two services used by the retailer as part of their payment processing workflow: the credit check service and the credit processor service.

Pre-requisites

You will start with an EC2 instance and perform some initial steps in order to get to the following state:

Deploy the Splunk distribution of the OpenTelemetry Collector
Build and deploy creditcheckservice and creditprocessorservice
Deploy a load generator to send traffic to the services

Initial Steps

The initial setup can be completed by executing the following steps on the command line of your EC2 instance:

cd workshop/tagging
./0-deploy-collector-with-services.sh

Java

There are implementations in multiple languages available for creditcheckservice. Run

./0-deploy-collector-with-services.sh java

to pick Java over Python.

View your application in Splunk Observability Cloud

Now that the setup is complete, let’s confirm that it’s sending data to Splunk Observability Cloud. Note that when the application is deployed for the first time, it may take a few minutes for the data to appear.

Navigate to APM, then use the Environment dropdown to select your environment (i.e. tagging-workshop-instancename).

If everything was deployed correctly, you should see creditprocessorservice and creditcheckservice displayed in the list of services:

Click on Explore on the right-hand side to view the service map. We can see that the creditcheckservice makes calls to the creditprocessorservice, with an average response time of at least 3 seconds:

Next, click on Traces on the right-hand side to see the traces captured for this application. You’ll see that some traces run relatively fast (i.e. just a few milliseconds), whereas others take a few seconds.

If you toggle Errors only to on, you’ll also notice that some traces have errors:

Toggle Errors only back to off and sort the traces by duration, then click on one of the longer running traces. In this example, the trace took five seconds, and we can see that most of the time was spent calling the /runCreditCheck operation, which is part of the creditprocessorservice.

Currently, we don’t have enough details in our traces to understand why some requests finish in a few milliseconds, and others take several seconds. To provide the best possible customer experience, this will be critical for us to understand.

We also don’t have enough information to understand why some requests result in errors, and others don’t. For example, if we look at one of the error traces, we can see that the error occurs when the creditprocessorservice attempts to call another service named otherservice. But why do some requests results in a call to otherservice, and others don’t?

We’ll explore these questions and more in the workshop.

What are Tags?

3 minutes

To understand why some requests have errors or slow performance, we’ll need to add context to our traces. We’ll do this by adding tags. But first, let’s take a moment to discuss what tags are, and why they’re so important for observability.

What are tags?

Tags are key-value pairs that provide additional metadata about spans in a trace, allowing you to enrich the context of the spans you send to Splunk APM.

For example, a payment processing application would find it helpful to track:

The payment type used (i.e. credit card, gift card, etc.)
The ID of the customer that requested the payment

This way, if errors or performance issues occur while processing the payment, we have the context we need for troubleshooting.

While some tags can be added with the OpenTelemetry collector, the ones we’ll be working with in this workshop are more granular, and are added by application developers using the OpenTelemetry API.

Attributes vs. Tags

A note about terminology before we proceed. While this workshop is about tags, and this is the terminology we use in Splunk Observability Cloud, OpenTelemetry uses the term attributes instead. So when you see tags mentioned throughout this workshop, you can treat them as synonymous with attributes.

Why are tags so important?

Tags are essential for an application to be truly observable. As we saw with our credit check service, some users are having a great experience: fast with no errors. But other users get a slow experience or encounter errors.

Tags add the context to the traces to help us understand why some users get a great experience and others don’t. And powerful features in Splunk Observability Cloud utilize tags to help you jump quickly to root cause.

Sneak Peak: Tag Spotlight

Tag Spotlight uses tags to discover trends that contribute to high latency or error rates:

The screenshot above provides an example of Tag Spotlight from another application.

Splunk has analyzed all of the tags included as part of traces that involve the payment service.

It tells us very quickly whether some tag values have more errors than others.

If we look at the version tag, we can see that version 350.10 of the service has a 100% error rate, whereas version 350.9 of the service has no errors at all:

We’ll be using Tag Spotlight with the credit check service later on in the workshop, once we’ve captured some tags of our own.

Capture Tags with OpenTelemetry

15 minutes

Please proceed to one of the subsections for Java or Python. Ask your instructor for the one used during the workshop!

1. Capture Tags - Java

15 minutes

Let’s add some tags to our traces, so we can find out why some customers receive a poor experience from our application.

Identify Useful Tags

We’ll start by reviewing the code for the creditCheck function of creditcheckservice (which can be found in the file /home/splunk/workshop/tagging/creditcheckservice-java/src/main/java/com/example/creditcheckservice/CreditCheckController.java):

@GetMapping("/check")
public ResponseEntity<String> creditCheck(@RequestParam("customernum") String customerNum) {
    // Get Credit Score
    int creditScore;
    try {
        String creditScoreUrl = "http://creditprocessorservice:8899/getScore?customernum=" + customerNum;
        creditScore = Integer.parseInt(restTemplate.getForObject(creditScoreUrl, String.class));
    } catch (HttpClientErrorException e) {
        return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body("Error getting credit score");
    }

    String creditScoreCategory = getCreditCategoryFromScore(creditScore);

    // Run Credit Check
    String creditCheckUrl = "http://creditprocessorservice:8899/runCreditCheck?customernum=" + customerNum + "&score=" + creditScore;
    String checkResult;
    try {
        checkResult = restTemplate.getForObject(creditCheckUrl, String.class);
    } catch (HttpClientErrorException e) {
        return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body("Error running credit check");
    }

    return ResponseEntity.ok(checkResult);
}

We can see that this function accepts a customer number as an input. This would be helpful to capture as part of a trace. What else would be helpful?

Well, the credit score returned for this customer by the creditprocessorservice may be interesting (we want to ensure we don’t capture any PII data though). It would also be helpful to capture the credit score category, and the credit check result.

Great, we’ve identified four tags to capture from this service that could help with our investigation. But how do we capture these?

Capture Tags

We start by adding OpenTelemetry imports to the top of the CreditCheckController.java file:

...
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.instrumentation.annotations.WithSpan;
import io.opentelemetry.instrumentation.annotations.SpanAttribute;

Next, we use the @WithSpan annotation to produce a span for creditCheck:

@GetMapping("/check")
@WithSpan // ADDED
public ResponseEntity<String> creditCheck(@RequestParam("customernum") String customerNum) {
    ...

We can now get a reference to the current span and add an attribute (aka tag) to it:

...
try {
    String creditScoreUrl = "http://creditprocessorservice:8899/getScore?customernum=" + customerNum;
    creditScore = Integer.parseInt(restTemplate.getForObject(creditScoreUrl, String.class));
} catch (HttpClientErrorException e) {
    return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body("Error getting credit score");
}
Span currentSpan = Span.current(); // ADDED
currentSpan.setAttribute("credit.score", creditScore); // ADDED
...

That was pretty easy, right? Let’s capture some more, with the final result looking like this:

@GetMapping("/check")
@WithSpan(kind=SpanKind.SERVER)
public ResponseEntity<String> creditCheck(@RequestParam("customernum")
                                          @SpanAttribute("customer.num")
                                          String customerNum) {
    // Get Credit Score
    int creditScore;
    try {
        String creditScoreUrl = "http://creditprocessorservice:8899/getScore?customernum=" + customerNum;
        creditScore = Integer.parseInt(restTemplate.getForObject(creditScoreUrl, String.class));
    } catch (HttpClientErrorException e) {
        return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body("Error getting credit score");
    }
    Span currentSpan = Span.current();
    currentSpan.setAttribute("credit.score", creditScore);

    String creditScoreCategory = getCreditCategoryFromScore(creditScore);
    currentSpan.setAttribute("credit.score.category", creditScoreCategory);

    // Run Credit Check
    String creditCheckUrl = "http://creditprocessorservice:8899/runCreditCheck?customernum=" + customerNum + "&score=" + creditScore;
    String checkResult;
    try {
        checkResult = restTemplate.getForObject(creditCheckUrl, String.class);
    } catch (HttpClientErrorException e) {
        return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body("Error running credit check");
    }
    currentSpan.setAttribute("credit.check.result", checkResult);

    return ResponseEntity.ok(checkResult);
}

Redeploy Service

Once these changes are made, let’s run the following script to rebuild the Docker image used for creditcheckservice and redeploy it to our Kubernetes cluster:

./5-redeploy-creditcheckservice.sh java

Confirm Tag is Captured Successfully

After a few minutes, return to Splunk Observability Cloud and load one of the latest traces to confirm that the tags were captured successfully (hint: sort by the timestamp to find the latest traces):

Well done, you’ve leveled up your OpenTelemetry game and have added context to traces using tags.

Next, we’re ready to see how you can use these tags with Splunk Observability Cloud!

2. Capture Tags - Python

15 minutes

Let’s add some tags to our traces, so we can find out why some customers receive a poor experience from our application.

Identify Useful Tags

We’ll start by reviewing the code for the credit_check function of creditcheckservice (which can be found in the /home/splunk/workshop/tagging/creditcheckservice-py/main.py file):

@app.route('/check')
def credit_check():
    customerNum = request.args.get('customernum')

    # Get Credit Score
    creditScoreReq = requests.get("http://creditprocessorservice:8899/getScore?customernum=" + customerNum)
    creditScoreReq.raise_for_status()
    creditScore = int(creditScoreReq.text)

    creditScoreCategory = getCreditCategoryFromScore(creditScore)

    # Run Credit Check
    creditCheckReq = requests.get("http://creditprocessorservice:8899/runCreditCheck?customernum=" + str(customerNum) + "&score=" + str(creditScore))
    creditCheckReq.raise_for_status()
    checkResult = str(creditCheckReq.text)

    return checkResult

We can see that this function accepts a customer number as an input. This would be helpful to capture as part of a trace. What else would be helpful?

Great, we’ve identified four tags to capture from this service that could help with our investigation. But how do we capture these?

Capture Tags

We start by adding importing the trace module by adding an import statement to the top of the creditcheckservice-py/main.py file:

import requests
from flask import Flask, request
from waitress import serve
from opentelemetry import trace  # <--- ADDED BY WORKSHOP
...

Next, we need to get a reference to the current span so we can add an attribute (aka tag) to it:

def credit_check():
    current_span = trace.get_current_span()  # <--- ADDED BY WORKSHOP
    customerNum = request.args.get('customernum')
    current_span.set_attribute("customer.num", customerNum)  # <--- ADDED BY WORKSHOP
...

That was pretty easy, right? Let’s capture some more, with the final result looking like this:

def credit_check():
    current_span = trace.get_current_span()  # <--- ADDED BY WORKSHOP
    customerNum = request.args.get('customernum')
    current_span.set_attribute("customer.num", customerNum)  # <--- ADDED BY WORKSHOP

    # Get Credit Score
    creditScoreReq = requests.get("http://creditprocessorservice:8899/getScore?customernum=" + customerNum)
    creditScoreReq.raise_for_status()
    creditScore = int(creditScoreReq.text)
    current_span.set_attribute("credit.score", creditScore)  # <--- ADDED BY WORKSHOP

    creditScoreCategory = getCreditCategoryFromScore(creditScore)
    current_span.set_attribute("credit.score.category", creditScoreCategory)  # <--- ADDED BY WORKSHOP

    # Run Credit Check
    creditCheckReq = requests.get("http://creditprocessorservice:8899/runCreditCheck?customernum=" + str(customerNum) + "&score=" + str(creditScore))
    creditCheckReq.raise_for_status()
    checkResult = str(creditCheckReq.text)
    current_span.set_attribute("credit.check.result", checkResult)  # <--- ADDED BY WORKSHOP

    return checkResult

Redeploy Service

Once these changes are made, let’s run the following script to rebuild the Docker image used for creditcheckservice and redeploy it to our Kubernetes cluster:

./5-redeploy-creditcheckservice.sh

Confirm Tag is Captured Successfully

Well done, you’ve leveled up your OpenTelemetry game and have added context to traces using tags.

Next, we’re ready to see how you can use these tags with Splunk Observability Cloud!

Explore Trace Data

5 minutes

Now that we’ve captured several tags from our application, let’s explore some of the trace data we’ve captured that include this additional context, and see if we can identify what’s causing a poor user experience in some cases.

Use Trace Analyzer

Navigate to APM, then select Traces. This takes us to the Trace Analyzer, where we can add filters to search for traces of interest. For example, we can filter on traces where the credit score starts with 7:

If you load one of these traces, you’ll see that the credit score indeed starts with seven.

We can apply similar filters for the customer number, credit score category, and credit score result.

Explore Traces With Errors

Let’s remove the credit score filter and toggle Errors only to on, which results in a list of only those traces where an error occurred:

Click on a few of these traces, and look at the tags we captured. Do you notice any patterns?

Next, toggle Errors only to off, and sort traces by duration. Look at a few of the slowest running traces, and compare them to the fastest running traces. Do you notice any patterns?

If you found a pattern that explains the slow performance and errors - great job! But keep in mind that this is a difficult way to troubleshoot, as it requires you to look through many traces and mentally keep track of what you saw, so you can identify a pattern.

Thankfully, Splunk Observability Cloud provides a more efficient way to do this, which we’ll explore next.

Index Tags

5 minutes

Index Tags

To use advanced features in Splunk Observability Cloud such as Tag Spotlight, we’ll need to first index one or more tags.

To do this, navigate to Settings -> APM MetricSets. Then click the + New MetricSet button.

Let’s index the credit.score.category tag by entering the following details (note: since everyone in the workshop is using the same organization, the instructor will do this step on your behalf):

Click Start Analysis to proceed.

The tag will appear in the list of Pending MetricSets while analysis is performed.

Once analysis is complete, click on the checkmark in the Actions column.

How to choose tags for indexing

Why did we choose to index the credit.score.category tag and not the others?

To understand this, let’s review the primary use cases for tags:

Filtering
Grouping

Filtering

With the filtering use case, we can use the Trace Analyzer capability of Splunk Observability Cloud to filter on traces that match a particular tag value.

We saw an example of this earlier, when we filtered on traces where the credit score started with 7.

Or if a customer calls in to complain about slow service, we could use Trace Analyzer to locate all traces with that particular customer number.

Tags used for filtering use cases are generally high-cardinality, meaning that there could be thousands or even hundreds of thousands of unique values. In fact, Splunk Observability Cloud can handle an effectively infinite number of unique tag values! Filtering using these tags allows us to rapidly locate the traces of interest.

Note that we aren’t required to index tags to use them for filtering with Trace Analyzer.

Grouping

With the grouping use case, we can use Trace Analyzer to group traces by a particular tag.

But we can also go beyond this and surface trends for tags that we collect using the powerful Tag Spotlight feature in Splunk Observability Cloud, which we’ll see in action shortly.

Tags used for grouping use cases should be low to medium-cardinality, with hundreds of unique values.

For custom tags to be used with Tag Spotlight, they first need to be indexed.

We decided to index the credit.score.category tag because it has a few distinct values that would be useful for grouping. In contrast, the customer number and credit score tags have hundreds or thousands of unique values, and are more valuable for filtering use cases rather than grouping.

Troubleshooting vs. Monitoring MetricSets

You may have noticed that, to index this tag, we created something called a Troubleshooting MetricSet. It’s named this way because a Troubleshooting MetricSet, or TMS, allows us to troubleshoot issues with this tag using features such as Tag Spotlight.

You may have also noticed that there’s another option which we didn’t choose called a Monitoring MetricSet (or MMS). Monitoring MetricSets go beyond troubleshooting and allow us to use tags for alerting and dashboards. We’ll explore this concept later in the workshop.

Use Tags for Troubleshooting

5 minutes

Using Tag Spotlight

Now that we’ve indexed the credit.score.category tag, we can use it with Tag Spotlight to troubleshoot our application.

Navigate to APM then click on Tag Spotlight on the right-hand side. Ensure the creditcheckservice service is selected from the Service drop-down (if not already selected).

With Tag Spotlight, we can see 100% of credit score requests that result in a score of impossible have an error, yet requests for all other credit score types have no errors at all!

This illustrates the power of Tag Spotlight! Finding this pattern would be time-consuming without it, as we’d have to manually look through hundreds of traces to identify the pattern (and even then, there’s no guarantee we’d find it).

We’ve looked at errors, but what about latency? Let’s click on Latency near the top of the screen to find out.

Here, we can see that the requests with a poor credit score request are running slowly, with P50, P90, and P99 times of around 3 seconds, which is too long for our users to wait, and much slower than other requests.

We can also see that some requests with an exceptional credit score request are running slowly, with P99 times of around 5 seconds, though the P50 response time is relatively quick.

Using Dynamic Service Maps

Now that we know the credit score category associated with the request can impact performance and error rates, let’s explore another feature that utilizes indexed tags: Dynamic Service Maps.

With Dynamic Service Maps, we can breakdown a particular service by a tag. For example, let’s click on APM, then click Explore to view the service map.

Click on creditcheckservice. Then, on the right-hand menu, click on the drop-down that says Breakdown, and select the credit.score.category tag.

At this point, the service map is updated dynamically, and we can see the performance of requests hitting creditcheckservice broken down by the credit score category:

This view makes it clear that performance for good and fair credit scores is excellent, while poor and exceptional scores are much slower, and impossible scores result in errors.

Summary

Tag Spotlight has uncovered several interesting patterns for the engineers that own this service to explore further:

Why are all the impossible credit score requests resulting in error?
Why are all the poor credit score requests running slowly?
Why do some of the exceptional requests run slowly?

As an SRE, passing this context to the engineering team would be extremely helpful for their investigation, as it would allow them to track down the issue much more quickly than if we simply told them that the service was “sometimes slow”.

If you’re curious, have a look at the source code for the creditprocessorservice. You’ll see that requests with impossible, poor, and exceptional credit scores are handled differently, thus resulting in the differences in error rates and latency that we uncovered.

The behavior we saw with our application is typical for modern cloud-native applications, where different inputs passed to a service lead to different code paths, some of which result in slower performance or errors. For example, in a real credit check service, requests resulting in low credit scores may be sent to another downstream service to further evaluate risk, and may perform more slowly than requests resulting in higher scores, or encounter higher error rates.

Use Tags for Monitoring

15 minutes

Earlier, we created a Troubleshooting Metric Set on the credit.score.category tag, which allowed us to use Tag Spotlight with that tag and identify a pattern to explain why some users received a poor experience.

In this section of the workshop, we’ll explore a related concept: Monitoring MetricSets.

What are Monitoring MetricSets?

Monitoring MetricSets go beyond troubleshooting and allow us to use tags for alerting, dashboards and SLOs.

Create a Monitoring MetricSet

(note: your workshop instructor will do the following for you, but observe the steps)

Let’s navigate to Settings -> APM MetricSets, and click the edit button (i.e. the little pencil) beside the MetricSet for credit.score.category.

Check the box beside Also create Monitoring MetricSet then click Start Analysis

The credit.score.category tag appears again as a Pending MetricSet. After a few moments, a checkmark should appear. Click this checkmark to enable the Pending MetricSet.

Using Monitoring MetricSets

This mechanism creates a new dimension from the tag on a bunch of metrics that can be used to filter these metrics based on the values of that new dimension. Important: To differentiate between the original and the copy, the dots in the tag name are replaced by underscores for the new dimension. With that the metrics become a dimension named credit_score_category and not credit.score.category.

Next, let’s explore how we can use this Monitoring MetricSet.

Use Tags with Dashboards

5 minutes

Dashboards

Navigate to Metric Finder, then type in the name of the tag, which is credit_score_category (remember that the dots in the tag name were replaced by underscores when the Monitoring MetricSet was created). You’ll see that multiple metrics include this tag as a dimension:

By default, Splunk Observability Cloud calculates several metrics using the trace data it receives. See Learn about MetricSets in APM for more details.

By creating an MMS, credit_score_category was added as a dimension to these metrics, which means that this dimension can now be used for alerting and dashboards.

To see how, let’s click on the metric named service.request.duration.ns.p99, which brings up the following chart:

Add filters for sf_environment, sf_service, and sf_dimensionalized. Then set the Extrapolation policy to Last value and the Display units to Nanosecond:

With these settings, the chart allows us to visualize the service request duration by credit score category:

Now we can see the duration by credit score category. In my example, the red line represents the exceptional category, and we can see that the duration for these requests sometimes goes all the way up to 5 seconds.

The orange represents the very good category, and has very fast response times.

The green line represents the poor category, and has response times between 2-3 seconds.

It may be useful to save this chart on a dashboard for future reference. To do this, click on the Save as… button and provide a name for the chart:

When asked which dashboard to save the chart to, let’s create a new one named Credit Check Service - Your Name (substituting your actual name):

Now we can see the chart on our dashboard, and can add more charts as needed to monitor our credit check service:

Use Tags with Alerting

3 minutes

Alerts

It’s great that we have a dashboard to monitor the response times of the credit check service by credit score, but we don’t want to stare at a dashboard all day.

Let’s create an alert so we can be notified proactively if customers with exceptional credit scores encounter slow requests.

To create this alert, click on the little bell on the top right-hand corner of the chart, then select New detector from chart:

Let’s call the detector Latency by Credit Score Category. Set the environment to your environment name (i.e. tagging-workshop-yourname) then select creditcheckservice as the service. Since we only want to look at performance for customers with exceptional credit scores, add a filter using the credit_score_category dimension and select exceptional:

As an alert condition instead of “Static threshold” we want to select “Sudden Change” to make the example more vivid.

We can then set the remainder of the alert details as we normally would. The key thing to remember here is that without capturing a tag with the credit score category and indexing it, we wouldn’t be able to alert at this granular level, but would instead be forced to bucket all customers together, regardless of their importance to the business.

Unless you want to get notified, we do not need to finish this wizard. You can just close the wizard by clicking the X on the top right corner of the wizard pop-up.

Use Tags with Service Level Objectives

10 minutes

We can now use the created Monitoring MetricSet together with Service Level Objectives a similar way we used them with dashboards and detectors/alerts before. For that we want to be clear about some key concepts:

Key Concepts of Service Level Monitoring

(skip if you know this)

Concept	Definition	Examples
Service level indicator (SLI)	An SLI is a quantitative measurement showing some health of a service, expressed as a metric or combination of metrics.	Availability SLI: Proportion of requests that resulted in a successful response Performance SLI: Proportion of requests that loaded in < 100 ms
Service level objective (SLO)	An SLO defines a target for an SLI and a compliance period over which that target must be met. An SLO contains 3 elements: an SLI, a target, and a compliance period. Compliance periods can be calendar, such as monthly, or rolling, such as past 30 days.	Availability SLI over a calendar period: Our service must respond successfully to 95% of requests in a month Performance SLI over a rolling period: Our service must respond to 99% of requests in < 100 ms over a 7-day period
Service level agreement (SLA)	An SLA is a contractual agreement that indicates service levels your users can expect from your organization. If an SLA is not met, there can be financial consequences.	A customer service SLA indicates that 90% of support requests received on a normal support day must have a response within 6 hours.
Error budget	A measurement of how your SLI performs relative to your SLO over a period of time. Error budget measures the difference between actual and desired performance. It determines how unreliable your service might be during this period and serves as a signal when you need to take corrective action.	Our service can respond to 1% of requests in >100 ms over a 7 day period.
Burn rate	A unitless measurement of how quickly a service consumes the error budget during the compliance window of the SLO. Burn rate makes the SLO and error budget actionable, showing service owners when a current incident is serious enough to page an on-call responder.	For an SLO with a 30-day compliance window, a constant burn rate of 1 means your error budget is used up in exactly 30 days.

Creating a new Service Level Objective

There is an easy to follow wizard to create a new Service Level Objective (SLO). In the left navigation just follow the link “Detectors & SLOs”. From there select the third tab “SLOs” and click the blue button to the right that says “Create SLO”.

The wizard guides you through some easy steps. And if everything during the previous steps worked out, you will have no problems here. ;)

In our case we want to use Service & endpoint as our Metric type instead of Custom metric. We filter the Environment down to the environment that we are using during this workshop (i.e. tagging-workshop-yourname) and select the creditcheckservice from the Service and endpoint list. Our Indicator type for this workshop will be Request latency and not Request success.

Now we can select our Filters. Since we are using the Request latency as the Indicator type and that is a metric of the APM Service, we can filter on credit.score.category. Feel free to try out what happens, when you set the Indicator type to Request success.

Today we are only interested in our exceptional credit scores. So please select that as the filter.

In the next step we define the objective we want to reach. For the Request latency type, we define the Target (%), the Latency (ms) and the Compliance Window. Please set these to 99, 100 and Last 7 days. This will give us a good idea what we are achieving already.

Here we will already be in shock or play around with the numbers to make it not so scary. Feel free to play around with the numbers to see how well we achieve the objective and how much we have left to burn.

The third step gives us the chance to alert (aka annoy) people who should be aware about these SLOs to initiate countermeasures. These “people” can also be mechanism like ITSM systems or webhooks to initiate automatic remediation steps.

Activate all categories you want to alert on and add recipients to the different alerts.

The next step is only the naming for this SLO. Have your own naming convention ready for this. In our case we would just name it creditchceckservice:score:exceptional:YOURNAME and click the Create-button BUT you can also just cancel the wizard by clicking anything in the left navigation and confirming to Discard changes.

And with that we have (nearly) successfully created an SLO including the alerting in case we might miss or goals.

Summary

2 minutes

In this workshop, we learned the following:

What are tags and why are they such a critical part of making an application observable?
How to use OpenTelemetry to capture tags of interest from your application.
How to index tags in Splunk Observability Cloud and the differences between Troubleshooting MetricSets and Monitoring MetricSets.
How to utilize tags in Splunk Observability Cloud to find “unknown unknowns” using the Tag Spotlight and Dynamic Service Map features.
How to utilize tags for dashboards, alerting and service level objectives.

Collecting tags aligned with the best practices shared in this workshop will let you get even more value from the data you’re sending to Splunk Observability Cloud. Now that you’ve completed this workshop, you have the knowledge you need to start collecting tags from your own applications!

To get started with capturing tags today, check out how to add tags in various supported languages, and then how to use them to create Troubleshooting MetricSets so they can be analyzed in Tag Spotlight. For more help, feel free to ask a Splunk Expert.

Tip for Workshop Facilitator(s)

Once the workshop is complete, remember to delete the APM MetricSet you created earlier for the credit.score.category tag.

Profiling Workshop

2 minutes Author Derek Mitchell

Service Maps and Traces are extremely valuable in determining what service an issue resides in. And related log data helps provide detail on why issues are occurring in that service.

But engineers sometimes need to go even deeper to debug a problem that’s occurring in one of their services.

This is where features such as Splunk’s AlwaysOn Profiling and Database Query Performance come in.

AlwaysOn Profiling continuously collects stack traces so that you can discover which lines in your code are consuming the most CPU and memory.

And Database Query Performance can quickly identify long-running, unoptimized, or heavy queries and mitigate issues they might be causing.

In this workshop, we’ll explore:

How to debug an application with several performance issues.
How to use Database Query Performance to find slow-running queries that impact application performance.
How to enable AlwaysOn Profiling and use it to find the code that consumes the most CPU and memory.
How to apply fixes based on findings from Splunk Observability Cloud and verify the result.

The workshop uses a Java-based application called The Door Game hosted in Kubernetes. Let’s get started!

Tip

The easiest way to navigate through this workshop is by using:

the left/right arrows (< | >) on the top right of this page
the left (◀️) and right (▶️) cursor keys on your keyboard

Build the Sample Application

10 minutes

Introduction

For this workshop, we’ll be using a Java-based application called The Door Game. It will be hosted in Kubernetes.

Pre-requisites

You will start with an EC2 instance and perform some initial steps in order to get to the following state:

Deploy the Splunk distribution of the OpenTelemetry Collector
Deploy the MySQL database container and populate data
Build and deploy the doorgame application container

Initial Steps

The initial setup can be completed by executing the following steps on the command line of your EC2 instance.

You’ll be asked to enter a name for your environment. Please use profiling-workshop-yourname (where yourname is replaced by your actual name).

cd workshop/profiling
./1-deploy-otel-collector.sh
./2-deploy-mysql.sh
./3-deploy-doorgame.sh

Let’s Play The Door Game

Now that the application is deployed, let’s play with it and generate some observability data.

Get the external IP address for your application instance using the following command:

kubectl describe svc doorgame | grep "LoadBalancer Ingress"

The output should look like the following:

LoadBalancer Ingress:     52.23.184.60

You should be able to access The Door Game application by pointing your browser to port 81 of the provided IP address. For example:

http://52.23.184.60:81

You should be met with The Door Game intro screen:

Click Let's Play to start the game:

Did you notice that it took a long time after clicking Let's Play before we could actually start playing the game?

Let’s use Splunk Observability Cloud to determine why the application startup is so slow.

Troubleshoot Game Startup

10 minutes

Let’s use Splunk Observability Cloud to determine why the game started so slowly.

View your application in Splunk Observability Cloud

Note: when the application is deployed for the first time, it may take a few minutes for the data to appear.

Navigate to APM, then use the Environment dropdown to select your environment (i.e. profiling-workshop-name).

If everything was deployed correctly, you should see doorgame displayed in the list of services:

Click on Explore on the right-hand side to view the service map. We should the doorgame application on the service map:

Notice how the majority of the time is being spent in the MySQL database. We can get more details by clicking on Database Query Performance on the right-hand side.

This view shows the SQL queries that took the most amount of time. Ensure that the Compare to dropdown is set to None, so we can focus on current performance.

We can see that one query in particular is taking a long time:

select * from doorgamedb.users, doorgamedb.organizations

(do you notice anything unusual about this query?)

Let’s troubleshoot further by clicking on one of the spikes in the latency graph. This brings up a list of example traces that include this slow query:

Click on one of the traces to see the details:

In the trace, we can see that the DoorGame.startNew operation took 25.8 seconds, and 17.6 seconds of this was associated with the slow SQL query we found earlier.

What did we accomplish?

To recap what we’ve done so far:

We’ve deployed our application and are able to access it successfully.
The application is sending traces to Splunk Observability Cloud successfully.
We started troubleshooting the slow application startup time, and found a slow SQL query that seems to be the root cause.

Enable AlwaysOn Profiling

20 minutes

Let’s learn how to enable the memory and CPU profilers, verify their operation, and use the results in Splunk Observability Cloud to find out why our application startup is slow.

Update the application configuration

SPLUNK_PROFILER_ENABLED="true"
SPLUNK_PROFILER_MEMORY_ENABLED="true"

- name: SPLUNK_PROFILER_ENABLED
  value: "true"
- name: SPLUNK_PROFILER_MEMORY_ENABLED
  value: "true"

Next, let’s redeploy the Door Game application by running the following command:

cd workshop/profiling
kubectl apply -f doorgame/doorgame.yaml

After a few seconds, a new pod will be deployed with the updated application settings.

Confirm operation

To ensure the profiler is enabled, let’s review the application logs with the following commands:

kubectl logs -l app=doorgame --tail=100 | grep JfrActivator

You should see a line in the application log output that shows the profiler is active:

[otel.javaagent 2024-02-05 19:01:12:416 +0000] [main] INFO com.splunk.opentelemetry.profiler.JfrActivator - Profiler is active.```

This confirms that the profiler is enabled and sending data to the OpenTelemetry collector deployed in our Kubernetes cluster, which in turn sends profiling data to Splunk Observability Cloud.

Profiling in APM

Visit http://<your IP address>:81 and play a few more rounds of The Door Game.

Then head back to Splunk Observability Cloud, click on APM, and click on the doorgame service at the bottom of the screen.

Selecting one of these traces brings up the following screen:

You can see that the spans now include “Call Stacks”, which is a result of us enabling CPU and memory profiling earlier.

Click on the span named doorgame: SELECT doorgamedb, then click on CPU stack traces on the right-hand side:

This brings up the CPU call stacks captured by the profiler.

Let’s open the AlwaysOn Profiler to review the CPU stack trace in more detail. We can do this by clicking on the Span link beside View in AlwaysOn Profiler:

The AlwaysOn Profiler includes both a table and a flamegraph. Take some time to explore this view by doing some of the following:

click a table item and notice the change in flamegraph
navigate the flamegraph by clicking on a stack frame to zoom in, and a parent frame to zoom out
add a search term like splunk or jetty to highlight some matching stack frames

Let’s have a closer look at the stack trace, starting with the DoorGame.startNew method (since we already know that it’s the slowest part of the request)

com.splunk.profiling.workshop.DoorGame.startNew(DoorGame.java:24)
com.splunk.profiling.workshop.UserData.loadUserData(UserData.java:33)
com.mysql.cj.jdbc.StatementImpl.executeQuery(StatementImpl.java:1168)
com.mysql.cj.NativeSession.execSQL(NativeSession.java:655)
com.mysql.cj.protocol.a.NativeProtocol.sendQueryString(NativeProtocol.java:998)
com.mysql.cj.protocol.a.NativeProtocol.sendQueryPacket(NativeProtocol.java:1065)
com.mysql.cj.protocol.a.NativeProtocol.readAllResults(NativeProtocol.java:1715)
com.mysql.cj.protocol.a.NativeProtocol.read(NativeProtocol.java:1661)
com.mysql.cj.protocol.a.TextResultsetReader.read(TextResultsetReader.java:48)
com.mysql.cj.protocol.a.TextResultsetReader.read(TextResultsetReader.java:87)
com.mysql.cj.protocol.a.NativeProtocol.read(NativeProtocol.java:1648)
com.mysql.cj.protocol.a.ResultsetRowReader.read(ResultsetRowReader.java:42)
com.mysql.cj.protocol.a.ResultsetRowReader.read(ResultsetRowReader.java:75)
com.mysql.cj.protocol.a.MultiPacketReader.readMessage(MultiPacketReader.java:44)
com.mysql.cj.protocol.a.MultiPacketReader.readMessage(MultiPacketReader.java:66)
com.mysql.cj.protocol.a.TimeTrackingPacketReader.readMessage(TimeTrackingPacketReader.java:41)
com.mysql.cj.protocol.a.TimeTrackingPacketReader.readMessage(TimeTrackingPacketReader.java:62)
com.mysql.cj.protocol.a.SimplePacketReader.readMessage(SimplePacketReader.java:45)
com.mysql.cj.protocol.a.SimplePacketReader.readMessage(SimplePacketReader.java:102)
com.mysql.cj.protocol.a.SimplePacketReader.readMessageLocal(SimplePacketReader.java:137)
com.mysql.cj.protocol.FullReadInputStream.readFully(FullReadInputStream.java:64)
java.io.FilterInputStream.read(Unknown Source:0)
sun.security.ssl.SSLSocketImpl$AppInputStream.read(Unknown Source:0)
sun.security.ssl.SSLSocketImpl.readApplicationRecord(Unknown Source:0)
sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(Unknown Source:0)
sun.security.ssl.SSLSocketInputRecord.readHeader(Unknown Source:0)
sun.security.ssl.SSLSocketInputRecord.read(Unknown Source:0)
java.net.SocketInputStream.read(Unknown Source:0)
java.net.SocketInputStream.read(Unknown Source:0)
java.lang.ThreadLocal.get(Unknown Source:0)

We can interpret the stack trace as follows:

When starting a new Door Game, a call is made to load user data.
This results in executing a SQL query to load the user data (which is related to the slow SQL query we saw earlier).
We then see calls to read data in from the database.

So, what does this all mean? It means that our application startup is slow since it’s spending time loading user data. In fact, the profiler has told us the exact line of code where this happens:

com.splunk.profiling.workshop.UserData.loadUserData(UserData.java:33)

Let’s open the corresponding source file (./doorgame/src/main/java/com/splunk/profiling/workshop/UserData.java) and look at this code in more detail:

public class UserData {

    static final String DB_URL = "jdbc:mysql://mysql/DoorGameDB";
    static final String USER = "root";
    static final String PASS = System.getenv("MYSQL_ROOT_PASSWORD");
    static final String SELECT_QUERY = "select * FROM DoorGameDB.Users, DoorGameDB.Organizations";

    HashMap<String, User> users;

    public UserData() {
        users = new HashMap<String, User>();
    }

    public void loadUserData() {

        // Load user data from the database and store it in a map
        Connection conn = null;
        Statement stmt = null;
        ResultSet rs = null;

        try{
            conn = DriverManager.getConnection(DB_URL, USER, PASS);
            stmt = conn.createStatement();
            rs = stmt.executeQuery(SELECT_QUERY);
            while (rs.next()) {
                User user = new User(rs.getString("UserId"), rs.getString("FirstName"), rs.getString("LastName"));
                users.put(rs.getString("UserId"), user);
            }

Here we can see the application logic in action. It establishes a connection to the database, then executes the SQL query we saw earlier:

select * FROM DoorGameDB.Users, DoorGameDB.Organizations

It then loops through each of the results, and loads each user into a HashMap object, which is a collection of User objects.

We have a good understanding of why the game startup sequence is so slow, but how do we fix it?

For more clues, let’s have a look at the other part of AlwaysOn Profiling: memory profiling. To do this, click on the Memory tab in AlwaysOn profiling:

At the top of this view, we can see how much heap memory our application is using, the heap memory allocation rate, and garbage collection activity.

What did we accomplish?

We’ve come a long way already!

We learned how to enable the profiler in the Splunk OpenTelemetry Java instrumentation agent.
We learned how to verify in the agent output that the profiler is enabled.
We have explored several profiling related workflows in APM:
- How to navigate to AlwaysOn Profiling from the troubleshooting view
- How to explore the flamegraph and method call duration table through navigation and filtering
- How to identify when a span has sampled call stacks associated with it
- How to explore heap utilization and garbage collection activity
- How to view memory allocation stack traces for a particular method

In the next section, we’ll apply a fix to our application to resolve the slow startup performance.

Fix Application Startup Slowness

10 minutes

In this section, we’ll use what we learned from the profiling data in Splunk Observability Cloud to resolve the slowness we saw when starting our application.

Examining the Source Code

Open the corresponding source file once again (./doorgame/src/main/java/com/splunk/profiling/workshop/UserData.java) and focus on the following code:

public class UserData {

    static final String DB_URL = "jdbc:mysql://mysql/DoorGameDB";
    static final String USER = "root";
    static final String PASS = System.getenv("MYSQL_ROOT_PASSWORD");
    static final String SELECT_QUERY = "select * FROM DoorGameDB.Users, DoorGameDB.Organizations";

    HashMap<String, User> users;

    public UserData() {
        users = new HashMap<String, User>();
    }

    public void loadUserData() {

        // Load user data from the database and store it in a map
        Connection conn = null;
        Statement stmt = null;
        ResultSet rs = null;

        try{
            conn = DriverManager.getConnection(DB_URL, USER, PASS);
            stmt = conn.createStatement();
            rs = stmt.executeQuery(SELECT_QUERY);
            while (rs.next()) {
                User user = new User(rs.getString("UserId"), rs.getString("FirstName"), rs.getString("LastName"));
                users.put(rs.getString("UserId"), user);
            }

After speaking with a database engineer, you discover that the SQL query being executed includes a cartesian join:

select * FROM DoorGameDB.Users, DoorGameDB.Organizations

Cartesian joins are notoriously slow, and shouldn’t be used, in general.

Upon further investigation, you discover that there are 10,000 rows in the user table, and 1,000 rows in the organization table. When we execute a cartesian join using both of these tables, we end up with 10,000 x 1,000 rows being returned, which is 10,000,000 rows!

Furthermore, the query ends up returning duplicate user data, since each record in the user table is repeated for each organization.

So when our code executes this query, it tries to load 10,000,000 user objects into the HashMap, which explains why it takes so long to execute, and why it consumes so much heap memory.

Let’s Fix That Bug

After consulting the engineer that originally wrote this code, we determined that the join with the Organizations table was inadvertent.

So when loading the users into the HashMap, we simply need to remove this table from the query.

Open the corresponding source file once again (./doorgame/src/main/java/com/splunk/profiling/workshop/UserData.java) and change the following line of code:

    static final String SELECT_QUERY = "select * FROM DoorGameDB.Users, DoorGameDB.Organizations";

to:

    static final String SELECT_QUERY = "select * FROM DoorGameDB.Users";

Now the method should perform much more quickly, and less memory should be used, as it’s loading the correct number of users into the HashMap (10,000 instead of 10,000,000).

Rebuild and Redeploy Application

Let’s test our changes by using the following commands to re-build and re-deploy the Door Game application:

cd workshop/profiling
./5-redeploy-doorgame.sh

Once the application has been redeployed successfully, visit The Door Game again to confirm that your fix is in place: http://<your IP address>:81

Clicking Let's Play should take us to the game more quickly now (though performance could still be improved):

Start the game a few more times, then return to Splunk Observability Cloud to confirm that the latency of the GET new-game operation has decreased.

What did we accomplish?

We discovered why our SQL query was so slow.
We applied a fix, then rebuilt and redeployed our application.
We confirmed that the application starts a new game more quickly.

In the next section, we’ll explore continue playing the game and fix any remaining performance issues that we find.

Fix In Game Slowness

10 minutes

Now that our game startup slowness has been resolved, let’s play several rounds of the Door Game and ensure the rest of the game performs quickly.

As you play the game, do you notice any other slowness? Let’s look at the data in Splunk Observability Cloud to put some numbers on what we’re seeing.

Review Game Performance in Splunk Observability Cloud

Navigate to APM then click on Traces on the right-hand side of the screen. Sort the traces by Duration in descending order:

We can see that a few of the traces with an operation of GET /game/:uid/picked/:picked/outcome have a duration of just over five seconds. This explains why we’re still seeing some slowness when we play the app (note that the slowness is no longer on the game startup operation, GET /new-game, but rather a different operation used while actually playing the game).

Let’s click on one of the slow traces and take a closer look. Since profiling is still enabled, call stacks have been captured as part of this trace. Click on the child span in the waterfall view, then click CPU Stack Traces:

At the bottom of the call stack, we can see that the thread was busy sleeping:

com.splunk.profiling.workshop.ServiceMain$$Lambda$.handle(Unknown Source:0)
com.splunk.profiling.workshop.ServiceMain.lambda$main$(ServiceMain.java:34)
com.splunk.profiling.workshop.DoorGame.getOutcome(DoorGame.java:41)
com.splunk.profiling.workshop.DoorChecker.isWinner(DoorChecker.java:14)
com.splunk.profiling.workshop.DoorChecker.checkDoorTwo(DoorChecker.java:30)
com.splunk.profiling.workshop.DoorChecker.precheck(DoorChecker.java:36)
com.splunk.profiling.workshop.Util.sleep(Util.java:9)
java.util.concurrent.TimeUnit.sleep(Unknown Source:0)
java.lang.Thread.sleep(Unknown Source:0)
java.lang.Thread.sleep(Native Method:0)

The call stack tells us a story – reading from the bottom up, it lets us describe what is happening inside the service code. A developer, even one unfamiliar with the source code, should be able to look at this call stack to craft a narrative like:

We are getting the outcome of a game. We leverage the DoorChecker to see if something is the winner, but the check for door two somehow issues a precheck() that, for some reason, is deciding to sleep for a long time.

Our workshop application is left intentionally simple – a real-world service might see the thread being sampled during a database call or calling into an un-traced external service. It is also possible that a slow span is executing a complicated business process, in which case maybe none of the stack traces relate to each other at all.

The longer a method or process is, the greater chance we will have call stacks sampled during its execution.

Let’s Fix That Bug

By using the profiling tool, we were able to determine that our application is slow when issuing the DoorChecker.precheck() method from inside DoorChecker.checkDoorTwo(). Let’s open the doorgame/src/main/java/com/splunk/profiling/workshop/DoorChecker.java source file in our editor.

By quickly glancing through the file, we see that there are methods for checking each door, and all of them call precheck(). In a real service, we might be uncomfortable simply removing the precheck() call because there could be unseen/unaccounted side effects.

Down on line 29 we see the following:

    private boolean checkDoorTwo(GameInfo gameInfo) {
        precheck(2);
        return gameInfo.isWinner(2);
    }

    private void precheck(int doorNum) {
        long extra = (int)Math.pow(70, doorNum);
        sleep(300 + extra);
    }

With our developer hat on, we notice that the door number is zero based, so the first door is 0, the second is 1, and the 3rd is 2 (this is conventional). The extra value is used as extra/additional sleep time, and it is computed by taking 70^doorNum (Math.pow performs an exponent calculation). That’s odd, because this means:

door 0 => 70^0 => 1ms
door 1 => 70^1 => 70ms
door 2 => 70^2 => 4900ms

We’ve found the root cause of our slow bug! This also explains why the first two doors weren’t ever very slow.

We have a quick chat with our product manager and team lead, and we agree that the precheck() method must stay but that the extra padding isn’t required. Let’s remove the extra variable and make precheck now read like this:

    private void precheck(int doorNum) {
        sleep(300);
    }

Now all doors will have a consistent behavior. Save your work and then rebuild and redeploy the application using the following command:

cd workshop/profiling
./5-redeploy-doorgame.sh

Once the application has been redeployed successfully, visit The Door Game again to confirm that your fix is in place: http://<your IP address>:81

What did we accomplish?

We found another performance issue with our application that impacts game play.
We used the CPU call stacks included in the trace to understand application behavior.
We learned how the call stack can tell us a story and point us to suspect lines of code.
We identified the slow code and fixed it to make it faster.

Summary

3 minutes

In this workshop, we accomplished the following:

We deployed our application and captured traces with Splunk Observability Cloud.
We used Database Query Performance to find a slow-running query that impacted the game startup time.
We enabled AlwaysOn Profiling and used it to confirm which line of code was causing the increased startup time and memory usage.
We found another application performance issue and used AlwaysOn Profiling again to find the problematic line of code.
We applied fixes for both of these issues and verified the result using Splunk Observability Cloud.

Enabling AlwaysOn Profiling and utilizing Database Query Performance for your applications will let you get even more value from the data you’re sending to Splunk Observability Cloud.

Now that you’ve completed this workshop, you have the knowledge you need to start collecting deeper diagnostic data from your own applications!

To get started with Database Query Performance today, check out Monitor Database Query Performance.

And to get started with AlwaysOn Profiling, check out Introduction to AlwaysOn Profiling for Splunk APM

For more help, feel free to ask a Splunk Expert.

Fix Application Startup Slowness

10 minutes

In this section, we’ll use what we learned from the profiling data in Splunk Observability Cloud to resolve the slowness we saw when starting our application.

Examining the Source Code

Open the corresponding source file once again (./doorgame/src/main/java/com/splunk/profiling/workshop/UserData.java) and focus on the following code:

public class UserData {

    static final String DB_URL = "jdbc:mysql://mysql/DoorGameDB";
    static final String USER = "root";
    static final String PASS = System.getenv("MYSQL_ROOT_PASSWORD");
    static final String SELECT_QUERY = "select * FROM DoorGameDB.Users, DoorGameDB.Organizations";

    HashMap<String, User> users;

    public UserData() {
        users = new HashMap<String, User>();
    }

    public void loadUserData() {

        // Load user data from the database and store it in a map
        Connection conn = null;
        Statement stmt = null;
        ResultSet rs = null;

        try{
            conn = DriverManager.getConnection(DB_URL, USER, PASS);
            stmt = conn.createStatement();
            rs = stmt.executeQuery(SELECT_QUERY);
            while (rs.next()) {
                User user = new User(rs.getString("UserId"), rs.getString("FirstName"), rs.getString("LastName"));
                users.put(rs.getString("UserId"), user);
            }

After speaking with a database engineer, you discover that the SQL query being executed includes a cartesian join:

select * FROM DoorGameDB.Users, DoorGameDB.Organizations

Cartesian joins are notoriously slow, and shouldn’t be used, in general.

Furthermore, the query ends up returning duplicate user data, since each record in the user table is repeated for each organization.

So when our code executes this query, it tries to load 10,000,000 user objects into the HashMap, which explains why it takes so long to execute, and why it consumes so much heap memory.

Let’s Fix That Bug

After consulting the engineer that originally wrote this code, we determined that the join with the Organizations table was inadvertent.

So when loading the users into the HashMap, we simply need to remove this table from the query.

Open the corresponding source file once again (./doorgame/src/main/java/com/splunk/profiling/workshop/UserData.java) and change the following line of code:

    static final String SELECT_QUERY = "select * FROM DoorGameDB.Users, DoorGameDB.Organizations";

to:

    static final String SELECT_QUERY = "select * FROM DoorGameDB.Users";

Now the method should perform much more quickly, and less memory should be used, as it’s loading the correct number of users into the HashMap (10,000 instead of 10,000,000).

Rebuild and Redeploy Application

Let’s test our changes by using the following commands to re-build and re-deploy the Door Game application:

cd workshop/profiling
./5-redeploy-doorgame.sh

Once the application has been redeployed successfully, visit The Door Game again to confirm that your fix is in place: http://<your IP address>:81

Clicking Let's Play should take us to the game more quickly now (though performance could still be improved):

Start the game a few more times, then return to Splunk Observability Cloud to confirm that the latency of the GET new-game operation has decreased.

What did we accomplish?

We discovered why our SQL query was so slow.
We applied a fix, then rebuilt and redeployed our application.
We confirmed that the application starts a new game more quickly.

In the next section, we’ll explore continue playing the game and fix any remaining performance issues that we find.

Fix In Game Slowness

10 minutes

Now that our game startup slowness has been resolved, let’s play several rounds of the Door Game and ensure the rest of the game performs quickly.

As you play the game, do you notice any other slowness? Let’s look at the data in Splunk Observability Cloud to put some numbers on what we’re seeing.

Review Game Performance in Splunk Observability Cloud

Navigate to APM then click on Traces on the right-hand side of the screen. Sort the traces by Duration in descending order:

At the bottom of the call stack, we can see that the thread was busy sleeping:

com.splunk.profiling.workshop.ServiceMain$$Lambda$.handle(Unknown Source:0)
com.splunk.profiling.workshop.ServiceMain.lambda$main$(ServiceMain.java:34)
com.splunk.profiling.workshop.DoorGame.getOutcome(DoorGame.java:41)
com.splunk.profiling.workshop.DoorChecker.isWinner(DoorChecker.java:14)
com.splunk.profiling.workshop.DoorChecker.checkDoorTwo(DoorChecker.java:30)
com.splunk.profiling.workshop.DoorChecker.precheck(DoorChecker.java:36)
com.splunk.profiling.workshop.Util.sleep(Util.java:9)
java.util.concurrent.TimeUnit.sleep(Unknown Source:0)
java.lang.Thread.sleep(Unknown Source:0)
java.lang.Thread.sleep(Native Method:0)

We are getting the outcome of a game. We leverage the DoorChecker to see if something is the winner, but the check for door two somehow issues a precheck() that, for some reason, is deciding to sleep for a long time.

The longer a method or process is, the greater chance we will have call stacks sampled during its execution.

Let’s Fix That Bug

Down on line 29 we see the following:

    private boolean checkDoorTwo(GameInfo gameInfo) {
        precheck(2);
        return gameInfo.isWinner(2);
    }

    private void precheck(int doorNum) {
        long extra = (int)Math.pow(70, doorNum);
        sleep(300 + extra);
    }

door 0 => 70^0 => 1ms
door 1 => 70^1 => 70ms
door 2 => 70^2 => 4900ms

We’ve found the root cause of our slow bug! This also explains why the first two doors weren’t ever very slow.

    private void precheck(int doorNum) {
        sleep(300);
    }

Now all doors will have a consistent behavior. Save your work and then rebuild and redeploy the application using the following command:

cd workshop/profiling
./5-redeploy-doorgame.sh

Once the application has been redeployed successfully, visit The Door Game again to confirm that your fix is in place: http://<your IP address>:81

What did we accomplish?

We found another performance issue with our application that impacts game play.
We used the CPU call stacks included in the trace to understand application behavior.
We learned how the call stack can tell us a story and point us to suspect lines of code.
We identified the slow code and fixed it to make it faster.

Summary

3 minutes

In this workshop, we accomplished the following:

We deployed our application and captured traces with Splunk Observability Cloud.
We used Database Query Performance to find a slow-running query that impacted the game startup time.
We enabled AlwaysOn Profiling and used it to confirm which line of code was causing the increased startup time and memory usage.
We found another application performance issue and used AlwaysOn Profiling again to find the problematic line of code.
We applied fixes for both of these issues and verified the result using Splunk Observability Cloud.

Enabling AlwaysOn Profiling and utilizing Database Query Performance for your applications will let you get even more value from the data you’re sending to Splunk Observability Cloud.

Now that you’ve completed this workshop, you have the knowledge you need to start collecting deeper diagnostic data from your own applications!

To get started with Database Query Performance today, check out Monitor Database Query Performance.

And to get started with AlwaysOn Profiling, check out Introduction to AlwaysOn Profiling for Splunk APM

For more help, feel free to ask a Splunk Expert.

Optimize End User Experiences

90 minutes Author Sarah Ware

How can we use Splunk Observability to get insight into end user experience, and proactively test scenarios to improve that experience?

Sections:

Set up basic Synthetic tests to understand availability and performance ASAP
- Uptime test
- API test
- Single page Browser test
Explore RUM to understand our real users
Write advanced Synthetics tests based on what we’ve learned about our users and what we need them to do
Customize dashboard charts to capture our KPIs, show trends, and show data in context of our events
Create Detectors to alert on our KPIs

Tip

Keep in mind throughout the workshop: how can I prioritize activities strategically to get the fastest time to value for my end users and for myself/ my developers?

Context

As a reminder, we need frontend performance monitoring to capture everything that goes into our end user experience. If we’re just monitoring the backend, we’re missing all of the other resources that are critical to our users’ success. Read What the Fastly Outage Can Teach Us About Observability for a real world example. Click the image below to zoom in.

References

Throughout this workshop, we will see references to resources to help further understand end user experience and how to optimize it. In addition to Splunk Docs for supported features and Lantern for tips and tricks, Google’s web.dev and Mozilla are great resources.

Remember that the specific libraries, platforms, and CDNs you use often also have their own specific resources. For example React, Wordpress, and Cloudflare all have their own tips to improve performance.

Synthetics

Let’s quickly set up some tests in Synthetics to immediately start understanding our end user experience, without waiting for real users to interact with our app.

We can capture not only the performance and availability of our own apps and endpoints, but also those third parties we rely on any time of the day or night.

Tip

If you find that your tests are being bot-blocked, see the docs for tips on how to allow Synthetic testing. if you need to test something that is not accessible externally, see private location instructions.

Uptime Test

5 minutes

Introduction

The simplest way to keep an eye on endpoint availability is with an Uptime test. This lightweight test can run internally or externally around the world, as frequently as every minute. Because this is the easiest (and cheapest!) test to set up, and because this is ideal for monitoring availability of your most critical enpoints and ports, let’s start here.

Pre-requisites

Publicly accessible HTTP(S) endpoint(s) to test
Access to Splunk Observability Cloud

Creating a test

Open Synthetics
Click the Add new test button on the right side of the screen, then select Uptime and HTTP test.
Name your test with your team name (provided by your workshop instructor), your initials, and any other details you’d like to include, like geographic region.
For now let’s test a GET request. Fill in the URL field. You can use one of your own, or one of ours like https://online-boutique-eu.splunko11y.com, https://online-boutique-us.splunko11y.com, or https://www.splunk.com.
Click Try now to validate that the endpoint is accessible before the selected location before saving the test. Try now does not count against your subscription usage, so this is a good practice to make sure you’re not wasting real test runs on a misconfigured test.
Tip
A common reason for Try now to fail is that there is a non-2xx response code. If that is expected, add a Validation for the correct response code.
Add any additional validations needed, for example: response code, response header, and response size.
Add and remove any locations you’d like. Keep in mind where you expect your endpoint to be available.
Change the frequency to test your more critical endpoints more often, up to one minute.
Make sure “Round-robin” is on so the test will run from one location at a time, rather than from all locations at once.
- If an endpoint is highly critical, think about if it is worth it to have all locations tested at the same time every single minute. If you have automations built in with a webhook from a detector, or if you have strict SLAs you need to track, this could be worth it to have as much coverage as possible. But if you are doing more manual investigation, or if this is a less critical endpoint, you could be wasting test runs that are executing while an issue is being investigated.
- Remember that your license is based on the number of test runs per month. Turning Round-robin off will multiply the number of test runs by the number of locations you have selected.
When you are ready for the test to start running, make sure “Active” is on, then scroll down and click Submit to save the test configuration.

Now the test will start running with your saved configuration. Take a water break, then we’ll look at the results!

Understanding results

Click into a test summary view and play with the Performance KPIs chart filters to see how you can slice and dice your data. This is a good place to get started understanding trends. Later, we will see what custom charts look like, so you can tailor dashboards to the KPIs you care about most.
Workshop Question: Using the Performance KPIs chart
What metrics are available? Is your data consistent across time and locations? Do certain locations run slower than others? Are there any spikes or failures?
Click into a recent run either in the chart or in the table below.
If there are failures, look at the response to see if you need to add a response code assertion (302 is a common one), if there is some authorization needed, or different request headers added. Here we have information about this particular test run including if it succeeded or failed, the location, timestamp, and duration in addition to the other Uptime test metrics. Click through to see the response, request, and connection info as well. If you need to edit the test for it to run successfully, click the test name in the top left breadcrumb on this run result page, then click Edit test on the top right of the test overview page. Remember to scroll down and click Submit to save your changes after editing the test configuration.
In addition to the test running successfully, there are other metrics to measure the health of your endpoints. For example, Time to First Byte(TTFB) is a great indicator of performance, and you can optimize TTFB to improve end user experience.
Go back to the test overview page and change the Performance KPIs chart to display First Byte time, and change the interval if needed to better see trends in the data. In the example above, we can see that TTFB varies consistently between locations. Knowing this, we can keep location in mind when reporting on metrics. We could also improve the experience, for example by serving users in those locations an endpoint hosted closer to them, which should reduce network latency. We can also see some slight variations in the results over time, but overall we already have a good idea of our baseline for this endpoint’s KPIs. When we have a baseline, we can alert on worsening metrics as well as visualize improvements.
Tip
We are not setting a detector on this test yet, to make sure it is running consistently and successfully. If you are testing a highly critical endpoint and want to be alerted on it ASAP (and have tolerance for potential alert noise), jump to Single Test Detectors.

Once you have your Uptime test running successfully, let’s move on to the next test type.

API Test

5 minutes

The API test provides a flexible way to check the functionality and performance of API endpoints. The shift toward API-first development has magnified the necessity to monitor the back-end services that provide your core front-end functionality.

Whether you’re interested in testing multi-step API interactions or you want to gain visibility into the performance of your endpoints, the API Test can help you accomplish your goals.

This excercise will walk through a multi-step test on the Spotify API. You can also use it as a reference to build tests on your own APIs or on those of your critical third parties.

Global Variables

Global variables allow us to use stored strings in multiple tests, so we only need to update them in one place.

View the global variable that we’ll use to perform our API test. Click on Global Variables under the cog icon. The global variable named env.encoded_auth will be the one that we’ll use to build the spotify API transaction.

Create new API test

Create a new API test by clicking on the Add new test button and select API test from the dropdown. Name the test using your team name, your initials, and Spotify API e.g. [Daisy] RWC - Spotify API

Authentication Request

Click on + Add requests and enter the request step name e.g. Authenticate with Spotify API.

Expand the Request section, from the drop-down change the request method to POST and enter the following URL:

https://accounts.spotify.com/api/token

In the Payload body section enter the following:

grant_type=client_credentials

Next, add two + Request headers with the following key/value pairings:

CONTENT-TYPE: application/x-www-form-urlencoded
AUTHORIZATION: Basic {{env.encoded_auth}}

Expand the Validation section and add the following extraction:

Extract from Response body JSON $.access_token as access_token

This will parse the JSON payload that is received from the Spotify API, extract the access token and store it as a custom variable.

Search Request

Click on + Add Request to add the next step. Name the step Search for Tracks named “Up around the bend”.

Expand the Request section and change the request method to GET and enter the following URL:

https://api.spotify.com/v1/search?q=Up%20around%20the%20bend&type=track&offset=0&limit=5

Next, add two request headers with the following key/value pairings:

CONTENT-TYPE: application/json
AUTHORIZATION: Bearer {{custom.access_token}}
- This uses the custom variable we created in the previous step!

Expand the Validation section and add the following extraction:

Extract from Response body JSON $.tracks.items[0].id as track.id

To validate the test before saving, change the location as needed and click Try now. See the docs for more information on the try now feature.

When the validation is successful, click on < Return to test to return to the test configuration page. And then click Save to save the API test.

Extra credit

Have more time to work on this test? Take a look at the Response Body in one of your run results. What additional steps would make this test more thorough? Edit the test, and use the Try now feature to validate any changes you make before you save the test.

View results

Wait for a few minutes for the test to provision and run. Once you see the test has run successfully, click on the run to view the results:

Resources

Single Page Browser Test

5 minutes

We have started testing our endpoints, now let’s test the front end browser experience.

Starting with a single page browser test will let us capture how first- and third-party resources impact how our end users experience our browser-based site. It also allows us to start to understand our user experience metrics before introducing the complexity of multiple steps in one test.

A page where your users commonly “land” is a good choice to start with a single page test. This could be your site homepage, a section main page, or any other high-traffic URL that is important to you and your end users.

Click Add new test and select Browser test
Include your team name and initials in the test name. Add to the Name and Custom properties to describe the scope of the test (like Desktop for device type). Then click + Edit steps
Change the transaction label (top left) and step name (on the right) to something readable that describes the step. Add the URL you’d like to test. Your workshop instructor can provide you with a URL as well. In the below example, the transaction is “Home” and the step name is “Go to homepage”.
To validate the test, change the location as needed and click Try now. See the docs for more information on the try now feature.
Wait for the test validation to complete. If the test validation failed, double check your URL and test location and try again. With Try now you can see what the result of the test will be if it were saved and run as-is.
Click < Return to test to continue the configuration.
Edit the locations you want to use, keeping in mind any regional rules you have for your site.
You can edit the Device and Frequency or leave them at their default values for now. Click Submit to save the test and start running it.

Bonus Exercise

Have a few spare seconds? Copy this test and change just the title and device type, and save. Now you have visibility into the end user experience on another device and connection speed!

While our Synthetic tests are running, let’s see how RUM is instrumented to start getting data from our real users.

RUM

15 minutes

With RUM instrumented, we will be able to better understand our end users, what they are doing, and what issues they are encountering.

This workshop walks through how our demo site is instrumented and how to interpret the data. If you already have a RUM license, this will help you understand how RUM works and how you can use it to optimize your end user experience.

Tip

Our Docs also contain guidance such as scenarios using Splunk RUM and demo applications to test out RUM for mobile apps.

Overview

The aim of this Splunk Real User Monitoring (RUM) workshop is to let you:

Shop for items on the Online Boutique to create traffic, and create RUM User Sessions¹ that you can view in the Splunk Observability Suite.
See an overview of the performance of all your application(s) in the Application Summary Dashboard
Examine the performance of a specific website with RUM metrics.

In order to reach this goal, we will use an online boutique to order various products. While shopping on the online boutique you will create what is called a User Session.

You may encounter some issues with this web site, and you will use Splunk RUM to identify the issues, so they can be resolved by the developers.

The workshop host will provide you with a URL for an online boutique store that has RUM enabled.

Each of these Online Boutiques are also being visited by a few synthetic users; this will allow us to generate more live data to be analyzed later.

A RUM User session is a “recording” of a collection of user interactions on an application, basically collecting a website or app’s performance measured straight from the browser or Mobile App of the end user. To do this a small amount of JavaScript is embedded in each page. This script then collects data from each user as he or she explores the page, and transfers that data back for analysis. ↩︎

RUM instrumentation in a browser app

Check the HEAD section of the Online-boutique webpage in your browser
Find the code that instruments RUM

1. Browse to the Online Boutique

Open the frontend demo site in your browser (EU or US).

2. Inspecting the HTML source

The changes needed for RUM are placed in the <head> section of the hosts Web page. Right click to view the page source or to inspect the code. Below is an example of the <head> section with RUM:

 4<!DOCTYPE html>
 5<html lang="en">
 6
 7<head>
 8    <meta charset="UTF-8">
 9    <meta name="viewport" content="width=device-width, initial-scale=1.0, shrink-to-fit=no">
10    <meta http-equiv="X-UA-Compatible" content="ie=edge">
11    <title>Online Boutique</title>
12    <link href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.1/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-WskhaSGFgHYWDcbwN70/dfYBj47jz9qbsMId/iRN3ewGhXQFZCSftd1LZCfmhktB"
13        crossorigin="anonymous">
14    <link href="https://fonts.googleapis.com/css?family=Roboto:300,400,500,700" rel="stylesheet">
15    <link rel="stylesheet" type="text/css" href="/static/styles/styles.css">
16    <link rel="stylesheet" type="text/css" href="/static/styles/cart.css">
17    <link rel="stylesheet" type="text/css" href="/static/styles/order.css">
18    <link rel='shortcut icon' type='image/x-icon' href='/static/favicon.ico' />
19 
20 
21    <script src="https://cdn.signalfx.com/o11y-gdi-rum/latest/splunk-otel-web.js" crossorigin="anonymous"></script>
22    <script src="https://cdn.signalfx.com/o11y-gdi-rum/latest/splunk-otel-web-session-recorder.js" crossorigin="anonymous"></script>
23    <script>
24        window.SplunkRum && window.SplunkRum.init({ 
25            beaconUrl: 'https://rum-ingest.eu0.signalfx.com/v1/rum',
26            rumAuth: '<redacted>',
27            app: 'online-boutique-eu-store',
28            version: '1',
29            environment: 'online-boutique-eu'
30        });
31        SplunkSessionRecorder.init({
32            beaconUrl: 'https://rum-ingest.eu0.signalfx.com/v1/rumreplay',
33            rumAuth: '<redacted>'
34        });
35    </script>
36    <script>
37        const Provider = SplunkRum.provider; 
38        var tracer=Provider.getTracer('appModuleLoader');
39    </script>
40</head>
41
42<body>

This code enables RUM Tracing, Session Replay, and Custom Events to better understand performance in the context of user workflows:

The first part is to indicate where to download the Splunk Open Telemetry Javascript file from: https://cdn.signalfx.com/o11y-gdi-rum/latest/splunk-otel-web.js (this can also be hosted locally if so required).
The next section defines the location where to send the traces to in the beacon url: {beaconUrl: 'https://rum-ingest.eu0.signalfx.com/v1/rum'
The RUM Access Token: rumAuth: '<redacted>'.

Identification tags app and environment to indentify in the SPLUNK RUM UI e.g.

  <script> 
      window.SplunkRum && window.SplunkRum.init({
          beaconUrl: 'https://rum-ingest.eu0.signalfx.com/v1/rum',
          rumAuth: '<redacted>',
          app: 'online-boutique-eu-store',
          version: '1',
          environment: 'online-boutique-eu'
      });
      SplunkSessionRecorder.init({
          beaconUrl: 'https://rum-ingest.eu0.signalfx.com/v1/rumreplay',
          rumAuth: '<redacted>'
      });
  </script>

The above lines 21 and 23-30 are all that is required to enable RUM on your website!

Lines 22 and 31-34 are optional if you want Session Replay instrumented.

Line 36-39 var tracer=Provider.getTracer('appModuleLoader'); will add a Custom Event for every page change, allowing you to better track your website conversions and usage. This may or may not be instrumented for this workshop.

Exercise

Time to shop! Take a minute to open the workshop store URL in as many browsers and devices as you’d like, shop around, add items to cart, checkout, and feel free to close the shopping browsers when you’re finished. Keep in mind this is a lightweight demo shop site, so don’t be alarmed if the cart doesn’t match the item you picked!

RUM Landing Page

Visit the RUM landing page and and check the overview of the performance of all your RUM enabled applications with the Application Summary Dashboard (Both Mobile and Web based)

1. Visit the RUM Landing Page

Login into Splunk Observability. From the left side menu bar select RUM . This will bring you to your the RUM Landing Page.

The goal of this page is to give you in a single page, a clear indication of the health, performance and potential errors found in your application(s) and allow you to dive deeper into the information about your User Sessions collected from your web page/App. You will have a pane for each of your active RUM applications. (The view below is the default expanded view)

If you have multiple applications, (which will be the case when every attendee is using their own ec2 instance for the RUM workshop), the pane view may be automatically reduced by collapsing the panes as shown below:

You can expanded a condensed RUM Application Summary View to the full dashboard by clicking on the small browser or Mobile icon. (Depending on the type of application: Mobile or Browser based) on the left in front of the applications name, highlighted by the red arrow.

First find the right application to use for the workshop:

If you are participating in a stand alone RUM workshop, the workshop leader will tell you the name of the application to use, in the case of a combined workshop, it will follow the naming convention we used for IM and APM and use the ec2 node name as a unique id like jmcj-store as shown as the last app in the screenshot above.

2. Configure the RUM Application Summary Dashboard Header Section

RUM Application Summary Dashboard consists of 6 major sections. The first is the selection header, where you can set/filter a number of options:

A drop down for the Time Window you’re reviewing (You are looking at the past 15 minutes by default)
A drop down to select the Environment¹ you want to look at. This allows you to focus on just the subset of applications belonging to that environment, or Select all to view all available.
A drop down list with the various Apps being monitored. You can use the one provided by the workshop host or select your own. This will focus you on just one application.
A drop down to select the Source, Browser or Mobile applications to view. For the Workshop leave All selected.
A hamburger menu located at the right of the header allowing you to configure some settings of your Splunk RUM application. (We will visit this in a later section).

For the workshop lets do a deeper dive into the Application Summary screen in the next section: Check Health Browser Application

A common application deployment pattern is to have multiple, distinct application environments that don’t interact directly with each other but that are all being monitored by Splunk APM or RUM: for instance, quality assurance (QA) and production environments, or multiple distinct deployments in different datacenters, regions or cloud providers.

A deployment environment is a distinct deployment of your system or application that allows you to set up configurations that don’t overlap with configurations in other deployments of the same application. Separate deployment environments are often used for different stages of the development process, such as development, staging, and production. ↩︎

Check Browser Applications health at a glance

Get familiar with the UI and options available from this landing page
Identify Page Views/JavaScript Errors and Request/Errors in a single view
Check the Web Vitals metrics and any Detector that has fired for in relation to your Browser Application

Application Summary Dashboard

1.Header Bar

As seen in the previous section the RUM Application Summary Dashboard consists of 5 major sections.
The first section is the selection header, where you can collapse the Pane via the Browser icon or the > in front of the application name, which is jmcj-store in the example below. It also provides access to the Application Overview page if you click the link with your application name which is jmcj-store in the example below.

Further, you can also open the Application Overview or App Health Dashboard via the triple dot menu on the right.

For now, let’s look at the high level information we get on the application summary dashboard.

The RUM Application Summary Dashboard is focused on providing you with at a glance highlights of the status of your application.

2. Page Views / JavaScript Errors & Network Requests / Errors

The first section shows Page Views / JavaScript Errors, & Network Requests and Errors charts show the quantity and trend of these issues in your application. This could be Javascript errors, or failed network calls to back end services.

In the example above you can see that there are no failed network calls in the Network chart, but in the Page View chart you can see that a number of pages do experience some errors. These are often not visible for regular users, but can seriously impact the performance of your web site.

You can see the count of the Page Views / Network Requests / Errors by hovering over the charts.

3. JavaScript Errors

With the second section of the RUM Application Summary Dashboard we are showing you an overview of the JavaScript errors occurring in your application, along with a count of each error.

In the example above you can see there are three JavaScript errors, one that appears 29 times in the selected time slot, and the other two each appear 12 times.

If you click on one of the errors a pop-out opens that will show a summary (below) of the errors over time, along with a Stack Trace of the JavaScript error, giving you an indication of where the problems occurred. (We will see this in more detail in one of the following sections)

4. Web Vitals

The next section of the RUM Application Summary Dashboard is showing you Google’s Core Web Vitals, three metrics that are not only used by Google in its search ranking system, but also quantify end user experience in terms of loading, interactivity, and visual stability.

As you can see our site is well behaved and scores Good for all three Metrics. These metrics can be used to identify the effect changes to your application have, and help you improve the performance of your site.

If you click on any of the Metrics shown in the Web Vitals pane you will be taken to the corresponding Tag Spotlight Dashboard. e.g. clicking on the Largest Contentful Paint (LCP) chartlet, you will be taken to a dashboard similar to the screen shot below, that gives you timeline and table views for how this metric has performed. This should allow you to spot trends and identify where the problem may be more common, such as an operating system, geolocation, or browser version.

5. Most Recent Detectors

The final section of the RUM Application Summary Dashboard is focused on providing you an overview of recent detectors that have triggered for your application. We have created a detector for this screen shot but your pane will be empty for now. We will add some detectors to your site and make sure they are triggered in one of the next sections.

In the screen shot you can see we have a critical alert for the RUM Aggregated View Detector, and a Count, how often this alert has triggered in the selected time window. If you happen to have an alert listed, you can click on the name of the Alert (that is shown as a blue link) and you will be taken to the Alert Overview page showing the details of the alert (Note: this will move you away from the current page, Please use the Back option of your browser to return to the overview page).

Exercise

Please take a few minutes to experiment with the RUM Application Summary Dashboard and the underlying chart and dashboards before going on to the next section.

Analyzing RUM Metrics

See RUM Metrics and Session information in the RUM UI
See correlated APM traces in the RUM & APM UI

1. RUM Overview Pages

From your RUM Application Summary Dashboard you can see detailed information by opening the Application Overview Page via the tripple dot menu on the right by selecting Open Application Overview or by clicking the link with your application name which is jmcj-rum-app in the example below.

This will take you to the RUM Application Overview Page screen as shown below.

2. RUM Browser Overview

2.1. Header

The RUM UI consists of five major sections. The first is the selection header, where you can set/filter a number of options:

A drop down for the time window you’re reviewing (You are looking at the past 15 minutes in this case)
A drop down to select the Comparison window (You are comparing current performance on a rolling window - in this case compared to 1 hour ago)
A drop down with the available Environments to view
A drop down list with the Various Web apps
Optionally a drop down to select Browser or Mobile metrics (Might not be available in your workshop)

2.2. UX Metrics

By default, RUM prioritizes the metrics that most directly reflect the experience of the end user.

Additional Tags

All of the dashboard charts allow us to compare trends over time, create detectors, and click through to further diagnose issues.

First, we see page load and route change information, which can help us understand if something unexpected is impacting user traffic trends.

Next, Google has defined Core Web Vitals to quantify the user experience as measured by loading, interactivity, and visual stability. Splunk RUM builds in Google’s thresholds into the UI, so you can easily see if your metrics are in an acceptable range.

Largest Contentful Paint (LCP), measures loading performance. How long does it take for the largest block of content in the viewport to load? To provide a good user experience, LCP should occur within 2.5 seconds of when the page first starts loading.
First Input Delay (FID), measures interactivity. How long does it take to be able to interact with the app? To provide a good user experience, pages should have a FID of 100 milliseconds or less.
Cumulative Layout Shift (CLS), measures visual stability. How much does the content move around after the initial load? To provide a good user experience, pages should maintain a CLS of 0.1. or less.

Improving Web Vitals is a key component to optimizing your end user experience, so being able to quickly understand them and create detectors if they exceed a threshold is critical.

Google has some great resources if you want to learn more, for example the business impact of Core Web Vitals.

2.3. Front-end health

Common causes of frontend issues are javascript errors and long tasks, which can especially affect interactivity. Creating detectors on these indicators helps us investigate interactivity issues sooner than our users report it, allowing us to build workarounds or roll back related releases faster if needed. Learn more about optimizing long tasks for better end user experience!

2.4. Back-end health

Common back-end issues affecting user experience are network issues and resource requests. In this example, we clearly see a spike in Time To First Byte that lines up with a resource request spike, so we already have a good starting place to investigate.

Time To First Byte (TTFB), measures how long it takes for a client’s browser to receive the first byte of the response from the server. The longer it takes for the server to process the request and send a response, the slower your visitors’ browser is at displaying your page.

Analyzing RUM Tags in the Tag Spotlight view

Look into the Metrics views for the various endpoints and use the Tags sent via the Tag spotlight for deeper analysis

1. Find an url for the Cart endpoint

From the RUM Overview page, please select the url for the Cart endpoint to dive deeper into the information available for this endpoint.

Once you have selected and clicked on the blue url, you will find yourself in the Tag Spotlight overview

Here you will see all of the tags that have been sent to Splunk RUM as part of the RUM traces. The tags displayed will be relevant to the overview that you have selected. These are generic Tags created automatically when the Trace was sent, and additional Tags you have added to the trace as part of the configuration of your website.

Additional Tags

We are already sending two additional tags, you have seen them defined in the Beacon url that was added to your website: app: "[nodename]-store", environment: "[nodename]-workshop" in the first section of this workshop! You can add additional tags in a similar way.

In our example we have selected the Document Load Latency view as shown here:

You can select any of the following Tag views, each focused on a specific metric.

2. Explore the information in the Tag Spotlight view

The Tag spotlight is designed to help you identify problems, either through the chart view,, where you may quickly identify outliers or via the TAGs.

In the Document Load Latency view, if you look at the Browser, Browser Version & OS Name Tag views,you can see the various browser types and versions, as well as for the underlying OS.

This makes it easy to identify problems related to specific browser or OS versions, as they would be highlighted.

In the above example you can see that Firefox had the slowest response, various Browser versions ( Chrome) that have different response times and the slow response of the Android devices.

A further example are the regional Tags that you can use to identify problems related to ISP or locations etc. Here you should be able to find the location you have been using to access the Online Boutique. Drill down by selecting the town or country you are accessing the Online Boutique from by clicking on the name as shown below (City of Amsterdam):

This will select only the traces relevant to the city selected as shown below:

By selecting the various Tag you build up a filter, you can see the current selection below

To clear the filter and see every trace click on Clear All at the top right of the page.

If the overview page is empty or shows , no traces have been received in the selected timeslot. You need to increase the time window at the top left. You can start with the Last 12 hours for example.

You can then use your mouse to select the time slot you want like show in the view below and activate that time filter by clicking on the little spyglass icon.

Analyzing RUM Sessions

Dive into RUM Session information in the RUM UI
Identify Javascript errors in the Span of an user interaction

1. Again select the cart URL

After you have focussed the time slot with the time selector, you need to reselect the cart url from Url Name view, as shown below:

In the example above we selected http://34.246.124.162:81/cart

2. Drill down in the Sessions

After you have analyzed the information and drilled down via the Tag Spotlight to a subset of the traces, you can view the actual session as it was run by the end-user’s browser.

You do this by clicking on the link User Sessions as shown below:

This will give you a list of sessions that matched both the time filter and the subset selected in the Tag Profile.

Select one by clicking on the session ID, It is a good idea to select one that has the longest duration (preferably over 500 ms).

Once you have selected the session, you will be taken to the session details page. As you are selecting a specific action that is part of the session, you will likely arrive somewhere in the middle of the session, at the moment of the interaction.

You can see the URL https://online-boutique-eu.splunko11y.com/cart, the one you selected earlier, is where we are focusing on in the session stream.

Scroll down a little bit on the page, so you see the end of the operation as shown below.

You can see that we have received a few Javascript Console errors that may not have been detected or visible to the end users. To examine these in more detail click on the middle one that says: Cannot read properties of undefined (reading 'Prcie')

This will cause the page to expand and show the Span detail for this interaction, It will contain a detailed error.stack you can pass on the developer to solve the issue. You may have noticed when buying in the Online Boutique that the final total always was $0.00.

Advanced Synthetics

30 minutes

Introduction

This workshop walks you through using the Chrome DevTools Recorder to create a synthetic test on a Splunk demonstration environment or on your own public website.

The exported JSON from the Chrome DevTools Recorder will then be used to create a Splunk Synthetic Monitoring Real Browser Test.

Pre-requisites

Google Chrome Browser installed
Publicly browser-accessible URL
Access to Splunk Observability Cloud

Supporting resources

Lantern: advanced Selectors for multi-step browser tests
Chrome for Developers DevTools Tips
web.dev Core Web Vitals reference

Record a test

Write down a short user journey you want to test. Remember: smaller bites are easier to chew! In other words, get started with just a few steps. This is easier not only to create and maintain the test, but also to understand and act on the results. Test the essential features to your users, like a support contact form, login widget, or date picker.

Note

Record the test in the same type of viewport that you want to run it. For example, if you want to run a test on a mobile viewport, narrow your browser width to mobile and refresh before starting the recording. This way you are capturing the correct elements that could change depending on responsive style rules.

Open your starting URL in Chrome Incognito. This is important so you’re not carrying cookies into the recording, which we won’t set up in the Synthetic test by default. If you workshop instructor does not have a custom URL, feel free to use https://online-boutique-eu.splunko11y.com or https://online-boutique-us.splunko11y.com, which are in the examples below.

Open the Chrome DevTools Recorder

Next, open the Developer Tools (in the new tab that was opened above) by pressing Ctrl + Shift + I on Windows or Cmd + Option + I on a Mac, then select Recorder from the top-level menu or the More tools flyout menu.

Note

Site elements might change depending on viewport width. Before recording, set your browser window to the correct width for the test you want to create (Desktop, Tablet, or Mobile). Change the DevTools “dock side” to pop out as a separate window if it helps.

Create a new recording

With the Recorder panel open in the DevTools window. Click on the Create a new recording button to start.

For the Recording Name use your initials to prefix the name of the recording e.g. <your initials> - <website name>. Click on Start Recording to start recording your actions.

Now that we are recording, complete a few actions on the site. An example for our demo site is:

Click on Vintage Camera Lens
Click on Add to Cart
Click on Place Order
Click on End recording in the Recorder panel.

Export the recording

Click on the Export button:

Select JSON as the format, then click on Save

Congratulations! You have successfully created a recording using the Chrome DevTools Recorder. Next, we will use this recording to create a Real Browser Test in Splunk Synthetic Monitoring.

View the example JSON file for this browser test recording

{
    "title": "RWC - Online Boutique",
    "steps": [
        {
            "type": "setViewport",
            "width": 1430,
            "height": 1016,
            "deviceScaleFactor": 1,
            "isMobile": false,
            "hasTouch": false,
            "isLandscape": false
        },
        {
            "type": "navigate",
            "url": "https://online-boutique-eu.splunko11y.com/",
            "assertedEvents": [
                {
                    "type": "navigation",
                    "url": "https://online-boutique-eu.splunko11y.com/",
                    "title": "Online Boutique"
                }
            ]
        },
        {
            "type": "click",
            "target": "main",
            "selectors": [
                [
                    "div:nth-of-type(2) > div:nth-of-type(2) a > div"
                ],
                [
                    "xpath//html/body/main/div/div/div[2]/div[2]/div/a/div"
                ],
                [
                    "pierce/div:nth-of-type(2) > div:nth-of-type(2) a > div"
                ]
            ],
            "offsetY": 170,
            "offsetX": 180,
            "assertedEvents": [
                {
                    "type": "navigation",
                    "url": "https://online-boutique-eu.splunko11y.com/product/66VCHSJNUP",
                    "title": ""
                }
            ]
        },
        {
            "type": "click",
            "target": "main",
            "selectors": [
                [
                    "aria/ADD TO CART"
                ],
                [
                    "button"
                ],
                [
                    "xpath//html/body/main/div[1]/div/div[2]/div/form/div/button"
                ],
                [
                    "pierce/button"
                ],
                [
                    "text/Add to Cart"
                ]
            ],
            "offsetY": 35.0078125,
            "offsetX": 46.4140625,
            "assertedEvents": [
                {
                    "type": "navigation",
                    "url": "https://online-boutique-eu.splunko11y.com/cart",
                    "title": ""
                }
            ]
        },
        {
            "type": "click",
            "target": "main",
            "selectors": [
                [
                    "aria/PLACE ORDER"
                ],
                [
                    "div > div > div.py-3 button"
                ],
                [
                    "xpath//html/body/main/div/div/div[4]/div/form/div[4]/button"
                ],
                [
                    "pierce/div > div > div.py-3 button"
                ],
                [
                    "text/Place order"
                ]
            ],
            "offsetY": 29.8125,
            "offsetX": 66.8203125,
            "assertedEvents": [
                {
                    "type": "navigation",
                    "url": "https://online-boutique-eu.splunko11y.com/cart/checkout",
                    "title": ""
                }
            ]
        }
    ]
}

Create a Browser Test

In Splunk Observability Cloud, navigate to Synthetics and click on Add new test.

From the dropdown select Browser test.

You will then be presented with the Browser test content configuration page.

Import JSON

To begin configuring our test, we need to import the JSON that we exported from the Chrome DevTools Recorder. To enable the Import button, we must first give our test a name e.g. [<your team name>] <your initials> - Online Boutique.

Once the Import button is enabled, click on it and either drop the JSON file that you exported from the Chrome DevTools Recorder or upload the file.

Once the JSON file has been uploaded, click on Continue to edit steps

Before we make any edits to the test, let’s first configure the settings, click on < Return to test

Test settings

The simple settings allow you to configure the basics of the test:

Name: The name of the test (e.g. RWC - Online Boutique).
Details:
- Locations: The locations where the test will run from.
- Device: Emulate different devices and connection speeds. Also, the viewport will be adjusted to match the chosen device.
- Frequency: How often the test will run.
- Round-robin: If multiple locations are selected, the test will run from one location at a time, rather than all locations at once.
- Active: Set the test to active or inactive.

![Return to Test]For this workshop, we will configure the locations that we wish to monitor from. Click in the Locations field and you will be presented with a list of global locations (over 50 in total).

Select the following locations:

AWS - N. Virginia
AWS - London
AWS - Melbourne

Once complete, scroll down and click on Click on Submit to save the test.

The test will now be scheduled to run every 5 minutes from the 3 locations that we have selected. This does take a few minutes for the schedule to be created.

So while we wait for the test to be scheduled, click on Edit test so we can go through the Advanced settings.

Advanced Test Settings

Click on Advanced, these settings are optional and can be used to further configure the test.

Note

In the case of this workshop, we will not be using any of these settings; this is for informational purposes only.

Security:
- TLS/SSL validation: When activated, this feature is used to enforce the validation of expired, invalid hostname, or untrusted issuer on SSL/TLS certificates.
- Authentication: Add credentials to authenticate with sites that require additional security protocols, for example from within a corporate network. By using concealed global variables in the Authentication field, you create an additional layer of security for your credentials and simplify the ability to share credentials across checks.
Custom Content:
- Custom headers: Specify custom headers to send with each request. For example, you can add a header in your request to filter out requests from analytics on the back end by sending a specific header in the requests. You can also use custom headers to set cookies.
- Cookies: Set cookies in the browser before the test starts. For example, to prevent a popup modal from randomly appearing and interfering with your test, you can set cookies. Any cookies that are set will apply to the domain of the starting URL of the check. Splunk Synthetics Monitoring uses the public suffix list to determine the domain.
- Host overrides: Add host override rules to reroute requests from one host to another. For example, you can create a host override to test an existing production site against page resources loaded from a development site or a specific CDN edge node.

Next, we will edit the test steps to provide more meaningful names for each step.

Edit test steps

To edit the steps click on the + Edit steps or synthetic transactions button. From here, we are going to give meaningful names to each step.

For each step, we are going to give them a meaningful, readable name. That could look like:

Step 1 replace the text Go to URL with Go to Homepage
Step 2 enter the text Select Typewriter.
Step 3 enter Add to Cart.
Step 4 enter Place Order.

Note

If you’d like, group the test steps into Transactions and edit the transaction names. This is especially useful for Single Page Apps (SPAs), where the resource waterfall is not split by URL. We can also create charts and alerts based on transactions.

Click < Return to test to return to the test configuration page and click Save to save the test.

You will be returned to the test dashboard where you will see test results start to appear.

Congratulations! You have successfully created a Real Browser Test in Splunk Synthetic Monitoring. Next, we will look into a test result in more detail.

View test results

1. Click into a spike or failure in your test run results.

2. What can you learn about this test run? If it failed, use the error message, filmstrip, video replay, and waterfall to understand what happened.

3. What do you see in the resources? Make sure to click through all of the page (or transaction) tabs.

Workshop Question

Do you see anything interesting? Common issues to find and fix include: unexpected response codes, duplicate requests, forgotten third parties, large or slow files, and long gaps between requests.

Want to learn more about specific performance improvements? Google and Mozilla have great resources to help understand what goes into frontend performance as well as in-depth details of how to optimize it.

Frontend Dashboards

15 minutes

Go to Dashboards and find the End User Experience dashboard group.

Click the three dots on the top right to open the dashboard menu, and select Save As, and include your team name and initials in the dashboard name.

Save to the dashboard group that matches your email address. Now you have your own copy of this dashboard to customize!

Copying and editing charts

We have some good charts in our dashboard, but let’s add a few more.

Go to Dashboards by clicking the dashboard icon on the left side of the screen. Find the Browser app health dashboard and scroll to the Largest Contentful Paint (LCP) chart. Click the chart actions icon to open the flyout menu, and click “Copy” to add this chart to your clipboard.
Now you can continue to add any other charts to your clipboard by clicking the “add to clipboard” icon.
When you have collected the charts you want on your dashboard, click the “create” icon on the top right. You might need to reload the page if you were looking at charts in another browser tab.
Click the “Paste charts” menu option.

Now you are able to resize and edit the charts as you’d like!

Bonus: edit chart data

Click the chart actions icon and select Open to edit the chart.
Remove the existing Test signal.
Click Add filter and type test: *yourInitials*. This will use a wildcard match so that all of the tests you have created that contain your initials (or any string you decide) will be pulled into the chart.
Click into the functions to see how adding and removing dimensions changes how the data is displayed. For example, if you want all of your test location data rolled up, remove that dimension from the function.
Change the chart name and description as appropriate, and click “Save and close” to commit your changes or just “Close” to cancel your changes.

Events in context with chart data

Seeing the visualization of our KPIs is great. What’s better? KPIs in context with events! Overlaying events on a dashboard can help us more quickly understand if an event like a deployment caused a change in metrics, for better or worse.

Your instructor will push a condition change to the workshop application. Click the event marker on any of your dashboard charts to see more details.
In the dimensions, we can see more details about this specific event. If we click the event record, we can mark for deletion if needed.
We can also see a history of events in the event feed by clicking the icon on the top right of the screen, and selecting Event feed.
Again, we can see details about recent events in this feed.
We can also add new events in the GUI or via API. To add a new event in the GUI, click the New event button.
Name your event with your team name, initials, and what kind of event it is (deployment, campaign start, etc). Choose a timestamp, or leave as-is to use the current time, and click “Create”.
Now, we need to make sure our new event is overlaid in this dashboard. Wait a minute or so (refresh the page if needed) and then search for the event in the Event overlay field.
If your event is within the dashboard time window, you should now see it overlaid in your charts. Click “Save” to make sure your event overlay is saved to your dashboard!

Keep in mind

Want to add context to that bug ticket, or show your manager how your change improved app performance? Seeing observability data in context with events not only helps with troubleshooting, but also helps us communicate with other teams.

Optimize End User Experiences

90 minutes Author Sarah Ware

How can we use Splunk Observability to get insight into end user experience, and proactively test scenarios to improve that experience?

Sections:

Set up basic Synthetic tests to understand availability and performance ASAP
- Uptime test
- API test
- Single page Browser test
Explore RUM to understand our real users
Write advanced Synthetics tests based on what we’ve learned about our users and what we need them to do
Customize dashboard charts to capture our KPIs, show trends, and show data in context of our events
Create Detectors to alert on our KPIs

Tip

Keep in mind throughout the workshop: how can I prioritize activities strategically to get the fastest time to value for my end users and for myself/ my developers?

Context

References

Synthetics

Let’s quickly set up some tests in Synthetics to immediately start understanding our end user experience, without waiting for real users to interact with our app.

We can capture not only the performance and availability of our own apps and endpoints, but also those third parties we rely on any time of the day or night.

Tip

Uptime Test

5 minutes

Introduction

Pre-requisites

Publicly accessible HTTP(S) endpoint(s) to test
Access to Splunk Observability Cloud

Creating a test

Open Synthetics
Click the Add new test button on the right side of the screen, then select Uptime and HTTP test.
Name your test with your team name (provided by your workshop instructor), your initials, and any other details you’d like to include, like geographic region.
For now let’s test a GET request. Fill in the URL field. You can use one of your own, or one of ours like https://online-boutique-eu.splunko11y.com, https://online-boutique-us.splunko11y.com, or https://www.splunk.com.
Click Try now to validate that the endpoint is accessible before the selected location before saving the test. Try now does not count against your subscription usage, so this is a good practice to make sure you’re not wasting real test runs on a misconfigured test.

Tip

A common reason for Try now to fail is that there is a non-2xx response code. If that is expected, add a Validation for the correct response code.

Add any additional validations needed, for example: response code, response header, and response size.
Add and remove any locations you’d like. Keep in mind where you expect your endpoint to be available.
Change the frequency to test your more critical endpoints more often, up to one minute.
Make sure “Round-robin” is on so the test will run from one location at a time, rather than from all locations at once.
- If an endpoint is highly critical, think about if it is worth it to have all locations tested at the same time every single minute. If you have automations built in with a webhook from a detector, or if you have strict SLAs you need to track, this could be worth it to have as much coverage as possible. But if you are doing more manual investigation, or if this is a less critical endpoint, you could be wasting test runs that are executing while an issue is being investigated.
- Remember that your license is based on the number of test runs per month. Turning Round-robin off will multiply the number of test runs by the number of locations you have selected.
When you are ready for the test to start running, make sure “Active” is on, then scroll down and click Submit to save the test configuration.

Now the test will start running with your saved configuration. Take a water break, then we’ll look at the results!

Understanding results

From the Synthetics landing page, click into a test to see its summary view and play with the Performance KPIs chart filters to see how you can slice and dice your data. This is a good place to get started understanding trends. Later, we will see what custom charts look like, so you can tailor dashboards to the KPIs you care about most.
Workshop Question: Using the Performance KPIs chart
What metrics are available? Is your data consistent across time and locations? Do certain locations run slower than others? Are there any spikes or failures?
Click into a recent run either in the chart or in the table below.
If there are failures, look at the response to see if you need to add a response code assertion (302 is a common one), if there is some authorization needed, or different request headers added. Here we have information about this particular test run including if it succeeded or failed, the location, timestamp, and duration in addition to the other Uptime test metrics. Click through to see the response, request, and connection info as well. If you need to edit the test for it to run successfully, click the test name in the top left breadcrumb on this run result page, then click Edit test on the top right of the test overview page. Remember to scroll down and click Submit to save your changes after editing the test configuration.
In addition to the test running successfully, there are other metrics to measure the health of your endpoints. For example, Time to First Byte(TTFB) is a great indicator of performance, and you can optimize TTFB to improve end user experience.
Go back to the test overview page and change the Performance KPIs chart to display First Byte time. Once the test has run for a long enough time, expanding the time frame will draw the data points as lines to better see trends and anomalies, like in the example below.

In the example above, we can see that TTFB varies consistently between locations. Knowing this, we can keep location in mind when reporting on metrics. We could also improve the experience, for example by serving users in those locations an endpoint hosted closer to them, which should reduce network latency. We can also see some slight variations in the results over time, but overall we already have a good idea of our baseline for this endpoint’s KPIs. When we have a baseline, we can alert on worsening metrics as well as visualize improvements.

Tip

We are not setting a detector on this test yet, to make sure it is running consistently and successfully. If you are testing a highly critical endpoint and want to be alerted on it ASAP (and have tolerance for potential alert noise), jump to Single Test Detectors.

Once you have your Uptime test running successfully, let’s move on to the next test type.

API Test

5 minutes

Whether you’re interested in testing multi-step API interactions or you want to gain visibility into the performance of your endpoints, the API Test can help you accomplish your goals.

This excercise will walk through a multi-step test on the Spotify API. You can also use it as a reference to build tests on your own APIs or on those of your critical third parties.

Global Variables

Global variables allow us to use stored strings in multiple tests, so we only need to update them in one place.

Create new API test

Authentication Request

Click on + Add requests and enter the request step name e.g. Authenticate with Spotify API.

Expand the Request section, from the drop-down change the request method to POST and enter the following URL:

https://accounts.spotify.com/api/token

In the Payload body section enter the following:

grant_type=client_credentials

Next, add two + Request headers with the following key/value pairings:

CONTENT-TYPE: application/x-www-form-urlencoded
AUTHORIZATION: Basic {{env.encoded_auth}}

Expand the Validation section and add the following extraction:

Extract from Response body JSON $.access_token as access_token

This will parse the JSON payload that is received from the Spotify API, extract the access token and store it as a custom variable.

Search Request

Click on + Add Request to add the next step. Name the step Search for Tracks named “Up around the bend”.

Expand the Request section and change the request method to GET and enter the following URL:

https://api.spotify.com/v1/search?q=Up%20around%20the%20bend&type=track&offset=0&limit=5

Next, add two request headers with the following key/value pairings:

CONTENT-TYPE: application/json
AUTHORIZATION: Bearer {{custom.access_token}}
- This uses the custom variable we created in the previous step!

Expand the Validation section and add the following extraction:

Extract from Response body JSON $.tracks.items[0].id as track.id

To validate the test before saving, scroll to the top and change the location as needed. Click Try now. See the docs for more information on the try now feature.

When the validation is successful, click on < Return to test to return to the test configuration page. And then click Save to save the API test.

Extra credit

View results

Wait for a few minutes for the test to provision and run. Once you see the test has run successfully, click on the run to view the results:

Resources

Single Page Browser Test

5 minutes

We have started testing our endpoints, now let’s test the front end browser experience.

Click Create new test and select Browser test
Include your team name and initials in the test name. Add to the Name and Custom properties to describe the scope of the test (like Desktop for device type). Then click + Edit steps
Change the transaction label (top left) and step name (on the right) to something readable that describes the step. Add the URL you’d like to test. Your workshop instructor can provide you with a URL as well. In the below example, the transaction is “Home” and the step name is “Go to homepage”.
To validate the test, change the location as needed and click Try now. See the docs for more information on the try now feature.
Wait for the test validation to complete. If the test validation failed, double check your URL and test location and try again. With Try now you can see what the result of the test will be if it were saved and run as-is.
Click < Return to test to continue the configuration.
Edit the locations you want to use, keeping in mind any regional rules you have for your site.
You can edit the Device and Frequency or leave them at their default values for now. Click Submit at the bottom of the form to save the test and start running it.

Bonus Exercise

Have a few spare seconds? Copy this test and change just the title and device type, and save. Now you have visibility into the end user experience on another device and connection speed!

While our Synthetic tests are running, let’s see how RUM is instrumented to start getting data from our real users.

RUM

15 minutes

With RUM instrumented, we will be able to better understand our end users, what they are doing, and what issues they are encountering.

Tip

Our Docs also contain guidance such as scenarios using Splunk RUM and demo applications to test out RUM for mobile apps.

Overview

The aim of this Splunk Real User Monitoring (RUM) workshop is to let you:

Shop for items on the Online Boutique to create traffic, and create RUM User Sessions¹ that you can view in the Splunk Observability Suite.
See an overview of the performance of all your application(s) in the Application Summary Dashboard
Examine the performance of a specific website with RUM metrics.

In order to reach this goal, we will use an online boutique to order various products. While shopping on the online boutique you will create what is called a User Session.

You may encounter some issues with this web site, and you will use Splunk RUM to identify the issues, so they can be resolved by the developers.

The workshop host will provide you with a URL for an online boutique store that has RUM enabled.

Each of these Online Boutiques are also being visited by a few synthetic users; this will allow us to generate more live data to be analyzed later.

A RUM User session is a “recording” of a collection of user interactions on an application, basically collecting a website or app’s performance measured straight from the browser or Mobile App of the end user. To do this a small amount of JavaScript is embedded in each page. This script then collects data from each user as he or she explores the page, and transfers that data back for analysis. ↩︎

RUM instrumentation in a browser app

Check the HEAD section of the Online-boutique webpage in your browser
Find the code that instruments RUM

1. Browse to the Online Boutique

Your workshop instructor will provide you with the Online Boutique URL that has RUM installed so that you can complete the next steps.

2. Inspecting the HTML source

The changes needed for RUM are placed in the <head> section of the hosts Web page. Right click to view the page source or to inspect the code. Below is an example of the <head> section with RUM:

This code enables RUM Tracing, Session Replay, and Custom Events to better understand performance in the context of user workflows:

The first part is to indicate where to download the Splunk Open Telemetry Javascript file from: https://cdn.signalfx.com/o11y-gdi-rum/latest/splunk-otel-web.js (this can also be hosted locally if so required).
The next section defines the location where to send the traces to in the beacon url: {beaconUrl: "https://rum-ingest.eu0.signalfx.com/v1/rum"
The RUM Access Token: rumAuth: "<redacted>".
Identification tags app and environment to indentify in the SPLUNK RUM UI e.g. app: "online-boutique-us-store", environment: "online-boutique-us"} (these values will be different in your workshop)

The above lines 21 and 23-30 are all that is required to enable RUM on your website!

Lines 22 and 31-34 are optional if you want Session Replay instrumented.

Exercise

RUM Landing Page

Visit the RUM landing page and and check the overview of the performance of all your RUM enabled applications with the Application Summary Dashboard (Both Mobile and Web based)

1. Visit the RUM Landing Page

Login into Splunk Observability. From the left side menu bar select RUM . This will bring you to your the RUM Landing Page.

First find the right application to use for the workshop:

2. Configure the RUM Application Summary Dashboard Header Section

RUM Application Summary Dashboard consists of 6 major sections. The first is the selection header, where you can set/filter a number of options:

A drop down for the Time Window you’re reviewing (You are looking at the past 15 minutes by default)
A drop down to select the Environment¹ you want to look at. This allows you to focus on just the subset of applications belonging to that environment, or Select all to view all available.
A drop down list with the various Apps being monitored. You can use the one provided by the workshop host or select your own. This will focus you on just one application.
A drop down to select the Source, Browser or Mobile applications to view. For the Workshop leave All selected.
A hamburger menu located at the right of the header allowing you to configure some settings of your Splunk RUM application. (We will visit this in a later section).

For the workshop lets do a deeper dive into the Application Summary screen in the next section: Check Health Browser Application

A deployment environment is a distinct deployment of your system or application that allows you to set up configurations that don’t overlap with configurations in other deployments of the same application. Separate deployment environments are often used for different stages of the development process, such as development, staging, and production. ↩︎

Check Browser Applications health at a glance

Get familiar with the UI and options available from this landing page
Identify Page Views/JavaScript Errors and Request/Errors in a single view
Check the Web Vitals metrics and any Detector that has fired for in relation to your Browser Application

Application Summary Dashboard

1.Header Bar

Further, you can also open the Application Overview or App Health Dashboard via the triple dot menu on the right.

For now, let’s look at the high level information we get on the application summary dashboard.

The RUM Application Summary Dashboard is focused on providing you with at a glance highlights of the status of your application.

2. Page Views / JavaScript Errors & Network Requests / Errors

You can see the count of the Page Views / Network Requests / Errors by hovering over the charts.

3. JavaScript Errors

With the second section of the RUM Application Summary Dashboard we are showing you an overview of the JavaScript errors occurring in your application, along with a count of each error.

In the example above you can see there are three JavaScript errors, one that appears 29 times in the selected time slot, and the other two each appear 12 times.

4. Web Vitals

5. Most Recent Detectors

Exercise

Please take a few minutes to experiment with the RUM Application Summary Dashboard and the underlying chart and dashboards before going on to the next section.

Analyzing RUM Metrics

See RUM Metrics and Session information in the RUM UI
See correlated APM traces in the RUM & APM UI

1. RUM Overview Pages

This will take you to the RUM Application Overview Page screen as shown below.

2. RUM Browser Overview

2.1. Header

The RUM UI consists of five major sections. The first is the selection header, where you can set/filter a number of options:

A drop down for the time window you’re reviewing (You are looking at the past 15 minutes in this case)
A drop down to select the Comparison window (You are comparing current performance on a rolling window - in this case compared to 1 hour ago)
A drop down with the available Environments to view
A drop down list with the Various Web apps
Optionally a drop down to select Browser or Mobile metrics (Might not be available in your workshop)

2.2. UX Metrics

By default, RUM prioritizes the metrics that most directly reflect the experience of the end user.

Additional Tags

All of the dashboard charts allow us to compare trends over time, create detectors, and click through to further diagnose issues.

First, we see page load and route change information, which can help us understand if something unexpected is impacting user traffic trends.

Largest Contentful Paint (LCP), measures loading performance. How long does it take for the largest block of content in the viewport to load? To provide a good user experience, LCP should occur within 2.5 seconds of when the page first starts loading.
First Input Delay (FID), measures interactivity. How long does it take to be able to interact with the app? To provide a good user experience, pages should have a FID of 100 milliseconds or less.
Cumulative Layout Shift (CLS), measures visual stability. How much does the content move around after the initial load? To provide a good user experience, pages should maintain a CLS of 0.1. or less.

Improving Web Vitals is a key component to optimizing your end user experience, so being able to quickly understand them and create detectors if they exceed a threshold is critical.

Google has some great resources if you want to learn more, for example the business impact of Core Web Vitals.

2.3. Front-end health

2.4. Back-end health

Time To First Byte (TTFB), measures how long it takes for a client’s browser to receive the first byte of the response from the server. The longer it takes for the server to process the request and send a response, the slower your visitors’ browser is at displaying your page.

Analyzing RUM Tags in the Tag Spotlight view

Look into the Metrics views for the various endpoints and use the Tags sent via the Tag spotlight for deeper analysis

1. Find an url for the Cart endpoint

From the RUM Overview page, please select the url for the Cart endpoint to dive deeper into the information available for this endpoint.

Once you have selected and clicked on the blue url, you will find yourself in the Tag Spotlight overview

Additional Tags

In our example we have selected the Page Load view as shown here:

You can select any of the following Tag views, each focused on a specific metric.

2. Explore the information in the Tag Spotlight view

The Tag spotlight is designed to help you identify problems, either through the chart view,, where you may quickly identify outliers or via the TAGs.

In the Page Load view, if you look at the Browser, Browser Version & OS Name Tag views,you can see the various browser types and versions, as well as for the underlying OS.

This makes it easy to identify problems related to specific browser or OS versions, as they would be highlighted.

In the above example you can see that Firefox had the slowest response, various Browser versions ( Chrome) that have different response times and the slow response of the Android devices.

This will select only the sessions relevant to the city selected as shown below:

By selecting the various Tag you build up a filter, you can see the current selection below

To clear the filter and see every trace click on Clear All at the top right of the page.

You can then use your mouse to select the time slot you want like show in the view below and activate that time filter by clicking on the little spyglass icon.

Analyzing RUM Sessions

Dive into RUM Session information in the RUM UI
Identify Javascript errors in the Span of an user interaction

1. Drill down in the Sessions

After you have analyzed the information and drilled down via the Tag Spotlight to a subset of the traces, you can view the actual session as it was run by the end-user’s browser.

You do this by clicking on the link User Sessions as shown below:

This will give you a list of sessions that matched both the time filter and the subset selected in the Tag Profile.

Select one by clicking on the session ID, It is a good idea to select one that has the longest duration (preferably over 700 ms).

You can see the URL that you selected earlier is where we are focusing on in the waterfall.

Scroll down a little bit on the page, so you see the end of the operation as shown below.

You can see that we have received a few Javascript Console errors that may not have been detected or visible to the end users. To examine these in more detail click on the middle one that says: *Cannot read properties of undefined (reading ‘Prcie’)

Advanced Synthetics

30 minutes

Introduction

This workshop walks you through using the Chrome DevTools Recorder to create a synthetic test on a Splunk demonstration environment or on your own public website.

The exported JSON from the Chrome DevTools Recorder will then be used to create a Splunk Synthetic Monitoring Real Browser Test.

Pre-requisites

Google Chrome Browser installed
Publicly browser-accessible URL
Access to Splunk Observability Cloud

Supporting resources

Lantern: advanced Selectors for multi-step browser tests
Chrome for Developers DevTools Tips
web.dev Core Web Vitals reference

Record a test

Note

Open the Chrome DevTools Recorder

Note

Create a new recording

With the Recorder panel open in the DevTools window. Click on the Create a new recording button to start.

For the Recording Name use your initials to prefix the name of the recording e.g. <your initials> - <website name>. Click on Start Recording to start recording your actions.

Now that we are recording, complete a few actions on the site. An example for our demo site is:

Click on Vintage Camera Lens
Click on Add to Cart
Click on Place Order
Click on End recording in the Recorder panel.

Export the recording

Click on the Export button:

Select JSON as the format, then click on Save

Congratulations! You have successfully created a recording using the Chrome DevTools Recorder. Next, we will use this recording to create a Real Browser Test in Splunk Synthetic Monitoring.

View the example JSON file for this browser test recording

{
    "title": "RWC - Online Boutique",
    "steps": [
        {
            "type": "setViewport",
            "width": 1430,
            "height": 1016,
            "deviceScaleFactor": 1,
            "isMobile": false,
            "hasTouch": false,
            "isLandscape": false
        },
        {
            "type": "navigate",
            "url": "https://online-boutique-eu.splunko11y.com/",
            "assertedEvents": [
                {
                    "type": "navigation",
                    "url": "https://online-boutique-eu.splunko11y.com/",
                    "title": "Online Boutique"
                }
            ]
        },
        {
            "type": "click",
            "target": "main",
            "selectors": [
                [
                    "div:nth-of-type(2) > div:nth-of-type(2) a > div"
                ],
                [
                    "xpath//html/body/main/div/div/div[2]/div[2]/div/a/div"
                ],
                [
                    "pierce/div:nth-of-type(2) > div:nth-of-type(2) a > div"
                ]
            ],
            "offsetY": 170,
            "offsetX": 180,
            "assertedEvents": [
                {
                    "type": "navigation",
                    "url": "https://online-boutique-eu.splunko11y.com/product/66VCHSJNUP",
                    "title": ""
                }
            ]
        },
        {
            "type": "click",
            "target": "main",
            "selectors": [
                [
                    "aria/ADD TO CART"
                ],
                [
                    "button"
                ],
                [
                    "xpath//html/body/main/div[1]/div/div[2]/div/form/div/button"
                ],
                [
                    "pierce/button"
                ],
                [
                    "text/Add to Cart"
                ]
            ],
            "offsetY": 35.0078125,
            "offsetX": 46.4140625,
            "assertedEvents": [
                {
                    "type": "navigation",
                    "url": "https://online-boutique-eu.splunko11y.com/cart",
                    "title": ""
                }
            ]
        },
        {
            "type": "click",
            "target": "main",
            "selectors": [
                [
                    "aria/PLACE ORDER"
                ],
                [
                    "div > div > div.py-3 button"
                ],
                [
                    "xpath//html/body/main/div/div/div[4]/div/form/div[4]/button"
                ],
                [
                    "pierce/div > div > div.py-3 button"
                ],
                [
                    "text/Place order"
                ]
            ],
            "offsetY": 29.8125,
            "offsetX": 66.8203125,
            "assertedEvents": [
                {
                    "type": "navigation",
                    "url": "https://online-boutique-eu.splunko11y.com/cart/checkout",
                    "title": ""
                }
            ]
        }
    ]
}

Create a Browser Test

In Splunk Observability Cloud, navigate to Synthetics and click on Add new test.

From the dropdown select Browser test.

You will then be presented with the Browser test content configuration page.

Import JSON

Once the Import button is enabled, click on it and either drop the JSON file that you exported from the Chrome DevTools Recorder or upload the file.

Once the JSON file has been uploaded, click on Continue to edit steps

Before we make any edits to the test, let’s first configure the settings, click on < Return to test

Test settings

The simple settings allow you to configure the basics of the test:

Name: The name of the test (e.g. RWC - Online Boutique).
Details:
- Locations: The locations where the test will run from.
- Device: Emulate different devices and connection speeds. Also, the viewport will be adjusted to match the chosen device.
- Frequency: How often the test will run.
- Round-robin: If multiple locations are selected, the test will run from one location at a time, rather than all locations at once.
- Active: Set the test to active or inactive.

For this workshop, we will configure the locations that we wish to monitor from. Click in the Locations field and you will be presented with a list of global locations (over 50 in total).

Select the following locations:

AWS - N. Virginia
AWS - London
AWS - Melbourne

Once complete, scroll down and click on Click on Submit to save the test.

The test will now be scheduled to run every 5 minutes from the 3 locations that we have selected. This does take a few minutes for the schedule to be created.

So while we wait for the test to be scheduled, click on Edit test so we can go through the Advanced settings.

Advanced Test Settings

Click on Advanced, these settings are optional and can be used to further configure the test.

Note

In the case of this workshop, we will not be using any of these settings; this is for informational purposes only.

Security:
- TLS/SSL validation: When activated, this feature is used to enforce the validation of expired, invalid hostname, or untrusted issuer on SSL/TLS certificates.
- Authentication: Add credentials to authenticate with sites that require additional security protocols, for example from within a corporate network. By using concealed global variables in the Authentication field, you create an additional layer of security for your credentials and simplify the ability to share credentials across checks.
Custom Content:
- Custom headers: Specify custom headers to send with each request. For example, you can add a header in your request to filter out requests from analytics on the back end by sending a specific header in the requests. You can also use custom headers to set cookies.
- Cookies: Set cookies in the browser before the test starts. For example, to prevent a popup modal from randomly appearing and interfering with your test, you can set cookies. Any cookies that are set will apply to the domain of the starting URL of the check. Splunk Synthetics Monitoring uses the public suffix list to determine the domain.
- Host overrides: Add host override rules to reroute requests from one host to another. For example, you can create a host override to test an existing production site against page resources loaded from a development site or a specific CDN edge node.

See the Advanced Settings for Browser Tests section of the Docs for more information.

Next, we will edit the test steps to provide more meaningful names for each step.

Edit test steps

To edit the steps click on the + Edit steps or synthetic transactions button. From here, we are going to give meaningful names to each step.

For each step, we are going to give them a meaningful, readable name. That could look like:

Step 1 replace the text Go to URL with Go to Homepage
Step 2 enter the text Select Typewriter.
Step 3 enter Add to Cart.
Step 4 enter Place Order.

Note

If you’d like, group the test steps into Transactions and edit the transaction names as seen above. This is especially useful for Single Page Apps (SPAs), where the resource waterfall is not split by URL. We can also create charts and alerts based on transactions.

Click < Return to test to return to the test configuration page and click Save to save the test.

You will be returned to the test dashboard where you will see test results start to appear.

Congratulations! You have successfully created a Real Browser Test in Splunk Synthetic Monitoring. Next, we will look into a test result in more detail.

View test results

1. Click into a spike or failure in your test run results.

2. What can you learn about this test run? If it failed, use the error message, filmstrip, video replay, and waterfall to understand what happened.

3. What do you see in the resources? Make sure to click through all of the page (or transaction) tabs.

Workshop Question

Do you see anything interesting? Common issues to find and fix include: unexpected response codes, duplicate requests, forgotten third parties, large or slow files, and long gaps between requests.

Frontend Dashboards

15 minutes

Go to Dashboards and find the End User Experiences dashboard group.

Click the three dots on the top right to open the dashboard menu, and select Save As, and include your team name and initials in the dashboard name.

Save to the dashboard group that matches your email address. Now you have your own copy of this dashboard to customize!

Copying and editing charts

We have some good charts in our dashboard, but let’s add a few more.

Go to Dashboards by clicking the dasboard icon on the left side of the screen. Find the Browser app health dashboard and scroll to the Largest Contentful Paint (LCP) chart. Click the chart actions icon to open the flyout menu, and click “Copy” to add this chart to your clipboard.
Now you can continue to add any other charts to your clipboard by clicking the “add to clipboard” icon.
When you have collected the charts you want on your dashboard, click the “create” icon on the top right. You might need to reload the page if you were looking at charts in another browser tab.
Click the “Paste charts” menu option.

Now you are able to resize and edit the charts as you’d like!

Bonus: edit chart data

Click the chart actions icon and select Open to edit the chart.
Remove the existing Test signal.
Click Add filter and type test: *yourInitials*. This will use a wildcard match so that all of the tests you have created that contain your initials (or any string you decide) will be pulled into the chart.
Click into the functions to see how adding and removing dimensions changes how the data is displayed. For example, if you want all of your test location data rolled up, remove that dimension from the function.
Change the chart name and description as appropriate, and click “Save and close” to commit your changes or just “Close” to cancel your changes.

Events in context with chart data

Your instructor will push a condition change to the workshop application. Click the event marker on any of your dashboard charts to see more details.
In the dimensions, we can see more details about this specific event. If we click the event record, we can mark for deletion if needed.
We can also see a history of events in the event feed by clicking the icon on the top right of the screen, and selecting Event feed.
Again, we can see details about recent events in this feed.
We can also add new events in the GUI or via API. To add a new event in the GUI, click the New event button.
Name your event with your team name, initials, and what kind of event it is (deployment, campaign start, etc). Choose a timestamp, or leave as-is to use the current time, and click “Create”.
Now, we need to make sure our new event is overlaid in this dashboard. Wait a minute or so (refresh the page if needed) and then search for the event in the Event overlay field.
If your event is within the dashboard time window, you should now see it overlaid in your charts. Click “Save” to make sure your event overlay is saved to your dashboard!

Keep in mind

Detectors

20 minutes

After we have a good understanding of our performance baseline, we can start to create Detectors so that we receive alerts when our KPIs are unexpected. If we create detectors before understanding our baseline, we run the risk of generating unnecessary alert noise.

For RUM and Synthetics, we will explore how to create detectors:

on a single Synthetic test
on a single KPI in RUM
on a dashboard chart

For more Detector resources, please see our Observability docs, Lantern, and consider an Education course if you’d like to go more in depth with instructor guidance.

Test Detectors

Why would we want a detector on a single Synthetic test? Some examples:

The endpoint, API transaction, or browser journey is highly critical
We have deployed code changes and want to know if the resulting KPI is or is not as we expect
We need to temporarily keep a close eye on a specific change we are testing and don’t want to create a lot of noise, and will disable the detector later
We want to know about unexpected issues before a real user encounters them

On the test overview page, click Create Detector on the top right.
Name the detector with your team name and your initials and LCP (the signal we will eventually use), so that the instructor can better keep track of everyone’s progress.
Change the signal to First byte time.
Change the alert details, and see how the chart to the right shows the amount of alert events under those conditions. This is where you can decide how much alert noise you want to generate, based on how much your team tolerates. Play with the settings to see how they affect estimated alert noise.
Now, change the signal to Largest contentful paint. This is a key web vital related to the user experience as it relates to loading time. Change the threshold to 2500ms. It’s okay if there is no sample alert event in the detector preview.
Scroll down in this window to see the notification options, including severity and recipients.
Click the notifications link to customize the alert subject, message, tip, and runbook link.
When you are happy with the amount of alert noise this detector would generate, click Activate.

RUM Detectors

Let’s say we want to know about an issue in production without waiting for a ticket from our support center. This is where creating detectors in RUM will be helpful for us.

Go to the RUM overview of our App. Scroll to the LCP chart, click the chart menu icon, and click Create Detector.
Rename the detector to include your team name and initials, and change the scope of the detector to App so we are not limited to a single URL or page. Change the threshold and sensitivity until there is at least one alert event in the time frame.
Change the alert severity and add a recipient if you’d like, and click Activate to save the Detector.

Exercise

Now, your workshop instructor will change something on the website. How do you find out about the issue, and how do you investigate it?

Tip

Wait a few minutes, and take a look at the online store homepage in your browser. How is the experience in an incognito browser window? How is it different when you refresh the page?

Chart Detectors

With our custom dashboard charts, we can create detectors focussed directly on the data and conditions we care about. In building our charts, we also built signals that can trigger alerts.

Static detectors

For many KPIs, we have a static value in mind as a threshold.

In your custom End User Experience dashboard, go to the “LCP - all tests” chart.
Click the bell icon on the top right of the chart, and select “New detector from chart”
Change the detector name to include your team name and initials, and adjust the alert details. Change the threshold to 2500 or 4000 and see how the alert noise preview changes.
Change the severity, and add yourself as a recipient before you save this detector. Click Activate.

Advanced: Dynamic detectors

Sometimes we have metrics that vary naturally, so we want to create a more dynamic detector that isn’t limited by the static threshold we decide in the moment.

To create dynamic detectors on your chart, click the link to the “old” detector wizard.
Change the detector name to include your team name and initials, and Click Create alert rule
Confirm the signal looks correct and proceed to Alert condition.
Select the “Sudden Change” condition and proceed to Alert settings
Play with the settings and see how the estimated alert noise is previewed in the chart above. Tune the settings, and change the advanced settings if you’d like, before proceeding to the Alert message.
Customize the severity, runbook URL, any tips, and message payload before proceeding to add recipients.
For the sake of this workshop, only add your own email address as recipient. This is where you would add other options like webhooks, ticketing systems, and Slack channels if it’s in your real environment.
Finally, confirm the detector name before clicking Activate Alert Rule

Summary

2 minutes

In this workshop, we learned the following:

How to create simple synthetic tests so that we can quickly begin to understand the availability and performance of our application
How to understand what RUM shows us about the end user experience, including specific user sessions
How to write advanced synthetic browser tests to proactively test our most important user actions
How to visualize our frontend performance data in context with events on dashboards
How to set up detectors so we don’t have to wait to hear about issues from our end users
How all of the above, plus Splunk and Google’s resources, helps us optimize end user experience

There is a lot more we can do with front end performance monitoring. If you have extra time, be sure to play with the charts, detectors, and do some more synthetic testing. Remember our resources such as Lantern, Splunk Docs, and experiment with apps for Mobile RUM.

This is just the beginning! If you need more time to trial Splunk Observability, or have any other questions, reach out to a Splunk Expert.

Detectors

20 minutes

For RUM and Synthetics, we will explore how to create detectors:

on a single Synthetic test
on a single KPI in RUM
on a dashboard chart

For more Detector resources, please see our Observability docs, Lantern, and consider an Education course if you’d like to go more in depth with instructor guidance.

Test Detectors

Why would we want a detector on a single Synthetic test? Some examples:

The endpoint, API transaction, or browser journey is highly critical
We have deployed code changes and want to know if the resulting KPI is or is not as we expect
We need to temporarily keep a close eye on a specific change we are testing and don’t want to create a lot of noise, and will disable the detector later
We want to know about unexpected issues before a real user encounters them

On the test overview page, click Create Detector on the top right.
Name the detector with your team name and your initials and LCP (the signal we will eventually use), so that the instructor can better keep track of everyone’s progress.
Change the signal to First byte time.
Change the alert details, and see how the chart to the right shows the amount of alert events under those conditions. This is where you can decide how much alert noise you want to generate, based on how much your team tolerates. Play with the settings to see how they affect estimated alert noise.
Now, change the signal to Largest contentful paint. This is a key web vital related to the user experience as it relates to loading time. Change the threshold to 2500ms. It’s okay if there is no sample alert event in the detector preview.
Scroll down in this window to see the notification options, including severity and recipients.
Click the notifications link to customize the alert subject, message, tip, and runbook link.
When you are happy with the amount of alert noise this detector would generate, click Activate.

RUM Detectors

Let’s say we want to know about an issue in production without waiting for a ticket from our support center. This is where creating detectors in RUM will be helpful for us.

Go to the RUM view of the Syncreator App. Scroll to the LCP chart, click the chart menu icon, and click Create Detector.
Rename the test to include your team name and initials, and change the scope of the detector to App so we are not limited to a single URL or page. Change the threshold and sensitivity until there is at least one alert event in the time frame.
Change the alert severity and add a recipient if you’d like, and click Activate to save the Detector.

Exercise

Now, your workshop instructor will change something on the website. How do you find out about the issue, and how do you investigate it?

Tip

Wait a few minutes, and take a look at the online store homepage in your browser. How is the experience in an incognito browser window? How is it different when you refresh the page?

Chart Detectors

With our custom dashboard charts, we can create detectors focussed directly on the data and conditions we care about. In building our charts, we also built signals that can trigger alerts.

Static detectors

For many KPIs, we have a static value in mind as a threshold.

In your custom End User Experience dashboard, go to the “LCP - all tests” chart.
Click the bell icon on the top right of the chart, and select “New detector from chart”
Change the detector name to include your team name and initials, and adjust the alert details. Change the threshold to 2500 or 4000 and see how the alert noise preview changes.
Change the severity, and add yourself as a recipient before you save this detector. Click Activate.

Advanced: Dynamic detectors

Sometimes we have metrics that vary naturally, so we want to create a more dynamic detector that isn’t limited by the static threshold we decide in the moment.

To create dynamic detectors on your chart, click the link to the “old” detector wizard.
Change the detector name to include your team name and initials, and Click Create alert rule
Confirm the signal looks correct and proceed to Alert condition.
Select the “Sudden Change” condition and proceed to Alert settings
Play with the settings and see how the estimated alert noise is previewed in the chart above. Tune the settings, and change the advanced settings if you’d like, before proceeding to the Alert message.
Customize the severity, runbook URL, any tips, and message payload before proceeding to add recipients.
For the sake of this workshop, only add your own email address as recipient. This is where you would add other options like webhooks, ticketing systems, and Slack channels if it’s in your real environment.
Finally, confirm the detector name before clicking Activate Alert Rule

Summary

2 minutes

In this workshop, we learned the following:

How to create simple synthetic tests so that we can quickly begin to understand the availability and performance of our application
How to understand what RUM shows us about the end user experience, including specific user sessions
How to write advanced synthetic browser tests to proactively test our most important user actions
How to visualize our frontend performance data in context with events on dashboards
How to set up detectors so we don’t have to wait to hear about issues from our end users
How all of the above, plus Splunk and Google’s resources, helps us optimize end user experience

This is just the beginning! If you need more time to trial Splunk Observability, or have any other questions, reach out to a Splunk Expert.

Resources

Frequently Asked Questions

A collection of the common questions and their answers associated with Observability, DevOps, Incident Response and Splunk On-Call.

Dimension, Properties and Tags

One conversation that frequently comes up is Dimensions vs Properties and when you should use one verus the other.

Naming Conventions for Tagging with OpenTelemetry and Splunk

When deploying OpenTelemetry in a large organization, it’s critical to define a standardized naming convention for tagging, and a governance process to ensure the convention is adhered to.

Frequently Asked Questions

A collection of the common questions and their answers associated with Observability, DevOps, Incident Response and Splunk On-Call.

Q: Alerts v. Incident Response v. Incident Management

A: Alerts, Incident Response and Incident Management are related functions. Together they comprise the incident response and resolution process.

Monitoring and Observability tools send alerts to incident response platforms. Those platforms take a collection of alerts and correlate them into incidents.

Those incidents are recorded into incident management (ITSM) platforms for record. Alerts are the trigger that something has happened, and provide context to an incident.

Incidents consist of the alert payload, all activity associated with the incident from the time it was created, and the on-call policies to be followed. ITSM is the system of record for incidents that are active and after they have been resolved.

All these components are necessary for successful incident response and management practices.

On-Call

Q: Is Observability Monitoring

A: The key difference between Monitoring and Observability is the difference between “known knowns” and “unknown knowns” respectively.

In monitoring the operator generally has prior knowledge of the architecture and elements in their system. They can reliably predict the relationship between elements, and their associated metadata. Monitoring is good for stateful infrastructure that is not frequently changed.

Observability is for systems where the operators ability to predict and trace all elements in the system and their relationships is limited.

Observability is a set of practices and technology, which include traditional monitoring metrics.

These practices and technologies combined give the operator the ability to understand ephemeral and highly complex environments without prior knowledge of all elements of a system. Observability technology can also account for fluctuations in the environment, and variation in metadata (cardinality) better than traditional monitoring which is more static.

Observability

Q: What are Traces and Spans

A: Traces and spans, combined with metrics and logs, make up the core types of data that feed modern Observability tools. They all have specific elements and functions, but work well together.

Because microservices based architectures are distributed, transactions in the system touch multiple services before completing. This makes accurately pinpointing the location of an issue difficult. Traces are a method for tracking the full path of a request through all the services in a distributed system. Spans are the timed operations in each service. Traces are the connective tissue for the spans and together they give more detail on individual service processes. While metrics give a good snapshot of the health of a system, and logs give depth when investigating issues, traces and spans help navigate operators to the source of issues with greater context. This saves time when investigating incidents, and supports the increasing complexity of modern architectures.

APM

Q: What is the Sidecar Pattern?

A: The sidecar pattern is a design pattern for having related services connected directly by infrastructure. Related services can be adding functionality or supporting the application logic they are connected to. It is used heavily as a method for deploying agents associated with the management plan along with the application service they support.

In Observability the sidecar services are the application logic, and the agent collecting data from that service. The setup requires two containers one with the application service, and one running the agent. The containers share a pod, and resources such as disk, network, and namespace. They are also deployed together and share the same lifecycle.

Observability

Dimension, Properties and Tags

Applying context to your metrics

One conversation that frequently comes up is Dimensions vs Properties and when you should use one verus the other. Instead of starting off with their descriptions it makes sense to understand how we use them and how they are similar, before diving into their differences and examples of why you would use one or the other.

How are Dimensions and Properties similar?

The simplest answer is that they are both metadata key:value pairs that add context to our metrics. Metrics themselves are what we actually want to measure, whether it’s a standard infrastructure metric like cpu.utilization or a custom metric like number of API calls received.

If we receive a value of 50% for the cpu.utilization metric without knowing where it came from or any other context it is just a number and not useful to us. We would need at least to know what host it comes from.

These days it is likely we care more about the performance or utilization of a cluster or data center as a whole then that of an individual host and therefore more interested in things like the average cpu.utilization across a cluster of hosts, when a host’s cpu.utilization is a outlier when compared to other hosts running the same service or maybe compare the average cpu.utilization of one environment to another.

To be able to slice, aggregate or group our cpu.utilization metrics in this way we will need additional metadata for the cpu.utilization metrics we receive to include what cluster a host belongs to, what service is running on the host and what environment it is a part of. This metadata can be in the form of either dimension or property key:value pairs.

For example, if I go to apply a filter to a dashboard or use the Group by function when running analytics, I can use a property or a dimension.

So how are Dimensions and Properties different?

Dimensions are sent in with metrics at the time of ingest while properties are applied to metrics or dimensions after ingest. This means that any metadata you need to make a datapoint (a single reported value of a metric) unique, like what host a value of cpu.utilization is coming from needs to be a dimension. Metric names + dimensions uniquely define an MTS (metric time series).

Example: the cpu.utilization metric sent by a particular host (server1) with a dimension host:server1 would be considered a unique time series. If you have 10 servers, each sending that metric, then you would have 10 time series, with each time series sharing the metric name cpu.utilization and uniquely identified by the dimension key-value pair (host:server1, host:server2…host:server10).

However, if your server names are only unique within a datacenter vs your whole environment you would need to add a 2nd dimension dc for the datacenter location. You could now have double the number of possible MTSs. cpu.utilization metrics received would now be uniquely identified by 2 sets of dimension key-value pairs.

cpu.utilization plus dc:east & host:server1 would create a different time series then cpu.utilization plus dc:west & host:server1.

Dimensions are immutable while properties are mutable

As we mentioned above, Metric name + dimensions make a unique MTS. Therefore, if the dimension value changes we will have a new unique combination of metric name + dimension value and create a new MTS.

Properties on the other hand are applied to metrics (or dimensions) after they are ingested. If you apply a property to a metric, it propagates and applies to all MTS that the metric is a part of. Or if you apply a property to a dimension, say host:server1 then all metrics from that host will have those properties attached. If you change the value of a property it will propagate and update the value of the property to all MTSs with that property attached. Why is this important? It means that if you care about the historical value of a property you need to make it a dimension.

Example: We are collecting custom metrics on our application. One metric is latency which counts the latency of requests made to our application. We have a dimension customer, so we can sort and compare latency by customer. We decide we want to track the application version as well so we can sort and compare our application latency by the version customers are using. We create a property version that we attach to the customer dimension. Initially all customers are using application version 1, so version:1.

We now have some customers using version 2 of our application, for those customers we update the property to version:2. When we update the value of the version property for those customers it will propagate down to all MTS for that customer. We lose the history that those customers at some point used version 1, so if we wanted to compare latency of version 1 and version 2 over a historical period we would not get accurate data. In this case even though we don’t need application version to make out metric time series unique we need to make version a dimension, because we care about the historical value.

So when should something be a Property instead of a dimension?

The first reason would be if there is any metadata you want attached to metrics, but you don’t know it at the time of ingest. The second reason is best practice is if it doesn’t need to be a dimension, make it a property. Why? One reason is that today there is a limit of 5K MTSs per analytics job or chart rendering and the more dimensions you have the more MTS you will create. Properties are completely free-form and let you add as much information as you want or need to metrics or dimensions without adding to MTS counts.

As dimensions are sent in with every datapoint, the more dimensions you have the more data you send to us, which could mean higher costs to you if your cloud provider charges for data transfer.

A good example of some things that should be properties would be additional host information. You want to be able to see things like machine_type, processor, or os, but instead of making these things dimensions and sending them with every metric from a host you could make them properties and attach the properties to the host dimension.

Example where host:server1 you would set properties machine_type:ucs, processor:xeon-5560, os:rhel71. Anytime a metric comes in with the dimension host:server1 all the above properties will be applied to it automatically.

Some other examples of use cases for properties would be if you want to know who is the escalation contact for each service or SLA level for every customer. You do not need these items to make metrics uniquely identifiable and you don’t care about the historical values, so they can be properties. The properties could be added to the service dimension and customer dimensions and would then apply to all metrics and MTSs with those dimensions.

What about Tags?

Tags are the 3rd type of metadata that can be used to give context to or help organize your metrics. Unlike dimensions and properties, tags are NOT key:value pairs. Tags can be thought of as labels or keywords. Similar to Properties, Tags are applied to your data after ingest via the Catalog in the UI or programmatically via the API. Tags can be applied to Metrics, Dimensions or other objects such as Detectors.

Where would I use Tags?

Tags are used when there is a need for a many-to-one relationship of tags to an object or a one-to-many relationship between the tag and the objects you are applying them to. They are useful for grouping together metrics that may not be intrinsically associated.

One example is you have hosts that run multiple applications. You can create tags (labels) for each application and then apply multiple tags to each host to label the applications that are running on it.

Example: Server1 runs 3 applications. You create tags app1, app2 and app3 and apply all 3 tags to the dimension host:server1

To expand on the example above let us say you also collect metrics from your applications. You could apply the tags you created to any metrics coming in from the applications themselves. You can filter based on a tag allowing you to filter based on an application, but get the full picture including both application and the relevant host metrics.

Example: App1 sends in metrics with the dimension service:application1. You would apply tag app1 to the dimension service:application1. You can then filter on the tag app1 in charts and dashboards.

Another use case for tags for binary states where there is just one possible value. An example is you do canary testing and when you do a canary deployment you want to be able to mark the hosts that received the new code, so you can easily identify their metrics and compare their performance to those hosts that did not receive the new code. There is no need for a key:value pair as there is just a single value “canary”.

Be aware that while you can filter on tags you cannot use the groupBy function on them. The groupBy function is run by supplying the key part of a key:value pair and the results are then grouped by values of that key pair.

Additional information

For information on sending dimensions for custom metrics please review the Client Libraries documentation for your library of choice.

For information on how to apply properties & tags to metrics or dimensions via the API please see the API documentation for /metric/:name /dimension/:key/:value

For information on how to add or edit properties and tags via the Metadata Catalog in the UI please reference the section Add or edit metadata in Search the Metric Finder and Metadata catalog.

Naming Conventions for Tagging with OpenTelemetry and Splunk

Introduction

When deploying OpenTelemetry in a large organization, it’s critical to define a standardized naming convention for tagging, and a governance process to ensure the convention is adhered to.

This ensures that MELT data collected via OpenTelemetry can be efficiently utilized for alerting, dashboarding, and troubleshooting purposes. It also ensures that users of Splunk Observability Cloud can quickly find the data they’re looking for.

Naming conventions also ensure that data can be aggregated effectively. For example, if we wanted to count the number of unique hosts by environment, then we must use a standardized convention for capturing the host and environment names.

Attributes vs. Tags

Before we go further, let’s make a note regarding terminology. Tags in OpenTelemetry are called “attributes”. Attributes can be attached to metrics, logs, and traces, either via manual instrumentation or automated instrumentation.

Attributes can also be attached to metrics, logs, and traces at the OpenTelemetry collector level, using various processors such as the Resource Detection processor.

Once traces with attributes are ingested into Splunk Observability Cloud, they are available as “tags”. Optionally, attributes collected as part of traces can be used to create Troubleshooting Metric Sets, which can in turn be used with various features such as Tag Spotlight.

Alternatively, attributes can be used to create Monitoring Metric Sets, which can be used to drive alerting.

Resource Semantic Conventions

OpenTelemetry resource semantic conventions should be used as a starting point when determining which attributes an organization should standardize on. In the following sections, we’ll review some of the more commonly used attributes.

Service Attributes

A number of attributes are used to describe the service being monitored.

service.name is a required attribute that defines the logical name of the service. It’s added automatically by the OpenTelemetry SDK but can be customized. It’s best to keep this simple (i.e. inventoryservice would be better than inventoryservice-prod-hostxyz, as other attributes can be utilized to capture other aspects of the service instead).

The following service attributes are recommended:

service.namespace this attribute could be utilized to identify the team that owns the service
service.instance.id is used to identify a unique instance of the service
service.version is used to identify the version of the service

Telemetry SDK

These attributes are set automatically by the SDK, to record information about the instrumentation libraries being used:

telemetry.sdk.name is typically set to opentelemetry
telemetry.sdk.language is the language of the SDK, such as java
telemetry.sdk.version identifies which version of the SDK is utilized

Containers

For services running in containers, there are numerous attributes used to describe the container runtime, such as container.id, container.name, and container.image.name. The full list can be found here.

Hosts

These attributes describe the host where the service is running, and include attributes such as host.id, host.name, and host.arch. The full list can be found here.

Deployment Environment

The deployment.environment attribute is used to identify the environment where the service is deployed, such as staging or production.

Splunk Observability Cloud uses this attribute to enable related content (as described here), so it’s important to include it.

Cloud

There are also attributes to capture information for services running in public cloud environments, such as AWS. Attributes include cloud.provider, cloud.account.id, and cloud.region.

The full list of attributes can be found here.

Some cloud providers, such as GCP, define semantic conventions specific to their offering.

Kubernetes

There are a number of standardized attributes for applications running in Kubernetes as well. Many of these are added automatically by Splunk’s distribution of the OpenTelemetry collector, as described here.

These attributes include k8s.cluster.name, k8s.node.name, k8s.pod.name, k8s.namespace.name, and kubernetes.workload.name.

Best Practices for Creating Custom Attributes

Many organizations require attributes that go beyond what’s defined in OpenTelemetry’s resource semantic conventions.

In this case, it’s important to avoid naming conflicts with attribute names already included in the semantic conventions. So it’s a good idea to check the semantic conventions before deciding on a particular attribute name for your naming convention.

In addition to a naming convention for attribute names, you also need to consider attribute values. For example, if you’d like to capture the particular business unit with which an application belongs, then you’ll also want to have a standardized list of business unit values to choose from, to facilitate effective filtering.

The OpenTelemetry community provides guidelines that should be followed when naming attributes as well, which can be found here.

The Recommendations for Application Developers section is most relevant to our discussion.

They recommend:

Prefixing the attribute name with your company’s domain name, e.g. com.acme.shopname (if the attribute may be used outside your company as well as inside).
Prefixing the attribute name with the application name, if the attribute is unique to a particular application and is only used within your organization.
Not using existing OpenTelemetry semantic convention names as a prefix for your attribute name.
Consider submitting a proposal to add your attribute name to the OpenTelemetry specification, if there’s a general need for it across different organizations and industries.
Avoid having attribute names start with otel.*, as this is reserved for OpenTelemetry specification usage.

Metric Cardinality Considerations

One final thing to keep in mind when deciding on naming standards for attribute names and values is related to metric cardinality.

Metric cardinality is defined as **the number of unique metric time series (MTS) produced by a combination of metric names and their associated dimensions.

A metric has high cardinality when it has a high number of dimension keys and a high number of possible unique values for those dimension keys.

For example, suppose your application sends in data for a metric named custom.metric. In the absence of any attributes, custom.metric would generate a single metric time series (MTS).

On the other hand, if custom.metric includes an attribute named customer.id, and there are thousands of customer ID values, this would generate thousands of metric time series, which may impact costs and query performance.

Splunk Observability Cloud provides a report that allows for the management of metrics usage. And rules can be created to drop undesirable dimensions. However, the first line of defence is understanding how attribute name and value combinations can drive increased metric cardinality.

Summary

In this document, we highlighted the importance of defining naming conventions for OpenTelemetry tags, preferably before starting a large rollout of OpenTelemetry instrumentation.

We discussed how OpenTelemetry’s resource semantic conventions define the naming conventions for several attributes, many of which are automatically collected via the OpenTelemetry SDKs, as well as processors that run within the OpenTelemetry collector.

Finally, we shared some best practices for creating your attribute names, for situations where the resource semantic conventions are not sufficient for your organization’s needs.

Splunk Observability Workshops

Subsections of Splunk Observability Workshops

Splunk4Rookies - Observability Cloud

Subsections of Splunk4Rookies - Observability Cloud

Workshop Overview

Introduction

Who should attend?

What you’ll need

What’s covered in this workshop?

OpenTelemetry

Tour of the Splunk Observability User Interface

Generate Real User Data

Splunk Real User Monitoring (RUM)

Splunk Application Performance Monitoring (APM)

Splunk Log Observer (LO)

Splunk Synthetics

What is OpenTelemetry & why should you care?

OpenTelemetry

Why Should You Care?

UI - Quick Tour 🚌

Subsections of UI - Quick Tour 🚌

Getting Started

1. Sign in to Splunk Observability Cloud

Subsections of Getting Started

Home Page

Real User Monitoring Overview

Subsections of Real User Monitoring Overview

Real User Monitoring Home Page

Application Performance Monitoring Overview

Subsections of Application Performance Monitoring Overview

Application Performance Monitoring Home page

Log Observer Overview

Subsections of Log Observer Overview

Log Observer Home Page

Synthetics Overview

Subsections of Synthetics Overview

Synthetics Home Page

Infrastructure Overview

Subsections of Infrastructure Overview

Let's go shopping 💶

Splunk RUM

Subsections of Splunk RUM

1. RUM Dashboard

2. Tag Spotlight

3. Session Replay

4. User Sessions

Splunk APM

Subsections of Splunk APM

1. APM Explore

2. APM Service Dashboard

2. APM Service View

3. APM Tag Spotlight

4. APM Service Breakdown

5. APM Trace Analyzer

6. APM Waterfall

Splunk Log Observer

Subsections of Splunk Log Observer

1. Log Filtering

2. Viewing Log Entries

3. Log Timeline Chart

4. Log View Chart

Splunk Synthetics

Subsections of Splunk Synthetics

1. Synthetics Dashboard

2. Synthetics Test Detail

3. Synthetics to APM

4. Synthetics Detector

Custom Service Health Dashboard 🏥

Subsections of Custom Service Health Dashboard 🏥

Enhancing the Dashboard

Adding a Custom Chart

Workshop Wrap-up 🎁

Subsections of Workshop Wrap-up 🎁

Key Takeaways

Ninja Workshops

Subsections of Ninja Workshops

Spring PetClinic SpringBoot Based Microservices On Kubernetes

Subsections of Spring PetClinic SpringBoot Based Microservices On Kubernetes

Architecture

Preparation of the Workshop instance