Financial Services Observability Cloud

30 minutes   Authors Robert Castley Pieter Hagen Jeremy Hicks Deepti Bhutani

In this workshop, we’ll demonstrate how Splunk Observability Cloud delivers value to our financial services customers due to its ability to provide real-time, full-fidelity, AI-powered monitoring across your entire digital ecosystem from infrastructure to applications to user experiences. It’s purpose-built for modern, cloud-native, microservices-based environments. You’ll have the opportunity to explore some of the platform’s most powerful features, which set it apart from other observability solutions:

  • Infrastructure Monitoring
  • Complete end-to-end trace visibility with NoSample Full-fidelity Application Performance Monitoring (APM)
  • No-code log querying
  • Root cause analysis with tag analytics and error stacks
  • Related Content for seamless navigation between components

One of the core strengths of Splunk Observability Cloud is its ability to unify telemetry data, creating a comprehensive picture of both the end-user experience and your entire application stack.

The workshop will focus on a microservices-based Wire transfer application deployed on Kubernetes. Users can initiate a wire transfer and all of the appropriate checks for user, balance, and compliance will be handled. This application is fully instrumented with OpenTelemetry to capture detailed performance data.

What is OpenTelemetry?
OpenTelemetry is an open-source collection of tools, APIs, and software development kits (SDKs) designed to help you instrument, generate, collect, and export telemetry data—such as metrics, traces, and logs. This data enables in-depth analysis of your software’s performance and behavior.

The OpenTelemetry community is growing rapidly, supported by leading companies like Splunk, Google, Microsoft, and Amazon. It currently has the second-largest number of contributors within the Cloud Native Computing Foundation, following only Kubernetes.

Full Stack Full Stack

Last Modified May 21, 2025

Subsections of Financial Services Observability Cloud

Workshop Overview

2 minutes  

Introduction
The goal of this workshop is to give you hands-on experience troubleshooting an issue using Splunk Observability Cloud to identify its root cause. We’ve provided a fully instrumented microservices-based application that actually mimic a wire transfer workflow that is running on Kubernetes, which sends metrics, traces, and logs to Splunk Observability Cloud for real-time analysis.

Who Should Attend?
This workshop is ideal for anyone looking to gain practical knowledge of Splunk Observability. It’s designed for individuals with little or no prior experience with the platform.

What You’ll Need
All you need is your laptop and a browser with access to external websites. The workshop can be attended either in-person or via Zoom. If you don’t have the Zoom client installed, you can still join using your browser.

Workshop Overview
In this 3-hour session, we’ll cover the fundamentals of Splunk Observability—the only platform offering streaming analytics and NoSample Full Fidelity distributed tracing—in an interactive, hands-on setting. Here’s what you can expect:

  • OpenTelemetry
    Learn why OpenTelemetry is essential for modern observability and how it enhances visibility into your systems.

  • Tour of the Splunk Observability User Interface
    Take a guided tour of Splunk Observability Cloud’s interface, where we’ll show you how to navigate the five key components: APM, Log Observer, and Infrastructure.

  • Splunk Application Performance Monitoring (APM)
    Gain end-to-end visibility of your customers’ request path using APM traces. You’ll explore how telemetry from various services is captured and visualized in Splunk Observability Cloud, helping you detect anomalies and errors.

  • Splunk Log Observer (LO)
    Learn how to leverage the “Related Content” feature to easily navigate between components. In this case, we’ll move from an APM trace to the related logs for deeper insight into issues.

By the end of this session, you’ll have gained practical experience with Splunk Observability Cloud and a solid understanding of how to troubleshoot and resolve issues across your application stack.

Last Modified May 21, 2025

What is OpenTelemetry & why should you care?

2 minutes  

OpenTelemetry

With the rise of cloud computing, microservices architectures, and ever-more complex business requirements, the need for Observability has never been greater. Observability is the ability to understand the internal state of a system by examining its outputs. In the context of software, this means being able to understand the internal state of a system by examining its telemetry data, which includes metrics, traces, and logs.

To make a system observable, it must be instrumented. That is, the code must emit traces, metrics, and logs. The instrumented data must then be sent to an Observability back-end such as Splunk Observability Cloud.

MetricsTracesLogs
Do I have a problem?Where is the problem?What is the problem?

OpenTelemetry does two important things:

  • Allows you to own the data that you generate rather than be stuck with a proprietary data format or tool.
  • Allows you to learn a single set of APIs and conventions

These two things combined enable teams and organizations the flexibility they need in today’s modern computing world.

There are a lot of variables to consider when getting started with Observability, including the all-important question: “How do I get my data into an Observability tool?”. The industry-wide adoption of OpenTelemetry makes this question easier to answer than ever.

Why Should You Care?

OpenTelemetry is completely open-source and free to use. In the past, monitoring and Observability tools relied heavily on proprietary agents meaning that the effort required to change or set up additional tooling required a large amount of changes across systems, from the infrastructure level to the application level.

Since OpenTelemetry is vendor-neutral and supported by many industry leaders in the Observability space, adopters can switch between supported Observability tools at any time with minor changes to their instrumentation. This is true regardless of which distribution of OpenTelemetry is used – like with Linux, the various distributions bundle settings and add-ons but are all fundamentally based on the community-driven OpenTelemetry project.

Splunk has fully committed to OpenTelemetry so that our customers can collect and use ALL their data, in any type, any structure, from any source, on any scale, and all in real-time. OpenTelemetry is fundamentally changing the monitoring landscape, enabling IT and DevOps teams to bring data to every question and every action. You will experience this during these workshops.

OpenTelemetry Logo OpenTelemetry Logo

Last Modified May 20, 2025

UI - Quick Tour 🚌

We are going to start with a short walkthrough of the various components of Splunk Observability Cloud. The aim of this is to get you familiar with the UI.

  1. Signing in to Splunk Observability Cloud
  2. Application Performance Monitoring (APM)
  3. Infrastructure Monitoring
  4. Log Observer
Tip

The easiest way to navigate through this workshop is by using:

  • the left/right arrows (< | >) on the top right of this page
  • the left (◀) and right (▶) cursor keys on your keyboard
Last Modified May 20, 2025

Subsections of 3. UI - Quick Tour

Getting Started

2 minutes  

1. Sign in to Splunk Observability Cloud

You should have received an e-mail from Splunk inviting you to the Workshop Org. This e-mail will look like the screenshot below, if you cannot find it, please check your Spam/Junk folders or inform your Instructor. You can also check for other solutions in our login F.A.Q..

To proceed click the Join Now button or click on the link provided in the e-mail.

If you have already completed the registration process you can skip the rest and proceed directly to Splunk Observability Cloud and log in:

email email

If this is your first time using Splunk Observability Cloud, you will be presented with the registration form. Enter your full name, and desired password. Please note that the password requirements are:

  • Must be between 8 and 32 characters
  • Must contain at least one capital letter
  • Must have at least one number
  • Must have at least one symbol (e.g. !@#$%^&*()_+)

Click the checkbox to agree to the terms and conditions and click the SIGN IN NOW button.

User-Setup User-Setup

Last Modified May 20, 2025

Subsections of 1. Getting Started

Home Page

5 minutes  

After you have registered and logged into Splunk Observability Cloud you will be taken to the home or landing page. Here, you will find several useful features to help you get started.

home page home page

  1. Explore your data pane: Displays which integrations are enabled and allows you to add additional integrations if you are an Administrator.
  2. Documentation pane: Training videos and links to documentation to get you started with Splunk Observability Cloud.
  3. Recents pane: Recently created/visited dashboards and/or detectors for quick access.
  4. Main Menu pane: Navigate the components of Splunk Observability Cloud.
  5. Org Switcher: Easily switch between Organizations (if you are a member of more than one Organization).
  6. Expand/Contract Main Menu: Expand » / Collapse « the main menu if space is at a premium.

Let’s start with our first exercise:

Exercise
  • Expand the Main Menu and click on Settings.
  • Check in the Org Switcher if you have access to more than one Organization.
Tip

If you have used Splunk Observability before, you may be placed in an Organization you have used previously. Make sure you are in the correct workshop organization. Verify this with your instructor if you have access to multiple Organizations.

Exercise
  • Click Onboarding Guidance (Here you can toggle the visibility of the onboarding panes. This is useful if you know the product well enough, and can use the space to show more information).
  • Hide the Onboarding Content for the Home Page.
  • At the bottom of the menu, select your preferred appearance: Light, Dark or Auto mode.
  • Did you also notice this is where the Sign Out option is? Please don’t 😊 !
  • Click < to get back to the main menu.

Next, let’s check out Splunk Real User Monitoring (RUM).

Last Modified May 20, 2025

Application Performance Monitoring Overview

5 minutes  

Splunk APM provides a NoSample end-to-end visibility of every service and its dependency to solve problems quicker across monoliths and microservices. Teams can immediately detect problems from new deployments, confidently troubleshoot by scoping and isolating the source of an issue, and optimize service performance by understanding how back-end services impact end users and business workflows.

Real-time monitoring and alerting: Splunk provides out-of-the-box service dashboards and automatically detects and alerts on RED metrics (rate, error and duration) when there is a sudden change.

Dynamic telemetry maps: Easily visualize service performance in modern production environments in real-time. End-to-end visibility of service performance from infrastructure, applications, end users, and all dependencies helps quickly scope new issues and troubleshoot more effectively.

Intelligent tagging and analysis: View all tags from your business, infrastructure and applications in one place to easily compare new trends in latency or errors to their specific tag values.

AI-directed troubleshooting identifies the most impactful issues: Instead of manually digging through individual dashboards, isolate problems more efficiently. Automatically identify anomalies and the sources of errors that impact services and customers the most.

Complete distributed tracing analyses every transaction: Identify problems in your cloud-native environment more effectively. Splunk distributed tracing visualizes and correlates every transaction from the back-end and front-end in context with your infrastructure, business workflows and applications.

Full stack correlation: Within Splunk Observability, APM links traces, metrics, logs and profiling together to easily understand the performance of every component and its dependency across your stack.

Monitor database query performance: Easily identify how slow and high execution queries from SQL and NoSQL databases impact your services, endpoints and business workflows — no instrumentation required.

Architecture Overview Architecture Overview

Last Modified May 21, 2025

Subsections of 2. APM Overview

Application Performance Monitoring Home page

Click APM in the main menu, the APM Home Page is made up of 3 distinct sections:

APM page APM page

  1. Onboarding Pane Pane: Training videos and links to documentation to get you started with Splunk APM.
  2. APM Overview Pane: Real-time metrics for the Top Services and Top Business Workflows.
  3. Functions Pane: Links for deeper analysis of your services, tags, traces, database query performance and code profiling.

The APM Overview pan provides a high-level view of the health of your application. It includes a summary of the services, latency and errors in your application. It also includes a list of the top services by error rate and the top business workflows by error rate (a business workflow is the start-to-finish journey of the collection of traces associated with a given activity or transaction and enables monitoring of end-to-end KPIs and identifying root causes and bottlenecks).

About Environments

To easily differentiate between multiple applications, Splunk uses environments. The naming convention for workshop environments is [NAME OF WORKSHOP]-workshop. Your instructor will provide you with the correct one to select.

Exercise
  • Verify that the time window we are working with is set to the last 15 minutes (-15m).
  • Change the environment to the workshop one by selecting its name from the drop-down box and make sure that is the only one selected.

What can you conclude from the Top Services by Error Rate chart?

The wire-transfer-service has a high error rate

If you scroll down the Overview Page you will notice some services listed have Inferred Service next to them.

Splunk APM can infer the presence of the remote service, or inferred service if the span calling the remote service has the necessary information. Examples of possible inferred services include databases, HTTP endpoints, and message queues. Inferred services are not instrumented, but they are displayed on the service map and the service list.

Next, let’s check out Splunk Log Observer (LO).

Last Modified May 21, 2025

Log Observer Overview

5 minutes  

Log Observer Connect allows you to seamlessly bring in the same log data from your Splunk Platform into an intuitive and no-code interface designed to help you find and fix problems quickly. You can easily perform log-based analysis and seamlessly correlate your logs with Splunk Infrastructure Monitoring’s real-time metrics and Splunk APM traces in one place.

End-to-end visibility: By combining the powerful logging capabilities of Splunk Platform with Splunk Observability Cloud’s traces and real-time metrics for deeper insights and more context of your hybrid environment.

Perform quick and easy log-based investigations: By reusing logs that are already ingested in Splunk Cloud Platform or Enterprise in a simplified and intuitive interface (no need to know SPL!) with customizable and out-of-the-box dashboards

Achieve higher economies of scale and operational efficiency: By centralizing log management across teams, breaking down data and team silos, and getting better overall support

Logo graph Logo graph

Last Modified May 21, 2025

Subsections of 3. Log Observer Overview

Log Observer Home Page

Click Log Observer in the main menu, the Log Observer Home Page is made up of 4 distinct sections:

Lo Page Lo Page

  1. Onboarding Pane: Training videos and links to documentation to get you started with Splunk Log Observer.
  2. Filter Bar: Filter on time, indexes, and fields and also Save Queries.
  3. Logs Table Pane: List of log entries that match the current filter criteria.
  4. Fields Pane: List of fields available in the currently selected index.
Splunk indexes

Generally, in Splunk, an “index” refers to a designated place where your data is stored. It’s like a folder or container for your data. Data within a Splunk index is organized and structured in a way that makes it easy to search and analyze. Different indexes can be created to store specific types of data. For example, you might have one index for web server logs, another for application logs, and so on.

Tip

If you have used Splunk Enterprise or Splunk Cloud before, you are probably used to starting investigations with logs. As you will see in the following exercise, you can do that with Splunk Observability Cloud as well. This workshop, however, will use all the OpenTelemetry signals for investigations.

Let’s run a little search exercise:

Exercise
  • Set the time frame to -15m.

  • Click on Add Filter in the filter bar then click on Fields in the dialog.

  • Type in cardType and select it.

  • Under Top values click on visa, then click on = to add it to the filter.

  • Click Run search

    logo search logo search

  • Click on one of the log entries in the Logs table to validate that the entry contains cardType: "visa".

  • Let’s find all the wire transfer orders that have been compelted. Click on Clear All in the filter bar to remove the previous filter.

  • Click again on Add Filter in the filter bar, then select Keyword. Next just type order in the Enter Keyword… box and press enter.

  • Click Run search

  • You should now only have log lines that contain the word order. There are still a lot of log lines – some of which may not be our service – so let’s filter some more.

  • Add another filter, this time select the Fields box, then type severity in the Find a field … search box and select it. severity severity

  • Under Top values click on error, then click on = to add it to the filter.

  • Click Run search

  • You should now have a list of wire transfer orders that failed to complete for the last 15 minutes.

Next, let’s check out Splunk Synthetics.

Last Modified May 21, 2025

Infrastructure Overview

5 minutes  

Splunk Infrastructure Monitoring (IM) is a market-leading monitoring and observability service for hybrid cloud environments. Built on a patented streaming architecture, it provides a real-time solution for engineering teams to visualize and analyze performance across infrastructure, services, and applications in a fraction of the time and with greater accuracy than traditional solutions.

OpenTelemetry standardization: Gives you full control over your data — freeing you from vendor lock-in and implementing proprietary agents.

Splunk’s OTel Collector: Seamless installation and dynamic configuration, auto-discovers your entire stack in seconds for visibility across clouds, services, and systems.

300+ Easy-to-use OOTB content: Pre-built navigators and dashboards, deliver immediate visualizations of your entire environment so that you can interact with all your data in real time.

Kubernetes navigator: Provides an instant, comprehensive out-of-the-box hierarchical view of nodes, pods, and containers. Ramp up even the most novice Kubernetes user with easy-to-understand interactive cluster maps.

AutoDetect alerts and detectors: Automatically identify the most important metrics, out-of-the-box, to create alert conditions for detectors that accurately alert from the moment telemetry data is ingested and use real-time alerting capabilities for important notifications in seconds.

Log views in dashboards: Combine log messages and real-time metrics on one page with common filters and time controls for faster in-context troubleshooting.

Metrics pipeline management: Control metrics volume at the point of ingest without re-instrumentation with a set of aggregation and data-dropping rules to store and analyze only the needed data. Reduce metrics volume and optimize observability spend.

Infrastructure Overview Infrastructure Overview

Last Modified May 21, 2025

Splunk APM

20 minutes  
Persona

You are a back-end developer and you have been called in to help investigate an issue found by the SRE. The SRE has identified a poor user experience and has asked you to investigate the issue.

Getting to the root cause of a problem in cloud-native environments requires engineers to navigate through immense complexity within a distributed system. Oftentimes, you didn’t write the code and you lack the background and context to quickly understand what’s going on when a problem occurs. That is where Splunk APM comes in.

Last Modified May 20, 2025

Subsections of 4. Splunk APM

1. APM Explore

When you click into the APM section of Splunk Observability Cloud you are greated with an overview of your APM data including top services by error rates, and R.E.D. metrics for services and workflows.

The APM Service Map displays the dependencies and connections among your instrumented and inferred services in APM. The map is dynamically generated based on your selections in the time range, environment, workflow, service, and tag filters.

You can see the services involved in any of your APM user workflows by clicking into the Service Map. When you select a service in the Service Map, the charts in the Business Workflow sidepane are updated to show metrics for the selected service. The Service Map and any indicators are syncronized with the time picker and chart data displayed.

Exercise
  • Click on the wire-transfer-service in the Service Map.

APM Explore APM Explore

Splunk APM also provides built-in Service Centric Views to help you see problems occurring in real time and quickly determine whether the problem is associated with a service, a specific endpoint, or the underlying infrastructure. Let’s have a closer look.

Exercise
  • In the right hand pane, click on wire-transfer-service in blue.

APM Service APM Service

Last Modified May 21, 2025

2. APM Service View

Service View

As a service owners you can use the service view in Splunk APM to get a complete view of your service health in a single pane of glass. The service view includes a service-level indicator (SLI) for availability, dependencies, request, error, and duration (RED) metrics, runtime metrics, infrastructure metrics, Tag Spotlight, endpoints, and logs for a selected service. You can also quickly navigate to code profiling and memory profiling for your service from the service view.

Service Dashboard Service Dashboard

Exercise
  • In the Time box change the timeframe to -1h. Note how the charts update.
  • These charts are very useful to quickly identify performance issues. You can use this dashboard to keep an eye on the health of your service.
  • Scroll down the page and expand Infrastructure Metrics. Here you will see the metrics for the Host and Pod.
  • Runtime Metrics are not available as profiling data is not available for services written in Node.js.
  • Now let’s go back to the explore view, you can hit the back button in your Browser

APM Explore APM Explore

Exercise

In the Service Map hover over the wire-transfer-service. What can you conclude from the popup service chart?

The error percentage is very high.

APM Service Chart APM Service Chart

We need to understand if there is a pattern to this error rate. We have a handy tool for that, Tag Spotlight.

Last Modified May 21, 2025

3. APM Tag Spotlight

Exercise
  • To view the tags for the wire-transfer-service click on the wire-transfer-service and then click on Tag Spotlight in the right-hand side functions pane (you may need to scroll down depending upon your screen resolution).* Once in Tag Spotlight ensure the toggle Show tags with no values is off.

APM Tag Spotlight APM Tag Spotlight

The views in Tag Spotlight are configurable for both the chart and cards. The view defaults to Requests & Errors.

It is also possible to configure which tag metrics are displayed in the cards. It is possible to select any combinations of:

  • Requests
  • Errors
  • Root cause errors
  • P50 Latency
  • P90 Latency
  • P99 Latency

Also ensure that the Show tags with no values toggle is unchecked.

Scroll through the cards and get familiar with the tags provided by the wire-transfer-service’s telemetry.

Exercise

Which card exposes the tag that identifies what the problem is?

The version card. The number of requests against v350.10 matches the number of errors i.e. 100%

Now that we have identified the version of the wire-transfer-service that is causing the issue, let’s see if we can find out more information about the error. Press the back button on your browser to get back to the Service Map.

Last Modified May 21, 2025

4. APM Service Breakdown

Exercise
  • Select the wire-transfer-service in the Service Map.
  • In the right-hand pane click on the Breakdown.
  • Select tenant.level in the list.
  • Back in the Service Map click on gold (our most valuable user tier).
  • Click on Breakdown and select version, this is the tag that exposes the service version.
  • Repeat this for silver and bronze.

What can you conclude from what you are seeing?

Every tenant.level is being impacted by v350.10

You will now see the wire-transfer-service broken down into three services, gold, silver and bronze. Each tenant is broken down into two services, one for each version (v350.10 and v350.9).

APM Service Breakdown APM Service Breakdown

Span Tags

Using span tags to break down services is a very powerful feature. It allows you to see how your services are performing for different customers, different versions, different regions, etc. In this exercise, we have determined that v350.10 of the wire-transfer-service is causing problems for all our customers.

Next, we need to drill down into a trace to see what is going on.

Last Modified May 21, 2025

5. APM Trace Analyzer

As Splunk APM provides a NoSample end-to-end visibility of every service Splunk APM captures every trace. For this workshop, the wire transfer orderId is available as a tag. This means that we can use this to search for the exact trace of the poor user experience encountered by users.

Trace Analyzer

Splunk Observability Cloud provides several tools for exploring application monitoring data. Trace Analyzer is suited to scenarios where you have high-cardinality, high-granularity searches and explorations to research unknown or new issues.

Exercise
  • With the outer box of the wire-transfer-service selected, in the right-hand pane, click on Traces.
  • Set Time Range to Last 15 minutes.
  • Ensure the Sample Ratio is set to 1:1 and not 1:10.

APM Trace Analyzer APM Trace Analyzer

The Trace & error count view shows the total traces and traces with errors in a stacked bar chart. You can use your mouse to select a specific period within the available time frame.

Exercise
  • Click on the dropdown menu that says Trace & error count, and change it to Trace duration

APM Trace Analyzer Heat Map APM Trace Analyzer Heat Map

The Trace Duration view shows a heatmap of traces by duration. The heatmap represents 3 dimensions of data:

  • Time on the x-axis
  • Trace duration on the y-axis
  • The traces (or requests) per second are represented by the heatmap shades

You can use your mouse to select an area on the heatmap, to focus on a specific time period and trace duration range.

Exercise
  • Switch from Trace duration back to Trace & Error count.
  • In the time picker select Last 1 hour.
  • Note, that most of our traces have errors (red) and there are only a limited amount of traces that are error-free (blue).
  • Make sure the Sample Ratio is set to 1:1 and not 1:10.
  • Click on Add filters, type in orderId and select orderId from the list.
  • Find and select the orderId provided by your workshop leader and hit enter. Traces by Duration Traces by Duration

We have now filtered down to the exact trace where users reported a poor experience with a very long processing wait.

A secondary benefit to viewing this trace is that the trace will be accessible for up to 13 months. This will allow developers to come back to this issue at a later stage and still view this trace for example.

Exercise
  • Click on the trace in the list.

Next, we will walk through the trace waterfall.

Last Modified May 21, 2025

6. APM Waterfall

We have arrived at the Trace Waterfall from the Trace Analyzer. A trace is a collection of spans that share the same trace ID, representing a unique transaction handled by your application and its constituent services.

Each span in Splunk APM captures a single operation. Splunk APM considers a span to be an error span if the operation that the span captures results in an error.

Trace Waterfall Trace Waterfall

Exercise
  • Click on the ! next to any of the wire-transfer-service spans in the waterfall.

What is the error message and version being reported in the Span Details?

Invalid request and v350.10.

Now that we have identified the version of the wire-transfer-service that is causing the issue, let’s see if we can find out more information about the error. This is where Related Logs come in.

Related Content relies on specific metadata that allow APM, Infrastructure Monitoring, and Log Observer to pass filters around Observability Cloud. For related logs to work, you need to have the following metadata in your logs:

  • service.name
  • deployment.environment
  • host.name
  • trace_id
  • span_id
Exercise
  • At the very bottom of the Trace Waterfall click on Logs (1). This highlights that there are Related Logs for this trace.
  • Click on the Logs for trace xxx entry in the pop-up, this will open the logs for the complete trace in Log Observer.

Related Logs Related Logs

Next, let’s find out more about the error in the logs.

Last Modified May 20, 2025

Splunk Log Observer

20 minutes  
Persona

Remaining in your back-end developer role, you need to inspect the logs from your application to determine the root cause of the issue.

Using the content related to the APM trace (logs) we will now use Splunk Log Observer to drill down further to understand exactly what the problem is.

Related Content is a powerful feature that allows you to jump from one component to another and is available for metrics, traces and logs.

Last Modified May 20, 2025

Subsections of 5. Splunk Log Observer

1. Log Filtering

Log Observer (LO), can be used in multiple ways. In the quick tour, you used the LO no-code interface to search for specific entries in the logs. This section, however, assumes you have arrived in LO from a trace in APM using the Related Content link.

The advantage of this is, as it was with the link between RUM & APM, that you are looking at your logs within the context of your previous actions. In this case, the context is the time frame (1), which matches that of the trace and the filter (2) which is set to the trace_id.

Trace Logs Trace Logs

This view will include all the log lines from all applications or services that participated in the back-end transaction started by the end-user interaction with the Online Boutique.

Even in a small application, the sheer amount of logs found can make it hard to see the specific log lines that matter to the actual incident we are investigating.

Exercise

We need to focus on just the Error messages in the logs:

  • Click on the Group By drop-down box and use the filter to find Severity.
  • Once selected click the Apply button (notice that the chart legend changes to show debug, error and info). legend legend
  • Selecting just the error logs can be done by either clicking on the word error (1) in the legend, followed by selecting Add to filter. Then click Run Search
  • You could also add the service name, sf_service=wire-transfer-service, to the filter if there are error lines for multiple services, but in our case, this is not necessary. Error Logs Error Logs

Next, we will look at log entries in detail.

Last Modified May 20, 2025

2. Viewing Log Entries

Before we look at a specific log line, let’s quickly recap what we have done so far and why we are here based on the 3 pillars of Observability:

MetricsTracesLogs
Do I have a problem?Where is the problem?What is the problem?
  • Using metrics we identified we have a problem with our application. This was obvious from the error rate in the Service Dashboards as it was higher than it should be.
  • Using traces and span tags we found where the problem is. The wire-transfer-service comprises of two versions, v350.9 and v350.10, and the error rate was 100% for v350.10.
  • We did see that this error from the wire-transfer-service v350.10 caused multiple retries and a long delay in the response back from the compliance check service.
  • From the trace, using the power of Related Content, we arrived at the log entries for the failing wire-transfer-service version. Now, we can determine what the problem is.
Exercise
  • Click on an error entry in the log table (make sure it says hostname: "wire-transfer-service-xxxx" in case there is a rare error from a different service in the list too.

Based on the message, what would you tell the development team to do to resolve the issue?

The development team needs to rebuild and deploy the container with a valid API Token or rollback to v350.9.

Log Message Log Message

  • Click on the X in the log message pane to close it.
Congratulations

You have successfully used Splunk Observability Cloud to understand why users are experiencing issues while using the wire transfer service. You used Splunk APM and Splunk Log Observer to understand what happened in your service landscape and subsequently, found the underlying cause, all based on the 3 pillars of Observability, metrics, traces and logs

You also learned how to use Splunk’s intelligent tagging and analysis with Tag Spotlight to detect patterns in your applications’ behavior and to use the full stack correlation power of Related Content to quickly move between the different components and telemetry while keeping in context of the issue.

In the next part of the workshop, we will move from problem-finding mode into mitigation, prevention and process improvement mode.

Next up, creating log charts in a custom dashboard.

Last Modified May 20, 2025

3. Log Timeline Chart

Once you have a specific view in Log Observer, it is very useful to be able to use that view in a dashboard, to help in the future with reducing the time to detect or resolve issues. As part of the workshop, we will create an example custom dashboard that will use these charts.

Let’s look at creating a Log Timeline chart. The Log Timeline chart is used for visualizing log messages over time. It is a great way to see the frequency of log messages and to identify patterns. It is also a great way to see the distribution of log messages across your environment. These charts can be saved to a custom dashboard.

Exercise

First, we will reduce the amount of information to only the columns we are interested in:

  • Click on the Configure Table icon above the Logs table to open the Table Settings, untick _raw and ensure the following fields are selected k8s.pod.name, message and version. Log Table Settings Log Table Settings
  • Remove the fixed time from the time picker, and set it to the Last 15 minutes.
  • To make this work for all traces, remove the trace_id from the filter and add the fields sf_service=wire-transfer-service and sf_environment=[WORKSHOPNAME].
  • Click Save and select Save to Dashboard. save it save it
  • In the chart creation dialog box that appears, for the Chart name use Log Timeline.
  • Click Select Dashboard and then click New dashboard in the Dashboard Selection dialog box.
  • In the New dashboard dialog box, enter a name for the new dashboard (no need to enter a description). Use the following format: Initials - Service Health Dashboard and click Save
  • Ensure the new dashboard is highlighted in the list (1) and click OK (2). Save dashboard Save dashboard
  • Ensure that Log Timeline is selected as the Chart Type. log timeline log timeline
  • Click the Save button (do not click Save and goto dashboard at this time).

Next, we will create a Log View chart.

Last Modified May 20, 2025

4. Log View Chart

The next chart type that can be used with logs is the Log View chart type. This chart will allow us to see log messages based on predefined filters.

As with the previous Log Timeline chart, we will add a version of this chart to our Customer Health Service Dashboard:

Exercise
  • After the previous exercise make sure you are still in Log Observer.
  • The filters should be the same as the previous exercise, with the time picker set to the Last 15 minutes and filtering on severity=error, sf_service=wire-transfer-service and sf_environment=[WORKSHOPNAME].
  • Make sure we have the header with just the fields we wanted.
  • Click again on Save and then Save to Dashboard.
  • This will again provide you with the Chart creation dialog.
  • For the Chart name use Log View.
  • This time Click Select Dashboard and search for the Dashboard you created in the previous exercise. You can start by typing your initials in the search box (1). search dashboard search dashboard
  • Click on your dashboard name to highlight it (2) and click OK (3).
  • This will return you to the create chart dialog.
  • Ensure Log View is selected as the Chart Type. log view log view
  • To see your dashboard click Save and go to dashboard.
  • The result should be similar to the dashboard below: Custom Dashboard Custom Dashboard
  • As the last step in this exercise, let us add your dashboard to your workshop team page, this will make it easy to find later in the workshop.
  • At the top of the page, click on the to the left of your dashboard name. linking linking
  • Select Link to teams from the drop-down.
  • In the following Link to teams dialog box, find the Workshop team that your instructor will have provided for you and click Done.
Last Modified May 20, 2025

Custom Service Health Dashboard đŸ„

15 minutes  
Persona

As the SRE hat suits you let’s keep it on as you have been asked to build a custom Service Health Dashboard for the wire-transfer-service. The requirement is to display RED metrics, logs and Synthetic test duration results.

It is common for development and SRE teams to require a summary of the health of their applications and/or services. More often or not these are displayed on wall-mounted TVs. Splunk Observability Cloud has the perfect solution for this by creating custom dashboards.

In this section we are going to build a Service Health Dashboard we can use to display on teams’ monitors or TVs.

Last Modified May 20, 2025

Subsections of 6. Service Health Dashboard

Enhancing the Dashboard

As we already saved some useful log charts in a dashboard in the Log Observer exercise, we are going to extend that dashboard.

Wall mounted Wall mounted

Exercise
  • To get back to your dashboard with the two log charts, click on Dashboards from the main menu and you will be taken to your Team Dashboard view. Under Dashboards click in Search dashboards to search for your Service Health Dashboard group.
  • Click on the name and this will bring up your previously saved dashboard. log list log list
  • Even if the log information is useful, it will need more information to have it make sense for our team so let’s add a bit more information
  • The first step is adding a description chart to the dashboard. Click on the New text note and replace the text in the note with the following text and then click the Save and close button and name the chart Instructions
Information to use with text note
This is a Custom Health Dashboard for the **wire-transfer-service**,  
Please pay attention to any errors in the logs.
For more detail visit [link](https://https://www.splunk.com/en_us/products/observability.html)
  • The charts are not in a nice order, let’s correct that and rearrange the charts so that they are useful.
  • Move your mouse over the top edge of the Instructions chart, your mouse pointer will change to a ☩. This will allow you to drag the chart in the dashboard. Drag the Instructions chart to the top left location and resize it to a 1/3rd of the page by dragging the right-hand edge.
  • Drag and add the Log Timeline view chart next to the Instruction chart, resize it so it fills the other 2/3rd of the page to be the error rate chart next to the two the chart and resize it so it fills the page
  • Next, resize the Log lines chart to be the width of the page and resize it the make it at least twice as long.
  • You should have something similar to the dashboard below: Initial Dashboard Initial Dashboard

This looks great, let’s continue and add more meaningful charts.

Last Modified May 20, 2025

Adding a Custom Chart

In this part of the workshop we are going to create a chart that we will add to our dashboard, we will also link it to the detector we previously built. This will allow us to see the behavior of our test and get alerted if one or more of our test runs breach its SLA.

Exercise
  • At the top of the dashboard click on the + and select Chart. new chart screen new chart screen
  • First, use the Untitled chart input field and name the chart Requests by version & error.
  • For this exercise we want a bar or column chart, so click on the 3rd icon in the chart option box.
  • In the Plot editor enter spans.count (this is runtime in duration for our test) in the Signal box and hit enter.
  • Click Add filter and choose sf_service:wire-transfer-service
  • Right now we see different colored bars, a different color for each region the test runs from. As this is not needed we can change that behavior by adding some analytics.
  • Click the Add analytics button.
  • From the drop-down choose the Sum option, then pick sum:aggregation and click version and then click sf_error to group by both of these dimensions. Notice how the chart changes as the metrics are now aggregated. new chart screen new chart screen
  • Click the Save and close button.
  • In the dashboard, move the charts so they look like the screenshot below: Service Health Dashboard Service Health Dashboard
  • For the final task, click three dots at the top of the page (next to Event Overlay) and click on View fullscreen. This will be the view you would use on the TV monitor on the wall (press Esc to go back).
Tip

In your spare time have a try at adding another custom chart to the dashboard using APM or Infrastructure metrics. You could copy a chart from the out-of-the-box Kubernetes dashboard group. Or you could use the APM metric traces.count to create a chart that shows the number of errors on a specific endpoint.

Finally, we will run through a workshop wrap-up.

Last Modified May 21, 2025

Workshop Wrap-up 🎁

10 minutes  

Congratulations, you have completed the Splunk4Rookies - Observability Cloud Workshop. Today, you have become familiar with how to use Splunk Observability Cloud to monitor your applications and infrastructure.

Celebrate your achievement by adding this certificate to your LinkedIn profile.

Let’s recap what we have learned and what you can do next.

Champagne Champagne

Last Modified May 20, 2025

Subsections of 7. Workshop Wrap-up

Key Takeaways

During the workshop, we have seen how the Splunk Observability Cloud in combination with the OpenTelemetry signals (metrics, traces and logs) can help you to reduce mean time to detect (MTTD) and also reduce mean time to resolution (MTTR).

  • We have a better understanding of the Main User interface and its components, the Landing, Infrastructure, APM, Log Observer, Dashboard pages, and a quick peek at the Settings page.
  • Depending on time, we did an Infrastructure exercise and looked at Metrics used in the Kubernetes Navigators and saw related services found on our Kubernetes cluster:

Kubernetes Kubernetes

  • Understood what users were experiencing and used APM to Troubleshoot a particularly long load time and error, by following its trace across the front and back end and right to the log entries. We used tools like the APM Dependency map with Breakdown to discover what is causing our issue:

apm apm

  • Used Tag Spotlight, in APM, to understand blast radius, detect trends and context for our performance issues and errors. We drilled down in Span’s in the APM Trace waterfall to see how services interacted and find errors:

tag and waterfall tag and waterfall

  • We used the Related content feature to follow the link between our Trace directly to the Logs related to our Trace and used filters to drill down to the exact cause of our issue.

logs logs

  • In the final exercise, we created a health dashboard to keep that running for our Developers and SREs on a TV screen:

synth and TV synth and TV