Splunk APM

20 minutes  
Persona

You are a back-end developer and you have been called in to help investigate an issue found by the SRE. The SRE has identified a poor user experience and has asked you to investigate the issue.

Discover the power of full end-to-end visibility by jumping from a RUM trace (front-end) to an APM trace (back-end). All the services are sending telemetry (traces and spans) that Splunk Observability Cloud can visualize, analyze and use to detect anomalies and errors.

RUM and APM are two sides of the same coin. RUM is the client-side view of the application and APM is the server-side view. In this section, we will use APM to drill down and identify where the problem is.

Last Modified Apr 3, 2024

Subsections of Splunk APM

1. APM Explore

The APM Service Map displays the dependencies and connections among your instrumented and inferred services in APM. The map is dynamically generated based on your selections in the time range, environment, workflow, service, and tag filters.

When we clicked on the APM link in the RUM waterfall, filters were automatically added to the service map view to show the services that were involved in that WorkFlow Name (frontend:/cart/checkout).

You can see the services involved in the workflow in the Service Map. In the side pane, under Business Workflow, charts for the selected workflow are displayed. The Service Map and Business Workflow charts are synchronized. When you select a service in the Service Map, the charts in the Business Workflow pane are updated to show metrics for the selected service.

APM Business Workflow APM Business Workflow

Exercise
  • Click on the paymentservice in the Service Map.

APM Explore APM Explore

Splunk APM also provides a set of built-in dashboards that present charts and visualized metrics to help you see problems occurring in real time and quickly determine whether the problem is associated with a service, a specific endpoint, or the underlying infrastructure. Let’s have a closer look.

Exercise
  • Click on the three dots next to paymentservice and select APM Service

APM Service APM Service

Last Modified Jun 19, 2024

2. APM Service Dashboard

Service Dashboard

APM Service Dashboards provide request, error, and duration (RED) metrics based on Monitoring MetricSets created from endpoint spans for your services, endpoints, and Business Workflows. If you scroll down the dashboard you will also see the host and Kubernetes-related metrics which help you determine whether there are problems with the underlying infrastructure.

Service Dashboard Service Dashboard

Exercise
  • Check the Time box (1), you can see that the dashboards only show data relevant to the time it took for the APM trace we selected to complete (note that the charts are static).
  • In the time box change enter -1h and hit enter.
  • The Single Value charts, Request rate, Request latency (p90) and Error rate will start updating every 10 seconds showing that we still have a large number of errors occurring.
  • These charts are very useful to quickly identify performance issues. You can use this dashboard to keep an eye on the health of your service or use it as a base for a custom one.
  • We want to use some of these charts in a later exercise:
    • In the Request rate Single Value chart (2), click the and select Copy. Note that you now have a 1 before the + at the top right of the page (3), indicating you have a copied chart to the clipboard.
    • In the Request rate line chart (4), either click on the Add to clipboard indicator that appeared (just at the (4) in the screenshot) to add it to the clipboard or use the and select Add to clipboard.
  • Note that you now have 2 before the + on the top right of the page. (3)
  • Now let’s go back to the explore view, you can hit the back button in your Browser

APM Explore APM Explore

Exercise

In the Service Map hover over the paymentservice. What can you conclude from the popup service chart?

The error percentage is very high.

APM Service Chart APM Service Chart

We need to understand if there is a pattern to this error rate. We have a handy tool for that, Tag Spotlight.

Last Modified Dec 21, 2023

3. APM Tag Spotlight

Exercise
  • To view the tags for the paymentservice click on the paymentservice and then click on Tag Spotlight in the right-hand side functions pane (you may need to scroll down depending upon your screen resolution).
  • Once in Tag Spotlight ensure the toggle Show tags with no values is off.

APM Tag Spotlight APM Tag Spotlight

There are two views available in Tag Spotlight. The default is Request/Errors and the other is Latency.

Request/Error charts display the total number of requests, errors, and root cause errors. The Latency charts display p50, p90, and p99 latency. These values are based on Troubleshooting MetricSets (TMS), which Splunk APM generates for every indexed span tag. These are known as RED metrics (request, error, and duration).

Exercise

Which chart exposes the tag that identifies what the problem is?

The version chart. The number of requests against v350.10 matches the number of errors.

  • Now that we have identified the version of the paymentservice that is causing the issue, let’s see if we can find out more information about the error. Click on ← Tag Spotlight at the top of the page to get back to the Service Map.
Last Modified Dec 1, 2023

4. APM Service Breakdown

Exercise
  • In the right-hand pane click on the Breakdown.
  • Select tenant.level in the list. This is a tag that exposes the customers’ status and can be useful to see trends related to customer status.
  • Back in the Service Map Click on gold to select it.
  • Click on Breakdown and select version, this is the tag that exposes the service version.
  • Repeat this for silver and bronze.

What can you conclude from what you are seeing?

Every tenant is being impacted by v350.10

You will now see the paymentservice broken down into three services, gold, silver and bronze. Each tenant is broken down into two services, one for each version (v350.10 and v350.9).

APM Service Breakdown APM Service Breakdown

Exercise
  • Click on the outer main box that surrounds the 3 red circles, the box will become highlighted.
Span Tags

Using span tags to break down services is a very powerful feature. It allows you to see how your services are performing for different customers, different versions, different regions, etc. In this exercise, we have determined that v350.10 of the paymentservice is causing problems for all our customers.

Next, we need to drill down into a trace to see what is going on.

Last Modified Dec 4, 2023

5. APM Trace Analyzer

As Splunk APM provides a NoSample™ end-to-end visibility of every service Splunk APM captures every trace. For this workshop, the Order Confirmation ID is available as a tag. This means that we can use this to search for the exact trace of the poor user experience you encountered earlier in the workshop.

Trace Analyzer

Splunk Observability Cloud provides several tools for exploring application monitoring data. Trace Analyzer is suited to scenarios where you have high-cardinality, high-granularity searches and explorations to research unknown or new issues.

Exercise
  • With the outer box of the paymentservice selected, in the right-hand pane, click on Traces.
  • To ensure we are using Trace Analyzer make sure the button Switch to Classic View is showing. If it is not, click on Switch to Trace Analyzer.

APM Trace Analyzer APM Trace Analyzer

The Trace & error count view shows the total traces and traces with errors in a stacked bar chart. You can use your mouse to select a specific period within the available time frame.

Exercise
  • Click on the dropdown menu that says Trace & error count, and change it to Trace duration

APM Trace Analyzer Heat Map APM Trace Analyzer Heat Map

The Trace Duration view shows a heatmap of traces by duration. The heatmap represents 3 dimensions of data:

  1. Time on the x-axis
  2. Trace duration on the y-axis
  3. The traces (or requests) per second are represented by the heatmap shades

You can use your mouse to select an area on the heatmap, to focus on a specific time period and trace duration range.

Exercise
  • Switch from Trace duration back to Trace & Error count.
  • In the time picker select Last 1 hour.
  • Note, that most of our traces have errors (red) and there are only a limited amount of traces that are error-free (blue).
  • Make sure the Sample Ratio is set to 1:1 and not 1:10.
  • Click on Add filters, type in orderId and select orderId from the list.
  • Paste in your Order Confirmation ID from when you went shopping earlier in the workshop and hit enter. If you didn’t capture one, please ask your instructor for one. Traces by Duration Traces by Duration

We have now filtered down to the exact trace where you encountered a poor user experience with a very long checkout wait. A secondary benefit to viewing this trace is that the trace will be accessible for up to 13 months. This will allow developers to come back to this issue at a later stage and still view this trace for example.

Exercise
  • Click on the trace in the list.

Next, we will walk through the trace waterfall.

Last Modified Jul 12, 2024

6. APM Waterfall

We have arrived at the Trace Waterfall from the Trace Analyzer. A trace is a collection of spans that share the same trace ID, representing a unique transaction handled by your application and its constituent services.

Each span in Splunk APM captures a single operation. Splunk APM considers a span to be an error span if the operation that the span captures results in an error.

Trace Waterfall Trace Waterfall

Exercise
  • Click on the ! next to any of the paymentservice:grpc.hipstershop.PaymentService/Charge spans in the waterfall.

What is the error message and version being reported in the span metadata?

Invalid request and v350.10.

Now that we have identified the version of the paymentservice that is causing the issue, let’s see if we can find out more information about the error. This is where Related Logs come in.

Related Logs Related Logs

Related Content relies on specific metadata that allow APM, Infrastructure Monitoring, and Log Observer to pass filters around Observability Cloud. For related logs to work, you need to have the following metadata in your logs:

  • service.name
  • deployment.environment
  • host.name
  • trace_id
  • span_id
Exercise
  • At the very bottom of the Trace Waterfall click on the word Logs (1). This highlights that there are Related Logs for this trace.
  • Click on the Logs for trace XXX entry in the pop-up, this will open the logs for the complete trace in Log Observer.