Splunk APM

20 minutes

Persona

You are a back-end developer and you have been called in to help investigate an issue found by the SRE. The SRE has identified a poor user experience and has asked you to investigate the issue.

Discover the power of full end-to-end visibility by jumping from a RUM trace (front-end) to an APM trace (back-end). All the services are sending telemetry (traces and spans) that Splunk Observability Cloud can visualize, analyze and use to detect anomalies and errors.

RUM and APM are two sides of the same coin. RUM is the client-side view of the application and APM is the server-side view. In this section, we will use APM to drill down and identify where the problem is.

1. APM Explore

The APM Service Map displays the dependencies and connections among your instrumented and inferred services in APM. The map is dynamically generated based on your selections in the time range, environment, workflow, service, and tag filters.

When we clicked on the APM link in the RUM waterfall, filters were automatically added to the service map view to show the services that were involved in that WorkFlow Name (frontend:/cart/checkout).

You can see the services involved in the workflow in the Service Map. In the side pane, under Business Workflow, charts for the selected workflow are displayed. The Service Map and Business Workflow charts are synchronized. When you select a service in the Service Map, the charts in the Business Workflow pane are updated to show metrics for the selected service.

Exercise

Click on the paymentservice in the Service Map.

Splunk APM also provides built-in Service Centric Views to help you see problems occurring in real time and quickly determine whether the problem is associated with a service, a specific endpoint, or the underlying infrastructure. Let’s have a closer look.

Exercise

In the right hand pane, click on View Service.

2. APM Service View

Service View

As a service owners you can use the service view in Splunk APM to get a complete view of your service health in a single pane of glass. The service view includes a service-level indicator (SLI) for availability, dependencies, request, error, and duration (RED) metrics, runtime metrics, infrastructure metrics, Tag Spotlight, endpoints, and logs for a selected service. You can also quickly navigate to code profiling and memory profiling for your service from the service view.

Exercise

Check the Time box, you can see that the dashboards only show data relevant to the time it took for the APM trace we previosuly selected to complete (note that the charts are static).
In the Time box change the timeframe to -1h.
These charts are very useful to quickly identify performance issues. You can use this dashboard to keep an eye on the health of your service.
Scroll down the page and expand Infrastructure Metrics. Here you will see the metrics for the Host and Pod.
Runtime Metrics are not available as profiling data is not available for services written in Node.js.
Now let’s go back to the explore view, you can hit the back button in your Browser

Exercise

In the Service Map hover over the paymentservice. What can you conclude from the popup service chart?

The error percentage is very high.

We need to understand if there is a pattern to this error rate. We have a handy tool for that, Tag Spotlight.

3. APM Tag Spotlight

Exercise

To view the tags for the paymentservice click on the paymentservice and then click on Tag Spotlight in the right-hand side functions pane (you may need to scroll down depending upon your screen resolution).* Once in Tag Spotlight ensure the toggle Show tags with no values is off.

The views in Tag Spotlight are configurable for both the chart and cards. The view defaults to Requests & Errors.

It is also possible to configure which tag metrics are displayed in the cards. It is possible to select any combinations of:

Requests
Errors
Root cause errors
P50 Latency
P90 Latency
P99 Latency

Also ensure that the Show tags with no values toggle is unchecked.

Exercise

Which card exposes the tag that identifies what the problem is?

The version card. The number of requests against v350.10 matches the number of errors i.e. 100%

Now that we have identified the version of the paymentservice that is causing the issue, let’s see if we can find out more information about the error. Click on ← Tag Spotlight at the top of the page to get back to the Service Map.

4. APM Service Breakdown

Exercise

Select the paymentservice in the Service Map.
In the right-hand pane click on the Breakdown.
Select tenant.level in the list.
Back in the Service Map click on gold.
Click on Breakdown and select version, this is the tag that exposes the service version.
Repeat this for silver and bronze.

What can you conclude from what you are seeing?

Every tenant.level is being impacted by v350.10

You will now see the paymentservice broken down into three services, gold, silver and bronze. Each tenant is broken down into two services, one for each version (v350.10 and v350.9).

Span Tags

Using span tags to break down services is a very powerful feature. It allows you to see how your services are performing for different customers, different versions, different regions, etc. In this exercise, we have determined that v350.10 of the paymentservice is causing problems for all our customers.

Next, we need to drill down into a trace to see what is going on.

5. APM Trace Analyzer

As Splunk APM provides a NoSample™ end-to-end visibility of every service Splunk APM captures every trace. For this workshop, the Order Confirmation ID is available as a tag. This means that we can use this to search for the exact trace of the poor user experience you encountered earlier in the workshop.

Trace Analyzer

Splunk Observability Cloud provides several tools for exploring application monitoring data. Trace Analyzer is suited to scenarios where you have high-cardinality, high-granularity searches and explorations to research unknown or new issues.

Exercise

With the outer box of the paymentservice selected, in the right-hand pane, click on Traces.
To ensure we are using Trace Analyzer make sure the button Switch to Classic View is showing. If it is not, click on Switch to Trace Analyzer.
Set Time Range to Last 15 minutes.
Ensure the Sample Ratio is set to 1:1 and not 1:10.

The Trace & error count view shows the total traces and traces with errors in a stacked bar chart. You can use your mouse to select a specific period within the available time frame.

Exercise

Click on the dropdown menu that says Trace & error count, and change it to Trace duration

The Trace Duration view shows a heatmap of traces by duration. The heatmap represents 3 dimensions of data:

Time on the x-axis
Trace duration on the y-axis
The traces (or requests) per second are represented by the heatmap shades

You can use your mouse to select an area on the heatmap, to focus on a specific time period and trace duration range.

Exercise

Switch from Trace duration back to Trace & Error count.
In the time picker select Last 1 hour.
Note, that most of our traces have errors (red) and there are only a limited amount of traces that are error-free (blue).
Make sure the Sample Ratio is set to 1:1 and not 1:10.
Click on Add filters, type in orderId and select orderId from the list.
Paste in your Order Confirmation ID from when you went shopping earlier in the workshop and hit enter. If you didn’t capture one, please ask your instructor for one.

We have now filtered down to the exact trace where you encountered a poor user experience with a very long checkout wait.

A secondary benefit to viewing this trace is that the trace will be accessible for up to 13 months. This will allow developers to come back to this issue at a later stage and still view this trace for example.

Exercise

Click on the trace in the list.

Next, we will walk through the trace waterfall.

6. APM Waterfall

We have arrived at the Trace Waterfall from the Trace Analyzer. A trace is a collection of spans that share the same trace ID, representing a unique transaction handled by your application and its constituent services.

Each span in Splunk APM captures a single operation. Splunk APM considers a span to be an error span if the operation that the span captures results in an error.

Exercise

Click on the ! next to any of the paymentservice:grpc.hipstershop.PaymentService/Charge spans in the waterfall.

What is the error message and version being reported in the Span Details?

Invalid request and v350.10.

Now that we have identified the version of the paymentservice that is causing the issue, let’s see if we can find out more information about the error. This is where Related Logs come in.

Related Content relies on specific metadata that allow APM, Infrastructure Monitoring, and Log Observer to pass filters around Observability Cloud. For related logs to work, you need to have the following metadata in your logs:

service.name
deployment.environment
host.name
trace_id
span_id

Exercise

At the very bottom of the Trace Waterfall click on Logs (1). This highlights that there are Related Logs for this trace.
Click on the Logs for trace xxx entry in the pop-up, this will open the logs for the complete trace in Log Observer.

Next, let’s find out more about the error in the logs.

Splunk APM

Subsections of 6. Splunk APM

1. APM Explore

2. APM Service View

3. APM Tag Spotlight

4. APM Service Breakdown

5. APM Trace Analyzer

6. APM Waterfall