Self-Service Observability

1 minute Author Bill Grant

Splunk Observability Cloud includes powerful features that help central platform teams that are responsible for creating consistency, standards and best practices in an organization.

This workshop will go through some of the ways to apply standardization in your Observability practice.

We will cover:

Collecting data with standards, and applying metadata at various points in the process
Managing costs, by reviewing metrics and applying metrics pipeline management to it
Configuring Observability-as-code, using terraform and api’s

We will use a variety of scripts to demonstrate these examples. Be sure to pick a unique name, so your data won’t cross over with anyone else taking the workshop at the same time.

Let’s get started!

Tip

The easiest way to navigate through this workshop is by using:

the left/right arrows (< | >) on the top right of this page
the left (◀️) and right (▶️) cursor keys on your keyboard

Background

3 minutes

Background

Let’s review a few background concepts on Open Telemetry before jumping into the details.

First we have the Open Telemetry Collector, which lives on hosts or kubernetes nodes. These collectors can collect local information (like cpu, disk, memory, etc.). They can also collect metrics from other sources like prometheus (push or pull) or databases and other middleware.

Source: OTel Documentation

The way the OTel Collector collects and sends data is using pipelines. Pipelines are made up of:

Receivers: Collect telemetry from one or more sources; they are pull- or push-based.
Processors: Take data from receivers and modify or transform them. Unlike receivers and exporters, processors process data in a specific order.
Exporters: Send data to one or more observability backends or other destinations.

Source: OTel Documentation

The final piece is applications which are instrumented; they will send traces (spans), metrics, and logs.

By default the instrumentation is designed to send data to the local collector (on the host or kubernetes node). This is desirable because we can then add metadata on it – like which pod or which node/host the application is running on.

Collect Data with Standards

10 minutes

Introduction

For this workshop, we’ll be doing things that only a central tools or administration would do.

The workshop uses scripts to help with steps that aren’t part of the focus of this workshop – like how to change a kubernetes app, or start an application from a host.

Tip

It can be useful to review what the scripts are doing.

So along the way it is advised to run cat <filename> from time to time to see what that step just did.

The workshop won’t call this out, so do it when you are curious.

We’ll also be running some scripts to simulate data that we want to deal with.

A simplified version of the architecture (leaving aside the specifics of kubernetes) will look something like the following:

The App sends metrics and traces to the Otel Collector
The Otel Collector also collects metrics of its own
The Otel Collector adds metadata to its own metrics and data that passes through it
The OTel Gateway offers another opportunity to add metadata

Let’s start by deploying the gateway.

Deploy Gateway

5 minutes

Gateway

First we will deploy the OTel Gateway. The workshop instructor will deploy the gateway, but we will walk through the steps here if you wish to try this yourself on a second instance.

The steps:

Click the Data Management icon in the toolbar
Click the + Add integration button
Click Deploy the Splunk OpenTelemetry Collector button
Click Next
Select Linux
Change mode to Data forwarding (gateway)
Set the environment to prod
Choose the access token for this workshop
Click Next
Copy the installer script and run it in the provided linux environment.

Once our gateway is started we will notice… Nothing. The gateway, by default, doesn’t send any data. It can be configured to send data, but it doesn’t by default.

We can review the config file with:

sudo cat /etc/otel/collector/splunk-otel-collector.conf

And see that the config being used is gateway_config.yaml.

Tip

Diagrams created with OTelBin.io. Click on them to see them in detail.

Diagram	What it Tells Us
	Metrics: The gateway will receive metrics over otlp or signalfx protocols, and then send these metrics to Splunk Observability Cloud with the signalfx protocol. There is also a pipeline for prometheus metrics to be sent to Splunk. That pipeline is labeled internal and is meant to be for the collector. (In other words if we want to receive prometheus directly we should add it to the main pipeline.)
	Traces: The gateway will receive traces over jaeger, otlp, sapm, or zipkin and then send these traces to Splunk Observability Cloud with the sapm protocol.
	Logs: The gateway will receive logs over otlp and then send these logs to 2 places: Splunk Enterprise (Cloud) (for logs) and Splunk Observability Cloud (for profiling data). There is also a pipeline labeled signalfx that is sending signalfx to Splunk Observability Cloud; these are events that can be used to add events to charts, as well as the process list.

We’re not going to see any host metrics, and we aren’t send any other data through the gateway yet. But we do have the internal metrics being sent in.

You can find it by creating a new chart and adding a metric:

Click the + in the top-right
Click Chart
For the signal of Plot A, type otelcol_process_uptime
Add a filter with the + to the right, and type: host.id:<name of instance>

You should get a chart like the following:

You can look at the Metric Finder to find other internal metrics to explore.

Add Metadata

Before we deploy a collector (agent) let’s add some metada onto metrics and traces with the gateway. That’s how we will know data is passing through it.

The attributes processor let’s us add some metadata.

sudo vi /etc/otel/collector/agent_config.yaml

Here’s what we want to add to the processors section:

processors:
  attributes/gateway_config:
    actions:
      - key: gateway
        value: oac
        action: insert

And then to the pipelines (adding attributes/gateway_config to each):

service:
  pipelines:
    traces:
      receivers: [jaeger, otlp, smartagent/signalfx-forwarder, zipkin]
      processors:
      - memory_limiter
      - batch
      - resourcedetection
      - attributes/gateway_config
      #- resource/add_environment
      exporters: [sapm, signalfx]
      # Use instead when sending to gateway
      #exporters: [otlp, signalfx]
    metrics:
      receivers: [hostmetrics, otlp, signalfx, smartagent/signalfx-forwarder]
      processors: [memory_limiter, batch, resourcedetection, attributes/gateway_config]
      exporters: [signalfx]
      # Use instead when sending to gateway
      #exporters: [otlp]

And finally we need to restart the gateway:

sudo systemctl restart splunk-otel-collector.service

We can make sure it is still running fine by checking the status:

sudo systemctl status splunk-otel-collector.service

Next, let’s deploy a collector and then configure it to this gateway.

Deploy Collector (Agent)

10 minutes

Collector (Agent)

Now we will deploy a collector. At first this will be configured to go directly to the back-end, but we will change the configuration and restart the collector to use the gateway.

The steps:

Click the Data Management icon in the toolbar
Click the + Add integration button
Click Deploy the Splunk OpenTelemetry Collector button
Click Next
Select Linux
Leave the mode as Host monitoring (agent)
Set the environment to prod
Leave the rest as defaults
Choose the access token for this workshop
Click Next
Copy the installer script and run it in the provided linux environment.

This collector is sending host metrics, so you can find it in common navigators:

Click the Infrastructure icon in the toolbar
Click the EC2 panel under Amazon Web Services
The AWSUniqueId is the easiest thing to find; add a filter and look for it with a wildcard (i.e. i-0ba6575181cb05226*)

We can also simply look at the cpu.utilization metric. Create a new chart to display it, filtered on the AWSUniqueId:

The reason we wanted to do that is so we can easily see the new dimension added on once we send the collector through the gateway. You can click on the Data table to see the dimensions currently being sent:

Next we’ll reconfigure the collector to send to the gateway.

Reconfigure Collector

10 minutes

Reconfigure Collector

To reconfigure the collector we need to make these changes:

In agent_config.yaml
- We need to adjust the signalfx exporter to use the gateway
- The otlp exporter is already there, so we leave it alone
- We need to change the pipelines to use otlp
In splunk-otel-collector.conf
- We need to set the SPLUNK_GATEWAY_URL to the url provided by the instructor

See this docs page for more details.

The exporters will be the following:

exporters:
  # Metrics + Events
  signalfx:
    access_token: "${SPLUNK_ACCESS_TOKEN}"
    #api_url: "${SPLUNK_API_URL}"
    #ingest_url: "${SPLUNK_INGEST_URL}"
    # Use instead when sending to gateway
    api_url: "http://${SPLUNK_GATEWAY_URL}:6060"
    ingest_url: "http://${SPLUNK_GATEWAY_URL}:9943"
    sync_host_metadata: true
    correlation:
  # Send to gateway
  otlp:
    endpoint: "${SPLUNK_GATEWAY_URL}:4317"
    tls:
      insecure: true

The others you can leave as they are but they won’t be used, as you will see in the pipelines.

The pipeline changes (you can see the items commented out and uncommented out):

service:
  pipelines:
    traces:
      receivers: [jaeger, otlp, smartagent/signalfx-forwarder, zipkin]
      processors:
      - memory_limiter
      - batch
      - resourcedetection
      #- resource/add_environment
      #exporters: [sapm, signalfx]
      # Use instead when sending to gateway
      exporters: [otlp, signalfx]
    metrics:
      receivers: [hostmetrics, otlp, signalfx, smartagent/signalfx-forwarder]
      processors: [memory_limiter, batch, resourcedetection]
      #exporters: [signalfx]
      # Use instead when sending to gateway
      exporters: [otlp]
    metrics/internal:
      receivers: [prometheus/internal]
      processors: [memory_limiter, batch, resourcedetection]
      # When sending to gateway, at least one metrics pipeline needs
      # to use signalfx exporter so host metadata gets emitted
      exporters: [signalfx]
    logs/signalfx:
      receivers: [signalfx, smartagent/processlist]
      processors: [memory_limiter, batch, resourcedetection]
      exporters: [signalfx]
    logs:
      receivers: [fluentforward, otlp]
      processors:
      - memory_limiter
      - batch
      - resourcedetection
      #- resource/add_environment
      #exporters: [splunk_hec, splunk_hec/profiling]
      # Use instead when sending to gateway
      exporters: [otlp]

And finally we can add the SPLUNK_GATEWAY_URL in splunk-otel-collector.conf, for example:

SPLUNK_GATEWAY_URL=gateway.splunk011y.com

Then we can restart the collector:

sudo systemctl restart splunk-otel-collector.service

And check the status:

sudo systemctl status splunk-otel-collector.service

And finally see the new dimension on the metrics:

Self-Service Observability

Subsections of Self-Service Observability

Background

Background

Collect Data with Standards

Introduction

Subsections of 2 Collect Data with Standards

Deploy Gateway

Gateway

Add Metadata

Next

Deploy Collector (Agent)

Collector (Agent)

Next

Reconfigure Collector

Reconfigure Collector