Resources

Frequently Asked Questions

A collection of the common questions and their answers associated with Observability, DevOps, Incident Response and Splunk On-Call.

Dimension, Properties and Tags

One conversation that frequently comes up is Dimensions vs Properties and when you should use one verus the other.

Naming Conventions for Tagging with OpenTelemetry and Splunk

When deploying OpenTelemetry in a large organization, it’s critical to define a standardized naming convention for tagging, and a governance process to ensure the convention is adhered to.

Local Hosting

Resources for setting up a locally hosted workshop environment.

Frequently Asked Questions

A collection of the common questions and their answers associated with Observability, DevOps, Incident Response and Splunk On-Call.

Q: Alerts v. Incident Response v. Incident Management

A: Alerts, Incident Response and Incident Management are related functions. Together they comprise the incident response and resolution process.

Monitoring and Observability tools send alerts to incident response platforms. Those platforms take a collection of alerts and correlate them into incidents.

Those incidents are recorded into incident management (ITSM) platforms for record. Alerts are the trigger that something has happened, and provide context to an incident.

Incidents consist of the alert payload, all activity associated with the incident from the time it was created, and the on-call policies to be followed. ITSM is the system of record for incidents that are active and after they have been resolved.

All these components are necessary for successful incident response and management practices.

On-Call

Q: Is Observability Monitoring

A: The key difference between Monitoring and Observability is the difference between “known knowns” and “unknown knowns” respectively.

In monitoring the operator generally has prior knowledge of the architecture and elements in their system. They can reliably predict the relationship between elements, and their associated metadata. Monitoring is good for stateful infrastructure that is not frequently changed.

Observability is for systems where the operators ability to predict and trace all elements in the system and their relationships is limited.

Observability is a set of practices and technology, which include traditional monitoring metrics.

These practices and technologies combined give the operator the ability to understand ephemeral and highly complex environments without prior knowledge of all elements of a system. Observability technology can also account for fluctuations in the environment, and variation in metadata (cardinality) better than traditional monitoring which is more static.

Observability

Q: What are Traces and Spans

A: Traces and spans, combined with metrics and logs, make up the core types of data that feed modern Observability tools. They all have specific elements and functions, but work well together.

Because microservices based architectures are distributed, transactions in the system touch multiple services before completing. This makes accurately pinpointing the location of an issue difficult. Traces are a method for tracking the full path of a request through all the services in a distributed system. Spans are the timed operations in each service. Traces are the connective tissue for the spans and together they give more detail on individual service processes. While metrics give a good snapshot of the health of a system, and logs give depth when investigating issues, traces and spans help navigate operators to the source of issues with greater context. This saves time when investigating incidents, and supports the increasing complexity of modern architectures.

APM

Q: What is the Sidecar Pattern?

A: The sidecar pattern is a design pattern for having related services connected directly by infrastructure. Related services can be adding functionality or supporting the application logic they are connected to. It is used heavily as a method for deploying agents associated with the management plan along with the application service they support.

In Observability the sidecar services are the application logic, and the agent collecting data from that service. The setup requires two containers one with the application service, and one running the agent. The containers share a pod, and resources such as disk, network, and namespace. They are also deployed together and share the same lifecycle.

Observability

Dimension, Properties and Tags

Applying context to your metrics

One conversation that frequently comes up is Dimensions vs Properties and when you should use one verus the other. Instead of starting off with their descriptions it makes sense to understand how we use them and how they are similar, before diving into their differences and examples of why you would use one or the other.

How are Dimensions and Properties similar?

The simplest answer is that they are both metadata key:value pairs that add context to our metrics. Metrics themselves are what we actually want to measure, whether it’s a standard infrastructure metric like cpu.utilization or a custom metric like number of API calls received.

If we receive a value of 50% for the cpu.utilization metric without knowing where it came from or any other context it is just a number and not useful to us. We would need at least to know what host it comes from.

These days it is likely we care more about the performance or utilization of a cluster or data center as a whole then that of an individual host and therefore more interested in things like the average cpu.utilization across a cluster of hosts, when a host’s cpu.utilization is a outlier when compared to other hosts running the same service or maybe compare the average cpu.utilization of one environment to another.

To be able to slice, aggregate or group our cpu.utilization metrics in this way we will need additional metadata for the cpu.utilization metrics we receive to include what cluster a host belongs to, what service is running on the host and what environment it is a part of. This metadata can be in the form of either dimension or property key:value pairs.

For example, if I go to apply a filter to a dashboard or use the Group by function when running analytics, I can use a property or a dimension.

So how are Dimensions and Properties different?

Dimensions are sent in with metrics at the time of ingest while properties are applied to metrics or dimensions after ingest. This means that any metadata you need to make a datapoint (a single reported value of a metric) unique, like what host a value of cpu.utilization is coming from needs to be a dimension. Metric names + dimensions uniquely define an MTS (metric time series).

Example: the cpu.utilization metric sent by a particular host (server1) with a dimension host:server1 would be considered a unique time series. If you have 10 servers, each sending that metric, then you would have 10 time series, with each time series sharing the metric name cpu.utilization and uniquely identified by the dimension key-value pair (host:server1, host:server2…host:server10).

However, if your server names are only unique within a datacenter vs your whole environment you would need to add a 2nd dimension dc for the datacenter location. You could now have double the number of possible MTSs. cpu.utilization metrics received would now be uniquely identified by 2 sets of dimension key-value pairs.

cpu.utilization plus dc:east & host:server1 would create a different time series then cpu.utilization plus dc:west & host:server1.

Dimensions are immutable while properties are mutable

As we mentioned above, Metric name + dimensions make a unique MTS. Therefore, if the dimension value changes we will have a new unique combination of metric name + dimension value and create a new MTS.

Properties on the other hand are applied to metrics (or dimensions) after they are ingested. If you apply a property to a metric, it propagates and applies to all MTS that the metric is a part of. Or if you apply a property to a dimension, say host:server1 then all metrics from that host will have those properties attached. If you change the value of a property it will propagate and update the value of the property to all MTSs with that property attached. Why is this important? It means that if you care about the historical value of a property you need to make it a dimension.

Example: We are collecting custom metrics on our application. One metric is latency which counts the latency of requests made to our application. We have a dimension customer, so we can sort and compare latency by customer. We decide we want to track the application version as well so we can sort and compare our application latency by the version customers are using. We create a property version that we attach to the customer dimension. Initially all customers are using application version 1, so version:1.

We now have some customers using version 2 of our application, for those customers we update the property to version:2. When we update the value of the version property for those customers it will propagate down to all MTS for that customer. We lose the history that those customers at some point used version 1, so if we wanted to compare latency of version 1 and version 2 over a historical period we would not get accurate data. In this case even though we don’t need application version to make out metric time series unique we need to make version a dimension, because we care about the historical value.

So when should something be a Property instead of a dimension?

The first reason would be if there is any metadata you want attached to metrics, but you don’t know it at the time of ingest. The second reason is best practice is if it doesn’t need to be a dimension, make it a property. Why? One reason is that today there is a limit of 5K MTSs per analytics job or chart rendering and the more dimensions you have the more MTS you will create. Properties are completely free-form and let you add as much information as you want or need to metrics or dimensions without adding to MTS counts.

As dimensions are sent in with every datapoint, the more dimensions you have the more data you send to us, which could mean higher costs to you if your cloud provider charges for data transfer.

A good example of some things that should be properties would be additional host information. You want to be able to see things like machine_type, processor, or os, but instead of making these things dimensions and sending them with every metric from a host you could make them properties and attach the properties to the host dimension.

Example where host:server1 you would set properties machine_type:ucs, processor:xeon-5560, os:rhel71. Anytime a metric comes in with the dimension host:server1 all the above properties will be applied to it automatically.

Some other examples of use cases for properties would be if you want to know who is the escalation contact for each service or SLA level for every customer. You do not need these items to make metrics uniquely identifiable and you don’t care about the historical values, so they can be properties. The properties could be added to the service dimension and customer dimensions and would then apply to all metrics and MTSs with those dimensions.

What about Tags?

Tags are the 3rd type of metadata that can be used to give context to or help organize your metrics. Unlike dimensions and properties, tags are NOT key:value pairs. Tags can be thought of as labels or keywords. Similar to Properties, Tags are applied to your data after ingest via the Catalog in the UI or programmatically via the API. Tags can be applied to Metrics, Dimensions or other objects such as Detectors.

Where would I use Tags?

Tags are used when there is a need for a many-to-one relationship of tags to an object or a one-to-many relationship between the tag and the objects you are applying them to. They are useful for grouping together metrics that may not be intrinsically associated.

One example is you have hosts that run multiple applications. You can create tags (labels) for each application and then apply multiple tags to each host to label the applications that are running on it.

Example: Server1 runs 3 applications. You create tags app1, app2 and app3 and apply all 3 tags to the dimension host:server1

To expand on the example above let us say you also collect metrics from your applications. You could apply the tags you created to any metrics coming in from the applications themselves. You can filter based on a tag allowing you to filter based on an application, but get the full picture including both application and the relevant host metrics.

Example: App1 sends in metrics with the dimension service:application1. You would apply tag app1 to the dimension service:application1. You can then filter on the tag app1 in charts and dashboards.

Another use case for tags for binary states where there is just one possible value. An example is you do canary testing and when you do a canary deployment you want to be able to mark the hosts that received the new code, so you can easily identify their metrics and compare their performance to those hosts that did not receive the new code. There is no need for a key:value pair as there is just a single value “canary”.

Be aware that while you can filter on tags you cannot use the groupBy function on them. The groupBy function is run by supplying the key part of a key:value pair and the results are then grouped by values of that key pair.

Additional information

For information on sending dimensions for custom metrics please review the Client Libraries documentation for your library of choice.

For information on how to apply properties & tags to metrics or dimensions via the API please see the API documentation for /metric/:name /dimension/:key/:value

For information on how to add or edit properties and tags via the Metadata Catalog in the UI please reference the section Add or edit metadata in Search the Metric Finder and Metadata catalog.

Naming Conventions for Tagging with OpenTelemetry and Splunk

Introduction

When deploying OpenTelemetry in a large organization, it’s critical to define a standardized naming convention for tagging, and a governance process to ensure the convention is adhered to.

This ensures that MELT data collected via OpenTelemetry can be efficiently utilized for alerting, dashboarding, and troubleshooting purposes. It also ensures that users of Splunk Observability Cloud can quickly find the data they’re looking for.

Naming conventions also ensure that data can be aggregated effectively. For example, if we wanted to count the number of unique hosts by environment, then we must use a standardized convention for capturing the host and environment names.

Attributes vs. Tags

Before we go further, let’s make a note regarding terminology. Tags in OpenTelemetry are called “attributes”. Attributes can be attached to metrics, logs, and traces, either via manual instrumentation or automated instrumentation.

Attributes can also be attached to metrics, logs, and traces at the OpenTelemetry collector level, using various processors such as the Resource Detection processor.

Once traces with attributes are ingested into Splunk Observability Cloud, they are available as “tags”. Optionally, attributes collected as part of traces can be used to create Troubleshooting Metric Sets, which can in turn be used with various features such as Tag Spotlight.

Alternatively, attributes can be used to create Monitoring Metric Sets, which can be used to drive alerting.

Resource Semantic Conventions

OpenTelemetry resource semantic conventions should be used as a starting point when determining which attributes an organization should standardize on. In the following sections, we’ll review some of the more commonly used attributes.

Service Attributes

A number of attributes are used to describe the service being monitored.

service.name is a required attribute that defines the logical name of the service. It’s added automatically by the OpenTelemetry SDK but can be customized. It’s best to keep this simple (i.e. inventoryservice would be better than inventoryservice-prod-hostxyz, as other attributes can be utilized to capture other aspects of the service instead).

The following service attributes are recommended:

service.namespace this attribute could be utilized to identify the team that owns the service
service.instance.id is used to identify a unique instance of the service
service.version is used to identify the version of the service

Telemetry SDK

These attributes are set automatically by the SDK, to record information about the instrumentation libraries being used:

telemetry.sdk.name is typically set to opentelemetry
telemetry.sdk.language is the language of the SDK, such as java
telemetry.sdk.version identifies which version of the SDK is utilized

Containers

For services running in containers, there are numerous attributes used to describe the container runtime, such as container.id, container.name, and container.image.name. The full list can be found here.

Hosts

These attributes describe the host where the service is running, and include attributes such as host.id, host.name, and host.arch. The full list can be found here.

Deployment Environment

The deployment.environment attribute is used to identify the environment where the service is deployed, such as staging or production.

Splunk Observability Cloud uses this attribute to enable related content (as described here), so it’s important to include it.

Cloud

There are also attributes to capture information for services running in public cloud environments, such as AWS. Attributes include cloud.provider, cloud.account.id, and cloud.region.

The full list of attributes can be found here.

Some cloud providers, such as GCP, define semantic conventions specific to their offering.

Kubernetes

There are a number of standardized attributes for applications running in Kubernetes as well. Many of these are added automatically by Splunk’s distribution of the OpenTelemetry collector, as described here.

These attributes include k8s.cluster.name, k8s.node.name, k8s.pod.name, k8s.namespace.name, and kubernetes.workload.name.

Best Practices for Creating Custom Attributes

Many organizations require attributes that go beyond what’s defined in OpenTelemetry’s resource semantic conventions.

In this case, it’s important to avoid naming conflicts with attribute names already included in the semantic conventions. So it’s a good idea to check the semantic conventions before deciding on a particular attribute name for your naming convention.

In addition to a naming convention for attribute names, you also need to consider attribute values. For example, if you’d like to capture the particular business unit with which an application belongs, then you’ll also want to have a standardized list of business unit values to choose from, to facilitate effective filtering.

The OpenTelemetry community provides guidelines that should be followed when naming attributes as well, which can be found here.

The Recommendations for Application Developers section is most relevant to our discussion.

They recommend:

Prefixing the attribute name with your company’s domain name, e.g. com.acme.shopname (if the attribute may be used outside your company as well as inside).
Prefixing the attribute name with the application name, if the attribute is unique to a particular application and is only used within your organization.
Not using existing OpenTelemetry semantic convention names as a prefix for your attribute name.
Consider submitting a proposal to add your attribute name to the OpenTelemetry specification, if there’s a general need for it across different organizations and industries.
Avoid having attribute names start with otel.*, as this is reserved for OpenTelemetry specification usage.

Metric Cardinality Considerations

One final thing to keep in mind when deciding on naming standards for attribute names and values is related to metric cardinality.

Metric cardinality is defined as **the number of unique metric time series (MTS) produced by a combination of metric names and their associated dimensions.

A metric has high cardinality when it has a high number of dimension keys and a high number of possible unique values for those dimension keys.

For example, suppose your application sends in data for a metric named custom.metric. In the absence of any attributes, custom.metric would generate a single metric time series (MTS).

On the other hand, if custom.metric includes an attribute named customer.id, and there are thousands of customer ID values, this would generate thousands of metric time series, which may impact costs and query performance.

Splunk Observability Cloud provides a report that allows for the management of metrics usage. And rules can be created to drop undesirable dimensions. However, the first line of defence is understanding how attribute name and value combinations can drive increased metric cardinality.

Summary

In this document, we highlighted the importance of defining naming conventions for OpenTelemetry tags, preferably before starting a large rollout of OpenTelemetry instrumentation.

We discussed how OpenTelemetry’s resource semantic conventions define the naming conventions for several attributes, many of which are automatically collected via the OpenTelemetry SDKs, as well as processors that run within the OpenTelemetry collector.

Finally, we shared some best practices for creating your attribute names, for situations where the resource semantic conventions are not sufficient for your organization’s needs.

Local Hosting

Info

Please disable any VPNs or proxies before setting up an instance e.g:

ZScaler
Cisco AnyConnect

These tools will prevent the instance from being created properly.

Local Hosting with Multipass
Learn how to create a local hosting environment with Multipass - Windows/Linux/Mac(Intel)
Local Hosting with OrbStack
Learn how to create a local hosting environment with OrbStack - Mac (Apple Silicon)
Running Demo-in-a-Box
Learn how to use Demo-in-a-Box to manage demos and otel collectors in an easy-to-use web interface.

Local Hosting with Multipass

Install Multipass and Terraform for your operating system. On a Mac (Intel), you can also install via Homebrew e.g.

brew install multipass
brew install terraform

Clone workshop repository:

git clone https://github.com/splunk/observability-workshop

Change into multipass directory:

cd observability-workshop/local-hosting/multipass

Log Observer Connect:

If you plan to use your own Splunk Observability Cloud Suite Org and or Splunk instance, you may need to create a new Log Observer Connect connection: Follow the instructions found in the documentation for Splunk Cloud or Splunk Enterprize.

Additional requirements for running your own Log Observer Connect connection are: Create an index called splunk4rookies-workshop Make sure the Service account user used in the Log observer Connect Connection has access to the splunk4rookies-workshop index. (You can remove all other indexes, as all workshop log data should go to this index)

Initialise Terraform:

terraform init -upgrade

```text Initializing the backend... Initializing provider plugins... - Finding latest version of hashicorp/random... - Finding latest version of hashicorp/local... - Finding larstobi/multipass versions matching "~> 1.4.1"... - Installing hashicorp/random v3.5.1... - Installed hashicorp/random v3.5.1 (signed by HashiCorp) - Installing hashicorp/local v2.4.0... - Installed hashicorp/local v2.4.0 (signed by HashiCorp) - Installing larstobi/multipass v1.4.2... - Installed larstobi/multipass v1.4.2 (self-signed, key ID 797707331BF3549C) ```

Create Terraform variables file. Variables are kept in file terrform.tfvars and we provide a template, terraform.tfvars.template, to copy and edit:

cp terraform.tfvars.template terraform.tfvars

The following Terraform variables are required:

splunk_access_token: Observability Access Token
splunk_api_token: Observability API Token
splunk_rum_token: Observability RUM Token
splunk_realm: Observability Realm e.g. eu0
splunk_hec_url: Splunk HEC URL
splunk_hec_token: Splunk HEC Token

Instance type variables:

splunk_hec_url: Do not use a Raw Endpoint, but use an Event Endpoint as this will process the logs correctly
splunk_presetup: Provide a preconfigured instance (OTel Collector and Online Boutique deployed with RUM enabled). The default is false.
splunk_diab: Install and run Demo-in-a-Box. The default is false.
tagging_workshop: Install and configure the Tagging Workshop. The default is false.
otel_demo : Install and configure the OpenTelemetry Astronomy Shop Demo. This requires that splunk_presetup is set to false. The default is false

Optional advanced variables:

wsversion: Set this to main if working on the development of the workshop, otherwise this can be omitted.

Run terraform plan to check that all configuration is OK. Once happy run terraform apply to create the instance.

terraform apply

random_string.hostname: Creating...
random_string.hostname: Creation complete after 0s [id=cynu]
local_file.user_data: Creating...
local_file.user_data: Creation complete after 0s [id=46a5c50e396a1a7820c3999c131a09214db903dd]
multipass_instance.ubuntu: Creating...
multipass_instance.ubuntu: Still creating... [10s elapsed]
multipass_instance.ubuntu: Still creating... [20s elapsed]
...
multipass_instance.ubuntu: Still creating... [1m30s elapsed]
multipass_instance.ubuntu: Creation complete after 1m38s [name=cynu]
data.multipass_instance.ubuntu: Reading...
data.multipass_instance.ubuntu: Read complete after 1s [name=cynu]

Apply complete! Resources: 3 added, 0 changed, 0 destroyed.

Outputs:

instance_details = [
  {
    "image" = "Ubuntu 22.04.2 LTS"
    "image_hash" = "345fbbb6ec82 (Ubuntu 22.04 LTS)"
    "ipv4" = "192.168.205.185"
    "name" = "cynu"
    "state" = "Running"
  },
]

Once the instance has been successfully created (this can take several minutes), shell into it using the name from the output above e.g.

multipass shell cynu

███████╗██████╗ ██╗     ██╗   ██╗███╗   ██╗██╗  ██╗    ██╗
██╔════╝██╔══██╗██║     ██║   ██║████╗  ██║██║ ██╔╝    ╚██╗
███████╗██████╔╝██║     ██║   ██║██╔██╗ ██║█████╔╝      ╚██╗
╚════██║██╔═══╝ ██║     ██║   ██║██║╚██╗██║██╔═██╗      ██╔╝
███████║██║     ███████╗╚██████╔╝██║ ╚████║██║  ██╗    ██╔╝
╚══════╝╚═╝     ╚══════╝ ╚═════╝ ╚═╝  ╚═══╝╚═╝  ╚═╝    ╚═╝

To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details

Waiting for cloud-init status...
Your instance is ready!

ubuntu@cynu ~ $

Change user to splunk (the password is Splunk123!):

su -l splunk

Validate the instance:

kubectl version --output=yaml

To delete the instance, run the following command:

terraform destroy

Local Hosting with OrbStack

Install Orbstack:

brew install orbstack

Log Observer Connect:

Clone workshop repository:

git clone https://github.com/splunk/observability-workshop

Change into Orbstack directory:

cd observability-workshop/local-hosting/orbstack

Copy the start.sh.example to start.sh and edit the file to set the following required variables Make sure that you do not use a Raw Endpoint, but use an Event Endpoint instead as this will process the logs correctly

ACCESS_TOKEN
REALM
API_TOKEN
RUM_TOKEN
HEC_TOKEN
HEC_URL

Run the script and provide and instance name e.g.:

./start.sh my-instance

Once the instance has been successfully created (this can take several minutes), you will automatically be logged into the instance. If you exit you can SSH back in using the following command (replace <my_instance> with the name of your instance):

ssh splunk@<my_instance>@orb

Once in the shell, you can validate that the instance is ready by running the following command:

kubectl version --output=yaml

To get the IP address of the instance, run the following command:

ifconfig eth0

To delete the instance, run the following command:

orb delete my-instance

Running Demo-in-a-Box

Demo-in-a-box is a method for running demo apps easily using a web interface.

It provides:

A quick way to deploy demo apps and states
A way to easily change configuration of your otel collector and see logs
Get pod status, pod logs, etc.

To leverage this locally using multipass:

Follow the local hosting for multipass instructions
- In the terraform.tfvars file, set splunk_diab to true and make sure all other options are set to false
- Then set the other required and important tokens/url
- Then run the terraform steps
Once the instance is up, navigate in your browser to: http://<IP>:8083
- In the terraform.tfvars file the wsversion defaults to the current version of the workshop e.g 4.64:
  - To use the latest developments change wsversion to use main
  - There are only three versions of the workshop maintained, development (main) current (e.g. 4.64 and the previous (e.g. 4.63)
  - After making the change, run terraform apply to make the changes
Now you can deploy any of the demos; this will also deploy the collector as part of the deployment

Resources

Frequently Asked Questions

Dimension, Properties and Tags

Naming Conventions for Tagging with OpenTelemetry and Splunk

Local Hosting

Subsections of Resources

Frequently Asked Questions

Q: Alerts v. Incident Response v. Incident Management

Q: Is Observability Monitoring

Q: What are Traces and Spans

Q: What is the Sidecar Pattern?

Dimension, Properties and Tags

Applying context to your metrics

How are Dimensions and Properties similar?

So how are Dimensions and Properties different?

Dimensions are immutable while properties are mutable

So when should something be a Property instead of a dimension?

What about Tags?

Where would I use Tags?

Additional information

Naming Conventions for Tagging with OpenTelemetry and Splunk

Introduction

Attributes vs. Tags

Resource Semantic Conventions

Service Attributes

Telemetry SDK

Containers

Hosts

Deployment Environment

Cloud

Kubernetes

Best Practices for Creating Custom Attributes

Metric Cardinality Considerations

Summary

Local Hosting

Subsections of Local Hosting

Local Hosting with Multipass

Local Hosting with OrbStack

Running Demo-in-a-Box