Applying context to your metrics
One conversation that frequently comes up is Dimensions vs Properties and when you should use one verus the other. Instead of starting off with their descriptions it makes sense to understand how we use them and how they are similar, before diving into their differences and examples of why you would use one or the other.
How are Dimensions and Properties similar?
The simplest answer is that they are both metadata key:value
pairs that add context to our metrics. Metrics themselves are what we actually want to measure, whether it’s a standard infrastructure metric like cpu.utilization
or a custom metric like number of API calls received.
If we receive a value of 50% for the cpu.utilization
metric without knowing where it came from or any other context it is just a number and not useful to us. We would need at least to know what host it comes from.
These days it is likely we care more about the performance or utilization of a cluster or data center as a whole then that of an individual host and therefore more interested in things like the average cpu.utilization
across a cluster of hosts, when a host’s cpu.utilization
is a outlier when compared to other hosts running the same service or maybe compare the average cpu.utilization
of one environment to another.
To be able to slice, aggregate or group our cpu.utilization
metrics in this way we will need additional metadata for the cpu.utilization
metrics we receive to include what cluster a host belongs to, what service is running on the host and what environment it is a part of. This metadata can be in the form of either dimension or property key:value
pairs.
For example, if I go to apply a filter to a dashboard or use the Group by function when running analytics, I can use a property or a dimension.
So how are Dimensions and Properties different?
Dimensions are sent in with metrics at the time of ingest while properties are applied to metrics or dimensions after ingest. This means that any metadata you need to make a datapoint (a single reported value of a metric) unique, like what host a value of cpu.utilization
is coming from needs to be a dimension. Metric names + dimensions uniquely define an MTS (metric time series).
Example: the cpu.utilization
metric sent by a particular host (server1) with a dimension host:server1
would be considered a unique time series. If you have 10 servers, each sending that metric, then you would have 10 time series, with each time series sharing the metric name cpu.utilization
and uniquely identified by the dimension key-value pair (host:server1, host:server2…host:server10).
However, if your server names are only unique within a datacenter vs your whole environment you would need to add a 2nd dimension dc
for the datacenter location. You could now have double the number of possible MTSs. cpu.utilization metrics received would now be uniquely identified by 2 sets of dimension key-value pairs.
cpu.utilization plus dc:east & host:server1 would create a different time series then cpu.utilization plus dc:west & host:server1.
Dimensions are immutable while properties are mutable
As we mentioned above, Metric name + dimensions make a unique MTS. Therefore, if the dimension value changes we will have a new unique combination of metric name + dimension value and create a new MTS.
Properties on the other hand are applied to metrics (or dimensions) after they are ingested. If you apply a property to a metric, it propagates and applies to all MTS that the metric is a part of. Or if you apply a property to a dimension, say host:server1 then all metrics from that host will have those properties attached. If you change the value of a property it will propagate and update the value of the property to all MTSs with that property attached. Why is this important? It means that if you care about the historical value of a property you need to make it a dimension.
Example: We are collecting custom metrics on our application. One metric is latency which counts the latency of requests made to our application. We have a dimension customer, so we can sort and compare latency by customer. We decide we want to track the application version as well so we can sort and compare our application latency by the version customers are using. We create a property version that we attach to the customer dimension. Initially all customers are using application version 1, so version:1.
We now have some customers using version 2 of our application, for those customers we update the property to version:2. When we update the value of the version property for those customers it will propagate down to all MTS for that customer. We lose the history that those customers at some point used version 1, so if we wanted to compare latency of version 1 and version 2 over a historical period we would not get accurate data. In this case even though we don’t need application version to make out metric time series unique we need to make version a dimension, because we care about the historical value.
So when should something be a Property instead of a dimension?
The first reason would be if there is any metadata you want attached to metrics, but you don’t know it at the time of ingest.
The second reason is best practice is if it doesn’t need to be a dimension, make it a property. Why?
One reason is that today there is a limit of 5K MTSs per analytics job or chart rendering and the more dimensions you have the more MTS you will create. Properties are completely free-form and let you add as much information as you want or need to metrics or dimensions without adding to MTS counts.
As dimensions are sent in with every datapoint, the more dimensions you have the more data you send to us, which could mean higher costs to you if your cloud provider charges for data transfer.
A good example of some things that should be properties would be additional host information. You want to be able to see things like machine_type, processor, or os, but instead of making these things dimensions and sending them with every metric from a host you could make them properties and attach the properties to the host dimension.
Example where host:server1 you would set properties machine_type:ucs, processor:xeon-5560, os:rhel71. Anytime a metric comes in with the dimension host:server1 all the above properties will be applied to it automatically.
Some other examples of use cases for properties would be if you want to know who is the escalation contact for each service or SLA level for every customer. You do not need these items to make metrics uniquely identifiable and you don’t care about the historical values, so they can be properties. The properties could be added to the service dimension and customer dimensions and would then apply to all metrics and MTSs with those dimensions.
Tags are the 3rd type of metadata that can be used to give context to or help organize your metrics. Unlike dimensions and properties, tags are NOT key:value pairs. Tags can be thought of as labels or keywords. Similar to Properties, Tags are applied to your data after ingest via the Catalog in the UI or programmatically via the API. Tags can be applied to Metrics, Dimensions or other objects such as Detectors.
Tags are used when there is a need for a many-to-one relationship of tags to an object or a one-to-many relationship between the tag and the objects you are applying them to. They are useful for grouping together metrics that may not be intrinsically associated.
One example is you have hosts that run multiple applications. You can create tags (labels) for each application and then apply multiple tags to each host to label the applications that are running on it.
Example: Server1 runs 3 applications. You create tags app1, app2 and app3 and apply all 3 tags to the dimension host:server1
To expand on the example above let us say you also collect metrics from your applications. You could apply the tags you created to any metrics coming in from the applications themselves. You can filter based on a tag allowing you to filter based on an application, but get the full picture including both application and the relevant host metrics.
Example: App1 sends in metrics with the dimension service:application1. You would apply tag app1 to the dimension service:application1. You can then filter on the tag app1 in charts and dashboards.
Another use case for tags for binary states where there is just one possible value. An example is you do canary testing and when you do a canary deployment you want to be able to mark the hosts that received the new code, so you can easily identify their metrics and compare their performance to those hosts that did not receive the new code. There is no need for a key:value pair as there is just a single value “canary”.
Be aware that while you can filter on tags you cannot use the groupBy function on them. The groupBy function is run by supplying the key part of a key:value pair and the results are then grouped by values of that key pair.
For information on sending dimensions for custom metrics please review the Client Libraries documentation for your library of choice.
For information on how to apply properties & tags to metrics or dimensions via the API please see the API documentation for /metric/:name /dimension/:key/:value
For information on how to add or edit properties and tags via the Metadata Catalog in the UI please reference the section Add or edit metadata in Search the Metric Finder and Metadata catalog.
Naming Conventions for Tagging with OpenTelemetry and Splunk
Introduction
When deploying OpenTelemetry in a large organization, it’s critical to define a standardized naming convention for tagging, and a governance process to ensure the convention is adhered to.
This ensures that MELT data collected via OpenTelemetry can be efficiently utilized for alerting, dashboarding, and troubleshooting purposes. It also ensures that users of Splunk Observability Cloud can quickly find the data they’re looking for.
Naming conventions also ensure that data can be aggregated effectively. For example, if we wanted to count the number of unique hosts by environment, then we must use a standardized convention for capturing the host and environment names.
Before we go further, let’s make a note regarding terminology. Tags in OpenTelemetry are called “attributes”. Attributes can be attached to metrics, logs, and traces, either via manual instrumentation or automated instrumentation.
Attributes can also be attached to metrics, logs, and traces at the OpenTelemetry collector level, using various processors such as the Resource Detection processor.
Once traces with attributes are ingested into Splunk Observability Cloud, they are available as “tags”. Optionally, attributes collected as part of traces can be used to create Troubleshooting Metric Sets, which can in turn be used with various features such as Tag Spotlight.
Alternatively, attributes can be used to create Monitoring Metric Sets, which can be used to drive alerting.
Resource Semantic Conventions
OpenTelemetry resource semantic conventions should be used as a starting point when determining which attributes an organization should standardize on. In the following sections, we’ll review some of the more commonly used attributes.
Service Attributes
A number of attributes are used to describe the service being monitored.
service.name
is a required attribute that defines the logical name of the service. It’s added automatically by the OpenTelemetry SDK but can be customized. It’s best to keep this simple (i.e. inventoryservice
would be better than inventoryservice-prod-hostxyz
, as other attributes can be utilized to capture other aspects of the service instead).
The following service attributes are recommended:
service.namespace
this attribute could be utilized to identify the team that owns the serviceservice.instance.id
is used to identify a unique instance of the serviceservice.version
is used to identify the version of the service
Telemetry SDK
These attributes are set automatically by the SDK, to record information about the instrumentation libraries being used:
telemetry.sdk.name
is typically set to opentelemetry
telemetry.sdk.language
is the language of the SDK, such as java
telemetry.sdk.version
identifies which version of the SDK is utilized
Containers
For services running in containers, there are numerous attributes used to describe the container runtime, such as container.id
, container.name
, and container.image.name
. The full list can be found here.
Hosts
These attributes describe the host where the service is running, and include attributes such as host.id
, host.name
, and host.arch
. The full list can be found here.
Deployment Environment
The deployment.environment
attribute is used to identify the environment where the service is deployed, such as staging or production.
Splunk Observability Cloud uses this attribute to enable related content (as described here), so it’s important to include it.
Cloud
There are also attributes to capture information for services running in public cloud environments, such as AWS. Attributes include cloud.provider, cloud.account.id
, and cloud.region
.
The full list of attributes can be found here.
Some cloud providers, such as GCP, define semantic conventions specific to their offering.
Kubernetes
There are a number of standardized attributes for applications running in Kubernetes as well. Many of these are added automatically by Splunk’s distribution of the OpenTelemetry collector, as described here.
These attributes include k8s.cluster.name
, k8s.node.name
, k8s.pod.name
, k8s.namespace.name
, and kubernetes.workload.name
.
Best Practices for Creating Custom Attributes
Many organizations require attributes that go beyond what’s defined in OpenTelemetry’s resource semantic conventions.
In this case, it’s important to avoid naming conflicts with attribute names already included in the semantic conventions. So it’s a good idea to check the semantic conventions before deciding on a particular attribute name for your naming convention.
In addition to a naming convention for attribute names, you also need to consider attribute values. For example, if you’d like to capture the particular business unit with which an application belongs, then you’ll also want to have a standardized list of business unit values to choose from, to facilitate effective filtering.
The OpenTelemetry community provides guidelines that should be followed when naming attributes as well, which can be found here.
The Recommendations for Application Developers section is most relevant to our discussion.
They recommend:
- Prefixing the attribute name with your company’s domain name, e.g.
com.acme.shopname
(if the attribute may be used outside your company as well as inside). - Prefixing the attribute name with the application name, if the attribute is unique to a particular application and is only used within your organization.
- Not using existing OpenTelemetry semantic convention names as a prefix for your attribute name.
- Consider submitting a proposal to add your attribute name to the OpenTelemetry specification, if there’s a general need for it across different organizations and industries.
- Avoid having attribute names start with
otel.*
, as this is reserved for OpenTelemetry specification usage.
Metric Cardinality Considerations
One final thing to keep in mind when deciding on naming standards for attribute names and values is related to metric cardinality.
Metric cardinality is defined as **the number of unique metric time series (MTS) produced by a combination of metric names and their associated dimensions.
A metric has high cardinality when it has a high number of dimension keys and a high number of possible unique values for those dimension keys.
For example, suppose your application sends in data for a metric named custom.metric
. In the absence of any attributes, custom.metric
would generate a single metric time series (MTS).
On the other hand, if custom.metric
includes an attribute named customer.id
, and there are thousands of customer ID values, this would generate thousands of metric time series, which may impact costs and query performance.
Splunk Observability Cloud provides a report that allows for the management of metrics usage. And rules can be created to drop undesirable dimensions. However, the first line of defence is understanding how attribute name and value combinations can drive increased metric cardinality.
Summary
In this document, we highlighted the importance of defining naming conventions for OpenTelemetry tags, preferably before starting a large rollout of OpenTelemetry instrumentation.
We discussed how OpenTelemetry’s resource semantic conventions define the naming conventions for several attributes, many of which are automatically collected via the OpenTelemetry SDKs, as well as processors that run within the OpenTelemetry collector.
Finally, we shared some best practices for creating your attribute names, for situations where the resource semantic conventions are not sufficient for your organization’s needs.
Subsections of Local Hosting
Local Hosting with Multipass
Install Multipass and Terraform for your operating system. On a Mac (Intel), you can also install via Homebrew e.g.
brew install multipass
brew install terraform
Clone workshop repository:
git clone https://github.com/splunk/observability-workshop
Change into multipass directory:
cd observability-workshop/local-hosting/multipass
Log Observer Connect:
If you plan to use your own Splunk Observability Cloud Suite Org and or Splunk instance, you may need to create a new Log Observer Connect connection:
Follow the instructions found in the documentation for Splunk Cloud or Splunk Enterprize.
Additional requirements for running your own Log Observer Connect connection are:
- Create an index called splunk4rookies-workshop
- Make sure the Service account user used in the Log observer Connect connection has access to the splunk4rookies-workshop index (you can remove all other indexes, as all workshop log data should go to this index).
Initialise Terraform:
```text
Initializing the backend...
Initializing provider plugins...
- Finding latest version of hashicorp/random...
- Finding latest version of hashicorp/local...
- Finding larstobi/multipass versions matching "~> 1.4.1"...
- Installing hashicorp/random v3.5.1...
- Installed hashicorp/random v3.5.1 (signed by HashiCorp)
- Installing hashicorp/local v2.4.0...
- Installed hashicorp/local v2.4.0 (signed by HashiCorp)
- Installing larstobi/multipass v1.4.2...
- Installed larstobi/multipass v1.4.2 (self-signed, key ID 797707331BF3549C)
```
Create Terraform variables file. Variables are kept in file terrform.tfvars
and a template is provided, terraform.tfvars.template
, to copy and edit:
cp terraform.tfvars.template terraform.tfvars
The following Terraform variables are required:
splunk_access_token
: Observability Cloud Access Tokensplunk_api_token
: Observability Cloud API Tokensplunk_rum_token
: Observability Cloud RUM Tokensplunk_realm
: Observability Cloud Realm e.g. eu0
splunk_hec_url
: Splunk HEC URL. Do not use a raw
endpoint, use the event
endpoint so logs process correctly.splunk_hec_token
: Splunk HEC Tokensplunk_index
: Splunk Index to send logs to. Defaults to splunk4rookies-workshop
.
Instance type variables:
splunk_presetup
: Provide a preconfigured instance (OTel Collector and Online Boutique deployed with RUM enabled). The default is false
.splunk_diab
: Install and run Demo-in-a-Box. The default is false
.tagging_workshop
: Install and configure the Tagging Workshop. The default is false
.otel_demo
: Install and configure the OpenTelemetry Astronomy Shop Demo. This requires that splunk_presetup
is set to false
. The default is false
.
Optional advanced variables:
wsversion
: Set this to main
if working on the development of the workshop, otherwise this can be omitted.architecture
: Set this to arm64
if you are running on Apple Silicon. Defaults to amd64
.
Run terraform plan
to check that all configuration is OK. Once happy run terraform apply
to create the instance.
random_string.hostname: Creating...
random_string.hostname: Creation complete after 0s [id=cynu]
local_file.user_data: Creating...
local_file.user_data: Creation complete after 0s [id=46a5c50e396a1a7820c3999c131a09214db903dd]
multipass_instance.ubuntu: Creating...
multipass_instance.ubuntu: Still creating... [10s elapsed]
multipass_instance.ubuntu: Still creating... [20s elapsed]
...
multipass_instance.ubuntu: Still creating... [1m30s elapsed]
multipass_instance.ubuntu: Creation complete after 1m38s [name=cynu]
data.multipass_instance.ubuntu: Reading...
data.multipass_instance.ubuntu: Read complete after 1s [name=cynu]
Apply complete! Resources: 3 added, 0 changed, 0 destroyed.
Outputs:
instance_details = [
{
"image" = "Ubuntu 22.04.2 LTS"
"image_hash" = "345fbbb6ec82 (Ubuntu 22.04 LTS)"
"ipv4" = "192.168.205.185"
"name" = "cynu"
"state" = "Running"
},
]
Once the instance has been successfully created (this can take several minutes), exec
into it using the name
from the output above. The password for Multipass instance is Splunk123!
.
multipass exec cynu -- su -l splunk
$ multipass exec kdhl -- su -l splunk
Password:
Waiting for cloud-init status...
Your instance is ready!
Validate the instance:
kubectl version --output=yaml
To delete the instance, first make sure you have exited from instance and then run the following command:
Local Hosting with OrbStack
Install Orbstack:
Log Observer Connect:
If you plan to use your own Splunk Observability Cloud Suite Org and or Splunk instance, you may need to create a new Log Observer Connect connection:
Follow the instructions found in the documentation for Splunk Cloud or Splunk Enterprize.
Additional requirements for running your own Log Observer Connect connection are:
Create an index called splunk4rookies-workshop
Make sure the Service account user used in the Log observer Connect Connection has access to the splunk4rookies-workshop index. (You can remove all other indexes, as all workshop log data should go to this index)
Clone workshop repository:
git clone https://github.com/splunk/observability-workshop
Change into Orbstack directory:
cd observability-workshop/local-hosting/orbstack
Copy the start.sh.example
to start.sh
and edit the file to set the following required variables
Make sure that you do not use a Raw Endpoint, but use an Event Endpoint instead as this will process the logs correctly
ACCESS_TOKEN
REALM
API_TOKEN
RUM_TOKEN
HEC_TOKEN
HEC_URL
Run the script and provide and instance name e.g.:
Once the instance has been successfully created (this can take several minutes), you will automatically be logged into the instance. If you exit you can SSH back in using the following command (replace <my_instance>
with the name of your instance):
ssh splunk@<my_instance>@orb
Once in the shell, you can validate that the instance is ready by running the following command:
kubectl version --output=yaml
To get the IP address of the instance, run the following command:
To delete the instance, run the following command:
Running Demo-in-a-Box
Demo-in-a-box is a method for running demo apps easily using a web interface.
It provides:
- A quick way to deploy demo apps and states
- A way to easily change configuration of your otel collector and see logs
- Get pod status, pod logs, etc.
To leverage this locally using multipass:
- Follow the local hosting for multipass instructions
- In the
terraform.tfvars
file, set splunk_diab
to true
and make sure all other options are set to false
- Then set the other required and important tokens/url
- Then run the terraform steps
- Once the instance is up, navigate in your browser to:
http://<IP>:8083
- In the
terraform.tfvars
file the wsversion
defaults to the current version of the workshop e.g 4.64
:- To use the latest developments change
wsversion
to use main
- There are only three versions of the workshop maintained, development (
main
) current (e.g. 4.64
and the previous (e.g. 4.63
) - After making the change, run
terraform apply
to make the changes
- Now you can deploy any of the demos; this will also deploy the collector as part of the deployment