Unsupported Field Workshops

Splunk IM
Splunk delivers real-time monitoring and troubleshooting to help you maximize infrastructure performance with complete visibility.
Build a Distributed Trace in Lambda and Kinesis
This workshop will equip you with how a distributed trace is constructed for a small serverless application that runs on AWS Lambda, producing and consuming a message via AWS Kinesis
Getting Data In (GDI) with OTel and UF
Learn how to get data into Splunk Observability Cloud with OpenTelemetry and the Splunk Universal Forwarder.
Splunk OnCall
Make expensive service outages a thing of the past. Remediate issues faster, reduce on-call burnout and keep your services up and running.

Splunk IM

During this technical Splunk Observability Cloud Infrastructure Monitoring and APM Workshop, you will build out an environment based on a lightweight Kubernetes¹ cluster.

To simplify the workshop modules, a pre-configured AWS/EC2 instance is provided.

The instance is pre-configured with all the software required to deploy the Splunk OpenTelemetry Connector² in Kubernetes, deploy an NGINX^3 ReplicaSet^4 and finally deploy a microservices-based application which has been instrumented using OpenTelemetry to send metrics, traces, spans and logs³.

The workshops also introduce you to dashboards, editing and creating charts, creating detectors to fire alerts, Monitoring as Code and the Service Bureau⁴

By the end of these technical workshops, you will have a good understanding of some of the key features and capabilities of the Splunk Observability Cloud.

Here are the instructions on how to access your pre-configured AWS/EC2 instance

Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. ↩︎
OpenTelemetry Collector offers a vendor-agnostic implementation on how to receive, process and export telemetry data. In addition, it removes the need to run, operate and maintain multiple agents/collectors to support open-source telemetry data formats (e.g. Jaeger, Prometheus, etc.) sending to multiple open-source or commercial back-ends. ↩︎
Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing system released as open source by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems ↩︎
Monitoring as Code and Service Bureau ↩︎

How to connect to your workshop environment

5 minutes

How to retrieve the IP address of the AWS/EC2 instance assigned to you.
Connect to your instance using SSH, Putty¹ or your web browser.
Verify your connection to your AWS/EC2 cloud instance.
Using Putty (Optional)
Using Multipass (Optional)

1. AWS/EC2 IP Address

In preparation for the workshop, Splunk has prepared an Ubuntu Linux instance in AWS/EC2.

To get access to the instance that you will be using in the workshop please visit the URL to access the Google Sheet provided by the workshop leader.

Search for your AWS/EC2 instance by looking for your first and last name, as provided during registration for this workshop.

Find your allocated IP address, SSH command (for Mac OS, Linux and the latest Windows versions) and password to enable you to connect to your workshop instance.

It also has the Browser Access URL that you can use in case you cannot connect via ssh or Putty - see EC2 access via Web browser

Important

Please use SSH or Putty to gain access to your EC2 instance if possible and make a note of the IP address as you will need this during the workshop.

2. SSH (Mac OS/Linux)

Most attendees will be able to connect to the workshop by using SSH from their Mac or Linux device, or on Windows 10 and above.

To use SSH, open a terminal on your system and type ssh splunk@x.x.x.x (replacing x.x.x.x with the IP address found in Step #1).

When prompted Are you sure you want to continue connecting (yes/no/[fingerprint])? please type yes.

Enter the password provided in the Google Sheet from Step #1.

Upon successful login, you will be presented with the Splunk logo and the Linux prompt.

3. SSH (Windows 10 and above)

The procedure described above is the same on Windows 10, and the commands can be executed either in the Windows Command Prompt or PowerShell. However, Windows regards its SSH Client as an “optional feature”, which might need to be enabled.

You can verify if SSH is enabled by simply executing ssh

If you are shown a help text on how to use the ssh-command (like shown on the screenshot below), you are all set.

If the result of executing the command looks something like the screenshot below, you want to enable the “OpenSSH Client” feature manually.

To do that, open the “Settings” menu, and click on “Apps”. While in the “Apps & features” section, click on “Optional features”.

Here, you are presented with a list of installed features. On the top, you see a button with a plus icon to “Add a feature”. Click it. In the search input field, type “OpenSSH”, and find a feature called “OpenSSH Client”, or respectively, “OpenSSH Client (Beta)”, click on it, and click the “Install”-button.

Now you are set! In case you are not able to access the provided instance despite enabling the OpenSSH feature, please do not shy away from reaching out to the course instructor, either via chat or directly.

At this point you are ready to continue and start the workshop

4. Putty (For Windows Versions prior to Windows 10)

If you do not have SSH pre-installed or if you are on a Windows system, the best option is to install putty, you can find here.

Important

If you cannot install Putty, please go to Web Browser (All).

Open Putty and enter in the Host Name (or IP address) field the IP address provided in the Google Sheet.

You can optionally save your settings by providing a name and pressing Save.

To then login to your instance click on the Open button as shown above.

If this is the first time connecting to your AWS/EC2 workshop instance, you will be presented with a security dialogue, please click Yes.

Once connected, login in as splunk and the password is the one provided in the Google Sheet.

Once you are connected successfully you should see a screen similar to the one below:

At this point, you are ready to continue and start the workshop

5. Web Browser (All)

If you are blocked from using SSH (Port 22) or unable to install Putty you may be able to connect to the workshop instance by using a web browser.

Note

This assumes that access to port 6501 is not restricted by your company’s firewall.

Open your web browser and type http://x.x.x.x:6501 (where X.X.X.X is the IP address from the Google Sheet).

Once connected, login in as splunk and the password is the one provided in the Google Sheet.

Once you are connected successfully you should see a screen similar to the one below:

Unlike when you are using regular SSH, copy and paste does require a few extra steps to complete when using a browser session. This is due to cross browser restrictions.

When the workshop asks you to copy instructions into your terminal, please do the following:

Copy the instruction as normal, but when ready to paste it in the web terminal, choose Paste from browser as show below:

This will open a dialogue box asking for the text to be pasted into the web terminal:

Paste the text in the text box as shown, then press OK to complete the copy and paste process.

Note

Unlike regular SSH connection, the web browser has a 60-second time out, and you will be disconnected, and a Connect button will be shown in the center of the web terminal.

Simply click the Connect button and you will be reconnected and will be able to continue.

At this point you are ready to continue and start the workshop.

6. Multipass (All)

If you are unable to access AWS, but want to install software locally, follow the instructions for using Multipass.

Download Putty ↩︎

Deploying the OpenTelemetry Collector in Kubernetes

15 minutes

Use the Splunk Helm chart to install the OpenTelemetry Collector in K3s
Explore your cluster in the Kubernetes Navigator

1. Installation using Helm

Install the OpenTelemetry Collector using the Splunk Helm chart. First, add the Splunk Helm chart repository to Helm and update:

helm repo add splunk-otel-collector-chart https://signalfx.github.io/splunk-otel-collector-chart && helm repo update

Using ACCESS_TOKEN={REDACTED} Using REALM=eu0 “splunk-otel-collector-chart” has been added to your repositories Using ACCESS_TOKEN={REDACTED} Using REALM=eu0 Hang tight while we grab the latest from your chart repositories… …Successfully got an update from the “splunk-otel-collector-chart” chart repository Update Complete. ⎈Happy Helming!⎈

Install the OpenTelemetry Collector Helm chart with the following commands, do NOT edit this:

helm install splunk-otel-collector \
--set="splunkObservability.realm=$REALM" \
--set="splunkObservability.accessToken=$ACCESS_TOKEN" \
--set="clusterName=$INSTANCE-k3s-cluster" \
--set="splunkObservability.logsEnabled=false" \
--set="logsEngine=otel" \
--set="splunkObservability.profilingEnabled=true" \
--set="splunkObservability.infrastructureMonitoringEventsEnabled=true" \
--set="environment=$INSTANCE-workshop" \
--set="splunkPlatform.endpoint=$HEC_URL" \
--set="splunkPlatform.token=$HEC_TOKEN" \
--set="splunkPlatform.index=splunk4rookies-workshop" \
splunk-otel-collector-chart/splunk-otel-collector \
-f ~/workshop/k3s/otel-collector.yaml

Using ACCESS_TOKEN={REDACTED}
Using REALM=eu0
NAME: splunk-otel-collector
LAST DEPLOYED: Fri May  7 11:19:01 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

You can monitor the progress of the deployment by running kubectl get pods which should typically report a new pod is up and running after about 30 seconds.

Ensure the status is reported as Running before continuing.

kubectl get pods

NAME                                                          READY   STATUS    RESTARTS   AGE
splunk-otel-collector-agent-2sk6k                             0/1     Running   0          10s
splunk-otel-collector-k8s-cluster-receiver-6956d4446f-gwnd7   0/1     Running   0          10s

Ensure there are no errors by tailing the logs from the OpenTelemetry Collector pod. The output should look similar to the log output shown in the Output tab below.

Use the label set by the helm install to tail logs (You will need to press ctrl+c to exit). Or use the installed k9s terminal UI for bonus points!

kubectl logs -l app=splunk-otel-collector -f --container otel-collector

2021-03-21T16:11:10.900Z INFO service/service.go:364 Starting receivers… 2021-03-21T16:11:10.900Z INFO builder/receivers_builder.go:70 Receiver is starting… {“component_kind”: “receiver”, “component_type”: “prometheus”, “component_name”: “prometheus”} 2021-03-21T16:11:11.009Z INFO builder/receivers_builder.go:75 Receiver started. {“component_kind”: “receiver”, “component_type”: “prometheus”, “component_name”: “prometheus”} 2021-03-21T16:11:11.009Z INFO builder/receivers_builder.go:70 Receiver is starting… {“component_kind”: “receiver”, “component_type”: “k8s_cluster”, “component_name”: “k8s_cluster”} 2021-03-21T16:11:11.009Z INFO k8sclusterreceiver@v0.21.0/watcher.go:195 Configured Kubernetes MetadataExporter {“component_kind”: “receiver”, “component_type”: “k8s_cluster”, “component_name”: “k8s_cluster”, “exporter_name”: “signalfx”} 2021-03-21T16:11:11.009Z INFO builder/receivers_builder.go:75 Receiver started. {“component_kind”: “receiver”, “component_type”: “k8s_cluster”, “component_name”: “k8s_cluster”} 2021-03-21T16:11:11.009Z INFO healthcheck/handler.go:128 Health Check state change {“component_kind”: “extension”, “component_type”: “health_check”, “component_name”: “health_check”, “status”: “ready”} 2021-03-21T16:11:11.009Z INFO service/service.go:267 Everything is ready. Begin running and processing data. 2021-03-21T16:11:11.009Z INFO k8sclusterreceiver@v0.21.0/receiver.go:59 Starting shared informers and wait for initial cache sync. {“component_kind”: “receiver”, “component_type”: “k8s_cluster”, “component_name”: “k8s_cluster”} 2021-03-21T16:11:11.281Z INFO k8sclusterreceiver@v0.21.0/receiver.go:75 Completed syncing shared informer caches. {“component_kind”: “receiver”, “component_type”: “k8s_cluster”, “component_name”: “k8s_cluster”}

Deleting a failed installation

If you make an error installing the OpenTelemetry Collector you can start over by deleting the installation using:

helm delete splunk-otel-collector

2. Validate metrics in the UI

In the Splunk UI, click the » bottom left and click on Infrastructure.

Under Containers click on Kubernetes to open the Kubernetes Navigator Cluster Map to ensure metrics are being sent in.

Validate that your cluster is discovered and reported by finding your cluster (in the workshop you will see many other clusters). To find your cluster name run the following command and copy the output to your clipboard:

echo $INSTANCE-k3s-cluster

Then in the UI, click on the “Cluster: - " menu just below the Splunk Logo, paste the Cluster name you just copied into the search box, click the box to select your cluster, and finally click off the menu into white space to apply the filter.

To examine the health of your node, hover over the pale blue background of your cluster, then click on the blue magnifying glass that appears in the top left-hand corner.

This will drill down to the node level. Next, open the Metrics sidebar by clicking on the sidebar button.

Once it is open, you can use the slider on the side to explore the various charts relevant to your cluster/node: CPU, Memory, Network, Events etc.

Deploying NGINX in K3s

Deploy a NGINX ReplicaSet into your K3s cluster and confirm the discovery of your NGINX deployment.
Run a load test to create metrics and confirm them streaming into Splunk Observability Cloud!

1. Start your NGINX

Verify the number of pods running in the Splunk UI by selecting the WORKLOADS tab. This should give you an overview of the workloads on your cluster.

Note the single agent container running per node among the default Kubernetes pods. This single container will monitor all the pods and services being deployed on this node!

Now switch back to the default cluster node view by selecting the MAP tab and selecting your cluster again.

In your AWS/EC2 or Multipass shell session change into the nginx directory:

cd ~/workshop/k3s/nginx

2. Create NGINX deployment

Create the NGINX ConfigMap¹ using the nginx.conf file:

kubectl create configmap nginxconfig --from-file=nginx.conf

configmap/nginxconfig created

Then create the deployment:

kubectl create -f nginx-deployment.yaml

deployment.apps/nginx created service/nginx created

Next, we will deploy Locust² which is an open-source tool used for creating a load test against NGINX:

kubectl create -f locust-deployment.yaml

deployment.apps/nginx-loadgenerator created service/nginx-loadgenerator created

Validate the deployment has been successful and that the Locust and NGINX pods are running.

If you have the Splunk UI open you should see new Pods being started and containers being deployed.

It should only take around 20 seconds for the pods to transition into a Running state. In the Splunk UI you will have a cluster that looks like the screenshot below:

If you select the WORKLOADS tab again you will now see that there is a new ReplicaSet and a deployment added for NGINX:

Let’s validate this in your shell as well:

kubectl get pods

NAME READY STATUS RESTARTS AGE splunk-otel-collector-k8s-cluster-receiver-77784c659c-ttmpk 1/1 Running 0 9m19s splunk-otel-collector-agent-249rd 1/1 Running 0 9m19s svclb-nginx-vtnzg 1/1 Running 0 5m57s nginx-7b95fb6b6b-7sb9x 1/1 Running 0 5m57s nginx-7b95fb6b6b-lnzsq 1/1 Running 0 5m57s nginx-7b95fb6b6b-hlx27 1/1 Running 0 5m57s nginx-7b95fb6b6b-zwns9 1/1 Running 0 5m57s svclb-nginx-loadgenerator-nscx4 1/1 Running 0 2m20s nginx-loadgenerator-755c8f7ff6-x957q 1/1 Running 0 2m20s

3. Run Locust load test

Locust, an open-source load generator, is available on port 8083 of the EC2 instance’s IP address. Open a new tab in your web browser and go to http://{==EC2-IP==}:8083/, you will then be able to see the Locust running.

Set the Spawn rate to 2 and click Start Swarming.

This will start a gentle continuous load on the application.

As you can see from the above screenshot, most of the calls will report a fail, this is expected, as we have not yet deployed the application behind it, however, NGINX is reporting on your attempts and you should be able to see those metrics.

Validate you are seeing those metrics in the UI by selecting Dashboards → Built-in Dashboard Groups → NGINX → NGINX Servers. Using the Overrides filter on k8s.cluster.name:, find the name of your cluster as returned by echo $INSTANCE-k3s-cluster in the terminal.

A ConfigMap is an API object used to store non-confidential data in key-value pairs. Pods can consume ConfigMaps as environment variables, command-line arguments, or configuration files in a volume. A ConfigMap allows you to decouple environment-specific configuration from your container images so that your applications are easily portable. ↩︎
What is Locust? ↩︎

Working with Dashboards

20 minutes

Introduction to the Dashboards and Charts
Editing and creating charts
Filtering and analytical functions
Using formulas
Saving charts in a dashboard
Introduction to SignalFlow

1. Dashboards

Dashboards are groupings of charts and visualizations of metrics. Well-designed dashboards can provide useful and actionable insight into your system at a glance. Dashboards can be complex or contain just a few charts that drill down only into the data you want to see.

During this module, we are going to create the following charts and dashboard and connect them to your Team page.

2. Your Teams’ Page

Click on the from the navbar. As you have already been assigned to a team, you will land on the team dashboard. We use the Example Team as an example here. The one in your workshop will be different!

This page shows the total number of team members, how many active alerts for your team and all dashboards that are assigned to your team. Right now there are no dashboards assigned but as stated before, we will add the new dashboard that you will create to your Teams page later.

3. Sample Charts

To continue, click on All Dashboards in the top right corner of the screen. This brings you to the view that shows all the available dashboards, including the pre-built ones.

If you are already receiving metrics from a Cloud API integration or another service through the Splunk Agent you will see relevant dashboards for these services.

4. Inspecting the Sample Data

Among the dashboards, you will see a Dashboard group called Sample Data. Expand the Sample Data dashboard group by clicking on it, and then click on the Sample Charts dashboard.

In the Sample Charts dashboard, you can see a selection of charts that show a sample of the various styles, colors and formats you can apply to your charts in the dashboards.

Have a look through all the dashboards in this dashboard group (PART 1, PART 2, PART 3 and INTRO TO SPLUNK OBSERVABILITY CLOUD)

Editing charts

1. Editing a chart

Select the SAMPLE CHARTS dashboard and then click on the three dots ... on the Latency histogram chart, then on Open (or you can click on the name of the chart which here is Latency histogram).

You will see the plot options, current plot and signal (metric) for the Latency histogram chart in the chart editor UI.

In the Plot Editor tab under Signal you see the metric demo.trans.latency we are currently plotting.

You will see a number of Line plots. The number 18 ts indicates that we are plotting 18 metric time series in the chart.

Click on the different chart type icons to explore each of the visualizations. Notice their name while you swipe over them. For example, click on the Heat Map icon:

See how the chart changes to a heat map.

Note

You can use different charts to visualize your metrics - you choose which chart type fits best for the visualization you want to have.

For more info on the different chart types see Choosing a chart type.

Click on the Line chart type and you will see the line plot.

2. Changing the time window

You can also increase the time window of the chart by changing the time to Past 15 minutes by selecting from the Time dropdown.

3. Viewing the Data Table

Click on the Data Table tab.

You now see 18 rows, each representing a metric time series with a number of columns. These columns represent the dimensions of the metric. The dimensions for demo.trans.latency are:

demo_datacenter
demo_customer
demo_host

In the demo_datacenter column you see that there are two data centers, Paris and Tokyo, for which we are getting metrics.

If you move your cursor over the lines in the chart horizontally you will see the data table update accordingly. If you click on one of the lines in the chart you will see a pinned value appear in the data table.

Now click on Plot editor again to close the Data Table and let’s save this chart into a dashboard for later use!

Saving charts

1. Saving a chart

To start saving your chart, lets give it a name and description. Click the name of the chart Copy of Latency Histogram and rename it to “Active Latency”.

To change the description click on Spread of latency values across time. and change this to Overview of latency values in real-time.

Click the Save As button. Make sure your chart has a name, it will use the name Active Latency the you defined in the previous step, but you can edit it here if needed.

Press the Ok button to continue.

2. Creating a dashboard

In the Choose dashboard dialog, we need to create a new dashboard, click on the New Dashboard button.

You will now see the New Dashboard Dialog. In here you can give you dashboard a name and description, and set Read and Write Permissions.

Please use your own name in the following format to give your dashboard a name e.g. YOUR_NAME-Dashboard.

Please replace YOUR_NAME with your own name, change the dashboard permissions to Restricted Read and Write access, and verify your user can read/write.

You should see you own login information displayed, meaning you are now the only one who can edit this dashboard. Of course you have the option to add other users or teams from the drop box below that may edit your dashboard and charts, but for now make sure you change it back to Everyone can Read or Write to remove any restrictions and press the Save Button to continue.

Your new dashboard is now available and selected so you can save your chart in your new dashboard.

Make sure you have your dashboard selected and press the Ok button.

You will now be taken to your dashboard like below. You can see at the top left that your YOUR_NAME-DASHBOARD is part of a Dashboard Group YOUR_NAME-Dashboard. You can add other dashboards to this dashboard group.

3. Add to Team page

It is common practice to link dashboards that are relevant to a Team to a teams page. So let’s add your dashboard to the team page for easy access later. Use the from the navbar again.

This will bring you to your teams dashboard, We use the team Example Team as an example here, the workshop one will be different.

Press the + Add Dashboard Group button to add you dashboard to the team page.

This will bring you to the Select a dashboard group to link to this team dialog. Type your name (that you used above) in the search box to find your Dashboard. Select it so its highlighted and click the Ok button to add your dashboard.

Your dashboard group will appear as part of the team page. Please note during the course of the workshop many more will appear here.

Now click on the link for your Dashboard to add more charts!

3.3 Using Filters & Formulas

1 Creating a new chart

Let’s now create a new chart and save it in our dashboard!

Select the plus icon (top right of the UI) and from the drop down, choose the option Chart. Or click on the + New Chart Button to create a new chart.

You will now see a chart template like the following.

Let’s enter a metric to plot. We are still going to use the metric demo.trans.latency.

In the Plot Editor tab under Signal enter demo.trans.latency.

You should now have a familiar line chart. Please switch the time to 15 mins.

2. Filtering and Analytics

Let’s now select the Paris datacenter to do some analytics - for that we will use a filter.

Let’s go back to the Plot Editor tab and click on Add Filter , wait until it automatically populates, choose demo_datacenter, and then Paris.

In the F(x) column, add the analytic function Percentile:Aggregation, and leave the value to 95 (click outside to confirm).

For info on the Percentile function and the other functions see Chart Analytics.

3. Using Timeshift analytical function

Let’s now compare with older metrics. Click on ... and then on Clone in the dropdown to clone Signal A.

You will see a new row identical to A, called B, both visible and plotted.

For Signal B, in the F(x) column add the analytic function Timeshift and enter 1w (or 7d for 7 days), and click outside to confirm.

Click on the cog on the far right, and choose a Plot Color e.g. pink, to change color for the plot of B.

Click on Close.

We now see plots for Signal A (the past 15 minutes) as a blue plot, and the plots from a week ago in pink.

In order to make this clearer we can click on the Area chart icon to change the visualization.

We now can see when last weeks latency was higher!

Next, click into the field next to Time on the Override bar and choose Past Hour from the dropdown.

4. Using Formulas

Let’s now plot the difference of all metric values for a day with 7 days in between.

Click on Enter Formula then enter A-B (A minus B) and hide (deselect) all Signals using the eye, except C.

We now see only the difference of all metric values of A and B being plotted. We see that we have some negative values on the plot because a metric value of B has some times larger value than the metric value of A at that time.

Lets look at the Signalflow that drives our Charts and Detectors!

3.4 SignalFlow

1. Introduction

Let’s take a look at SignalFlow - the analytics language of Observability Cloud that can be used to setup monitoring as code.

The heart of Splunk Infrastructure Monitoring is the SignalFlow analytics engine that runs computations written in a Python-like language. SignalFlow programs accept streaming input and produce output in real time. SignalFlow provides built-in analytical functions that take metric time series (MTS) as input, perform computations, and output a resulting MTS.

Comparisons with historical norms, e.g. on a week-over-week basis
Population overviews using a distributed percentile chart
Detecting if the rate of change (or other metric expressed as a ratio, such as a service level objective) has exceeded a critical threshold
Finding correlated dimensions, e.g. to determine which service is most correlated with alerts for low disk space

Infrastructure Monitoring creates these computations in the Chart Builder user interface, which lets you specify the input MTS to use and the analytical functions you want to apply to them. You can also run SignalFlow programs directly by using the SignalFlow API.

SignalFlow includes a large library of built-in analytical functions that take a metric time series as an input, performs computations on its datapoints, and outputs time series that are the result of the computation.

Info

For more information on SignalFlow see Analyze incoming data using SignalFlow.

2. View SignalFlow

In the chart builder, click on View SignalFlow.

You will see the SignalFlow code that composes the chart we were working on. You can now edit the SignalFlow directly within the UI. Our documentation has the full list of SignalFlow functions and methods.

Also, you can copy the SignalFlow and use it when interacting with the API or with Terraform to enable Monitoring as Code.

A = data('demo.trans.latency', filter=filter('demo_datacenter', 'Paris')).percentile(pct=95).publish(label='A', enable=False)
B = data('demo.trans.latency', filter=filter('demo_datacenter', 'Paris')).percentile(pct=95).timeshift('1w').publish(label='B', enable=False)
C = (A-B).publish(label='C')

Click on View Builder to go back to the Chart Builder UI.

Let’s save this new chart to our Dashboard!

Adding charts to dashboards

1. Save to existing dashboard

Check that you have YOUR_NAME-Dashboard: YOUR_NAME-Dashboard in the top left corner. This means you chart will be saved in this Dashboard.

Name the Chart Latency History and add a Chart Description if you wish.

Click on Save And Close. This returns you to your dashboard that now has two charts!

Now let’s quickly add another Chart based on the previous one.

2. Copy and Paste a chart

Click on the three dots ... on the Latency History chart in your dashboard and then on Copy.

You see the chart being copied, and you should now have a red circle with a white 1 next to the + on the top left of the page.

Click on the plus icon the top of the page, and then in the menu on Paste Charts (There should also be a red dot with a 1 visible at the end of the line).

This will place a copy of the previous chart in your dashboard.

3. Edit the pasted chart

Click on the three dots ... on one of the Latency History charts in your dashboard and then on Open (or you can click on the name of the chart which here is Latency History).

This will bring you to the editor environment again.

First set the time for the chart to -1 hour in the Time box at the top right of the chart. Then to make this a different chart, click on the eye icon in front of signal “A” to make it visible again, and then hide signal “C” via the eye icon and change the name for Latency history to Latency vs Load.

Click on the Add Metric Or Event button. This will bring up the box for a new signal. Type and select demo.trans.count for Signal D.

This will add a new Signal D to your chart, It shows the number of active requests. Add the filter for the demo_datacenter:Paris, then change the Rollup type by clicking on the Configure Plot button and changing the roll-up from Auto (Delta) to Rate/sec. Change the name from demo.trans.count to Latency vs Load.

Finally press the Save And Close button. This returns you to your dashboard that now has three different charts!

Let’s add an “instruction” note and arrange the charts!

Adding Notes and Dashboard Layout

1. Adding Notes

Often on dashboards it makes sense to place a short “instruction” pane that helps users of a dashboard. Lets add one now by clicking on the New Text Note Button.

This will open the notes editor.

To allow you to add more then just text to you notes, Splunk is allowing you to use Markdown in these notes/panes. Markdown is a lightweight markup language for creating formatted text using plain-text often used in Webpages.

This includes (but not limited to):

Headers. (in various sizes)
Emphasis styles.
Lists and Tables.
Links. These can be external webpages (for documentation for example) or directly to other Splunk IM Dashboards

Below is an example of above Markdown options you can use in your note.

# h1 Big headings

###### h6 To small headings

##### Emphasis

**This is bold text**, *This is italic text* , ~~Strikethrough~~

##### Lists

Unordered

+ Create a list by starting a line with `+`, `-`, or `*`
- Sub-lists are made by indenting 2 spaces:
- Marker character change forces new list start:
    * Ac tristique libero volutpat at
    + Facilisis in pretium nisl aliquet
* Very easy!

Ordered

1. Lorem ipsum dolor sit amet
2. Consectetur adipiscing elit
3. Integer molestie lorem at massa

##### Tables

| Option | Description |
| ------ | ----------- |
| chart  | path to data files to supply the data that will be passed into templates. |
| engine | engine to be used for processing templates. Handlebars is the default. |
| ext    | extension to be used for dest files. |

#### Links

[link to webpage](https://www.splunk.com)

[link to dashboard with title](https://app.eu0.signalfx.com/#/dashboard/EaJHrbPAEAA?groupId=EaJHgrsAIAA&configId=EaJHsHzAEAA "Link to the Sample chart Dashboard!")

Copy the above by using the copy button and paste it in the Edit box. the preview will show you how it will look.

2. Saving our chart

Give the Note chart a name, in our example we used Example text chart, then press the Save And Close Button.

This will bring you back to you Dashboard, that now includes the note.

3. Ordering & sizing of charts

If you do not like the default order and sizes of your charts you can simply use window dragging technique to move and size them to the desired location.

Grab the top border of a chart and you should see the mouse pointer change to a drag icon (see picture below).

Now drag the Latency vs Load chart to sit between the Latency History Chart and the Example text chart.

You can also resize windows by dragging from the left, right and bottom edges.

As a last exercise reduce the width of the note chart to about a third of the other charts. The chart will automatically snap to one of the sizes it supports. Widen the 3 other charts to about a third of the Dashboard. Drag the notes to the right of the others and resize it to match it to the 3 others. Set the Time to -1h and you should have the following dashboard!

On to Detectors!

Working with Detectors

10 minutes

Create a Detector from one of your charts
Setting Alert conditions
Running a pre-flight check
Working with muting rules

1. Introduction

Splunk Observability Cloud uses detectors, events, alerts, and notifications to keep you informed when certain criteria are met. For example, you might want a message sent to a Slack channel or an email address for the Ops team when CPU Utilization has reached 95%, or when the number of concurrent users is approaching a limit that might require you to spin up an additional AWS instance.

These conditions are expressed as one or more rules that trigger an alert when the conditions in the rules are met. Individual rules in a detector are labeled according to criticality: Info, Warning, Minor, Major, and Critical.

2. Creating a Detector

In Dashboards click on your Custom Dashboard Group (that you created in the previous module) and then click on the dashboard name.

We are now going to create a new detector from a chart on this dashboard. Click on the bell icon on the Latency vs Load chart, and then click New Detector From Chart.

In the text field next to Detector Name, ADD YOUR INITIALS before the proposed detector name.

Naming the detector

It’s important that you add your initials in front of the proposed detector name.

It should be something like this: XYZ’s Latency Chart Detector.

Click on Create Alert Rule

In the Detector window, inside Alert signal, the Signal we will alert on is marked with a (blue) bell in the Alert on column. The bell indicates which Signal is being used to generate the alert.

Click on Proceed to Alert Condition

3. Setting Alert condition

In Alert condition, click on Static Threshold and then on Proceed to Alert Settings

In Alert Settings, enter the value 290 in the Threshold field. In the same window change Time on top right to past day (-1d).

4. Alert pre-flight check

A pre-flight check will take place after 5 seconds. See the Estimated alert count. Based on the current alert settings, the amount of alerts we would have received in 1 day would have been 3.

About pre-flight checks

Once you set an alert condition, the UI estimates how many alerts you might get based on the current settings, and in the timeframe set on the upper right corner - in this case, the past day.

Immediately, the platform will start analyzing the signals with the current settings, and perform something we call a Pre-flight Check. This enables you to test the alert conditions using the historical data in the platform, to ensure the settings are logical and will not inadvertently generate an alert storm, removing the guesswork from configuring alerts in a simple but very powerful way, only available using the Splunk Observability Cloud.

To read more about detector previewing, please visit this link Preview detector alerts.

Click on Proceed to Alert Message

5. Alert message

In Alert message, under Severity choose Major.

Click on Proceed to Alert Recipients

Click on Add Recipient and then on your email address displayed as the first option.

Notification Services

That’s the same as entering that email address OR you can enter another email address by clicking on E-mail….

This is just one example of the many Notification Services the platform has available. You can check this out by going to the Integrations tab of the top menu, and see Notification Services.

6. Alert Activation

Click on Proceed to Alert Activation

In Activate… click on Activate Alert Rule

If you want to get alerts quicker you edit the rule and lower the value from 290 to say 280.

If you change the Time to -1h you can see how many alerts you might get with the threshold you have chosen based on the metrics from the last 1 hour.

Click on the in the navbar and then click on Detectors. You can optionally filter for your initials. You will see you detector listed here. If you don’t then please refresh your browser.

Congratulations! You have created your first detector and activated it!

Working with Muting Rules

Learn how to configure Muting Rules
Learn how to resume notifications

1. Configuring Muting Rules

There will be times when you might want to mute certain notifications. For example, if you want to schedule downtime for maintenance on a server or set of servers, or if you are testing new code or settings etc. For that you can use muting rules in Splunk Observability Cloud. Let’s create one!

Click on Alerts & Detectors in the sidebar and then click Detectors to see the list of active detectors.

If you created a detector in Creating a Detector you can click on the three dots ... on the far right for that detector; if not, do that for another detector.

From the drop-down click on Create Muting Rule…

In the Muting Rule window check Mute Indefinitely and enter a reason.

Important

This will mute the notifications permanently until you come back here and un-check this box or resume notifications for this detector.

Click Next and in the new modal window confirm the muting rule setup.

Click on Mute Indefinitely to confirm.

You won’t be receiving any email notifications from your detector until you resume notifications again. Let’s now see how to do that!

2. Resuming notifications

To Resume notifications, click on Muting Rules, you will see the name of the detector you muted notifications for under Detector heading.

Click on the thee dots ... on the far right, and click on Resume Notifications.

Click on Resume to confirm and resume notifications for this detector.

Congratulations! You have now resumed your alert notifications!

Monitoring as Code

10 minutes

Use Terraform¹ to manage Observability Cloud Dashboards and Detectors
Initialize the Terraform Splunk Provider².
Run Terraform to create detectors and dashboards from code using the Splunk Terraform Provider.
See how Terraform can also delete detectors and dashboards.

1. Initial setup

Monitoring as code adopts the same approach as infrastructure as code. You can manage monitoring the same way you do applications, servers, or other infrastructure components.

You can use monitoring as code to build out your visualizations, what to monitor, and when to alert, among other things. This means your monitoring setup, processes, and rules can be versioned, shared, and reused.

Full documentation for the Splunk Terraform Provider is available here.

Remaining in your AWS/EC2 instance, change into the o11y-cloud-jumpstart directory

cd ~/observability-content-contrib/integration-examples/terraform-jumpstart

Initialize Terraform and upgrade to the latest version of the Splunk Terraform Provider.

Note: Upgrading the SignalFx Terraform Provider

You will need to run the command below each time a new version of the Splunk Terraform Provider is released. You can track the releases on GitHub.

terraform init -upgrade

    Upgrading modules...
    - aws in modules/aws
    - azure in modules/azure
    - docker in modules/docker
    - gcp in modules/gcp
    - host in modules/host
    - kafka in modules/kafka
    - kubernetes in modules/kubernetes
    - parent_child_dashboard in modules/dashboards/parent
    - pivotal in modules/pivotal
    - rum_and_synthetics_dashboard in modules/dashboards/rum_and_synthetics
    - usage_dashboard in modules/dashboards/usage

    Initializing the backend...

    Initializing provider plugins...
    - Finding latest version of splunk-terraform/signalfx...
    - Installing splunk-terraform/signalfx v6.20.0...
    - Installed splunk-terraform/signalfx v6.20.0 (self-signed, key ID CE97B6074989F138)

    Partner and community providers are signed by their developers.
    If you'd like to know more about provider signing, you can read about it here:
    https://www.terraform.io/docs/cli/plugins/signing.html

    Terraform has created a lock file .terraform.lock.hcl to record the provider
    selections it made above. Include this file in your version control repository
    so that Terraform can guarantee to make the same selections by default when
    you run "terraform init" in the future.

    Terraform has been successfully initialized!

    You may now begin working with Terraform. Try running "terraform plan" to see
    any changes that are required for your infrastructure. All Terraform commands
    should now work.

    If you ever set or change modules or backend configuration for Terraform,
    rerun this command to reinitialize your working directory. If you forget, other
    commands will detect it and remind you to do so if necessary.

2. Create execution plan

The terraform plan command creates an execution plan. By default, creating a plan consists of:

Reading the current state of any already-existing remote objects to make sure that the Terraform state is up-to-date.
Comparing the current configuration to the prior state and noting any differences.
Proposing a set of change actions that should, if applied, make the remote objects match the configuration.

The plan command alone will not actually carry out the proposed changes, and so you can use this command to check whether the proposed changes match what you expected before you apply the changes

terraform plan -var="api_token=$API_TOKEN" -var="realm=$REALM" -var="o11y_prefix=[$INSTANCE]"

Plan: 146 to add, 0 to change, 0 to destroy.

If the plan executes successfully, we can go ahead and apply:

3. Apply execution plan

The terraform apply command executes the actions proposed in the Terraform plan above.

The most straightforward way to use terraform apply is to run it without any arguments at all, in which case it will automatically create a new execution plan (as if you had run terraform plan) and then prompt you to provide the API Token, Realm (the prefix defaults to Splunk) and approve the plan, before taking the indicated actions.

Due to this being a workshop it is required that the prefix is to be unique so you need to run the terraform apply below.

terraform apply -var="api_token=$API_TOKEN" -var="realm=$REALM" -var="o11y_prefix=[$INSTANCE]"

Apply complete! Resources: 146 added, 0 changed, 0 destroyed.

Once the apply has been completed, validate that the detectors were created, under the Alerts & Detectors and click on the Detectors tab. They will be prefixed by the instance name. To check the prefix value run:

echo $INSTANCE

You will see a list of the new detectors and you can search for the prefix that was output from above.

3. Destroy all your hard work

The terraform destroy command is a convenient way to destroy all remote objects managed by your Terraform configuration.

While you will typically not want to destroy long-lived objects in a production environment, Terraform is sometimes used to manage ephemeral infrastructure for development purposes, in which case you can use terraform destroy to conveniently clean up all of those temporary objects once you are finished with your work.

Now go and destroy all the Detectors and Dashboards that were previously applied!

terraform destroy -var="api_token=$API_TOKEN" -var="realm=$REALM"

Destroy complete! Resources: 146 destroyed.

Validate all the detectors have been removed by navigating to Alerts → Detectors

Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions. ↩︎
A provider is responsible for understanding API interactions and exposing resources. Providers generally are an IaaS (e.g. Alibaba Cloud, AWS, GCP, Microsoft Azure, OpenStack), PaaS (e.g. Heroku), or SaaS services (e.g. Splunk, Terraform Cloud, DNSimple, Cloudflare). ↩︎

Service Bureau

10 minutes

How to keep track of the usage of Observability Cloud in your organization
Learn how to keep track of spend by exploring the Subscription Usage interface
Creating Teams
Adding notification rules to Teams
Controlling usage

1. Understanding engagement

To fully understand Observability Cloud engagement inside your organization, click on the » bottom left and select the Settings → Organization Overview, this will provide you with the following dashboards that show you how your Observability Cloud organization is being used:

You will see various dashboards such as Throttling, System Limits, Entitlements & Engagement. The workshop organization you’re using now may have less data to work with as this is cleared down after each workshop.

Take a minute to explore the various dashboards and charts in the Organization Overview of this workshop instance.

2. Subscription Usage

If you want to see what your usage is against your subscription you can select Subscription Usage.

This screen may take a few seconds to load whilst it calculates and pulls in the usage.

3. Understanding usage

You will see a screen similar to the one below that will give you an overview of the current usage, the average usage and your entitlement per category: Hosts, Containers, Custom Metrics and High Resolution Metrics.

For more information about these categories please refer to Monitor Splunk Infrastructure Monitoring subscription usage.

4. Examine usage in detail

The top chart shows you the current subscription levels per category (shown by the red arrows at the top in the screenshot below).

Also, your current usage of the four categories is displayed (shown in the red lines at the bottom of the chart).

In this example, you can see that there are 25 Hosts, 0 Containers, 100 Custom Metrics and 0 High Resolution Metrics.

In the bottom chart, you can see the usage per category for the current period (shown in the drop-down box on the top right of the chart).

The blue line marked Average Usage indicates what Observability Cloud will use to calculate your average usage for the current Subscription Usage Period.

Info

As you can see from the screenshot, Observability Cloud does not use High Watermark or P95% for cost calculation but the actual average hourly usage, allowing you to do performance testing or Blue/Green style deployments etc. without the risk of overage charges.

To get a feel for the options you can change the metric displayed by selecting the different options from the Usage Metric drop-down on the left, or change the Subscription Usage Period with the drop-down on the right.

Please take a minute to explore the different time periods & categories and their views.

Finally, the pane on the right shows you information about your Subscription.

Teams

Introduction to Teams
Create a Team and add members to the Team

1. Introduction to Teams

To make sure that users see the dashboards and alerts that are relevant to them when using Observability Cloud, most organizations will use Observability Cloud’s Teams feature to assign a member to one or more Teams.

Ideally, this matches work-related roles, for example, members of a Dev-Ops or Product Management group would be assigned to the corresponding Teams in Observability Cloud.

When a user logs into Observability Cloud, they can choose which Team Dashboard will be their home page and they will typically select the page for their primary role.

In the example below, the user is a member of the Development, Operations and Product Management Teams, and is currently viewing the Dashboard for the Operations Team.

This Dashboard has specific Dashboard Groups for Usage, SaaS and APM Business Workflows assigned but any Dashboard Group can be linked to a Teams Dashboard.

They can use the menu along the top left to quickly navigate between their allocated teams, or they can use the ALL TEAMS dropdown on the right to select specific Team Dashboards, as well as quickly access ALL Dashboards** using the adjacent link.

Alerts can be linked to specific Teams so the Team can monitor only the Alerts they are interested in, and in the above example, they currently have 1 active Critical Alert.

The Description for the Team Dashboard can be customized and can include links to team-specific resources (using Markdown).

2. Creating a new Team

To work with Splunk’s Team UI click on the hamburger icon top left and select the Organizations Settings → Teams.

When the Team UI is selected you will be presented with the list of current Teams.

To add a new Team click on the Create New Team button. This will present you with the Create New Team dialog.

Create your own team by naming it [YOUR-INITIALS]-Team and add yourself by searching for your name and selecting the Add link next to your name. This should result in a dialog similar to the one below:

You can remove selected users by pressing Remove or the small x.

Make sure you have your group created with your initials and with yourself added as a member, then click Done

This will bring you back to the Teams list that will now show your Team and the ones created by others.

Note

The Teams(s) you are a member of have a grey Member icon in front of it.

If no members are assigned to your Team, you should see a blue Add Members link instead of the member count, clicking on that link will get you to the Edit Team dialog where you can add yourself.

This is the same dialog you get when pressing the 3 dots … at the end of the line with your Team and selecting Edit Team

The … menu gives you the option to Edit, Join, Leave or Delete a Team (leave and join will depend on if you are currently a member).

3. Adding Notification Rules

You can set up specific Notification rules per team, by clicking on the Notification Policy tab, this will open the notification edit menu.

By default, the system offers you the ability to set up a general notification rule for your team.

Note

The Email all team members option means all members of this Team will receive an email with the Alert information, regardless of the alert type.

3.1 Adding recipients

You can add other recipients, by clicking Add Recipient. These recipients do not need to be Observability Cloud users.

However, if you click on the link Configure separate notification tiers for different severity alerts you can configure every alert level independently.

Different alert rules for the different alert levels can be configured, as shown in the above image.

Critical and Major are using Splunk's On-Call Incident Management solution. For the Minor alerts, we send it to the Teams Slack channel and for Warning and Info we send an email.

3.2 Notification Integrations

In addition to sending alert notifications via email, you can configure Observability Cloud to send alert notifications to the services shown below.

Take a moment to create some notification rules for your Team.

Controlling Usage

Discover how you can restrict usage by creating separate Access Tokens and setting limits.

1. Access Tokens

If you wish to control the consumption of Hosts, Containers, Custom Metrics and High Resolution Metrics, you can create multiple Access Tokens and allocate them to different parts of your organization.

In the UI click on the » bottom left and select the Settings → Access Tokens under General Settings.

The Access Tokens Interface provides an overview of your allotments in the form of a list of Access Tokens that have been generated. Every Organization will have a Default token generated when they are first set up, but there will typically be multiple Tokens configured.

Each Token is unique and can be assigned limits for the number of Hosts, Containers, Custom Metrics and High Resolution Metrics it can consume.

The Usage Status Column quickly shows if a token is above or below its assigned limits.

2. Creating a new token

Let create a new token by clicking on the New Token button. This will provide you with the Name Your Access Token dialog.

Enter the new name of the new Token by using your Initials e.g. RWC-Token and make sure to tick both Ingest Token and API Token checkboxes!

After you press OK you will be taken back to the Access Token UI. Here your new token should be present, among the ones created by others.

If you have made an error in your naming, want to disable/enable a token or set a Token limit, click on the ellipsis (…) menu button behind a token limit to open the manage token menu.

If you made a typo you can use the Rename Token option to correct the name of your token.

3. Disabling a token

If you need to make sure a token cannot be used to send Metrics in you can disable a token.

Click on Disable to disable the token, this means the token cannot be used for sending in data to Splunk Observability Cloud.

The line with your token should have become greyed out to indicate that it has been disabled as you can see in the screenshot below.

Go ahead and click on the ellipsis (…) menu button to Disable and Enable your token.

4. Manage token usage limits

Now, let’s start limiting usage by clicking on Manage Token Limit in the 3 … menu.

This will show the Manage Token Limit Dialog:

In this dialog, you can set the limits per category.

Please go ahead and specify the limits as follows for each usage metric:

Limit	Value
Host Limit	5
Container Limit	15
Custom Metric Limit	20
High Resolution Metric Limit	0

For our lab use your email address, and double check that you have the correct numbers in your dialog box as shown in the table above.

Token limits are used to trigger an alert that notifies one or more recipients when the usage has been above 90% of the limit for 5 minutes.

To specify the recipients, click Add Recipient, then select the recipient or notification method you want to use (specifying recipients is optional but highly recommended).

The severity of token alerts is always Critical.

Click on Update to save your Access Tokens limits and The Alert Settings.

Note: Going above token limit

When a token is at or above its limit in a usage category, new metrics for that usage category will not be stored and processed by Observability Cloud. This will make sure there will be no unexpected cost due to a team sending in data without restriction.

Note: Advanced alerting

If you wish to get alerts before you hit 90%, you can create additional detectors using whatever values you want. These detectors could target the Teams consuming the specific Access Tokens so they can take action before the admins need to get involved.

In your company you would distribute these new Access Tokens to various teams, controlling how much information/data they can send to Observability Cloud.

This will allow you to fine-tune the way you consume your Observability Cloud allotment and prevent overages from happening.

Congratulations! You have now completed the Service Bureau module.

Build a Distributed Trace in Lambda and Kinesis

45 minutes Author Katie Hymers

This workshop will equip you with how a distributed trace is constructed for a small serverless application that runs on AWS Lambda, producing and consuming a message via AWS Kinesis.

We will see how auto-instrumentation works with manual steps to force a Producer function’s context to be sent to Consumer function via a Record put on a Kinesis stream.

For this workshop Splunk has prepared an Ubuntu Linux instance in AWS/EC2 all pre-configured for you.

To get access to the instance that you will be using in the workshop, please visit the URL provided by the workshop leader.d

Setup

This lab will make a tracing superhero out of you!

In this lab you will learn how a distributed trace is constructed for a small serverless application that runs on AWS Lambda, producing and consuming your message via AWS Kinesis.

Pre-requisites

You should already have the lab content available on your EC2 lab host.

Ensure that this lab’s required folder o11y-lambda-lab is on your home directory:

cd ~ && ls

o11y-lambda-lab

Note

If you don’t see it, fetch the lab contents by running the following command:

git clone https://github.com/kdroukman/o11y-lambda-lab.git

Set Environment Variables

In your Splunk Observability Cloud Organisation (Org) obtain your Access Token and Realm Values.

Please reset your environment variables from the earlier lab. Take care that for this lab we may be using different names - make sure to match the Environment Variable names below.

export ACCESS_TOKEN=CHANGE_ME \
export REALM=CHANGE_ME \
export PREFIX=$INSTANCE

Update Auto-instrumentation serverless template

Update your auto-instrumentation Serverless template to include new values from the Enviornment variables.

cat ~/o11y-lambda-lab/auto/serverless_unset.yml | envsubst > ~/o11y-lambda-lab/auto/serverless.yml

Examine the output of the updated serverless.yml contents (you may need to scroll up to the relevant section).

cat ~/o11y-lambda-lab/auto/serverless.yml

# USER SET VALUES =====================              
custom: 
  accessToken: <updated to your Access Token>
  realm: <updated to your Realm>
  prefix: <updated to your Hostname>
#======================================

Update Manual instrumentation template

Update your manual instrumentation Serverless template to include new values from the Enviornment variables.

cat ~/o11y-lambda-lab/manual/serverless_unset.yml | envsubst > ~/o11y-lambda-lab/manual/serverless.yml

Examine the output of the updated serverless.yml contents (you may need to scroll up to the relevant section).

cat ~/o11y-lambda-lab/manual/serverless.yml

# USER SET VALUES =====================              
custom: 
  accessToken: <updated to your Access Token>
  realm: <updated to your Realm>
  prefix: <updated to your Hostname>
#======================================

Set your AWS Credentials

You will be provided with AWS Access Key ID and AWS Secret Access Key values - substitue these values in place of AWS_ACCESS_KEY_ID and AWS_ACCESS_KEY_SECRET in the bellow command:

sls config credentials --provider aws --key AWS_ACCCESS_KEY_ID --secret AWS_ACCESS_KEY_SECRET

This command will create a file ~/.aws/credentials with your AWS Credentials populated.

Note that we are using sls here, which is a Serverless framework for developing and deploying AWS Lambda functions. We will be using this command throughout the lab.

Now you are set up and ready go!

Auto-Instrumentation

Navigate to the auto directory that contains auto-instrumentation code.

cd ~/o11y-lambda-lab/auto

Inspect the contents of the files in this directory. Take a look at the serverless.yml template.

cat serverless.yml

Workshop Question

Can you identify which AWS entities are being created by this template?
Can you identify where OpenTelemetry instrumentation is being set up?
Can you determine which instrumentation information is being provided by the Environment Variables?

You should see the Splunk OpenTelemetry Lambda layer being added to each fuction.

layers:
      - arn:aws:lambda:us-east-1:254067382080:layer:splunk-apm:70

You can see the relevant layer ARNs (Amazon Resource Name) and latest versions for each AWS region here: https://github.com/signalfx/lambda-layer-versions/blob/main/splunk-apm/splunk-apm.md

You should also see a section where the Environment variables that are being set.

environment:
  AWS_LAMBDA_EXEC_WRAPPER: /opt/nodejs-otel-handler
  OTEL_RESOURCE_ATTRIBUTES: deployment.environment=${self:custom.prefix}-apm-lambda
  OTEL_SERVICE_NAME: consumer-lambda
  SPLUNK_ACCESS_TOKEN: ${self:custom.accessToken}
  SPLUNK_REALM: ${self:custom.realm}

Using the environment variables we are configuring and enriching our auto-instrumentation.

Here we provide minimum information, such as NodeJS wrapper location in the Splunk APM Layer, environment name, service name, and our Splunk Org credentials. We are sending trace data directly to Splunk Observability Cloud. You could alternatively export traces to an OpenTelemetry Collector set up in Gateway mode.

Take a look at the function code.

cat handler.js

Workshop Question

Can you identify the code for producer function?
Can you identify the code for consumer function?

Notice there is no mention of Splunk or OpenTelemetry in the code. We are adding the instrumentation using the Lambda layer and Environment Variables only.

Deploy your Lambdas

Run the following command to deploy your Lambda Functions:

sls deploy

Deploying hostname-lambda-lab to stage dev (us-east-1)
...
...
endpoint: POST - https://randomstring.execute-api.us-east-1.amazonaws.com/dev/producer
functions:
  producer: hostname-lambda-lab-dev-producer (1.6 kB)
  consumer: hostname-lambda-lab-dev-consumer (1.6 kB)

This command will follow the instructions in your serverless.yml template to create your Lambda functions and your Kinesis stream. Note it may take a 1-2 minutes to execute.

Note

serverless.yml is in fact a CloudFormation template. CloudFormation is an infrastructure as code service from AWS. You can read more about it here - https://aws.amazon.com/cloudformation/

Check the details of your serverless functions:

sls info

Take note of your endpoint value:

Send some Traffic

Use the curl command to send a payload to your producer function. Note the command option -d is followed by your message payload.

Try changing the value of name to your name and telling the Lambda function about your superpower. Replace YOUR_ENDPOINT with the endpoint from your previous step.

curl -d '{ "name": "CHANGE_ME", "superpower": "CHANGE_ME" }' YOUR_ENDPOINT

For example:

curl -d '{ "name": "Kate", "superpower": "Distributed Tracing" }' https://xvq043lj45.execute-api.us-east-1.amazonaws.com/dev/producer

You should see the following output if your message is successful:

{"message":"Message placed in the Event Stream: hostname-eventSteam"}

If unsuccessful, you will see:

{"message": "Internal server error"}

If this occurs, ask one of the lab facilitators for assistance.

If you see a success message, generate more load: re-send that messate 5+ times. You should keep seeing a success message after each send.

Check the lambda logs output:

Producer function logs:

sls logs -f producer

Consumer function logs:

sls logs -f consumer

Examine the logs carefully.

Workshop Question

Do you see OpenTelemetry being loaded? Look out for lines with splunk-extension-wrapper.

Lambdas in Splunk APM

Now it’s time to check how your Lambda traffic has been captured in Splunk APM.

Navigate to your Splunk Observability Cloud

Select APM from the Main Menu and then select your APM Environment. Your APM environment should be in the format $INSTANCE-apm-lambda where the hostname value is a four letter name of your lab host. (Check it by looking at your command prompt, or by running echo $INSTANCE).

Note

It may take a few minutes for you traces to appear in Splunk APM. Try hitting refresh on your browser until you find your environement name in the list of Envrionments

Go to Explore the Service Map to see the Dependencies between your Lambda Functions.

You should be able to see the producer-lambda and the call it is making to Kinesis service.

Workshop Question

What about your consumer-lambda?

Click into Traces and examine some traces that container procuder function calls and traces with consumer function calls.

We can see the producer-lambda putting a Record on the Kinesis stream. But the action of consumer-function is disconnected!

This is because the Trace Context is not being propagated.

This is not something that is supported automatically Out-of-the-Box by Kinesis service at the time of this lab. Our Distributed Trace stops at Kinesis inferred service, and we can’t see the propagation any further.

Not yet…

Let’s see how we work around this in the next section of this lab.

Manual Instrumentation

Navigate to the manual directory that contains manually instrumentated code.

cd ~/o11y-lambda-lab/manual

Inspect the contents of the files in this directory. Take a look at the serverless.yml template.

cat serverless.yml

Workshop Question

Do you see any difference from the same file in your auto directory?

You can try to compare them with a diff command:

diff ~/o11y-lambda-lab/auto/serverless.yml ~/o11y-lambda-lab/manual/serverless.yml

19c19
< #======================================    
---
> #======================================

There is no difference! (Well, there shouldn’t be. Ask your lab facilitator to assist you if there is)

Now compare handler.js it with the same file in auto directory using the diff command:

diff ~/o11y-lambda-lab/auto/handler.js ~/o11y-lambda-lab/manual/handler.js

Look at all these differences!

You may wish to view the entire file with cat handler.js command and examine its content.

Notice how we are now importing some OpenTelemetry libraries directly into our function to handle some of the manual instrumenation tasks we require.

const otelapi  = require('@opentelemetry/api');
const otelcore = require('@opentelemetry/core');

We are using https://www.npmjs.com/package/@opentelemetry/api to manipulate the tracing logic in our functions. We are using https://www.npmjs.com/package/@opentelemetry/core to access the Propagator objects that we will use to manually propagate our context with.

Inject Trace Context in Producer Function

The below code executes the following steps inside the Producer function:

Get the current Active Span.
Create a Propagator.
Initialize a context carrier object.
Inject the context of the active span into the carrier object.
Modify the record we are about to put on our Kinesis stream to include the carrier that will carry the active span’s context to the consumer.

const activeSpan = otelapi.trace.getSpan(otelapi.context.active());
const propagator = new otelcore.W3CTraceContextPropagator();
let carrier = {};
propagator.inject(otelapi.trace.setSpanContext(otelapi.ROOT_CONTEXT, activeSpan.spanContext()),
    carrier,
    otelapi.defaultTextMapSetter
  );
const data = "{\"tracecontext\": " + JSON.stringify(carrier) + ", \"record\":" + event.body + "}";
console.log(`Record with Trace Context added: 
  ${data}`);

Extract Trace Context in Consumer Function

The bellow code executes the following steps inside the Consumer function:

Extract the context that we obtained from the Producer into a carrier object.
Create a Propagator.
Extract the context from the carrier object in Customer function’s parent span context.
Start a new span with the parent span context.
Bonus: Add extra attributes to your span, including custom ones with the values from your message!
Once completed, end the span.

const carrier = JSON.parse( message ).tracecontext;
const propagator = new otelcore.W3CTraceContextPropagator();
const parentContext = propagator.extract(otelapi.ROOT_CONTEXT, carrier, otelapi.defaultTextMapGetter);
const tracer = otelapi.trace.getTracer(process.env.OTEL_SERVICE_NAME);
const span = tracer.startSpan("Kinesis.getRecord", undefined, parentContext);
                         
span.setAttribute("span.kind", "server");
const body = JSON.parse( message ).record;
if (body.name) {
    span.setAttribute("custom.tag.name", body.name);
}
 if (body.superpower) {
    span.setAttribute("custom.tag.superpower", body.superpower);
}
  --- function does some work
 span.end();

Now let’s see the difference this makes.

Redeploy Lambdas

Re-deploy your Lambdas

While remaining in your manual directory, run the following commandd to re-deploy your Lambda Functions:

sls deploy -f producer

Deploying function producer to stage dev (us-east-1)

✔ Function code deployed (6s)
Configuration did not change. Configuration update skipped. (6s)

sls deploy -f consumer

Deploying function consumer to stage dev (us-east-1)

✔ Function code deployed (6s)
Configuration did not change. Configuration update skipped. (6s)

Note that this deployment now only updates the code changes within the function. Our configuration remains the same.

Check the details of your serverless functions:

sls info

You endpoint value should remain the same:

Send some Traffic again

Use the curl command to send a payload to your producer function. Note the command option -d is followed by your message payload.

Try changing the value of name to your name and telling the Lambda function about your superpower. Replace YOUR_ENDPOINT with the endpoint from your previous step.

curl -d '{ "name": "CHANGE_ME", "superpower": "CHANGE_ME" }' YOUR_ENDPOINT

For example:

curl -d '{ "name": "Kate", "superpower": "Distributed Tracing" }' https://xvq043lj45.execute-api.us-east-1.amazonaws.com/dev/producer

You should see the following output if your message is successful:

{"message":"Message placed in the Event Stream: hostname-eventSteam"}

If unsuccessful, you will see:

{"message": "Internal server error"}

If this occurs, ask one of the lab facilitators for assistance.

If you see a success message, generate more load: re-send that messate 5+ times. You should keep seeing a success message after each send.

Check the lambda logs output:

sls logs -f producer

sls logs -f consumer

Examine the logs carefully.

Workshop Question

Do you notice the difference?

Note that we are logging our Record together with the Trace context that we have added to it. Copy one of the underlined sub-sections of your trace parent context, and save it for later.

Updated Lambdas in Splunk APM

Navigate back to APM in Splunk Observabilty Cloud

Go back to your Service Dependency map.

Workshop Question

Notice the difference?

You should be able to see the consumer-lambda now clearly connected to the producer-lambda.

Remember the value you copied from your producer logs? You can run sls logs -f consumer command again on your EC2 lab host to fetch one.

Take that value, and paste it into trace search:

Click on Go and you should be able to find the logged Trace:

Notice that the Trace ID is something that makes up the trace context that we propagated.

You can read up on the two common propagation standards:

Workshop Question

Which one are we using?

It should be self-explanatory from the Propagator we are creating in the Functions

Workshop Question

Bonus Question: What happens if we mix and match the W3C and B3 headers?

Expand the consumer-lambda span.

Workshop Question

Can you find the attributes from your message?

Summary

Before you Go

Please kindly clean up your lab using the following command:

sls remove

Conclusion

Congratuations on finishing the lab. You have seen how we complement auto-instrumentation with manual steps to force Producer function’s context to be sent to Consumer function via a Record put on a Kinesis stream. This allowed us to build the expected Distributed Trace.

You can now build out a Trace manually by linking two different functions together. This is very powerful when your auto-instrumenation, or third-party systems, do not support context propagation out of the box.

Getting Data In (GDI) with OTel and UF

45 minutes

During this technical workshop, you will learn how to:

Efficiently deploy complex environments
Capture metrics from these environments to Splunk Observability Cloud
Auto-instrument a Python application
Enable OS logging to Splunk Enterprise via Universal Forwarder

To simplify the workshop modules, a pre-configured AWS EC2 instance is provided.

By the end of this technical workshop, you will have an approach to demonstrating metrics collection for complex environments and services.

Getting Started with O11y GDI - Real Time Enrichment Workshop

Please note to begin the following lab, you must have completed the prework:

Obtain a Splunk Observability Cloud access key
Understand cli commands

Follow these steps if using O11y Workshop EC2 instances

1. Verify yelp data files are present

ll /var/appdata/yelp*

2. Export the following variables

export ACCESS_TOKEN=<your-access-token>
export REALM=<your-o11y-cloud-realm>
export clusterName=<your-k8s-cluster>

3. Clone the following repo

cd /home/splunk 
git clone https://github.com/leungsteve/realtime_enrichment.git 
cd realtime_enrichment/workshop 
python3 -m venv rtapp-workshop 
source rtapp-workshop/bin/activate

Deploy Complex Environments and Capture Metrics

Objective: Learn how to efficiently deploy complex infrastructure components such as Kafka and MongoDB to demonstrate metrics collection with Splunk O11y IM integrations

Duration: 15 Minutes

Scenario

A prospect uses Kafka and MongoDB in their environment. Since there are integrations for these services, you’d like to demonstrate this to the prospect. What is a quick and efficient way to set up a live environment with these services and have metrics collected?

1. Where can I find helm charts?

Google “myservice helm chart”
https://artifacthub.io/ (Note: Look for charts from trusted organizations, with high star count and frequent updates)

2. Review Apache Kafka packaged by Bitnami

We will deploy the helm chart with these options enabled:

replicaCount=3
metrics.jmx.enabled=true
metrics.kafka.enabled=true
deleteTopicEnable=true

3. Review MongoDB(R) packaged by Bitnami

We will deploy the helm chart with these options enabled:

version 12.1.31
metrics.enabled=true
global.namespaceOverride=default
auth.rootUser=root
auth.rootPassword=splunk
auth.enabled=false

4. Install Kafka and MongoDB with helm charts

helm repo add bitnami https://charts.bitnami.com/bitnami

helm install kafka --set replicaCount=3 --set metrics.jmx.enabled=true --set metrics.kafka.enabled=true  --set deleteTopicEnable=true bitnami/kafka

helm install mongodb --set metrics.enabled=true bitnami/mongodb --set global.namespaceOverride=default --set auth.rootUser=root --set auth.rootPassword=splunk --set auth.enabled=false --version 12.1.31

Verify the helm chart installation

helm list

NAME    NAMESPACE   REVISION    UPDATED                                 STATUS      CHART           APP VERSION
kafka   default     1           2022-11-14 11:21:36.328956822 -0800 PST deployed    kafka-19.1.3    3.3.1
mongodb default     1           2022-11-14 11:19:36.507690487 -0800 PST deployed    mongodb-12.1.31 5.0.10

Verify the helm chart installation

kubectl get pods

NAME                              READY   STATUS              RESTARTS   AGE
kafka-exporter-595778d7b4-99ztt   0/1     ContainerCreating   0          17s
mongodb-b7c968dbd-jxvsj           0/2     Pending             0          6s
kafka-1                           0/2     ContainerCreating   0          16s
kafka-2                           0/2     ContainerCreating   0          16s
kafka-zookeeper-0                 0/1     Pending             0          17s
kafka-0                           0/2     Pending             0          17s

Use information for each Helm chart and Splunk O11y Data Setup to generate values.yaml for capturing metrics from Kafka and MongoDB.

Note

values.yaml for the different services will be passed to the Splunk Helm Chart at installation time. These will configure the OTEL collector to capture metrics from these services.

References:

4.1 Example kafka.values.yaml

otelAgent:
config:
    receivers:
    receiver_creator:
        receivers:
        smartagent/kafka:
            rule: type == "pod" && name matches "kafka"
            config:
                    #endpoint: '`endpoint`:5555'
            port: 5555
            type: collectd/kafka
            clusterName: sl-kafka
otelK8sClusterReceiver:
k8sEventsEnabled: true
config:
    receivers:
    kafkametrics:
        brokers: kafka:9092
        protocol_version: 2.0.0
        scrapers:
        - brokers
        - topics
        - consumers
    service:
    pipelines:
        metrics:
        receivers:
                #- prometheus
        - k8s_cluster
        - kafkametrics

4.2 Example mongodb.values.yaml

otelAgent:
    config:
        receivers:
        receiver_creator:
            receivers:
            smartagent/mongodb:
                rule: type == "pod" && name matches "mongo"
                config:
                type: collectd/mongodb
                host: mongodb.default.svc.cluster.local
                port: 27017
                databases: ["admin", "O11y", "local", "config"]
                sendCollectionMetrics: true
                sendCollectionTopMetrics: true

4.3 Example zookeeper.values.yaml

otelAgent:
    config:
        receivers:
        receiver_creator:
            receivers:
            smartagent/zookeeper:
                rule: type == "pod" && name matches "kafka-zookeeper"
                config:
                type: collectd/zookeeper
                host: kafka-zookeeper
                port: 2181

5. Install the Splunk OTEL helm chart

cd /home/splunk/realtime_enrichment/otel_yamls/ 

helm repo add splunk-otel-collector-chart https://signalfx.github.io/splunk-otel-collector-chart

helm repo update

helm install --set provider=' ' --set distro=' ' --set splunkObservability.accessToken=$ACCESS_TOKEN --set clusterName=$clusterName --set splunkObservability.realm=$REALM --set otelCollector.enabled='false' --set splunkObservability.logsEnabled='true' --set gateway.enabled='false' --values kafka.values.yaml --values mongodb.values.yaml --values zookeeper.values.yaml --values alwayson.values.yaml --values k3slogs.yaml --generate-name splunk-otel-collector-chart/splunk-otel-collector

6. Verify installation

Verify that the Kafka, MongoDB and Splunk OTEL Collector helm charts are installed, note that names may differ.

helm list

NAME                                NAMESPACE   REVISION    UPDATED                                 STATUS      CHART                           APP VERSION
kafka                               default     1           2021-12-07 12:48:47.066421971 -0800 PST deployed    kafka-14.4.1                    2.8.1
mongodb                             default     1           2021-12-07 12:49:06.132771625 -0800 PST deployed    mongodb-10.29.2                 4.4.10
splunk-otel-collector-1638910184    default     1           2021-12-07 12:49:45.694013749 -0800 PST deployed    splunk-otel-collector-0.37.1    0.37.1

kubectl get pods

NAME                                                              READY   STATUS    RESTARTS   AGE
kafka-zookeeper-0                                                 1/1     Running   0          18m
kafka-2                                                           2/2     Running   1          18m
mongodb-79cf87987f-gsms8                                          2/2     Running   0          18m
kafka-1                                                           2/2     Running   1          18m
kafka-exporter-7c65fcd646-dvmtv                                   1/1     Running   3          18m
kafka-0                                                           2/2     Running   1          18m
splunk-otel-collector-1638910184-agent-27s5c                      2/2     Running   0          17m
splunk-otel-collector-1638910184-k8s-cluster-receiver-8587qmh9l   1/1     Running   0          17m

7. Verify dashboards

Verify that out of the box dashboards for Kafka, MongoDB and Zookeeper are populated in the Infrastructure Monitor landing page. Drill down into each component to view granular details for each service.

Tip: You can use the filter k8s.cluster.name with your cluster name to find your instance.

Infrastructure Monitoring Landing page:

K8 Navigator:

MongoDB Dashboard:

Kafka Dashboard:

Code to Kubernetes - Python

Objective: Understand activities to instrument a python application and run it on Kubernetes.

Verify the code
Containerize the app
Deploy the container in Kubernetes

Note: these steps do not involve Splunk

Duration: 15 Minutes

1. Verify the code - Review service

Navigate to the review directory

cd /home/splunk/realtime_enrichment/flask_apps/review/

Inspect review.py (realtime_enrichment/flask_apps/review)

cat review.py

from flask import Flask, jsonify
import random
import subprocess

review = Flask(__name__)
num_reviews = 8635403
num_reviews = 100000
reviews_file = '/var/appdata/yelp_academic_dataset_review.json'

@review.route('/')
def hello_world():
    return jsonify(message='Hello, you want to hit /get_review. We have ' + str(num_reviews) + ' reviews!')

@review.route('/get_review')
def get_review():
    random_review_int = str(random.randint(1,num_reviews))
    line_num = random_review_int + 'q;d'
    command = ["sed", line_num, reviews_file] # sed "7997242q;d" <file>
    random_review = subprocess.run(command, stdout=subprocess.PIPE, text=True)
    return random_review.stdout

if __name__ == "__main__":
    review.run(host ='0.0.0.0', port = 5000, debug = True)

Inspect requirements.txt

Flask==2.0.2

Create a virtual environment and Install the necessary python packages

cd /home/splunk/realtime_enrichment/workshop/flask_apps_start/review/

pip freeze #note output
pip install -r requirements.txt
pip freeze #note output

Start the REVIEW service. Note: You can stop the app with control+C

python3 review.py

 * Serving Flask app 'review' (lazy loading)
 * Environment: production
         ...snip...
 * Running on http://10.160.145.246:5000/ (Press CTRL+C to quit)
 * Restarting with stat
127.0.0.1 - - [17/May/2022 22:46:38] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [17/May/2022 22:47:02] "GET /get_review HTTP/1.1" 200 -
127.0.0.1 - - [17/May/2022 22:47:58] "GET /get_review HTTP/1.1" 200 -

Verify that the service is working

Open a new terminal and ssh into your ec2 instance. Then use the curl command in your terminal.

curl http://localhost:5000

Or hit the URL http://{Your_EC2_IP_address}:5000 and http://{Your_EC2_IP_address}:5000/get_review with a browser

curl localhost:5000
{
  "message": "Hello, you want to hit /get_review. We have 100000 reviews!"
}

curl localhost:5000/get_review
{"review_id":"NjbiESXotcEdsyTc4EM3fg","user_id":"PR9LAM19rCM_HQiEm5OP5w","business_id":"UAtX7xmIfdd1W2Pebf6NWg","stars":3.0,"useful":0,"funny":0,"cool":0,"text":"-If you're into cheap beer (pitcher of bud-light for $7) decent wings and a good time, this is the place for you. Its generally very packed after work hours and weekends. Don't expect cocktails. \n\n-You run into a lot of sketchy characters here sometimes but for the most part if you're chilling with friends its not that bad. \n\n-Friendly bouncer and bartenders.","date":"2016-04-12 20:23:24"}

Workshop Question

What does this application do?
Do you see the yelp dataset being used?
Why did the output of pip freeze differ each time you ran it?
Which port is the REVIEW app listening on? Can other python apps use this same port?

2. Create a REVIEW container

To create a container image, you need to create a Dockerfile, run docker build to build the image referencing the Docker file and push it up to a remote repository so it can be pulled by other sources.

Create a Dockerfile
Creating a Dockerfile typically requires you to consider the following:
- Identify an appropriate container image
  - ubuntu vs. python vs. alpine/slim
  - ubuntu - overkill, large image size, wasted resources when running in K8
  - this is a python app, so pick an image that is optimized for it
  - avoid alpine for python
- Order matters
  - you’re building layers.
  - re-use the layers as much as possible
  - have items that change often towards the end
- Other Best practices for writing Dockerfiles

Dockerfile for review

FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt /app
RUN pip install -r requirements.txt
COPY ./review.py /app
EXPOSE 5000
CMD [ "python", "review.py" ]

Create a container image (locally) Run ‘docker build’ to build a local container image referencing the Dockerfile

docker build -f Dockerfile -t localhost:8000/review:0.01 .

[+] Building 35.5s (11/11) FINISHED
 => [internal] load build definition from Dockerfile                              0.0s
         ...snip...
 => [3/5] COPY requirements.txt /app                                              0.0s
 => [4/5] RUN pip install -r requirements.txt                                     4.6s
 => [5/5] COPY ./review.py /app                                                   0.0s
 => exporting to image                                                            0.2s
 => => exporting layers                                                           0.2s
 => => writing image sha256:61da27081372723363d0425e0ceb34bbad6e483e698c6fe439c5  0.0s
 => => naming to docker.io/localhost:8000/review:0.1                                   0.0

Push the container image into a container repository Run ‘docker push’ to place a copy of the REVIEW container to a remote location

docker push localhost:8000/review:0.01

The push refers to repository [docker.io/localhost:8000/review]
02c36dfb4867: Pushed
         ...snip...
fd95118eade9: Pushed
0.1: digest: sha256:3651f740abe5635af95d07acd6bcf814e4d025fcc1d9e4af9dee023a9b286f38 size: 2202

Verify that the image is in Docker Hub. The same info can be found in Docker Desktop

curl -s http://localhost:8000/v2/_catalog

{"repositories":["review"]}

3. Run REVIEW in Kubernetes

Create K8 deployment yaml file for the REVIEW app

Reference: Creating a Deployment

review.deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: review
  labels:
    app: review
spec:
  replicas: 1
  selector:
    matchLabels:
      app: review
  template:
    metadata:
      labels:
        app: review
    spec:
      imagePullSecrets:
      - name: regcred
      containers:
      - image: localhost:8000/review:0.01
        name: review
        volumeMounts:
        - mountPath: /var/appdata
          name: appdata
      volumes:
      - name: appdata
        hostPath:
          path: /var/appdata

Notes regarding review.deployment.yaml:

labels - K8 uses labels and selectors to tag and identify resources
- In the next step, we’ll create a service and associate it to this deployment using the label
replicas = 1
- K8 allows you to scale your deployments horizontally
- We’ll leverage this later to add load and increase our ingestion rate
regcred provides this deployment with the ability to access your dockerhub credentials which is necessary to pull the container image.
The volume definition and volumemount make the yelp dataset visible to the container

Create a K8 service yaml file for the review app.

Reference: Creating a service:

review.service.yaml

apiVersion: v1
kind: Service
metadata:
  name: review
spec:
  type: NodePort
  selector:
    app: review
  ports:
    - port: 5000
      targetPort: 5000
      nodePort: 30000

Notes about review.service.yaml:

the selector associates this service to pods with the label app with the value being review
the review service exposes the review pods as a network service
- other pods can now ping ‘review’ and they will hit a review pod.
- a pod would get a review if it ran curl http://review:5000
NodePort service
- the service is accessible to the K8 host by the nodePort, 30000
- Another machine that has this can get a review if it ran curl http://<k8 host ip>:30000

Apply the review deployment and service

kubectl apply -f review.service.yaml -f review.deployment.yaml

Verify that the deployment and services are running:

kubectl get deployments

NAME                                                    READY   UP-TO-DATE   AVAILABLE   AGE
review                                                  1/1     1            1           19h

kubectl get services

NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                         AGE
review                     NodePort    10.43.175.21    <none>        5000:30000/TCP                  154d

curl localhost:30000

{
  "message": "Hello, you want to hit /get_review. We have 100000 reviews!"
}

curl localhost:30000/get_review

{"review_id":"Vv9rHtfBrFc-1M1DHRKN9Q","user_id":"EaNqIwKkM7p1bkraKotqrg","business_id":"TA1KUSCu8GkWP9w0rmElxw","stars":3.0,"useful":1,"funny":0,"cool":0,"text":"This is the first time I've actually written a review for Flip, but I've probably been here about 10 times.  \n\nThis used to be where I would take out of town guests who wanted a good, casual, and relatively inexpensive meal.  \n\nI hadn't been for a while, so after a long day in midtown, we decided to head to Flip.  \n\nWe had the fried pickles, onion rings, the gyro burger, their special burger, and split a nutella milkshake.  I have tasted all of the items we ordered previously (with the exception of the special) and have been blown away with how good they were.  My guy had the special which was definitely good, so no complaints there.  The onion rings and the fried pickles were greasier than expected.  Though I've thought they were delicious in the past, I probably wouldn't order either again.  The gyro burger was good, but I could have used a little more sauce.  It almost tasted like all of the ingredients didn't entirely fit together.  Something was definitely off. It was a friday night and they weren't insanely busy, so I'm not sure I would attribute it to the staff not being on their A game...\n\nDon't get me wrong.  Flip is still good.  The wait staff is still amazingly good looking.  They still make delicious milk shakes.  It's just not as amazing as it once was, which really is a little sad.","date":"2010-10-11 18:18:35"}

Workshop Question

What changes are required if you need to make an update to your Dockerfile now?

Instrument REVIEWS for Tracing

1. Use Data Setup to instrument a Python application

Within the O11y Cloud UI:

Data Management -> Add Integration -> Monitor Applications -> Python (traces) -> Add Integration

Provide the following to the Configure Integration Wizard:

Service: review
Django: no
collector endpoint: http://localhost:4317
Environment: rtapp-workshop-[YOURNAME]
Kubernetes: yes
Legacy Agent: no

We are instructed to:

Install the instrumentation packages for your Python environment.

pip install splunk-opentelemetry[all]
  
splunk-py-trace-bootstrap

Configure the Downward API to expose environment variables to Kubernetes resources.
For example, update a Deployment to inject environment variables by adding .spec.template.spec.containers.env like:

apiVersion: apps/v1
kind: Deployment
spec:
  selector:
    matchLabels:
      app: your-application
  template:
    spec:
      containers:
        - name: myapp
          env:
            - name: SPLUNK_OTEL_AGENT
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: "http://$(SPLUNK_OTEL_AGENT):4317"
            - name: OTEL_SERVICE_NAME
              value: "review"
            - name: OTEL_RESOURCE_ATTRIBUTES
              value: "deployment.environment=rtapp-workshop-stevel"

Enable the Splunk OTel Python agent by editing your Python service command.
```
splunk-py-trace python3 main.py --port=8000
```
The actions we must perform include:
- Update the Dockerfile to install the splunk-opentelemetry packages
- Update the deployment.yaml for each service to include these environment variables which will be used by the pod and container.
- Update our Dockerfile for REVIEW so that our program is bootstrapped with splunk-py-trace

Note

We will accomplish this by:

generating a new requirements.txt file
generating a new container image with an updated Dockerfile for REVIEW and then
update the review.deployment.yaml to capture all of these changes.

2. Update the REVIEW container

Generate a new container image

Update the Dockerfile for REVIEW (/workshop/flask_apps_finish/review)

FROM python:3.10-slim

WORKDIR /app
COPY requirements.txt /app
RUN pip install -r requirements.txt
RUN pip install splunk-opentelemetry[all]
RUN splk-py-trace-bootstrap

COPY ./review.py /app

EXPOSE 5000
ENTRYPOINT [ "splunk-py-trace" ]
CMD [ "python", "review.py" ]

Note

Note that the only lines, in bold, added to the Dockerfile

Generate a new container image with docker build in the ‘finished’ directory
Notice that I have changed the repository name from localhost:8000/review:0.01 to localhost:8000/review-splkotel:0.01

Ensure you are in the correct directory.

pwd
./workshop/flask_apps_finish/review

docker build -f Dockerfile.review -t localhost:8000/review-splkotel:0.01 .

[+] Building 27.1s (12/12) FINISHED
=> [internal] load build definition from Dockerfile                                                        0.0s
=> => transferring dockerfile: 364B                                                                        0.0s
=> [internal] load .dockerignore                                                                           0.0s
=> => transferring context: 2B                                                                             0.0s
=> [internal] load metadata for docker.io/library/python:3.10-slim                                         1.6s
=> [auth] library/python:pull token for registry-1.docker.io                                               0.0s
=> [1/6] FROM docker.io/library/python:3.10-slim@sha256:54956d6c929405ff651516d5ebbc204203a6415c9d2757aad  0.0s
=> [internal] load build context                                                                           0.3s
=> => transferring context: 1.91kB                                                                         0.3s
=> CACHED [2/6] WORKDIR /app                                                                               0.0s
=> [3/6] COPY requirements.txt /app                                                                        0.0s
=> [4/6] RUN pip install -r requirements.txt                                                              15.3s
=> [5/6] RUN splk-py-trace-bootstrap                                                                       9.0s
=> [6/6] COPY ./review.py /app                                                                             0.0s
=> exporting to image                                                                                      0.6s
=> => exporting layers                                                                                     0.6s
=> => writing image sha256:164977dd860a17743b8d68bcc50c691082bd3bfb352d1025dc3a54b15d5f4c4d                0.0s
=> => naming to docker.io/localhost:8000/review-splkotel:0.01                                              0.0s

Push the image to Docker Hub with docker push command

docker push localhost:8000/review-splkotel:0.01

The push refers to repository [docker.io/localhost:8000/review-splkotel]
682f0e550f2c: Pushed
dd7dfa312442: Pushed
917fd8334695: Pushed
e6782d51030d: Pushed
c6b19a64e528: Mounted from localhost:8000/review
8f52e3bfc0ab: Mounted from localhost:8000/review
f90b85785215: Mounted from localhost:8000/review
d5c0beb90ce6: Mounted from localhost:8000/review
3759be374189: Mounted from localhost:8000/review
fd95118eade9: Mounted from localhost:8000/review
0.01: digest: sha256:3b251059724dbb510ea81424fc25ed03554221e09e90ef965438da33af718a45 size: 2412

3. Update the REVIEW deployment in Kubernetes

review.deployment.yaml must be updated with the following changes:
- Load the new container image on Docker Hub
- Add environment variables so traces can be emitted to the OTEL collector

The deployment must be replaced using the updated review.deployment.yaml

Update review.deployment.yaml (updates highlighted in bold)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: review
  labels:
    app: review
spec:
  replicas: 1
  selector:
    matchLabels:
      app: review
  template:
    metadata:
      labels:
        app: review
    spec:
      imagePullSecrets:
        - name: regcred
      containers:
      - image: localhost:8000/review-splkotel:0.01
        name: review
        volumeMounts:
        - mountPath: /var/appdata
          name: appdata
        env:
        - name: SPLUNK_OTEL_AGENT
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: OTEL_SERVICE_NAME
          value: 'review'
        - name: SPLUNK_METRICS_ENDPOINT
          value: "http://$(SPLUNK_OTEL_AGENT):9943"
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://$(SPLUNK_OTEL_AGENT):4317"
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: 'deployment.environment=rtapp-workshop-stevel'
      volumes:
      - name: appdata
        hostPath:
          path: /var/appdata

Apply review.deployment.yaml. Kubernetes will automatically pick up the changes to the deployment and redeploy new pods with these updates
- Notice that the review-* pod has been restarted

kubectl apply -f review.deployment.yaml

kubectl get pods

NAME                                                              READY   STATUS        RESTARTS   AGE
kafka-client                                                      0/1     Unknown       0          155d
curl                                                              0/1     Unknown       0          155d
kafka-zookeeper-0                                                 1/1     Running       0          26h
kafka-2                                                           2/2     Running       0          26h
kafka-exporter-647bddcbfc-h9gp5                                   1/1     Running       2          26h
mongodb-6f6c78c76-kl4vv                                           2/2     Running       0          26h
kafka-1                                                           2/2     Running       1          26h
kafka-0                                                           2/2     Running       1          26h
splunk-otel-collector-1653114277-agent-n4dfn                      2/2     Running       0          26h
splunk-otel-collector-1653114277-k8s-cluster-receiver-5f48v296j   1/1     Running       0          26h
splunk-otel-collector-1653114277-agent-jqxhh                      2/2     Running       0          26h
review-6686859bd7-4pf5d                                           1/1     Running       0          11s
review-5dd8cfd77b-52jbd                                           0/1     Terminating   0          2d10h

kubectl get pods
NAME                                                              READY   STATUS    RESTARTS   AGE
kafka-client                                                      0/1     Unknown   0          155d
curl                                                              0/1     Unknown   0          155d
kafka-zookeeper-0                                                 1/1     Running   0          26h
kafka-2                                                           2/2     Running   0          26h
kafka-exporter-647bddcbfc-h9gp5                                   1/1     Running   2          26h
mongodb-6f6c78c76-kl4vv                                           2/2     Running   0          26h
kafka-1                                                           2/2     Running   1          26h
kafka-0                                                           2/2     Running   1          26h
splunk-otel-collector-1653114277-agent-n4dfn                      2/2     Running   0          26h
splunk-otel-collector-1653114277-k8s-cluster-receiver-5f48v296j   1/1     Running   0          26h
splunk-otel-collector-1653114277-agent-jqxhh                      2/2     Running   0          26h
review-6686859bd7-4pf5d                                           1/1     Running   0          15s

Monitor System Logs with Splunk Universal Forwarder

Objective: Learn how to monitor Linux system logs with the Universal Forwarder sending logs to Splunk Enterprise

Duration: 10 Minutes

Scenario

You’ve been tasked with monitoring the OS logs of the host running your Kubernetes cluster. We are going to utilize a script that will autodeploy the Splunk Universal Forwarder. You will then configure the Universal Forwarder to send logs to the Splunk Enterprise instance assigned to you.

1. Ensure You’re in the Correct Directory

we will need to be in /home/splunk/session-2

cd /home/splunk/session-2

2. Review the Universal Forwarder Install Script

Let’s take a look at the script that will install the Universal Forwarder and Linux TA automatically for you.
- This script is primarily used for remote instances.
- Note we are not using a deployment server in this lab, however it is recommended in production we do that.
- What user are we installing Splunk as?

#!/bin/sh  
# This EXAMPLE script shows how to deploy the Splunk universal forwarder
# to many remote hosts via ssh and common Unix commands.
# For "real" use, this script needs ERROR DETECTION AND LOGGING!!
# --Variables that you must set -----
# Set username using by splunkd to run.
  SPLUNK_RUN_USER="ubuntu"

# Populate this file with a list of hosts that this script should install to,
# with one host per line. This must be specified in the form that should
# be used for the ssh login, ie. username@host
#
# Example file contents:
# splunkuser@10.20.13.4
# splunkker@10.20.13.5
  HOSTS_FILE="myhost.txt"

# This should be a WGET command that was *carefully* copied from splunk.com!!
# Sign into splunk.com and go to the download page, then look for the wget
# link near the top of the page (once you have selected your platform)
# copy and paste your wget command between the ""
  WGET_CMD="wget -O splunkforwarder-9.0.3-dd0128b1f8cd-Linux-x86_64.tgz 'https://download.splunk.com/products/universalforwarder/releases/9.0.3/linux/splunkforwarder-9.0.3-dd0128b1f8cd-Linux-x86_64.tgz'"
# Set the install file name to the name of the file that wget downloads
# (the second argument to wget)
  INSTALL_FILE="splunkforwarder-9.0.3-dd0128b1f8cd-Linux-x86_64.tgz"

# After installation, the forwarder will become a deployment client of this
# host.  Specify the host and management (not web) port of the deployment server
# that will be managing these forwarder instances.
# Example 1.2.3.4:8089
#  DEPLOY_SERVER="x.x.x.x:8089"



# After installation, the forwarder can have additional TA's added to the 
# /app directory please provide the local where TA's will be. 
  TA_INSTALL_DIRECTORY="/home/splunk/session-2"

# Set the seed app folder name for deploymentclien.conf
#  DEPLOY_APP_FOLDER_NAME="seed_all_deploymentclient"
# Set the new Splunk admin password
  PASSWORD="buttercup"

REMOTE_SCRIPT_DEPLOY="
  cd /opt
  sudo $WGET_CMD
  sudo tar xvzf $INSTALL_FILE
  sudo rm $INSTALL_FILE
  #sudo useradd $SPLUNK_RUN_USER
  sudo find $TA_INSTALL_DIRECTORY -name '*.tgz' -exec tar xzvf {} --directory /opt/splunkforwarder/etc/apps \;
  sudo chown -R $SPLUNK_RUN_USER:$SPLUNK_RUN_USER /opt/splunkforwarder
  echo \"[user_info] 
  USERNAME = admin
  PASSWORD = $PASSWORD\" > /opt/splunkforwarder/etc/system/local/user-seed.conf   
  #sudo cp $TA_INSTALL_DIRECTORY/*.tgz /opt/splunkforwader/etc/apps/
  #sudo find /opt/splunkforwarder/etc/apps/ -name '*.tgz' -exec tar xzvf {} \;
  #sudo -u splunk /opt/splunkforwarder/bin/splunk start --accept-license --answer-yes --auto-ports --no-prompt
  /opt/splunkforwarder/bin/splunk start --accept-license --answer-yes --auto-ports --no-prompt
  #sudo /opt/splunkforwarder/bin/splunk enable boot-start -user $SPLUNK_RUN_USER
  /opt/splunkforwarder/bin/splunk enable boot-start -user $SPLUNK_RUN_USER
  #sudo cp $TA_INSTALL_DIRECTORY/*.tgz /opt/splunkforwarder/etc/apps/

  exit
 "

DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"


#===============================================================================================
  echo "In 5 seconds, will run the following script on each remote host:"
  echo
  echo "===================="
  echo "$REMOTE_SCRIPT_DEPLOY"
  echo "===================="
  echo 
  sleep 5
  echo "Reading host logins from $HOSTS_FILE"
  echo
  echo "Starting."
  for DST in `cat "$DIR/$HOSTS_FILE"`; do
    if [ -z "$DST" ]; then
      continue;
    fi
    echo "---------------------------"
    echo "Installing to $DST"
    echo "Initial UF deployment"
    sudo ssh -t "$DST" "$REMOTE_SCRIPT_DEPLOY"
  done  
  echo "---------------------------"
  echo "Done"
  echo "Please use the following app folder name to override deploymentclient.conf options: $DEPLOY_APP_FOLDER_NAME"

3. Run the install script

We will run the install script now. You will see some Warnings at the end. This is totally normal. The script is built for use on remote machines, however for todays lab you will be using localhost.

./install.sh

You will be asked Are you sure you want to continue connecting (yes/no/[fingerprint])? Answer Yes.

Enter your ssh password when prompted.

4. Verify installation of the Universal Forwarader

We need to verify that the Splunk Universal Forwarder is installed and running.
- You should see a couple PID’s return and a “Splunk is currently running.” message.

/opt/splunkforwarder/bin/splunk status

5. Configure the Universal Forwarder to Send Data to Splunk Enterprise

We will be able to send the data to our Splunk Enterprise environment easily by entering one line into the cli.
- Universal Forwarder Config Guide

/opt/splunkforwarder/bin/splunk add forward-server <your_splunk_enterprise_ip>:9997

6. Verify the Data in Your Splunk Enterprise Environment

We are now going to take a look at the Splunk Enterprise environment to verify logs are coming in.
- Logs will be coming into index=main
Open your web browser and navigate to: http://<your_splunk_enterprise_ip:8000
- You will use the credentials admin:<your_ssh_password>
In the search bar, type in the following:
index=main host=<your_host_name>
You should see data from your host. Take note of the interesting fields and the different data sources flowing in.

Splunk OnCall

1 hour 30 minutes Author Geoff Higginbottom

Aim

This module is simply to ensure you have access to the Splunk On-Call UI (formerly known as VictorOps), Splunk Infrastructure Monitoring UI (formerly known as SignalFx) and the EC2 Instance which has been allocated to you.

Once you have access to each platform, keep them open for the duration of the workshop as you will be switching between them and the workshop instructions.

You should have received an invitation to Activate your Splunk On-Call account via e-mail, if you have not already done so, click the Activate Account link and follow the prompts.

If you did not receive an invitation it is probably because you already have a Splunk On-Call login, linked to a different organization.

If so log in to that Org, then use the organization dropdown next to your username in the top left to switch to the Observability Workshop Org.

Note

If you do not see the Organisation dropdown menu item next to your name with Observability Workshop EMEA that is OK, it simply means you only have access to a single Org so that menu is not visible to you.

If you have forgotten your password go to the https://portal.victorops.com/membership/#/ page and use the forgotten password link to reset your password.

You should have received an invitation to join the Splunk Infrastructure Monitoring - Observability Workshop. If you have not already done so click the JOIN NOW button and follow the prompts to set a password and activate your login.

3. Access your EC2 Instance

Splunk has provided you with a dedicated EC2 Instance which you can use during this workshop for triggering Incidents the same way the instructor did during the introductory demo. This VM has Splunk Infrastructure Monitoring deployed and has an associated Detector configured. The Detector will pass Alerts to Splunk On-Call which will then create Incidents and page the on-call user.

The welcome e-mail you received providing you all the details for this Workshop contain the instructions for accessing your allocated EC2 Instance.

SSH (Mac OS/Linux)

Most attendees will be able to connect to the workshop by using SSH from their Mac or Linux device.

To use SSH, open a terminal on your system and type ssh splunk@x.x.x.x (replacing x.x.x.x with the IP address found in your welcome e-mail).

When prompted Are you sure you want to continue connecting (yes/no/[fingerprint])? please type yes.

Enter the password provided in the welcome e-mail.

Upon successful login you will be presented with the Splunk logo and the Linux prompt.

At this point you are ready to continue with the workshop when instructed to do so by the instructor

Putty (Windows users only)

If you do not have ssh pre-installed or if you are on a Windows system, the best option is to install putty, you can find the downloads here.

!!! important If you cannot install Putty, please go to Web Browser (All).

Open Putty and in the Host Name (or IP address) field enter the IP address provided in the welcome e-mail.

You can optionally save your settings by providing a name and pressing Save.

To then login to your instance click on the Open button as shown above.

If this is the first time connecting to your EC2 instance, you will be presented with a security dialogue, please click Yes.

Once connected, login in as splunk using the password provided in the welcome e-mail.

Once you are connected successfully you should see a screen similar to the one below:

At this point you are ready to continue with the workshop when instructed to do so by the instructor

Web Browser (All)

If you are blocked from using SSH (Port 22) or unable to install Putty you may be able to connect to the workshop instance by using a web browser.

!!! note This assumes that access to port 6501 is not restricted by your company’s firewall.

Open your web browser and type http://x.x.x.x:650 (where x.x.x.x is the IP address from the welcome e-mail).

Once connected, login in as splunk and the password is the one provided in the welcome e-mail.

Once you are connected successfully you should see a screen similar to the one below:

Copy & Paste in browser

Unlike when you are using regular SSH, copy and paste does require a few extra steps to complete when using a browser session. This is due to cross browser restrictions.

When the workshop asks you to copy instructions into your terminal, please do the following:

Copy the instruction as normal, but when ready to paste it in the web terminal, choose Paste from browser as show below:

This will open a dialogue box asking for the text to be pasted into the web terminal:

Paste the text in the text box as show, then press OK to complete the copy and paste process.

Unlike regular SSH connection, the web browser has a 60 second time out, and you will be disconnected, and a Connect button will be shown in the center of the web terminal.

Simply click the Connect button and you will be reconnected and will be able to continue.

At this point you are ready to continue with the workshop when instructed to do so by the instructor

User Profile

Aim

The aim of this module is for you to configure your personal profile which controls how you will be notified by Splunk On-Call whenever you get paged.

1. Contact Methods

Switch to the Splunk On-Call UI and click on your login name in the top right hand corner and chose Profile from the drop down. Confirm your contact methods are listed correctly and add any additional phone numbers and e-mail address you wish to use.

2. Mobile Devices

To install the Splunk On-Call app for your smartphone search your phones App Store for Splunk On-Call to find the appropriate version of the app. The publisher should be listed as VictorOps Inc.

Apple Store

Google Play

Configuration help guides are available:

Install the App and login, then refresh the Profile page and your device should now be listed under the devices section. Click the Test push notification button and confirm you receive the test message.

3. Personal Calendar

This link will enable you to sync your on-call schedule with your calendar, however as you do not have any allocated shifts yet this will currently be empty. You can add it to your calendar by copying the link into your preferred application and setting it up as a new subscription.

4. Paging Policies

Paging Polices specify how you will be contacted when on-call. The Primary Paging Policy will have defaulted to sending you an SMS assuming you added your phone number when activating your account. We will now configure this policy into a three tier multi-stage policy similar to the image below.

4.1 Send a push notification

Click the edit policy button in the top right corner for the Primary Paging Policy.

Send a push notification to all my devices
Execute the next step if I have not responded within 5 minutes

Click Add a Step

4.2 Send an e-mail

Send an e-mail to [your email address]
Execute the next step if I have not responded within 5 minutes

Click Add a Step

4.3 Call your number

Every 5 minutes until we have reached you
Make a phone call to [your phone number]

Click Save to save the policy.

When you are on-call or in the escalation path of an incident, you will receive notifications in this order following these time delays.

To cease the paging you must acknowledge the incident. Acknowledgements can occur in one of the following ways:

Expanding the Push Notification on your device and selecting Acknowledge
Responding to the SMS with the 5 digit code included
Pressing 4 during the Phone Call
Slack Button

For more information on Notification Types, see here.

5. Custom Paging Policies

Custom paging polices enable you to override the primary policy based on the time and day of the week. A good example would be to get the system to immediately phone you whenever you get a page during the evening or weekends as this is more likely to get your attention than a push notification.

Create a new Custom Policy by clicking Add a Policy and configure with the following settings:

5.1 Custom evening policy

Policy Name: Evening

Every 5 minutes until we have reached you
- Make a phone call to [your phone number]
- Time Period: All 7 Days
- Time zone
  - Between 7pm and 9am

Click Save to save the policy then add one more.

5.2 Custom weekend policy

Policy Name: Weekend

Every 5 minutes until we have reached you
- Make a phone call to [your phone number]
- Time Period: Sat & Sun
- Time zone
  - Between 9am and 7pm

Click Save to save the policy.

These custom paging policies will be used during the specified times in place of the Primary Policy. However, admins do have the ability to ignore these custom policies, and we will highlight how this is achieved in a later module.

The final option here is the setting for Recovery Notifications. These are typically low priority, will default to Push, but can also be email, sms or phone call. Your profile is now fully configured using these example configurations.

Organizations will have different views on how profiles should be configured and will typically issue guidelines for paging policies and times between escalations etc.

Please wait for the instructor before proceeding to the Teams module.

Teams

Aim

The aim of this module is for you to complete the first step of Team configuration by adding users to your Team.

1. Find your Team

Navigate to the Teams tab on the main toolbar, you should find you that a Team has been created for you as part of the workshop pre-setup and you would have been informed of your Team Name via e-mail.

If you have found your pre-configured Team, skip Step 2. and proceed to Step 3. Configure Your Team. However, if you cannot find your allocated Team, you will need to create a new one, so proceed with Step 2. Create Team

2. Create Team

Only complete this step if you cannot find your pre-allocated Team as detailed in your workshop e-mail. Select Add Team, then enter your allocated team name, this will typically be in the format of “AttendeeID Workshop” and then save by clicking the Add Team button.

3. Configure Your Team

You now need to add other users to your team. If you are running this workshop using the Splunk provided environment, the following accounts are available for testing. If you are running this lab in your own environment, you will have been provided a list of usernames you can use in place of the table below.

These users are dummy accounts who will not receive notifications when they are on call.

Name	Username	Shift
Duane Chow	duanechow	Europe
Steven Gomez	gomez	Europe
Walter White	heisenberg	Europe
Jim Halpert	jimhalpert	Asia
Lydia Rodarte-Quayle	lydia	Asia
Marie Schrader	marie	Asia
Maximo Arciniega	maximo	West Coast
Michael Scott	michaelscott	West Coast
Tuco Salamanca	tuco	West Coast
Jack Welker	jackwelker	24/7
Hank Schrader	hank	24/7
Pam Beesly	pambeesly	24/7

Add the users to your team, using either the above list or the alternate one provided to you. The value in the Shift column can be ignored for now, but will be required for a later step.

Click Invite User button on the right hand side, then either start typing the usernames (this will filter the list), or copy and paste them into the dialogue box. Once all users are added to the list click the Add User button.

To make a team member a Team Admin, simply click the :fontawesome-regular-edit: icon in the right hand column, pick any user and make them an Admin.

Tip

For large team management you can use the APIs to streamline this process.

Continue and also complete the Configure Rotations module.

Configure Rotations

Aim

A rotation is a recurring schedule, that consists of one or more shifts, with members who rotate through a shift.

The aim of this module is for you to configure two example Rotations, and assign Team Members to the Rotations.

Navigate to the Rotations tab on the Teams sub menu, you should have no existing Rotations so we need to create some.

The 1st Rotation you will create is for a follow the sun support pattern where the members of each shift provide cover during their normal working hours within their time zone.

The 2nd will be a Rotation used to provide escalation support by more experienced senior members of the team, based on a 24/7, 1 week shift pattern.

1. Follow the Sun Support - Business Hours

Click Add Rotation

Enter a name of “Follow the Sun Support - Business Hours” and Select Partial day from the three available shift templates.

Enter a Shift name of “Asia”
Time Zone set to “Asia/Tokyo”
Each user is on duty from “Monday through Friday from 9.00am to 5.00pm”
Handoff happens every “5 days”
The next handoff happens - Select the next Monday using the calendar
Click Save Rotation

You will now be prompted to add Members to this shift; add the Asia members who are Jim Halpert, Lydie Rodarte-Quayle and Marie Schrader, but only if you’re using the Splunk provided environment for this workshop.

If you’re using your own Organisation refer to the specific list provided separately.

Now add an 2nd shift for Europe by again clicking +Add a shift → Partial Day

Enter a Shift name of “Europe”
Time Zone set to “Europe/London”
Each user is on duty from “Monday through Friday from 9.00am to 5.00pm”
Handoff happens every “5 days”
The next handoff happens - Select the next Monday using the calendar
Click Save Shift

You will again be prompted to add Members to this shift; add the Europe members who are Duane Chow, Steven Gomez and Walter White, but only if you’re using the Observability Workshop Org for this workshop.

If you’re using your own Organisation refer to the specific list provided separately.

Now add a 3rd shift for West Coast USA by again clicking +Add a shift - Partial Day

Enter a Shift name of “West Coast”
Time Zone set to “US/Pacific”
Each user is on duty from “Monday through Friday from 9.00am to 5.00pm”
Handoff happens every “5 days”
The next handoff happens - Select the next Monday using the calendar
Click Save Shift

You will again be prompted to add Members to this shift; add the West Coast members who are Maximo Arciniega, Michael Scott and Tuco Salamanca, but only if you’re using the Observability Workshop Org for this workshop.

If you’re using your own Organisation refer to the specific list provided separately.

The first user added will be the ‘current’ user for that shift.

You can re-order the shifts by simply dragging the users up and down, and you can change the current user by clicking Set Current on an alternate user

You will now have three different Shift patterns, that provide cover 24hr hours, Mon - Fri, but with no cover at weekends.

We will now add another Rotation for our Senior SRE Escalation cover.

2. Senior SRE Escalation

Click Add Rotation
Enter a name of “Senior SRE Escalation”
Select 24/7 from the three available shift templates
Enter a Shift name of “Senior SRE Escalation”
Time Zone set to “Asia/Tokyo”
Handoff happens every “7 days at 9.00am”
The next handoff happens [select the next Monday from the date picker]
Click Save Rotation

You will again be prompted to add Members to this shift; add the 24/7 members who are Jack Welker, Hank Schrader and Pam Beesly, but only if you’re using the Observability Workshop Org for this workshop.

If you’re using your own Organisation refer to the specific list provided separately.

Please wait for the instructor before proceeding to the Configuring Escalation Policies module.

Configure Escalation Policies

Aim

Escalation policies determine who is actually on-call for a given team and are the link to utilizing any rotations that have been created.

The aim of this module is for you to create three different Escalation Policies to demonstrate a number of different features and operating models.

The instructor will start by explaining the concepts before you proceed with the configuration.

Navigate to the Escalation Polices tab on the Teams sub menu, you should have no existing Polices so we need to create some.

We are going to create the following Polices to cover off three typical use cases.

1. 24/7 Policy

Click Add Escalation Policy

Policy Name: 24/7
Step 1
Immediately
- Notify the on-duty user(s) in rotation → Senior SRE Escalation
- Click Save

2. Primary Policy

Click Add Escalation Policy

Policy Name: Primary
Step 1
Immediately
Notify the on-duty user(s) in rotation → Follow the Sun Support - Business Hours
Click Add Step

Step 2
If still un-acknowledged after 15 minutes
Notify the next user(s) in the current on-duty shift → Follow the Sun Support - Business Hours
Click Add Step

Step 3
If still un-acknowledged after 15 more minutes
Execute Policy → [Your Team Name] : 24/7
Click Save

3. Waiting Room Policy

Click Add Escalation Policy

Policy Name: Waiting Room
Step 1
If still un-acknowledged after 10 more minutes
Execute Policy → [Your Team Name] : Primary
Click Save

You should now have the following three escalation polices:

You may have noticed that when we created each policy there was the following warning message:

Warning

There are no routing keys for this policy - it will only receive incidents via manual reroute or when on another escalation policy

This is because there are no Routing Keys linked to these Escalation Polices, so now that we have these polices configured we can create the Routing Keys and link them to our Polices..

Continue and also complete the Creating Routing Keys module.

Creating Routing Keys

Aim

Routing Keys map the incoming alert messages from your monitoring system to an Escalation Policy which in turn sends the notifications to the appropriate team.

Note that routing keys are case insensitive and should only be composed of letters, numbers, hyphens, and underscores.

The aim of this module is for you to create some routing keys and then link them to your Escalation Policies you have created in the previous exercise.

1. Instance ID

Each participant requires a unique Routing Key so we use the Hostname of the EC2 Instance you were allocated. We are only doing this to ensure your Routing Key is unique and we know all Hostnames are unique. In a production deployment the Routing Key would typically reflect the name of a System or Service being monitored, or a Team such as 1st Line Support etc.

Your welcome e-mail informed you of the details of your EC2 Instance that has been provided for you to use during this workshop and you should have logged into this as part of the 1st exercise.

The e-mail also contained the Hostname of the Instance, but you can also obtain it from the Instance directly. To get your Hostname from within the shell session connected to your Instance run the following command:

echo ${HOSTNAME}

zevn

It is very important that when creating the Routing Keys you use the 4 letter hostname allocated to you as a Detector has been configured within Splunk Infrastructure Monitoring using this hostname, so any deviation will cause future exercises to fail.

2 Create Routing Keys

Navigate to Settings on the main menu bar, you should now be at the Routing Keys page.

You are going to create the following two Routing Keys using the naming conventions listed in the following table, but replacing {==HOSTNAME==} with the value from above and replace TEAM_NAME with the team you were allocated or created earlier.

Routing Key	Escalation Policies
`HOSTNAME`_PRI	`TEAM_NAME` : Primary
`HOSTNAME`_WR	`TEAM_NAME` : Waiting Room

There will probably already be a number of Routing Keys configured, but to add a new one simply scroll to the bottom of the page and then click Add Key

In the left hand box, enter the name for the key as per the table above. In the Routing Key column, select your Teams Primary policy from the drop down in the Escalation Polices column. You can start typing your Team Name to filter the results.

Note

If there are a large number of participants on the workshop, resulting in an unusually large number of Escalation Policies sometimes the search filter does not list all the Policies under your Team Name. If this happens instead of using the search feature, simply scroll down to your team name, all the policies will then be listed.

Repeat the above steps for both Keys, xxxx_PRI and xxxx_WR, mapping them to your Teams Primary and Waiting Room policies.

You should now have two Routing Keys configured, similar to the following:

Tip

You can assign a Routing Key to multiple Escalation Policies if required by simply selecting more from the list

If you now navigate back to Teams → [Your Team Name] → Escalation Policies and look at the settings for your Primary and Waiting Room polices you will see that these now have Routes assigned to them.

The 24/7 policy does not have a Route assigned as this will only be triggered via an Execute Policy escalation from the Primary policy.

Please wait for the instructor before proceeding to the Incident Lifecycle/Overview module.

Incident Lifecycle

Aim

The aim of this module is for you to get more familiar with the Timeline Tab and the filtering features.

1. Timeline

The aim of Splunk On-Call is to make being on call more bearable, and it does this by getting the critical data, to the right people, at the right time.

The key to making it work for you is to centralize all your alerting sources, sending them all to the Splunk On-Call platform, then you have a single pane of glass in which to manage all of your alerting.

Login to the Splunk On-Call UI and select the Timeline tab on the main menu bar, you should have a screen similar to the following image:

2. People

On the left we have the People section with the Teams and Users sub tabs. On the Teams tab, click on All Teams then expand [Your Team name].

Users with the Splunk On-Call Logo against their name are currently on call. Here you can see who is on call within a particular Team, or across all Teams via Users → On-Call.

If you click into one of the currently on call users, you can see their status. It shows which Rotation they are on call for, when their current Shift ends and their next Shift starts (times are displayed in your time zone), what contact methods they have and which Teams they belong to (dummy users such as Hank do not have Contact Methods configured).

3. Timeline

In the centre Timeline section you get a realtime view of what is happening within your environment with the newest messages at the top. Here you can quickly post update messages to make your colleagues aware of important developments etc.

You can filter the view using the buttons on the top toolbar showing only update messages, GitHub integrations, or apply more advanced filters.

Lets change the Filters settings to streamline your view. Click the Filters button then within the Routing Keys tab change the Show setting from all routing keys to selected routing keys. Change the My Keys value to all and the Other Keys value to selected and deselect all keys under the Other Keys section.

Click anywhere outside of the dialogue box to close it.

You will probably now have a much simpler view as you will not currently have Incidents created using your Routing Keys, so you are left with the other types of messages that the Timeline can display.

Click on Filters again, but this time switch to the Message Types tab. Here you control the types of messages that are displayed.

For example, deselect On-call Changes and Escalations, this will reduce the amount of messages displayed.

4. Incidents

On the right we have the Incidents section. Here we get a list of all the incidents within the platform, or we can view a more specific list such as incidents you are specifically assigned to, or for any of the Teams you are a member of.

Select the Team Incidents tab you should find that the Triggered, Acknowledged & Resolved tabs are currently all empty as you have had no incidents logged.

Let’s change that by generating your first incident!

Continue with the Create Incidents module.

Create Incidents

Aim

The aim of this module is for you to place yourself ‘On-Call’ then generate an Incident using the supplied EC2 Instance so you can then work through the lifecycle of an Incident.

1. On-Call

Before generating any incidents you should assign yourself to the current Shift within your Follow the Sun Support - Business Hours Rotation and also place yourself On-Call.

Click on the Schedule link within your Team in the People section on the left, or navigate to Teams → [Your Team] → Rotations
Expand the Follow the Sun Support - Business Hours Rotation
Click on the Manage members icon (the figures) for the current active shift depending on your time zone
Use the Select a user to add… dropdown to add yourself to the shift
Then click on Set Current next to your name to make yourself the current on-call user within the shift
You should now get a Push Notification to your phone informing you that You Are Now On-Call

2. Trigger Alert

Switch back to your shell session connected to your EC2 Instance; all of the following commands will be executed from your Instance.

Force the CPU to spike to 100% by running the following command:

openssl speed -multi $(grep -ci processor /proc/cpuinfo)

Forked child 0
+DT:md4:3:16
+R:19357020:md4:3.000000
+DT:md4:3:64
+R:14706608:md4:3.010000
+DT:md4:3:256
+R:8262960:md4:3.000000
+DT:md4:3:1024

This will result in an Alert being generated by Splunk Infrastructure Monitoring which in turn will generate an Incident within Splunk On-Call within a maximum of 10 seconds. This is the default polling time for the OpenTelemetry Collector installed on your instance (note it can be reduced to 1 second).

Continue with the Manage Incidents module.

Manage Incidents

1. Acknowledge

Use your Splunk On-Call App on your phone to acknowledge the Incident by clicking on the push notification …

…to open the alert in the Splunk On-Call mobile app, then clicking on either the single tick in the top right hand corner, or the Acknowledge link to acknowledge the incident and stop the escalation process.

The :fontawesome-solid-check: will then transform into a :fontawesome-solid-check::fontawesome-solid-check:, and the status will change from TRIGGERED to ACKNOWLEDGED.

Triggered Incident	Acknowledge Incident

2. Details and Annotations

Still on your phone, select the Alert Details tab. Then on the Web UI, navigate back to Timeline, select Team Incidents on the right, then select Acknowledged and click into the new Incident, this will open up the War Room Dashboard view of the Incident.

You should now have the Details tab displayed on both your Phone and the Web UI. Notice how they both show the exact same information.

Now select the Annotations tab on both the Phone and the Web UI, you should have a Graph displayed in the UI which is generated by Splunk Infrastructure Monitoring.

On your phone you should get the same image displayed (sometimes it’s a simple hyperlink depending on the image size)

Splunk On-Call is a ‘Mobile First’ platform meaning the phone app is full functionality and you can manage an incident directly from your phone.

For the remainder of this module we will focus on the Web UI however please spend some time later exploring the phone app features.

3. Link to Alerting System

Sticking with the Web UI, click the 2. Alert Details in SignalFx link.

This will open a new browser tab and take you directly to the Alert within Splunk Infrastructure Monitoring where you could then progress your troubleshooting using the powerful tools built into its UI.

However, we are focussing on Splunk On-Call so close this tab and return to the Splunk On-Call UI.

4. Similar Incidents

What if Splunk On-Call could identify previous incidents within the system which may give you a clue to the best way to tackle this incident.

The Similar Incidents tab does exactly that, surfacing previous incidents allowing you to look at them and see what actions were taken to resolve them, actions which could be easily repeated for this incident.

5 Timeline

On right we have a Time Line view where you can add messages and see the history of previous alerts and interactions.

6 Add Responders

On the far left you have the option of allocating additional resources to this incident by clicking on the Add Responders link.

This allows you build a virtual team specific to this incident by adding other Teams or individual Users, and also share details of a Conference Bridge where you can all get together and collaborate.

Once the system has built up some incident data history, it will use Machine Learning to suggest Teams and Users who have historically worked on similar incidents, as they may be best placed to help resolve this incident quickly.

You can select different Teams and/or Users and also choose from a pre-configured conference bridge, or populate the details of a new bridge from your preferred provider.

We do not need to add any Responders in this exercise so close the Add Responders dialogue by clicking Cancel.

7 Reroute

If it’s decided that maybe the incident could be better dealt with by a different Team, the call can be Rerouted by clicking the Reroute Button at the top of the left hand panel.

In a similar method to that used in the Add Responders dialogue, you can select Teams or Users to Reroute the Incident to.

We do not need to actually Reroute in this exercise so close the Reroute Incident dialogue by clicking Cancel.

8 Snooze

You can also snooze this incident by clicking on the alarm clock Button at the top of the left hand panel.

You can enter an amount of time upto 24 hours to snooze the incident. This action will be tracked in the Timeline, and when the time expires the paging will restart.

This is useful for low priority incidents, enabling you to put them on a back burner for a few hours, but it ensures they do not get forgotten thanks to the paging process starting again.

We do not need to actually Snooze in this exercise so close the Snooze Incident dialogue by clicking Cancel.

9 Action Tracking

Now lets fix this issue and update the Incident with what we did. Add a new message at the top of the right hand panel such as Discovered rogue process, terminated it.

All the actions related to the Incident will be recorded here, and can then be summarized is a Post Incident Review Report available from the Reports tab

10 Resolution

Now kill off the process we started in the VM to max out the CPU by switching back the Shell session for the VM and pressing ctrl+c

Within no more than 10 seconds SignalFx should detect the new CPU value, clear the alert state in SignalFx, then automatically update the Incident in VictorOps marking it as Resolved.

As we have two way integration between Splunk Infrastructure Monitoring and Splunk On-Call we could have also marked the incident as Resolved in Splunk On-Call, and this would have resulted in the alert in Splunk Infrastructure Monitoring being resolved as well.

That completes this introduction to Splunk On-Call!

Unsupported Field Workshops

Subsections of Unsupported Field Workshops

Splunk IM

Subsections of Splunk IM

How to connect to your workshop environment

1. AWS/EC2 IP Address

2. SSH (Mac OS/Linux)

3. SSH (Windows 10 and above)

4. Putty (For Windows Versions prior to Windows 10)

5. Web Browser (All)

6. Multipass (All)

Deploying the OpenTelemetry Collector in Kubernetes

1. Installation using Helm

2. Validate metrics in the UI

Subsections of Deploying the OpenTelemetry Collector in Kubernetes

Deploying NGINX in K3s

1. Start your NGINX

2. Create NGINX deployment

3. Run Locust load test

Working with Dashboards

1. Dashboards

2. Your Teams’ Page

3. Sample Charts

4. Inspecting the Sample Data

Subsections of Working with Dashboards

Editing charts

1. Editing a chart

2. Changing the time window

3. Viewing the Data Table

Saving charts

1. Saving a chart

2. Creating a dashboard

3. Add to Team page

3.3 Using Filters & Formulas

1 Creating a new chart

2. Filtering and Analytics

3. Using Timeshift analytical function

4. Using Formulas

3.4 SignalFlow

1. Introduction

2. View SignalFlow

Adding charts to dashboards

1. Save to existing dashboard

2. Copy and Paste a chart

3. Edit the pasted chart

Adding Notes and Dashboard Layout

1. Adding Notes

2. Saving our chart

3. Ordering & sizing of charts

Working with Detectors

1. Introduction

2. Creating a Detector

3. Setting Alert condition

4. Alert pre-flight check

5. Alert message

6. Alert Activation

Subsections of Working with Detectors

Working with Muting Rules

1. Configuring Muting Rules

2. Resuming notifications

Monitoring as Code

1. Initial setup

2. Create execution plan

3. Apply execution plan

3. Destroy all your hard work

Service Bureau

1. Understanding engagement

2. Subscription Usage

3. Understanding usage

4. Examine usage in detail

Subsections of Service Bureau

Teams

1. Introduction to Teams

2. Creating a new Team

3. Adding Notification Rules

3.1 Adding recipients

3.2 Notification Integrations

Controlling Usage

1. Access Tokens

2. Creating a new token