4. Building In Resilience

10 minutes  

The OpenTelemetry Collector’s FileStorage Extension enhances the resilience of your telemetry pipeline by providing reliable checkpointing, managing retries, and handling temporary failures effectively.

With this extension enabled, the OpenTelemetry Collector can store intermediate states on disk, preventing data loss during network disruptions and allowing it to resume operations seamlessly.

Note

This solution will work for metrics as long as the connection downtime is brief—up to 15 minutes. If the downtime exceeds this, Splunk Observability Cloud will drop data due to datapoints being out of order.

For logs, there are plans to implement a more enterprise-ready solution in one of the upcoming Splunk OpenTelemetry Collector releases.

Exercise
  • Inside the [WORKSHOP] directory, create a new subdirectory named 4-resilience.
  • Next, copy all contents from the 3-filelog directory into 4-resilience.
  • After copying, remove any *.out and *.log files.
  • Change all terminal windows to the [WORKSHOP]/4-reslilience directory.

Your updated directory structure will now look like this:

WORKSHOP
├── 1-agent
├── 2-gateway
├── 3-filelog
├── 4-resilience
│   ├── agent.yaml
│   ├── gateway.yaml
│   ├── log-gen.sh (or .ps1)
│   └── trace.json
└── otelcol
Last Modified Feb 7, 2025

Subsections of 4. Building Resilience

4.1 File Storage Configuration

In this exercise, we will update the extensions: section of the agent.yaml file. This section is part of the OpenTelemetry configuration YAML and defines optional components that enhance or modify the OpenTelemetry Collector’s behavior.

While these components do not process telemetry data directly, they provide valuable capabilities and services to improve the Collector’s functionality.

Exercise

Update the agent.yaml: Add the file_storage extension and name it checkpoint:

  file_storage/checkpoint:         # Extension Type/Name
    directory: "./checkpoint-dir"  # Define directory
    create_directory: true         # Create directory
    timeout: 1s                    # Timeout for file operations
    compaction:                    # Compaction settings
      on_start: true               # Start compaction at Collector startup
      # Define compaction directory
      directory: "./checkpoint-dir/tmp"
      # Max. size limit before compaction occurs
      max_transaction_size: 65536

Add file_storage to existing otlphttp exporter: Modify the otlphttp: exporter to configure retry and queuing mechanisms, ensuring data is retained and resent if failures occur:

  otlphttp:                       # Exporter Type
    endpoint: "http://localhost:5318" # Gateway OTLP endpoint
    headers:                      # Headers to add to the HTTPcall 
      X-SF-Token: "ACCESS_TOKEN"  # Splunk ACCESS_TOKEN header
    retry_on_failure:             # Retry on failure settings
      enabled: true               # Enables retrying
    sending_queue:                # Sending queue settings
      enabled: true               # Enables Sending queue
      num_consumers: 10           # Number of consumers
      queue_size: 10000           # Maximum queue size
      # File storage extension
      storage: file_storage/checkpoint

Update the services section: Add the file_storage/checkpoint extension to the existing extensions: section. This will cause the extension to be enabled:

service:
  extensions:
  - health_check
  - file_storage/checkpoint       # Enabled extensions for this collector

Update the metrics pipeline: For this exercise we are going to remove the hostmetrics receiver from the Metric pipeline to reduce debug and log noise:

  metrics:
    receivers: 
    - otlp                        # OTLP Receiver
    # - hostmetrics               # Hostmetrics Receiver

Validate the agent configuration using otelbin.io. For reference, the logs: section of your pipelines will look similar to this:

%%{init:{"fontFamily":"monospace"}}%%
graph LR
    %% Nodes
      REC1(&nbsp;&nbsp;otlp&nbsp;&nbsp;<br>fa:fa-download):::receiver
      PRO1(memory_limiter<br>fa:fa-microchip):::processor
      PRO2(resourcedetection<br>fa:fa-microchip):::processor
      PRO3(resource<br>fa:fa-microchip):::processor
      PRO4(batch<br>fa:fa-microchip):::processor
      EXP1(&ensp;debug&ensp;<br>fa:fa-upload):::exporter
      EXP2(otlphttp<br>fa:fa-upload):::exporter
    %% Links
    subID1:::sub-logs
    subgraph " "
      subgraph subID1[**Logs**]
      direction LR
      REC1 --> PRO1
      PRO1 --> PRO2
      PRO2 --> PRO3
      PRO3 --> PRO4
      PRO4 --> EXP1
      PRO4 --> EXP2
      end
    end
classDef receiver,exporter fill:#8b5cf6,stroke:#333,stroke-width:1px,color:#fff;
classDef processor fill:#6366f1,stroke:#333,stroke-width:1px,color:#fff;
classDef con-receive,con-export fill:#45c175,stroke:#333,stroke-width:1px,color:#fff;
classDef sub-logs stroke:#34d399,stroke-width:1px, color:#34d399,stroke-dasharray: 3 3;
Last Modified Feb 7, 2025

4.2 Setup environment for Resilience Testing

Next, we will configure our environment to be ready for testing the File Storage configuration.

Exercise

Start the Gateway: In the Gateway terminal window navigate to the [WORKSHOP]/4-resilience directory and run:

../otelcol --config=gateway.yaml

Start the Agent: In the Agent terminal window navigate to the [WORKSHOP]/4-resilience directory and run:

../otelcol --config=agent.yaml

Send a test trace: In the Test terminal window navigate to the [WORKSHOP]/4-resilience directory and run:

curl -X POST -i http://localhost:4318/v1/traces -H "Content-Type: application/json" -d "@trace.json"

Both the Agent and Gateway should display debug logs, and the Gateway should create a ./gateway-traces.out file.

If everything functions correctly, we can proceed with testing system resilience.

Last Modified Feb 7, 2025

4.3 Simulate Failure

To assess the Agent’s resilience, we’ll simulate a temporary Gateway outage and observe how the Agent handles it:

Summary:

  1. Send Traces to the Agent – Generate traffic by sending traces to the Agent.
  2. Stop the Gateway – This will trigger the Agent to enter retry mode.
  3. Restart the Gateway – The Agent will recover traces from its persistent queue and forward them successfully. Without the persistent queue, these traces would have been lost permanently.
Exercise

Simulate a network failure: In the Gateway terminal stop the Gateway with Ctrl-C and wait until the gateway console shows that it has stopped:

2025-01-28T13:24:32.785+0100  info  service@v0.116.0/service.go:309  Shutdown complete.

Send traces: In the Test terminal window send two traces using the curl command we used earlier.

Notice that the agent’s retry mechanism is activated as it continuously attempts to resend the data. In the agent’s console output, you will see repeated messages similar to the following:

2025-01-28T14:22:47.020+0100  info  internal/retry_sender.go:126  Exporting failed. Will retry the request after interval.  {"kind": "exporter", "data_type": "traces", "name": "otlphttp", "error": "failed to make an HTTP request: Post \"http://localhost:5318/v1/traces\": dial tcp 127.0.0.1:5318: connect: connection refused", "interval": "9.471474933s"}

Stop the Agent: Use Ctrl-C to stop the agent. Wait until the agent’s console confirms it has stopped:

2025-01-28T14:40:28.702+0100  info  extensions/extensions.go:66  Stopping extensions...
2025-01-28T14:40:28.702+0100  info  service@v0.116.0/service.go:309  Shutdown complete.
Tip

Stopping the agent will halt its retry attempts and prevent any future retry activity.

If the agent runs for too long without successfully delivering data, it may begin dropping traces, depending on the retry configuration, to conserve memory. By stopping the agent, any metrics, traces, or logs currently stored in memory are lost before being dropped, ensuring they remain available for recovery.

This step is essential for clearly observing the recovery process when the agent is restarted.

Last Modified Feb 7, 2025

4.4 Simulate Recovery

In this exercise, we’ll test how the OpenTelemetry Collector recovers from a network outage by restarting the Gateway. When the Gateway becomes available again, the Agent will resume sending data from its last checkpointed state, ensuring no data loss.

Exercise

Restart the Gateway: In the Gateway terminal window run:

../otelcol --config=gateway.yaml

Restart the Agent: In the Agent terminal window run:

../otelcol --config=agent.yaml

After the Agent is up and running, the File_Storage extension will detect buffered data in the checkpoint folder.
It will start to dequeue the stored spans from the last checkpoint folder, ensuring no data is lost.

Exercise

Verify the Agent Debug output
Note that the Agent Debug Screen does NOT change and still shows the following line indicating no new data is being exported.

2025-02-07T13:40:12.195+0100    info    service@v0.117.0/service.go:253 Everything is ready. Begin running and processing data.

Watch the Gateway Debug output
You should see from the Gateway debug screen, it has started receiving the previously missed traces without requiring any additional action on your part.

2025-02-07T12:44:32.651+0100    info    service@v0.117.0/service.go:253 Everything is ready. Begin running and processing data.
2025-02-07T12:47:46.721+0100    info    Traces  {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 4, "spans": 4}
2025-02-07T12:47:46.721+0100    info    ResourceSpans #0
Resource SchemaURL: https://opentelemetry.io/schemas/1.6.1
Resource attributes:

Check the gateway-traces.out file
Count the number of traces in the recreated ./gateway-traces.out. It should match the number you send when the Gateway was down

Conclusion

This exercise demonstrated how to enhance the resilience of the OpenTelemetry Collector by configuring the file_storage extension, enabling retry mechanisms for the otlp exporter, and using a file-backed queue for temporary data storage.

By implementing file-based checkpointing and queue persistence, you ensure the telemetry pipeline can gracefully recover from temporary interruptions, making it a more robust and reliable for production environments.

Stop the Agent and Gateway using Ctrl-C.