2. Building In Resilience

10 minutes  

The OpenTelemetry Collector’s FileStorage Extension is a critical component for building a more resilient telemetry pipeline. It enables the Collector to reliably checkpoint in-flight data, manage retries efficiently, and gracefully handle temporary failures without losing valuable telemetry.

With FileStorage enabled, the Collector can persist intermediate states to disk, ensuring that your traces, metrics, and logs are not lost during network disruptions, backend outages, or Collector restarts. This means that even if your network connection drops or your backend becomes temporarily unavailable, the Collector will continue to receive and buffer telemetry, resuming delivery seamlessly once connectivity is restored.

By integrating the FileStorage Extension into your pipeline, you can strengthen the durability of your observability stack and maintain high-quality telemetry ingestion, even in environments where connectivity may be unreliable.

Note

This solution will work for metrics as long as the connection downtime is brief, up to 15 minutes. If the downtime exceeds this, Splunk Observability Cloud might drop data to make sure no data-point is out of order.

For logs, there are plans to implement a full enterprise-ready solution in one of the upcoming Splunk OpenTelemetry Collector releases.

Last Modified Jul 11, 2025

Subsections of 2. Building Resilience

2.1 File Storage Configuration

In this exercise, we will update the extensions: section of the agent.yaml file. This section is part of the OpenTelemetry configuration YAML and defines optional components that enhance or modify the OpenTelemetry Collector’s behavior.

While these components do not process telemetry data directly, they provide valuable capabilities and services to improve the Collector’s functionality.

Exercise
Important

Change ALL terminal windows to the 2-building-resilience directory and run the clear command.

Your directory structure will look like this:

.
├── agent.yaml
└── gateway.yaml

Update the agent.yaml: In the Agent terminal window, add the file_storage extension under the existing health_check extension:

  file_storage/checkpoint:             # Extension Type/Name
    directory: "./checkpoint-dir"      # Define directory
    create_directory: true             # Create directory
    timeout: 1s                        # Timeout for file operations
    compaction:                        # Compaction settings
      on_start: true                   # Start compaction at Collector startup
      # Define compaction directory
      directory: "./checkpoint-dir/tmp"
      max_transaction_size: 65536      # Max. size limit before compaction occurs

Add file_storage to the exporter: Modify the otlphttp exporter to configure retry and queuing mechanisms, ensuring data is retained and resent if failures occur. Add the following under the endpoint: "http://localhost:5318" and make sure the indentation matches endpoint:

    retry_on_failure:
      enabled: true                    # Enable retry on failure
    sending_queue:                     # 
      enabled: true                    # Enable sending queue
      num_consumers: 10                # No. of consumers
      queue_size: 10000                # Max. queue size
      storage: file_storage/checkpoint # File storage extension

Update the services section: Add the file_storage/checkpoint extension to the existing extensions: section and the configuration needs to look like this:

service:
  extensions:
  - health_check
  - file_storage/checkpoint            # Enabled extensions for this collector

Update the metrics pipeline: For this exercise we are going to comment out the hostmetrics receiver from the Metric pipeline to reduce debug and log noise, again the configuration needs to look like this:

    metrics:
      receivers:
      # - hostmetrics                    # Hostmetric reciever (cpu only)
      - otlp
Last Modified Jul 9, 2025

2.2 Setup environment for Resilience Testing

Next, we will configure our environment to be ready for testing the File Storage configuration.

Exercise

Start the Gateway: In the Gateway terminal window run:

../otelcol --config=gateway.yaml

Start the Agent: In the Agent terminal window run:

../otelcol --config=agent.yaml

Send five test spans: In the Loadgen terminal window run:

../loadgen -count 5

Both the Agent and Gateway should display debug logs, and the Gateway should create a ./gateway-traces.out file.

If everything functions correctly, we can proceed with testing system resilience.

Last Modified Jul 9, 2025

2.3 Simulate Failure

To assess the Agent’s resilience, we’ll simulate a temporary Gateway outage and observe how the Agent handles it:

Exercise

Simulate a network failure: In the Gateway terminal stop the Gateway with Ctrl-C and wait until the gateway console shows that it has stopped. The Agent will continue running, but it will not be able to send data to the gateway. The output in the Gateway terminal should look similar to this:

2025-07-09T10:22:37.941Z        info    service@v0.126.0/service.go:345 Shutdown complete.      {"resource": {}}

Send traces: In the Loadgen terminal window send five more traces using the loadgen.

../loadgen -count 5

Notice that the agent’s retry mechanism is activated as it continuously attempts to resend the data. In the agent’s console output, you will see repeated messages similar to the following:

2025-01-28T14:22:47.020+0100  info  internal/retry_sender.go:126  Exporting failed. Will retry the request after interval.  {"kind": "exporter", "data_type": "traces", "name": "otlphttp", "error": "failed to make an HTTP request: Post \"http://localhost:5318/v1/traces\": dial tcp 127.0.0.1:5318: connect: connection refused", "interval": "9.471474933s"}

Stop the Agent: In the Agent terminal window, use Ctrl-C to stop the agent. Wait until the agent’s console confirms it has stopped:

2025-07-09T10:25:59.344Z        info    service@v0.126.0/service.go:345 Shutdown complete.      {"resource": {}}

When you stop the agent, any metrics, traces, or logs held in memory for retry will be lost. However, because we have configured the FileStorage Extension, all telemetry that has not yet been accepted by the target endpoint are safely checkpointed on disk.

Stopping the agent is a crucial step to clearly demonstrate how the system recovers when the agent is restarted.

Last Modified Jul 11, 2025

2.4 Recovery

In this exercise, we’ll test how the OpenTelemetry Collector recovers from a network outage by restarting the Gateway collector. When the Gateway becomes available again, the Agent will resume sending data from its last check-pointed state, ensuring no data loss.

Exercise

Restart the Gateway: In the Gateway terminal window run:

../otelcol --config=gateway.yaml

Restart the Agent: In the Agent terminal window run:

../otelcol --config=agent.yaml

After the Agent is up and running, the File_Storage extension will detect buffered data in the checkpoint folder. It will start to dequeue the stored spans from the last checkpoint folder, ensuring no data is lost.

Verify the Agent Debug output: Note that the Agent debug output does NOT change and still shows the following line indicating no new data is being exported:

2025-07-11T08:31:58.176Z        info    service@v0.126.0/service.go:289 Everything is ready. Begin running and processing data.   {"resource": {}}

Watch the Gateway Debug output
You should see from the Gateway debug screen, it has started receiving the previously missed traces without requiring any additional action on your part e.g.:

Attributes:
   -> user.name: Str(Luke Skywalker)
   -> user.phone_number: Str(+1555-867-5309)
   -> user.email: Str(george@deathstar.email)
   -> user.password: Str(LOTR>StarWars1-2-3)
   -> user.visa: Str(4111 1111 1111 1111)
   -> user.amex: Str(3782 822463 10005)
   -> user.mastercard: Str(5555 5555 5555 4444)
   -> payment.amount: Double(75.75)
      {"resource": {}, "otelcol.component.id": "debug", "otelcol.component.kind": "exporter", "otelcol.signal": "traces"}

Check the gateway-traces.out file: Using jq, count the number of traces in the recreated gateway-traces.out. It should match the number you send when the Gateway was down.

jq '.resourceSpans | length | "\(.) resourceSpans found"' gateway-traces.out
"5 resourceSpans found"
Important

Stop the Agent and the Gateway processes by pressing Ctrl-C in their respective terminals.

Conclusion

This exercise demonstrated how to enhance the resilience of the OpenTelemetry Collector by configuring the file_storage extension, enabling retry mechanisms for the otlp exporter, and using a file-backed queue for temporary data storage.

By implementing file-based check-pointing and queue persistence, you ensure the telemetry pipeline can gracefully recover from temporary interruptions, making it a more robust and reliable for production environments.