4.4 Simulate Recovery

In this exercise, we’ll test how the OpenTelemetry Collector recovers from a network outage by restarting the Gateway. When the Gateway becomes available again, the Agent will resume sending data from its last checkpointed state, ensuring no data loss.

Exercise

Restart the Gateway: In the Gateway terminal window run:

../otelcol --config=gateway.yaml

Restart the Agent: In the Agent terminal window run:

../otelcol --config=agent.yaml

After the Agent is up and running, the File_Storage extension will detect buffered data in the checkpoint folder.
It will start to dequeue the stored spans from the last checkpoint folder, ensuring no data is lost.

Exercise

Verify the Agent Debug output
Note that the Agent Debug Screen does NOT change and still shows the following line indicating no new data is being exported.

2025-02-07T13:40:12.195+0100    info    service@v0.117.0/service.go:253 Everything is ready. Begin running and processing data.

Watch the Gateway Debug output
You should see from the Gateway debug screen, it has started receiving the previously missed traces without requiring any additional action on your part.

2025-02-07T12:44:32.651+0100    info    service@v0.117.0/service.go:253 Everything is ready. Begin running and processing data.
2025-02-07T12:47:46.721+0100    info    Traces  {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 4, "spans": 4}
2025-02-07T12:47:46.721+0100    info    ResourceSpans #0
Resource SchemaURL: https://opentelemetry.io/schemas/1.6.1
Resource attributes:

Check the gateway-traces.out file
Count the number of traces in the recreated ./gateway-traces.out. It should match the number you send when the Gateway was down

Conclusion

This exercise demonstrated how to enhance the resilience of the OpenTelemetry Collector by configuring the file_storage extension, enabling retry mechanisms for the otlp exporter, and using a file-backed queue for temporary data storage.

By implementing file-based checkpointing and queue persistence, you ensure the telemetry pipeline can gracefully recover from temporary interruptions, making it a more robust and reliable for production environments.

Stop the Agent and Gateway using Ctrl-C.