2.4 Recovery
In this exercise, we’ll test how the OpenTelemetry Collector recovers from a network outage by restarting the Gateway collector. When the Gateway becomes available again, the Agent will resume sending data from its last check-pointed state, ensuring no data loss.
Exercise
Restart the Gateway: In the Gateway terminal window run:
Restart the Agent: In the Agent terminal window run:
After the Agent is up and running, the File_Storage extension will detect buffered data in the checkpoint folder. It will start to dequeue the stored spans from the last checkpoint folder, ensuring no data is lost.
Verify the Agent Debug output: Note that the Agent debug output does NOT change and still shows the following line indicating no new data is being exported:
Watch the Gateway Debug output
You should see from the Gateway debug screen, it has started receiving the previously missed traces without requiring any additional action on your part e.g.:
Check the gateway-traces.out file: Using jq, count the number of traces in the recreated gateway-traces.out. It should match the number you send when the Gateway was down.
Important
Stop the Agent and the Gateway processes by pressing Ctrl-C in their respective terminals.
Conclusion
This exercise demonstrated how to enhance the resilience of the OpenTelemetry Collector by configuring the file_storage extension, enabling retry mechanisms for the otlp exporter, and using a file-backed queue for temporary data storage.
By implementing file-based check-pointing and queue persistence, you ensure the telemetry pipeline can gracefully recover from temporary interruptions, making it a more robust and reliable for production environments.