Configure Azure Storage Blob modular inputs for the Splunk Add-on for Microsoft Cloud Services¶
Before you enable inputs, complete the previous steps in the configuration process:
- Configure a Storage Account in Microsoft Cloud Service
- Connect to your Azure Storage account with the Splunk Add-on for Microsoft Cloud Services
Configure your inputs on the Splunk platform instance responsible for collecting data for this add-on, usually a heavy forwarder. You can configure inputs using Splunk Web as a best practice, or you can use the configuration files.
Note
Versions 5.0.0 and higher of the Splunk Add-on for Microsoft Cloud Services contain changes to the checkpoint mechanism for the Storage Blob input. See the upgrade steps in this manual for more information.
Note
The Azure Storage Blob modular input for Splunk Add-on for Microsoft Cloud Services does not support the ingestion of gzip files. Only plaintext files are supported.
Since the format of the data in Azure Storage Blob channel varies, use source types to make the event data more effective. See Overview of Event Processing for more information.
Configure parameters to tune the performance of this input. For more information, see Configure Global settings.
Horizontal Scaling¶
Introduced Horizontal Scaling in the Splunk Add-on for Microsoft Cloud Services Version 5.0.0. Horizontal scaling provides functionality to collect data from the same Storage Container using multiple inputs in parallel to reduce data ingestion delays.
Analyze the user-case before opting for Horizontal Scaling. Horizontal Scaling is designed for containers containing a huge number of files. However, if there is a small number of large files in the container, then scaling up might be limited by the indexing rate of the environment.
Horizontal Scaling is not directly proportional to ingestion rate. For instance, if 1 input is capable of collecting the entire container’s data in 1 hour, then creating 2 inputs does not necessarily make the net collection time to 30 mins, and 3 inputs does not necessarily bring it down to 20 minutes.
Note
Scale the inputs incrementally and monitor the ingestion rate before scaling up the environment again. If the number of inputs starts filling up the indexing queue of Splunk, then the health of the environment might be adversely affected.
Note
The horizontal scaling should only be used after the file based checkpoint for the input has been successfully migrated to KV Store. Otherwise, it may lead to data duplication.
Prerequisites¶
- All input should use the same index.
- All Splunk instances should use the centralized KVStore. In Victoria stack, there is a centralized KVStore so this feature can be used there. If Splunk instances use a different KVStore, there is data duplication. If one Heavy Forwarder uses its own KVStore and another Heavy Forwarder uses a different KVStore, and both Heavy Forwarders have their inputs collecting data from the same Storage Container, then there is data duplication.
Risks¶
- There is a small chance of data duplication, up to 5%.
Configure inputs using Splunk Web¶
Configure your inputs using Splunk Web on the Splunk platform instance responsible for collecting data for this add-on, usually a heavy forwarder.
- In the Splunk Add-on for Microsoft Cloud Services, select Inputs.
- Select Create New Input and select Azure Storage Blob.
- Enter the Name, Storage Account, Container Name, Blob list, Interval, Index and Sourcetype using the Inputs parameters table.
Configure inputs using Configuration File¶
- Create a file called inputs.conf under
$SPLUNK_HOME/etc/apps/Splunk_TA_microsoft-cloudservices/local. - Configure the Azure Storage Blob input with the following stanza:
[mscs_storage_blob://<input_name>]
account = <value>
application_insights = <value>
blob_mode = <value>
collection_interval = <value>
container_name = <value>
prefix = <value>
blob_list = <value>
exclude_blob_list = <value>
decoding = <value>
guids = <value>
index = <value>
log_type = <value>
sourcetype = <value>
disabled = <value>
read_timeout = <value>
blob_compression = <value>
worker_threads_num = <value>
get_blob_batch_size = <value>
agent = <value>
dont_reupload_blob_same_size = <value>
Input parameters¶
Each attribute in the following table corresponds to a field in Splunk Web.
Basic configuration¶
| Attribute | Corresponding field in Splunk Web | Description |
|---|---|---|
mscs:storage:blob://<input_name> |
Name | Enter a unique, user-friendly name for each input. The name must not contain whitespace. This name is used to distinguish between different inputs. |
account |
Azure Storage Account | The name of your Azure storage account where blobs are stored. Select the storage account name you configured |
application_insights |
Application Insights Check | Indicates whether the Azure storage blob ingests data from application_insights. Set the value to 1 to ingest data, otherwise set it to 0. You must configure Log type and GUIDs if application_insights is set to 1. |
Gzip: Tells modinput to inflate downloaded blob content using the gzip algorithm. So far this is the only supported compression type. If inflation fails, modinput logs this with an error message and blob content is ingested as is. To avoid this situation, ensure that the scope of container blobs is correctly defined using prefix, blob_list, and/or exclude_blob_list configuration parameters. Note that Gzip is only supported with random blobs. For append blob mode only Not compressed compression type value is allowed. Default is Not compressed. |
||
container_name |
Container Name | Specifies the blob container in the storage account to be processed. You can configure only one container per input. |
prefix |
Prefix | Filter Blobs based on a prefix path inside the container. This allows Splunk to read only specific folders. Input will only collect the data from the blobs whose names begin with the specified prefix. For example: If you want to collect data from h=08 blob, and the blob name is y=2022/m=10/d=05/h=08/m=00/blob1.txt, then specify the prefix as y=2022/m=10/d=05/h=08. Constraints: - Blob name should not contain the trailing and leading spaces. - Prefix should not contain spaces in path segment. For instance, the following prefix would not provide the data from blob. Prefix: y=2022 / m=10 / d=05 /. |
blob_list |
Blob List | Enter the Blob name, wildcard, or regular expression for the data you want to collect. You can add multiple blob names separated by commas. If you leave this field empty, this add-on collects all the blob lists under the Container Name you configured. If you want to collect data from a specific blob list, enter the name of the blob list, such as blob_name. You can use wildcards, such as blob*, to collect data from the blob lists starting from blob. Use commas to separate multiple blob names, such as blob, name*. To use regular expressions, use this JSON format syntax: {"regex syntax":3}, where 3 stands for the regular expression file. If you want to enter the blob list which has both a wildcard and a regular expression, enter both separated by commas, for example, {"regex syntax" :3, blob* :2}, where 2 stands for a wildcard list. To enter the blob list using all of the expressions, use this syntax: {"regex syntax" :3, blob* :2, blob :1}, where 1 stands for using a specific blob list name. - The blob name must be at least one character long but cannot be more than 1,024 characters. - Blob names are case-sensitive. - Reserved URL characters must be properly escaped. - The number of path segments comprising the blob name cannot exceed 254. |
exclude_blob_list |
Excluded Blob List | Optional. Enter the Blob name or regular expressions for the data you do not want to collect. You can add multiple blob names separated by commas. The syntax of the Excluded Blob List is the same as Blob List. For example, if you do not want to include blobs from 2020, and your blob name is y=2020/m=10/d=05/h=08/m=00/blob1.txt, then you enter this regular expression: .*y=2020\/.*. |
guids |
GUIDs | Indicates the GUID identifier used for application insights data with this format: <application insights resource name>_<instrumentation key>. Required if application_insights is turned on. Enter individual GUIDs as comma-separated values. |
log_type |
Log type | Filters the results to return only blobs whose names begin with the specified log type. Use the following Application Insights blob format: <container_name>/<guid>/<arbitrary_log_type_value>/<yyyy-mm-dd>/<hh>/<blob_file>. Use only one log type value per input. Required if application_insights is turned on. |
decoding |
Decoding | Specify the character set of the file (for example, UTF‑8 or UTF‑32). If you left this field blank, the file’s default character set is used. |
collection_interval |
Interval | The setting specifies the number of seconds the Splunk platform waits before executing the command again to check for new blob changes or new blobs. The default is 3600 seconds. This interval does not impact the ingestion process for static blobs when there are no updates. |
index |
Index | The index in which to store Azure Storage Blob data. |
sourcetype |
Sourcetype | The default is mscs:storage:blob. To simplify field extraction, enter one of the following predefined sourcetypes: mscs:storage:blob:json, or mscs:storage:blob:xml.mscs:storage:blob is the default sourcetype used for general storage blob data. mscs:storage:blob:json is a predefined sourcetype specifically for blobs containing JSON-formatted data. Using this sourcetype simplifies field extraction by tailoring parsing to JSON structure. mscs:storage:blob:xml is a predefined sourcetype for blobs containing XML-formatted data, enabling easier field extraction by leveraging XML parsing. |
Advanced settings¶
| Attribute | Corresponding field in Splunk Web | Description |
|---|---|---|
blob_mode |
Blob Mode | Select the blob processing mode: - append: Retrieves only incremental changes from the blob. This mode is automatically used when the Azure Storage blob_type is Append blob and is suitable for sequential data scenarios such as logging. - random: Retrieves the entire blob each time the add-on is updated. This mode is used when the Azure Storage blob_type is Block or Page blob. Default is random. |
blob_compression |
Blob Compression Type | Select the following blob compression type values: - Not compressed: Indicates that the blob content is not compressed. The input processes the content as downloaded, without attempting decompression. - Gzip: Specifies that the downloaded blob content is inflated using the gzip algorithm. This is currently the only supported compression type. If inflation fails, an error is logged, and the blob content is ingested without inflation. To avoid this scenario, ensure that the container scope is correctly defined using the prefix, blob_list, and/or exclude_blob_list configuration parameters. Note: - Gzip is supported only in random blob mode. - In append blob mode, only Not compressed is supported. Default is Not compressed. |
read_timeout |
Read Timeout | This field specifies the maximum duration (in seconds) the system will wait for a response from the Azure Storage service when reading data. - Default value: 60 seconds. - Functionality: If the Azure Storage service does not respond within the specified timeframe, the request is terminated. Increasing this value allows the system to wait longer for a response, which is beneficial in environments with high latency or when processing exceptionally large blobs. - Performance Consideration: While increasing the timeout can improve success rates for slow network connections, it may also lead to longer ingestion delays if the service is unresponsive. Adjust this value based on your network stability and the typical size of the blobs being processed. |
dont_reupload_blob_same_size |
Skip re-ingesting blobs with unchanged size | If enabled, blobs won’t be re-ingested in Splunk if their size remains unchanged, even if it was re-uploaded in the container. This option is for append blobs and append mode only, and by default it is disabled. Note: This setting is for append blob mode only. It won’t affect performance and ingestion for random blob modes. |
Global settings override¶
| Attribute | Corresponding field in Splunk Web | Description |
|---|---|---|
worker_threads_num |
Number of worker threads | Defines the number of concurrent worker threads used to retrieve data from individual blobs. Adjust this value in small increments and monitor ingestion logs and system memory usage to determine an optimal setting for the environment. Purpose: Enables fine‑tuning of parallel processing within the ingestion engine. Increasing this value allows multiple blobs to be processed concurrently, which can significantly reduce overall ingestion time. Default value: 10. Configuration note: The worker_threads_num setting serves as a local override for thread configuration and takes precedence over any globally defined thread settings. Modify this value only in advanced scenarios where fine-tuning of thread utilization is required to optimize performance. |
get_blob_batch_size |
Batch size of append blobs | This field specifies the number of bytes to be downloaded in a single batch operation when processing append blobs. Purpose: This field specifies the number of bytes to be downloaded in a single batch operation when processing append blobs. Default value: 120,000 bytes. Configuration note: This setting acts as a local override and will take precedence over any global batch size settings defined at the system level. |
agent |
Log level | The Log Level field enables granular control over the information recorded by the system for analysis and troubleshooting. Use DEBUG level during the development and troubleshooting phases, as it provides the most comprehensive diagnostic information. Default value: INFO |
Note
If a file matches the syntax both in Blob List and Exclude Blob List, Exclude Blob List takes priority. For example, if there is a blob list name blob1, and it matches the syntax you set in Blob List and Exclude Blob List, this add-on excludes this list because Exclude Blob List is in higher priority.
Optimal config settings for large size container and files¶
There are certain ways to manage containers containing huge/large number/size of blobs:
Number of Worker Threads¶
We have defined this field under Global Settings Override section.
This field defines the number of concurrent worker threads dedicated to retrieving data from individual blobs. “It overrides the global setting” means that the specific configuration set locally (for example, for a particular input or instance) takes precedence over the default or global configuration that applies system wide. In other words, when this local setting is defined, it replaces the global default setting for that case.
If there is a global default number of worker threads set for inputs, but you specify a different number of worker threads for a single input using the worker_threads_num parameter, the system will use the number defined by worker_threads_num for that input instead of the global default. This allows for more granular control and tuning on a per-input basis without changing the global configuration.
The default value is 10.
Adjust this value in small increments and monitor the ingestion logs and system memory usage to find the optimal balance for your specific environment.
Batch Size of Append Blobs¶
The Batch size of append blobs setting in the user interface allows you to temporarily increase the batch size for processing append blobs. You can adjust this value based on your CPU capacity to optimize performance. Although this field is originally designed specifically for append blobs. This enables you to monitor memory consumption rates by increasing the batch size.
In future releases, a dedicated batch size field for random blobs will be provided to offer more precise control.
Note
If you increase the batch size , it will also increase memory consumption. This is because the batch size controls how many chunks of data are processed together, and larger batches require more memory to handle the data simultaneously.
Adjusting this setting carefully allows you to balance throughput and resource usage according to your environment’s capabilities.
Configure ingestion mode¶
Configure ingestion mode by selecting a blob mode that aligns with the blob type that you selected while creating the blob in your Azure storage account.
- On your Splunk platform deployment, navigate to the
$SPLUNK_HOME/etc/apps/Splunk_TA_microsoft-cloudservices/localdirectory. - Open inputs.conf file with a text editor.
- Navigate to the stanza of the blob storage input that you created.
- Change the
blob_modeattribute toappendorrandom, based on the following table:
| blob_type\ingestion_mode | Incremental | Full |
|---|---|---|
| append | blob_mode is irrelevant. You always receive incremental changes to your blob. | N/A |
| block or page | If you use a block_blob to append data to the blob and only want the incremental changes, set blob_mode = append. |
blob_mode = random. After a blob is complete or closed, the contents are ingested to the Splunk platform. |
- Save your changes.
Advanced Configuration¶
Introduced the “Allow Storage Blob Deletion” option in the Configuration -> Advanced tab in the Version 5.0.0 of the Splunk Add-on for Microsoft Cloud Services. This option allows the deletion of checkpoint files from the Splunk environment after migration to KVStore. Enable this option after all the Storage Blob inputs are migrated to KVStore successfully and the System is stable.