Configure Azure Storage Blob modular inputs for the Splunk Add-on for Microsoft Cloud Services¶

Before you enable inputs, complete the previous steps in the configuration process:

Configure your inputs on the Splunk platform instance responsible for collecting data for this add-on, usually a heavy forwarder. You can configure inputs using Splunk Web as a best practice, or you can use the configuration files.

Note

Versions 5.0.0 and higher of the Splunk Add-on for Microsoft Cloud Services contain changes to the checkpoint mechanism for the Storage Blob input. See the upgrade steps in this manual for more information.

Note

The Azure Storage Blob modular input for Splunk Add-on for Microsoft Cloud Services does not support the ingestion of gzip files. Only plaintext files are supported.

Since the format of the data in Azure Storage Blob channel varies, use source types to make the event data more effective. See Overview of Event Processing for more information.

Configure parameters to tune the performance of this input. For more information, see Configure Global settings.

Horizontal Scaling¶

Introduced Horizontal Scaling in the Splunk Add-on for Microsoft Cloud Services Version 5.0.0. Horizontal scaling provides functionality to collect data from the same Storage Container using multiple inputs in parallel to reduce data ingestion delays.

Analyze the user-case before opting for Horizontal Scaling. Horizontal Scaling is designed for containers containing a huge number of files. However, if there is a small number of large files in the container, then scaling up might be limited by the indexing rate of the environment.

Horizontal Scaling is not directly proportional to ingestion rate. For instance, if 1 input is capable of collecting the entire container’s data in 1 hour, then creating 2 inputs will not necessarily make the net collection time to 30 mins, and 3 inputs will not necessarily bring it down to 20 minutes.

Note

Scale the inputs incrementally and monitor the ingestion rate before scaling up the environment again. If the number of inputs starts filling up the indexing queue of Splunk, then the health of the environment might be adversely affected.

Note

The horizontal scaling should only be used after the file based checkpoint for the input has been successfully migrated to KV Store. Otherwise, it may lead to data duplication.

Prerequisites¶

All input should use the same index.
All Splunk instances should use the centralized KVStore. In Victoria stack, there is a centralized KVStore so this feature can be used there. If Splunk instances use a different KVStore, there will be data duplication. If one Heavy Forwarder uses its own KVStore and another Heavy Forwarder uses a different KVStore, and both Heavy Forwarders have their inputs collecting data from the same Storage Container, then there will be data duplication.

Risks¶

There is a small chance of data duplication, up to 5%.

Configure inputs using Splunk Web¶

Configure your inputs using Splunk Web on the Splunk platform instance responsible for collecting data for this add-on, usually a heavy forwarder.

In the Splunk Add-on for Microsoft Cloud Services, select Inputs.
Select Create New Input and select Azure Storage Blob.
Enter the Name, Storage Account, Container Name, Blob list, Interval, Index and Sourcetype using the Inputs parameters table.

Configure inputs using Configuration File¶

Create a file called inputs.conf under $SPLUNK_HOME/etc/apps/Splunk_TA_microsoft-cloudservices/local.

Configure the Azure Storage Blob input with the following stanza:

    [mscs_storage_blob://<input_name>]]
    account = <value>
    application_insights = <value>
    blob_mode = <value>
    collection_interval = <value>
    container_name = <value>
    prefix = <value>
    blob_list = <value>
    exclude_blob_list = <value>
    decoding = <value>
    guids = <value>
    index = <value>
    log_type = <value>
    sourcetype = <value>
    disabled = <value>
    read_timeout = <value>
    blob_compression = <value>
    worker_threads_num = <value>
    get_blob_batch_size = <value>
    agent = <value>
    dont_reupload_blob_same_size = <value>

Input parameters¶

Each attribute in the following table corresponds to a field in Splunk Web.

Attribute	Corresponding field in Splunk Web	Description
`mscs:storage:blob://<input_name>`	Name	Enter a friendly name of your inputs. Do not use whitespaces in your input names. Name cannot contain any whitespace.
`account`	Azure Storage Account	Select the storage account name you configured. Name cannot contain any whitespace.
`application_insights`	Application Insights Check	Indicates whether the Azure storage blob ingests data from `application_insights`. Set the value to 1 to ingest data, otherwise set it to 0. You must configure Log type and GUIDs if `application_insights` is set to 1.
`container_name`	Container Name	Enter the container name under the storage account. You can only add one container name for each input.
`prefix`	Prefix	Specify the prefix string for the blobs. Input will only collect the data from the blobs whose names begin with the specified prefix. For instance, if user wants to collect data from `h=08` blob, and the blob name is `/y=2022/m=10/d=05/h=08/m=00/blob1.txt`, then the prefix can be specified as `/y=2022/m=10/d=05/h=08`. Constraints: Blob name should not contain the trailing and leading spaces. Prefix should not contain spaces in path segment. For instance, the below prefix would not provide the data from blob. Prefix:`/y=2022 / m=10 / d=05 /`.
`blob_list`	Blob List	Enter the Blob name, wildcard, or regular expression for the data you want to collect. You can add multiple blob names separated by commas. If you leave this field empty, this add-on collects all the blob lists under the Container Name you configured. If you want to collect data from a specific blob list, enter the name of the blob list, such as `blob_name`. You can use wildcards, such as blob, to collect data from the blob lists starting from blob. Use commas to separate multiple blob names, such as blob, name. To use regular expressions, use thisJSON format syntax: `{"regex syntax":3}`, where 3 stands for the regular expression file. If you want to enter the blob list which has both a wildcard and a regular expression, enter both separated by commas, for example, `{"regex syntax" :3, blob* :2}`, where 2 stands for a wildcard list. To enter the blob list using all of the expressions, use this syntax: `{"regex syntax" :3, blob* :2, blob :1}`, where 1 stands for using a specific blob list name. The blob name must be at least one character long but cannot be more than 1,024 characters. Blob names are case-sensitive. Reserved URL characters must be properly escaped. The number of path segments comprising the blob name cannot exceed 254.
`blob_mode`	Blob Mode	Select the following blob mode: `append`: Makes the blob storage input retrieve only the incremental changes. `random`: Makes the blob storage input retrieve the entire blob again after the add-on is updated. Default is `random`.
`blob_compression`	Blob Compression Type	Select the following blob compression type values: `Not compressed`: Tells modular input that blob content is not compressed, i.e. the input will not try to inflate blob content after download. `Gzip`: Tells modinput to inflate downloaded blob content using gzip algorithm. So far this is the only supported compression type. If inflation fails, modinput will log this with an error message and blob content will be ingested as is. To avoid this situation, ensure that the scope of container blobs is correctly defined using `prefix`, `blob_list` and/or `exclude_blob_list` configuration parameters. Note that `Gzip` is only supported with random blobs. For append blob mode only `Not compressed` compression type value is allowed. Default is `Not compressed`.
`collection_interval`	Interval	The number of seconds to wait before the Splunk platform runs the command again. The default is 3600 seconds.
`decoding`	Decoding	Specify the character set of the file, such as UTF-8 or UTF-32. If you leave this field blank, this add-on uses the default character set of the file.
`exclude_blob_list`	Excluded Blob List	Optional. Enter the Blob name or regular expressions for the data you do not want to collect. You can add multiple blob names separated by commas. The syntax of the Excluded Blob List is the same as Blob List. For example, if you do not want to include blobs from 2020, and your blob name is `/y=2020/m=10/d=05/h=08/m=00/blob1.txt`, then you enter this regular expression: `.y=2020\/.`.
`guids`	GUIDs	Indicates the GUID identifier used for application insights data with this format: <application insights resource name>_<instrumentation key>. Required if `application_insights` is turned on. Enter individual GUIDs as comma-separated values.
`index`	Index	The index in which to store Azure Storage Blob data.
`log_type`	Log type	Filters the results to return only blobs whose names begin with the specified log type. Use the following application Insights blob format: `<container_name>/<guid>/<arbitrary_log_type_value>/<yyyy-mm-dd>/<hh>/<blob_file>`. Use only one log type value per input. Required if `application_insights` is turned on.
`sourcetype`	Sourcetype	The default is `mscs:storage:blob`. To simplify field extraction, enter one of the following predefined sourcetypes: `mscs:storage:blob:json`, or `mscs:storage:blob:xml`.
`read_timeout`	Read Timeout	Specify the maximum amount of time (in seconds) to wait for a response from the Azure Storage service when reading data. The default value is 60 seconds.
`dont_reupload_blob_same_size`	Skip re-ingesting blobs with unchanged size	If enabled, blobs won't be re-ingested in Splunk, if their size remains unchanged, even if it was re-uploaded in the container. This option is for append blobs and append mode only, and by default it is disabled.
`worker_threads_num`	Number of worker threads	Amount of workers responsible for collecting the data from individual blob simultaneously. It is used for tuning the parallel processing of Storage Blobs. It overrides the global setting.
`get_blob_batch_size`	Amount of blobs in one batch	Amount of bytes to be downloaded in a batch for append blobs. This attribute helps in controlling the size of each download operation. It overrides the global setting.
`agent`	Log level	Change the Log level of an individual input. It overrides the setting of Logging page.

Note

If a file matches the syntax both in Blob List and Exclude Blob List, Exclude Blob List takes priority. For example, if there is a blob list name blob1, and it matches the syntax you set in Blob List and Exclude Blob List, this add-on will exclude this list because Exclude Blob List is in higher priority.

Configure ingestion mode¶

Configure ingestion mode by selecting a blob mode that aligns with the blob type that you selected while creating the blob in your Azure storage account.

On your Splunk platform deployment, navigate to the $SPLUNK_HOME/etc/apps/Splunk_TA_microsoft-cloudservices/local directory.
Open inputs.conf file with a text editor.
Navigate to the stanza of the blob storage input that you created.

Change the blob_mode attribute to append or random, based on the following table:

blob_type\ingestion_mode	Incremental	Full
append	blob_mode is irrelevant. You always receive incremental changes to your blob.	N/A
block or page	If you use a block_blob to append data to the blob and only want the incremental changes, set `blob_mode = append`.	blob_mode = random. After a blob is complete or closed, the contents are ingested to the Splunk platform.

Save your changes.

Advanced Configuration¶

Introduced the “Allow Storage Blob Deletion” option in the Configuration -> Advanced tab in the Version 5.0.0 of the Splunk Add-on for Microsoft Cloud Services. This option allows the deletion of checkpoint files from the Splunk environment after migration to KVStore. Enable this option after all the Storage Blob inputs are migrated to KVStore successfully and the System is stable.