Skip to content

Configure Generic S3 inputs for the Splunk Add-on for AWS

Versions 6.2.0 and higher of the Splunk Add-on for AWS includes a UI warning message when configuring a new Generic S3 input or editing/cloning an existing input. A warning message will also be logged while the data input is enabled.

Complete the steps to configure Generic S3 inputs for the Splunk Add-on for Amazon Web Services (AWS):

  1. You must manage accounts for the add-on as a prerequisite. See Manage accounts for the Splunk Add-on for AWS.
  2. Configure AWS services for the Generic S3 input.
  3. Configure AWS permissions for the Generic S3 input.
  4. (Optional) Configure VPC Interface Endpoints for STS and S3 services from your AWS Console if you want to use private endpoints for data collection and authentication. For more information, see the Interface VPC endpoints (AWS PrivateLink) topic in the Amazon Virtual Private Cloud documentation.
  5. Configure Generic S3 inputs either through Splunk Web or configuration files.

Configuration prerequisites

The Generic S3 input lists all the objects in the bucket and examines each file’s modified date every time it runs to pull uncollected data from an S3 bucket. When the number of objects in a bucket is large, this can be a very time-consuming process with low throughput.

Before you begin configuring your Generic S3 inputs, be aware of the following expected behaviors:

  1. You cannot edit the initial scan time parameter of an S3 input after you create it. If you need to adjust the start time of an S3 input, delete it and recreate it.
  2. The S3 data input is not intended to read frequently modified files. If a file is modified after it has been indexed, the Splunk platform indexes the file again, resulting in duplicated data. Use key, blocklist, and allowlist options to instruct the add-on to index only those files that you know will not be modified later.
  3. The S3 data input processes compressed files according to their suffixes. Use these suffixes only if the file is in the corresponding format, or data processing errors occur. The data input supports the following compression types:
    • single file in ZIP, GZIP, TAR, or TAR.GZ formats
    • multiple files with or without folders in ZIP, TAR, or TAR.GZ format

Expanding compressed files requires significant operating system resources. The Splunk platform automatically detects the character set used in your files among these options:

  1. The Generic S3 custom data types input processes delimited files (.csv, .psv, .tsv) according to the status of the field parse_csv_with_header and parse_csv_with_delimiter. The data input supports the following compression types:

    • Single file in ZIP, GZIP, TAR, or TAR.GZ formats..
    • Multiple files with or without folders in GZIP, TAR, or TAR.GZ formats.
    • CSV parsing within TAR might fail, during the following scenarios:
      • CSV parsing within a TAR might fail if binary files (._) exist within the TAR. In a tar file created in Mac OS there will be binary files (._) packaged with your csv files. They will not be processed and will throw an error.

    Delimited Files parsing prerequisites if parse_csv_with_header is enabled

    • The Generic S3 custom data types input processes Delimited Files (.csv, .psv, .tsv) according to the status of the fields parse_csv_with_header and parse_csv_with_delimiter.

      • When parse_csv_with_header is enabled, all files ingested by the input, whether delimited or not, will be processed as if they were delimited files with the value of parse_csv_with_delimiter used to split the fields. The first line of each file will be considered the header.
      • When parse_csv_with_header is disabled, events will be indexed line by line without any CSV processing.

      The field parse_csv_with_delimiter will be a comma by default, but can be edited to any delimiter of one character that is not alphanumeric, single, or double quote. - Ensure that each delimited file contains a header. The CSV parsing functionality will take the first non-empty line of the file as a header before parsing. - Ensure that all files have a carriage return at the end of each file. Otherwise, the last line of the CSV file will not be indexed. - Ensure there are no duplicate values in the header of the CSV file(s) to avoid missing data. - Set polling interval to default 1800 or higher to avoid data duplication/incorrect parsing or CSV file data. - There are some illegal sequences of string characters that will throw an UnicodeDecodeError. - For example, VI,Visa,Cabela�s

    Processing outcomes

    • End result after CSV parsing will be a JSON object with the header values mapped to the subsequent row values.
  2. The Splunk platform auto-detects the character set used in your files among these options:

    • UTF-8 with or without BOM
    • UTF-16LE/BE with BOM
    • UTF-32BE/LE with BOM. If your S3 key uses a different character set, you can specify it in inputs.conf using the character_set parameter and separate out this collection job into its own input. Mixing non-autodetected character sets in a single input causes errors.
  3. If your S3 bucket contains a very large number of files, you can configure multiple S3 inputs for a single S3 bucket to improve performance. The Splunk platform dedicates one process for each data input, so provided that your system has sufficient processing power, performance improves with multiple inputs. See “Performance for the Splunk Add-on for AWS data inputs” in Sizing, Performance, and Cost Considerations for the Splunk Add-on for AWS for details.

To prevent indexing duplicate data, verify that multiple inputs do not collect the same S3 folder and file data.

  1. As a best practice, archive your S3 bucket contents when you no longer need to actively collect them. AWS charges for list key API calls that the input uses to scan your buckets for new and changed files so you can reduce costs and improve performance by archiving older S3 keys to another bucket or storage type.

  2. After configuring an S3 input, you may need to wait for a few minutes before new events are ingested and can be searched. The wait time depends on the number of files in the S3 buckets from which you are collecting data. The larger the quantity files, the longer the delay. Also, more verbose logging levels causes longer data digestion time.

Configure AWS services for the Generic S3 input

To collect access logs, configure logging in the AWS console to collect the logs in a dedicated S3 bucket. See the AWS documentation for more information on how to configure access logs:

See http://docs.aws.amazon.com/gettingstarted/latest/swh/getting-started-create-bucket.html for more information about how to configure S3 buckets and objects.

Refer to the AWS S3 documentation for more information about how to configure S3 buckets and objects: http://docs.aws.amazon.com/gettingstarted/latest/swh/getting-started-create-bucket.html

Configure S3 permissions

Required permissions for S3 buckets and objects:

  • ListBucket
  • GetObject
  • ListAllMyBuckets
  • GetBucketLocation

Required permissions for KMS:

  • Decrypt

In the Resource section of the policy, specify the Amazon Resource Names (ARNs) of the S3 buckets from which you want to collect S3 Access Logs, CloudFront Access Logs, ELB Access Logs, or generic S3 log data.

See the following sample inline policy to configure S3 input permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject",
                "s3:ListAllMyBuckets",
                "s3:GetBucketLocation",
                "kms:Decrypt"
            ],
            "Resource": "*"
        }
    ]
}

For more information and sample policies, see http://docs.aws.amazon.com/AmazonS3/latest/dev/using-iam-policies.html.

Configure a Generic S3 input using Splunk Web

To configure inputs in Splunk Web, click Splunk Add-on for AWS in the navigation bar on Splunk Web home, then choose one of the following menu paths depending on which data type you want to collect:

  • Create New Input > CloudTrail > Generic S3
  • Create New Input > CloudFront Access Log > Generic S3
  • Create New Input > ELB Access Logs > Generic S3
  • Create New Input > S3 Access Logs > Generic S3
  • Create New Input > Custom Data Type > Generic S3
  • Create New Input > Custom Data Type > Generic S3 > aws:s3:csv sourcetype

Make sure you choose the right menu path corresponding to the data type you want to collect. The system automatically sets the appropriate source type and may display slightly different field settings in the subsequent configuration page based on the menu path.

Use the following table to complete the fields for the new input in the .conf file or in Splunk Web:

Argument in configuration file

Field in Splunk Web

Description

aws_account

AWS Account

The AWS account or EC2 IAM role the Splunk platform uses to access the keys in your S3 buckets. In Splunk Web, select an account from the drop-down list. In inputs.conf, enter the friendly name of one of the AWS accounts that you configured on the Configuration page or the name of the automatically discovered EC2 IAM role.
If the region of the AWS account you select is GovCloud, you might encounter errors such as "Failed to load options for S3 Bucket". You need to manually add AWS GovCloud Endpoint in the S3 Host Name field. See http://docs.aws.amazon.com/govcloud-us/latest/UserGuide/using-govcloud-endpoints.html for more information.

aws_iam_role

Assume Role

The IAM role to assume, see Manage accounts for the Splunk Add-on for AWS.

aws_s3_region

AWS Region (Optional)

The AWS region that contains your bucket. In inputs.conf, enter the region ID.
Provide an AWS Region only if you want to use specific regional endpoints instead of public endpoints for data collection.
See the AWS service endpoints topic in the AWS General Reference manual for more information.

private_endpoint_enabled

Use Private Endpoints

Check the checkbox to use private endpoints of AWS Security Token Service (STS) and AWS Simple Cloud Storage (S3) services for authentication and data collection. In inputs.conf, enter 0 or 1 to respectively disable or enable use of private endpoints.

s3_private_endpoint_url

Private Endpoint (S3)

Private Endpoint (Interface VPC Endpoint) of your S3 service, which can be configured from your AWS console.
Supported Formats :
://bucket.vpce--.s3..vpce.amazonaws.com ://bucket.vpce---.s3..vpce.amazonaws.com

sts_private_endpoint_url

Private Endpoint (STS)

Private Endpoint (Interface VPC Endpoint) of your STS service, which can be configured from your AWS console.
Supported Formats :
://vpce--.sts..vpce.amazonaws.com ://vpce---.sts..vpce.amazonaws.com

bucket_name

S3 Bucket

The AWS bucket name.

log_file_prefix

Log File Prefix/S3 Key Prefix

Configure the prefix of the log file. This add-on searches the log files under this prefix. This argument is titled Log File Prefix in incremental S3 field inputs, and is titled S3 Key Prefix in generic S3 field inputs.

log_start_date

Start Date/Time

The start date of the log.

log_end_date

End Date/Time

The end date of the log.

sourcetype

Source Type

A source type for the events. Specify only if you want to override the default of aws:s3. You can select a source type from the drop-down list or type a custom source type yourself. To index access logs, enter aws:s3:accesslogs, aws:cloudfront:accesslogs, or aws:elb:accesslogs, depending on the log types in the bucket. To index CloudTrail events directly from an S3 bucket, change the source type to aws:cloudtrail.

index

Index

The index name where the Splunk platform puts the S3 data. The default is main.

ct_blacklist

CloudTrail Event Blacklist

Only valid if the source type is set to aws:cloudtrail. A Pearl Compatible Regex Expression (PCRE) regular expression that specifies event names to exclude. The default regex is ^$ to exclude events that can produce a high volume of data. Leave it blank if you want all data to be indexed.

blacklist

CloudTrail Event Blacklist

A regular expression to indicate the S3 paths that the Splunk platform should exclude from scanning. Regex should match the full path. For example, a regex to exclude .conf files, and files with sources that end in .bin is .*(\.conf$|\.bin$).

polling_interval

Polling Interval

The number of seconds to wait before the Splunk platform runs the command again. The default is 1,800 seconds.

parse_csv_with_header

Parse all files as CSV

If selected, all files will be parsed as a delimited file with the first line of each file considered the header. Set this checkbox to disabled for delimited files without a header. For new Generic S3 inputs, this feature is disabled, by default.

Supported Formats:

  • 1 is enabled.
  • 0 is disabled and default.

parse_csv_with_delimiter

CSV field delimiter

Delimiter must be one character. The character cannot be alphanumeric, single quote, or double quote. Tab-delimited files will be \t. By default the delimiter is a comma.

Configure a Generic S3 input using configuration files

When you configure inputs manually in inputs.conf, create a stanza using the following template and add it to $SPLUNK_HOME/etc/apps/Splunk_TA_aws/local/inputs.conf. If the file or path does not exist, create it.

[aws_s3://<name>]
is_secure = <whether use secure connection to AWS>
host_name = <the host name of the S3 service>
aws_account = <AWS account used to connect to AWS>
aws_s3_region = <value>
private_endpoint_enabled = <value>
s3_private_endpoint_url = <value>
sts_private_endpoint_url = <value>
bucket_name = <S3 bucket name>
polling_interval = <Polling interval for statistics>
key_name = <S3 key prefix>. For example, key_name = cloudtrail. This value does not accept regex.
recursion_depth = <For folder keys, -1 == unconstrained>
initial_scan_datetime = <Splunk relative time>
terminal_scan_datetime = <Only S3 keys which have been modified before this datetime will be considered. Using datetime format: %Y-%m-%dT%H:%M:%S%z (for example, 2011-07-06T21:54:23-0700).>
log_partitions = AWSLogs/<Account ID>/CloudTrail/<Region>
max_items = <Max trackable items.>
max_retries = <Max number of retry attempts to stream incomplete items.>
whitelist = <Override regex for allow list when using a folder key.>. A regular expression to indicate the S3 paths that the Splunk platform should exclude from scanning. Regex should match the path, starting from folder name. For example, for including contents of a folder named Test, provide regex as Test/.*
blacklist = <Keys to ignore when using a folder key.>. Regex should match the full path. A regular expression to indicate the S3 paths that the Splunk platform should exclude from scanning. Regex should match the path, starting from folder name. For example, for excluding contents of a folder named Test, provide regex as Test/.*
ct_blacklist = <The allow list to exclude cloudtrail events. Only valid when manually set sourcetype=aws:cloudtrail.>
ct_excluded_events_index = <name of index to put excluded events into. default is empty, which discards the events>
aws_iam_role = <AWS IAM role to be assumed>

Under one AWS account, to ingest logs in different prefixed locations in the bucket, you need to configure multiple AWS data inputs, one for each prefix name. Alternatively, you can configure one data input but use different AWS accounts to ingest logs in different prefixed locations in the bucket.

Some of these settings have default values that can be found in $SPLUNK_HOME/etc/apps/Splunk_TA_aws/default/inputs.conf:

[aws_s3]
aws_account =
sourcetype = aws:s3
initial_scan_datetime = default
log_partitions = AWSLogs/<Account ID>/CloudTrail/<Region>
max_items = 100000
max_retries = 3
polling_interval=
interval = 30
recursion_depth = -1
character_set = auto
is_secure = True
host_name = s3.amazonaws.com
ct_blacklist = ^(?:Describe|List|Get)
ct_excluded_events_index =