Troubleshoot the Splunk Add-on for AWS¶
Use the following information to troubleshoot the Splunk Add-on for Amazon Web Services (AWS). For helpful troubleshooting tips that you can apply to all add-ons see Troubleshoot add-ons, and Support and resource links for add-ons in the Splunk Add-ons manual.
Data collection errors and performance issues¶
You can choose dashboards from the Health Check menu to troubleshoot data collection errors and performance issues. See AWS Health Check Dashboards for more information.
Internal logs¶
You can directly access internal log data for help with troubleshooting. Data collected with these source types is used in the Health Check dashboards.
Data source | Source type |
---|---|
splunk_ta_aws_cloudtrail_cloudtrail_{input_name}.log. | aws:cloudtrail:log |
splunk_ta_aws_cloudwatch.log. | aws:cloudwatch:log |
splunk_ta_aws_cloudwatch_logs.log. | aws:cloudwatchlogs:log |
splunk_ta_aws_config_{input_name}.log. | aws:config:log |
splunk_ta_aws_config_rule.log. | aws:configrule:log |
splunk_ta_aws_inspector_main.log, splunk_ta_aws_inspector_app_env.log, splunk_ta_aws_inspector_proxy_conf.log, and splunk_ta_aws_inspector_util.log. | aws:inspector:log |
splunk_ta_aws_inspector_v2_main.log, splunk_ta_aws_inspector_v2_app_env.log, splunk_ta_aws_inspector_v2_proxy_conf.log, and splunk_ta_aws_inspector_v2_util.log. | aws:inspector:v2:log |
splunk_ta_aws_description.log. | aws:description:log |
splunk_ta_aws_metadata.log. | aws:metadata:log |
splunk_ta_aws_billing_{input_name}.log. | aws:billing:log |
splunk_ta_aws_generic_s3_{input_name}. | aws:s3:log |
splunk_ta_aws_logs_{input_name}.log, each incremental S3 input has one log file with the input name in the log file. | aws:logs:log |
splunk_ta_aws_kinesis.log. | aws:kinesis:log |
splunk_ta_aws_ sqs_based_s3_{input_name} . | aws:sqsbaseds3:log |
splunk_ta_aws_sns_alert_modular.log and splunk_ta_aws_sns_alert_search.log. | aws:sns:alert:log |
splunk_ta_aws_rest.log, populated by REST API handlers called when setting up the add-on or data input. | aws:resthandler:log |
splunk_ta_aws_proxy_conf.log, the proxy handler used in all AWS data inputs. | aws:proxy-conf:log |
splunk_ta_aws_s3util.log, populated by the S3, CloudWatch, and SQS connectors. | aws:resthandler:log |
splunk_ta_aws_util.log, a shared utilities library. | aws:util:log |
Configure log levels¶
- Click Splunk Add-on for AWS in the navigation bar on Splunk Web.
- Click Configuration in the app navigation bar.
- Click the Logging tab.
- Adjust the log levels for each of the AWS services as needed by
changing the default level of
INFO
toDEBUG
orERROR
.
These log level configurations apply only to runtime logs. Some REST endpoint logs from configuration activity log at DEBUG, and some validation logs log at ERROR. These levels cannot be configured.
Troubleshoot custom sourcetypes for SQS Based S3 inputs¶
Troubleshoot custom sourcetypes created with an SQS-based S3 input.
- If a custom sourcetype is used (for example,
custom_sourcetype
), it can be replaced. see the following steps:- Navigate to the Inputs page of the Splunk Add-on for AWS.
- Create a new SQS-Based S3 input, or edit an existing SQS-Based S3 input.
- Navigate to the Source Type input box, and change the sourcetype name.
- Save your changes.
- Adding a custom sourcetype will not split the events. To split
events, perform the following steps:
- Navigate to
Splunk_TA_aws/local/
. - Open
props.conf
with a text editor. - Add the following stanza:
[custom_sourcetype] SHOULD_LINEMERGE = false
- Save your changes.
- Navigate to
Low throughput for the Splunk Add-on for AWS¶
If you do not achieve the expected AWS data ingestion throughput, follow these steps to troubleshoot the throughput performance:
- Identify the problem in your system.
- Adjust the factors affecting performance.
- Verify whether performance meets your requirements.
-
Identify the problem in your system that prevents it from achieving a higher level of throughput performance. The problem in AWS data ingestion might be caused one of the following components:
- The amount of data the Splunk Add-on for AWS can pull in through API calls
- The heavy forwarder’s capacity to parse and forward data to the indexer tier, which involves the throughput of the parsing, merging, and typing pipelines
- The index pipeline throughput
To troubleshoot the indexing performance on the heavy forwarder and indexer, refer to Troubleshooting indexing performance in the Capacity Planning Manual.
-
Troubleshoot the performance of the problem component. If heavy forwarders or indexers are affecting performance, refer to the Summary of performance recommendations in the Splunk Enterprise Capacity Planning Manual. If the Splunk Add-on for AWS is affecting performance, adjust the following factors:
- Parallelization settings
To achieve optimal throughput performance, set the value of
parallelIngestionPipelines
to 2 in the server.conf file if your resource capacity permits. For information aboutparallelIngestionPipelines
, see Parallelization settings in the Splunk Enterprise Capacity Planning Manual. - AWS data inputs If you have sufficient resources, you can increase the number of inputs to improve throughput, but be aware that this also consumes more memory and CPU. Increase the number of inputs to improve throughput until memory or CPU is running short. If you are using SQS-based S3 inputs, you can horizontally scale data collection by configuring more inputs on multiple heavy forwarders to consume messages from the same SQS queue.
- Number of keys in a bucket For both the Generic S3 and Incremental S3 inputs, the number of keys or objects in a bucket can impact initial data collection performance. A large number of keys in a bucket requires more memory for S3 inputs in the initial data collection and limits the number of inputs you can configure in the add-on. If applicable, you can use log file prefix to subset keys in a bucket into smaller groups and configure different inputs to ingest them separately. For information about how to configure inputs to use log file prefix, see Configure Generic S3 inputs for the Splunk Add-on for AWS. For SQS-based S3 inputs, the number of keys in a bucket is not a primary factor since data collection can be horizontally scaled out based on messages consumed from the same SQS queue.
- File format Compressed files consume much more memory than plain text files.
- Parallelization settings
To achieve optimal throughput performance, set the value of
-
When you resolve the performance issue, see if the improved performance meets your requirements. If not, repeat the previous steps to identify the next bottleneck in the system and address it until you’re satisfied with the overall throughput performance.
Problem saving during account or input configuration¶
If you experience errors or trouble saving while configuring your AWS
accounts on the setup page, go to
$SPLUNK_HOME/etc/system/local/web.conf
and and change the following
timeout setting:
[settings]
splunkdConnectionTimeout = 300
Problems deploying with a deployment server¶
If you use a deployment server to deploy the Splunk Add-on for Amazon Web Services to multiple heavy forwarders, you must configure the Amazon Web Services accounts using the Splunk Web setup page for each instance separately because the deployment server does not support sharing hashed password storage across instances.
S3 issues¶
Troubleshoot the S3 inputs for the Splunk Add-on for AWS.
S3 input performance issues¶
You can configure multiple S3 inputs for a single S3 bucket to improve performance. The Splunk platform dedicates one process for each data input, so provided that your system has sufficient processing power, you can improve performance with multiple inputs. See Hardware and software requirements for the Splunk Add-on for AWS.
To prevent indexing duplicate data, don’t overlap the S3 key names in multiple inputs against the same bucket.
S3 key name filtering issues¶
Troubleshoot regex to fix filtering issues.
The deny and allow list matches the full key name, not just the last
segment. For example, allow list .*abc/.*
matches /a/b/abc/e.gz.
Your regex should match the full for whitelist and blacklist. For
example, if the directory of example
bucket is
cloudtrail/cloudtrail2
, the desired file is under the path
cloudtrail/cloudtrail2/abc.txt
, and you would like to ingest
abc.txt
, you need to specify the key_name
and whitelist
. See the
following example, which will ingest any files under the path:
cloudtrail/cloudtrail2
:
key_name = cloudtrail
whitelist = ^.\/cloudtrail2\/.$
- Watch “All My Regex’s Live in Texas” on Splunk Blogs.
- Read “About Splunk regular expressions” in the Splunk Enterprise Knowledge Manager Manual.
S3 event line breaking issues¶
If your indexed S3 data has incorrect line breaking, configure a custom source type in props.conf to control how the lines break for your events.
If S3 events are too long and get truncated, set TRUNCATE = 0
in
props.conf to prevent line truncating.
More more information, see Configure event line breaking in the Getting Data In manual.
S3 event Access Denied issue¶
For your configured SQS based S3 input in versions 6.0.0 and later of
the Splunk Add-on for AWS addon, if you face the error
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied
,
verify the following:
- Check if the S3 bucket has versioning enabled or not.
- If the versioning is enabled for the S3 bucket, add the
s3:GetObjectVersion
permission to the account associated with the S3 bucket.
CloudWatch configuration issues¶
Troubleshoot your CloudWatch configuration.
API throttling issues¶
If you have a high volume of CloudWatch data, search
index=_internal Throttling
to determine if you are experiencing an API
throttling issue. If you are, contact AWS support to increase your
CloudWatch API rate. You can also decrease the number of metrics you
collect or increase the granularity of your indexed data in order to
make fewer API calls.
Granularity¶
If the granularity of your indexed data does not match your expectations, check that your configured granularity falls within what AWS supports for the metric you have selected. Different AWS metrics support different minimum granularities, based on the allowed sampling period for that metric. For example, CPUUtilization has a sampling period of 5 minutes, whereas Billing Estimated Charge has a sampling period of 4 hours.
If you configured a granularity that is less than the sampling period
for the selected metric, the reported granularity in your indexed data
reflects the actual sampling granularity but is labeled with your
configured granularity. Clear the local/inputs.conf
cloudwatch stanza
with the problem, adjust the granularity configuration to match the
supported sampling granularity so that newly indexed data is correct,
and reindex the data.
CloudTrail data indexing problems¶
If you are not seeing CloudTrail data in the Splunk platform, follow this troubleshooting process.
- Review the internal logs with the following search:
index=_internal source=*cloudtrail*
- Verify that the Splunk platform is connecting to SQS successfully by
searching for the string
Connected to SQS
. - Verify that the Splunk platform is processing messages successfully.
Look for strings with the following pattern:
X completed, Y failed while processing notification batch
. - Review your Amazon Web Services configuration to verify that SQS messages are being placed into the queue. If messages are being removed and the logs do not show that the input is removing them, then there might be another script or input consuming messages from the queue. Review your data inputs to ensure there are no other inputs configured to consume the same queue.
- Go to the AWS console to view CloudWatch metrics with the detail set to 1 minute to view the trend. For more details, see https://aws.amazon.com/blogs/aws/amazon-cloudwatch-search-and-browse-metrics-in-the-console/. If you see messages consumed but no Splunk platform inputs are consuming them, check for remote services that might be accessing the same queue.
- If your AWS deployment contains large S3 buckets with a large number
of subdirectories for 60 or more AWS accounts, perform one of the
following tasks:
- Enable SQS notification for each S3 bucket and switch to a SQS S3 input. This lets you add multiple copies of the input for scaling purposes.
- Split your inputs into one bucket per account and use multiple incremental inputs.
Billing Report issues¶
Troubleshoot the Splunk Add-on for AWS Billing inputs.
Problems accessing billing reports from AWS¶
If you have problems accessing billing reports from AWS, ensure that:
- There Billing Reports available on the S3 bucket you select when you configure the billing input,
- The AWS account you specify has the permission to read the files inside that bucket.
Problems understanding the billing report data¶
If you have problems understanding the billing report data, access the saved searches included with the add-on to analyze billing report data.
Problems configuring the billing data interval¶
The default billing data ingestion collection intervals for billing report data is designed to minimize license usage. Review the default behavior and make adjustments with caution.
Configure the interval by which the Splunk platform pulls Monthly and Detailed Billing Reports:
- In Splunk Web, go to the Splunk Add-on for AWS inputs screen.
- Create a new Billing input or click to edit your existing one.
- Click the Settings tab.
- Customize the value in the Interval field.
SNS alert issues¶
Because the modular input module is inactive, it cannot check whether the AWS is correctly configured or existing in the AWS SNS. If you cannot send a message to the AWS SNS account, you can perform the following procedures:
- Ensure the SNS topic name exists in AWS and the region ID is correctly configured.
- Ensure the AWS account is correctly configured in Splunk Add-on for AWS.
If you still have the issue, use the following search to check the log for AWS SNS:
Search
index=\_internal sourcetype=aws:sns:alert:log
Proxy settings for VPC endpoints¶
You must add each S3 region endpoint to the no_proxy
setting, and use
the correct hostname in your region:
s3.<your_aws_region>.amazonaws.com
. The no_proxy
setting does not
allow for any spaces between the IP addresses.
When using a proxy with VPC endpoints, check the proxy setting defined
in the splunk-launch.conf file located at
$SPLUNK_HOME/etc/splunk-launch.conf
. For example:
no_proxy = 169.254.169.254,127.0.0.1,s3.amazonaws.com,s3.ap-southeast-2.amazonaws.com
Certificate verify failed (_ssl.c:741) error message¶
If you create a new input, you might receive the following error
message:
certificate verify failed (_ssl.c:741)
Perform the following steps to resolve the error:
- Navigate to
$SPLUNK_HOME/etc/auth/cacert.pem
and open the cacert.pem file with a text editor. - Copy the text from your deployment’s proxy server certificate, and paste it into the cacert.pem file.
- Save your changes.
Internet restrictions prevent add-on from collecting AWS data¶
If your deployment has a security policy that doesn’t allow connection to the public internet from AWS virtual private clouds (VPCs), this might prevent the Splunk Add-on for AWS from collecting data from Cloudwatch inputs, S3 inputs, and other inputs which depend on access to AWS services.
To identify this issue in your deployment:
- Check if you have a policy that restricts outbound access to the public Internet from your AWS VPC.
- Identify if you have error messages that show that your attempts to
connect to sts.amazonaws.com result in a timeout. For example:
ConnectTimeout: HTTPSConnectionPool(host='sts.amazonaws.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<botocore.awsrequest.AWSHTTPSConnection object at 0x7fdfd97bc350>, 'Connection to sts.amazonaws.com timed out. (connect timeout=60)')) host = si3-splunk1 index = 0014000000kbznqaa1 source = /opt/splunkcoreengine/ce_customers/0014000000KBzNQAA1/1425528/si3-splunk1-sh_ds_ ls_-20190708-190819/log/splunk_ta_aws_aws_cloudwatch.log sourcetype = splunk_ta_ aws_aws_cloudwatch
To fix this issue in your deployment:
- Your VPC endpoint interface needs to be set up in your AWS environment. See the AWS documentation for details regarding VPC endpoints.
-
Update the Splunk instance that is being used for data collection to use your VPC endpoint as a gateway to allow connections to be established to your AWS services:
- In your Splunk instance, navigate to
./etc/apps/Splunk_TA_aws/bin/3rdparty/botocore/data/endpoints.json
, and open using a text editor. -
Update the
hostname
to use the hostname of your VPC endpoint interface. For example:Before
After"sts": { "defaults": { "credentialScope": { "region": "us-east-1" }, "hostname": "sts.amazonaws.com"
"sts" : { "defaults" : { "credentialScope" : { "region" : "us-east-1" }, "hostname" : "<Enter VPC endpoint Interface DNS name here>"
-
Save your changes.
- In your Splunk instance, navigate to
./etc/apps/Splunk_TA_aws/bin/3rdparty/botocore/data/endpoints.json
, and open using a text editor. -
Update the
hostname
to use the hostname of your VPC endpoint interface. For example:Before
After"sts": { "defaults": { "credentialScope": { "region": "us-east-1" }, "hostname": "sts.amazonaws.com"
"sts" : { "defaults" : { "credentialScope" : { "region" : "us-east-1" }, "hostname" : "<Enter VPC endpoint Interface DNS name here>"
-
Save your changes.
- In your Splunk instance, navigate to
-
Restart your Splunk instance.
- Validate that the connection to your VPC has been established.
Failed to load input and configuration page when running the Splunk software on a custom management port¶
If the Splunk software fails to load input and configuration page while
running on the custom management port (for example,
<IP>:<CUSTOM_PORT>
), perform the following troubleshooting steps.
- Navigate to
$SPLUNK_HOME/etc/
- Open
splunk-launch.conf
using a text editor. - Add the environment variable
SPLUNK_MGMT_HOST_PORT=<IP>:<CUSTOM_PORT>
- Save your changes.
- Restart your Splunk instance.
Amazon Kinesis Firehose error exceptions¶
See Data Not Delivered to Splunk in the AWS documentation.
Amazon Kinesis Firehose data delivery errors¶
You can view the error logs related to Kinesis Firehose data delivery failure using the Kinesis Firehose console or CloudWatch console. See the Accessing CloudWatch Logs for Kinesis Firehose section in the Monitoring with Amazon CloudWatch Logs topic from the AWS documentation.
SSL-related data delivery errors¶
Amazon Kinesis Firehose requires HTTP Event Collector (HEC) endpoint to be terminated with a valid CA-signed certificate matching the DNS hostname used to connect to your HEC endpoint. If you are seeing an error message “Could not connect to the HEC endpoint. Make sure that the HEC endpoint URL is valid and reachable from Kinesis Firehose,” then your SSL certificate might not be valid.
Test if your SSL certificate is valid by opening your HEC endpoint in a web browser. If you are using a self-signed certificate, you will receive an error in your browser. For example, in Google Chrome, the error looks like:
Amazon Kinesis Firehose Error: “Received event for unconfigured/disabled/deleted index” but indexer acknowledgement is returning positives¶
If you see this error in messages or logs, edit your HEC token configurations to send data to an index that is able to accept data.
If indexer acknowledgment for your Amazon Kinesis Firehose data is successful but your data is not successfully indexed, the data may have been dropped by the parsing queue as an unparseable event. This is expected behavior when data is processed successfully in the input phase but cannot be parsed due to a logical error. For example, if the HTTP event collector is routing data to an index that has been deleted or disabled, the Splunk platform will still accept the data and begin processing it, which triggers indexer acknowledgment to confirm receipt. However, the parsing queue cannot pass the data to the index queue because the specified index is not available, thus the data does not appear in your index. For more information about the expected behavior of the indexer acknowledgment feature, see About HTTP Event Collector Indexer Acknowledgment.
If you suspect events have been dropped, search your “last chance”
index, if you have one configured. If you are on Splunk Cloud, contact
Splunk Support if you do not know the name of your last chance index. If
you are on Splunk Enterprise, see the lastChanceIndex
setting in
indexes.conf
for more information about the behavior of the last chance index feature
and how to configure it.
Troubleshoot performance with the Splunk Monitoring Console¶
For Splunk Cloud Platform, see Introduction to the Cloud Monitoring Console. For Splunk Enterprise, see About the Monitoring Console.
Queue fill dashboard¶
If you are experiencing performance issues with your HEC server, you may need to increase the number of HEC-enabled indexers to which your events are sent.
Use the Monitoring Console to determine the queue fill pattern. Follow these steps to check whether your indexers are at capacity.
Steps Navigate to either Monitoring Console > Indexing > Performance > Indexing Performance: Deployment or Monitoring Console > Indexing > Performance > Indexing Performance: Instance. From the Median Fill Ratio of Data Processing Queues dashboard, select Indexing queue from the Queue dropdown and 90th percentile from the Aggregation dropdown. (Optional) Set a Platform Alert to get a notification when one or more of your indexer queues reports a fill percentage of 90% or more. This alert can inform you of potential indexing latency. From Paid Splunk Cloud, navigate to Settings > Searches, reports, and alerts and select Monitoring Console in the app filter. Find the SIM Alert - Abnormal State of Indexer Processor platform alert, and click Edit > Enable to enable the alert. From the Splunk Enterprise Monitoring Console Overview page, click Triggered Alerts > Enable or Disable and then click the Enabled checkbox next to the SIM Alert - Abnormal State of Indexer Processor platform alert. See determine queue fill pattern for an example of a healthy and unhealthy queue.
HTTP Event Collector dashboards¶
The Monitoring Console also comes with pre-built dashboards for monitoring the HTTP Event Collector. To interpret the HTTP event collector dashboards information panels correctly, be aware of the following:
The Data Received and Indexed panel shows data as “indexed” even when the data is sent to a deleted or disabled index. Thus, this graph shows the data that is acknowledged by the indexer acknowledgment feature, even if that data is not successfully indexed. See the Error: ‘Received event for unconfigured/disabled/deleted index’ but indexer acknowledgment is returning positive section of this topic for more information about the expected behavior of the indexer acknowledgment feature when the index is not usable. The Errors panel is expected to show a steady stream of errors under normal operation. These errors occur because Amazon Kinesis Firehose sends empty test events to check that the authentication token is enabled, and the HTTP event collector cannot parse these empty events. Filter the Errors panel by Reason to help find significant errors.
For more information about the specific HTTP event collector dashboards, see HTTP Event Collector dashboards.
The HTTP event collector dashboards show all indexes, even if they are disabled or have been deleted.
Amazon Kinesis Firehose Kinesis timestamp issues¶
If your Kinesis events are ingested with the wrong timestamp, perform the following troubleshooting steps to disable the Splunk software’s timestamp extraction feature.
- Stop your Splunk instance.
- Navigate to
$SPLUNK_HOME/etc/apps/Splunk_TA_aws/local
. - Open the
props.conf
file using a text editor. - In the
props.conf
file, locate the stanza for the Kinesis sourcetype. If it doesn’t exist, create one with the Kinesis sourcetype. - Inside the Kinesis sourcetype stanza, add
DATETIME_CONFIG = NONE
. - Save your changes.
- Restart your Splunk instance.
Metadata WAFv2 API “The scope is not valid” error¶
If you encounter the following error: Error Log:
botocore.errorfactory.WAFInvalidParameterException: An error occurred (WAFInvalidParameterException) when calling the ListLoggingConfigurations operation: Error reason: The scope is not valid., field: SCOPE_VALUE, parameter: CLOUDFRONT
Review the following APIs, then select the region as “us-east-1”(N. Virginia) because “us-east-1” is the only region supported by these APIs.
- wafv2_list_available_managed_rule_group_versions_cloudfront
- wafv2_list_logging_configurations_cloudfront
- wafv2_list_ip_sets_cloudfront
Metadata Input - Data is not getting collected for “s3_buckets” and “iam_users” API¶
If, when using the Metadata input, Data is not getting collected for the
s3_buckets
and iam_users
APIs, select the region which is enabled
from the AWS portal side when you originally created the input in the
first place. The s3_buckets
and iam_users
APIs are GLOBAL APIs and
are using the first selected region for all their API calls.
Config Rules input - Data is not getting collected¶
If data is not getting collected for the Config Rules input, check if any of the following error messages are found in the log file.
botocore.errorfactory.ValidationException: An error occurred (ValidationException) when calling the DescribeConfigRules operation: 1 validation error detected: Value '[....]' at 'configRuleNames' failed to satisfy constraint: Member must have length less than or equal to 25
CloudWatch Logs ModInput Ingestion Delay¶
CloudWatch Logs ModInput uses lastEventTimestamp
from the response of the AWS boto3 SDK method describe_log_streams
to determine whether new data is available or not for the ingestion. As per the AWS Boto3 describe_log_streams
API response lastEventTimestamp
field description, lastEventTimestamp
value is updated on an eventual consistency basis. It updates in less than an hour from ingestion in the AWS CloudWatch Stream, and some time it takes more than one hour. For more details please check the descriobe_logs_stream
- Boto3 API documentation. Based on this AWS behaviour, delay is expected in the modular input. It is recommended to use the Push-Based mechanism to ingest CloudWatch Logs data in order to avoid ingestion delay.
Facing issues with aws:firehose:json
sourcetype extarctions¶
If you are collecting the data in the aws:firehose:json
sourcetype by configuring the HEC on a search head instead of on an IDM/HF on a Classic Splunk Cloud Platform environment, then due to partitioned builds, all the extractions will not be present on your search head. Therefore you should add required extractions in your desired sourcetype from your search head’s UI (Settings -> Source types). This will push the changes to the indexers as well.
Getting warning message while collecting metric data through VPC Flow Logs input¶
If you have configured the VPC Flow Logs input by selecting metric index and you encounter warning in splunk message tray or splunkd logs stating The metric event is not properly structured, source=xyz, sourcetype=metric, host=xyz, index=xyz. Metric event data without a metric name and properly formated numerical values are invalid and cannot be indexed. Ensure the input metric data is not malformed, have one or more keys of the form “metric_name:
then make sure that the data you are trying to ingest is of VPC Flow Logs otherwise it will not get ingested in the metric index.