Troubleshoot the Splunk Add-on for CrowdStrike FDR¶
For troubleshooting tips that you can apply to all add-ons, see Troubleshoot add-ons in Splunk Add-ons. For additional resources, see Support and resource links for add-ons in Splunk Add-ons.
Troubleshooting resources¶
To troubleshoot your forwarder setup, see “Troubleshoot the forwarder/receiver connection” in the Forwarding Data manual.
Monitor the troubleshooting dashboard¶
Starting in version 1.3.0, the add-on provides a monitoring dashboard that lets you quickly spot possible issues in the ingest process:
SQS message by event type¶
This is a time chart that shows SQS messages received by the add-on in one hour. Based on the batch folder that SQS message points to, the graph splits messages by color to different kinds: data, aidmaster, managedassets, notmanaged, appinfo, and userinfo.
S3 located event files versus event files received in SQS notifications¶
This time chart uses information collected by “CrowdStrike FDR S3 bucket monitor” input. It compares the list of batches found in your S3 bucket with the list of batches that the add-on receives in SQS messages. Batches that have not yet been received in SQS messages are marked as “Missed”. Depending on the size of the event backlog and ingestion rate of the Splunk environment, the timechart can show the maximum number of missed batches in the immediate past, showing fewer messages for older batches. There should be no missed batches shown earlier than some period into the past from current time. If you find missed batches in the past, there may be another unknown process consuming SQS events from the same CrowdStrike feed.
Batches seen by the add-on in SQS messages are marked as “Notified”. Missed and Notified batches are split into types j so that in the timechart legend one can see, for example, “Missed aidmaster” or/and “Notified data”
Bucket resources ingest by stage¶
This shows ingestion stages of event files received by the add-on in SQS messages to help make sure that events from batch files received in SQS messages appear in Splunk index by checking source property of ingested events (all CrowdStrike events have S3 bucket resource url they originate from as their source value). If events from a corresponding event file are not yet visible at the Splunk index, the add-on tries to show the ingestion stage of this file based on add-on logs:
- Skipped: Means that the whole SQS message containing this event file was ignored. This can happen because this type of event was not selected for ingestion or because an “Ignore SQS messages older than” parameter was configured and the SQS message is too old to be ingested.
- Scheduled: Means that the corresponding SQS message was not skipped. The event file was registered in the ingest journal but no managed consumer input has started to ingest it.
- InProgress: -Means that the event file was assigned to a managed consumer input and it has started ingesting it.
- Failed : Means that the managed consumer input has failed to ingest the event file. If the error was recoverable (for example, if it is a communication issue), the add-on will retry ingesting this event file later.
- Ingested: Means that events from batch files received in SQS messages appeared in Splunk index
Event ingestion average delay¶
Calculates the difference between the time the event was created by CrowdStrike and the time the event appeared in Splunk ingest pipeline, and shows the average time per hour. A large ingestion average delay indicates a significant backlog of events to be ingested. Combined with a noticeable number of “Failed” event file ingestion, this can point to Splunk environment configuration and/or communication issues. If the timechart shows that ingestion average delay is growing, then the environment most likely has insufficient resources. To confirm this, you can check the “Modular inputs average batch processing time (in seconds)” time chart and make sure that average time divided by the number of running consuming modinput is less than seven minutes, an approximate interval in which CrowdStrike uploads next event batch to the S3 bucket.
Ingested vs expected (missing and duplicated events)¶
This time chart compares the number of events reported by the add-on as sent to Splunk ingest pipeline with the number of events it can find in Splunk index on per event file basis. It will show if any events are duplicated and if any are missing. Ideally, all calculated values are zero and the time chart does not show negative or positive columns.
- Missing: This is a negative value that shows a number of events found in the Splunk index minus the number of events sent by the add-on to the Splunk pipeline. This lets you see whether all events sent to the pipeline were really ingested to let you point potential errors in the Splunk ingest pipeline.
Some small number of events will be shown as missing during a short period of time after ingestion, as some time is needed for the pipeline to ingest and index an event.
- Duplicated: Shows the number of events found in the Splunk index minus the number of events sent by the add-on to the Splunk pipeline and can show you if the same event file was ingested more than once. This duplication indicates digestion interruption and can point to an issue/crash or could be caused by a user changing the input configuration or manually restarting (disabling and then enabling) the input
Modular inputs ingest rates (MB/hour)¶
This time chart shows the size of raw data in megabytes sent by each consuming modular input per hour. This number is similar to Splunk license consumption, however it does not exactly match it because it does not take into account the size of index internal structures and event index time extracted fields.
Modular inputs ingest rates (files/hour)¶
This time chart shows how many event files have been processed by each consuming modular input per hour.
Modular inputs average batch processing time (in seconds)¶
This time chart shows how much time, in seconds, on average, that it takes to ingest a batch of event files. It counts time from the moment the add-on receives SQS message pointing to the event batch to the time the last event file from this batch is successfully ingested. This means that if there are errors during an event file’s ingestion, this batch processing time will also include the waiting period and time required for another attempt to ingest this event file
Troubleshoot event ingestion¶
If “CrowdStrike FDR SQS based S3 consumer” is running but you do not see new events appear in your index, try the following to diagnose and mitigate:
-
Try to make the search time window larger (Time interval on the right of the search expression area). Set it, for example, to seven days. Since the add-on assigns the events the time of event creation, not the time of ingestion, ingested events can be several days old and will not be seen within the default search time frame.
-
Switch search time frame to last 15 or 60 minutes and run the following search:
index="_internal" sourcetype="crowdstrike_fdr_ta*"
. By default the add-on is configured to log only informational and error messages and this search should show you the latest logs and give you an idea about the Splunk Add-on for CrowdStrike FDRs activities. Here are message examples that you can find when you run this search: cs_input_stanza=simple_consumer_input://my_input1i, error='aws_error_message='Proxy connection error''
: Indicates that provided proxy configuration does not allow the add-on to communicate with CrowdStrike AWS environment. Additional information about proxy settings can be found im log messages likeAWS proxy is disabled, aws_proxy=disabled
andAWS proxy is enabled, aws_proxy=
https://
*****:*****@proxy.host.fqnd:765
FILE processing summary: cs_input_stanza=simple_consumer_input://my_input1, cs_file_time_taken=223.106, cs_file_path=data/d811c19e-7729-4c9b-abb8-357d539aa4a0/part-00000.gz, cs_file_size_bytes=24178342, cs_file_error_count=0
: indicates that one event file ‘data/d811c19e-7729-4c9b-abb8-357d539aa4a0/part-00000.gz’ of size 24178342 bytes has been ingested by input ‘my_input1’ during 223.106 seconds with 0 errors occurred during the process.INGEST |< cs_input_stanza=simple_consumer_input://my_input1, cs_ingest_time_taken=229.321, cs_ingest_file_path=s3://crowdstrike-generated-big-batch-us-west-2/data/d811c19e-7729-4c9b-abb8-357d539aa4a0/part-00016.gz, cs_ingest_total_events=600540, cs_ingest_filter_matches=599705, cs_ingest_error_count=0
: indicates that input ‘my_input1’ consumed S3 bucket file ‘s3://crowdstrike-generated-big-batch-us-west-2/data/d811c19e-7729-4c9b-abb8-357d539aa4a0/part-00016.gz’ for 229.321 seconds and 599705 of total 600540 events in this file matched the filter criteria and have been sent to Splunk index. Pay attention to the number of matching events. If it’s 0 for all logged messages, consider checking the selected filter - it can be incorrectly defined.BATCH processing summary: cs_input_stanza=simple_consumer_input://si1, cs_batch_time_taken=230.002, cs_batch_path=data/d811c19e-7729-4c9b-abb8-357d539aa4a0, cs_batch_error_count=0
: Indicates that one event file batch located at ‘data/d811c19e-7729-4c9b-abb8-357d539aa4a0’ has been ingested by input ‘my_input1’ during 230.002 seconds with 0 errors occurred during the process.simple_consumer_input://si1, Skipping batch fdrv2/aidmaster/d811c19e-7729-4c9b-abb8-357d539aa4a0 according to input configuration
: Indicates that whole batch ‘data/d811c19e-7729-4c9b-abb8-357d539aa4a0’ has been skipped because it was configured not to ingest this kind of events. Only inventory events can be skipped like this.simple_consumer_input://my_input1, Stopping input as EVENT WRITER PIPE IS BROKEN. The add-on will re-try to ingest failed file after AWS SQS visibility_timeout expires
: Indicates that during the file ingestion process the communication with indexers has been broken and input ‘my_input1’ has to shutdown to be started again by Splunk. In Enterprise Cloud Platform, this often happens when you apply new configuration to a running input or stop an input. The error is triggered when Splunk tries to restart or stop the input in response. If you cannot correlate this error message with corresponding inputenableAQUgure actions
, check communication between ingesting host (heavy forwarder, IDM or search head) and indexers- If none of the above messages appear, try to switch the add-on logging
level to DEBUG. - IKGo to the Splunk Add-on for CloudStrike FDR
Configuration screen and select the Logging tab. Then select DEBUG in
the ‘logging level dropdown box and click Save. Restart input to
make it use the new logging level. Wait for several minutes to let the
add-on log new information. Re-run search
index="_internal" sourcetype="crowdstrike_fdr_ta*"
. Look for the following messages to make sure that the add-on can successfully communicate with AWS infrastructure: <<< aws_error_code=AWS.SimpleQueueService.NonExistentQueue, aws_error_message='The specified queue does not exist for this wsdl version.'
: indicates that AWS client error has taken place.aws_error_code
andaws_error_message
can vary depending on the exact AWS client issue.<<< receive_sqs_messages_time_taken=0.940, receive_sqs_message_count=1
: Indicates that a request for a new SQS message has been sent and one message was returned. If the value for receive_sqs_message_count is 0 then there are no messages in the SQS queue. Check there are no other consumers getting messages from this SQS queue. Also take into account that CrowdStrike FDR does not create new messages in SQS very often - one SQS message every 7-10 minutes. This means that you may have to wait for a new message to appear.- Check for the following message:
<<< check_success_time_taken=0.934, found_SUCCESS=True
. If thefound_SUCCESS
is False, then the event batch referenced by received SQS message will be skipped and no ingestion takes place. To figure out which batch has failed the check look for preceding log message like>>> check_success_bucket=crowdstrike-generated-big-batch-us-west-2, check_success_bucket_prefix=data/d811c19e-7729-4c9b-abb8-357d539aa4a0
- If you see the message:
<<< download_file_time_taken=7.107, download_file_path=data/d811c19e-7729-4c9b-abb8-357d539aa4a0/part-00023.gz
, then the add-on is able to successfully download message files from the S3 bucket. - If you do not see any add-on logs, run the following searches:
index="_internal" traceback
andindex="_internal" ERROR
. Look for error messages in the returned logs. Cannot find the destination field 'ComputerName' in the lookup table
: this error can indicate a corruption of a CSV lookup used at index time. This error can prevent ingestion of CrowdStrike events. If you see this error and new CrowdStrike events do not appear in the index, refer to “Recover index time host resolution lookup” below. For Splunk cloud environments contact Splunk support
Data is not going through all opened pipelines¶
This issue can take place with multiple pipelines mostly under heavy
workload. Each Add-on ingesting modular input instance can fully load a
dedicated Splunk ingest pipeline. To increase the throughput, you can
increase the number of pipelines if the host hardware allows it. The
number of ingesting modular inputs can be increased as well so each
input would have a dedicated ingest pipeline. However it can happen that
inputs are not evenly distributed among pipelines and two or more
add-on’s modinputs become assigned to the same ingest pipeline. This
results in decreased ingestion rate while not fully using available
resources. As a possible solution consider using weighted_random
value for pipelineSetSelectionPolicy setting in server.conf file,
i.e: pipelineSetSelectionPolicy=weighted_random
Recover index time host resolution lookup¶
CrowdStrike event ingestion can be blocked by corruption of an index
time host resolution lookup CSV. As a result of corruption, index time
lookup fails with error message
Cannot find the destination field 'ComputerName' in the lookup table
logged to _internal prefix. This corruption can be the result of
running “Crowdstrike FDR host information sync” input when configured
with incorrect source search head or limited user access. In version
1.2.0 of the Splunk Add-on for CloudStrike, additional validation has
been added to the “Crowdstrike FDR host information sync” modular input
code. This helps to prevent damaging the CSV lookup with bad data
received during the sync process. If for some reason CSV lookup table
has been corrupted follow the steps below to fix it: If it’s still
running, disable the “CrowdStrike FDR host information sync” input. On
each heavy forwarder (IDM or Search head in case of Splunk Cloud
environment) locate file
Splunk_TA_CrowdStrike_FDR/lookups/crowdstrike_ta_index_time_host_resolution_lookup.csv' under splunk etc/apps
Download Splunk_TA_CrowdStrike_FDR from splunkbase and unpack it. Locate
lookups/crowdstrike_ta_index_time_host_resolution_lookup.csv
and
replace the same file at Splunk instance. Splunk will reload the updated
CSV file automatically.
Data duplication¶
In a heavily-loaded environment, batches can be processed more than once. This can happen when a message is not processed at the expected time or when an input job is interrupted.
Processed message is visible again in SQS queue¶
Visibility timeout addresses the same SQS message again in case the software that started to process this message is not able to finish the processing or shutdown gracefully. When the processing time for a single batch takes more than the visibility timeout defined for related SQS messages, it becomes visible in the queue again and other jobs can re-ingest the same data. This results in event duplication in the indexer. To mitigate this:
- When you configure the Splunk Add-on for CloudStrike, set the visibility timeout to six hours by default. This value is sufficient to ingest big event batches (300-400 files up to 20MB each) which is specific to heavy-loaded environments with around 10TB of raw event data per day. If your environment has different amounts of raw event data per day, figure out the biggest batch and change visibility timeout proportionally. Maximum allowed value is 12 hours, minimum value is five minutes). Decreasing visibility timeout will make the SQS message return to the SQS queue faster and have more opportunity to be processed until it expires and is removed from queue permanently
- Scale out data collection horizontally by adding additional heavy forwarders (HF)/IDM and use less inputs for each HF/IDM.
Visibility timeout troubleshooting¶
The Splunk Add-on for CrowdStrike FDR logs information to help you determine if selected visibility timeout is adequate for your environment. To mitigate:
- Run the search:
index="_internal" sourcetype="crowdstrike_fdr_ta*" "BATCH processing summary:"
to see time taken to ingest event batches. You will see messages like
this:
2022-11-16 08:37:59,814 INFO pid=2228 tid=Thread-2 file=sqs_manager.py:finalize_batch_process:129 | BATCH processing summary: cs_input_stanza=sqs_based_manager://cs_feed1_sqs_man, cs_batch_time_taken=30.982, cs_batch_bucket=cs-prod-cannon-076270a656259f84-c33a6429, cs_batch_path=data/942fae26-bc6d-42cb-ae14-9e1eb84f761e, cs_batch_error_count=0
.
This tells you how much time, in seconds, it takes to process one event
batch. Run this search with sufficient time frame selected, you can
select “All time”, or find the largest ingest time taken and update the
visibility timeout setting an equal or larger size. This will improve
the likelihood that all future event batches will be processed within
visibility timeout.
- If the add-on finishes processing an event file after visibility
timeout has expired, it logs a warning message like this:
ALERT: data/d811c19e-7729-4c9b-abb8-357d539aa4a0/part-00018.gz ingested 233.720 seconds after SQS message visibility timeout expiration
. Run the following search to see these messages:
index="_internal" sourcetype="crowdstrike_fdr_ta*" "seconds after SQS message visibility timeout expiration"
Use this maximum value to adjust input visibility time. Consider
creating a Splunk alert based on this log message to receive alerts
about visibility timeout.
Verify that there is no other process consuming SQS notifications from the CrowdStrike queue¶
Make sure that no other process is reading and deleting SQS
notifications from the SQS queue used for ingesting CrowdStrike events
to Splunk index. If another process is running, SQS notifications
received by it will never reach ingesting modular inputs and so a
significant portion of events collected during 7-10 minutes will never
be ingested and indexed. A process can run on any host in the world
capable of connecting to the CrowdStrike feed dedicated AWS SQS queue,
so it can be hard to catch. Some noticeable symptoms of such a process
starting to work include: Frequency of received SQS messages decreases,
Amount of ingested CrowdStrike data decreases which in turn decreases
Splunk license usage. To help identify any process running, a new
monitoring modular input has been added to version 1.3.0, called
Crowdstrike FDR S3 bucket monitor
. This modular input is optional and
can be used only when monitoring is required. This modular input reads
all available CrowdStrike resources at the event feed dedicated S3
bucket and logs the findings. These logs can be found by running the
following search:
index=_internal sourcetype=crowdstrike_fdr_ta* "FDR S3 bucket monitor, new event file detected:"
Information about found S3 bucket resources can be compared with other
add-on logs carrying information about SQS messages received by the
add-on:
index=_internal sourcetype=crowdstrike_fdr_ta* "is processing SQS messages: sqs_msg_bucket="
Information about found S3 bucket resources compared with other add-on
logs carrying information about SQS messages received by the add-on
shows that some event batches (folders) presenting at S3 bucket are
missing in SQS messages received by the add-on, this can indicate
another process stealing notifying SQS messages. It also can be evidence
of a growing backlog so this assumption should be verified by additional
checks. It’s not necessary to run the above searches manually as they
are encapsulated into the add-ons monitoring dashboard as “S3 located
event files vs event files received in SQS notifications” time chart.
For more details, see the documentation section dedicated to the
monitoring/troubleshooting dashboard. To summarize here are the steps
required to spot existence of an external process “stealing” CrowdStrike
SQS messages from SQS queue:
- Make sure “Crowdstrike FDR S3 bucket monitor” modular input is configured and running
- Give it time to collect and log information
- Switch to monitoring dashboard and analyze “S3 located event files vs event files received in SQS notifications” time chart
Interrupted input job¶
When indexer connectivity is lost, messages left after the configured visibility timeout will still be available to process. Data ingested so far from the interrupted job is present on the indexer and will be ingested again with the next processing attempt. Try to avoid unnecessary input reconfiguration and establish stable connections between your instances.
Troubleshooting host resolution¶
If in search results over crowdstrike:events:sensor
sourcetype events
you do not see the aid_computer_name
field, then host resolution did
not work out. Below are the steps to troubleshoot the host resolution
process:
Search time host resolution troubleshooting¶
- Search for all events with
sourcetype=crowdstrike:inventory:aidmaster
. AIDmaster events are used as a source for host resolution. If no events are found then there is no host information ingested and there is nothing to use for host resolution. In that case, check “CrowdStrike FDR SQS based S3 consumer” modinput configuration to make sure AIDmaster events have been chosen for ingestion. If you see AIDmated events, narrow the search and try to find an unresolved record. - Check the
aid_master
information for aid values in events. You should be able to find AIDmaster events with the same aid. If information is missing, then Splunk lacks the information ingested to resolve all host names.
- Use an SPL search to check the lookup
crowdstrike_ta_host_resolution_lookup
. Look for the AID inside this lookup. - Find savedsearch
crowdstrike_ta_build_host_resolution_table
and execute it manually in Splunk Web, then check the lookup again.
Index time host resolution troubleshooting¶
If you configured and started “CrowdStrike FDR host information sync”
input for index time host resolution, run the search
index="_internal" sourcetype="crowdstrike_fdr_ta-inventory_sync_service"
for useful messages. Below is a list of log samples pointing to
successful and failed host information sync operations:
Inventory collection successfully synced. File size: 677 bytes. Records count: 2. Time taken: 0.5001018047332764 seconds.
Failed to retrieve collection
Inventory collection is not synced as source collection is empty
Inventory collection is not synced as source collection has unexpected formatting
Unexpected error when retrieving kvstore collection
Failed to authenticate to splunk instance
Inventory collection final rewrite failed with error
Always a good idea to check¶
Even when ingestion of CrowdStrike seems is running smoothly there are several log messages that good to check from time to time:
LineBreakingProcessor - Truncating line because limit of 150000 bytes has been exceeded
- indicates that actual CrowdStrike event was longer than the maximum value configured in TRUNCATE setting so the event was truncated. For sensor events TRUNCATE value set to 150000 which was confirmed by CrowdStrike as sufficient to handle all possible sensor events but this can change in future. It’s better to search for this message without relation to limit value, for example:index=* "LineBreakingProcessor - Truncating line because limit of "
, to be aware of any case of exceeding the limit
* Failed to parse timestamp in first MAX_TIMESTAMP_LOOKAHEAD
-
another log message that is good to be watched to make sure event
timestamps are extracted correctly and there are no unreadable values or
datetime format.