Index time vs search time JSON field extractions¶
Overview¶
CrowdStrike FDR uploads events to a dedicated S3 bucket in JSON format, and you can configure a sourcetype stanza for automatic json parsing and fields extraction either at index time or in search time.
- Index time automatic field extraction extracts field name-value pairs only once, when events enter the Splunk ingest pipeline. Extracted properties and values are then stored in the Splunk index along with the raw event itself. This lets you save resources and speed up searches later, since these fields and values no longer require extraction with every search.
Index time extraction uses more index space and Splunk license usage and should typically be configured only if temporal data, such as IP or hostname, would be lost or if the logs will be used in multiple searches.
- Search time automatic field extraction takes time with every running search which avoids using additional index space but increases resources and time required for searches to run.
Default configuration¶
Historically, the Splunk Add-on for Crowdstrike FDR is configured to do
index time automatic extractions. This is implemented using below
settings under corresponding sourcetype stanza (this can be also seen in
add-on sourcetype configurations via Splunk user interface).
INDEXED_EXTRACTIONS=json KV_MODE = none
The benefits of the default
configuration: Sensor events collected under the S3 bucket data folder
belong to several sourcetypes. Historically, sourcetype assignment is
implemented using index-time transformations and required event json to
be parsed in order to get access to properties used for the sourcetype
selection decision. Searches over security events are often parts of
various detection dashboards and visualizations, so multiplied searches
over same events are very likely and index time extractions can be more
beneficial CrowdStrike sensors generate a huge amount of data, up to
tens of thousands events per device per minute. Extracting and indexing
event’s JSON files enables using event fields in TSTATS searches that
are times faster than regular STATS
As of version 1.3.0, sourcetype assignment is fully implemented in the modular input part and index time JSON extraction is no longer a requirement. If points two and three in above list do not bring sufficient benefits in comparison to saving Splunk license, it is possible to reconfigure add-on for search-time-only extractions.
Switch from index time to search time JSON fields extractions¶
Turning off index-time JSON extraction will not remove indexed properties from the old (already ingested) events. However, turning on search time extractions will cause field extraction duplication for the old events (fields extracted at index time plus same fields extracted at search time). As a result, field types will change from atomic types (number, string. etc) to multi value type. This will break CIM extractions implemented in the add-on and custom extractions used in your searches for all events ingested before configuration change.
Avoid changing this configuration for some of crowdstrike:inventory:*
sourcetypes, because the TSTATS command is used to build kvstore lookups
for host resolution (crowdstrike:inventory:aidmaster sourcetype), host
ip and MAC address resolution (crowdstrike:inventory:managedassets
sourcetype). Turning off index time json extractions can affect results
of the TSTATS based saved searches.
Reconfigure using Splunk user interface¶
- In the menu select Settings, then click the Source types item.
- In the App dropdown list, select Splunk Add-on for CrowdStrike FDR to see only add-on
dedicated sourcetypes.
- Click the Sourcetype you want to adjust.
- In the Advanced tab, locate INDEXED_EXTRACTIONS property and click the button next to field value to delete the field.
- Locate the KV_MODE property and change value
none
tojson
. - Click Save.
- Make sure these changes are applied at all Splunk hosts where this add-on is installed.
Reconfigure using Splunk props.conf¶
- In the folder where Splunk Add-on for CrowdStrike FDR is installed, find the “local” folder. If it does not exist, create it.
- Inside the local folder find the “props.conf” file. If it does not exist, create it.
- Inside props.conf file locate desired sourcetype stanza. If it does not exist, create it.
- Assign empty value to property INDEXED_EXTRACTIONS:
INDEXED_EXTRACTIONS =
- Assign value
json
to property KV_MODE:
KV_MODE = json
- Save your file and restart Splunk.
- Apply these changes to all Splunk hosts where this add-on is installed.
Initial estimation of CrowdStrike FDR data volume¶
It is possible to estimate the volume of CrowdStrike FDR data before
configuring ingesting modular inputs. Crowdstrike FDR S3 bucket
monitor provides all necessary information for this purpose. As
described earlier this modular input connects directly to CrowdStrike
FDR dedicated S3 bucket and logs information about all resources located
there. Typical log event looks as the following:
FDR S3 bucket monitor, new event file detected: fdr_scan_checkpoint="None", fdr_bucket=bucket-name, fdr_event_batch=fdrv2/userinfo/e1acbcb4-a016-4dd1-9c2f-3b5eaaa79817, fdr_event_file=fdrv2/userinfo/e1acbcb4-a016-4dd1-9c2f-3b5eaaa79817/part-00000.gz, fdr_event_source=s3://bucket-name/fdrv2/userinfo/e1acbcb4-a016-4dd1-9c2f-3b5eaaa79817/part-00000.gz, fdr_event_file_last_modified="2022-09-12 07:58:33+00:00", fdr_event_file_size=647
It shows information about a single file containing events. By analyzing
fdr_event_file, fdr_event_file_last_modified and fdr_event_file_size
values it is possible to understand distribution of incoming data
volumes during a day or wider period of time. For example, the following
search can be used:
index=_internal sourcetype=crowdstrike_fdr_ta* "FDR S3 bucket monitor" | eval event_type_split=split(fdr_event_file, "/") | eval _time = strptime(fdr_event_file_last_modified, "%Y-%m-%d %H:%M:%S%z") | timechart sum(fdr_event_file_size) span=1h
Additionally data can be split by event type (data, aidmaster, userinfo,
…) as it is in the example search below:
index=_internal sourcetype=crowdstrike_fdr_ta* "FDR S3 bucket monitor" | eval event_type_split=split(fdr_event_file, "/") | eval event_type=if(mvindex(event_type_split, 0) == "fdrv1", mvindex(event_type_split, 1),mvindex(event_type_split, 0)) | eval _time = strptime(fdr_event_file_last_modified, "%Y-%m-%d %H:%M:%S%z") | timechart sum(fdr_event_file_size) by event_type span=1h
Note that fdr_event_file_size property value is the size of the event file as it’s stored at S3 bucket, i.e. it is the size of compressed data. A compressed event file 25M in size can turn into 650M of uncompressed data and contain around 700 000 events on average.
Initial setup and scaling¶
Overview¶
The following steps are required to start ingesting CrowdStrike events using Splunk Add-on for CrowdStrike FDR.
- Install the Splunk Add-on for Crowdstrike FDR in order to create the FDR AWS Collection. Specify connection information for the CrowdStrike FDR feed located in the AWS environment.
- Configure a filter that will allow you to ingest only events you need. This is an optional step because the add-on already has one predefined filter that drops all heartbeat events. On the CrowdStrike FDR site, you can also configure a filter that controls which collected events should be sent to your FDR feed. This makes the amount of data stored as AWS S3 bucket smaller, which saves additional resources when the add-on downloads, unpacks, and scans event files being ingested.
- Configure modular inputs to start ingesting CrowdStrike events. You can use direct or distributed ingestion architecture and therefore one or another set of ingesting modular input types:
1. Crowdstrike FDR SQS based S3 consumer
is a modular input
responsible for all the steps of the ingest process, from getting next
SQS notification to ingesting all files in corresponding event batch,
one by one. Use this for PoC environments and CrowdStrike environments
generating up to 1-2 TB of events per day in Splunk license usage. 2.
Crowdstrike FDR SQS based manager
and
Crowdstrike FDR managed S3 consumer
. These inputs split the
responsibilities of monitoring the SQS queue and ingesting events.
Crowdstrike FDR SQS based manager
:
- takes care about SQS queue
- gets new batches of event files when needed
- keeps the journal of received and ingested files
- updates checkpoints
- tells available consumer inputs which event file to ingest.
The Crowdstrike FDR managed S3 consumer
input ingests event files
requested by the manager input.
The manager and worker modular inputs require the KVStore cluster to communicate properly if run on different hosts. KVStore cluster is available by default in Splunk Cloud Victoria search head cluster, in other configurations, you must install it manually.
- If possible,start with creating a single input instance of each
modular input type belonging to the selected architecture. This means
you should either create single
Crowdstrike FDR SQS based S3 consumer
input for direct architecture, or create one input ofCrowdstrike FDR SQS based manager
andCrowdstrike FDR managed S3 consumer
if distributed architecture is selected. - Windows OS is not supported as an ingester host for this add-on.
- Once you have configured your selected inputs, check your ingestion and determine whether it needs additional resources to consume all events. You can do this using the Splunk Add-on for CrowdStrike FDR monitoring dashboard. In Splunk Web, go to Inputs > Configuration and Search tab and check the values in . To calculate resources, you can use the Modular input’s average batch processing time (in seconds)
- New CrowdStrike events batches arrive to dedicated AWS S3 bucket
approximately every 7 minutes (420 seconds), factoring this into the
average time the add-on spends to ingest a single event batch you can
estimate how many ingesting input processes
(
Crowdstrike FDR SQS based S3 consumer
orCrowdstrike FDR SQS based S3 consumer
) are needed. - For the
Crowdstrike FDR SQS based S3 consumer
input, ingesting a batch is an effort of a single modular input process.o figure out the minimal required number of inputs just divide the average batch ingest time on 420 and round up the result. - For
Crowdstrike FDR managed S3 consumer
ingesting a batch is an effort of all inputs of this type, so the minimal number of required ingest input processes can be calculated as a product of average batch ingestion time and current number of ingest input processes divided by 420 and rounded up. - To create your calculated resources, you can:
- Create modular inputs on separate Splunk hosts
- create all of your inputs on a single super powerful Splunk instance
- Spread them between hosts in any proportion depending on the resources they have.
- If you plan to run several ingesting input processes on the same host, make sure the host has sufficient processing resources and a proper number of parallel ingestion pipelines configured. Take into account that a single ingesting input process can fully load one Splunk ingestion pipeline and that each Splunk ingestion pipeline can use up to 6 vCPUs. So, for example, if you plan to use two ingesting input processes on the same Splunk host, the host should have at least two parallel ingestion pipelines configured (parallelIngestionPipelines=2 in host server.conf) and at least 12 vCPUs dedicated to ingestion.
Note the following for Splunk Cloud Victoria:
- To increase the number of ingest pipelines, contact the Splunk Cloud support team to request an exception.
- The
CrowdStrike FDR SQS based S3 consumer
andCrowdstrike FDR managed S3 consumer
modular inputs are configured by default so that Splunk runs each created input on each cluster search head. So, for example, if your search head cluster has three hosts and you configure a singleCrowdStrike FDR managed S3 consumer
input, Splunk runs three ingesting processes.
Index time host resolution¶
Overview¶
In version 1.5.0 new modular input “CrowdStrike Device API Inventory Sync Service” were introduced which allows users perform index time resolution in Splunk Cloud Platform (SCP) stacks. Now users can choose between two types of index time host resolution “Inventory events” and “CrowdStrike Device API”.
Inventory events index time host resolution¶
Pluses of Inventory events index time host resolution:
- Host information can be used for search time host resolution
Disadvantages of Inventory events index time host resolution:
- Host information may arrive with delay
- Enrichment of events at ingest time increases load to pipeline 10%-20% depending on resolution table size
- Corruption of host resolution table (CSV lookup) breaks ingestion
- Extra events need to be collected(aidmaster and managedassets) in order to make host resolution work
This variant of index time host resolution is not supported by Splunk Cloud Platform Stacks (SCP).
CrowdStrike Device API index time host resolution¶
Pluses of CrowdStrike Device API host resolution index time:
- Ingesting performance improvement, because it doesn’t use resources of pipeline
- Comes with a new type of filter, where users can specify required host information fields.
- Risk of corruption of any CSV lookup table used at index time as none is used
- Coming with new modular input, that also allows specifying bucket check intervals.
Disadvantages of CrowdStrike Device API host resolution index time:
- Data collected from Device API can’t be used in search time host resolution
This variant of index time host resolution is supported by Splunk Cloud Platform Stacks (SCP) and by Splunk Enterprise.