Redirecting _internal for a large forwarder deployment

Sometimes it is not noticed because there is no license charge associated with Splunk’s Universal forwarder internal logs and in some cases heavy forwarders. In very large deployments this can be a significant portion of storage used per day. Do you really need to keep those events around as long as the events associated with the Splunk Enterprise instances probably not.

Warning!

The following changes will disable the Splunk Monitoring consoles built in forwarder monitoring feature. You can customize the searches but be aware this is not upgrade safe.

Second Warning!

If you have any custom forwarder monitoring searches/dashboards/alerts they may be impacted.

Define an index

The index we need to define is _internal_forwarder the following sample configuration will allow us to keep about 3 days of data from our forwarders adjust according to need.

[_internal_forwarder]
maxWarmDBCount = 200
frozenTimePeriodInSecs = 259200
quarantinePastSecs = 459200
homePath = $SPLUNK_DB/$_index_name/db
coldPath = $SPLUNK_DB/$_index_name/colddb
thawedPath = $SPLUNK_DB/$_index_name/thaweddb
maxHotSpanSecs = 43200
max_count = 10

Change the index for internal logs

We need to create a new “TA” named “Splunk_TA_splunkforwarder we will CAREFULLY use the DS to push this to forwarders only. DO NOT push this to any Splunk Enterprise instance (CM/LM/MC/SH/IDX/deployer/ds) but you may push this to a “heavy” or intermediate forwarder. The app only needs two files in default app.conf and inputs.conf

#app.conf
[install]
state_change_requires_restart = true
is_configured = 0
state = enabled
build = 2

[launcher]
author = Ryan Faircloth
version = 1.0.0

[ui]
is_visible = 0
label = Splunk_UF Inputs

[package]
id = Splunk_TA_splunkforwarder
#inputs.conf
[monitor://$SPLUNK_HOME/var/log/splunk]
index = _internal_forwarder

Check our Work

First lets check positive make sure UFs have moved to the new index, we should get results.

index=_internal_forwarder source=*splunkforwarder*

Second lets check the negative make sure only UF logs got moved we should get no results

index= _internal_forwarder source=*splunk* NOT source=*splunkforwarder*

Updates

  • Index definition example used “_internal” rather than “_internal_uf”
  • renamed app to “Splunk_TA_splunkforwarder
  • renamed index to _internal_forwarder

Windows TA 6.0 is out!

Splunk released a major update to the Splunk TA for Windows last month you may not have noticed but I think you should take a closer look. A few key things

  • Simplified deployment for new customers Splunk merged the TA for Microsoft DNS and TA for Microsoft AD
  • The improved support for “XML” format Windows events from 5.0.1 is now the default in 6.0.0 there is upgrade action to accept this switch. XML events allow for extraction of additional value-able data such as the restart reason from event ID 1074
  • Improved CIM compliance for Security events from modern logging channels like Remote Desktop Session
  • Improved extensibility its now much easier to add support for third part logging via Windows Event Log
  • Improved support for Windows Event forwarding – Note I still strongly discourage this solution for performance, reliability and audit reasons.

If you are a SecKit for Windows user it is safe to upgrade just follow Splunk’s upgrade instructions. Need some guidance on good practices for Windows data on-boarding to Splunk be sure to checkout SecKit

But Change!

While this is not a replacement for the upgrade notes you are probably wondering how will this impact my users.

  • sourcetype changes: Prepare for the upgrade review use of sourcetype=wineventlog:* and replace with an appropriate eventtype OR source= With this TA version we use the source to differentiate between the specific event logs. sourcetype which represents the format of the log becomes a constant regardless of log type. This reduces the memory used in index and search time.
  • License impact: XML is bigger, yes but classic has white space and thats not free either and all that static text is gone. In my travels I have not seen much impact if any to license it seems to be a wash
  • XML logs are ugly: You are not wrong there. What can I say its Windows
  • XML parsing is slower: Yes and no overall the impact of switch from Classic to XML is not much slower. The TA uses regex parsing not “XML”, while you see XML on screen Splunk treats it like normal text. The changes implemented in the prior release (5.0.1) made improvements compared to 4.8.4 if your prior experience relates to this version its worth a second look.

Five things you can do now to get ready for Splunk Smart Store

Splunk’s SmartStore technology is a game changing advancement in data retention for Splunk Enterprise. Allowing Splunk to move least used data to an AWS for low cost “colder storage”.

Reduce the maximum size of a bucket

We will review indexes.conf on the indexer and identify any references to the setting maxDatasize. Common historical practice has been to increase the size of this setting from the default of auto to an arbitrary large value or auto_high_volume. SmartStore is optimized and enforces the use of “auto” or 750mb as the max bucket size. This task should be completed at least 7 days prior cutover to SmartStore.

Reduce the maximum span of a bucket

We will review indexes.conf and identify all indexes which continuously stream data. Common historical practice to leave this as default value which are very wide this increases the likely a user will retrieve buckets from S3 that do not actually meet their needs. We will determine a value of maxHotSpanSecs that will SmartStore to uncache buckets not used while also keeping buckets available likely to be used. Often 1 day (86400s) is appropriate.

  • What is the time window a typical search will use for this index relative to now i.e 15 min, 1 day, 3 days, 1 week
  • What span of time would allow a set of buckets to contain the events for the user search without excessive “extra” events. For example if the span is 90 days and the users primarily only work with 1 days worth of events therefore 89 days of events will use cache space in a wasteful way.

Review Getting Data In problems impacting bucket use

Certain oversights in onboard data into Splunk impact both use-ability of data and performance review and resolve any issues identified by the Splunk Monitoring Console page Data Quality the most important indicators of concern are

  • time stamp extraction
  • time zone detection
  • indexing latency (_indextime - _time)

One common source of “latency” is events from offline endpoints such as windows laptops. Any endpoint that can spool locally for an undetermined period of time then forward old events should be routed to a index not used for normal streaming events. For example “oswinsec” is the normal index I use for Windows Security Events however for endpoint monitoring I use “oswinsecep”.

Review bucket roll behavior

After the above activities are done, wait an hour before beginning this work. We should identify pre-mature bucket roll behavior that is buckets rolled from hot to warm regularly for not great reasons.  The following search

index=_internal source="/opt/splunk/var/log/splunk/splunkd.log"
component=HotDBManager evicting_count="*" 
| stats max(maxHotBuckets) values(count) as reason count by idx
| sort -count

This search identifies buckets which are “high volume” and rolling due to lack of an available bucket to index a new event in correct relative order. For each index where the maxHotBuckets is less than 10 increase the value of maxHotBuckets in indexes.conf to no more than 10. For these indexes 10 is a safe value.