Lets Encrypt and get an A for A Great Splunk TLS config

Setting up SSL/TLS on Splunk doesn’t have to be super hard or costly. While running Splunk in cloud providers has many benefits there are some hassles like provisioning certificates we can better manage using let’s encrypt. This method of installing browser trusted certificates can help to keep your administrative costs down in large Splunk deployments such as MssP services.

Expanding on prior work https://www.splunk.com/blog/2016/08/12/secure-splunk-web-in-five-minutes-using-lets-encrypt.html

NGINX

First we are going to install NGINX we will use this as a front end reverse proxy. Why, we can renew our certs with minimal own time in the future, OCSP stapling (improved page load times) and other things (future posts)

#centos

yum install nginx

#ubuntu

apt-get install nginx

Second setup a new vhost for the splunk reverse proxy. Any request to http will be redirected to https except for requests related to certificate management.

map $uri $redirect_https {

    /.well-known/                      0;

    default                            1;

}

server {

    listen       80;

    server_name  hf-scan.splunk.example.com;

    root /usr/share/nginx/html;

    if ($redirect_https = 1) {

       return 301 https://$server_name$request_uri;

    }

#    return       301 $scheme://hf-scan.splunk.example.com$request_uri;

}

server {

    

    listen 443 ssl http2;

    server_name hf-scan.splunk.example.com;

    root /usr/share/nginx/html;

    index index.html index.htm;

   location / {

        proxy_pass_request_headers on;

        proxy_set_header x-real-IP $remote_addr;

        proxy_set_header x-forwarded-for $proxy_add_x_forwarded_for;

        proxy_set_header host $host;

        proxy_pass https://127.0.0.1:8000;

        add_header Strict-Transport-Security “max-age=31536000; includeSubDomains” always;

      }

    

    

    ssl_certificate     /etc/letsencrypt/live/hf-scan.splunk.example.com/fullchain.pem;

    ssl_certificate_key /etc/letsencrypt/live/hf-scan.splunk.example.com/privkey.pem;

    ssl_protocols       TLSv1.2;

    ssl_ciphers         HIGH:!aNULL:!MD5;

    ssl_dhparam /etc/nginx/ssl/dhparam.pem;

    ssl_session_cache shared:SSL:50m;

    ssl_session_timeout 1d;

    ssl_session_tickets off;

    ssl_prefer_server_ciphers on;

    ssl_stapling on;

    ssl_stapling_verify on;

    resolver 8.8.8.8 8.8.4.4 valid=300s;

    resolver_timeout 5s;

    add_header Strict-Transport-Security “max-age=31536000; includeSubDomains” always;

}

Setup a deploy hook script this will prepare the cert files as splunk needs them and will also be used on renewal. Save this script as /etc/letsencrypt/renewal-hooks/deploy/splunk.sh

#!/bin/bash
#deploy to /etc/letsencrypt/renewal-hooks/deploy/splunk.sh
#when requesting a cert add "--deploy-hook /etc/letsencrypt/renewal-hooks/deploy/splunk.sh" to the command
dir=/opt/splunk/etc/auth/ssl
if [[ ! -e $dir ]]; then
    mkdir -p $dir
elif [[ ! -d $dir ]]; then
    echo "$dir already exists but is not a directory" 1>&2
fi
openssl rsa -aes256 -in $RENEWED_LINEAGE/privkey.pem -out $dir/protected.pem -passout pass:password
if [[ ! -f $dir/protected.pem ]]; then
    exit 1
fi
cat $dir/protected.pem $RENEWED_LINEAGE/fullchain.pem > $dir/server.pem
cp $RENEWED_LINEAGE/fullchain.pem $dir/
cp $RENEWED_LINEAGE/privkey.pem $dir/
chown splunk:splunk $dir/*
systemctl restart splunk

Request the certificate note correct the webroot folder for your platform and the certificate with the fqdn of your server

certbot certonly –webroot -w /var/www/html –hsts -d hf-scan.splunk.example.com –noninteractive –agree-tos –email your@example.com –deploy-hook /etc/letsencrypt/renewal-hooks/deploy/splunk.sh

Setup Splunk

Update /opt/splunk/etc/system/local/web.conf

[settings]

enableSplunkWebSSL = true

#sendStrictTransportSecurityHeader = true

sslVersions = tls1.2

cipherSuite = TLSv1.2:!NULL-SHA256:!AES128-SHA256:!ADH-AES128-SHA256:!ADH-AES256-SHA256:!ADH-AES128-GCM-SHA256:!ADH-AES256-GCM-SHA384

privKeyPath =  /opt/splunk/etc/auth/ssl/privkey.pem

caCertPath = /opt/splunk/etc/auth/ssl/fullchain.pem

Update /opt/splunk/etc/system/local/server.conf

[general]

serverName = hf-scan.splunk.example.com

[sslConfig]

sslVersions = tls1.2

sslVersionsForClient = tls1.2

serverCert = $SPLUNK_HOME/etc/auth/ssl/server.pem

sslRootCAPath = $SPLUNK_HOME/etc/auth/ssl/fullchain.pem

dhFile = /opt/splunk/etc/auth/ssl/dhparam.pem

sendStrictTransportSecurityHeader = true

allowSslCompression = false

cipherSuite = TLSv1.2:!NULL-SHA256:!AES128-SHA256:!ADH-AES128-SHA256:!ADH-AES256-SHA256:!ADH-AES128-GCM-SHA256:!ADH-AES256-GCM-SHA384

useClientSSLCompression = false

useSplunkdClientSSLCompression = false

Test

  • Option 1 SSL labs, limited to port 443 (don’t forget about 8089)
  • Option 2 testssl.sh CLI based doesn’t share data no letter grade (management likes letters)
  • Option 3 High Tech Bridge https://www.htbridge.com/ssl allows testing multiple ports similar coverage to ssllabs less well known

Renew certs

Setup a cron job to run the following command at least once per week in your scheduled change window. If a certificate renewal is required splunk will be restarted

certbot renew –webroot  -w /usr/share/nginx/html

Can we even patch this Spectre/Meltdown oh and AV also

Isn’t it great when things are in meltdown and you can’t patch yet because your waiting on another patch?

Microsoft has stated you can’t patch until AV goes first

http://www.zdnet.com/article/windows-meltdown-spectre-fix-how-to-check-if-your-av-is-blocking-microsoft-patch/

https://support.microsoft.com/en-us/help/4072699/january-3-2018-windows-security-updates-and-antivirus-software

Bottom line if your AV vendor hasn’t update to set this registry to give the update permissions to install or you don’t use AV and instead use an application whitelist approach for security the patch won’t apply. You can use splunk to track down hosts that will refuse to apply the patch by adding this monitor to splunk and well Splunking the results

Key="HKEY_LOCAL_MACHINE" Subkey="SOFTWARE\Microsoft\Windows\CurrentVersion\QualityCompat" Value="cadca5fe-87d3-4b96-b7fb-a231484277cc" Type="REG_DWORD”
Data="0x00000000

Add the following to the inputs.conf applied to all windows system and ensure the server class is set to restart the UF and happy Splunking

 

[WinRegMon://HKLMSoftwareMSWindowsQualityCompat]
index = epintel
baseline = 1
disabled = 0
hive = \\REGISTRY\\MACHINE\\Software\\Microsoft\\Windows\\CurrentVersion\\QualityCompat\\.*
proc = .*
type = delete|create|set|rename

Tuning Splunk when max concurrent searches are reached

Your searches are queued but you have cores, memory and IO to spare? Tuning your limits can allow Splunk to utilize “more” of your hardware when scaled up instances are in use.

Warnings

This approach is NOT  useful when searches run LONG. If regular searches such as datamodel acceleration, summary and reporting searches are not completing inside of the expected/required time constraints this information could make the symptoms worse.

This approach is useful when searches consistently execute faster than the required times for datamodel acceleration, summary and reporting and additional searches are queued while the utilization of cpu, memory, storage IOPS, storage bandwidth are well below the validated capacity of the infrastructure.

 

Details

First in all versions of Splunk beginning with 6.5.x apply the following setting to disable a feature that can slow search initialization.

$SPLUNK_HOME/etc/local/limits.conf

$SPLUNK_HOME/etc/master-apps/_cluster/local/limits.conf

[search]
#SPL-136845 Review future release notes to determine if this can be reverted to auto
max_searches_per_process = 1

On the search head only where DMA is utilized (ES) update the following

$SPLUNK_HOME/etc/local/limits.conf

#this is useful when you have ad-hoc to spare but are skipping searches (ES I'm looking at you) or other 
# home grown or similar things
[scheduler]
max_searches_perc = 75
auto_summary_perc = 100

Evaluate the load percentage on the search heads and indexers including memory, cpu utilized and memory utilized.  We can increase the value of base_max_searches in increments of 10 to allow more concurrent searches per SH until one of the following occurs

  • CPU or memory utilization is 60% on IDX or SH
  • IOPS or storage throughput hits  ceiling and no longer increases  decrease the system is fully utilized to prevent failure due to unexpected load decrement the base_max_searches value by 10 and confirm IOPS is no longer constant.
  • Skipping /queuing no longer occurs (increase by 1-3 additional units from this point to provide some “head room”
#limits.conf set SH only
[search]
#base value is 6 increase by 10 until utilization on IDX or SH is at 60% CPU/memory starting with 20
#base_max_searches = TBD

Outage due to DDOS

The sites been down for a few days, BlueHost has been suffering from a DDOS on at least one of the sites they host. My site shared infrastructure. for $3.95 a month I don’t expect too much but having some ability to move sites to new hosts would be nice.  Anyways, I’m up on Azure now until I decide if I want to be my own webmaster or revert to paying someone else to pretend to worry about things like that.  On the plus side of things, the outage forced me to update the site infrastructure. Now using certificates from Let’s Encrypt.  If you have CLI access to your apache hosted site, super easy and free to enable good encryption.

sudo certbot –apache -d www.rfaircloth.com -d rfaircloth.com -d rfaircloth.westus.cloudapp.azure.com –must-staple –redirect   –hsts   –uir –rsa 4096

What’s in a URL now you can Splunk that

Hunting we find URLs in logs both email and proxy that are interesting all the time. What will that URL return, if it redirects where is it going and what kind of content questions you might be asking. If you are not asking them now is the time to start. I’ve released a new add on to Splunk Base, a little adaptive response action that can be used with just Splunk Enterprise OR Splunk Enterprise Security to collect and index information about those URLs.

https://splunkbase.splunk.com/app/3630/

How to enable the Alexa Domain list in ES 4.7

This post is short and sweet, in ES 4.7 the Alexa download is not enabled by default enabling and using this list which can be very valuable in domain/fqdn based analysis is a simple two step process

  1. Navigate to Enterprise Security –> Configure –> Threat Intelligence Downloads
    1. Find Alexa
    2. Click enable
  2. Navigate to Splunk Settings –> Search Reports and Alerts
    1. Select “All” from the app drop down
    2. Search for “Threat – Alexa Top Sites – Lookup Gen
    3. Click Edit under actions and then enable
    4. Optional Click Edit under actions again and cron schedule, Set the task to daily execution 03:00 with an auto window. This reduces the chances the list will not be updated if skipped due to search head maintenance.
    5. Optional the OOB gen search creates a large dispatch directory entry which is not desirable on search head clusters or where disk space is premium such as public clouds. Update the search as follow (appending the stats count) to prevent creation of a result set on the search head | inputthreatlist alexa_top_one_million_sites fieldnames=”rank,domain” | outputlookup alexa_lookup_by_stra | stats count
    6. Click “Run” to build the list so you can have it right now

Data Streams – Fill the data river with events from Splunk

I’ve had this in the bucket for a while waiting for the right time to share. There is a growing demand to develop “real time” analytic capability using machine data. Some great things are being created in labs their problem coming out of the lab is generally the inability to get events from the source systems, immediately following by difficulty normalizing events. If you’ve been working with these systems for very long and also worked with Splunk you may share my opinion that the Universal Forwarder, and the Schema at read power of Splunk is simply unmatched. How can we leverage the power of Splunk without reinventing the wheel, the axel, and the engine.

Credits

  • Liu-yuan Lai, Engineer, Splunk https://conf.splunk.com/session/2015/conf2015_LYuan_Splunk_BigData_DistributedProcessingwithSpark.pdf
  • Splunk App for CEF https://splunkbase.splunk.com/app/1847/

Back in 2015 I attended a short conf presentation that introduced me to the concepts and the value of Spark like engines. Last year our new CEF app introduced the idea message distribution can be executed on the indexer allowing very large scale processing with Splunk.

Introducing Integration Kit (IntKit)

The solution adds three interesting abilities to Splunk using “summarizing searches” to distribute events via a durable message bus.

  1. Send raw events using durable message queue
  2. Send reformated events using an arbitrary schema
  3. Send “Data Model” schema eliminating the need to build parsing logic for each type of source on the receiving side.

But what about other solutions

  • Syslog Output using the heavy forwarder
    • Syslog is not a reliable delivery protocol unable to resend lost events can cause backup on the UF
  • CEF 2.0
    • Great tool limited to single line events or reformating also allows for data loss.

The tools consist of a message formatter currently preparing a _json field, other formats such as xml or csv could be implemented and a producer that will place the message into the kafka queue (other queues can also be implemented)

example

[code lang=text]
| datamodel Network_Traffic All_Traffic search
| fields + _raw,All_Traffic.*
| generatejsonmsg suppress_empty=true suppress_unknown=true suppress_stringnull=true output_field=_json
include_metadata=true include_fields=true include_raw=false sort_fields=true sort_mv=true
| ProduceKafkamsgCommand bootstrap_servers="localhost:9092" topic="topicname" msgfield="_json"
| stats count
[/code]

What does this do:

  1. Using the datamodel command gather all Network_Traffic events
  2. Keep only _raw and the data model fields
  3. generate a _json field containing the fields in json format omit empty strings, “null”, sort the values of mv fields
  4. Send the message to kafka using a bootstrap server (localhost) topic “topicname”

This project is slightly above science project. That is poorly documented and mostly functional.  I expect it will fit in well with the ecosystem its helping. Please submit enhancements to make it better including documentation if you use it.

Using systemd to squash THP and start splunk enterprise

Updated Jan, 16, 2018 user security issue

Updated Jan 19,2018 using forking type for splunk

Fixing INIT Scripts

If you are currently or prefer using init script startup to remain as close to “out of box” configuration as possible be aware of a serious security risk present in the traditional startup method.  REF: https://www.splunk.com/view/SP-CAAAP3M To mitigate the issue and address THP/Ulimits consider moving to a field modified version of the script. https://bitbucket.org/snippets/rfaircloth-splunk/Gek8My

Going forward using SYSTEMD

 

The concept presented in this post, as well as the original inspiration, have some risks. Using alternatives to the vendor provided init scripts have support risks including loss of the configuration by future upgrades. Each operating system vendor has their own specific guidance on how to do this, each automation vendor has example automation scripts as well. Picking an approach that is appropriate for your environment is up to you.

THP the bain of performance for so many things in big data is often left on by default and is slightly difficult to disable. As a popular Splunk answers post and Splunk consultants include Marquis have found the best way to ensure ulimit and THP settings are properly configured is to modify the init scripts. This is a really crafty and reliable way to ensure THP is disabled for Splunk, it works on all Linux operating systems regardless of how services are started.

I’m doing some work with newer operating systems and wanted to explore how systemd really works and changes what is possible in managing a server. Lets face it systemd has not gotten the best of receptions in the community, after all it moved our cheese, toys and the ball all at once. It seems to be here to stay what if we could use its powers for good in relation to Splunk. Let’s put an end to THP and start Splunk the systemd native way.

Note: the following config file is present for readability and google. Downloadable text file is available https://bitbucket.org/snippets/rfaircloth-splunk/ze7rqL

Create the file /etc/systemd/system/disable-transparent-huge-pages.service


[Unit]
Description=Disable Transparent Huge Pages

[Service]
Type=oneshot
ExecStart=/bin/sh -c “echo never >/sys/kernel/mm/transparent_hugepage/enabled”
ExecStart=/bin/sh -c “echo never >/sys/kernel/mm/transparent_hugepage/defrag”
RemainAfterExit=true
[Install]
WantedBy=multi-user.target

Verify THP and defrag is presently enabled to avoid a false sense of success

# cat /sys/kernel/mm/transparent_hugepage/enabled

[always] madvise never

# cat /sys/kernel/mm/transparent_hugepage/defrag

[always] madvise never

Enable and start the unit to disable THP

# systemctl enable disable-transparent-huge-pages.service

# systemctl start disable-transparent-huge-pages.service

# cat /sys/kernel/mm/transparent_hugepage/enabled

always madvise [never]

# cat /sys/kernel/mm/transparent_hugepage/defrag

always madvise [never]

Reboot and repeat the verification to ensure the process is enforced

Note: the following config file is present for readability and google. Downloadable text file is available https://bitbucket.org/snippets/rfaircloth-splunk/xe7rqj

create the unit file /etc/systemd/system/splunk.service

<p class="p1"><span class="s1">[unit]</span></p>
<p class="p1"><span class="s1">After=network.target</span></p>
<p class="p1"><span class="s1">Wants=network.target</span></p>
<p class="p1"><span class="s1">Description=Splunk Enterprise</span></p>
<p class="p1"><span class="s1">[Service]</span></p>
<p class="p1"><span class="s1">Type=forking</span></p>
<p class="p1"><span class="s1">RemainAfterExit=False</span></p>
<p class="p1"><span class="s1">User=splunk</span></p>
<p class="p1"><span class="s1">Group=splunk</span></p>
<p class="p1"><span class="s1">ExecStart=/opt/splunk/bin/splunk start --answer-yes --no-prompt --accept-license</span></p>
<p class="p1"><span class="s1">ExecStop=/opt/splunk/bin/splunk stop</span></p>
<p class="p1"><span class="s1">PIDFile=/opt/splunk/var/run/splunk/splunkd.pid</span></p>
<p class="p1"><span class="s1">Restart=always</span></p>
<p class="p1"><span class="s1">TimeoutSec=300</span></p>
<p class="p1"><span class="s1">#ulimit -Sn 65535</span></p>
<p class="p1"><span class="s1">#ulimit -Hn 65535</span></p>
<p class="p1"><span class="s1">LimitNOFILE=65535</span></p>
<p class="p1"><span class="s1">#ulimit -Su 20480</span></p>
<p class="p1"><span class="s1">#ulimit -Hu 20480</span></p>
<p class="p1"><span class="s1">LimitNPROC=20480</span></p>
<p class="p1"><span class="s1">#ulimit -Hf unlimited</span></p>
<p class="p1"><span class="s1">#ulimit -Sf unlimited</span></p>
<p class="p1"><span class="s1">LimitFSIZE=infinity</span></p>
<p class="p1"><span class="s1">LimitCORE=infinity</span></p>
<p class="p1"><span class="s1">[Install]</span></p>
<p class="p1"><span class="s1">WantedBy=multi-user.target</span></p>

# systemctl enable splunk.service

# systemctl start splunk.service

Verify the ulimits have been applied via splunk logs

#cat /opt/splunk/var/log/splunk/splunkd.log | grep ulimit

Reboot and repeate all verifications

Bonus material, kill Splunk (lab env only) and watch systemd bring it back

# killall splunk

# ps aux | grep splunk

You just noticed splunkd was brought back to up when it died without using systemctl stop. This means using splunk start|stop is not valid when systemd started Splunk.

Splunk OS Data on boarding – best practices updated

I’ve updated my best practices a bit and moved the implementation guides from confluence out to the bitbuckets in markdown so they can be more easily referenced on any platform or secured environments where PDFs might be discouraged.

Each repo will contain a README.md and one or more INSTALL.md files with the implementation guides. If you find an issue have a better practice or other enhancement, please open an issue in the repositories tracker.

Trust but verify – your logs might be lieing to you aka, the case for network tap traffic monitoring

I really do “get” it, logging and monitoring can be very costly, we all agree not nearly as costly as a breach. Each organization is struggling to ensure they log enough to see detection and value while being good stewards of their company budget. It has been a day reading vault 7 leaks and I see honestly not much that surprises me. I do see something worth a strong restatement, that is an encouragement to rethink what you log and how you log it. The CIA has a very cool (sorry hacker at heart) tool we have known about for some time but have not been able to talk about. Their tool “Drillbit” allows the creation of a covert tunnel using common cisco gear in such a way typical monitoring and logging using IDS and firewalls will not identify. American companies should note criminal gangs and foreign governments certainly have similar capabilities. Splunk has your back if you are willing to let us. Using Splunk Stream and proper sensor placement we can collect data from the inside and outside of your firewall that can be used to identify covert tunnels. Detection should be performed three approaches. The danger these leaks are presenting you is an increased awareness of the effectiveness of these techniques, encouraging the advancement of commodity cybercrime toolkits with ever more difficult to detect features. Don’t use cisco, sorry bad news is almost every major gear vendor has been exploited with similar approaches

  • Static Rules such as
    • This not that “Stream identified traffic not in firewall logs”
    • New patterns in DNS, NTP, GRE flows
    • Change/login to firewall or switch not associated with a change record
  • Threat List Enrichment and detection
    • Source and Destination traffic on quality threat lists. Traffic for protocols other than http(s) and DNS should be treated with high or critical priority
  • Machine Learning
    • Anomalous  egress traffic by source from network devices
    • Anomalous admin connections by source to network devices
%d bloggers like this: