Sunday, 4 November 2018

Enriching Logs With Active Directory Attributes

Date of writing: 4th November 2018
Last edit: 4th November 2018

JUST GIVE ME THE SCRIPT AND CONFIG


If all you need are the logstash configuration and script to generate the translate data, go here:

https://github.com/kevinwilcox/python-elk

It is in the directory called "crawl_ad_ldap"!

Everyone Else 8^)


I recently tweeted about something that popped up in my day job and promised a bit more detail in a blog post. It has sort of worked out "to the good" because I've mentioned wanting to cover enrichment with AD info before but this really pushed that into overdrive...so here we are!

The Problem


When I talk to folks about what they're logging, and why they want to use something like Elastic, Splunk, GrayLog, etc., there are a few log types that regularly pop up.  It seems like everyone wants to pull login logs, be they 4624 events in Windows or SSH logins from the *nix world (I know, OpenSSH has been ported to Windows).  Both of these use pretty well-defined formats -- even though openssh logs to auth.log (generally) and uses syslog (generally), the format of an SSH success or failure is pretty consistent.

For the purpose of this post, we're going to focus on one field that's common to both Windows 4624 events and SSH authentications - the username field.

Let's make things a bit "real life".  Think about the scenario where you want to search for every login to a system in Human Resources, or for every login by a user in Finance and Payroll.  You would need to pull every login over <x> time period, do an OU lookup for every workstation name or username and then discard anything not in the OU you care about, or you'd need to pull a list of each user in those specific OUs and look for logins for those users. Those methods are pretty traditional but I was 100% sure there's a better way using modern tools (specifically, using my SIEM of choice - Elastic).

Method One - Elasticsearch


My initial solution was to use python to crawl Active Directory and LDAP for all computer and user objects (and the properties I need), cache that data locally in Elasticsearch and then query for the relevant data each time logstash parses or extracts a username field.  By doing that I can somewhat normalise all of my login logs - and then it doesn't matter if I know what all of the OUs or groups are in my organisation, or who is a member of which one, as long as I have mostly-current information from AD and LDAP.

I figured I already did this with OUI info for DHCP logs and Elasticsearch was performing great so doing it for username fields shouldn't be a big issue, right?  I'd just finished writing the python to pull the data and was working on the filter when I had a chat with Justin Henderson, author of the SANS SEC555 course, and he completely changed my approach.

Method Two - Translate


Justin recommended I try the translate filter.  I already used the filter for login logs to add a field with the description for each login type so I wasn't completely new to using it but I had never used it with a dictionary file.  After chatting with him and reading the documentation for a bit, I realised I could create a dictionary file that looked like this:

username_0:
  distinguished_name: username_0, ou=foo, ou=foo2, ou=org, ou=local
  member_of:
    - cn=foo
    - cn=another_foo
    - cn=yet_another_foo
username_1:
  distinguished_name: username_1, ou=foo3, ou=foo4, ou=org, ou=local
  member_of:
    - cn=foo3

Then any time logstash see username_0 in the "username" field, it can add the distinguished_name and member_of fields to the metadata for that log.

The operative word being "could"...I'd still have a bit of work to go from that idea to something I could roll out to my logstash nodes, especially since they run on BSD and Linux where there isn't exactly an option to run "get-aduser".

A First Pass


The AD controller I used for this post is configured to use osg.local as its domain. I've added an OU called "osg_users" and that OU has two users, "testuser" and "charlatan".

"charlatan" is also a member of a group called "InfoSec" and "testuser" is member of a group called "Random Group".

First, I needed to setup a configuration file so I could put directory-specific information in there. The one I wrote for this blog is named conn_info.py and looks like this:

ad_server = "IP_of_AD_controller"
ad_search_user = "osg\charlatan"
ad_search_pass = 'charlatans_password'
ad_search_base = "ou=osg_users,dc=osg,dc=local"
ad_search_props = ['displayname', 'distinguishedname', 'memberof']

Ideally the charlatan user/password would be a service account but it works for this example.

You may assign any manner of properties to the users in your AD or LDAP but displayName, distinguishedName and memberOf are practically universal.

The next step was a script that could connect to and interrogate my AD server.

A really basic script using the ldap3 module and my conn_info.py file might look like this:

#!/usr/bin/env python
from ldap3 import Server, Connection, ALL, NTLM
import conn_info
server = Server(conn_info.ad_server, get_info=ALL)
conn = Connection(server, conn_info.ad_search_user, conn_info.ad_search_pass, auto_bind=True, auto_range=True)
print("starting search...")
try:
  searchString = "(objectclass=user)"
  conn.search(conn_info.ad_search_base, searchString, attributes=conn_info.ad_search_props, paged_size = 500)
  for entry in conn.entries:
    print(str(entry))
  cookie = conn.result['controls']['1.2.840.113556.1.4.319']['value']['cookie']
  while(cookie):
    print('receiving next batch')
    conn.search(conn_info.ad_search_base, searchString, attributes=conn_info.ad_search_props, paged_size = 500, paged_cookie = cookie)
    for entry in conn.entries:
      print(str(entry))
    cookie = conn.result['controls']['1.2.840.113556.1.4.319']['value']['cookie']
    print('searches finished')
except Exception as e:
  print("error")
  print(e)
exit("exiting...")

Let's step through that a bit. ldap3 is a fantastic module for interrogating AD and LDAP and it works with both python 2 and python 3. Importing conn_info makes sure we can read variables (I know, constants, and they should be in all caps) from the conn_info.py file. The next few lines connect to the AD server; the auto_bind option saves a step by binding to the server after connecting as the specified user.

Unless the directory configuration is modified, most objects in AD will probably be of either type "user" or type "computer".

"paged_size" is an interesting attribute in that it will limit the results to 500 at a time, then the cookie is used to paginate through to the last result set (hence the "for" loop).

This script doesn't produce output that's useful to logstash but it is really useful to us as people. For my example DC, I get this:

(ldap_post) test@u18:~/ad_ldap_post$ python test.py
starting search...
DN: CN=Just A. Charlatan,OU=osg_users,DC=osg,DC=local - STATUS: Read - READ TIME: 2018-10-15T00:32:44.624881
    displayName: Just A. Charlatan
    distinguishedName: CN=Just A. Charlatan,OU=osg_users,DC=osg,DC=local
    memberOf: CN=InfoSec,DC=osg,DC=local
DN: CN=Testing User,OU=osg_users,DC=osg,DC=local - STATUS: Read - READ TIME: 2018-10-15T00:32:44.625089
    displayName: Testing User
    distinguishedName: CN=Testing User,OU=osg_users,DC=osg,DC=local
    memberOf: CN=Random Group,DC=osg,DC=local
exiting...

Notice how similar this output is to doing:

get-aduser -searchbase "ou=osg_users,dc=osg,dc=local" -filter * -properties displayname, distinguishedname, memberof

Making it More Useful


The real goal is to have something I can read into logstash. The translate module can read CSV, JSON and YAML and, for this purpose, in my opinion the cleanest to read is YAML. Instead of using a library to convert between object types and output YAML, I'm just going to output the YAML directly.

A sample function to make it easy to build the output might look like:

def format_output(user_entry):
  entry_string = str(user_entry['samaccountname']) + ":\n"
  entry_string += "  dn: " + str(user_entry['distinguishedname']) + "\n"
  entry_string += "  displayname: " + str(user_entry['displayname']) + "\n"
  entry_string += "  memberof: " + "\n"
  for i in user_entry['memberof']:
    entry_string += "    - " + str(i) + "\n"
  return entry_string

With a couple of minor edits, my output now looks like this:

charlatan:
  dn: CN=Just A. Charlatan,OU=osg_users,DC=osg,DC=local
  displayname: Just A. Charlatan
  memberof:
    - CN=InfoSec,DC=osg,DC=local
testuser:
  dn: CN=Testing User,OU=osg_users,DC=osg,DC=local
  displayname: Testing User
  memberof:
    - CN=Random Group,DC=osg,DC=local

It's important to have memberof output like that as default because as soon as you add users to two or more groups, that's how YAML parsers will expect them to be formatted.

The last step in the script is to write to file instead of outputting to the display. Stripping out the print statements and using file output instead, my script now looks like this:

#!/usr/bin/env python
def format_output(user_entry):
  entry_string = str(user_entry['samaccountname']) + ":\n"
  entry_string += "  dn: " + str(user_entry['distinguishedname']) + "\n"
  entry_string += "  displayname: " + str(user_entry['displayname']) + "\n"
  entry_string += "  memberof: " + "\n"
  for i in user_entry['memberof']:
    entry_string += "    - " + str(i) + "\n"
  return entry_string
from ldap3 import Server, Connection, ALL, NTLM
import conn_info
server = Server(conn_info.ad_server, get_info=ALL)
conn = Connection(server, conn_info.ad_search_user, conn_info.ad_search_pass, auto_bind=True, auto_range=True)
try:
  out_string = ""
  searchString = "(objectclass=user)"
  conn.search(conn_info.ad_search_base, searchString, attributes=conn_info.ad_search_props, paged_size = 500)
  for entry in conn.entries:
    out_string = out_string + format_output(entry)
  cookie = conn.result['controls']['1.2.840.113556.1.4.319']['value']['cookie']
  while(cookie):
    conn.search(conn_info.ad_search_base, searchString, attributes=conn_info.ad_search_props, paged_size = 500, paged_cookie = cookie)
    for entry in conn.entries:
      out_string = out_string + format_output(entry)
    cookie = conn.result['controls']['1.2.840.113556.1.4.319']['value']['cookie']
  out_fh = open('ad_users_file.yml', 'w')
  out_fh.write(out_string)
  out_fh.close()
except Exception as e:
  print("error: " + e)
exit()

When I run it, I have a new file created called "ad_users_file.yml". It looks like this:

charlatan:
  dn: CN=Just A. Charlatan,OU=osg_users,DC=osg,DC=local
  displayname: Just A. Charlatan
  memberof:
    - CN=InfoSec,DC=osg,DC=local
testuser:
  dn: CN=Testing User,OU=osg_users,DC=osg,DC=local
  displayname: Testing User
  memberof:
    - CN=Random Group,DC=osg,DC=local

This is ideal for logstash to use as a dictionary - now I can tell it if it ever sees "charlatan" in a field called "user" then add the information in my dictionary to those log files.

A Sample Logstash Config


Instead of testing with a Windows log or an OpenSSH log, I can have a really simple logstash "testing" config file that takes in user input in JSON format, looks up the information in a dictionary when it sees a field called "user" then sends the resulting object to the display.

That "simple" configuration for logstash, which I'm naming ls_test.conf,  could look like this:

input { stdin { codec => json } }
filter {
  if [user] {
    translate {
      field                       => "user"
      destination             => "from_ad"
      dictionary_path     => "ad_users_file.yml"
      refresh_behaviour => "replace"
    }
  }
}
output { stdout { codec => rubydebug } }

Logstash can be started with a specific config file with:

sudo /usr/share/logstash/bin/logstash -f ls_test.conf

Once it's ready for input, if I use the following two objects then I get a good test of what happens if logstash both does and does not find my object in the dictionary:

{"user":"charlie"}
{"user":"charlatan"}

When I use them, I get the following output:

{"user":"charlie"}
{
      "@version" => "1",
          "user" => "charlie",
    "@timestamp" => 2018-10-15T02:10:06.240Z,
          "host" => "u18"
}
{"user":"charlatan"}
{
      "@version" => "1",
          "user" => "charlatan",
    "@timestamp" => 2018-10-15T02:10:14.579Z,
          "host" => "u18",
       "from_ad" => {
                 "dn" => "CN=Just A. Charlatan,OU=osg_users,DC=osg,DC=local",
        "displayname" => "Just A. Charlatan",
           "memberof" => [
            [0] "CN=InfoSec,DC=osg,DC=local"
        ]
    }
}

This is exactly what I'd expect to see. Having everything under the "from_ad" field may not be ideal, I can always use a mutate statement to move or rename any field to a more "reasonable" or "usable" place.

Wrapping Up


This is an enrichment technique that I really like and that I recommend implementing anywhere possible. It's also really flexible! I know I've written about it in the context of Active Directory but it's really in the context of any LDAP - Active Directory just happens to be the one folks tend to be familiar with these days. Want to search something by username? Great, here you go! Need to search the rest of your logs and limit it to only members of a given OU? No more waiting for your directory to return all of those objects and search for those, no more retrieving all of your logs and waiting while your SIEM looks for them one-at-a-time in a lookup table *at search time* with potentially out-of-date information - it's all added when the log is cut with the most current information at the time!

Saturday, 14 July 2018

Enriching Domain Names with Frequency Analysis and ASN Information

I saw a really interesting question on Twitter today.  Someone asked if there were an IP/whois plugin for ELK so that folks could add ASN information to log data.  I thought that question was just perfect because I've spent a lot of time lately talking to folks about the importance of log enrichment and how to do it with logstash (for example, see the SANS webcast Justin Henderson and I did: https://www.sans.org/webcasts/high-fidelity-alerts-context-context-107870).

Since the question was specifically about ASN information, we're going to take a look at the "geoip" filter to get information about an IP.  Since I like DNS enrichment, I'm going to use the "dns" filter to get information about a hostname and then we're going to look at a Mark Baggett tool called domain_stats to get whois information - specifically, the creation date of the domain.  Since it runs as a web service, you can use logstash's "rest" filter to query it from inside your configuration!

That got me thinking about another Mark Baggett tool called "freq_server".  Have you ever looked at a domain name and thought, "there's no way a human being created that, it had to be an algorithm..."? Generally speaking, in the English language, some letters come after others a LOT more frequently than some other combinations.  For example, in English, if we see the letter 'Q' then it is followed by the letter 'U' much more frequently than it is followed by the letter 'J'.  If I see a domain name that has 'qj' in it then that is much more interesting (*very* generally speaking) than a domain name that has 'qu' in it.  Mark's tool looks at those tuples and returns a score based on how frequently the letter combinations in a given word fit with the English literature used to "train" it - and it runs as a web service, too, so we can query it with logstash's "rest" filter as well!

The Environment


My environment is Ubuntu Server 18.04 LTS with logstash 6.3.1 installed per the instructions here:


The "rest" filter is not installed by default but it is available for the latest version of logstash.  You can install it with:

sudo /usr/share/logstash/bin/logstash-plugin install logstash-filter-rest

Both of Mark's tools are written with python2.7 in mind so I've installed python2.7 and pip via apt:

sudo apt install python python-pip

And then I've added python-whois via pip:

sudo pip install python-whois

One quick word: ordinarily I would use something like virtualenv for this.  There is absolutely no reason not to do that right now other than I'm destroying this VM when I'm done tonight.

Getting domain_stats and freq_server


Both of the tools I'm using for enrichment, domain_stats.py and freq_server.py, are available from Mark's GitHub:


I have the latest version of each from GitHub using "git clone <above_url>", so I have a directory called domain_stats and another called freq.

The easiest way to run domain_stats.py (remember, it starts a web server!!) is to cd to the domain_stats directory and just run it as a backgrounded python process.

cd domain_stats
python domain_stats.py 10000 &

This starts domain_stats on port 10000.

I'm going to do the same thing for freq_server.py except I want it on port 20000. It doesn't matter which port you run them on (as long as it's a free port above 1024!). In production I would use a service and set the owner/group to logstash but for this demo, it's sufficient. Notice that freq_server takes a frequency table as its final option -- Mark provides one at the GitHub page!

cd freq
python freq_server.py 20000 freqtable2018 &

You should get something saying the server is ready and an example of the URL to use to query it.

First Logstash Config


I am starting with a VERY simple logstash config file, called "test.conf", in my local directory. My initial config looks like this:

input { stdin { codec => json } }
filter {
  if [message] == "" {
    drop { }
  }
}
output { stdout { codec => rubydebug } }

Basically this lets me type in a JSON object as input and it parses the fields and prints it as a formatted JSON object. The filter block lets me hit enter a few times without throwing a parse error or printing an object for null input by telling logstash to just drop those events instead of displaying them.

I'm starting logstash with:

sudo /usr/share/logstash/bin/logstash -f test.conf

For the entire blog post, I'm going to use a simple example log entry of:

{"host":"google.co.uk"}

With my existing logstash configuration, I get something like this:


Notice that because I told logstash to expect JSON as input, and because I gave it a JSON object as input, it parsed out the "host" field for me.  This will be really important later.

The First Enrichment: DNS


The very first thing we need to do is go from a hostname to an IP address.  In a "real" environment you'd have something like passive DNS information or netflow so you may already have an interesting IP address.  I still like letting my enrichment boxes do DNS lookups, though, because I find it interesting if I see one domain looked up ten times in two minutes and the results are for vastly different ASNs (warning warning warning!! that could be malware using fast flux DNS!).

To use the filter, I'm going to add a "filter { }" section to my test.conf file that looks like this:

filter {
  if [host] {
    dns {
      resolve => "host"
      action => "append"
    }
  }
}

This will do a lookup for whatever is in the "host" field and then *append* it to the host field, effectively turning it into an array.  If I use my new config, my result looks something like this:


Move the IP With Mutate


I think that's a little difficult to deal with later because I don't like arrays in my config files.  The easiest way to solve that wee issue is to use the "mutate" filter to rename fields into something more useful.  I want to break the IP address into its own field, resolved_ip, and keep the hostname in its own field, host.  With a little work, my filter now looks like this:

filter {
  if [host] {
    dns {
      resolve => "host"
      action => "append"
    }
    mutate {
      rename => {
        "[host][1]" => "resolved_ip"
        "[host][0]" => "host"
      }
    }
  }
}

When I run the new config, I get this:


Much easier to read (and definitely easier to use later!!).

Next Enrichment: geoip


Now that I have an IP address, I can answer the question that prompted this post - which ASN is assigned the IP in question!

There are two ways to use the built-in geoip filter.  The first adds city/state/country information, the second adds ASN information.  You can't add these with a single block, you have to call geoip twice, so I'm going to add a new filter block that looks for "resolved_ip" and, if its present, calls geoip twice.

filter {
  if [resolved_ip] {
    geoip {
      source => "resolved_ip"
      target => "geoip_city"
      default_database_type => "City"
    }
    geoip {
      source => "resolved_ip"
      target => "geoip_asn"
      default_database_type => "ASN"
    }
  }
}
This adds a *considerable* amount of information to the log event!


The most important fields, generally speaking, are the geoip_asn[asn] and geoip_asn[as_org] fields.  Microsoft, for example, may have some 20 million IPs across multiple CIDR blocks, but I don't need to know what each block is - I just need to know to look for "Microsoft" as the as_org.  With mutate I could rename any of these fields to make searching easier.

Using REST With domain_stats.py


In day-to-day Internet usage, it is rare to see people visiting domains that are "new".  Sure, the occasional start-up makes a name for themselves or a site takes off practically overnight but that is not the norm.  If I see domains that are only a few days or weeks old, I find that *REALLY* interesting.  It may not be the sole reason to alert on a given domain but it certainly can contribute towards me taking a second look at one.

Now that I have domain_stats running, I can query it for information using the "rest" filter.  To do so, I'll add the following filter block to my test.conf:

filter {
  if [host] {
    rest {
      request => {
        url => "http://localhost:10000/domain/creation_date/%{host}"
      }
      sprintf => true
      json => false
      target => "creation_date"
    }
  }
}

This tells it to query the service running on port 10000 on the local machine using the URL that tells domain_stats to return the creation date for the hostname in [host].  It then stores that information in "creation_date":


Note the addition of the "creation_date" field - Google's UK domain has been around for quite a while, almost 20 years!

Final Enrichment: Using REST With freq_server


The final enrichment I'll add is the frequency analysis.  This assigns a numeric value between 1 and 35 (configurable) depending on how "randomly" the letter sequences appear - the smaller the number, the more likely the domain is the product of a Domain Generation Algorithm.  Because it uses the rest filter, the final "filter" block to be added will look a lot like the block for domain_stats:

filter {
  if [host] {
    rest {
      request => {
        url => "http://localhost:20000/measure/%{host}"
      }
      sprintf => true
      json => false
      target => "freq_score"
    }
  }
}

With that added, my final output looks like this:


The "freq_score" value has two different values based on a recent code update and using two different algorithms.  You can use whichever value you prefer but be consistent with whether you want the first or second one.  I'll leave it to you to determine how to extract the values but grok, dissect or ruby would be viable starting points...

Wrapping Up


I really like the performance of Elasticsearch but for me, the *best* thing about the Elastic stack is logstash.  To be able to take a single domain name or IP address and produce a log event, in real time, that has whois, GeoIP, passive DNS and anything else you can think of, is a MASSIVE win in my opinion.

To make things easier, here is the full content of my config file:

input { stdin { codec => json } }
filter {
  if [message] == "" {
    drop { }
  }
}
filter {
  if [host] {
    dns {
      resolve => "host"
      action => "append"
    }
    mutate {
      rename => {
        "[host][1]" => "resolved_ip"
        "[host][0]" => "host"
      }
    }
  }
}
filter {
  if [resolved_ip] {
    geoip {
      source => "resolved_ip"
      target => "geoip_city"
      default_database_type => "City"
    }
    geoip {
      source => "resolved_ip"
      target => "geoip_asn"
      default_database_type => "ASN"
    }
  }
}
filter {
  if [host] {
    rest {
      request => {
        url => "http://localhost:10000/domain/creation_date/%{host}"
      }
      sprintf => true
      json => false
      target => "creation_date"
    }
  }
}
filter {
  if [host] {
    rest {
      request => {
        url => "http://localhost:20000/measure/%{host}"
      }
      sprintf => true
      json => false
      target => "freq_score"
    }
  }
}
output { stdout { codec => rubydebug } }

Sunday, 21 January 2018

SIEM From Scratch: Custom Data With Python

This was a bit of an unexpected post for me to make. When I laid out all of the "SIEM From Scratch" posts I wanted to do, I fully expected to use use filebeat and syslog to get data from endpoints. If I have programs (or scripts...) that query for data via an API, such as my previous post about getting logs from Google, I typically run those on a schedule via cron and then import that data into ELK via filebeat. I can write each login event as a JSON object so, really, why would I NOT do that?

Then this week I was at work and was bitten, a bit harder than I found comfortable, by an inode re-use issue with Linux and filebeat and that got me to thinking (I know, how problematic is *that*?!)...

For the impatient curious - the ultimate solution was to create the new file and have filebeat pick it up, then delete it after <x> number of minutes so that it drops out of the filebeat registry before the new file is created. This lets me avoid fiddling with filebeat options and I can control it all from my script (or with cron).

The Issue - inode Re-Use


Before I go into a possible solution, allow me to outline the problem. Filebeat is amazing for watching text files and getting them into the Elastic stack - it has all the TLS and congestion-control bells and whistles, plus it lets you do some interesting things on the client if you're reading JSON files. To make sure it gets your data (and doesn't get your data twice), it has a registry file that makes note of which byte it just read in your file. To know that it's your file, it also keeps track of the name, the inode and the device on which that file resides.

The important part here is the concept of an inode.

Without getting too "into the weeds", an "inode" is a way to address where a file starts on a hard drive. An analogy would be your home. If you want to invite someone over, you give them your address. The equivalent for a file or directory may be its path -- "/var/log/syslog" or "C:\Windows\system32", for example.

However, you could ALSO give them your GPS coordinates. Even if your address changes, your GPS coordinates would remain the same. The equivalent for a file in POSIX-compliant operating systems is an inode number. Just like giving someone your GPS coordinates is kind of a pain, trying to remember inode numbers is a bit of a pain. This is why we don't really use them in day-to-day work but we use filenames and paths all the time. One is really easy for people to remember, one is really problematic.

If the town/city/village/whatever renames your street then your address will change, even though your home didn't move - therefore your GPS coordinates won't change. Likewise, if I rename my file from "my_file.txt" to "my_file2.txt", I'm just changing the path and the inode number remains the same. Here is the output of "stat" on FreeBSD. The second column is the inode number:


Here's where it gets really interesting. If I have a file called "my_file.txt", delete it and create a new file called "my_file.txt", the inode number may be the same - even though it's a completely new file with new information. Again, here is the output of doing that and of the 'stat' command:


Notice the inode number is the same for both files. If I have a program, like filebeat, that knows to watch "my_file.txt" with inode '348828', it may read the content of the original file ("hi") and store that it read <n> bytes. Then, even though I deleted the file and have *completely new data* in my new file, because the inode number is the same, it will start reading again at <n + 1> bytes. That means I will have a log entry of "hi" - and then another that says " someone else". Now my SIEM has missing log data!

One Solution - Write Directly to Logstash


This brings me to the fun part of this post - what to do when you really don't need an intermediary and just want to log something directly to logstash. I think Solaris may have been the Last Great Unix in some ways but there isn't a filebeat for Solaris and nxlog (community edition) is a bit of a pain to compile and install. That's okay because you can have something pick up a logfile and send it *as JSON* to your SIEM. Can't get a log shipper approved for install on <x> system but you need output from your scripts? That's okay because you can send that output directly to your SIEM!

Step One - Logstash Needs to Listen For a TCP Connection


The first step is to setup a TCP listener in logstash. I have logstash 6.1.2 installed from Elastic's "apt" repo and a config file, "test.conf", with just enough in it to be able to test input/output. The "stdin" input type tells logstash to accept from "standard input", in this case my keyboard, and print to "standard output", in this case my monitor. The "rubydebug" code tells logstash to make the output pretty so it's easier for a human to read:


And when I run logstash manually with that configuration using:

sudo -u logstash /usr/share/logstash/bin/logstash --path.config test.conf

I can confirm it works by hitting "enter" a couple of times and then typing in some text (I just wanted to show that it grabs lines where the only input is the "enter" key):


Logstash has a multitude of possible input types. For example, I have some logstash servers that accept syslog from some devices, un-encrypted beats from others, encrypted beats from still others, raw data over UDP from even more and, finally, raw data from TCP from yet *another* set of devices. There are a LOT of possibilities!

NB: logstash will run as the logstash user, not root, so by default it can NOT bind to ports lower than 1024. Generally speaking you want to use high-numbered ports so things "Just Work"!

For this post, I want to tell logstash to listen for TCP connections on port 10001 and I want it to parse that input as JSON. To do this, I add a "TCP" input section. I've also added a comment line above each section telling whether it was in the above configuration so you can more easily see the added section:


Since I'm already logged into that VM, I'm going to go ahead and restart logstash with my custom configuration using the same sudo command as above:

sudo -u logstash /usr/share/logstash/bin/logstash --path.config test.conf

Since I still have the section for stdin as an input, logstash should start and give me a message saying it's waiting for input from "stdin":


Now I'm going to move over to my FreeBSD VM that has python installed to write my test script.

Step Two - The Scripting VM


My development VM is running FreeBSD 11 and the only things I have done to it are add a user named 'demo' and installed pkg, vim-lite and python3:

adduser demo
pkg install pkg
pkg install vim-lite
pkg install python3

After logging in, I made sure I could run "python3" and get an interpreter (the ">>>" means it is ready for me to start entering python code):


To exit the interpreter, I can either type "quit()" and hit the enter key or I can hold down the Control key and press 'd'.

With that done, it's time to write my script!

Step Three - Python, JSON and Sockets


I am a very beginner-level python programmer - I know just enough about it to read a little of it, so don't be intimidated if you're new to it. We're going to keep things as basic as possible.

First I need to tell python that I'm either going to read, parse or output JSON. It has a native module for JSON called, appropriately enough, 'json', and all I need to do is "import" it for my script to use it as a library. It also has a module called 'socket' that is used for network programming so I want to include that as well.

import json
import socket

Next I want to build an item to use for testing. I'm going to name it "sample_item" and it's going to have a single field called "user_name". Since my user on the VM is named 'demo', I'm going to use 'demo' as the value for the "user_name" field:

sample_item = { 'user_name': 'demo' }

That's kind of a JSON object but not quite standards-compliant. To make sure it is exactly what I want, I am going to use the json.dumps() function to force it to be converted. Logstash expects each JSON object to be delineated with a newline character ("\n") so I'm going to add one of those to the end and I'm going to save all of that into a new variable called "sample_to_send":

sample_to_send = json.dumps(sample_item) + "\n"

Now I need to create a network connection to my logstash server. I know that it has an IP address of 192.168.1.8 because I checked it earlier, and I know that logstash is listening for TCP connections on port 10001 because I configured it that way. To open that connection in python is a multi-step process. First, I create a socket item that I call "my_socket":

my_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

Then I'm going to tell it to connect to the logstash server:

my_socket.connect(("192.168.1.8", 10001))

If that doesn't give an error then I'm going to try to write the test object, sample_to_send, to the socket. With python 3.x, sendall() takes a *bytes-like* object, not a string, so I need to use the encode() function -- by default it will encode to 'utf-8' which is just fine:

my_socket.sendall(sample_to_send.encode())

Finally I'm going to close my socket and exit the script:

my_socket.close()
exit()

Since I can run all of that in the interpreter, it looks like this:


Notice I didn't get an error after the "sendall()" function. Moving back over to my logstash VM I see that I have received something:


192.168.1.7 is indeed the IP address of my FreeBSD VM and my JSON object has been parsed by logstash!

Step Four - Save A Script and Add Another Field


Running things inside the interpreter is pretty handy for testing small things but I want to run this repeatedly so I'm going to save it all in a script. At that point I can do interesting things like add error checking, send multiple items in a loop, etc.

I'm going to name the script "test.py" and it will be pretty much exactly what I ran in the interpreter but with some blank lines thrown in to break it up for readability:


What if I want to add an email address for my user? How would that look? Well, it's just another field in the "sample_item" object (it's called a dictionary but dictionaries are objects...):

sample_item = { 'user_name': 'demo', 'email_address': 'demo@my-great-domain.com' }

In my script it looks like this (notice I've put the new field on a new line - this is purely for readability):


If I modify my script so that "sample_item" looks like the above and run it with "python3 test.py", I should see a new JSON object come in with a field called "email_address":


Sure enough, there it is!

Wrapping Up


This script is in pretty poor shape - for starters, it has zero error handling and it isn't documented. In a production environment I would want to do both of those things, as well as write out any errors to an optional error log on the disk. For now, though, I think it is a good starting point for getting custom data from a script to logstash without needing to rely on additional software like filebeat or nxlog. Never again will I have to look at custom log data sitting on a Solaris server and think, "you know, I'd like to have that in my SIEM but I don't know an easy way to get it there..."

There are some additional benefits, too. I don't *HAVE* to use python to send my events over. I could just as easily have sent data via netcat (nc) or via any other language that supports writing to a network socket (or, for that matter, just writing to file!). Bash even supports doing things like:

echo '{"user_name":"demo"}' > /dev/tcp/192.168.1.8/10001

It does not even need to be bash on Linux - I've tested it on both Linux and FreeBSD, and it even works on Windows 10 using the Windows Subsystem for Linux.

A slightly better-documented version of the script, and the sample logstash configuration, can be found at:

https://github.com/kevinwilcox/python-elk

Saturday, 30 December 2017

SIEM From Scratch: Getting Logs From Google

A SIEM should be able to consume, correlate and alert on data from multiple types of logs.  Google Apps (G Suite) and Microsoft 365 have been growing in popularity for years - almost everyone in my industry use one or the other - but I see very few resources publicly available for SecOps teams to get information from them.  Even worse, most of the InfoSec and SecOps teams in my industry that DO have visibility into their Google environments are stuck with the painfully slow Google web interface for search and have to go to their Ops teams for API stuff.  I have seen *hours* shaved off investigations by having Google login and access logs stored in a local SIEM, or at least searchable by the API, versus trying to pivot inside of the web interface.  In chatting with colleagues, though, I keep finding that very few use the API - or they have to rely on an external group for that type of search.

I want to change that by writing and releasing a set of scripts to search for specific types of logs, written on top of python3 and maintained against the latest version of the Google API, using the "readonly" scopes provided by Google.  My goal is to have a starting point for SecOps teams, a place where they can see how to get rolling with the API and then build off of my starter scripts to do interesting things that address the problems they face and are tailored to their environments.

Note: I stated I want to provide something small and light.  There is already a *comprehensive* solution to interacting with the Google API via the command line called GAM: https://github.com/jay0lee/GAM

Prerequisites - Google Service Account


Before you go any further, you're going to need a service account or oauth2 access token for your Google domain.  For most of us that will mean going to our Google admins and asking for an oauth2 credentials file.  Google offers a two-week test setup if you're interested in "G Suite"/"Google Apps for Business" so I moved one of my domains and dove into their account/IAM/token tools.

I'm not going to try to document how to create a service account, that could be its own post, but more information on how to generate one can be found in the Google documentation:


After the account is created, the oauth2 token (in JSON) should look something like this:


To keep things simple, I have named mine "client_secret.json".

The three scopes that need to be authorised for the purposes of this post are:


  • https://www.googleapis.com/auth/admin.reports.audit.readonly
  • https://www.googleapis.com/auth/admin.reports.usage.readonly
  • https://www.googleapis.com/auth/gmail.readonly

In the Google Admin interface, they should be specified on one line, separated by a comma.

Each API will require a "delegation" user -- a user on whose behalf the script is running.  For the "audit" API it will probably be your account but for the "gmail" API it will need to be the user against whose account the script runs.  If you have a user, foo@bar.com, and your script is getting a list of email subjects for that user, the delegation user will be "foo@bar.com".

Prerequisites - Python Modules


My API VM is a "fresh" FreeBSD 11.1 installation but it can be any system capable of running python3 - it could be FreeBSD, macOS, any modern Linux or any modern Windows.  As I've noted in other posts, I just happen to like FreeBSD.  I've added a user named 'demo' and they're able to issue commands via 'sudo'.

With that setup, I need to install python, pip and a few python modules. First, python and pip can be installed with

sudo pkg install py36-pip

If you have a Linux background, this is the equivalent of "sudo apt install python3-pip" on Debian/Ubuntu or "sudo yum install python-pip" on RH and derivatives.


Notice that on Linux systems, this usually provides "python3".  That is typically a link to a specific 3.x version of python.  FreeBSD doesn't provide the link by default so if you're installing on FreeBSD, keep that in mind. This is why it's "py36-pip" instead of something like "py3-pip".  For example, on a system running Ubuntu 16.04.3 LTS, "/usr/bin/python3" is a link to "/usr/bin/python3.5".

Once python and pip are installed, it's time to install the necessary python modules.  These are:


  • httplib2
  • oauth2client
  • google-api-python-client


This is why I wanted pip - some package managers will actually have separate packages for each of these but using pip lets me stay current.  Again, notice I'm calling it with "pip-3.6" instead of the "pip3" you'd see on a Linux system.


Start python from the command line with either "python3" or "python3.6" and you can interact with it directly.  You can use the following as a simple "test" script to make sure your modules are installed and will work:

import httplib2
from apiclient import discovery
from oauth2client.service_account import ServiceAccountCredentials
quit()

When I did it, it looked like this:


Start Scripting


Now that python/pip/necessary modules are installed and I have my oauth2 token and delegated account name, it's time to write a small script that reads in those credentials and attempts to connect to Google.  This first script is going to use the "Admin SDK" and you can read up on it here:  https://developers.google.com/admin-sdk.

There is a LOT of information there so to be a little more specific, we're going to use the "reports" API.  You can read more about that here:  https://developers.google.com/admin-sdk/reports/v1/get-start/getting-started

If you want to dive straight into some of their more technical documentation, I do find the API reference for the Admin SDK to be quite good in some ways and it can be found here: https://developers.google.com/admin-sdk/reports/v1/reference

With that bit of "light reading" provided, let's start on a "first script".  This will:


  • import the required modules
  • attempt to read the oauth2 token file (remember, mine is "client_secret.json")
  • attempt to set the necessary scopes (I listed them above)
  • attempt to create delegated access (this means the script will act on behalf of an actual account)
  • attempt to build an "admin"/"reports_v1" API object
  • attempt to authorise that object with Google


In code, this would look like:

import httplib2
from apiclient import discovery
from oauth2client.service_account import ServiceAccountCredentials
oauth2_file = "client_secret.json"
oauth2_acct = "my-account@my-company.com"
oauth2_scopes = ['https://www.googleapis.com/auth/admin.reports.audit.readonly',
                             'https://www.googleapis.com/auth/admin.reports.usage.readonly']
sa_creds = ServiceAccountCredentials.from_json_keyfile_name(oauth2_file, oauth2_scopes)
delegated = sa_creds.create_delegated(oauth2_acct)
http_auth = delegated.authorize(httplib2.Http())
service = discovery.build('admin', 'reports_v1', http=http_auth)
exit()

If I save it as "test_script.py", run it with "python3.6 test_script.py" and get no output, I know it works (and indeed it does for me).



The next thing I'm going to do is move the oauth2_ variables to another file called "api_info.py".  Instead of using, for example, "oauth2_file", I would have "import api_info" and then use that value with "api_info.oauth2_file".  In the long run it's going to save time and effort because I'm going to have several scripts all using the same credential information and if I change the scope, change the account, etc., I only have to change it in one place.  With that change, my "test_script.py" now looks like:


Authentication Logs


Now that I have a starting point, I want to make my script do something useful.  If you're going to start pulling logs from Google and put them into your SIEM (which should ultimately be the goal...), I would recommend starting with the login logs.  This gives you really good data like:


  • who logged in/out
  • at what time
  • from which IP address
  • success or failure


If you're correlating other authentication logs, this is a great addition for correlation.  Don't be afraid of volume here - in my day job we have approximately 20,000 users and the Google login logs are typically just a few megabytes per day.  I know of organisations with tens of thousands of users and they have 50GB/day Splunk licenses, they make sure they get their Google authentication logs.

When you query the "reports_v1" API, Google provides a list of "activities".  Each activity is a JSON object.  Those fields are documented here:  https://developers.google.com/admin-sdk/reports/v1/reference/activities/list

This is a sample failed login for one of my domains:


At 17.03 UTC on 28th of December, someone at the IP address 1.2.3.4 tried to login to Google using "a.user@my-domain.com" as the username and it failed due to an invalid password.  Yes, I edited the *content* of the fields for demonstration purposes but each successful and unsuccessful login will have each of those fields with the appropriate values. That object will come all on one line, though, so it can be a bit difficult to read.

Getting the Authentication Logs


Now it's time to work on the script so it retrieves and displays authentication logs in a meaningful way!

From the API documentation, I know that I need to call activities().list().execute(), and I know I need to give it two parameters:


  • applicationName - this will be 'login'
  • userKey - this is a specific FULL email address in your domain OR you can use the keyword 'all'


From the above screenshot, I also know that I'm going to get a bunch of JSON objects that have ['id']['time'], ['actor']['email'] and ['ipAddress'] fields, so I know I can look specifically for those.  I also know ['events'][0]['name'] is going to tell me where it was a login_success, login_failure or logout, so I want that as well.

Adding that information to my script, I know have:

import api_info
import httplib2
from apiclient import discovery
from oauth2client.service_account import ServiceAccountCredentials
sa_creds = ServiceAccountCredentials.from_json_keyfile_name(api_info.oauth2_file, api_info.oauth2_scope)
delegated = sa_creds.create_delegated(api_info.oauth2_email)
http_auth = delegated.authorize(httplib2.Http())
service = discovery.build('admin', 'reports_v1', http=http_auth)
results = service.activities().list(userKey='all', applicationName='login').execute()
activities = results.get('items', [])
for activity in activities:
  print()
  print("New login record")
  print("Time: " + activity['id']['time'])
  print("Email Address: " + activity['actor']['email'])
  print("IP Address: " + activity['ipAddress'])
  print("Event result: " + activity['events'][0]['name'])
exit()

Since I'm doing something specific, I'm going to go ahead and save this version as "get_logins.py".

When it runs, I'll get a list of the first 1000 login successes and failures for all users in my domain for UP TO the last 180 days.  The 1000 limit is easy to address but it's beyond the scope of this post; I'll provide a github link at the end that has a version of this script with it included. For example, this is an (edited) sample of what I get for an account I've only used from one location:


If I wanted to search for JUST logs for "test.acct@my-domain.com", I'd use that as the userKey value instead of 'all' and my results would be identical.

IR-driven Scripting


Since most of my work is incident response driven, let's have a wee scenario.  An attacker phishes a set of credentials and logs into someone's email.  At that point they decide to launch a spear-phishing campaign against select members of your management but they want any responses to be invisible to the person actually uses the account - maybe they add a filter that automatically sends those emails to the Bin.  It's 2 AM and the SOC analyst on-call, Lexi, gets a support ticket from the CFO saying, "this is Janet, the CFO. I just received an odd email from Steve in HR saying he has an emergency purchase that needs approval tomorrow but the file he sent me won't open! the subject is 'emergency purchase'." The analyst takes a closer look and sees it was actually submitted at 9PM the night before but they're just now receiving it.

Okay, let's walk through this.  It's 2.00 AM so calling Janet is a Really Bad Idea.  You don't call C-levels in the middle of the night unless they've JUST contacted you.  Angry spouses, upset babies, waking up a C-level, these are all resume-generating events.  Your analyst probably isn't a Google "superadmin" so they can't check the actual email log to get information about the emails sent from Steve to Janet.  For the sake of argument let's say you aren't using Vault or some other archival tool because <pick a reason>.  What does Lexi do?

As it turns out, there are a host of tools available to her via the "gmail" API.  Google's documentation for it is here:

https://developers.google.com/gmail/api/guides/

and the API reference is available here:

https://developers.google.com/gmail/api/v1/reference/

One simple thing to do would be to search Janet's email for any messages from Steve with a subject of "emergency purchase".  From the above reference, I know the API lets me retrieve a list of message IDs that match a query filter and that I can then use that message ID field to retrieve actual emails; additionally, I know that I can use ['payload']['headers'] to get the message headers (like "From", "To", "Subject", etc) and I know there is a ['snippet'] field that has a short, plain-text version  of the email. With that knowledge, I can write something like the following:

import api_info
import httplib2
from apiclient import discovery
from oauth2client.service_account import ServiceAccountCredentials
query = 'subject: "emergency purchase'
userID = 'cfo.janet@my-company.com'
sa_creds = ServiceAccountCredentials.from_json_keyfile_name(api_info.oauth2_file, api_info.oauth2_scope)
delegated = sa_creds.create_delegated(userID)
http_auth = delegated.authorize(httplib2.Http())
service = discovery.build('gmail', 'v1', http=http_auth)
query = 'subject: "emergency purchase'
userID = 'cfo.janet@my-company.com'
results = service.users().messages().list(userId=userID, q=query).execute()
messages = results.get('messages', [])
for aMessage in messages:
  mid = aMessage['id']
  msgObject = service.users().messages().get(userId=userID,id=mid).execute()
  for aHeader in msgObject['payload']['headers']:
    if aHeader['name'] == "To":
      print("Recipient is: " + aHeader['value'])
    elif aHeader['name'] == "From":
      print("Sender is: " + aHeader['value'])
    elif aHeader['name'] == "Subject":
      print("Subject is: " + aHeader['value'])
    elif aHeader['name'] == "Message-ID":
      print("Message ID is: " + aHeader['value'])
    print("Snippet from email: ")
    snippet = msgObject['snippet']
    print(snippet)
    print()
exit()


NOTE: the gmail.readonly, gmail.modify or https://mail.google.com/ scopes must be allowed for this to work.  I HIGHLY recommend using the gmail.readonly scope unless you want your SecOps team to have the ability to delete emails (which you may do once they're adept at finding the message IDs of phishing messages).

If Lexi had such a script, named 'get_headers.py', and were to run it, she may get something like this:


Using the same API, she could go further and retrieve the actual attachment.

What if the email HAD come from Steve's proper account, though?  At this point Lexi could use get_logins.py to see which IPs had accessed Steve's account and then look for additional accounts being accessed from the same address.  She could then possibly find other malicious/phishing emails that were sent and, if allowed the .modify scope, delete them from user mailboxes before the user ever sees them.

Wrapping Up


The Google/G Suite API offers a fantastic opportunity for incident responders and SecOps teams to have visibility into their email and collaboration environment.  It not only allows us to pull information in an easily-parsed format (json) for one-off searches but also to pull logs in volume to import into our (hopefully) much faster and more powerful SIEMs.  With a little bit of tuning, any of the scripts I've offered above can write log data in CSV, JSON or XML, with field and header names of your choosing, and they can be executed via any scheduling mechanism your operating system uses.  Since this is part of a series about being a SIEM, a post in the very near future will rely on some of these scripts writing out in JSON so if you do take a look at them on github, know they are very much early versions!

As promised, they're at:

https://github.com/kevinwilcox/python-google-api

Take a look at all the other tools that are available, by all means.  GAM is the de facto for managing Google domains from the CLI and should be part of everyone's toolkit...but if you're going to embed GAM into a script to do something, why not use the API directly to accomplish *exactly* what you want?

Saturday, 16 December 2017

New Multi-Part Series: SIEM From Scratch

This year I've really jumped into the Elastic stack, rolling out dozens of small environments for personal projects and working countless undocumented evenings at my day job to profile performance, test filters and deliver the log aggregation ecosystem I'd promised my system administrators, network admins and superiors that I could deliver. I've written several posts on Logstash and Elasticsearch that were geared for folks trying to get started with the ecosystem. I hope others have found them useful and I hope they've been able to grow their environments into something that does what they need.

Every day I see more and more articles and blog posts about doing interesting things with Elastic, about deploying it as a log aggregation system, written by people who are infinitely more talented than I. They do fantastic things with the platform and I will readily admit my envy for what they are able to accomplish. My issue with those great tutorials is that they don't usually put things together in a way a true beginner can follow - they assume someone who has a specific task they want to accomplish (using various filters for enrichment, getting Windows logs into an existing stack, using the stack for DNS analysis, etc).

I am guilty of the same thing. My "getting started" tutorials assume you already have an interest in the stack and need to accomplish specific tasks. I believe this is useful and has its place; I also believe there are people who want to solve a larger problem and need a little bit of hand-holding until they're ready to start asking questions about those specific things. That's where my new multi-part series will come in.

This weekend I am starting a new set of posts that have a clear goal in mind. This is for the people who are 100% brand-new to log aggregation and SIEM, the folks who may have just made a career change (or want to make a career change) to SecOps (or Ops), the ones who have just inherited a dozen servers that all log to local files in interesting and who will benefit from aggregating those logs, the people who have just found out they're responsible for PCI compliance at their small organisation and need to start building a monitoring and alerting program with basically no budget. My goal is to give you the means to deploy a log aggregation platform that allows you to normalise log data, search logs from dozens of devices in seconds and generate alerts so you know when odd things are happening, all with the assumption that you have no prior experience in any of those areas.

This weekend I'm starting "SIEM From Scratch" and I'm pretty excited about it.

Sunday, 12 November 2017

Logstash Profiling: Time in the Pipeline

I have been pushing Mark Baggett's domain_stats.py (https://github.com/MarkBaggett/domain_stats) script out to my logstash nodes this week and I saw a pretty big hit on throughput. That's okay, this was expected - the script goes out and does a whois lookup for every domain you send to it, then caches the results in memory. The first time it's super slow if whois is slow, the next time it's super fast. It's pretty awesome.

When I found out about the script from Justin Henderson (https://github.com/smapper), he warned me that it's best used on domains that aren't in the Alexa/Umbrella Top One Million. He's right, for the most part there isn't really a reason to do lookups on domains that are popular enough to be in the top one million sites on the Internet, but I'm not tagging those yet and sometimes I just really want more information about ad networks when they show up in my DNS logs. As a result, I started sending my 1k DNS queries per second to domain_stats and I *very quickly* saw a drop in throughput.

That left me with two questions:

  1. How many events were going through my pipeline each second before and after I started doing DNS enrichment?
  2. How long does each event, on average, spend in my pipeline, both before and after I started doing DNS enrichment?

X-Pack is fantastic for measuring events per second and I'll address its installation/configuration soon. The more interesting question right now is the latter and the solution was a lot easier than I thought it would be.

My Pipeline


Before I get into timing, I want to give an overview of my log pipeline. I have a cluster of logstash ingest nodes that receive events via a combination of inputs. These nodes send data to a RabbitMQ cluster that uses mirrored, persistent queues (highly available queues that store messages on disk). Another logstash cluster pulls events from RabbitMQ and then runs each event through a series of filters that parse fields, normalise names, add fields based on the workflows of my coworkers, enrich with additional information based on the fields already present in each log item and then send the enriched/processed data to the appropriate Elasticsearch index. That means from end to end, the log flow is:

[ endpoint beat ] --> [ logstash_ingest ] --> [ rabbitmq ] --> [ logstash_enrich ] --> [ elasticsearch ]

There are several ways to lay things out and it's perfectly acceptable to have endpoints use beats to write directly to Elasticsearch, if that's your preferred method. I've used Kafka as a message buffer and it works well - and the beats can write to it natively! Don't assume this is THE way to do it, figure out what works for you and what addresses the risks you face.

The Goal


The question I want to answer is how much time an event spends in my pipeline: I want to know when it enters the pipeline, when it comes out of RabbitMQ and when it is sent to Elasticsearch. To do this, I decided to:

  • add a timestamp as the first (and only) filter when an event is received by an ingest node, called "pipeline_start"
  • add a timestamp as the first filter on the enrich cluster when an event is retrieved from RabbitMQ, called "pipeline_rabbitmq_out"
  • store the difference between "pipeline_rabbitmq_out" and "pipeline_start" as "pipeline_rabbitmq_processing"
  • add a timestamp as the last filter on the enrich cluster, before the event goes to Elasticsearch, called "pipeline_stop"
  • store the difference between "pipeline_start" and "pipeline_stop" as a field called "pipeline_processing_total"

It's pretty basic timing and it gives me an easy way to look *at any log event* and see if the overhead was on ingest/rabbitmq or if it was the enrichment cluster.

Making it Work


The easiest way I've found to add a timestamp to an event is with the ruby filter:

filter {
  ruby {
    code => "
      event.set('s_time', Time.now)
      sleep(1)
      event.set('e_time', Time.now)
    "
  }
}

This adds a field, s_time, that is an ISO8601-formatted timestamp with the current time. It then sleeps/waits for one second and adds *another* field, e_time, that is also an ISO8601-formatted timestamp with the current time. Since there is a one-second pause between the two timestamps, there should be a one-second difference when I use this as my filter. Note I'm using stdin for input (so I can type something in) and stdout with the rubydebug codec as my output (so I can see the parsed version of what I type in).

My entire testing config looks like this:


And when I run it with the test input of "my great log event", I get this:


Indeed, I have two timestamps, s_time and e_time, and they are one second apart!


Taking the Difference


Now let's go a step further. Having the timestamps is great but that leaves me looking at multiple fields and seeing the difference. Can't I just get the difference in ruby and store that as a third field?

Well...yes I can!

Ruby lets you subtract one timestamp from another and it handles all the type conversion on the backend. That means I can do something like "e_time - s_time" and Ruby will "just take care of it".

I'm going to update my filter to do the subtraction and store the third field, p_time (for processing time). Note I'm using "event.set('foo')" and "event.get('foo')" to set and retrieve the fields in my log event:


Now when I run with the same test input, I get:


p_time exists and it's being parsed as a number - you can tell because it's in blue and NOT in quotation marks. If I were sending this to a full ELK stack I could have a dashboard that showed the average p_time and, if I were tagging my logs, the tags associated with the events that took longest to process. Pretty useful information!


A Caveat: Time Objects Versus Strings


When I put this into production, I hit a bit of a snag. I was adding my start time in the first logstash cluster, sending the event through my buffer, then adding the second timestamp and finding the difference in the second cluster. event.get('s_time') would give me the ISO8601 timestamp but my processing times were coming back as the floating point representation of the ending timestamp. This means my start times were being stored but they were being treated as a zero during subtraction. I don't know why this happened, I don't know if it was something happening with logstash or RabbitMQ, but it was problematic!

After a bit of digging, I found out about the "Time.parse()" method in Ruby. This lets you create a Time object from a formatted string. To better represent how I've setup my filters, I'm going to separate setting the s_time and e_times into different filter blocks. In the first block I'm going to get the timestamp as a formatted string instead of as a Time object and store that as s_time. In the second block I'll use local variables to do the arithmetic and then store the e_time and p_time values using the "event.set('event_field', 'local_variable')" syntax:


When it runs, I get the following:


e_time is a timestamp object but s_time is an ISO8601-formatted string. That's okay! When the values are sent to Elasticsearch, it will store both as timestamps. I don't have a use-case for searching these yet but while I'm debugging I have no incentive to remove them.

One decision I did make when I did this in my production clusters is to only apply the pipeline timing filters to events with a "pipeline_metric" tag. This allows me to only add the timestamps to select groups of logs (or for one of my sys-admins to tag the logs they're sending, should they decide to test processing time).

Wrapping Up


As I stated, X-Pack does a great job of showing general metrics - events received and sent by logstash, how much CPU and heap is being used by logstash or elasticsearch, how many events per second are being indexed by an Elasticsearch cluster (or node), etc - but it doesn't really help me evaluate how long it takes to process an event. Logstash offers an "elapsed" filter but that is only useful for the time that has elapsed between two log events, not since certain actions were taken against a specific event. This gives me a clean way to solve that issue.

More than just the system time is available to ruby. With multiple Logstash clusters, maybe it's good to know which ingest or processing node handled a given event - you can add this with:

event.set('node_name', Socket.gethostname)

If you combine that with time-based metrics, you suddenly have the ability to make *very* useful dashboards about which types of logs are taking the longest to process, which filters are taking the most time to run and which nodes have the highest (or lowest) processing time for those logs. It's pretty powerful stuff!

Saturday, 16 September 2017

Puppet Part Three: In Which I Write My First Module

In my previous post I wrote a manifest for stark that removed specific groups/users and added specific groups/users. This allowed me to do some user standardisation on stark (if this doesn't make sense to you, please read my previous post, Puppet Part Two). On the whole it's pretty nifty, right? I mean, I could copy and paste the user/group stuff from stark's manifest into the manifests for the other systems and have my user needs satisfied...but is that the best way?

Modules: A Quick Overview


There are _a lot_ of things to say about modules and the Puppet documentation on them is here:


The short, short version is that modules let you write something one time and then use it in lots of places. Modules are made up of classes and a class, ideally, should do one thing (object-oriented programming is the programmer's derivative of the Unix philosophy?). For example, you may have a module called 'ssh' - then that module may have one class for installing the SSH server package, another class to configure sshd_config and another class to configure the system ssh_config.

Using my scenario above, I have a group of accounts that I want to exist on <some or all of> my servers. Instead of creating those accounts in every manifest I have, I can write a module that does account-y stuff and then add that one module to each server's manifest. Then when I need to delete an account I can do it one time, in my module, and as my servers check in, they'll get the new configuration and remove the account. Write once, use lots of places, problem solved!

One quick note: remember, I'm a Puppet noob. I've written very basic modules to do user-y and basic administration-y type stuff. I've not written modules to do more complex things. That's okay, we're learning (more or less) together and the Puppet folks on Twitter (@puppetize) are phenomenally supportive.  Their documentation covers doing interesting things, I'm just covering getting started. I may do some more complex things in several months but for now, I'm relying mostly on the work of others and I'm keeping MY work fairly simple.

With that said, let's write a module that does some basic administration-y type things that allow for standardisation of some system services.

The Layout


There are some requirements for Puppet to use a module. It needs to be in a directory that is designated for modules in Puppet's configuration. On Ubuntu, by default, these are:

/etc/puppetlabs/code/modules
/etc/puppetlabs/code/environments/production/modules

This is configurable but that configuration is outside the scope of this post. Maybe later, if I dive into multiple environments or custom paths for multiple maintainers.

A module itself needs a few things to work. At a bare minimum, it needs:

o a metadata.json file that gives the module name, version, author, summary, license and some other information
o a manifests directory
o a manifests/init.pp file that has the initial class declaration; the initial class is the same name as the module

I know, that last bullet may be a little confusing. It will make more sense when we take a look at one.

Three Choices


There are three ways to create a module. You can copy an existing module into the appropriate directory and modify it to fit your needs but that can be a lot of work - it is worth it to have a template you can copy over for new modules if you're going to write several of them but that may be unlikely.

The second option is to use the 'puppet module' command. This is the same way you install and remove 3rd party modules and it is an easy way to create the initial directory structure, metadata file, init.pp file and basic documentation. Using the 'puppet module' command also creates several more files and sub-directories built around the idea that you're going to share your module with the world and do testing across multiple platforms. It is the most complete method available and everyone writing puppet modules should use it at least once.

The third option is to create the necessary directories and files yourself. This was my choice since it's not a lot of work, I'm not doing anything especially complex in my modules and I don't plan on sharing them with anyone. By choosing this option I'm almost guaranteeing to do something in an incorrect way, do it manually at your own risk!

Create the Module


Puppet lets you store modules pretty much anywhere on the filesystem you want to, provided you tell it where to find them. I like using /etc/puppetlabs/code/modules so that's where I'll create this one.

First I'll change to that directory:

cd /etc/puppetlabs/code/modules

Then I'll make the directory structure and required files for my new module, "my_users". Ordinarily I would name using "camel case" or "stair-stepped case" - "myUsers" - but when the puppet agent does a manifest lookup it would look for "myusers" and that would fail. I have several classes with an underscore in the name now...

mkdir my_users
mkdir my_users/manifests
touch my_users/metadata.json
touch my_users/manifests/init.pp

Then I'll add the following to the 'my_users/metadata.json' file:


Here is the copy/paste version:

{
  "name": "test-my_users",
  "version": "0.0.1",
  "author": "my name",
  "summary": "User and group standardisation for my VMs",
  "license": "BSD 3-Clause",
  "source": "",
  "project_page": "",
  "issues_url": "",
  "dependencies": [
    { "name":"puppetlabs-stdlib", "version_requirement":">=1.0.0" }
  ],
  "data_provider": null
}

A few quick items. Notice I've named it "test-my_users". The name of the module is the format <author>-<module name>. I'm just using "test" because, well, this is a test after all! I've used "my name" as the author but it doesn't matter which name you use here. I like the 3-Clause BSD licence, also known as BSD 2.0, so that's what I'm using - it basically says use the module however you want but do so at your own risk, I'm not liable if it destroys your data centre, just using the module doesn't mean you'll get support, I'm not endorsing your product and, if you use the module in your product, you have to say you're using it. Read the licences that are out there, they're important.

Right now I just want to get a module that loads so I'm going to use a very basic my_users/init.pp file. All I'm putting in it is the initial class declaration:


At this point it doesn't do anything but it's a good time to see if I can use it with a manifest.

Include and Require


There are two ways to make sure a manifest uses a module or class - either by using "include" or by using "require". These have two very different meanings!

"include" tells puppet to make sure the contents of a class are included when it generates the list of things for an agent to do. You could, in theory, include ten different classes that all include each other and puppet can sort that out. Using "include" does not put any specific constraints on ordering - it lets puppet sort all of that out.

"require" tells puppet to make sure the contents of a class are included *in a specific order". If puppet sees a "require" statement, it will make sure everything that is part of that "require" statement is done before continuing. That is great if you need to have things happen in a certain order, ensure specific files exist before starting a service, etc., but in general it can lead to some serious management headaches. I will use "include" unless I absolutely need to use a "require".

Using a Module or Class in a Manifest


I know that baratheon, my actual puppet server, is set to be managed by puppet because I configured it that way in my Puppet Part One post - if you read that post, though, you'll see it has an empty manifest. All the manifest contains is:

node baratheon { }

(reminder: I put that file at /etc/puppetlabs/code/environments/production/manifests/baratheon.pp)

To use the 'my_users' module, I need to add a single line to baratheon's manifest file. After editing, it will look like this:


For copy/paste, that is:

node baratheon {
  include my_users
}

This tells puppet to look in the module directories and use/include the code from the class named 'my_users'. I also could have used:

node baratheon {
  require Class['my_users']
}

Again, I don't have a specific need to use "require" so I'm using "include".

The Test...


Now that I've told puppet to use the "my_users" module, I need to test it. I can do that with:

sudo /opt/puppet/bin/puppet agent --test

If everything is good, it should compile the catalogue and return me to a prompt:


Success! Now I'm ready to make the module do something useful.

Make my_users Useful


A module is made of classes and classes do one thing. Otherwise, they do the same thing as a system's manifest! Since I have already written the code/configuration necessary to do user stuff in the stark.pp manifest, I'm going to copy it into the my_users class:


Then I'm going to test it with "puppet agent --test":


Oh no, something went wrong! ... or did it? Remember, the problem I'm trying to solve from Puppet Part Two is that I have Ubuntu systems with a user named 'test' and CentOS systems with a user named 'demo'. My goal is to remove those users and add one named 'secops'. Since I'm logged in as the 'test' user, it's going to fail on deleting the 'test' user and group. I have two choices - I can either logout and wait ten minutes (because I have puppet configured to run every ten minutes) or I can just reboot the machine. I'm going to reboot...

With the VM rebooted, I'm going to see if the 'test' user still exists:


Notice the 'Login incorrect' - the system did remove the 'test' user! But am I locked out?


No, I'm not! Success!

Just because you can use puppet to manage itself doesn't mean that's always a good idea. Weigh the cost/benefit before doing this. I like this example because it shows how quickly and easily you can lock yourself out of your management server. BE CAREFUL!

Wrapping Up


Now that I have a class for user management, I can start to simplify management. First, I can edit the manifest for stark and cut out everything I have in there -- and replace it with an include for the my_users module, the same way I did for baratheon. I can add the same include statement to the manifests for lannister and bolton. By using modules/classes, I can write one time and then include it everywhere I want to have the 'secops' user with the 'secopspass' password.

To go even further, if I need to change the password for the 'secops' user on all of my systems, add a different user, add an SSH key or more, I only need to edit the my_users class and that change propagates to all systems that use it. That is MUCH more efficient (and reliable!) than having that functionality in each server's manifest and editing possibly hundreds or thousands of files just to make one small change.

As I pointed out in a previous post, Puppet is not alone in this functionality. SCCM, ansible, chef and others are all capable and each brings their own strengths/weaknesses to the table. Each one deserves a close look and consideration, even in smaller environments.

Enriching Logs With Active Directory Attributes

Date of writing: 4th November 2018 Last edit: 4th November 2018 JUST GIVE ME THE SCRIPT AND CONFIG If all you need are the logstash co...