04 April 2017

Generating Sample Log Files

As I work on ELK posts (and even on a couple of upcoming Splunk posts), I've realised I have a significant problem: I don't really have a good set of log data that I can use as an example for doing interesting things with ELK. This left me with a couple of options. Either I could install some server software and generate a lot of requests over several weeks or I could script something that created a month or two of logs in a few minutes.

Well that is a pretty easy decision!

I'm not going to go through the script I wrote to generate the sample data. Instead, I'm going to talk a little bit about the script and its output and then provide a link to the script at the end.

Make Life Easy, Use JSON


This could start a text-format Holy War but I'm going to make an executive decision, from the start, to use JSON. It may not be as compact as something like BSD syslog or CSV but I think it is easily readable, it's still easy to parse with cut/awk and you don't need to run it through a grok pattern in logstash to push it to ELK - you can just dump it directly into elasticsearch! I'm going to use JSON.

My Key/Value Pairs


I wanted to have a log that would be *similar* to something you may find, that could let me do one or two interesting things in ELK (and later in Splunk) and that would be simple to read. I opted for a log that may be similar to a web proxy log. Therefore, I decided on the following fields (keys):

1. timestamp - this is in ISO-8601 format (and in UTC)

2. ip_address - this is the IP address of the system making the request; I have six possible IP addresses:
  • 10.10.10.1
  • 10.10.10.3
  • 10.10.10.8
  • 10.10.20.2
  • 10.10.20.5
  • 10.10.20.13
3. username - this is the user authenticated with the proxy; I have seven users:
  • alvin
  • simon
  • theodore
  • piper
  • prue
  • pheobe
  • paige

4. site - this is the URL being requested; I have four sites:
  • https://www.bbc.co.uk
  • https://www.bbc.com
  • https://www.google.com
  • https://www.cnn.com

To make things interesting, any user may have any IP at any time. That means piper may request https://www.bbc.co.uk from IP 10.10.20.13 and one second later simon may request https://www.cnn.com using the same IP.

A Note on BASH


Bash isn't my favourite shell and it isn't my favourite scripting language but it truly is Good Enough for a lot of things, including this. To really get an idea of what you can do with it, I recommend perusing the "Bash Guide for Beginners" at The Linux Documentation Project:


The Overview


The script itself does the following:

o establish arrays for users, sites and IPs
o get the current time
o get the time from one month ago (this is arbitrary)
o write one log entry for each second between those two times (appends to a log file via stdout redirection)

A Sample Entry


What does this look like in the actual log? Here's a sample entry:

{"timestamp":"2017-03-28T01:37:54+00:00","ip_address":"10.10.10.1","username":"paige","site":"https://www.bbc.co.uk"}

Again, I chose this format because the key/value pairs mean you can look at a single entry and never have to guess what any field is. One of the best things I did for my Bro logs, where I may need to give exports to people who don't read Bro every day, is start outputting in JSON format!

Wrapping Up


To generate (in my VM) the two million-ish log entries I want to use took about two and a half hours.

There are certainly more efficient mechanisms BUT I also want to do some things in the future with reading log files that change (for example, having logstash or filebeat read from an active log file), so this fits both needs at the loss of speed, elegance and correctness (there are certainly better ways to write to a file in bash than using output redirection after every line!!).

Sometimes you just don't want to run through all of the setup necessary to generate a lot of log data. Sometimes you DO want to but don't have the hardware available (and don't want to spin up a bunch of systems at AWS or Azure). When that happens I think it can be perfectly fine to script something that generates sample log data, just know it can take some work to get there!

The script I used can be found here:

https://raw.githubusercontent.com/kevinwilcox/samplelogs/master/createProxyLogs.sh

No comments:

Post a Comment

Note: only a member of this blog may post a comment.

A New Year, A New Lab -- libvirt and kvm

For years I have done the bulk of my personal projects with either virtualbox or VMWare Professional (all of the SANS courses use VMWare). R...