22 May 2017

Beginning ELK Part Five: Let's Talk Templates

A Teachable Moment...


I've been pushing Bro logs to ELK for some time now. I think it's a brilliant way to index and search the 125GB+ of Bro logs my organisation generates each day. That particular installation was not tweaked beyond what was necessary in logstash to read in the Bro logs and do some basic manipulation (which I'll cover here in a future post!) to make them a tiny bit more useful.

Since I didn't really change anything, I left the number of shards at five...but I only have three elasticsearch nodes in that cluster. That means two of the nodes have to do two searches *every time* I search an index.

Which brings me to tonight's post. What do you do if you have one node in your cluster but the default number of shards is five? Let's see!

Laying Out a Template


Remember that every item in elasticsearch is a JSON object - and that includes templates. For example, an object to set the number of shards to two for an index named "my_index" would look like this:

{
    "template" : "my_index*",
    "settings" :
    {
        "index" :
        {
            "number_of_shards" : 2
        }
    }
}

This could also be represented:

{ "template" : "my_index", "settings" : { "index" : { "number_of_shards" : 2 } } }

Or, if I wanted to set the number of shards to one and disable replication (all on one line):

{ "template" : "my_index", "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 } } }

There are two ways to apply the template -- by using curl to interact with elasticsearch directly or by telling logstash how to define the index when it writes to elasticsearch.

Create the Template With curl


First, I want to make sure I don't have a template or index named "my_index" using:

curl http://192.168.1.111:9200/_cat/indices?pretty
curl http://192.168.1.111:9200/_cat/templates?pretty

I have one template, "logstash", that gets applied to any new indexes with a name starting with "logstash-", and one index, ".kibana", that's used to store the settings for kibana. Note that the ".kibana" index has a status of "green" - that means elasticsearch has been able to allocate all of the shards for that index (I've already set it to only have one shard):


Now I can create my template with one curl command (this would all be on one line):

curl -XPUT http://192.168.1.111:9200/_template/my_template?pretty -H 'Content-Type: application/json' -d '{ "template" : "my_index*", "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 } } } '

And then I can verify it with:

curl http://192.168.1.111:9200/_template/my_template?pretty

When it runs, it looks like this:


Now if I create two indexes, "test" and "my_index-2017.05.07", I can see whether my template is used as intended. The date isn't really important in that index name, I just added it because I used "my_index*" in the template and I want to show that anything beginning with "my_index" has the template applied. I can create an index with:

curl -XPUT http://192.168.1.111:9200/test?pretty
curl -XPUT http://192.168.1.111:9200/my_index-2017.05.07?pretty

When I run it, I get the following output:


Then I can verify the indexes were created and the template applied with:

curl http://192.168.1.111:9200/test
curl http://192.168.1.111:9200/my_index-2017.05.07



Notice how "test" has the default five shards/one replica and "my_index" has one shard/zero replicas. The template worked!

Create the Template With Logstash


Defining the template with curl is interesting but you can also create/apply the template in your logstash configuration using the "template"/"template_name" option. I did this with the same information I used when I defined it with curl - I want to name the index "my_index-<date>", I want to name the template "my_template" and I want to set one shard with zero replicas.

First, I wrote out a file in the /etc/logstash/ directory called "my_template.json". It contains the JSON object that I want to use as my template:


And the copy/paste version:

{
  "template": "my_template",
  "settings": { "index" : { "number_of_replicas": 0, "number_of_shards": 1 } }
}

Then I added a section in my output { } block to tell the elasticsearch plugin to use my template:


At that point I can restart logstash and it will create the index (and template) as soon as it uses the elasticsearch plugin for storage.

In Closing


The default settings for elasticsearch are Pretty Good; if you're going to change anything then it's good to make small, incremental changes and document the results after each one. Maybe you need to increase/decrease shard counts per index, maybe you want to change the number of replica shards per index, maybe you want to specify a data type for a given field - all of these are set at index creation by templates.

So what's the best way to test? I'd say take a sample data set, import it with a given configuration, see what you think. Don't like it? No problem! Delete the index, tweak the template, import the data again. For me, that's the real benefit of setting the template via the elasticsearch plugin -- if I don't like something I don't have to try to copy/paste and then edit a command, I can just open the file in an editor and continue testing.

As I've said before, don't be afraid to try different things and see what works for you!

10 May 2017

Beginning ELK Part Four: On Shards and Replicas

When elasticsearch creates a new index, it looks to see if the name matches a template it knows about and, if no match is found, it uses the default template. Without configuration changes, a new index:

o has five shards
o has one replica

What does that really mean for usage and performance? That depends on your cluster. Even though this isn't going to be a hands-on post, I think it's important to geek out on the nitty gritty for a bit because this is important when talking templates (which I'll do in Part Five).

Shards and replicas are covered in the elasticsearch documentation here: 


Shards Per Index


If you only have one node in your cluster, five shards for an index could be overkill. Shards are how elasticsearch segments an index - every query against an index has to search every shard. Indexes are shared across nodes in a cluster, shards are distributed amongst nodes. If you have five nodes and five shards, that's great - each node has to search one segment and it's done (these searches happen in parallel!). If you have one node and five shards, each search against that index means the node has to perform *five searches* with searches fighting for resources. This means that from a search perspective, one shard per node would be ideal.

There are some hard limits around shards - for example, a single shard can only contain a little over two billion documents (2^31 to be exact) - so sometimes you HAVE to have multiple shards per node. Please note that the number of shards per index is set when the index is created and *can not be changed*. You can change it for NEW indexes but not for EXISTING indexes, so if the performance for a daily time-based index is cruddy now, you can change it for future indexes.

Replicas


elasticsearch is built from the ground up to cluster and protect your data. It has replication built in and by default, it wants to create a replica of your data. This means that using the default template, when an index gets created with five shards and one replica, it actually wants to create TEN shards -- five primary and five replica shards.

An easier way to explain it is to pretend that you've set the number of shards to two and you have two nodes. The first node has the PRIMARY for shard one and the REPLICA for shard two. The second node has the REPLICA for shard one and the PRIMARY for shard two. Primary one and replica one have the same data in them but only the primary is used for searching. 

If Node 1 goes away, Node 2 becomes the PRIMARY for shard one AND shard two. All searches will continue to work because Node 2 has the primary shards and all indexing will continue to function -- but now there are no replica shards.

If Node 1 comes back up, Node 2 starts reallocating data from either shard one or shard two to Node 1. It will continue to accept and index data so this reallocation can take a while. When it finishes, though, Node 1 will become primary for one of the shards and the cluster will be back to full capacity, complete with replication. By design, this means no data loss if one of the cluster members has a problem.

Suppose you decide you want to add a new node, Node 3. You can increase the resiliency of your cluster by changing the number of replicas from one to two - that would mean a copy of each shard would exist on each node. Unlike the number of shards per index, you can change this on-the-fly for existing indexes!

Yes, I'm going to demonstrate replication and reallocation of data in a future post :)

So What Is Ideal?


Well...as some of my SANS instructors were so fond of saying when we would ask questions in class, "that depends". It depends on your index/search ratio, on the number of nodes in your cluster, on the type of data you're indexing, on your requirements for search speed, on your requirements for data replication and various other factors. Your goal should be to test multiple configurations with a defined data set so you can time data import and search responses. Pay attention to disk and other resource utilisation when you're importing and searching data. Don't be afraid to delete an index and start over with different shard/replica ratios!

A New Year, A New Lab -- libvirt and kvm

For years I have done the bulk of my personal projects with either virtualbox or VMWare Professional (all of the SANS courses use VMWare). R...