10 May 2017

Beginning ELK Part Four: On Shards and Replicas

When elasticsearch creates a new index, it looks to see if the name matches a template it knows about and, if no match is found, it uses the default template. Without configuration changes, a new index:

o has five shards
o has one replica

What does that really mean for usage and performance? That depends on your cluster. Even though this isn't going to be a hands-on post, I think it's important to geek out on the nitty gritty for a bit because this is important when talking templates (which I'll do in Part Five).

Shards and replicas are covered in the elasticsearch documentation here: 


Shards Per Index


If you only have one node in your cluster, five shards for an index could be overkill. Shards are how elasticsearch segments an index - every query against an index has to search every shard. Indexes are shared across nodes in a cluster, shards are distributed amongst nodes. If you have five nodes and five shards, that's great - each node has to search one segment and it's done (these searches happen in parallel!). If you have one node and five shards, each search against that index means the node has to perform *five searches* with searches fighting for resources. This means that from a search perspective, one shard per node would be ideal.

There are some hard limits around shards - for example, a single shard can only contain a little over two billion documents (2^31 to be exact) - so sometimes you HAVE to have multiple shards per node. Please note that the number of shards per index is set when the index is created and *can not be changed*. You can change it for NEW indexes but not for EXISTING indexes, so if the performance for a daily time-based index is cruddy now, you can change it for future indexes.

Replicas


elasticsearch is built from the ground up to cluster and protect your data. It has replication built in and by default, it wants to create a replica of your data. This means that using the default template, when an index gets created with five shards and one replica, it actually wants to create TEN shards -- five primary and five replica shards.

An easier way to explain it is to pretend that you've set the number of shards to two and you have two nodes. The first node has the PRIMARY for shard one and the REPLICA for shard two. The second node has the REPLICA for shard one and the PRIMARY for shard two. Primary one and replica one have the same data in them but only the primary is used for searching. 

If Node 1 goes away, Node 2 becomes the PRIMARY for shard one AND shard two. All searches will continue to work because Node 2 has the primary shards and all indexing will continue to function -- but now there are no replica shards.

If Node 1 comes back up, Node 2 starts reallocating data from either shard one or shard two to Node 1. It will continue to accept and index data so this reallocation can take a while. When it finishes, though, Node 1 will become primary for one of the shards and the cluster will be back to full capacity, complete with replication. By design, this means no data loss if one of the cluster members has a problem.

Suppose you decide you want to add a new node, Node 3. You can increase the resiliency of your cluster by changing the number of replicas from one to two - that would mean a copy of each shard would exist on each node. Unlike the number of shards per index, you can change this on-the-fly for existing indexes!

Yes, I'm going to demonstrate replication and reallocation of data in a future post :)

So What Is Ideal?


Well...as some of my SANS instructors were so fond of saying when we would ask questions in class, "that depends". It depends on your index/search ratio, on the number of nodes in your cluster, on the type of data you're indexing, on your requirements for search speed, on your requirements for data replication and various other factors. Your goal should be to test multiple configurations with a defined data set so you can time data import and search responses. Pay attention to disk and other resource utilisation when you're importing and searching data. Don't be afraid to delete an index and start over with different shard/replica ratios!

No comments:

Post a Comment

Note: only a member of this blog may post a comment.

A New Year, A New Lab -- libvirt and kvm

For years I have done the bulk of my personal projects with either virtualbox or VMWare Professional (all of the SANS courses use VMWare). R...