- How many events were going through my pipeline each second before and after I started doing DNS enrichment?
- How long does each event, on average, spend in my pipeline, both before and after I started doing DNS enrichment?
X-Pack is fantastic for measuring events per second and I'll address its installation/configuration soon. The more interesting question right now is the latter and the solution was a lot easier than I thought it would be.
- add a timestamp as the first (and only) filter when an event is received by an ingest node, called "pipeline_start"
- add a timestamp as the first filter on the enrich cluster when an event is retrieved from RabbitMQ, called "pipeline_rabbitmq_out"
- store the difference between "pipeline_rabbitmq_out" and "pipeline_start" as "pipeline_rabbitmq_processing"
- add a timestamp as the last filter on the enrich cluster, before the event goes to Elasticsearch, called "pipeline_stop"
- store the difference between "pipeline_start" and "pipeline_stop" as a field called "pipeline_processing_total"
It's pretty basic timing and it gives me an easy way to look *at any log event* and see if the overhead was on ingest/rabbitmq or if it was the enrichment cluster.
Making it Work
code => "
Taking the Difference
A Caveat: Time Objects Versus Strings
When it runs, I get the following:
e_time is a timestamp object but s_time is an ISO8601-formatted string. That's okay! When the values are sent to Elasticsearch, it will store both as timestamps. I don't have a use-case for searching these yet but while I'm debugging I have no incentive to remove them.
One decision I did make when I did this in my production clusters is to only apply the pipeline timing filters to events with a "pipeline_metric" tag. This allows me to only add the timestamps to select groups of logs (or for one of my sys-admins to tag the logs they're sending, should they decide to test processing time).
As I stated, X-Pack does a great job of showing general metrics - events received and sent by logstash, how much CPU and heap is being used by logstash or elasticsearch, how many events per second are being indexed by an Elasticsearch cluster (or node), etc - but it doesn't really help me evaluate how long it takes to process an event. Logstash offers an "elapsed" filter but that is only useful for the time that has elapsed between two log events, not since certain actions were taken against a specific event. This gives me a clean way to solve that issue.
More than just the system time is available to ruby. With multiple Logstash clusters, maybe it's good to know which ingest or processing node handled a given event - you can add this with:
If you combine that with time-based metrics, you suddenly have the ability to make *very* useful dashboards about which types of logs are taking the longest to process, which filters are taking the most time to run and which nodes have the highest (or lowest) processing time for those logs. It's pretty powerful stuff!