Tuning Elasticsearch for 40k events/sec on bare metal
Thread pools, shard strategy, custom analyzers, and the Logstash-to-Fluentd migration. How I got an NDR pipeline to handle Suricata at 100Gbps.
At Stamus Networks, Clear NDR uses Suricata for network intrusion detection. Suricata watches traffic and generates structured JSON events (called EVE JSON): alerts, flow records, DNS queries, TLS handshakes, HTTP transactions, file metadata, SSH sessions, SMB traffic, and more. All of that gets indexed into Elasticsearch (now OpenSearch in newer deployments) for analysis and correlation by security analysts.
One of our larger deployments runs 4 probes at 100Gbps, producing roughly 40,000 events per second sustained. This is a bare-metal on-premise deployment with fixed hardware. No cloud autoscaling, no “just add another node.”
The out-of-box Elasticsearch config was dropping events within minutes of starting the probes. Here’s what I changed and why.
Thread pools: the first thing that breaks
Elasticsearch’s default write queue size is 200. At 40k events/sec, even a brief GC pause fills that queue and the cluster starts rejecting writes. Each rejected bulk request means lost security events.
thread_pool.write.queue_size: 3000
thread_pool.search.size: 16
thread_pool.search.queue_size: 50000
3000 for the write queue absorbs GC pauses and spike bursts without rejecting. I arrived at this number by profiling the actual GC pause duration on the hardware and calculating how many events would queue up during a worst-case pause.
The search queue at 50,000 sounds insane. But this is a security product. SOC analysts run heavy aggregation queries across millions of events while data is still being ingested. A query that spans 30 days of flow records and groups by source IP can generate thousands of shard-level search tasks internally. If the search queue is too small, analyst queries fail intermittently during peak ingest periods, and they stop trusting the tool.
Search thread pool is pinned to 16, matching the core count. The default would have been higher, but on a write-heavy workload, search threads competing with bulk indexing threads for CPU time causes both to degrade. Better to limit search concurrency and let writes have priority.
Shard limits and disk watermarks
cluster.max_shards_per_node: 2500
cluster.routing.allocation.disk.watermark.low: 90%
cluster.routing.allocation.disk.watermark.high: 92%
cluster.routing.allocation.disk.watermark.flood_stage: 94%
The shard limit prevents the “1000 tiny indices” problem that time-series workloads create. Each daily index has at least one shard, and with 10+ event types, you’re creating 10+ shards per day. Over months, that adds up. At 2500 max per node, the cluster enforces a ceiling and you’re forced to clean up old indices instead of letting them accumulate.
Disk watermarks are pushed way above the defaults (85/90/95). On a dedicated cluster where a retention job deletes old indices hourly, the conservative defaults waste a lot of disk space. The flood stage at 94% is a last resort that makes all indices read-only. In a security monitoring system, going read-only because of a disk threshold is worse than running at 92% for an hour while the retention job catches up.
Index templates: not all events are equal
Suricata generates wildly different volumes per event type. Flow records are 80%+ of total volume. Alerts might be 1%. HTTP metadata is somewhere in between. Putting them all in one index with the same shard count wastes resources on small event types and under-provisions large ones.
The Logstash output routes events to separate index patterns:
logstash-alert-YYYY.MM.DD
logstash-flow-YYYY.MM.DD
logstash-dns-YYYY.MM.DD
logstash-http-YYYY.MM.DD
logstash-tls-YYYY.MM.DD
logstash-fileinfo-YYYY.MM.DD
logstash-host_id-YYYY.MM.DD
logstash-aggregate-YYYY.MM.DD
Each gets its own template with shard counts based on actual volume. The shared template sets some important defaults:
{
"settings": {
"index": {
"number_of_replicas": 0,
"refresh_interval": "30s",
"mapping.total_fields.limit": 10000
}
}
}
Zero replicas because it’s a single-node deployment. No point replicating shards to the same machine. The 30-second refresh interval is a big one: default is 1 second, which means Elasticsearch creates a new Lucene segment every second. At 40k events/sec, that’s a lot of tiny segments that need constant merging. Bumping to 30 seconds means fewer, larger segments and less merge pressure. The tradeoff is that new events take up to 30 seconds to appear in search results, but for flow records that’s fine. Alert indices stay at a shorter refresh because analysts need those immediately.
The 10,000 field limit accommodates Suricata’s deeply nested EVE JSON. A single HTTP event can have request headers, response headers, user agent parsed fields, geo IP data, and file metadata. The default 1000 fields limit would be hit within days.
Custom analyzer for network data
The index template includes a custom analyzer tuned for network security data:
"analysis": {
"analyzer": {
"sn_analyzer": {
"tokenizer": "sn_tokenizer",
"char_filter": ["sn_lowercase"]
}
},
"tokenizer": {
"sn_tokenizer": {
"type": "pattern",
"pattern": "[ \\(\\)]"
}
}
}
This tokenizes on spaces and parentheses, which is how Suricata formats many of its string fields (rule messages, file type descriptions from libmagic). The default analyzer would tokenize on more characters and produce too many tokens for fields like file magic strings (“PE32 executable (GUI) Intel 80386, for MS Windows”), making exact-match queries unreliable.
Dynamic templates map all string fields to both text (for full-text search) and keyword (for exact match and aggregations), with norms disabled to save memory. Percentage fields get cast from long to float because Suricata sometimes outputs them as integers.
The Logstash pipeline in detail
The ingest pipeline runs on two input ports: 5044 (TLS-encrypted for remote probes) and 5045 (plaintext for the local manager). Both expect JSON codec directly from Suricata’s EVE output, so there’s no log parsing overhead.
The filter chain does:
- Date parsing from ISO8601 timestamps
- File type extraction via a Ruby filter that splits the libmagic string on commas (taking only the first part)
- GeoIP enrichment on source IPs using MaxMind databases
- User-Agent parsing on HTTP events
- HTTP body truncation to 32,700 bytes (just under the Elasticsearch field size limit)
- Host_id upserts using document IDs derived from host+IP+worker, so repeated detections of the same host update rather than duplicate
The output routing uses event type to decide which index pattern each event lands in. Host_id and aggregate events use Elasticsearch’s update action with doc_as_upsert: true, which is critical for the asset tracking and beaconing detection features. An aggregate document ID combines the beaconing statistics value, tracking type, source, and destination, so repeated observations of the same C2 beacon pattern update the existing record.
Moving to Fluentd
Newer deployments use Fluentd instead of Logstash. The main motivation was memory: Logstash’s JVM heap (1GB minimum in our config, with CMS GC tuned to trigger at 75% occupancy) is a lot on memory-constrained hardware. Fluentd uses significantly less.
The Fluentd config uses file-based buffering instead of in-memory:
<buffer tag, time>
@type file
path /var/log/fluentd/buffer/stats
timekey 86400
flush_interval 10s
retry_type exponential_backoff
chunk_limit_size 5M
queue_limit_length 32
overflow_action block
</buffer>
File-based buffering means events survive service restarts. If the Elasticsearch cluster goes down for maintenance, Fluentd buffers to disk and retries with exponential backoff. Logstash’s in-memory buffers lost everything on restart.
The overflow_action: block is important: when the buffer queue is full (32 chunks of 5MB each = 160MB), Fluentd blocks the input rather than dropping events. This applies backpressure to the upstream probes, which is the correct behavior. Late data is better than lost data in a security context.
Alert events get a shorter flush interval (5 seconds vs 10 seconds for everything else) because analysts need to see alerts quickly. Stats and engine metrics are fine with the 10-second default.
What I learned
Most Elasticsearch “performance problems” at this scale aren’t about hardware. They’re about defaults that don’t match the workload. The defaults are safe for a general-purpose cluster. A write-heavy time-series workload with 40k events/sec and concurrent analytical queries is not general-purpose.
The specific numbers (3000 write queue, 50000 search queue, 30s refresh, 2500 max shards) came from profiling the actual workload on the actual hardware. They won’t be right for every deployment. But the methodology is always the same: find where events are being dropped or queries are being rejected, understand which resource is the bottleneck (thread pool, queue, disk I/O, segment merging), and adjust that specific parameter. Don’t change everything at once.