Tools: Report: Running Grafana Loki in Production: What We Actually Learned
Why Loki Exists: A Different Philosophy on Logs
The Components: What Each Piece Does and Why It Exists
Distributor — The Front Door
Ingester — Where Data Lives Before Storage
Querier — The Brute-Force Search Engine
Query Frontend — The Query Optimizer
Query Scheduler — The Traffic Controller
Index Gateway — The Index Serving Layer
Compactor — The Janitor
Table Manager — The Schema Enforcer
Gateway (nginx) — The Entry Point
Our Setup at a Glance
Component Sizing
The Configuration That Matters
Ingester Tuning
Limits: The Rate Limiting That Keeps You Alive
Storage: S3 + BoltDB Shipper
Caching: The Three Layers
Ring Membership: Memberlist over Consul/etcd
gRPC Tuning
The Numbers from Production
Monitoring Loki: The PromQL Queries That Matter
Distributor Metrics — Is Data Getting In?
Ingester Metrics — The Heart of the System
Query Path Metrics — Are Queries Healthy?
Cache Metrics — Your Performance Multiplier
Index Gateway Metrics — Is Index Serving Healthy?
Global Panic Metric — The Last Resort
Recommended Alert Rules
Things We'd Do Differently
Wrapping Up We run Loki in distributed mode on EKS, processing ~1.16 TB of logs per day across ~34,000 lines/second. This post covers the architecture we landed on, the configuration decisions that actually matter, and the numbers from production that validate (or challenge) those decisions. But first — if you're evaluating Loki or just heard the name, let's build up from first principles. Traditional logging systems like Elasticsearch (ELK stack) or Splunk work by full-text indexing every log line. When a log line comes in, the system tokenizes it, builds an inverted index over every word, and stores that index alongside the raw data. This makes arbitrary text search fast, but the index itself becomes enormous — often larger than the raw logs. At scale, you're paying more to store and maintain the index than the data it points to. Loki takes the opposite approach: index only the metadata, store the logs as compressed chunks. Instead of indexing the contents of "ERROR: connection refused to database host db-prod-3", Loki only indexes the labels attached to that line — things like {namespace="payments", app="api-gateway", pod="api-gateway-7f8b9c"}. When you query, Loki uses the label index to find the right chunks, then brute-force greps through those chunks. This is the fundamental trade-off: Loki trades query-time compute for storage-time simplicity. Queries are slower than Elasticsearch for arbitrary text search, but storage costs drop dramatically because you're not maintaining a massive inverted index. For most operational use cases — "show me the logs from the payments namespace in the last hour where the line contains ERROR" — this is fast enough, and the cost savings are substantial. Think of it like this: Elasticsearch is a search engine that happens to store logs. Loki is a log storage system that happens to support search. Before we get into our specific setup, let's understand what each Loki component does from first principles. In a traditional monolithic logging system, one process handles everything — accept logs, store them, index them, and query them. Loki breaks this into discrete components so each can be scaled independently based on its bottleneck (CPU, memory, I/O, or network). The distributor is the first component that touches your log data. Every log push (from Promtail, Fluentd, or any other agent) hits a distributor. What it does: Validates incoming log streams (checks labels, enforces rate limits, rejects old samples), then hashes the stream's labels to determine which ingester(s) should own that stream. It uses a consistent hash ring to route — the same label set always goes to the same ingester, which is critical for keeping related log lines together in memory. Why it's separate: In Elasticsearch, the coordinating node handles both routing and querying. Loki splits these because the write path and read path have completely different scaling characteristics. Distributors are CPU-light and stateless — you can add or remove them without any data migration. They're essentially smart load balancers for your write path. Scaling signal: CPU usage and push latency. If P99 push latency climbs above 500ms, add more distributors. The ingester is the most critical and most resource-hungry component. It's the equivalent of what Elasticsearch calls a "data node" for recent data, but with a key difference. What it does: Receives log streams from distributors, holds them in memory as "chunks" (compressed blocks of log lines), builds the index entries for those chunks, and periodically flushes both to long-term storage (S3). While data is in the ingester, it's queryable directly from memory — no storage round-trip needed. Why it's separate and stateful: Ingesters are StatefulSets, not Deployments, because they hold state — unflushed chunks in memory and a Write-Ahead Log (WAL) on disk. The WAL is Loki's crash recovery mechanism: if an ingester dies, the replacement can replay the WAL to recover data that hadn't been flushed to S3 yet. This is fundamentally different from Elasticsearch, where data is replicated at the storage level through Lucene segment replication. In Loki, data durability between flush cycles is handled by a combination of replication factor (writing to N ingesters simultaneously) and WAL replay. Scaling signal: Memory usage. Ingesters are memory-bound — they hold all active streams plus unflushed chunks. If memory pressure rises, you either add ingesters (spreading the hash ring thinner) or tune flush intervals to push data to storage faster. The querier is where Loki's "index little, grep a lot" philosophy becomes concrete. What it does: Executes LogQL queries by first consulting the index (via the index gateway) to identify which chunks match the label matchers, then fetches those chunks from S3 (or cache), decompresses them, and does a line-by-line scan for your filter expression. It also queries ingesters directly for data that hasn't been flushed to storage yet. Why it's separate: In Elasticsearch, a query coordinator fans out to data nodes that each search their local shards. Loki separates this because queriers are compute-intensive and bursty — one expensive query shouldn't affect your ability to ingest logs. By isolating the read path, you can scale queriers independently of ingesters. The brute-force scan is what makes Loki queries slower than Elasticsearch for wide searches, but it's also what makes the storage layer so simple. No inverted index to update on every write, no segment merges eating I/O in the background. Scaling signal: Query latency and memory. Wide time-range queries or unselective label matchers cause queriers to fetch and decompress many chunks. More queriers = more parallelism. The query frontend sits between the user (Grafana) and the queriers. It doesn't execute queries itself. What it does: Takes an incoming query, splits it into smaller sub-queries by time interval, dispatches those sub-queries to queriers (via the query scheduler), and merges the results. It also caches query results and deduplicates identical in-flight queries. Why it's separate: This is a pattern borrowed from Cortex/Mimir (Prometheus long-term storage). Without the query frontend, a single "show me the last 24 hours" query would force one querier to scan 24 hours of data sequentially. With it, that query becomes 96 parallel 15-minute queries spread across your querier fleet. This is the single biggest lever for query performance. Elasticsearch has a similar concept with its search "phases" (query then fetch), but Loki makes the time-based splitting explicit and configurable. Scaling signal: Rarely the bottleneck. Scale if you see request queuing. What it does: Maintains a queue of pending sub-queries from the query frontend and distributes them fairly across available queriers. Implements per-tenant fair queuing so one user's expensive query doesn't starve others. Why it's separate: Without the scheduler, query frontends connect directly to queriers via round-robin. This is fine at small scale, but at high concurrency it leads to uneven load distribution. The scheduler ensures no single querier gets overwhelmed while others sit idle. Think of it as the difference between random checkout lane selection vs. a single serpentine queue at a bank. What it does: Downloads the BoltDB index files from S3, caches them on local disk (EFS in our case), and serves index lookups to queriers over gRPC. Why it's separate: This is Loki-specific and exists because of the BoltDB Shipper pattern. Without an index gateway, every querier downloads and caches the full index locally. With 10 queriers, that's 10 copies of the index, 10x the S3 GET requests, and 10x the local disk needed. The index gateway centralizes this into 3 replicas that serve the entire querier fleet. This has no real equivalent in Elasticsearch because ES stores the index within its Lucene segments on each data node — it's not a separate concern. What it does: Runs as a singleton (only one instance) and performs two jobs: (1) compacts multiple small index files into larger ones to improve query performance, and (2) applies retention policies by marking expired chunks for deletion. Why it's separate: Compaction is I/O-intensive and runs on its own schedule. You don't want compaction competing with ingest or query for resources. In Elasticsearch, segment merging (the equivalent operation) happens on each data node and is one of the primary sources of I/O contention at scale. Loki avoids this by centralizing compaction into a dedicated component. What it does: Pre-creates and manages the index "tables" (time-based partitions) according to the schema config. Ensures tables exist before data arrives. Why it's separate: Mostly a legacy component for when Loki supported external index stores like DynamoDB or Cassandra with time-sharded tables. With BoltDB Shipper or TSDB, it's less critical but still handles table lifecycle. What it does: A simple nginx reverse proxy that routes /loki/api/v1/push to distributors and /loki/api/v1/query to query frontends. Provides a single endpoint for clients. Cluster: EKS (Kubernetes 1.33) on AWS, 8 nodes running Bottlerocket OS
Deployment: loki-distributed Helm chart (v0.80.2), Loki version 2.9.10Storage: S3 for chunks, BoltDB Shipper for index, EFS for compactor and index gatewayCaching: Memcached (3 tiers — chunks, query frontend, index writes)
Monitoring: Prometheus + Grafana co-located in the same cluster A few things jump out: Ingesters are the memory hogs. Each ingester requests 16Gi and uses nearly all of it (~15.9 GB working set in production). This is because ingesters hold all active streams and recent chunks in memory before flushing to storage. If you undersize these, you'll see OOMKills that take down a chunk of your in-flight data with them. Queriers need headroom. We run 10 queriers not because steady-state demands it, but because queries are bursty. A single user running a broad {namespace=~".+"} query over a wide time range can spike memory on multiple queriers simultaneously. The 3Gi request with 6Gi limit gives them room to burst. Distributors are lightweight but need replicas. At 34k lines/sec ingest, 6 distributors keep each one comfortably below its CPU limit. They're stateless, so scaling is trivial. Here are the config knobs that we've tuned from defaults and why: replication_factor: 2 is the sweet spot for us. RF=3 is safer but doubles your ingester memory overhead vs RF=1. With RF=2 and WAL enabled, we can lose one ingester and recover without data loss. The WAL with 1-minute checkpointing gives us a recovery path that doesn't rely on ring state alone. chunk_idle_period: 2m and max_chunk_age: 30m control how long data lives in ingester memory before hitting S3. Shorter values = lower memory usage but more small chunks in object storage (which slows queries). 30 minutes is a good balance — our ingesters flush about 9.4 chunks/sec at steady state. chunk_encoding: snappy over gzip because at 13.4 MB/s ingest, CPU matters more than compression ratio. Snappy gives us good-enough compression without burning cores. A few things worth calling out: retention_period: 72h — we only keep 3 days of logs in Loki. This is deliberate. Loki isn't our log archive; it's our log search tool. Anything older goes to S3 lifecycle rules for long-term retention. This keeps our index small and queries fast. per_stream_rate_limit: 512M — this is intentionally high. We run with auth_enabled: false (single tenant), so there's no per-tenant isolation. Instead, we rely on per-stream limits to prevent any single application from overwhelming the pipeline. If you're multi-tenant, you'd want much tighter per-tenant limits. reject_old_samples_max_age: 168h — logs older than 7 days get rejected at the distributor. This prevents backfill jobs or misbehaving agents from pushing stale data that would create index entries far from the current write head. split_queries_by_interval: 15m — the query frontend splits every query into 15-minute sub-queries. These are parallelized across queriers, which is why a 1-hour query range actually fans out to 4 sub-queries. This, combined with the query scheduler (5 replicas), is what keeps our P99 query latency at ~2.75 seconds despite scanning TBs. BoltDB Shipper with Index Gateway is the key pattern here. Instead of every querier downloading the full BoltDB index from S3, the index gateway (3 replicas on EFS) serves index lookups over gRPC. This dramatically reduces the number of S3 API calls and keeps query latency consistent. We back the index gateways and compactor with EFS (not EBS), because EFS gives us shared persistent storage that survives pod rescheduling across AZs. The ingesters use gp3 EBS volumes (200Gi each) for WAL and local index — they need low-latency local disk, not shared access. Three separate memcached tiers: Our memcached hit rate sits at 97.8%, meaning only ~2.2% of chunk fetches actually hit S3. This is the single biggest factor in keeping query latency reasonable at our scale. We use memberlist (gossip protocol) instead of Consul or etcd for hash ring coordination. One fewer external dependency to manage. It works well up to the scale we're at. If you're running 50+ ingesters, you might want to evaluate Consul for faster ring convergence. The default gRPC message sizes are too small for production. At high ingest rates, distributor-to-ingester messages can get large, especially when batching. We set both client and server to ~200MB. The gzip compression on the ingester client cuts inter-component bandwidth significantly. Here's what the cluster looks like right now: 632 active streams is low for 34k lines/sec — this means our log labels are well-structured. High cardinality (thousands of unique label combinations) is the number one killer of Loki performance. We keep stream count low by using only a handful of labels: namespace, pod, container, and app. We avoid dynamic labels like request IDs or user IDs. 245ms P99 push latency is solid at this ingest rate. The 6 distributors with gRPC compression keep the write path fast. If this creeps above 500ms, it's time to add distributors or check if ingesters are falling behind on flushes. 2.75s P99 query latency is acceptable for our use case (human-driven debugging sessions). If you need sub-second queries, look at increasing the querier count and reducing split_queries_by_interval. Running Loki without monitoring Loki is flying blind. Loki exposes a rich set of Prometheus metrics out of the box — the challenge is knowing which ones to watch and what they're telling you. Here's the monitoring playbook, broken down by component. The distributor is your canary. If the write path is unhealthy, these metrics will show it first. Ingest rate (bytes/sec): This is your headline number — total bytes/sec hitting Loki across all distributors. Track this on a dashboard as the primary throughput gauge. A sudden drop means log agents are failing to ship; a sudden spike means something is logging excessively (a crash loop, debug logging left on, etc.). Ingest rate (lines/sec): Compare this against bytes/sec to derive your average line size. If lines/sec stays flat but bytes/sec spikes, something is producing abnormally large log lines (stack traces, serialized payloads). If lines/sec spikes but bytes/sec doesn't, you've got a chatty service producing many small lines. Distributor-to-ingester failures: This should be zero. Any non-zero value means distributors are failing to write to ingesters — the ingester ring might be unhealthy, an ingester is OOMing, or gRPC connections are timing out. Alert on this immediately. Discarded samples (dropped logs): Logs that Loki actively rejected, grouped by reason. Common reasons: rate_limited (hitting per-tenant or per-stream limits), greater_than_max_sample_age (old logs rejected by reject_old_samples_max_age), per_stream_rate_limit. If you see rate_limited, your limits are too tight or a service is misbehaving. This is the metric that tells you when logs are being silently dropped. Ingesters are stateful, memory-heavy, and the most likely component to cause data loss if they go wrong. Monitor them closely. The number of unique label combinations currently active in memory across all ingesters. This is your cardinality gauge — the single most important metric for Loki health. If this number grows unbounded, you have a label cardinality problem (someone added a dynamic label like request_id or user_id). We alert if this crosses 2x our baseline. The number of chunk objects held in ingester memory. Each active stream has at least one chunk being actively written to. This correlates with memory usage — more chunks = more RAM. If chunks grow faster than flushes, ingesters will OOM. How many chunks/sec are being flushed to long-term storage. This should be steady. A drop in flush rate while ingest stays constant means chunks are accumulating in memory — check if S3 is slow or the compactor is backed up. Chunk age at flush (P99): How old chunks are when they get flushed. Should be close to your max_chunk_age setting (30 minutes = 1800s for us). If P99 chunk age drifts significantly higher, ingesters are holding data too long — either the flush loop is slow or chunks aren't hitting the idle timeout. Chunk compression ratio: How well your chunks compress. A ratio of 0.3 means data compresses to 30% of its original size (~3.3x). If this ratio climbs toward 1.0, your logs are either already compressed or have high entropy (binary data being logged). Snappy encoding typically gives 3-4x for structured text logs. How much data is sitting in the Write-Ahead Log. If this grows steadily, WAL checkpointing may be falling behind. A zero value (like ours currently) means the WAL is keeping up — data is checkpointed and flushed regularly. WAL corruptions (alert on any non-zero): WAL corruption means potential data loss on recovery. This should always be zero. Any corruption usually points to disk issues (EBS volume problems, filesystem corruption). Alert immediately and investigate the underlying storage. How long WAL checkpoints take. If this exceeds your checkpoint_duration setting (1 minute for us), checkpoints are overlapping and the WAL will grow unbounded. Usually means disk I/O is saturated. End-to-end time for a push request. This spans distributor validation, hashing, and ingester writes. Under 500ms is healthy. Above 1s means something in the write path is saturated. End-to-end time for read queries. This is what your Grafana users feel. The acceptable threshold depends on your use case — under 5s for interactive debugging is reasonable. If this degrades, check if queriers are memory-saturated, cache hit rates dropped, or someone is running expensive queries. Request rate by status code: Break down all requests by HTTP status code. 204 for pushes is success. 200 for queries is success. Watch for 429 (rate limited), 500 (internal errors), and 503 (service unavailable — usually means queriers are overloaded). Caching is what makes Loki viable at scale. If cache hit rates drop, query latency will spike and S3 costs will climb. Chunk cache hit rate: What percentage of chunk fetches are served from memcached instead of S3. We target >95%. Below 90% means your memcached is undersized or your query patterns aren't cache-friendly (too many unique time ranges). Query result cache hit rate: How often the query frontend serves results from cache instead of dispatching to queriers. High hit rates here mean your users are running the same (or overlapping) queries repeatedly — common when multiple people investigate the same incident. Memcached client health: Number of memcached servers each Loki component can see. If this drops below expected (e.g., from 3 to 2), a memcached pod is down and you're losing cache capacity. The consistent hashing will redistribute, but hit rate will temporarily drop as the remaining nodes re-warm. How many cache writes are queued. A growing queue means memcached can't keep up with write volume — either the memcached pods need more CPU/memory, or network latency between Loki and memcached is too high. Index gateway request latency (P99): How long it takes the index gateway to serve a lookup. Should be low (under 500ms). High latency here means the index is too large for memory and the gateway is reading from disk, or EFS is slow. BoltDB shipper upload health: Rate of index table uploads from ingesters to shared storage. A zero rate means ingesters aren't shipping index tables — queries for recently ingested data may fail. BoltDB shipper request latency (P99): How long BoltDB shipper operations take. Our P99 sits at ~378ms. If this spikes, shared storage (S3/EFS) is likely the bottleneck. Loki processes that panicked (crashed). This should always be zero. Any non-zero value means a component is hitting an unhandled error — check logs immediately. This is your "something is deeply wrong" alert. Based on what we've learned running this in production, here are the alerts worth wiring up: Start with TSDB index (schema v12+) instead of BoltDB Shipper. We're on schema v11 because we started early. TSDB is significantly better for index performance and is the direction Grafana is investing in. New deployments should use it. Set up recording rules earlier. We added Prometheus recording rules for Loki metrics after the fact. Having loki:ingester_memory_streams:sum, loki:distributor_bytes:rate5m etc. pre-built saves a lot of dashboard query overhead. Consider upgrading to Loki 3.x. We're on 2.9.10. Loki 3.x brings native OTLP ingest, improved bloom filters for faster queries, and the new v13 schema. The migration path exists but requires careful planning around schema migration. Loki in distributed mode isn't "install and forget." But it also isn't the operational nightmare that some make it out to be. The key principles: Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse