Tools

Tools: Report: Running Grafana Loki in Production: What We Actually Learned

2026-03-30 0 views admin

Why Loki Exists: A Different Philosophy on Logs

The Components: What Each Piece Does and Why It Exists

Distributor — The Front Door

Ingester — Where Data Lives Before Storage

Querier — The Brute-Force Search Engine

Query Frontend — The Query Optimizer

Query Scheduler — The Traffic Controller

Index Gateway — The Index Serving Layer

Compactor — The Janitor

Table Manager — The Schema Enforcer

Gateway (nginx) — The Entry Point

Our Setup at a Glance

Component Sizing

The Configuration That Matters

Ingester Tuning

Limits: The Rate Limiting That Keeps You Alive

Storage: S3 + BoltDB Shipper

Caching: The Three Layers

Ring Membership: Memberlist over Consul/etcd

gRPC Tuning

The Numbers from Production

Monitoring Loki: The PromQL Queries That Matter

Distributor Metrics — Is Data Getting In?

Ingester Metrics — The Heart of the System

Query Path Metrics — Are Queries Healthy?

Cache Metrics — Your Performance Multiplier

Index Gateway Metrics — Is Index Serving Healthy?

Global Panic Metric — The Last Resort

Recommended Alert Rules

Things We'd Do Differently

Wrapping Up We run Loki in distributed mode on EKS, processing ~1.16 TB of logs per day across ~34,000 lines/second. This post covers the architecture we landed on, the configuration decisions that actually matter, and the numbers from production that validate (or challenge) those decisions. But first — if you're evaluating Loki or just heard the name, let's build up from first principles. Traditional logging systems like Elasticsearch (ELK stack) or Splunk work by full-text indexing every log line. When a log line comes in, the system tokenizes it, builds an inverted index over every word, and stores that index alongside the raw data. This makes arbitrary text search fast, but the index itself becomes enormous — often larger than the raw logs. At scale, you're paying more to store and maintain the index than the data it points to. Loki takes the opposite approach: index only the metadata, store the logs as compressed chunks. Instead of indexing the contents of "ERROR: connection refused to database host db-prod-3", Loki only indexes the labels attached to that line — things like {namespace="payments", app="api-gateway", pod="api-gateway-7f8b9c"}. When you query, Loki uses the label index to find the right chunks, then brute-force greps through those chunks. This is the fundamental trade-off: Loki trades query-time compute for storage-time simplicity. Queries are slower than Elasticsearch for arbitrary text search, but storage costs drop dramatically because you're not maintaining a massive inverted index. For most operational use cases — "show me the logs from the payments namespace in the last hour where the line contains ERROR" — this is fast enough, and the cost savings are substantial. Think of it like this: Elasticsearch is a search engine that happens to store logs. Loki is a log storage system that happens to support search. Before we get into our specific setup, let's understand what each Loki component does from first principles. In a traditional monolithic logging system, one process handles everything — accept logs, store them, index them, and query them. Loki breaks this into discrete components so each can be scaled independently based on its bottleneck (CPU, memory, I/O, or network). The distributor is the first component that touches your log data. Every log push (from Promtail, Fluentd, or any other agent) hits a distributor. What it does: Validates incoming log streams (checks labels, enforces rate limits, rejects old samples), then hashes the stream's labels to determine which ingester(s) should own that stream. It uses a consistent hash ring to route — the same label set always goes to the same ingester, which is critical for keeping related log lines together in memory. Why it's separate: In Elasticsearch, the coordinating node handles both routing and querying. Loki splits these because the write path and read path have completely different scaling characteristics. Distributors are CPU-light and stateless — you can add or remove them without any data migration. They're essentially smart load balancers for your write path. Scaling signal: CPU usage and push latency. If P99 push latency climbs above 500ms, add more distributors. The ingester is the most critical and most resource-hungry component. It's the equivalent of what Elasticsearch calls a "data node" for recent data, but with a key difference. What it does: Receives log streams from distributors, holds them in memory as "chunks" (compressed blocks of log lines), builds the index entries for those chunks, and periodically flushes both to long-term storage (S3). While data is in the ingester, it's queryable directly from memory — no storage round-trip needed. Why it's separate and stateful: Ingesters are StatefulSets, not Deployments, because they hold state — unflushed chunks in memory and a Write-Ahead Log (WAL) on disk. The WAL is Loki's crash recovery mechanism: if an ingester dies, the replacement can replay the WAL to recover data that hadn't been flushed to S3 yet. This is fundamentally different from Elasticsearch, where data is replicated at the storage level through Lucene segment replication. In Loki, data durability between flush cycles is handled by a combination of replication factor (writing to N ingesters simultaneously) and WAL replay. Scaling signal: Memory usage. Ingesters are memory-bound — they hold all active streams plus unflushed chunks. If memory pressure rises, you either add ingesters (spreading the hash ring thinner) or tune flush intervals to push data to storage faster. The querier is where Loki's "index little, grep a lot" philosophy becomes concrete. What it does: Executes LogQL queries by first consulting the index (via the index gateway) to identify which chunks match the label matchers, then fetches those chunks from S3 (or cache), decompresses them, and does a line-by-line scan for your filter expression. It also queries ingesters directly for data that hasn't been flushed to storage yet. Why it's separate: In Elasticsearch, a query coordinator fans out to data nodes that each search their local shards. Loki separates this because queriers are compute-intensive and bursty — one expensive query shouldn't affect your ability to ingest logs. By isolating the read path, you can scale queriers independently of ingesters. The brute-force scan is what makes Loki queries slower than Elasticsearch for wide searches, but it's also what makes the storage layer so simple. No inverted index to update on every write, no segment merges eating I/O in the background. Scaling signal: Query latency and memory. Wide time-range queries or unselective label matchers cause queriers to fetch and decompress many chunks. More queriers = more parallelism. The query frontend sits between the user (Grafana) and the queriers. It doesn't execute queries itself. What it does: Takes an incoming query, splits it into smaller sub-queries by time interval, dispatches those sub-queries to queriers (via the query scheduler), and merges the results. It also caches query results and deduplicates identical in-flight queries. Why it's separate: This is a pattern borrowed from Cortex/Mimir (Prometheus long-term storage). Without the query frontend, a single "show me the last 24 hours" query would force one querier to scan 24 hours of data sequentially. With it, that query becomes 96 parallel 15-minute queries spread across your querier fleet. This is the single biggest lever for query performance. Elasticsearch has a similar concept with its search "phases" (query then fetch), but Loki makes the time-based splitting explicit and configurable. Scaling signal: Rarely the bottleneck. Scale if you see request queuing. What it does: Maintains a queue of pending sub-queries from the query frontend and distributes them fairly across available queriers. Implements per-tenant fair queuing so one user's expensive query doesn't starve others. Why it's separate: Without the scheduler, query frontends connect directly to queriers via round-robin. This is fine at small scale, but at high concurrency it leads to uneven load distribution. The scheduler ensures no single querier gets overwhelmed while others sit idle. Think of it as the difference between random checkout lane selection vs. a single serpentine queue at a bank. What it does: Downloads the BoltDB index files from S3, caches them on local disk (EFS in our case), and serves index lookups to queriers over gRPC. Why it's separate: This is Loki-specific and exists because of the BoltDB Shipper pattern. Without an index gateway, every querier downloads and caches the full index locally. With 10 queriers, that's 10 copies of the index, 10x the S3 GET requests, and 10x the local disk needed. The index gateway centralizes this into 3 replicas that serve the entire querier fleet. This has no real equivalent in Elasticsearch because ES stores the index within its Lucene segments on each data node — it's not a separate concern. What it does: Runs as a singleton (only one instance) and performs two jobs: (1) compacts multiple small index files into larger ones to improve query performance, and (2) applies retention policies by marking expired chunks for deletion. Why it's separate: Compaction is I/O-intensive and runs on its own schedule. You don't want compaction competing with ingest or query for resources. In Elasticsearch, segment merging (the equivalent operation) happens on each data node and is one of the primary sources of I/O contention at scale. Loki avoids this by centralizing compaction into a dedicated component. What it does: Pre-creates and manages the index "tables" (time-based partitions) according to the schema config. Ensures tables exist before data arrives. Why it's separate: Mostly a legacy component for when Loki supported external index stores like DynamoDB or Cassandra with time-sharded tables. With BoltDB Shipper or TSDB, it's less critical but still handles table lifecycle. What it does: A simple nginx reverse proxy that routes /loki/api/v1/push to distributors and /loki/api/v1/query to query frontends. Provides a single endpoint for clients. Cluster: EKS (Kubernetes 1.33) on AWS, 8 nodes running Bottlerocket OS

Deployment: loki-distributed Helm chart (v0.80.2), Loki version 2.9.10Storage: S3 for chunks, BoltDB Shipper for index, EFS for compactor and index gatewayCaching: Memcached (3 tiers — chunks, query frontend, index writes)

Monitoring: Prometheus + Grafana co-located in the same cluster A few things jump out: Ingesters are the memory hogs. Each ingester requests 16Gi and uses nearly all of it (~15.9 GB working set in production). This is because ingesters hold all active streams and recent chunks in memory before flushing to storage. If you undersize these, you'll see OOMKills that take down a chunk of your in-flight data with them. Queriers need headroom. We run 10 queriers not because steady-state demands it, but because queries are bursty. A single user running a broad {namespace=~".+"} query over a wide time range can spike memory on multiple queriers simultaneously. The 3Gi request with 6Gi limit gives them room to burst. Distributors are lightweight but need replicas. At 34k lines/sec ingest, 6 distributors keep each one comfortably below its CPU limit. They're stateless, so scaling is trivial. Here are the config knobs that we've tuned from defaults and why: replication_factor: 2 is the sweet spot for us. RF=3 is safer but doubles your ingester memory overhead vs RF=1. With RF=2 and WAL enabled, we can lose one ingester and recover without data loss. The WAL with 1-minute checkpointing gives us a recovery path that doesn't rely on ring state alone. chunk_idle_period: 2m and max_chunk_age: 30m control how long data lives in ingester memory before hitting S3. Shorter values = lower memory usage but more small chunks in object storage (which slows queries). 30 minutes is a good balance — our ingesters flush about 9.4 chunks/sec at steady state. chunk_encoding: snappy over gzip because at 13.4 MB/s ingest, CPU matters more than compression ratio. Snappy gives us good-enough compression without burning cores. A few things worth calling out: retention_period: 72h — we only keep 3 days of logs in Loki. This is deliberate. Loki isn't our log archive; it's our log search tool. Anything older goes to S3 lifecycle rules for long-term retention. This keeps our index small and queries fast. per_stream_rate_limit: 512M — this is intentionally high. We run with auth_enabled: false (single tenant), so there's no per-tenant isolation. Instead, we rely on per-stream limits to prevent any single application from overwhelming the pipeline. If you're multi-tenant, you'd want much tighter per-tenant limits. reject_old_samples_max_age: 168h — logs older than 7 days get rejected at the distributor. This prevents backfill jobs or misbehaving agents from pushing stale data that would create index entries far from the current write head. split_queries_by_interval: 15m — the query frontend splits every query into 15-minute sub-queries. These are parallelized across queriers, which is why a 1-hour query range actually fans out to 4 sub-queries. This, combined with the query scheduler (5 replicas), is what keeps our P99 query latency at ~2.75 seconds despite scanning TBs. BoltDB Shipper with Index Gateway is the key pattern here. Instead of every querier downloading the full BoltDB index from S3, the index gateway (3 replicas on EFS) serves index lookups over gRPC. This dramatically reduces the number of S3 API calls and keeps query latency consistent. We back the index gateways and compactor with EFS (not EBS), because EFS gives us shared persistent storage that survives pod rescheduling across AZs. The ingesters use gp3 EBS volumes (200Gi each) for WAL and local index — they need low-latency local disk, not shared access. Three separate memcached tiers: Our memcached hit rate sits at 97.8%, meaning only ~2.2% of chunk fetches actually hit S3. This is the single biggest factor in keeping query latency reasonable at our scale. We use memberlist (gossip protocol) instead of Consul or etcd for hash ring coordination. One fewer external dependency to manage. It works well up to the scale we're at. If you're running 50+ ingesters, you might want to evaluate Consul for faster ring convergence. The default gRPC message sizes are too small for production. At high ingest rates, distributor-to-ingester messages can get large, especially when batching. We set both client and server to ~200MB. The gzip compression on the ingester client cuts inter-component bandwidth significantly. Here's what the cluster looks like right now: 632 active streams is low for 34k lines/sec — this means our log labels are well-structured. High cardinality (thousands of unique label combinations) is the number one killer of Loki performance. We keep stream count low by using only a handful of labels: namespace, pod, container, and app. We avoid dynamic labels like request IDs or user IDs. 245ms P99 push latency is solid at this ingest rate. The 6 distributors with gRPC compression keep the write path fast. If this creeps above 500ms, it's time to add distributors or check if ingesters are falling behind on flushes. 2.75s P99 query latency is acceptable for our use case (human-driven debugging sessions). If you need sub-second queries, look at increasing the querier count and reducing split_queries_by_interval. Running Loki without monitoring Loki is flying blind. Loki exposes a rich set of Prometheus metrics out of the box — the challenge is knowing which ones to watch and what they're telling you. Here's the monitoring playbook, broken down by component. The distributor is your canary. If the write path is unhealthy, these metrics will show it first. Ingest rate (bytes/sec): This is your headline number — total bytes/sec hitting Loki across all distributors. Track this on a dashboard as the primary throughput gauge. A sudden drop means log agents are failing to ship; a sudden spike means something is logging excessively (a crash loop, debug logging left on, etc.). Ingest rate (lines/sec): Compare this against bytes/sec to derive your average line size. If lines/sec stays flat but bytes/sec spikes, something is producing abnormally large log lines (stack traces, serialized payloads). If lines/sec spikes but bytes/sec doesn't, you've got a chatty service producing many small lines. Distributor-to-ingester failures: This should be zero. Any non-zero value means distributors are failing to write to ingesters — the ingester ring might be unhealthy, an ingester is OOMing, or gRPC connections are timing out. Alert on this immediately. Discarded samples (dropped logs): Logs that Loki actively rejected, grouped by reason. Common reasons: rate_limited (hitting per-tenant or per-stream limits), greater_than_max_sample_age (old logs rejected by reject_old_samples_max_age), per_stream_rate_limit. If you see rate_limited, your limits are too tight or a service is misbehaving. This is the metric that tells you when logs are being silently dropped. Ingesters are stateful, memory-heavy, and the most likely component to cause data loss if they go wrong. Monitor them closely. The number of unique label combinations currently active in memory across all ingesters. This is your cardinality gauge — the single most important metric for Loki health. If this number grows unbounded, you have a label cardinality problem (someone added a dynamic label like request_id or user_id). We alert if this crosses 2x our baseline. The number of chunk objects held in ingester memory. Each active stream has at least one chunk being actively written to. This correlates with memory usage — more chunks = more RAM. If chunks grow faster than flushes, ingesters will OOM. How many chunks/sec are being flushed to long-term storage. This should be steady. A drop in flush rate while ingest stays constant means chunks are accumulating in memory — check if S3 is slow or the compactor is backed up. Chunk age at flush (P99): How old chunks are when they get flushed. Should be close to your max_chunk_age setting (30 minutes = 1800s for us). If P99 chunk age drifts significantly higher, ingesters are holding data too long — either the flush loop is slow or chunks aren't hitting the idle timeout. Chunk compression ratio: How well your chunks compress. A ratio of 0.3 means data compresses to 30% of its original size (~3.3x). If this ratio climbs toward 1.0, your logs are either already compressed or have high entropy (binary data being logged). Snappy encoding typically gives 3-4x for structured text logs. How much data is sitting in the Write-Ahead Log. If this grows steadily, WAL checkpointing may be falling behind. A zero value (like ours currently) means the WAL is keeping up — data is checkpointed and flushed regularly. WAL corruptions (alert on any non-zero): WAL corruption means potential data loss on recovery. This should always be zero. Any corruption usually points to disk issues (EBS volume problems, filesystem corruption). Alert immediately and investigate the underlying storage. How long WAL checkpoints take. If this exceeds your checkpoint_duration setting (1 minute for us), checkpoints are overlapping and the WAL will grow unbounded. Usually means disk I/O is saturated. End-to-end time for a push request. This spans distributor validation, hashing, and ingester writes. Under 500ms is healthy. Above 1s means something in the write path is saturated. End-to-end time for read queries. This is what your Grafana users feel. The acceptable threshold depends on your use case — under 5s for interactive debugging is reasonable. If this degrades, check if queriers are memory-saturated, cache hit rates dropped, or someone is running expensive queries. Request rate by status code: Break down all requests by HTTP status code. 204 for pushes is success. 200 for queries is success. Watch for 429 (rate limited), 500 (internal errors), and 503 (service unavailable — usually means queriers are overloaded). Caching is what makes Loki viable at scale. If cache hit rates drop, query latency will spike and S3 costs will climb. Chunk cache hit rate: What percentage of chunk fetches are served from memcached instead of S3. We target >95%. Below 90% means your memcached is undersized or your query patterns aren't cache-friendly (too many unique time ranges). Query result cache hit rate: How often the query frontend serves results from cache instead of dispatching to queriers. High hit rates here mean your users are running the same (or overlapping) queries repeatedly — common when multiple people investigate the same incident. Memcached client health: Number of memcached servers each Loki component can see. If this drops below expected (e.g., from 3 to 2), a memcached pod is down and you're losing cache capacity. The consistent hashing will redistribute, but hit rate will temporarily drop as the remaining nodes re-warm. How many cache writes are queued. A growing queue means memcached can't keep up with write volume — either the memcached pods need more CPU/memory, or network latency between Loki and memcached is too high. Index gateway request latency (P99): How long it takes the index gateway to serve a lookup. Should be low (under 500ms). High latency here means the index is too large for memory and the gateway is reading from disk, or EFS is slow. BoltDB shipper upload health: Rate of index table uploads from ingesters to shared storage. A zero rate means ingesters aren't shipping index tables — queries for recently ingested data may fail. BoltDB shipper request latency (P99): How long BoltDB shipper operations take. Our P99 sits at ~378ms. If this spikes, shared storage (S3/EFS) is likely the bottleneck. Loki processes that panicked (crashed). This should always be zero. Any non-zero value means a component is hitting an unhandled error — check logs immediately. This is your "something is deeply wrong" alert. Based on what we've learned running this in production, here are the alerts worth wiring up: Start with TSDB index (schema v12+) instead of BoltDB Shipper. We're on schema v11 because we started early. TSDB is significantly better for index performance and is the direction Grafana is investing in. New deployments should use it. Set up recording rules earlier. We added Prometheus recording rules for Loki metrics after the fact. Having loki:ingester_memory_streams:sum, loki:distributor_bytes:rate5m etc. pre-built saves a lot of dashboard query overhead. Consider upgrading to Loki 3.x. We're on 2.9.10. Loki 3.x brings native OTLP ingest, improved bloom filters for faster queries, and the new v13 schema. The migration path exists but requires careful planning around schema migration. Loki in distributed mode isn't "install and forget." But it also isn't the operational nightmare that some make it out to be. The key principles: Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ ingester: chunk_block_size: 262144 # 256KB blocks chunk_encoding: snappy # Fast compression, ~3.5x ratio chunk_idle_period: 2m # Flush idle chunks after 2 min max_chunk_age: 30m # Hard cap — flush after 30 min regardless chunk_retain_period: 1m # Keep flushed chunks for queries in flight lifecycler: ring: replication_factor: 2 # Every log line written to 2 ingesters wal: dir: /var/loki/wal checkpoint_duration: 1m # WAL checkpoint every minute ingester: chunk_block_size: 262144 # 256KB blocks chunk_encoding: snappy # Fast compression, ~3.5x ratio chunk_idle_period: 2m # Flush idle chunks after 2 min max_chunk_age: 30m # Hard cap — flush after 30 min regardless chunk_retain_period: 1m # Keep flushed chunks for queries in flight lifecycler: ring: replication_factor: 2 # Every log line written to 2 ingesters wal: dir: /var/loki/wal checkpoint_duration: 1m # WAL checkpoint every minute ingester: chunk_block_size: 262144 # 256KB blocks chunk_encoding: snappy # Fast compression, ~3.5x ratio chunk_idle_period: 2m # Flush idle chunks after 2 min max_chunk_age: 30m # Hard cap — flush after 30 min regardless chunk_retain_period: 1m # Keep flushed chunks for queries in flight lifecycler: ring: replication_factor: 2 # Every log line written to 2 ingesters wal: dir: /var/loki/wal checkpoint_duration: 1m # WAL checkpoint every minute limits_config: retention_period: 72h ingestion_rate_mb: 10000 # Per-tenant ingest rate (MB/s) ingestion_burst_size_mb: 1000 # Burst allowance per_stream_rate_limit: 512M # Per-stream rate limit per_stream_rate_limit_burst: 1024M max_entries_limit_per_query: 1000000 reject_old_samples: true reject_old_samples_max_age: 168h # Reject logs older than 7 days cardinality_limit: 200000 max_label_value_length: 20480 max_label_name_length: 10240 max_label_names_per_series: 300 split_queries_by_interval: 15m limits_config: retention_period: 72h ingestion_rate_mb: 10000 # Per-tenant ingest rate (MB/s) ingestion_burst_size_mb: 1000 # Burst allowance per_stream_rate_limit: 512M # Per-stream rate limit per_stream_rate_limit_burst: 1024M max_entries_limit_per_query: 1000000 reject_old_samples: true reject_old_samples_max_age: 168h # Reject logs older than 7 days cardinality_limit: 200000 max_label_value_length: 20480 max_label_name_length: 10240 max_label_names_per_series: 300 split_queries_by_interval: 15m limits_config: retention_period: 72h ingestion_rate_mb: 10000 # Per-tenant ingest rate (MB/s) ingestion_burst_size_mb: 1000 # Burst allowance per_stream_rate_limit: 512M # Per-stream rate limit per_stream_rate_limit_burst: 1024M max_entries_limit_per_query: 1000000 reject_old_samples: true reject_old_samples_max_age: 168h # Reject logs older than 7 days cardinality_limit: 200000 max_label_value_length: 20480 max_label_name_length: 10240 max_label_names_per_series: 300 split_queries_by_interval: 15m schema_config: configs: - from: "2020-09-07" index: period: 24h prefix: loki_index_ object_store: aws schema: v11 store: boltdb-shipper storage_config: aws: bucketnames: <your-s3-bucket> s3: s3://<your-region> boltdb_shipper: active_index_directory: /var/loki/index cache_location: /var/loki/cache cache_ttl: 168h index_gateway_client: server_address: dns:///<loki-index-gateway--weight: 500;">service>:9095 shared_store: s3 schema_config: configs: - from: "2020-09-07" index: period: 24h prefix: loki_index_ object_store: aws schema: v11 store: boltdb-shipper storage_config: aws: bucketnames: <your-s3-bucket> s3: s3://<your-region> boltdb_shipper: active_index_directory: /var/loki/index cache_location: /var/loki/cache cache_ttl: 168h index_gateway_client: server_address: dns:///<loki-index-gateway--weight: 500;">service>:9095 shared_store: s3 schema_config: configs: - from: "2020-09-07" index: period: 24h prefix: loki_index_ object_store: aws schema: v11 store: boltdb-shipper storage_config: aws: bucketnames: <your-s3-bucket> s3: s3://<your-region> boltdb_shipper: active_index_directory: /var/loki/index cache_location: /var/loki/cache cache_ttl: 168h index_gateway_client: server_address: dns:///<loki-index-gateway--weight: 500;">service>:9095 shared_store: s3 chunk_store_config: chunk_cache_config: memcached_client: addresses: dnssrv+...memcached-chunks... consistent_hash: true write_dedupe_cache_config: memcached_client: addresses: dnssrv+...memcached-index-writes... consistent_hash: true query_range: results_cache: cache: memcached_client: addresses: dnssrv+...memcached-frontend... consistent_hash: true chunk_store_config: chunk_cache_config: memcached_client: addresses: dnssrv+...memcached-chunks... consistent_hash: true write_dedupe_cache_config: memcached_client: addresses: dnssrv+...memcached-index-writes... consistent_hash: true query_range: results_cache: cache: memcached_client: addresses: dnssrv+...memcached-frontend... consistent_hash: true chunk_store_config: chunk_cache_config: memcached_client: addresses: dnssrv+...memcached-chunks... consistent_hash: true write_dedupe_cache_config: memcached_client: addresses: dnssrv+...memcached-index-writes... consistent_hash: true query_range: results_cache: cache: memcached_client: addresses: dnssrv+...memcached-frontend... consistent_hash: true memberlist: join_members: - <loki-memberlist--weight: 500;">service> distributor: ring: kvstore: store: memberlist memberlist: join_members: - <loki-memberlist--weight: 500;">service> distributor: ring: kvstore: store: memberlist memberlist: join_members: - <loki-memberlist--weight: 500;">service> distributor: ring: kvstore: store: memberlist ingester_client: grpc_client_config: grpc_compression: gzip max_send_msg_size: 204857600 # ~200MB max_recv_msg_size: 204857600 server: grpc_server_max_recv_msg_size: 204857600 grpc_server_max_send_msg_size: 204857600 ingester_client: grpc_client_config: grpc_compression: gzip max_send_msg_size: 204857600 # ~200MB max_recv_msg_size: 204857600 server: grpc_server_max_recv_msg_size: 204857600 grpc_server_max_send_msg_size: 204857600 ingester_client: grpc_client_config: grpc_compression: gzip max_send_msg_size: 204857600 # ~200MB max_recv_msg_size: 204857600 server: grpc_server_max_recv_msg_size: 204857600 grpc_server_max_send_msg_size: 204857600 sum(rate(loki_distributor_bytes_received_total[5m])) sum(rate(loki_distributor_bytes_received_total[5m])) sum(rate(loki_distributor_bytes_received_total[5m])) sum(rate(loki_distributor_lines_received_total[5m])) sum(rate(loki_distributor_lines_received_total[5m])) sum(rate(loki_distributor_lines_received_total[5m])) sum(rate(loki_distributor_ingester_append_failures_total[5m])) sum(rate(loki_distributor_ingester_append_failures_total[5m])) sum(rate(loki_distributor_ingester_append_failures_total[5m])) sum by (reason) (rate(loki_discarded_samples_total[5m])) sum by (reason) (rate(loki_discarded_samples_total[5m])) sum by (reason) (rate(loki_discarded_samples_total[5m])) sum(loki_ingester_memory_streams) sum(loki_ingester_memory_streams) sum(loki_ingester_memory_streams) sum(loki_ingester_memory_chunks) sum(loki_ingester_memory_chunks) sum(loki_ingester_memory_chunks) sum(rate(loki_ingester_chunks_flushed_total[5m])) sum(rate(loki_ingester_chunks_flushed_total[5m])) sum(rate(loki_ingester_chunks_flushed_total[5m])) histogram_quantile(0.99, sum(rate(loki_ingester_chunk_age_seconds_bucket[5m])) by (le)) histogram_quantile(0.99, sum(rate(loki_ingester_chunk_age_seconds_bucket[5m])) by (le)) histogram_quantile(0.99, sum(rate(loki_ingester_chunk_age_seconds_bucket[5m])) by (le)) histogram_quantile(0.5, sum(rate(loki_ingester_chunk_compression_ratio_bucket[5m])) by (le)) histogram_quantile(0.5, sum(rate(loki_ingester_chunk_compression_ratio_bucket[5m])) by (le)) histogram_quantile(0.5, sum(rate(loki_ingester_chunk_compression_ratio_bucket[5m])) by (le)) sum(loki_ingester_wal_bytes_in_use) sum(loki_ingester_wal_bytes_in_use) sum(loki_ingester_wal_bytes_in_use) sum(rate(loki_ingester_wal_corruptions_total[5m])) sum(rate(loki_ingester_wal_corruptions_total[5m])) sum(rate(loki_ingester_wal_corruptions_total[5m])) loki_ingester_checkpoint_duration_seconds loki_ingester_checkpoint_duration_seconds loki_ingester_checkpoint_duration_seconds histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route="loki_api_v1_push"}[5m])) by (le) ) histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route="loki_api_v1_push"}[5m])) by (le) ) histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route="loki_api_v1_push"}[5m])) by (le) ) histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route=~"loki_api_v1_query_range|loki_api_v1_query"}[5m])) by (le) ) histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route=~"loki_api_v1_query_range|loki_api_v1_query"}[5m])) by (le) ) histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route=~"loki_api_v1_query_range|loki_api_v1_query"}[5m])) by (le) ) sum by (status_code, route) (rate(loki_request_duration_seconds_count[5m])) sum by (status_code, route) (rate(loki_request_duration_seconds_count[5m])) sum by (status_code, route) (rate(loki_request_duration_seconds_count[5m])) sum(rate(loki_cache_hits{cache="chunks"}[5m])) / sum(rate(loki_cache_fetched_keys{cache="chunks"}[5m])) sum(rate(loki_cache_hits{cache="chunks"}[5m])) / sum(rate(loki_cache_fetched_keys{cache="chunks"}[5m])) sum(rate(loki_cache_hits{cache="chunks"}[5m])) / sum(rate(loki_cache_fetched_keys{cache="chunks"}[5m])) sum(rate(loki_query_frontend_log_result_cache_hit_total[5m])) / ( sum(rate(loki_query_frontend_log_result_cache_hit_total[5m])) + sum(rate(loki_query_frontend_log_result_cache_miss_total[5m])) ) sum(rate(loki_query_frontend_log_result_cache_hit_total[5m])) / ( sum(rate(loki_query_frontend_log_result_cache_hit_total[5m])) + sum(rate(loki_query_frontend_log_result_cache_miss_total[5m])) ) sum(rate(loki_query_frontend_log_result_cache_hit_total[5m])) / ( sum(rate(loki_query_frontend_log_result_cache_hit_total[5m])) + sum(rate(loki_query_frontend_log_result_cache_miss_total[5m])) ) loki_memcache_client_servers loki_memcache_client_servers loki_memcache_client_servers sum(loki_cache_background_queue_length) sum(loki_cache_background_queue_length) sum(loki_cache_background_queue_length) histogram_quantile(0.99, sum(rate(loki_index_gateway_request_duration_seconds_bucket[5m])) by (le) ) histogram_quantile(0.99, sum(rate(loki_index_gateway_request_duration_seconds_bucket[5m])) by (le) ) histogram_quantile(0.99, sum(rate(loki_index_gateway_request_duration_seconds_bucket[5m])) by (le) ) sum(rate(loki_boltdb_shipper_tables_upload_operation_total[5m])) sum(rate(loki_boltdb_shipper_tables_upload_operation_total[5m])) sum(rate(loki_boltdb_shipper_tables_upload_operation_total[5m])) histogram_quantile(0.99, sum(rate(loki_boltdb_shipper_request_duration_seconds_bucket[5m])) by (le) ) histogram_quantile(0.99, sum(rate(loki_boltdb_shipper_request_duration_seconds_bucket[5m])) by (le) ) histogram_quantile(0.99, sum(rate(loki_boltdb_shipper_request_duration_seconds_bucket[5m])) by (le) ) sum(rate(loki_panic_total[5m])) sum(rate(loki_panic_total[5m])) sum(rate(loki_panic_total[5m])) groups: - name: loki-alerts rules: - alert: LokiIngesterAppendFailures expr: sum(rate(loki_distributor_ingester_append_failures_total[5m])) > 0 for: 5m labels: severity: critical annotations: summary: "Distributors failing to write to ingesters" - alert: LokiDiscardedSamples expr: sum(rate(loki_discarded_samples_total[5m])) > 10 for: 5m labels: severity: warning annotations: summary: "Loki is dropping logs — check reason label" - alert: LokiIngesterHighMemoryStreams expr: sum(loki_ingester_memory_streams) > 5000 for: 10m labels: severity: warning annotations: summary: "Active stream count is unusually high — possible cardinality explosion" - alert: LokiPushLatencyHigh expr: | histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route="loki_api_v1_push"}[5m])) by (le) ) > 1 for: 5m labels: severity: warning annotations: summary: "P99 push latency above 1 second" - alert: LokiChunkCacheHitRateLow expr: | sum(rate(loki_cache_hits{cache="chunks"}[15m])) / sum(rate(loki_cache_fetched_keys{cache="chunks"}[15m])) < 0.90 for: 15m labels: severity: warning annotations: summary: "Chunk cache hit rate below 90%" - alert: LokiWALCorruption expr: sum(rate(loki_ingester_wal_corruptions_total[5m])) > 0 for: 1m labels: severity: critical annotations: summary: "Ingester WAL corruption detected — potential data loss" - alert: LokiPanic expr: sum(rate(loki_panic_total[5m])) > 0 for: 1m labels: severity: critical annotations: summary: "Loki component panicked" groups: - name: loki-alerts rules: - alert: LokiIngesterAppendFailures expr: sum(rate(loki_distributor_ingester_append_failures_total[5m])) > 0 for: 5m labels: severity: critical annotations: summary: "Distributors failing to write to ingesters" - alert: LokiDiscardedSamples expr: sum(rate(loki_discarded_samples_total[5m])) > 10 for: 5m labels: severity: warning annotations: summary: "Loki is dropping logs — check reason label" - alert: LokiIngesterHighMemoryStreams expr: sum(loki_ingester_memory_streams) > 5000 for: 10m labels: severity: warning annotations: summary: "Active stream count is unusually high — possible cardinality explosion" - alert: LokiPushLatencyHigh expr: | histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route="loki_api_v1_push"}[5m])) by (le) ) > 1 for: 5m labels: severity: warning annotations: summary: "P99 push latency above 1 second" - alert: LokiChunkCacheHitRateLow expr: | sum(rate(loki_cache_hits{cache="chunks"}[15m])) / sum(rate(loki_cache_fetched_keys{cache="chunks"}[15m])) < 0.90 for: 15m labels: severity: warning annotations: summary: "Chunk cache hit rate below 90%" - alert: LokiWALCorruption expr: sum(rate(loki_ingester_wal_corruptions_total[5m])) > 0 for: 1m labels: severity: critical annotations: summary: "Ingester WAL corruption detected — potential data loss" - alert: LokiPanic expr: sum(rate(loki_panic_total[5m])) > 0 for: 1m labels: severity: critical annotations: summary: "Loki component panicked" groups: - name: loki-alerts rules: - alert: LokiIngesterAppendFailures expr: sum(rate(loki_distributor_ingester_append_failures_total[5m])) > 0 for: 5m labels: severity: critical annotations: summary: "Distributors failing to write to ingesters" - alert: LokiDiscardedSamples expr: sum(rate(loki_discarded_samples_total[5m])) > 10 for: 5m labels: severity: warning annotations: summary: "Loki is dropping logs — check reason label" - alert: LokiIngesterHighMemoryStreams expr: sum(loki_ingester_memory_streams) > 5000 for: 10m labels: severity: warning annotations: summary: "Active stream count is unusually high — possible cardinality explosion" - alert: LokiPushLatencyHigh expr: | histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route="loki_api_v1_push"}[5m])) by (le) ) > 1 for: 5m labels: severity: warning annotations: summary: "P99 push latency above 1 second" - alert: LokiChunkCacheHitRateLow expr: | sum(rate(loki_cache_hits{cache="chunks"}[15m])) / sum(rate(loki_cache_fetched_keys{cache="chunks"}[15m])) < 0.90 for: 15m labels: severity: warning annotations: summary: "Chunk cache hit rate below 90%" - alert: LokiWALCorruption expr: sum(rate(loki_ingester_wal_corruptions_total[5m])) > 0 for: 1m labels: severity: critical annotations: summary: "Ingester WAL corruption detected — potential data loss" - alert: LokiPanic expr: sum(rate(loki_panic_total[5m])) > 0 for: 1m labels: severity: critical annotations: summary: "Loki component panicked" - Chunk cache (3 replicas, 4Gi each) — caches decompressed chunks so repeated queries don't hit S3 - Index write dedupe (3 replicas, 1Gi each) — deduplicates index writes from multiple ingesters (critical with RF=2) - Query frontend cache (2 replicas, 1Gi each) — caches query results for identical queries within the cache freshness window - Start with TSDB index (schema v12+) instead of BoltDB Shipper. We're on schema v11 because we started early. TSDB is significantly better for index performance and is the direction Grafana is investing in. New deployments should use it. - Set up recording rules earlier. We added Prometheus recording rules for Loki metrics after the fact. Having loki:ingester_memory_streams:sum, loki:distributor_bytes:rate5m etc. pre-built saves a lot of dashboard query overhead. - Consider upgrading to Loki 3.x. We're on 2.9.10. Loki 3.x brings native OTLP ingest, improved bloom filters for faster queries, and the new v13 schema. The migration path exists but requires careful planning around schema migration. - Label cardinality is everything. Keep your active stream count low. - Cache aggressively. Three-tier memcached turned 97.8% of our S3 reads into cache hits. - Size your ingesters for memory, not CPU. They're memory-bound workloads. - Use index gateways to keep queriers from hammering S3 for index lookups. - Set retention aggressively if Loki isn't your archive. 72 hours keeps queries fast.

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolsreportrunninggrafanaproductionactuallylearnedrce

More from Tools

Tools: Latest: SSH Hardening: How to Secure Your Server Against Brute Force Attacks

2026-03-30 0

Tools: How to monitor cron jobs (and stop silent failures) (2026)

2026-03-30 0

Tools: When vector search isn't enough: hybrid graph+vector queries in VelesQL - Full Analysis

2026-03-30 0

Tools: Essential Guide: Building a Hyperliquid Trading Bot: Perps, Spot, and Sub-Accounts

2026-03-30 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News