Tools: Prometheus #1

Tools: Prometheus #1

Source: Dev.to

Why We Use Grafana Alongside Prometheus ## Prometheus: Metrics Collection and Storage ## The Visualization Problem ## Grafana: Unified Visualization Layer ## One Dashboard, Multiple Sources ## Alerting in Grafana ## Open Source and Enterprise Options ## Summary ## How Prometheus Collects and Stores Metrics ## The Prometheus Architecture (High Level) ## Case 1: When You Have the Application Source Code ## Case 2: When You Do NOT Have the Source Code ## Why “Push to Prometheus” Is a Bad Idea ## Exporters: The Correct Solution ## What Is an Exporter? ## Where Exporters Run ## Scraping: How Prometheus Pulls Metrics ## Case 3: Short-Lived Jobs and PushGateway ## How PushGateway Works ## Important Design Rule ## Why This Design Matters ## Summary ## Node Exporter: Collecting Host Metrics with Prometheus ## What Is Node Exporter? ## Why Node Exporter Exists ## Official vs Community Exporters ## Where Node Exporter Is Installed ## Network & Security (Very Important) ## Security Rule (Best Practice) ## Installing Node Exporter on Ubuntu / Linux ## 1. Update the system ## 2. Download Node Exporter ## 3. Extract ## 4. Run Node Exporter (temporary) ## Configuring Prometheus to Scrape Node Exporter ## Running Node Exporter as a Service (Production) ## Solution: systemd service ## Node Exporter on macOS (Homebrew) ## Install ## Start as service ## Update Prometheus config ## Key Takeaways ## Prometheus Data Model (Foundations) ## 1. Time Series Basics ## 2. Metric Name ## 3. Labels (Key–Value Pairs) ## 4. Time Series Identity ## 5. Metric Format ## 6. Example: Authentication API Metrics ## Metric name: ## Labels: ## Full time series example: ## 7. Key Takeaways ## PromQL and Prometheus Data Types ## 1. Scalar (Scalar Data Type) ## 2. Labels Are Always Strings ## Example Metric ## 3. String Matching vs Numeric Matching ## String Matching Example ## Numeric Matching (Wrong Usage) ## 4. Instant Vector ## How to Create an Instant Vector ## Filtering an Instant Vector ## 5. Range Vector ## Syntax ## Supported Time Units (Case-Sensitive) ## 6. Range Vector Example in Prometheus UI ## Instant Vector ## Range Vector ## Scrape Interval Impact ## 7. PromQL Arithmetic Operators ## 8. Scalar + Instant Vector ## 9. Instant Vector + Instant Vector ## Example ## Key Takeaways ## PromQL Binary Operators, Filters, Aggregations, and Time Offset ## 1. Binary Comparison Operators ## Scalar vs Scalar Comparison ## Instant Vector vs Scalar ## Instant Vector vs Instant Vector ## 2. Set Binary Operators ## unless ## 3. Label Filtering (Selectors) ## Label Match Operators ## Regex Matching Example ## Label Type Matters ## 4. Aggregation Operators ## Common Aggregation Operators ## Basic Aggregation Syntax ## Grouping with by ## Excluding Labels with without ## topk and bottomk ## 5. Time Offset ## Offset Syntax ## Offset Example ## Important Offset Rule ## 6. Graph View vs Table View ## Aggregation Required for Graphs ## Final Key Takeaways ## PromQL Functions – Part 1 (Time & Utility Functions) ## 1. day_of_month() and day_of_week() ## day_of_month() ## day_of_week() ## 2. delta() and idelta() ## Important Rules ## delta() ## idelta() ## 3. absent() ## Purpose ## Behavior of absent() ## Example ## 4. absent_over_time() ## Syntax ## Key Points ## 5. Mathematical Functions ## ceil() ## floor() ## 6. Clamp Functions (Very Important) ## clamp() ## clamp_min() ## clamp_max() ## Examples ## Why Clamp Is Useful ## Key Takeaways ## PromQL Functions – Part 2 (Math, Sorting, Time & Alerts) ## 1. Logarithmic Functions ## log2() ## log10() ## 2. Sorting Functions ## sort() ## sort_desc() ## Example ## 3. Time Functions ## time() ## timestamp() ## Offset + Timestamp Example ## 4. Aggregation Over Time Functions ## Common Aggregation-Over-Time Functions ## Example ## Filtering + Over Time ## 5. Why Alerts Matter ## Goal of Alerts ## 6. Prometheus Alerts vs Alertmanager ## Prometheus ## Alertmanager ## Why Alertmanager Is Required ## 7. Creating an Alert Rule (YAML) ## Rule File Structure ## 8. Linking Rules to Prometheus ## 9. Reloading Prometheus ## 10. Viewing Alerts in Prometheus UI ## 11. Testing the Alert ## 12. Pre-Built Alert Rules (Very Important Tip) ## Key Takeaways ## Improving Prometheus Alerts with for, Labels, Annotations & Alertmanager Setup ## 1. Why We Need the for Clause ## The Problem ## 2. Using the for Clause ## Syntax (YAML indentation matters!) ## Updated Alert Example ## 3. Using absent() Instead of Comparisons ## Reminder: How absent() Works ## Cleaner Alert Expression ## 4. Adding Context with Labels ## Labels ## 5. Adding Context with Annotations ## 6. Alert Templating ($labels, $value) ## Available Variables ## Example with Templates ## 7. Full Alert Rule Example ## 8. Seeing Alerts in Prometheus UI ## 9. Alertmanager Recap ## What It Does ## 10. Alertmanager UI ## 11. Installing Alertmanager – Windows ## 12. Installing Alertmanager – macOS (MacPorts) ## 13. Installing Alertmanager – Linux (Ubuntu) ## Steps Overview ## Key Takeaways ## Advanced Alerting: Routes, Matchers, Inhibition, Silencing & Recording Rules ## 1. How Alertmanager Works Internally ## Internal Flow ## 2. Matchers and Routes ## Matchers ## Legacy vs Modern Matching ## Route Concept ## 3. Multiple Receivers Example (Email) ## Routing Based on Severity ## 4. Sending Alerts to Slack (Incoming Webhooks) ## Steps in Slack ## Alertmanager Slack Receiver Example ## 5. PagerDuty Integration ## Steps in PagerDuty ## PagerDuty Receiver Example ## 6. Silencing Alerts (Temporary) ## 7. Inhibiting Alerts (Permanent Logic) ## Inhibition Example Scenario ## Inhibit Rule Example ## Important Rule ## 8. Recording Rules (Why We Need Them) ## Problem ## Solution: Recording Rules ## Real-World Example ## 9. Recording Rule Concept ## 10. Recording Rules File Structure ## 11. Where Recording Rules Live ## macOS / Windows ## Prometheus Config ## 12. Alerting Rules vs Recording Rules ## Key Takeaways ## Recording Rules and Prometheus Client Libraries (Python) ## Part 1: Writing a Recording Rule ## Why Recording Rules Matter ## Step 1: Build the PromQL Expression First ## ❌ This is not useful: ## ✅ Grouping makes it meaningful: ## ✅ Correct Expression for Recording Rule ## Step 2: Create the Recording Rule File ## File Location ## Example File Name ## Step 3: Recording Rule YAML Structure ## Naming Convention (Best Practice) ## Step 4: Load the Rule in Prometheus ## Step 5: Verify the New Metric ## Key Takeaways (Recording Rules) ## Part 2: Short-Lived Jobs & Client Libraries ## What Are Short-Lived Jobs? ## Official Prometheus Client Libraries ## Part 3: Prometheus Client Library (Python) ## Step 1: Install the Client Library ## Step 2: Simple Python App (No Web Framework) ## Example: Summary Metric ## Step 3: Counters ## Counter Basics ## Incrementing Counters ## Counting Exceptions ## Step 4: Gauges ## Gauge Definition ## Gauge Operations ## Step 5: Adding Labels to Metrics ## Define Labels ## Assign Label Values ## Step 6: Expose App to Prometheus ## Prometheus Target ## Verify in Prometheus ## Key Takeaways (Client Libraries) ## Prometheus Client Libraries ## Java Client Library (Simpleclient) & .NET Client Library ## Part 1: Prometheus Java Client Library ## Overview ## Key Java Client Modules ## Step 1: Add Maven Dependencies ## Step 2: Create a Basic Java Application ## What Prometheus Sees ## Important Prometheus Behavior ## Adding Labels (Java) ## Define Labels ## Use Labels (Mandatory!) ## Summary (Java) ## Part 2: Prometheus .NET Client Library ## Important Note ## Step 1: Install NuGet Package ## Step 2: Create Metrics in .NET Console App ## Adding Labels (.NET) ## Dynamic Labels ## Static Labels (Per Metric) ## Global Static Labels (All Metrics) ## Counting Exceptions (.NET) ## Prometheus Configuration ## Final Key Takeaways ## Universal Rules ## Prometheus with ASP.NET Core (.NET Core Web Application) ## Using prometheus-net to Expose Metrics to Prometheus ## Goal of This Lecture ## Step 1: Create an ASP.NET Core Web Application ## Step 2: Add Required NuGet Packages ## Required ## Optional (Best Practice) ## Step 3: Expose /metrics Endpoint ## Step 4: Create a Custom Counter (Controller Example) ## Example: HomeController ## Step 5: Add Health Checks (Best Practice) ## Register Health Checks ## Map Health Check Endpoint ## Health Check Metrics in Prometheus ## Summary So Far ## Why Static Scrape Configs Are Not Enough ## Problems in Cloud Environments ## Solution 1: Service Discovery ## Why Load Balancers Don’t Work for Scraping ## Solution 2: Pushgateway (Special Cases) ## Pushgateway Solves This ## Introduction to Service Discovery in Prometheus ## AWS EC2 Service Discovery (Concept) ## Filtering EC2 Instances (Important) ## Example: Filter by Tag ## Relabeling (Critical Skill) ## Example: Use Public IP Instead of Private IP ## File-Based Service Discovery ## Example File: targets.yml ## Prometheus Config ## When to Use Each Method ## Final Takeaway In modern systems, we usually have servers and workloads running across different environments. From these systems, we want to: This type of data is called time-series data. Prometheus is a time-series database designed to: Prometheus is excellent at collecting, storing, and querying metrics. However, Prometheus has a very basic built-in UI. Its visualization capabilities are limited and not sufficient for real-world dashboards. In real production environments: You can move all these metrics into Prometheus, but: If your goal is visualization only, moving data into Prometheus is not required. Grafana is an open-source visualization and monitoring platform. Grafana allows you to: Supported data sources include: In a single Grafana dashboard: All of this is displayed together, giving a complete system view. Grafana also provides: For more details, you can explore the official Grafana website. Together, they provide: This is why, in real DevOps environments, Prometheus and Grafana are almost always used together. Now that we know how to install Prometheus, the next question is: How does Prometheus actually collect metrics and store them? At a high level, we usually have: Many systems we want to monitor: Prometheus is a pull-based time-series database, meaning: If you own the application code, things are easy. Client libraries exist for: This approach works well only when you control the source code. In many real-world cases, you cannot modify the code. “Let’s write a script that collects data and sends it to Prometheus.” This is not a good solution because: Prometheus is not designed to accept pushed metrics. The correct solution is to use exporters. An exporter is a small service that: The process of Prometheus pulling metrics from exporters is called scraping. Prometheus always controls when and how often data is collected. There is one special case: For this case, Prometheus provides PushGateway. Prometheus Pushgateway Prometheus is always pull-based. Always. This model allows Prometheus to: This is the foundation of real-world Prometheus monitoring. Node Exporter is an official Prometheus exporter used to collect host-level metrics from Unix-based systems. Important clarification first: Node Exporter has NOTHING to do with Node.js. In Prometheus terminology, a “node” means: So Node Exporter = exporter for machine (host) metrics. Applications expose application metrics. Node Exporter exposes machine metrics. Examples of metrics collected by Node Exporter: These metrics are critical for: Node Exporter is official, meaning: Other exporters (MySQL, NGINX, CloudWatch, etc.) may be: Never install Node Exporter on the Prometheus server (unless you want to monitor Prometheus itself) Example architecture: Node Exporter listens on port 9100. Port 9100 must ONLY be accessible by Prometheus From the official Prometheus download page. shows raw metrics (hard to read, but correct). Edit Prometheus config: Add under scrape_configs: Verify in Prometheus UI: Running Node Exporter in a terminal is not acceptable in production. If Prometheus is installed via Homebrew: Prometheus config location (Homebrew): To query metrics stored in Prometheus, you must first understand how Prometheus stores data. Prometheus stores all data as time series. A time series consists of: Each data point represents the value of a metric at a specific moment in time. The metric name identifies what is being measured. The metric name is always required. Labels provide dimensions to a metric and allow you to slice and filter data. Labels answer questions like: In Prometheus, a time series is uniquely identified by: Even if the metric name is the same, different label combinations create different time series. The general format of a Prometheus metric is: Imagine an authentication API where we want to track how often it is called. Each time the API is hit: A new data point is recorded with: Important: Labels describe metadata. The metric value (e.g., the counter increment) is stored separately, not as a label. Prometheus comes with a powerful query language called PromQL (Prometheus Query Language). Using PromQL, you can read, filter, and calculate metrics stored in Prometheus. Before we deep-dive into writing PromQL queries, we must first understand the data types available in Prometheus. These data types are used: A scalar is a single numeric value. Labels in Prometheus are always strings, even if they look like numbers. Match any code starting with 2 This works only because code is a string. This returns no results, because: Labels are metadata → always strings Metric values are numbers → used for calculations An instant vector is: A set of time series, each with one single value at a specific timestamp. That’s why it’s called instant. A range vector is similar to an instant vector, but: Instead of one value, it returns multiple values over time PromQL supports arithmetic operations: When you apply a scalar to an instant vector: The scalar is applied to every element in the vector When applying arithmetic between two instant vectors: label="c" is excluded because it does not exist in both vectors. Arithmetic operations: PromQL never mutates existing data Below is a clean, corrected, and structured lecture version of everything you explained. I’ve fixed terminology, PromQL syntax, and logical flow, while preserving your teaching intent. This is ready for class notes, slides, or recording. To write meaningful queries in Prometheus, we need to understand: Prometheus supports six comparison (binary) operators: How these operators behave depends on the data types on the left and right sides. If you compare two scalar values: Imagine an instant vector: The comparison is applied to every element in the instant vector. When comparing two instant vectors: If you use > instead of ==: Prometheus has three set operators: Returns only time series that exist in both vectors Returns the union of both vectors Returns time series from the left vector that do NOT exist in the right vector A PromQL query always looks like: Each comma means AND. Labels are always strings. Prometheus does not auto-convert types. Aggregation operators: This aggregates while ignoring a label. Returns one row per mode, value = 1. By default, Prometheus returns the latest scrape. To query past data, use offset. “Give me the value from that time in the past” Offset must be applied directly to the metric, NOT after aggregation. Because it returns a range vector. This shows flat lines: This shows meaningful graphs: Now that we’ve learned about operators in Prometheus, it’s time to learn about functions. PromQL functions are extremely important. You will use them constantly when: In total, we will cover these functions across four lectures. In this lecture, we’ll focus on basic time-based and utility functions. These are time-based functions. These two functions are very similar. “How much did the CPU temperature change over the last 2 hours?” This is a very important and commonly used function, especially in alerts. Check whether an instant vector is empty ⚠️ The behavior is counterintuitive, so pay attention. This is how Prometheus detects missing metrics. Same idea as absent(), but works with range vectors. You cannot use absent() with range vectors — that’s why this function exists. These functions modify values inside an instant vector. Clamp functions are extremely useful for visualization and dashboards. They allow you to trim values that are too small or too large. In Prometheus, besides operators, we also have many built-in functions. These functions are heavily used in dashboards, alerts, and troubleshooting. In this lecture, we cover: If you previously used: → starts from 300 → ends at 150000 → starts from 150000 → ends at 300 Normal aggregation functions work on instant vectors. When you use range vectors, you must use *_over_time functions. Imagine you are monitoring an API. This is the point of chaos. We define a threshold: Without Alertmanager: Alerts are defined in YAML rule files. This alert fires when: Restart Node Exporter: There is a community-maintained repository with ready-to-use alert rules for: You do not need to write alerts from scratch. This saves huge amounts of time. So far, we’ve learned how to write basic alerts in Prometheus. Now it’s time to make our alerts smarter, quieter, and more informative. In this lecture, we cover: In the previous lecture, we created an alert like this: Some applications have: We do not want false alerts. The for clause tells Prometheus: “Only fire this alert if the condition stays true for a specific duration.” Supported time units: Previously, we wrote: An alternative (often cleaner) approach is using absent(). This alert fires when: Both approaches are valid. Use whichever is more readable for your team. Alerts are often received by people who didn’t write them. We must add metadata. Labels are key-value pairs attached to the alert. Labels are mainly used by Alertmanager routing rules. Annotations are human-readable descriptions. Prometheus supports templates inside annotations. ⚠️ Always wrap templates in quotes in YAML. This gives rich context in Slack, email, PagerDuty, etc. Clicking the alert shows: Alertmanager is an official Prometheus component. Groups related alerts Silences alerts during maintenance Prometheus does NOT send notifications by itself. Homebrew does not support Alertmanager. Download Alertmanager binary In this lecture, we cover how Alertmanager actually works internally, how alerts are routed, how to send notifications to different channels, how to silence and inhibit alerts, and finally we introduce recording rules. We already know the high-level flow: But inside Alertmanager, there is an important decision process. Alert is sent to a receiver Matchers define conditions based on alert labels. Matchers work only on alert labels, not on metric values. Recommended (modern): Always use matchers in new configurations. You can define multiple receivers: Slack uses Incoming Webhooks. Restart Alertmanager → Alerts go to Slack. PagerDuty is used for on-call incident management. Route alerts to PagerDuty using matchers as before. Silencing does not change Prometheus behavior — only notifications. If the server is down: PromQL calculations like: can be expensive when: Recording rules precompute values and store them as new metrics. Instead of calculating: Calculating averages on demand becomes slow. Computed every scrape interval. Recording rules are defined in YAML, similar to alert rules. Restart Prometheus after changes. Recording rules are used to: Instead of repeatedly calculating: We compute it once and store it as: Let’s start with an existing metric: It returns one number, losing all context. But this still doesn’t work well, because: For these cases, Prometheus provides: Prometheus provides official client libraries for: There are many community-maintained libraries as well (e.g., .NET). Prometheus client includes a built-in HTTP server, perfect for console apps. ⚠️ Prometheus automatically adds _total to counters. ⚠️ All labels must be assigned values. Labels added automatically by Prometheus. Prometheus provides an official Java client library called simpleclient. It allows Java applications to expose metrics that Prometheus can scrape. For this lecture we use: ⚠️ Once labels are defined: .NET is not an official Prometheus client, but the community library prometheus-net is widely used and production-grade. Now every metric includes: Restart Prometheus after changes. In this lecture, we will learn how to: Open NuGet Package Manager and install: These packages allow: Open Startup.cs (or Program.cs for minimal hosting). Inside Configure (or middleware section): This automatically creates: If you run the app and visit: You will already see default runtime metrics, such as: These are exposed automatically by prometheus-net. Imagine we want to count how many times an API endpoint is hit. In ConfigureServices: Prometheus automatically exposes health checks as metrics: This allows monitoring health without polling /health manually. Up to now, we used static targets in prometheus.yml: Prometheus cannot scrape what it doesn’t know exists. Prometheus supports native service discovery, configured entirely in prometheus.yml. Common discovery types: No extra Prometheus components required. If Prometheus scrapes a Load Balancer: Prometheus must scrape each instance directly. Some workloads cannot be scraped: Pushgateway does not make Prometheus push-based It is a buffer, not a database Service discovery is configured in prometheus.yml. You rarely scrape all instances. Relabeling allows you to: This is mandatory if Prometheus is outside AWS. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK: sudo apt update sudo apt upgrade -y Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: sudo apt update sudo apt upgrade -y COMMAND_BLOCK: sudo apt update sudo apt upgrade -y COMMAND_BLOCK: wget https://github.com/prometheus/node_exporter/releases/download/vX.Y.Z/node_exporter-X.Y.Z.linux-amd64.tar.gz Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: wget https://github.com/prometheus/node_exporter/releases/download/vX.Y.Z/node_exporter-X.Y.Z.linux-amd64.tar.gz COMMAND_BLOCK: wget https://github.com/prometheus/node_exporter/releases/download/vX.Y.Z/node_exporter-X.Y.Z.linux-amd64.tar.gz CODE_BLOCK: tar xvf node_exporter-*.tar.gz Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: tar xvf node_exporter-*.tar.gz CODE_BLOCK: tar xvf node_exporter-*.tar.gz CODE_BLOCK: ./node_exporter Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ./node_exporter CODE_BLOCK: ./node_exporter CODE_BLOCK: Listening on :9100 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Listening on :9100 CODE_BLOCK: Listening on :9100 CODE_BLOCK: http://<server-ip>:9100/metrics Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: http://<server-ip>:9100/metrics CODE_BLOCK: http://<server-ip>:9100/metrics COMMAND_BLOCK: sudo nano /etc/prometheus/prometheus.yml Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: sudo nano /etc/prometheus/prometheus.yml COMMAND_BLOCK: sudo nano /etc/prometheus/prometheus.yml CODE_BLOCK: - job_name: "application-server" static_configs: - targets: ["<APPLICATION_SERVER_IP>:9100"] Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: - job_name: "application-server" static_configs: - targets: ["<APPLICATION_SERVER_IP>:9100"] CODE_BLOCK: - job_name: "application-server" static_configs: - targets: ["<APPLICATION_SERVER_IP>:9100"] COMMAND_BLOCK: sudo systemctl restart prometheus Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: sudo systemctl restart prometheus COMMAND_BLOCK: sudo systemctl restart prometheus CODE_BLOCK: systemctl status node_exporter Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: systemctl status node_exporter CODE_BLOCK: systemctl status node_exporter CODE_BLOCK: Active: active (running) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Active: active (running) CODE_BLOCK: Active: active (running) CODE_BLOCK: brew install node_exporter Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: brew install node_exporter CODE_BLOCK: brew install node_exporter CODE_BLOCK: brew services start node_exporter Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: brew services start node_exporter CODE_BLOCK: brew services start node_exporter CODE_BLOCK: http://localhost:9100/metrics Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: http://localhost:9100/metrics CODE_BLOCK: http://localhost:9100/metrics CODE_BLOCK: /usr/local/etc/prometheus.yml Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: /usr/local/etc/prometheus.yml CODE_BLOCK: /usr/local/etc/prometheus.yml CODE_BLOCK: - job_name: "mac-node" static_configs: - targets: ["localhost:9100"] Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: - job_name: "mac-node" static_configs: - targets: ["localhost:9100"] CODE_BLOCK: - job_name: "mac-node" static_configs: - targets: ["localhost:9100"] CODE_BLOCK: brew services restart prometheus Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: brew services restart prometheus CODE_BLOCK: brew services restart prometheus CODE_BLOCK: metric name + full set of labels Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: metric name + full set of labels CODE_BLOCK: metric name + full set of labels CODE_BLOCK: metric_name{label1="value1", label2="value2", label3="value3"} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: metric_name{label1="value1", label2="value2", label3="value3"} CODE_BLOCK: metric_name{label1="value1", label2="value2", label3="value3"} CODE_BLOCK: authentication_api_hits_total Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: authentication_api_hits_total CODE_BLOCK: authentication_api_hits_total CODE_BLOCK: authentication_api_hits_total{account_id="12345", response_time_ms="800"} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: authentication_api_hits_total{account_id="12345", response_time_ms="800"} CODE_BLOCK: authentication_api_hits_total{account_id="12345", response_time_ms="800"} CODE_BLOCK: prometheus_http_requests_total{code="200", job="prometheus"} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: prometheus_http_requests_total{code="200", job="prometheus"} CODE_BLOCK: prometheus_http_requests_total{code="200", job="prometheus"} CODE_BLOCK: prometheus_http_requests_total{job="prometheus", code=~"2.*"} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: prometheus_http_requests_total{job="prometheus", code=~"2.*"} CODE_BLOCK: prometheus_http_requests_total{job="prometheus", code=~"2.*"} CODE_BLOCK: prometheus_http_requests_total{code=200} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: prometheus_http_requests_total{code=200} CODE_BLOCK: prometheus_http_requests_total{code=200} CODE_BLOCK: auth_api_hits_total Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: auth_api_hits_total CODE_BLOCK: auth_api_hits_total CODE_BLOCK: auth_api_hits_total{count="1", time_taken="800"} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: auth_api_hits_total{count="1", time_taken="800"} CODE_BLOCK: auth_api_hits_total{count="1", time_taken="800"} CODE_BLOCK: metric_name[time_range] Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: metric_name[time_range] CODE_BLOCK: metric_name[time_range] CODE_BLOCK: auth_api_hits_total[5m] Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: auth_api_hits_total[5m] CODE_BLOCK: auth_api_hits_total[5m] CODE_BLOCK: node_network_transmit_errs_total Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: node_network_transmit_errs_total CODE_BLOCK: node_network_transmit_errs_total CODE_BLOCK: node_network_transmit_errs_total Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: node_network_transmit_errs_total CODE_BLOCK: node_network_transmit_errs_total CODE_BLOCK: node_network_transmit_errs_total[5m] Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: node_network_transmit_errs_total[5m] CODE_BLOCK: node_network_transmit_errs_total[5m] CODE_BLOCK: 5 minutes ÷ 15 seconds = ~20 data points Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: 5 minutes ÷ 15 seconds = ~20 data points CODE_BLOCK: 5 minutes ÷ 15 seconds = ~20 data points CODE_BLOCK: node_cpu_seconds_total + 5 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: node_cpu_seconds_total + 5 CODE_BLOCK: node_cpu_seconds_total + 5 CODE_BLOCK: 5 6 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: 10 11 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: m1{label="a"} = 10 m1{label="b"} = 20 m1{label="c"} = 30 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: m1{label="a"} = 10 m1{label="b"} = 20 m1{label="c"} = 30 CODE_BLOCK: m1{label="a"} = 10 m1{label="b"} = 20 m1{label="c"} = 30 CODE_BLOCK: m1{label="a"} = 5 m1{label="b"} = 2 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: m1{label="a"} = 5 m1{label="b"} = 2 CODE_BLOCK: m1{label="a"} = 5 m1{label="b"} = 2 CODE_BLOCK: A + B Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: m1{label="a"} = 15 m1{label="b"} = 22 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: m1{label="a"} = 15 m1{label="b"} = 22 CODE_BLOCK: m1{label="a"} = 15 m1{label="b"} = 22 CODE_BLOCK: 10 == 10 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: 1 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: 10 == 5 → 0 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: 10 == 5 → 0 CODE_BLOCK: 10 == 5 → 0 CODE_BLOCK: m == 10 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: m{label="a"} = 10 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: m{label="a"} = 10 CODE_BLOCK: m{label="a"} = 10 CODE_BLOCK: A == B Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: A and B Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: A or B Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: A unless B Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: metric_name{label1="value1", label2="value2"} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: metric_name{label1="value1", label2="value2"} CODE_BLOCK: metric_name{label1="value1", label2="value2"} CODE_BLOCK: prometheus_http_requests_total{code="200", job="prometheus"} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: prometheus_http_requests_total{code="200", job="prometheus"} CODE_BLOCK: prometheus_http_requests_total{code="200", job="prometheus"} CODE_BLOCK: code=~"2.*" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: code=~"2.*" CODE_BLOCK: code=~"2.*" CODE_BLOCK: le="1000" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: le=1000 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: sum(metric_name) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: sum(metric_name) CODE_BLOCK: sum(metric_name) CODE_BLOCK: sum(node_cpu_seconds_total) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: sum(node_cpu_seconds_total) CODE_BLOCK: sum(node_cpu_seconds_total) CODE_BLOCK: sum(metric_name) by (label) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: sum(metric_name) by (label) CODE_BLOCK: sum(metric_name) by (label) CODE_BLOCK: sum(node_cpu_seconds_total) by (mode) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: sum(node_cpu_seconds_total) by (mode) CODE_BLOCK: sum(node_cpu_seconds_total) by (mode) CODE_BLOCK: sum(metric_name) without (label) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: sum(metric_name) without (label) CODE_BLOCK: sum(metric_name) without (label) CODE_BLOCK: topk(3, node_cpu_seconds_total) bottomk(3, node_cpu_seconds_total) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: topk(3, node_cpu_seconds_total) bottomk(3, node_cpu_seconds_total) CODE_BLOCK: topk(3, node_cpu_seconds_total) bottomk(3, node_cpu_seconds_total) CODE_BLOCK: group(metric_name) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: group(metric_name) CODE_BLOCK: group(metric_name) CODE_BLOCK: group(metric_name) by (mode) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: group(metric_name) by (mode) CODE_BLOCK: group(metric_name) by (mode) CODE_BLOCK: metric_name offset 10m Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: metric_name offset 10m CODE_BLOCK: metric_name offset 10m CODE_BLOCK: prometheus_http_requests_total Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: prometheus_http_requests_total CODE_BLOCK: prometheus_http_requests_total CODE_BLOCK: 21 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: prometheus_http_requests_total offset 8m Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: prometheus_http_requests_total offset 8m CODE_BLOCK: prometheus_http_requests_total offset 8m CODE_BLOCK: 20 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: avg(prometheus_http_requests_total offset 8h) by (code) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: avg(prometheus_http_requests_total offset 8h) by (code) CODE_BLOCK: avg(prometheus_http_requests_total offset 8h) by (code) CODE_BLOCK: avg(prometheus_http_requests_total) by (code) offset 8h Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: avg(prometheus_http_requests_total) by (code) offset 8h CODE_BLOCK: avg(prometheus_http_requests_total) by (code) offset 8h CODE_BLOCK: metric_name[5m] Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: metric_name[5m] CODE_BLOCK: metric_name[5m] CODE_BLOCK: group(metric_name) by (code) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: group(metric_name) by (code) CODE_BLOCK: group(metric_name) by (code) CODE_BLOCK: avg(metric_name) by (code) sum(metric_name) by (code) count(metric_name) by (code) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: avg(metric_name) by (code) sum(metric_name) by (code) count(metric_name) by (code) CODE_BLOCK: avg(metric_name) by (code) sum(metric_name) by (code) count(metric_name) by (code) CODE_BLOCK: day_of_month(<instant_vector>) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: day_of_month(<instant_vector>) CODE_BLOCK: day_of_month(<instant_vector>) CODE_BLOCK: day_of_week(<instant_vector>) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: day_of_week(<instant_vector>) CODE_BLOCK: day_of_week(<instant_vector>) CODE_BLOCK: delta(<range_vector>) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: delta(<range_vector>) CODE_BLOCK: delta(<range_vector>) CODE_BLOCK: last_value − first_value Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: last_value − first_value CODE_BLOCK: last_value − first_value CODE_BLOCK: delta(node_cpu_temp[2h]) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: delta(node_cpu_temp[2h]) CODE_BLOCK: delta(node_cpu_temp[2h]) CODE_BLOCK: idelta(<range_vector>) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: idelta(<range_vector>) CODE_BLOCK: idelta(<range_vector>) CODE_BLOCK: absent(<instant_vector>) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: absent(<instant_vector>) CODE_BLOCK: absent(<instant_vector>) CODE_BLOCK: absent(node_cpu_seconds_total) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: absent(node_cpu_seconds_total) CODE_BLOCK: absent(node_cpu_seconds_total) CODE_BLOCK: absent(node_cpu_seconds_total{cpu="fake"}) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: absent(node_cpu_seconds_total{cpu="fake"}) CODE_BLOCK: absent(node_cpu_seconds_total{cpu="fake"}) CODE_BLOCK: absent_over_time(<range_vector>) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: absent_over_time(<range_vector>) CODE_BLOCK: absent_over_time(<range_vector>) CODE_BLOCK: absent_over_time(node_cpu_seconds_total[1h]) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: absent_over_time(node_cpu_seconds_total[1h]) CODE_BLOCK: absent_over_time(node_cpu_seconds_total[1h]) CODE_BLOCK: abs(<instant_vector>) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: abs(<instant_vector>) CODE_BLOCK: abs(<instant_vector>) CODE_BLOCK: ceil(<instant_vector>) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ceil(<instant_vector>) CODE_BLOCK: ceil(<instant_vector>) CODE_BLOCK: floor(<instant_vector>) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: floor(<instant_vector>) CODE_BLOCK: floor(<instant_vector>) CODE_BLOCK: clamp(<instant_vector>, min, max) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: clamp(<instant_vector>, min, max) CODE_BLOCK: clamp(<instant_vector>, min, max) CODE_BLOCK: clamp_min(<instant_vector>, min) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: clamp_min(<instant_vector>, min) CODE_BLOCK: clamp_min(<instant_vector>, min) CODE_BLOCK: clamp_max(<instant_vector>, max) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: clamp_max(<instant_vector>, max) CODE_BLOCK: clamp_max(<instant_vector>, max) CODE_BLOCK: clamp_min(node_cpu_seconds_total, 300) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: clamp_min(node_cpu_seconds_total, 300) CODE_BLOCK: clamp_min(node_cpu_seconds_total, 300) CODE_BLOCK: clamp_max(node_cpu_seconds_total, 150000) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: clamp_max(node_cpu_seconds_total, 150000) CODE_BLOCK: clamp_max(node_cpu_seconds_total, 150000) CODE_BLOCK: clamp(node_cpu_seconds_total, 300, 150000) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: clamp(node_cpu_seconds_total, 300, 150000) CODE_BLOCK: clamp(node_cpu_seconds_total, 300, 150000) CODE_BLOCK: log2(<instant_vector>) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: log2(<instant_vector>) CODE_BLOCK: log2(<instant_vector>) CODE_BLOCK: log10(<instant_vector>) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: log10(<instant_vector>) CODE_BLOCK: log10(<instant_vector>) CODE_BLOCK: ln(<instant_vector>) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ln(<instant_vector>) CODE_BLOCK: ln(<instant_vector>) CODE_BLOCK: sort(<instant_vector>) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: sort(<instant_vector>) CODE_BLOCK: sort(<instant_vector>) CODE_BLOCK: sort_desc(<instant_vector>) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: sort_desc(<instant_vector>) CODE_BLOCK: sort_desc(<instant_vector>) CODE_BLOCK: clamp(node_cpu_seconds_total, 300, 150000) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: clamp(node_cpu_seconds_total, 300, 150000) CODE_BLOCK: clamp(node_cpu_seconds_total, 300, 150000) CODE_BLOCK: sort(...) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: sort_desc(...) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: sort_desc(...) CODE_BLOCK: sort_desc(...) CODE_BLOCK: time() Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: timestamp(<instant_vector>) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: timestamp(<instant_vector>) CODE_BLOCK: timestamp(<instant_vector>) CODE_BLOCK: timestamp(node_cpu_seconds_total offset 1h) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: timestamp(node_cpu_seconds_total offset 1h) CODE_BLOCK: timestamp(node_cpu_seconds_total offset 1h) CODE_BLOCK: avg(node_cpu_seconds_total[2h]) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: avg(node_cpu_seconds_total[2h]) CODE_BLOCK: avg(node_cpu_seconds_total[2h]) CODE_BLOCK: avg_over_time(node_cpu_seconds_total[2h]) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: avg_over_time(node_cpu_seconds_total[2h]) CODE_BLOCK: avg_over_time(node_cpu_seconds_total[2h]) CODE_BLOCK: avg_over_time(node_cpu_seconds_total{cpu="0"}[2h]) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: avg_over_time(node_cpu_seconds_total{cpu="0"}[2h]) CODE_BLOCK: avg_over_time(node_cpu_seconds_total{cpu="0"}[2h]) CODE_BLOCK: groups: - name: alerts rules: - alert: NodeExporterDown expr: up{job="node_exporter"} == 0 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: groups: - name: alerts rules: - alert: NodeExporterDown expr: up{job="node_exporter"} == 0 CODE_BLOCK: groups: - name: alerts rules: - alert: NodeExporterDown expr: up{job="node_exporter"} == 0 CODE_BLOCK: rule_files: - "rules/*.yml" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: rule_files: - "rules/*.yml" CODE_BLOCK: rule_files: - "rules/*.yml" CODE_BLOCK: systemctl restart prometheus Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: systemctl restart prometheus CODE_BLOCK: systemctl restart prometheus CODE_BLOCK: brew services restart prometheus Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: brew services restart prometheus CODE_BLOCK: brew services restart prometheus CODE_BLOCK: Status → Alerts Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Status → Alerts CODE_BLOCK: Status → Alerts CODE_BLOCK: systemctl stop node_exporter Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: systemctl stop node_exporter CODE_BLOCK: systemctl stop node_exporter CODE_BLOCK: brew services stop node_exporter Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: brew services stop node_exporter CODE_BLOCK: brew services stop node_exporter CODE_BLOCK: expr: up{job="node_exporter"} == 0 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: expr: up{job="node_exporter"} == 0 CODE_BLOCK: expr: up{job="node_exporter"} == 0 CODE_BLOCK: for: 5m Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: groups: - name: alerts rules: - alert: NodeExporterDown expr: up{job="node_exporter"} == 0 for: 5m Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: groups: - name: alerts rules: - alert: NodeExporterDown expr: up{job="node_exporter"} == 0 for: 5m CODE_BLOCK: groups: - name: alerts rules: - alert: NodeExporterDown expr: up{job="node_exporter"} == 0 for: 5m CODE_BLOCK: expr: up{job="node_exporter"} == 0 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: expr: up{job="node_exporter"} == 0 CODE_BLOCK: expr: up{job="node_exporter"} == 0 CODE_BLOCK: expr: absent(up{job="node_exporter"}) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: expr: absent(up{job="node_exporter"}) CODE_BLOCK: expr: absent(up{job="node_exporter"}) CODE_BLOCK: labels: team: team-alpha severity: critical Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: labels: team: team-alpha severity: critical CODE_BLOCK: labels: team: team-alpha severity: critical CODE_BLOCK: annotations: summary: "Node exporter is down" description: "Node exporter on {{ $labels.instance }} is not reachable" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: annotations: summary: "Node exporter is down" description: "Node exporter on {{ $labels.instance }} is not reachable" CODE_BLOCK: annotations: summary: "Node exporter is down" description: "Node exporter on {{ $labels.instance }} is not reachable" CODE_BLOCK: annotations: summary: "{{ $labels.instance }} node exporter is down" description: | Job: {{ $labels.job }} Instance: {{ $labels.instance }} Value: {{ $value }} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: annotations: summary: "{{ $labels.instance }} node exporter is down" description: | Job: {{ $labels.job }} Instance: {{ $labels.instance }} Value: {{ $value }} CODE_BLOCK: annotations: summary: "{{ $labels.instance }} node exporter is down" description: | Job: {{ $labels.job }} Instance: {{ $labels.instance }} Value: {{ $value }} CODE_BLOCK: groups: - name: alerts rules: - alert: NodeExporterDown expr: absent(up{job="node_exporter"}) for: 5m labels: severity: critical team: team-alpha annotations: summary: "Node exporter down on {{ $labels.instance }}" description: "Node exporter has been unreachable for 5 minutes." Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: groups: - name: alerts rules: - alert: NodeExporterDown expr: absent(up{job="node_exporter"}) for: 5m labels: severity: critical team: team-alpha annotations: summary: "Node exporter down on {{ $labels.instance }}" description: "Node exporter has been unreachable for 5 minutes." CODE_BLOCK: groups: - name: alerts rules: - alert: NodeExporterDown expr: absent(up{job="node_exporter"}) for: 5m labels: severity: critical team: team-alpha annotations: summary: "Node exporter down on {{ $labels.instance }}" description: "Node exporter has been unreachable for 5 minutes." CODE_BLOCK: Status → Alerts Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Status → Alerts CODE_BLOCK: Status → Alerts CODE_BLOCK: http://localhost:1993 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: http://localhost:1993 CODE_BLOCK: http://localhost:1993 CODE_BLOCK: alertmanager.exe Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: alertmanager.exe CODE_BLOCK: alertmanager.exe CODE_BLOCK: http://localhost:1993 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: http://localhost:1993 CODE_BLOCK: http://localhost:1993 COMMAND_BLOCK: sudo port install alertmanager sudo port load alertmanager Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: sudo port install alertmanager sudo port load alertmanager COMMAND_BLOCK: sudo port install alertmanager sudo port load alertmanager CODE_BLOCK: /opt/local/etc/alertmanager.yml Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: /opt/local/etc/alertmanager.yml CODE_BLOCK: /opt/local/etc/alertmanager.yml COMMAND_BLOCK: sudo port unload alertmanager sudo port load alertmanager Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: sudo port unload alertmanager sudo port load alertmanager COMMAND_BLOCK: sudo port unload alertmanager sudo port load alertmanager CODE_BLOCK: /var/lib/alertmanager Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: /var/lib/alertmanager CODE_BLOCK: /var/lib/alertmanager CODE_BLOCK: /var/lib/alertmanager/data Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: /var/lib/alertmanager/data CODE_BLOCK: /var/lib/alertmanager/data CODE_BLOCK: chown -R prometheus:prometheus /var/lib/alertmanager chmod -R 755 /var/lib/alertmanager Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: chown -R prometheus:prometheus /var/lib/alertmanager chmod -R 755 /var/lib/alertmanager CODE_BLOCK: chown -R prometheus:prometheus /var/lib/alertmanager chmod -R 755 /var/lib/alertmanager CODE_BLOCK: /etc/systemd/system/alertmanager.service Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: /etc/systemd/system/alertmanager.service CODE_BLOCK: /etc/systemd/system/alertmanager.service COMMAND_BLOCK: sudo systemctl daemon-reload sudo systemctl start alertmanager sudo systemctl enable alertmanager Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: sudo systemctl daemon-reload sudo systemctl start alertmanager sudo systemctl enable alertmanager COMMAND_BLOCK: sudo systemctl daemon-reload sudo systemctl start alertmanager sudo systemctl enable alertmanager CODE_BLOCK: http://<server-ip>:1993 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: http://<server-ip>:1993 CODE_BLOCK: http://<server-ip>:1993 CODE_BLOCK: Prometheus → Alertmanager → Notifications Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Prometheus → Alertmanager → Notifications CODE_BLOCK: Prometheus → Alertmanager → Notifications CODE_BLOCK: receivers: - name: default-email email_configs: - to: [email protected] - name: urgent-email email_configs: - to: [email protected] Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: receivers: - name: default-email email_configs: - to: [email protected] - name: urgent-email email_configs: - to: [email protected] CODE_BLOCK: receivers: - name: default-email email_configs: - to: [email protected] - name: urgent-email email_configs: - to: [email protected] CODE_BLOCK: route: receiver: default-email routes: - receiver: urgent-email matchers: - severity="critical" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: route: receiver: default-email routes: - receiver: urgent-email matchers: - severity="critical" CODE_BLOCK: route: receiver: default-email routes: - receiver: urgent-email matchers: - severity="critical" CODE_BLOCK: receivers: - name: slack-alerts slack_configs: - api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ" channel: "#udemy-prometheus" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: receivers: - name: slack-alerts slack_configs: - api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ" channel: "#udemy-prometheus" CODE_BLOCK: receivers: - name: slack-alerts slack_configs: - api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ" channel: "#udemy-prometheus" CODE_BLOCK: receivers: - name: pagerduty-alerts pagerduty_configs: - service_key: "PAGERDUTY_INTEGRATION_KEY" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: receivers: - name: pagerduty-alerts pagerduty_configs: - service_key: "PAGERDUTY_INTEGRATION_KEY" CODE_BLOCK: receivers: - name: pagerduty-alerts pagerduty_configs: - service_key: "PAGERDUTY_INTEGRATION_KEY" CODE_BLOCK: inhibit_rules: - source_matchers: - team="team-alpha" target_matchers: - team="team-beta" equal: - instance Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: inhibit_rules: - source_matchers: - team="team-alpha" target_matchers: - team="team-beta" equal: - instance CODE_BLOCK: inhibit_rules: - source_matchers: - team="team-alpha" target_matchers: - team="team-beta" equal: - instance CODE_BLOCK: avg(sensor_temperature) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: avg(sensor_temperature) CODE_BLOCK: avg(sensor_temperature) CODE_BLOCK: sensor_temperature_avg Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: sensor_temperature_avg CODE_BLOCK: sensor_temperature_avg CODE_BLOCK: iot_temperature_avg Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: iot_temperature_avg CODE_BLOCK: iot_temperature_avg CODE_BLOCK: groups: - name: iot-rules rules: - record: iot_temperature_avg expr: avg(iot_temperature) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: groups: - name: iot-rules rules: - record: iot_temperature_avg expr: avg(iot_temperature) CODE_BLOCK: groups: - name: iot-rules rules: - record: iot_temperature_avg expr: avg(iot_temperature) CODE_BLOCK: /etc/prometheus/rules/ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: /etc/prometheus/rules/ CODE_BLOCK: /etc/prometheus/rules/ CODE_BLOCK: rule_files: - "rules/*.yml" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: rule_files: - "rules/*.yml" CODE_BLOCK: rule_files: - "rules/*.yml" CODE_BLOCK: avg(rate(node_cpu_seconds_total[5m])) by (cpu) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: avg(rate(node_cpu_seconds_total[5m])) by (cpu) CODE_BLOCK: avg(rate(node_cpu_seconds_total[5m])) by (cpu) CODE_BLOCK: cpu:node_cpu_seconds_total:avg_rate Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: cpu:node_cpu_seconds_total:avg_rate CODE_BLOCK: cpu:node_cpu_seconds_total:avg_rate CODE_BLOCK: node_cpu_seconds_total Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: node_cpu_seconds_total CODE_BLOCK: node_cpu_seconds_total CODE_BLOCK: avg(node_cpu_seconds_total) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: avg(node_cpu_seconds_total) CODE_BLOCK: avg(node_cpu_seconds_total) CODE_BLOCK: avg by (cpu) (node_cpu_seconds_total) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: avg by (cpu) (node_cpu_seconds_total) CODE_BLOCK: avg by (cpu) (node_cpu_seconds_total) CODE_BLOCK: avg by (cpu) ( rate(node_cpu_seconds_total[5m]) ) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: avg by (cpu) ( rate(node_cpu_seconds_total[5m]) ) CODE_BLOCK: avg by (cpu) ( rate(node_cpu_seconds_total[5m]) ) CODE_BLOCK: /etc/prometheus/rules/ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: /etc/prometheus/rules/ CODE_BLOCK: /etc/prometheus/rules/ CODE_BLOCK: recording-rules.yml Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: recording-rules.yml CODE_BLOCK: recording-rules.yml CODE_BLOCK: groups: - name: node-exporter-recording-rules rules: - record: cpu:node_cpu_seconds_total:avg_rate expr: avg by (cpu) ( rate(node_cpu_seconds_total[5m]) ) labels: exporter_type: node Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: groups: - name: node-exporter-recording-rules rules: - record: cpu:node_cpu_seconds_total:avg_rate expr: avg by (cpu) ( rate(node_cpu_seconds_total[5m]) ) labels: exporter_type: node CODE_BLOCK: groups: - name: node-exporter-recording-rules rules: - record: cpu:node_cpu_seconds_total:avg_rate expr: avg by (cpu) ( rate(node_cpu_seconds_total[5m]) ) labels: exporter_type: node CODE_BLOCK: <labels>:<metric_name>:<operation> Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: <labels>:<metric_name>:<operation> CODE_BLOCK: <labels>:<metric_name>:<operation> CODE_BLOCK: cpu:node_cpu_seconds_total:avg_rate Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: cpu:node_cpu_seconds_total:avg_rate CODE_BLOCK: cpu:node_cpu_seconds_total:avg_rate CODE_BLOCK: rule_files: - "rules/*.yml" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: rule_files: - "rules/*.yml" CODE_BLOCK: rule_files: - "rules/*.yml" CODE_BLOCK: systemctl restart prometheus Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: systemctl restart prometheus CODE_BLOCK: systemctl restart prometheus CODE_BLOCK: brew services restart prometheus Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: brew services restart prometheus CODE_BLOCK: brew services restart prometheus CODE_BLOCK: cpu:node_cpu_seconds_total:avg_rate Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: cpu:node_cpu_seconds_total:avg_rate CODE_BLOCK: cpu:node_cpu_seconds_total:avg_rate CODE_BLOCK: sum(cpu:node_cpu_seconds_total:avg_rate) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: sum(cpu:node_cpu_seconds_total:avg_rate) CODE_BLOCK: sum(cpu:node_cpu_seconds_total:avg_rate) COMMAND_BLOCK: pip install prometheus-client Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: pip install prometheus-client COMMAND_BLOCK: pip install prometheus-client CODE_BLOCK: from prometheus_client import start_http_server, Summary import random import time REQUEST_TIME = Summary( 'request_processing_seconds', 'Time spent processing requests' ) @REQUEST_TIME.time() def process_request(t): time.sleep(t) if __name__ == "__main__": start_http_server(8000) while True: process_request(random.random()) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: from prometheus_client import start_http_server, Summary import random import time REQUEST_TIME = Summary( 'request_processing_seconds', 'Time spent processing requests' ) @REQUEST_TIME.time() def process_request(t): time.sleep(t) if __name__ == "__main__": start_http_server(8000) while True: process_request(random.random()) CODE_BLOCK: from prometheus_client import start_http_server, Summary import random import time REQUEST_TIME = Summary( 'request_processing_seconds', 'Time spent processing requests' ) @REQUEST_TIME.time() def process_request(t): time.sleep(t) if __name__ == "__main__": start_http_server(8000) while True: process_request(random.random()) CODE_BLOCK: http://localhost:8000/metrics Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: http://localhost:8000/metrics CODE_BLOCK: http://localhost:8000/metrics CODE_BLOCK: from prometheus_client import Counter MY_COUNTER = Counter( 'my_counter', 'Example counter' ) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: from prometheus_client import Counter MY_COUNTER = Counter( 'my_counter', 'Example counter' ) CODE_BLOCK: from prometheus_client import Counter MY_COUNTER = Counter( 'my_counter', 'Example counter' ) CODE_BLOCK: MY_COUNTER.inc() MY_COUNTER.inc(5) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: MY_COUNTER.inc() MY_COUNTER.inc(5) CODE_BLOCK: MY_COUNTER.inc() MY_COUNTER.inc(5) CODE_BLOCK: @MY_COUNTER.count_exceptions() def process_request(): raise Exception("error") Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: @MY_COUNTER.count_exceptions() def process_request(): raise Exception("error") CODE_BLOCK: @MY_COUNTER.count_exceptions() def process_request(): raise Exception("error") CODE_BLOCK: from prometheus_client import Gauge MY_GAUGE = Gauge( 'my_gauge', 'Example gauge' ) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: from prometheus_client import Gauge MY_GAUGE = Gauge( 'my_gauge', 'Example gauge' ) CODE_BLOCK: from prometheus_client import Gauge MY_GAUGE = Gauge( 'my_gauge', 'Example gauge' ) CODE_BLOCK: MY_GAUGE.set(5) MY_GAUGE.inc(5) MY_GAUGE.dec(2) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: MY_GAUGE.set(5) MY_GAUGE.inc(5) MY_GAUGE.dec(2) CODE_BLOCK: MY_GAUGE.set(5) MY_GAUGE.inc(5) MY_GAUGE.dec(2) CODE_BLOCK: 8 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: MY_COUNTER = Counter( 'my_counter', 'Counter with labels', ['name', 'age'] ) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: MY_COUNTER = Counter( 'my_counter', 'Counter with labels', ['name', 'age'] ) CODE_BLOCK: MY_COUNTER = Counter( 'my_counter', 'Counter with labels', ['name', 'age'] ) CODE_BLOCK: MY_COUNTER.labels(name="John", age="30").inc() Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: MY_COUNTER.labels(name="John", age="30").inc() CODE_BLOCK: MY_COUNTER.labels(name="John", age="30").inc() CODE_BLOCK: - job_name: "python-app" static_configs: - targets: ["localhost:8000"] Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: - job_name: "python-app" static_configs: - targets: ["localhost:8000"] CODE_BLOCK: - job_name: "python-app" static_configs: - targets: ["localhost:8000"] CODE_BLOCK: my_counter_total{name="John", age="30"} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: my_counter_total{name="John", age="30"} CODE_BLOCK: my_counter_total{name="John", age="30"} CODE_BLOCK: https://github.com/prometheus/client_java Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: https://github.com/prometheus/client_java CODE_BLOCK: https://github.com/prometheus/client_java CODE_BLOCK: <dependencies> <dependency> <groupId>io.prometheus</groupId> <artifactId>simpleclient</artifactId> <version>0.16.0</version> </dependency> <dependency> <groupId>io.prometheus</groupId> <artifactId>simpleclient_httpserver</artifactId> <version>0.16.0</version> </dependency> </dependencies> Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: <dependencies> <dependency> <groupId>io.prometheus</groupId> <artifactId>simpleclient</artifactId> <version>0.16.0</version> </dependency> <dependency> <groupId>io.prometheus</groupId> <artifactId>simpleclient_httpserver</artifactId> <version>0.16.0</version> </dependency> </dependencies> CODE_BLOCK: <dependencies> <dependency> <groupId>io.prometheus</groupId> <artifactId>simpleclient</artifactId> <version>0.16.0</version> </dependency> <dependency> <groupId>io.prometheus</groupId> <artifactId>simpleclient_httpserver</artifactId> <version>0.16.0</version> </dependency> </dependencies> CODE_BLOCK: import io.prometheus.client.Counter; import io.prometheus.client.Gauge; import io.prometheus.client.Summary; import io.prometheus.client.exporter.HTTPServer; public class PrometheusApp { static final Counter counter = Counter.build() .name("java_random_counter") .help("Example Java counter") .register(); static final Gauge gauge = Gauge.build() .name("java_random_gauge") .help("Example Java gauge") .register(); static final Summary summary = Summary.build() .name("java_process_time") .help("Time spent processing") .register(); public static void main(String[] args) throws Exception { HTTPServer server = new HTTPServer(8000); counter.inc(); counter.inc(4.5); gauge.set(100); gauge.inc(10); gauge.dec(5); Summary.Timer timer = summary.startTimer(); try { Thread.sleep(1000); } finally { timer.observeDuration(); } Thread.currentThread().join(); } } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: import io.prometheus.client.Counter; import io.prometheus.client.Gauge; import io.prometheus.client.Summary; import io.prometheus.client.exporter.HTTPServer; public class PrometheusApp { static final Counter counter = Counter.build() .name("java_random_counter") .help("Example Java counter") .register(); static final Gauge gauge = Gauge.build() .name("java_random_gauge") .help("Example Java gauge") .register(); static final Summary summary = Summary.build() .name("java_process_time") .help("Time spent processing") .register(); public static void main(String[] args) throws Exception { HTTPServer server = new HTTPServer(8000); counter.inc(); counter.inc(4.5); gauge.set(100); gauge.inc(10); gauge.dec(5); Summary.Timer timer = summary.startTimer(); try { Thread.sleep(1000); } finally { timer.observeDuration(); } Thread.currentThread().join(); } } CODE_BLOCK: import io.prometheus.client.Counter; import io.prometheus.client.Gauge; import io.prometheus.client.Summary; import io.prometheus.client.exporter.HTTPServer; public class PrometheusApp { static final Counter counter = Counter.build() .name("java_random_counter") .help("Example Java counter") .register(); static final Gauge gauge = Gauge.build() .name("java_random_gauge") .help("Example Java gauge") .register(); static final Summary summary = Summary.build() .name("java_process_time") .help("Time spent processing") .register(); public static void main(String[] args) throws Exception { HTTPServer server = new HTTPServer(8000); counter.inc(); counter.inc(4.5); gauge.set(100); gauge.inc(10); gauge.dec(5); Summary.Timer timer = summary.startTimer(); try { Thread.sleep(1000); } finally { timer.observeDuration(); } Thread.currentThread().join(); } } CODE_BLOCK: http://localhost:8000/metrics Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: http://localhost:8000/metrics CODE_BLOCK: http://localhost:8000/metrics CODE_BLOCK: static final Counter labeledCounter = Counter.build() .name("java_labeled_counter") .help("Counter with labels") .labelNames("foo", "bar") .register(); Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: static final Counter labeledCounter = Counter.build() .name("java_labeled_counter") .help("Counter with labels") .labelNames("foo", "bar") .register(); CODE_BLOCK: static final Counter labeledCounter = Counter.build() .name("java_labeled_counter") .help("Counter with labels") .labelNames("foo", "bar") .register(); CODE_BLOCK: labeledCounter.labels("1", "2").inc(); Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: labeledCounter.labels("1", "2").inc(); CODE_BLOCK: labeledCounter.labels("1", "2").inc(); CODE_BLOCK: prometheus-net Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: prometheus-net CODE_BLOCK: prometheus-net CODE_BLOCK: Install-Package prometheus-net Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Install-Package prometheus-net CODE_BLOCK: Install-Package prometheus-net CODE_BLOCK: using Prometheus; class Program { private static readonly Counter counter = Metrics.CreateCounter("dotnet_counter", "Example counter"); private static readonly Gauge gauge = Metrics.CreateGauge("dotnet_gauge", "Example gauge"); private static readonly Summary summary = Metrics.CreateSummary("dotnet_summary", "Example summary"); static void Main() { var server = new MetricServer(port: 8000); server.Start(); counter.Inc(); gauge.Set(100); gauge.Dec(10); using (summary.NewTimer()) { Thread.Sleep(1000); } while (true) { Thread.Sleep(1000); } } } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: using Prometheus; class Program { private static readonly Counter counter = Metrics.CreateCounter("dotnet_counter", "Example counter"); private static readonly Gauge gauge = Metrics.CreateGauge("dotnet_gauge", "Example gauge"); private static readonly Summary summary = Metrics.CreateSummary("dotnet_summary", "Example summary"); static void Main() { var server = new MetricServer(port: 8000); server.Start(); counter.Inc(); gauge.Set(100); gauge.Dec(10); using (summary.NewTimer()) { Thread.Sleep(1000); } while (true) { Thread.Sleep(1000); } } } CODE_BLOCK: using Prometheus; class Program { private static readonly Counter counter = Metrics.CreateCounter("dotnet_counter", "Example counter"); private static readonly Gauge gauge = Metrics.CreateGauge("dotnet_gauge", "Example gauge"); private static readonly Summary summary = Metrics.CreateSummary("dotnet_summary", "Example summary"); static void Main() { var server = new MetricServer(port: 8000); server.Start(); counter.Inc(); gauge.Set(100); gauge.Dec(10); using (summary.NewTimer()) { Thread.Sleep(1000); } while (true) { Thread.Sleep(1000); } } } CODE_BLOCK: var labeledGauge = Metrics.CreateGauge( "dotnet_labeled_gauge", "Gauge with labels", new[] { "foo", "bar" } ); labeledGauge .WithLabels("1", "2") .Set(100); Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: var labeledGauge = Metrics.CreateGauge( "dotnet_labeled_gauge", "Gauge with labels", new[] { "foo", "bar" } ); labeledGauge .WithLabels("1", "2") .Set(100); CODE_BLOCK: var labeledGauge = Metrics.CreateGauge( "dotnet_labeled_gauge", "Gauge with labels", new[] { "foo", "bar" } ); labeledGauge .WithLabels("1", "2") .Set(100); CODE_BLOCK: var gauge = Metrics.CreateGauge( "dotnet_env_gauge", "Gauge with static labels", new GaugeConfiguration { LabelNames = new[] { "foo", "bar" }, StaticLabels = new Dictionary<string, string> { { "environment", "dev" } } } ); Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: var gauge = Metrics.CreateGauge( "dotnet_env_gauge", "Gauge with static labels", new GaugeConfiguration { LabelNames = new[] { "foo", "bar" }, StaticLabels = new Dictionary<string, string> { { "environment", "dev" } } } ); CODE_BLOCK: var gauge = Metrics.CreateGauge( "dotnet_env_gauge", "Gauge with static labels", new GaugeConfiguration { LabelNames = new[] { "foo", "bar" }, StaticLabels = new Dictionary<string, string> { { "environment", "dev" } } } ); CODE_BLOCK: Metrics.DefaultRegistry.SetStaticLabels( new Dictionary<string, string> { { "country", "us" } } ); Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Metrics.DefaultRegistry.SetStaticLabels( new Dictionary<string, string> { { "country", "us" } } ); CODE_BLOCK: Metrics.DefaultRegistry.SetStaticLabels( new Dictionary<string, string> { { "country", "us" } } ); CODE_BLOCK: country="us" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: country="us" CODE_BLOCK: country="us" CODE_BLOCK: counter.CountExceptions(() => { try { throw new NotImplementedException(); } catch { // swallow exception } }); Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: counter.CountExceptions(() => { try { throw new NotImplementedException(); } catch { // swallow exception } }); CODE_BLOCK: counter.CountExceptions(() => { try { throw new NotImplementedException(); } catch { // swallow exception } }); CODE_BLOCK: scrape_configs: - job_name: "java" static_configs: - targets: ["localhost:8000"] - job_name: "dotnet" static_configs: - targets: ["localhost:8000"] Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: scrape_configs: - job_name: "java" static_configs: - targets: ["localhost:8000"] - job_name: "dotnet" static_configs: - targets: ["localhost:8000"] CODE_BLOCK: scrape_configs: - job_name: "java" static_configs: - targets: ["localhost:8000"] - job_name: "dotnet" static_configs: - targets: ["localhost:8000"] CODE_BLOCK: Prometheus.Web.Auth Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Prometheus.Web.Auth CODE_BLOCK: Prometheus.Web.Auth CODE_BLOCK: app.UseEndpoints(endpoints => { endpoints.MapControllers(); endpoints.MapMetrics(); // exposes /metrics }); Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: app.UseEndpoints(endpoints => { endpoints.MapControllers(); endpoints.MapMetrics(); // exposes /metrics }); CODE_BLOCK: app.UseEndpoints(endpoints => { endpoints.MapControllers(); endpoints.MapMetrics(); // exposes /metrics }); CODE_BLOCK: /metrics Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: http://localhost:<port>/metrics Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: http://localhost:<port>/metrics CODE_BLOCK: http://localhost:<port>/metrics CODE_BLOCK: using Prometheus; using Microsoft.AspNetCore.Mvc; public class HomeController : Controller { private static readonly Counter IndexCounter = Metrics.CreateCounter( "index_action_total", "Number of times Index action is called" ); public IActionResult Index() { IndexCounter.Inc(); return Ok("Hello from Prometheus!"); } } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: using Prometheus; using Microsoft.AspNetCore.Mvc; public class HomeController : Controller { private static readonly Counter IndexCounter = Metrics.CreateCounter( "index_action_total", "Number of times Index action is called" ); public IActionResult Index() { IndexCounter.Inc(); return Ok("Hello from Prometheus!"); } } CODE_BLOCK: using Prometheus; using Microsoft.AspNetCore.Mvc; public class HomeController : Controller { private static readonly Counter IndexCounter = Metrics.CreateCounter( "index_action_total", "Number of times Index action is called" ); public IActionResult Index() { IndexCounter.Inc(); return Ok("Hello from Prometheus!"); } } COMMAND_BLOCK: services.AddHealthChecks() .AddCheck("self", () => HealthCheckResult.Healthy()); Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: services.AddHealthChecks() .AddCheck("self", () => HealthCheckResult.Healthy()); COMMAND_BLOCK: services.AddHealthChecks() .AddCheck("self", () => HealthCheckResult.Healthy()); CODE_BLOCK: app.UseEndpoints(endpoints => { endpoints.MapHealthChecks("/health"); endpoints.MapMetrics(); }); Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: app.UseEndpoints(endpoints => { endpoints.MapHealthChecks("/health"); endpoints.MapMetrics(); }); CODE_BLOCK: app.UseEndpoints(endpoints => { endpoints.MapHealthChecks("/health"); endpoints.MapMetrics(); }); CODE_BLOCK: aspnetcore_healthcheck_status{name="self"} 1 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: aspnetcore_healthcheck_status{name="self"} 1 CODE_BLOCK: aspnetcore_healthcheck_status{name="self"} 1 CODE_BLOCK: static_configs: - targets: ["localhost:5000"] Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: static_configs: - targets: ["localhost:5000"] CODE_BLOCK: static_configs: - targets: ["localhost:5000"] CODE_BLOCK: scrape_configs: - job_name: "ec2" ec2_sd_configs: - region: ap-southeast-2 port: 9100 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: scrape_configs: - job_name: "ec2" ec2_sd_configs: - region: ap-southeast-2 port: 9100 CODE_BLOCK: scrape_configs: - job_name: "ec2" ec2_sd_configs: - region: ap-southeast-2 port: 9100 CODE_BLOCK: filters: - name: tag:Environment values: ["prod"] Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: filters: - name: tag:Environment values: ["prod"] CODE_BLOCK: filters: - name: tag:Environment values: ["prod"] CODE_BLOCK: relabel_configs: - source_labels: [__meta_ec2_public_ip] target_label: __address__ replacement: "$1:9100" Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: relabel_configs: - source_labels: [__meta_ec2_public_ip] target_label: __address__ replacement: "$1:9100" CODE_BLOCK: relabel_configs: - source_labels: [__meta_ec2_public_ip] target_label: __address__ replacement: "$1:9100" CODE_BLOCK: - targets: - localhost:9100 labels: team: alpha Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: - targets: - localhost:9100 labels: team: alpha CODE_BLOCK: - targets: - localhost:9100 labels: team: alpha CODE_BLOCK: scrape_configs: - job_name: "file_sd" file_sd_configs: - files: - /etc/prometheus/file_sd/*.yml Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: scrape_configs: - job_name: "file_sd" file_sd_configs: - files: - /etc/prometheus/file_sd/*.yml CODE_BLOCK: scrape_configs: - job_name: "file_sd" file_sd_configs: - files: - /etc/prometheus/file_sd/*.yml - Collect metrics - Store their values - Keep track of timestamps - Analyze trends over time - Scrape and store metrics - Attach labels to metrics - Store metrics efficiently over time - Allow querying using PromQL - Create alerts based on metric conditions - Metrics are not stored in one place - You may have: Metrics in Prometheus Time-series data in SQL databases Infrastructure metrics in cloud platforms - Metrics in Prometheus - Time-series data in SQL databases - Infrastructure metrics in cloud platforms - Metrics in Prometheus - Time-series data in SQL databases - Infrastructure metrics in cloud platforms - A SQL database like SQL Server or MySQL - Cloud metrics in Amazon CloudWatch - That adds unnecessary complexity - It increases maintenance overhead - It is only useful if you need to combine metrics mathematically - Connect to multiple data sources - Build rich dashboards - Visualize metrics from different systems in one place - SQL databases - Cloud providers like Amazon CloudWatch - Many others - One panel may show data from Prometheus - Another panel may read from a SQL database - Another panel may show CloudWatch metrics - Centralized alerting - Visual alert states on dashboards - Unified alert management across data sources - You don’t need separate alerting systems for each tool - Teams can see what’s broken and why from one place - Open source (widely used in DevOps and SRE teams) - Also available as an enterprise offering with advanced features - Prometheus → collects and stores metrics - Grafana → visualizes metrics from many sources - Together, they provide: Strong monitoring Clear dashboards Unified alerting Real production-ready observability - Strong monitoring - Clear dashboards - Unified alerting - Real production-ready observability - Strong monitoring - Clear dashboards - Unified alerting - Real production-ready observability - One Prometheus server (or a Prometheus cluster) - Many systems we want to monitor: Applications Databases Servers Cloud services Proxies, load balancers IoT devices - Applications - Cloud services - Proxies, load balancers - IoT devices - Applications - Cloud services - Proxies, load balancers - IoT devices - Prometheus always pulls metrics - Nothing ever pushes metrics directly into Prometheus - Add a Prometheus client library to the application - Expose a /metrics endpoint - Many others - Collects metrics internally - Exposes them over HTTP - Prometheus scrapes them - Databases (MySQL, PostgreSQL, SQL Server) - Cloud services like Amazon CloudWatch - Proxies and load balancers - Third-party systems - IoT devices (sensors, meters, traffic lights) - Add libraries - Change the application logic - Modify how metrics are exposed - It does not scale - Scripts fail silently - Scheduling becomes complex - Millions of devices pushing data can overload Prometheus - Knows how to talk to a system - Collects metrics from it - Exposes those metrics in Prometheus format - Node Exporter → Linux servers - MySQL Exporter → MySQL databases - Windows Exporter → Windows servers - CloudWatch Exporter → AWS metrics - Proxy exporters (NGINX, HAProxy, Envoy) - On the same machine (Linux, Windows) - Next to the system (for cloud services, databases, proxies) - As a container - As a Kubernetes Pod - Discovers the exporter - Connects to it - Pulls metrics - Configured in prometheus.yml - Default scrape interval: 15 seconds - Prometheus: Connects to exporters Pulls metrics Stores them as time-series data - Connects to exporters - Pulls metrics - Stores them as time-series data - Connects to exporters - Pulls metrics - Stores them as time-series data - Short-lived processes - Do not stay running long enough to be scraped - Applications push metrics to PushGateway - PushGateway stores them temporarily - PushGateway exposes a /metrics endpoint - Prometheus scrapes PushGateway - Metrics are not pushed to Prometheus - Prometheus still pulls - PushGateway only acts as an intermediate buffer - Is optional - Used only for short-lived jobs - Should NOT be used for normal services or IoT streams - Scale safely - Control load - Avoid overload - Work with thousands of heterogeneous systems - Large infrastructures - Cloud-native systems - Hybrid environments - IoT at scale - Applications with source code → Client libraries - Systems without source code → Exporters - Short-lived jobs → PushGateway - Prometheus → Always pulls metrics - Scraping → Happens on a fixed interval (default 15s) - Any machine running a Unix-based OS - Examples: Linux servers, Ubuntu, Amazon Linux, macOS - Memory usage - Network I/O - File system stats - Load average - System uptime - Capacity planning - Performance troubleshooting - Infrastructure monitoring - Alerting on system health - It is part of the Prometheus project - Maintained by the Prometheus team - Stable and production-ready - Community maintained - Vendor maintained - Third-party maintained - Prometheus server → central collector - Node Exporter → installed on each machine you want to monitor - Metrics include sensitive system information - Opening 9100 to the internet exposes your server - Open port 9100 - Source = Prometheus server security group - NOT 0.0.0.0/0 - Only Prometheus can scrape metrics - No public access - Status → Targets - Target state should be UP (green) - Check firewall / security group - Check Node Exporter is running - Terminal closes → exporter stops - Server restarts → exporter stops - Create user & group - Move binary to /var/lib/node_exporter - Create node_exporter.service - Enable & start service - Survives reboots - Starts automatically - Production-ready - Prometheus → Targets - Both Prometheus and Node Exporter should be UP - Node Exporter = host metrics - Not related to Node.js - Installed on machines, not Prometheus - Uses port 9100 - Must be secured - Always run as a service in production - Works on Linux and macOS - A metric name - A set of labels (key–value pairs) - A timestamp (Unix timestamp) - http_requests_total - cpu_usage_seconds_total - authentication_api_hits_total - Labels are optional - Each label is a key = value pair - A metric can have multiple labels - Which service? - Which user/account? - Which endpoint? - Which instance? - Metric name comes first - Labels go inside { } - Labels are separated by commas - account_id="12345" - response_time_ms="800" - The counter increases by 1 - A new data point is recorded with: Current timestamp Updated value - Current timestamp - Updated value - Current timestamp - Updated value - Prometheus stores data as time series - Every time series = metric name + labels - Labels are key–value pairs used for filtering and aggregation - Timestamps are automatically attached - Different label values = different time series - When storing metrics in Prometheus - When retrieving metrics using PromQL (via UI or API) - Scalars can be integers or floating-point numbers - In Prometheus, all numbers are treated as floats - code="200" → string, not a number - job="prometheus" → string - Label values must be enclosed in quotes - Both double quotes (" ") and single quotes (' ') are accepted - code=~"2.*" is a regular expression - Match any code starting with 2 200, 201, 204, 205, etc. - 200, 201, 204, 205, etc. - 200, 201, 204, 205, etc. - code is stored as a string - You are comparing it as a number - Use only the metric name - Optionally apply label filters - One value per time series - All values sampled at the same timestamp - Selects only time series matching the labels - Still returns one value per series - Return all samples from the last 5 minutes - Time range is always in the past - There is no month unit - Units are case-sensitive - Multiple rows - Each row has one value - Same timestamp - Different label values (e.g., device="eth0", device="lo") - Same metrics - Each metric has multiple values - Values depend on: - Time range (5m) - Scrape interval - Scrape interval = 15s - Time range = 5 minutes - The original vector is not modified - PromQL always returns a new vector - Prometheus matches metric name + labels - Only matching series appear in the result - Labels are always strings - Scalars are single numeric values - Instant vectors = one value per time series - Range vectors = multiple values over time - Arithmetic operations: Scalar + Vector → applied to every element Vector + Vector → matched by labels - Scalar + Vector → applied to every element - Vector + Vector → matched by labels - PromQL never mutates existing data - Scalar + Vector → applied to every element - Vector + Vector → matched by labels - Binary comparison operators - Set binary operators - Label filtering (selectors) - Aggregation operators - Time offset - How Prometheus visualizes results - 1 represents true - 0 represents false - Only the time series where the value equals 10 remains - Only time series that exist in both vectors (same metric name + labels) are compared - Only matching elements appear in the result - Only elements present in both A and B - Only if their values satisfy the comparison - You get elements where the left-side value is greater than the right-side value - Case-sensitive - Work only with instant vectors - Do NOT compare values — they compare existence of time series - Metric name must match - code must be "200" - job must be "prometheus" - 200, 201, 204, 205 - Always ensure your regex cannot match an empty string - Use .* when you want to ignore remaining characters - Work on a single instant vector - Return a new instant vector - Usually reduce the number of time series - One value (sum of all elements) - One value per mode - Largest or smallest values - Values are always 1 - Only labels matter - Instant vectors → can be graphed - Range vectors → cannot be graphed directly - Comparison operators return 1 or 0 - Set operators work on existence, not values - Labels are always strings - Aggregations reduce vectors - group always returns value = 1 - offset must be applied before aggregation - Graphs require numeric values - Writing queries - Building dashboards - Creating alerts - Both functions accept an instant vector - The value must represent time in seconds (Unix timestamp) - Time is evaluated in UTC - A number between 1 and 31 - Represents the day of the month - A number between 1 and 7 - They work only on gauges - They do NOT work on counters - They compare the first and last samples in the time window - Accepts a range vector - Calculates: - Uses only the last two samples - More sensitive to short-term changes - Useful for quick fluctuations - If data exists → returns nothing - If data is missing → returns 1 - Empty (because data exists) - One time series - Input: range vector - Output: instant vector - If data is missing → returns 1 - If data exists → returns empty - Converts all values to absolute values - Example: -5 → 5 - Rounds values up - Example: 1.6 → 2 - Rounds values down - Example: 1.6 → 1 - Removes values: Less than min Greater than max - Less than min - Greater than max - Less than min - Greater than max - Removes values below min - Keeps everything else - Removes values above max - Keeps everything else - All values < 300 are removed - All values > 150000 are removed - Values are trimmed between 300 and 150000 - Prevents outliers from ruining graphs - Makes dashboards clean and readable - Very common in Grafana visualizations - day_of_month() and day_of_week() work on time values - delta() and idelta() work only on gauges - absent() and absent_over_time() detect missing data - Mathematical functions modify values - Clamp functions are critical for dashboard hygiene - Many functions accept range vectors but return instant vectors - Logarithmic & utility functions - Sorting & time functions - Aggregation over time - Alerts and Alertmanager (concept + hands-on) - Returns the binary logarithm (base-2) - Example: Value = 2 → result = 1 Value = 8 → result = 3 - Value = 2 → result = 1 - Value = 8 → result = 3 - Value = 2 → result = 1 - Value = 8 → result = 3 - Returns the decimal logarithm - Example: Value = 10 → result = 1 Value = 100 → result = 2 - Value = 10 → result = 1 - Value = 100 → result = 2 - Value = 10 → result = 1 - Value = 100 → result = 2 - Returns the natural logarithm - Function name is lowercase - Sorts values in ascending order - Sorts values in descending order - Returns the current Unix timestamp - Not guaranteed to be exact current second - Returns the timestamp when each sample was scraped - Output value = timestamp - Returns timestamps from one hour ago - Notice how timestamps change with offset - Averages CPU 0 - Over the last 2 hours - Returns an instant vector - Errors suddenly spike at 4:30 PM - Developer fixes it later - Users experience failures before you notice - Detect problems before chaos - Give engineers time to react - Avoid: Too many alerts (noise) Alerts too late (damage already done) - Too many alerts (noise) - Alerts too late (damage already done) - Too many alerts (noise) - Alerts too late (damage already done) - Not too low (avoid flapping) - Not too high (avoid late alerts) - Evaluates alert rules - Shows alerts in the UI only - Receives alerts from Prometheus - Sends notifications: Email Slack PagerDuty OpsGenie Webhooks - Handles: Deduplication Grouping Throttling - Deduplication - Deduplication - Each Prometheus instance sends alerts independently - Duplicate alerts everywhere - Same alerts are grouped - Only one notification is sent - Repeated alerts are batched - groups → required - rules → list of alert rules - alert → alert name - expr → PromQL expression - Node Exporter is not reachable - Paths are relative - You can load multiple rule files - Homebrew (macOS): - Windows: Stop the process Start Prometheus again - Stop the process - Start Prometheus again - Stop the process - Start Prometheus again - 🟢 Inactive → condition not met - 🔴 Firing → alert active - See expression - See duration - Evaluate the query manually - Alert turns red - Status = Firing - Alert returns to green - MySQL / PostgreSQL - Elasticsearch - NGINX / Apache - Cloud services - Adjust labels / thresholds - Use in production - Log functions help normalize values - Sorting helps with visibility - Aggregation-over-time works on range vectors - Alerts detect issues before chaos - Prometheus evaluates alerts - Alertmanager sends notifications - Deduplication prevents alert spam - Always reuse community alert rules - The for clause (time-based alert stability) - Using absent() vs comparisons - Adding labels and annotations - Alert templating ($labels, $value) - Alertmanager recap - Installing Alertmanager (Windows, macOS, Linux) - Prometheus evaluates alert rules every 1 minute - If the expression is true for one evaluation, the alert fires - Temporary failures - Intermittent network issues - Self-healing behavior - s – seconds - m – minutes - The exporter must be down continuously for 5 minutes - Only then does the alert fire - Returns nothing if data exists - Returns 1 if data is missing - In Prometheus: 1 = true - No target exists with job="node_exporter" - team → who owns the alert - severity → how serious it is - summary → short message - description → detailed explanation - $labels → all labels of the time series - $labels.instance → specific label - $value → result of the alert expression - 🟢 Inactive – condition not met - 🔴 Firing – alert active - Annotations - Evaluation timestamp (UTC) - Converts alerts → notifications - Sends alerts to: Email Slack PagerDuty OpsGenie Webhooks - Deduplicates alerts - Groups related alerts - Silences alerts during maintenance - Runs on port 1993 - UI is read-only - Configuration happens only via YAML - Go to Prometheus download page - Download Alertmanager (Windows AMD64) - Extract the ZIP - Files inside: - alertmanager.exe - alertmanager.yml Run: - Install MacPorts - Config file location: - Restart after changes: - Download Alertmanager binary - Extract files - Set ownership: - Create systemd service: - Reload and start: - for prevents alert flapping - absent() is cleaner for missing targets - Labels route alerts - Annotations explain alerts - Templates add dynamic context - Alertmanager handles notifications - UI is read-only - Configuration is always YAML-based - Prometheus raises an alert - Alertmanager receives the alert - Alertmanager evaluates routes - Routes contain matchers - If a matcher matches alert labels: - Alert is sent to a receiver Receiver sends notification to: - Receiver sends notification to: - Receiver sends notification to: - severity = critical - team = billing - Regex matches like service =~ "billing.*" - Deprecated (legacy): match match_re - Recommended (modern): matchers - Matchers (conditions) - A receiver (destination) - Alert is sent to the configured receiver - All alerts → default email - Critical alerts → urgent email - Create or choose a channel - Go to Integrations - Install Incoming Webhooks - Choose the channel - Copy the Webhook URL - (Optional) Customize icon or emoji - Go to Services - Select a service - Go to Integrations - Add integration → Prometheus - Copy the Integration Key - Done via Alertmanager UI - Used during maintenance or deployments - Silence alerts for 2 hours - Silence alerts matching team=billing - Defined in Alertmanager config - Suppresses alerts based on other alerts - Used to reduce noise - Alert A: Server is down - Alert B: Website is down - Website alert is redundant - Suppress Alert B - If a Team Alpha alert is firing - Suppress Team Beta alerts - When instance label is equal - Inhibition happens only in Alertmanager - Prometheus will still show both alerts - Only notifications are suppressed - You have thousands of metrics - Dashboards refresh frequently - Data volume is large - Compute it once - Store it as: - Thousands of IoT sensors - Hundreds of hotels - Constant dashboards - Compute periodically - Store results - Dashboards become fast - Runs a PromQL expression - Saves the result as a new metric - Same directory as prometheus.yml - Create a rules/ folder - Reference it in config - Routes decide where alerts go - Matchers work on labels - Slack & PagerDuty use webhooks - Silencing = temporary - Inhibition = rule-based suppression - Recording rules improve performance - Recording rules create new metrics - Precompute expensive PromQL expressions - Store the result as a new metric - Improve dashboard and alert performance - Is a counter - Has many labels (cpu, mode, instance, etc.) - Counters must use rate() or irate() - We also need a time window - Converts the counter into a rate (per second) - Produces an instant vector - Can be graphed - Is ideal for recording rules - macOS / Windows: Create a rules/ directory Place it next to prometheus.yml - Create a rules/ directory - Place it next to prometheus.yml - Create a rules/ directory - Place it next to prometheus.yml - Windows: Restart the Prometheus process - Behaves like any normal metric - Can be aggregated again - Can be used in alerts and dashboards - Always build the query first - Counters → rate() → aggregate - Recording rules create new metrics - Great for dashboards and alerts - Reduce query load dramatically - Do not run continuously - Start → do work → exit - Cannot always be scraped - Background tasks - One-time functions - Client libraries - Pushgateway (covered later) - Counters reset when the application restarts - Values exist only while the app is running - Client libraries expose /metrics - Python client works without Flask - Counters, Gauges, Summaries are easy - Labels add powerful dimensions - App restart resets metrics - Prometheus handles scraping - simpleclient - simpleclient_httpserver - java_random_counter_total = 5.5 - java_random_gauge = 105 - java_process_time_count = 1 - java_process_time_sum ≈ 1 - Counters always end with _total - Metrics reset when the application restarts - Summary produces: _count _sum - You must always use .labels() - Calling .inc() directly will throw an exception - Once labels exist → must always use WithLabels() - Exception still occurs - Metric increments automatically - Exception handling is your responsibility - Official Prometheus client - Uses embedded HTTP server - Strongly typed, explicit registration - Community-driven but mature - Very flexible label handling - Supports global static labels - Counters reset on restart - Labels must always be populated - /metrics endpoint is mandatory - Prometheus scrapes — clients only expose - Use the Prometheus .NET client library (prometheus-net) - Expose metrics from an ASP.NET Core web application - Scrape those metrics using Prometheus - Understand why service discovery and Pushgateway are needed later - Add a new project to your solution - Choose ASP.NET Core Web Application - Name it something like: - Authentication: None - HTTPS: optional - Framework: .NET 6 / .NET 7 (either is fine) - prometheus-net - prometheus-net.AspNetCore - prometheus-net.AspNetCore.HealthChecks - Metric creation - /metrics endpoint - Health check metrics - Thread count - GC collections - Process CPU - Memory usage - Every request increments the counter - Metric appears automatically in /metrics - Prometheus can scrape it without extra configuration - 1 → Healthy - 0 → Unhealthy - /metrics endpoint - Default runtime metrics - Custom counters and gauges - Health checks exposed as Prometheus metrics - IPs never change - Number of servers is fixed - Auto Scaling Groups - VM scale-out / scale-in - Ephemeral IPs - Serverless functions (no IPs) - AWS Lightsail - Requests are round-robin - Metrics are mixed across instances - You lose instance-level visibility - Labels become unreliable - Azure Functions - Short-lived processes - Applications push metrics - Pushgateway stores them temporarily - Prometheus scrapes Pushgateway - ec2_sd_configs - kubernetes_sd_configs - dns_sd_configs - file_sd_configs - Discovers instances - Updates targets dynamically - Scrapes node exporters - Instance state - Availability zone - Instance type - Build labels - Replace IPs - Drop unwanted targets - Control __address__ - Cloud provider is unsupported - Custom environments - On-prem / hybrid setups - Use wildcard (*) - Let automation update files - No Prometheus restart required - How ASP.NET Core exposes metrics - How Prometheus scrapes applications - Why static configs fail in cloud - When to use service discovery - Why Pushgateway exists