#+date: <2024-09-20 Fri 13:38:52>
#+title: Linux Observability with Self-Hosted Prometheus and Grafana Cloud
#+description: Learn how to self-host a Prometheus data collection tool with Docker and visualize the results with Grafana Cloud.
#+filetags: :linux:grafana:
#+slug: prometheus-grafana-cloud

This tutorial will guide you through the process of:

1. Configuring a free Grafana cloud account.
2. Installing Prometheus to store metrics.
3. Installing Node Exporter to export machine metrics for Prometheus.
4. Installing Nginx Exporter to export Nginx metrics for Prometheus.
5. Visualizing data in Grafana dashboards.
6. Configure alerts based on Grafana metrics.

* Grafana Cloud

To get started, visit the [[https://grafana.com/auth/sign-up/create-user][Grafana website]] and create a free account.

** Prometheus Data Source

By default, a Prometheus data source should exist in your data sources page
(=$yourOrg.grafana.net/connections/datasources=). If not, add a new data source
using the Prometheus type.

Once you have a valid Prometheus data source, open the data source and note the
following items:

| Data                  | Example                                                             |
|-----------------------+---------------------------------------------------------------------|
| Prometheus Server URL | https://prometheus-prod-13-prod-us-east-0.grafana.net/api/prom/push |
|-----------------------+---------------------------------------------------------------------|
| User                  | 1234567                                                             |
|-----------------------+---------------------------------------------------------------------|
| Password              | configured                                                          |

** Cloud Access Policy Token

Now let's create an access token in Grafana. Navigate to the Administration
> Users and Access > Cloud Access Policies page and create an access policy.

The =metrics > write= scope must be enabled within the access policy you choose.

Once you have an access policy with the correct scope, click the Add Token
button and be sure to copy and save the token since it will disappear once the
modal window is closed.

** Dashboards

Finally, let's create a couple dashboards so that we can easily explore the data
that we will be importing from the server.

I recommend importing the following dashboards:

- [[https://grafana.com/grafana/dashboards/1860-node-exporter-full/][Node Exporter Full]]
- [[https://github.com/nginxinc/nginx-prometheus-exporter/blob/main/grafana][nginx-prometheus-exporter]]
- Prometheus 2.0 Stats

Refer to the bottom of the post for dashboard screenshots!

* Docker

On the machine that you want to observe, make sure Docker and Docker Compose are
installed. This tutorial will be using Docker Compose to create a group of
containers that will work together to send metrics to Grafana.

Let's start by creating a working directory.

#+begin_src sh
mkdir ~/prometheus && \
cd ~/prometheus    && \
nano compose.yml
#+end_src

Within the =compose.yml= file, let's paste the following:

#+begin_src yaml
# compose.yml

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data: {}

services:
  nginx-exporter:
    image: nginx/nginx-prometheus-exporter
    container_name: nginx-exporter
    restart: unless-stopped
    command:
      - '--nginx.scrape-uri=http://host.docker.internal:8080/stub_status'
    expose:
      - 9113
    networks:
      - monitoring
    extra_hosts:
      - host.docker.internal:host-gateway

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    expose:
      - 9100
    networks:
      - monitoring

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    expose:
      - 9090
    networks:
      - monitoring
#+end_src

#+begin_src sh
sudo docker compose up -d
#+end_src

#+begin_quote
I'm not sure if it made a difference but I also whitelisted port 8080 on my
local firewall with =sudo ufw allow 8080=.
#+end_quote

Next, let's create a =prometheus.yml= configuration file.

#+begin_src sh
nano prometheus.yml
#+end_src

#+begin_src yaml
# prometheus.yml

global:
  scrape_interval: 1m

scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 1m
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'nginx'
    scrape_interval: 5s
    static_configs:
      - targets: ['nginx-exporter:9113']

remote_write:
  - url: 'https://prometheus-prod-13-prod-us-east-0.grafana.net/api/prom/push'
    basic_auth:
      username: 'prometheus-grafana-username'
      password: 'access-policy-token'
#+end_src

** Nginx

To enable to the Nginx statistics we need for the nginx-exporter container, we
need to modify the Nginx configuration on the host.

More specifically, we need to create a path for the =stub_status= to be returned
when we query port 8080 on our localhost.

#+begin_src sh
sudo nano /etc/nginx/conf.d/default.conf
#+end_src

#+begin_src conf
server {
        listen 8080;
        listen [::]:8080;

        location /stub_status {
                stub_status;
        }
}
#+end_src

#+begin_src sh
sudo systemctl restart nginx.service
#+end_src

** Debugging

At this point, everything should be running smoothly. If not, here are a few
areas to check and see if any obvious errors exist.

Nginx: Curl the stub_status from the Nginx web server on the host machine to see
if Nginx and stub_status are working properly.

#+begin_src sh
curl http://127.0.0.1:8080/stub_status

# EXPECTED RESULTS:
Active connections: 101
server accepts handled requests
 7510 7510 9654
Reading: 0 Writing: 1 Waiting: 93
#+end_src

Nginx-Exporter: Curl the exported Nginx metrics.

#+begin_src sh
# Figure out the IP address of the Docker container
sudo docker network inspect grafana_monitoring

...
"Name": "nginx-exporter",
"EndpointID": "ef999a53eb9e0753199a680f8d78db7c2a8d5f442626df0b1bb945f03b73dcdd",
"MacAddress": "02:42:c0:a8:40:02",
"IPv4Address": "192.168.64.2/20",
...

# Curl the exported Nginx metrics
curl 192.168.64.2:9113/metrics

# EXPECTED RESULTS:
...
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.9927e-05
go_gc_duration_seconds{quantile="0.25"} 4.24e-05
go_gc_duration_seconds{quantile="0.5"} 4.8531e-05
...
#+end_src

Node-Exporter: Curl the exporter node machine metrics.

#+begin_src sh
# Curl the exported Node metrics
curl 192.168.64.3:9100/metrics

# EXPECTED RESULTS:
...
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 47
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
...
#+end_src

Grafana: Open the Explore panel and look to see if any metrics are coming
through the Prometheus data source. If not, something on the machine is
preventing data from flowing through.

* Alerts & IRM

Now that we have our data connected and visualized, we can define alerting rules
and determine what Grafana should do when an alert is triggered.

** OnCall

#+caption: OnCall
[[https://img.cleberg.net/blog/20240920-prometheus-grafana-cloud/oncall.png]]

Within the Alerts & IRM section of Grafana (=/alerts-and-incidents=), open the
Users page.

The Users page allows you to configure user connections such as:

- Mobile App
- Slack
- Telegram
- MS Teams
- iCal
- Google Calendar

In addition to the connections of each user, you can specify how each user or
team is alerted for Default Notifications and Important Notifications.

Finally, you can access the Schedules page within the OnCall module to schedule
users and teams to be on call for specific date and time ranges. For my
purposes, I put myself on-call 24/7 so that I receive all alerts.

#+caption: User Information
[[https://img.cleberg.net/blog/20240920-prometheus-grafana-cloud/irm_user_info.png]]

** Alerting

#+caption: Alerting Insights
[[https://img.cleberg.net/blog/20240920-prometheus-grafana-cloud/alerting_insights.png]]

Now that we have defined users and team associated with an on-call schedule and
configured to receive the proper alerts, let's define a rule that will generate
alerts.

Within the Alerting section of the Alerts & IRM module, you can create alert
rules, contact points, and notification policies.

Let's start by opening the Alert Rules page and click the New Alert Rule button.

As shown in the image below, we will create an alert for high CPU temperature by querying the =node_hwmon_temp_celsius= metric from our Prometheus data source.

Next, we will set the threshold to be anything above 50 (degrees Celsius).
Finally, we will tell Grafana to evaluate this every 1 minute via our Default
evaluation group. This is connected to our Grafana email, but can be associated
with any notification policy.

#+caption: New Alert Rule
[[https://img.cleberg.net/blog/20240920-prometheus-grafana-cloud/new_alert.png]]

When the alert fires, it will generate an email (or whatever notification policy
you assigned) and will look something like the following image.

#+caption: Alerting Example
[[https://img.cleberg.net/blog/20240920-prometheus-grafana-cloud/email_alert.png]]

** Dashboards

As promised above, here are some dashboard screenshots based on the
configurations above.

#+caption: Nginx Dashboard
[[https://img.cleberg.net/blog/20240920-prometheus-grafana-cloud/dashboard_nginx.png]]

#+caption: Node Dashboard
[[https://img.cleberg.net/blog/20240920-prometheus-grafana-cloud/dashboard_node.png]]

#+caption: OnCall Dashboard
[[https://img.cleberg.net/blog/20240920-prometheus-grafana-cloud/dashboard_oncall.png]]

#+caption: Prometheus Dashboard
[[https://img.cleberg.net/blog/20240920-prometheus-grafana-cloud/dashboard_prometheus.png]]