Monitoring vs. Observability

Monitoring tells you when something is broken. Observability tells you why. A good DevOps setup needs both. In practice, that means collecting three types of data:

  • Metrics — numbers over time (CPU usage, request rate, error rate)
  • Logs — timestamped events from your applications and infrastructure
  • Traces — the path a request takes through your system

In this guide, we'll set up Prometheus for metrics collection and Grafana for visualization using Docker Compose.

Step 1: Docker Compose Setup

Create a docker-compose.yml:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=15d'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=devopspack
    volumes:
      - grafana_data:/var/lib/grafana
    depends_on:
      - prometheus
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Step 2: Configure Prometheus

Create prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: []

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'your-app'
    static_configs:
      - targets: ['your-app:8080']
    metrics_path: '/metrics'

Step 3: Create Alerts

Create alerts.yml — these are the three alerts every production system needs:

groups:
  - name: infrastructure
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is above 85% for 5 minutes"

      - alert: LowDiskSpace
        expr: (node_filesystem_free_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space"
          description: "Less than 10% disk space remaining"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "{{ $labels.job }} has been down for more than 1 minute"

Step 4: Launch and Connect Grafana

docker compose up -d

# Check everything is running
docker compose ps

# Prometheus UI
open http://localhost:9090

# Grafana (login: admin / devopspack)
open http://localhost:3000

In Grafana, add Prometheus as a data source: Settings → Data Sources → Add data source → Prometheus. Set the URL to http://prometheus:9090 and click Save & Test.

Step 5: Import a Dashboard

Instead of building dashboards from scratch, import the official Node Exporter dashboard:

  1. Go to Dashboards → Import
  2. Enter dashboard ID 1860 (Node Exporter Full)
  3. Select your Prometheus data source
  4. Click Import

You now have a full system dashboard showing CPU, memory, disk, and network — all in real time.

Key Metrics to Watch

  • CPU — alert above 85% sustained for 5+ minutes
  • Memory — alert above 90% used
  • Disk — alert below 10% free (and below 5% as critical)
  • HTTP error rate — alert if 5xx responses exceed 1% of traffic
  • Response time (p99) — alert if the 99th percentile exceeds your SLA

Good monitoring catches problems before your users do. Set it up early, tune your thresholds over time, and make sure your alerts actually reach your team — a silent alert is no alert at all.