Homelab Monitoring Concepts

Monitoring is what tells you a service is broken before users do. A homelab benefits from monitoring at three levels: is the host alive (metrics), what is the host actually doing (logs), and where does a slow user request spend its time (traces). The tooling has converged on a handful of patterns that scale from a single Raspberry Pi to a small datacenter.

The three pillars

PillarWhat it capturesUse for
MetricsNumbers sampled over timeTrends, capacity planning, alerting on thresholds
LogsTimestamped events / textWhat happened in a specific incident; debugging
TracesEnd-to-end request paths"Why is this user request slow?" — finding bottlenecks across services

For most homelab cases, metrics + logs is enough. Traces matter when you have multiple services and slow requests; can skip until then.

The simplest useful monitoring: uptime checks

Before any sophisticated stack, the most-used monitoring tool is the uptime check: every minute, try to reach each service. If it fails, send a notification.

  • Uptime Kuma — self-hosted, web UI, supports HTTP/TCP/ping/DNS/custom checks. Notifies via webhook, email, Telegram, etc.
  • Public services — UptimeRobot, BetterStack, Healthchecks.io. Free tiers cover personal use.

For a homelab with 10-20 services, Uptime Kuma alone covers the most common monitoring need.

Metrics: Prometheus and friends

The de facto standard homelab metrics stack:

  • Prometheus — pulls metrics from configured targets every 15 seconds; stores in a time-series database.
  • node_exporter — exposes host-level metrics (CPU, memory, disk, network) on each machine.
  • cAdvisor — exposes container-level metrics for Docker / Kubernetes workloads.
  • SNMP exporter / blackbox exporter — for network devices and HTTP endpoints.
  • Grafana — dashboards on top of Prometheus data.
  • Alertmanager — handles alert routing and notification.

Each piece is one container. The full stack runs comfortably on a Raspberry Pi 4 for a small homelab.

Pull vs push for metrics

ModelExamplesBest for
PullPrometheusLong-running services with stable endpoints
PushInfluxDB Telegraf, StatsDShort-lived processes; high-cardinality events

Prometheus also supports push for short-lived jobs via the Pushgateway, but the primary model is pull.

Logs: Loki and alternatives

For log aggregation:

  • Grafana Loki — Prometheus-style log aggregation. Indexes labels rather than full text, which keeps it lightweight.
  • Elastic Stack (ELK) — Elasticsearch + Logstash + Kibana. Full-text indexing; powerful but resource-intensive.
  • Graylog — Java-based; similar capabilities to ELK.
  • Plain syslog server — for lightweight setups; logs go to text files; grep when needed.

Loki is the modern homelab default — it integrates with Grafana naturally and has much lower resource requirements than the ELK stack.

Traces: skip unless needed

Tracing tools (Jaeger, Tempo, OpenTelemetry collector) are useful when you have multiple services and need to follow a single request across them. For a homelab with a few self-hosted apps, the value vs setup cost is low. Skip until you have a "this request is slow but I can't tell which service" problem.

What to monitor on the host

  • CPU usage — overall + per-core.
  • Memory — used, available, swap.
  • Disk space — per filesystem, with alerts at 80% and 90% full.
  • Disk I/O — IOPS, throughput, queue depth.
  • Network — RX/TX bytes per interface, packet errors.
  • Load average — Linux's load metric.
  • Temperatures and fan speeds — especially for homelabs with consumer hardware.
  • UPS battery state — if a UPS protects the homelab.

What to monitor per service

  • Process running? — basic liveness.
  • Responding to HTTP / TCP probe? — basic functional check.
  • Request latency — p50, p95, p99 if the service exposes the metric.
  • Error rate — 5xx responses, exceptions.
  • Throughput / queue depth — for background workers.
  • Application-specific — Plex transcode count, Home Assistant unavailable entities, etc.

Alerting discipline

The biggest mistake in homelab monitoring is over-alerting. Discipline:

  • Alert on symptoms, not causes. "Service is returning 500s" not "memory is high."
  • Use thresholds with hysteresis. Alert when value crosses; clear when it returns below, with margin.
  • Require sustained breach. "5 minutes of 5xx rate above 1%" not "any 5xx in the last minute."
  • Different severity, different channels. Critical → push notification; warning → daily summary email.
  • Quiet hours. Many alerts can wait until morning; routing rules handle this.

Retention

  • Metrics: 30-90 days full resolution. Downsample to longer retention if needed (5-min averages can extend back years cheaply).
  • Logs: 14-30 days for application logs. 90+ days for security logs (auth attempts, audit).
  • Traces: 7-14 days typically. High cardinality; large storage.

Retention costs storage. For a homelab, pick what you actually look back at.

Dashboards: useful patterns

  • Overview dashboard — green/yellow/red status of every service at a glance. The "is everything OK?" view.
  • Per-host dashboard — CPU/memory/disk/network for one host. Use during incident investigation.
  • Per-service dashboard — latency, error rate, queue depth for one service. Use during service-specific issues.
  • Capacity dashboard — storage growth, RAM trend, request volume over months. Use for planning.

Monitoring the monitoring

If Prometheus goes down, you stop getting metrics — and you might not notice until you go to check something. Mitigation:

  • External uptime check on the monitoring stack itself (Healthchecks.io or similar).
  • Alertmanager configured to send periodic "I'm alive" heartbeat alerts — if they stop, you know something's wrong.
  • Run monitoring on different hardware from the workloads it monitors so they don't fail together.

Frequently Asked Questions

What are the three pillars of observability?

Metrics, logs, and traces. Metrics are numeric measurements sampled over time — CPU, memory, response time. Logs are timestamped text events — what happened, when. Traces show the path of a single request through multiple services. Each pillar answers different questions; mature monitoring uses all three.

What is the difference between pull and push collection?

Pull-based: the monitoring system scrapes metrics from each target periodically. Prometheus is the canonical example — every 15 seconds it asks each endpoint for its current metrics. Push-based: each target sends metrics to a collector when it has them. InfluxDB and StatsD work this way. Pull is simpler to reason about and detects target downtime naturally; push works better for short-lived processes that can't be scraped.

What should I alert on in a homelab?

User-facing problems: services down, certs expiring, disks filling, backups failing. Don't alert on routine events: high CPU during expected scheduled jobs, single brief spikes, transient log warnings. The goal is "when paged, something needs attention from you" — not "a number went up and the system reported it." Bad alerting trains you to ignore alerts.

How long should I retain monitoring data?

For metrics, 30-90 days of full-resolution data is usually enough; longer retention can be downsampled (5-minute average instead of 15-second samples) to save storage. For logs, 14-30 days for application logs; longer for security-relevant logs (auth attempts, audit). Retention costs grow linearly with volume; pick what you can actually use.

What is the simplest useful homelab monitoring stack?

Uptime Kuma alone covers "is each service responding to HTTP/TCP/ping?" — answers the most common question with no agent installs. For more depth: Prometheus + Grafana + node_exporter (host metrics) + cAdvisor (container metrics) + a Loki or Grafana Cloud Logs setup for logs. Both stacks are containerized and run with minimal effort.

Related Guides

More From This Section