Skip to main content

Command Palette

Search for a command to run...

Installing a Distributed Monitoring Platform: 3-VM Setup Process

Learn to set up a three-layer system with separate VMs for application, monitoring, and logging as in production.

Updated
18 min read
Installing a Distributed Monitoring Platform: 3-VM Setup Process

System Architecture

Overview: Three-tier distributed system using separate VMs for application, monitoring, and logging - mimicking production infrastructure.

Why This Architecture:

  • Separation of Concerns: Each VM has a dedicated role (app/monitoring/logging)

  • Scalability: Easy to scale each tier independently

  • Observability Pillars: Covers metrics (Prometheus), logs (ELK), and visualization (Grafana/Kibana)

Setup Flow

Purpose: Build infrastructure from golden image → deploy services → integrate monitoring/logging

Phase 1: VM Foundation
├── Create Golden Image Template (base OS with common tools)
├── Clone VMs (app-vm, monitoring-vm, logging-vm)
├── Fix Hostnames & Machine IDs  ⚠️ [ERROR #1] (prevent duplicate identity)
├── Configure Static IPs (stable addressing for monitoring)
└── Create Users & Permissions (security & access control)

Phase 2: Application VM Setup
├── Install Python & Dependencies (runtime environment)
├── Create Flask Application (web app with metrics/logging)
├── Install Node Exporter (system-level metrics)
└── Install & Configure Filebeat  ⚠️ [ERROR #2, #3] (log shipper)

Phase 3: Monitoring VM Setup
├── Install Prometheus (time-series metrics database)
├── Configure Scrape Targets (collect from app-vm)
├── Install Grafana (visualization & dashboards)
└── Create Dashboards (display metrics)

Phase 4: Logging VM Setup
├── Install Elasticsearch (log storage & search)
├── Install Kibana (log visualization)
├── Configure Data Views (index patterns)
└── Verify Log Ingestion (confirm data flow)

Phase 5: Integration & Testing
├── Test Metrics Collection (Prometheus → Grafana)
├── Test Log Shipping (Filebeat → ELK)
└── Create Comprehensive Dashboards (unified view)

Detailed Setup Steps

PHASE 1: VM Foundation

Goal: Create reusable VM template and properly configure cloned instances with unique identities.

1.1 Create Golden Image Template

Purpose: Single source of truth - install once, clone many times. Ensures consistency across all VMs.

# Install base Ubuntu Server 22.04
sudo apt update && sudo apt upgrade -y
sudo apt install -y qemu-guest-agent curl wget vim htop net-tools
sudo systemctl enable qemu-guest-agent

# Optional: Install Docker
sudo apt install -y docker docker-compose

1.2 Clean VM Before Cloning

Purpose: Remove machine-specific identifiers to prevent conflicts when cloning.

sudo cloud-init clean
sudo truncate -s 0 /etc/machine-id
sudo rm -f /var/lib/dbus/machine-id
sudo poweroff

1.3 Convert to Template in Proxmox UI

  • Right-click VM → Convert to Template

  • Note: Template becomes read-only - cannot boot directly

1.4 Clone VMs

  • Clone from template for: app-vm, monitoring-vm, logging-vm

  • Use Full Clone (recommended for independent VMs)

  • Result: 3 identical VMs that need unique configuration


⚠️ ERROR #1 ENCOUNTERED HERE

Why This Matters: Cloned VMs have identical hostnames and machine-ids, causing:

  • Prometheus to see only 1 node instead of 3

  • Systemd service conflicts

  • Network confusion

1.5 Fix Hostnames & Machine IDs

Purpose: Give each VM unique identity for proper monitoring and logging.

App VM:

sudo hostnamectl set-hostname app-vm
hostnamectl  # Verify

Monitoring VM:

sudo hostnamectl set-hostname monitoring-vm
hostnamectl  # Verify

Logging VM:

sudo hostnamectl set-hostname logging-vm
hostnamectl  # Verify

1.6 Fix /etc/hosts (Each VM)

Purpose: Ensure hostname resolves correctly locally.

sudo nano /etc/hosts

Change:

127.0.1.1 app-server

To:

127.0.1.1 app-vm  # (or monitoring-vm, logging-vm respectively)

1.7 Regenerate machine-id (CRITICAL)

Purpose: Create unique systemd identifier - required for proper journaling and service management. ⚠️ Do NOT manually create IDs - let systemd generate them.

sudo rm -f /etc/machine-id
sudo rm -f /var/lib/dbus/machine-id
sudo systemd-machine-id-setup
cat /etc/machine-id  # Verify unique ID
sudo reboot

After reboot, verify each VM has different machine-id

1.8 Create Common User (All VMs)

Purpose: Standard non-root user for application management and SSH access.

sudo useradd -m -s /bin/bash devops
sudo passwd devops
sudo usermod -aG sudo devops

# Verify
getent passwd devops
id devops

1.9 Set Root Password (Optional)

Purpose: Enable root access for emergency situations (homelab only - disable in production).

sudo passwd -u root
sudo passwd root

1.10 Configure Static IPs

Purpose: Fixed IPs are essential for monitoring systems - DHCP changes would break scrape targets and log shipping.

Identify Network Interface:

ip a  # Note interface name (e.g., ens18)

Edit Netplan (Each VM):

sudo nano /etc/netplan/00-installer-config.yaml

App VM (192.168.8.50):

network:
  version: 2
  renderer: networkd
  ethernets:
    ens18:
      dhcp4: no
      addresses:
        - 192.168.8.50/24
      gateway4: 192.168.8.1
      nameservers:
        addresses:
          - 8.8.8.8
          - 1.1.1.1

Monitoring VM (192.168.8.60):

addresses:
  - 192.168.8.60/24
# (Everything else same)

Logging VM (192.168.8.70):

addresses:
  - 192.168.8.70/24
# (Everything else same)

Apply Configuration:

sudo netplan apply
ip a  # Verify
ip route  # Verify gateway

Test Connectivity:

ping -c 3 192.168.8.1  # Gateway
ping 192.168.8.60      # Monitoring VM
ping 192.168.8.70      # Logging VM

Expected: All pings successful = network ready

1.11 Update /etc/hosts (All VMs)

Purpose: Enable hostname-based communication between VMs (easier than remembering IPs).

sudo nano /etc/hosts

Add:

192.168.8.50 app-vm
192.168.8.60 monitoring-vm
192.168.8.70 logging-vm

Test: ping monitoring-vm should work from any VM


PHASE 2: Application VM Setup

Goal: Deploy Flask web application with Prometheus metrics export and JSON logging.

2.1 Install Python & Dependencies

Purpose: Python runtime for Flask application.

ssh devops@app-vm
sudo apt update
sudo apt install -y python3 python3-pip python3-venv

2.2 Create Application Directory

Purpose: Isolated virtual environment prevents dependency conflicts.

cd ~
mkdir myapp
cd myapp
python3 -m venv venv
source venv/bin/activate

Result: Shell prompt shows (venv) prefix

2.3 Install Python Packages

pip install flask prometheus-client
pip list  # Verify

2.4 Create Flask Application

Purpose: Web app that:

  • Serves HTTP requests

  • Exposes /metrics for Prometheus

  • Writes JSON logs for ELK

nano app.py

Paste the following code:

from flask import Flask, Response, render_template_string
import time
import random
import logging
import json
from datetime import datetime
from prometheus_client import (
    Counter,
    Histogram,
    generate_latest,
    CONTENT_TYPE_LATEST
)

app = Flask(__name__)

# ----------------------
# JSON Logging Setup
# ----------------------
class JsonFormatter(logging.Formatter):
    def format(self, record):
        log_record = {
            "@timestamp": datetime.utcnow().isoformat(),
            "log.level": record.levelname,
            "message": record.getMessage(),
            "logger": record.name
        }
        if hasattr(record, "extra"):
            log_record.update(record.extra)
        return json.dumps(log_record)

handler = logging.FileHandler("/var/log/myapp/app.log")
handler.setFormatter(JsonFormatter())
logger = logging.getLogger("myapp")
logger.setLevel(logging.INFO)
logger.addHandler(handler)
logger.propagate = False

# ----------------------
# Prometheus Metrics
# ----------------------
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"]
)

REQUEST_LATENCY = Histogram(
    "http_request_latency_seconds",
    "HTTP request latency in seconds",
    ["endpoint"]
)

# ----------------------
# HTML Template
# ----------------------
HTML_TEMPLATE = """
<!DOCTYPE html>
<html>
<head>
    <title>MyApp - Distributed Monitoring Platform</title>
    <style>
        * {
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }
        html, body {
            height: 100%;
            overflow: hidden;
            font-family: 'Segoe UI', Arial, sans-serif;
        }
        body {
            background: linear-gradient(135deg, #0d47a1 0%, #1976d2 50%, #42a5f5 100%);
            display: flex;
            align-items: center;
            justify-content: center;
        }
        .container {
            width: 95vw;
            height: 95vh;
            background: rgba(227, 242, 253, 0.95);
            border-radius: 15px;
            padding: 2vh 2vw;
            box-shadow: 0 15px 50px rgba(0, 0, 0, 0.3);
            display: flex;
            flex-direction: column;
        }
        header {
            text-align: center;
            padding-bottom: 1.5vh;
            border-bottom: 3px solid #1976d2;
            margin-bottom: 2vh;
        }
        h1 {
            font-size: 2.5vw;
            color: #0d47a1;
            margin-bottom: 0.5vh;
        }
        .tagline {
            font-size: 1.2vw;
            color: #1565c0;
        }
        .main-content {
            flex: 1;
            display: grid;
            grid-template-columns: 1fr 1fr;
            grid-template-rows: auto 1fr;
            gap: 2vh;
            overflow: hidden;
        }
        .section {
            background: #bbdefb;
            padding: 2vh 1.5vw;
            border-radius: 10px;
            border-left: 5px solid #1976d2;
            overflow: auto;
        }
        .section h2 {
            color: #0d47a1;
            font-size: 1.5vw;
            margin-bottom: 1vh;
        }
        .section p, .section li {
            color: #1565c0;
            font-size: 1vw;
            line-height: 1.5;
        }
        .architecture {
            grid-column: 1 / -1;
            background: #90caf9;
        }
        .vm-grid {
            display: grid;
            grid-template-columns: repeat(3, 1fr);
            gap: 1.5vw;
            margin-top: 1vh;
        }
        .vm-box {
            background: #e3f2fd;
            padding: 1.5vh 1vw;
            border-radius: 8px;
            border: 2px solid #1976d2;
            text-align: center;
        }
        .vm-box h3 {
            color: #0d47a1;
            font-size: 1.3vw;
            margin-bottom: 1vh;
        }
        .vm-box p {
            color: #1565c0;
            font-size: 0.9vw;
            margin: 0.5vh 0;
        }
        .vm-icon {
            font-size: 2.5vw;
            margin-bottom: 1vh;
        }
        .api-list {
            list-style: none;
        }
        .api-item {
            background: #e3f2fd;
            padding: 1vh 1vw;
            margin: 0.8vh 0;
            border-radius: 5px;
            border-left: 3px solid #1976d2;
            display: flex;
            justify-content: space-between;
            align-items: center;
        }
        .api-endpoint {
            font-weight: bold;
            color: #0d47a1;
            font-size: 1vw;
        }
        .api-method {
            background: #1976d2;
            color: white;
            padding: 0.3vh 0.8vw;
            border-radius: 3px;
            font-size: 0.8vw;
        }
        .sample-page {
            display: flex;
            flex-direction: column;
            gap: 1vh;
        }
        .sample-card {
            background: #e3f2fd;
            padding: 1vh 1vw;
            border-radius: 5px;
            border-left: 3px solid #1976d2;
        }
        .sample-card h4 {
            color: #0d47a1;
            font-size: 1.1vw;
            margin-bottom: 0.5vh;
        }
        .sample-card p {
            font-size: 0.9vw;
        }
        footer {
            text-align: center;
            padding-top: 1vh;
            border-top: 2px solid #1976d2;
            color: #1565c0;
            font-size: 0.9vw;
            margin-top: 1.5vh;
        }
    </style>
</head>
<body>
    <div class="container">
        <header>
            <h1>MyApp - Distributed Monitoring Platform</h1>
            <p class="tagline">Three-Tier Architecture | Application • Monitoring • Logging</p>
        </header>

        <div class="main-content">
            <div class="section architecture">
                <h2>🏗️ System Architecture</h2>
                <div class="vm-grid">
                    <div class="vm-box">
                        <div class="vm-icon">🖥️</div>
                        <h3>App VM</h3>
                        <p><strong>Role:</strong> Application Server</p>
                        <p><strong>Stack:</strong> Python Flask</p>
                        <p><strong>Port:</strong> 5000</p>
                        <p><strong>Features:</strong> REST APIs, Metrics Export</p>
                    </div>
                    <div class="vm-box">
                        <div class="vm-icon">📊</div>
                        <h3>Monitor VM</h3>
                        <p><strong>Role:</strong> Metrics & Visualization</p>
                        <p><strong>Stack:</strong> Prometheus + Grafana</p>
                        <p><strong>Ports:</strong> 9090, 3000</p>
                        <p><strong>Features:</strong> Time-series DB, Dashboards</p>
                    </div>
                    <div class="vm-box">
                        <div class="vm-icon">📝</div>
                        <h3>Logging VM</h3>
                        <p><strong>Role:</strong> Log Aggregation</p>
                        <p><strong>Stack:</strong> Elasticsearch + Kibana</p>
                        <p><strong>Ports:</strong> 9200, 5601</p>
                        <p><strong>Features:</strong> Log Search, Analysis</p>
                    </div>
                </div>
            </div>

            <div class="section">
                <h2>🔌 Available APIs</h2>
                <ul class="api-list">
                    <li class="api-item">
                        <span class="api-endpoint">/</span>
                        <span class="api-method">GET</span>
                    </li>
                    <li class="api-item">
                        <span class="api-endpoint">/api</span>
                        <span class="api-method">GET</span>
                    </li>
                    <li class="api-item">
                        <span class="api-endpoint">/slow</span>
                        <span class="api-method">GET</span>
                    </li>
                    <li class="api-item">
                        <span class="api-endpoint">/error</span>
                        <span class="api-method">GET</span>
                    </li>
                    <li class="api-item">
                        <span class="api-endpoint">/metrics</span>
                        <span class="api-method">GET</span>
                    </li>
                </ul>
            </div>

            <div class="section sample-page">
                <h2>📄 Sample Page</h2>
                <div class="sample-card">
                    <h4>Application Features</h4>
                    <p>Real-time monitoring with Prometheus metrics collection and Grafana visualization</p>
                </div>
                <div class="sample-card">
                    <h4>Logging System</h4>
                    <p>Centralized log management using Elasticsearch with Kibana dashboards</p>
                </div>
                <div class="sample-card">
                    <h4>Performance Tracking</h4>
                    <p>Request latency, error rates, and throughput metrics tracked across all endpoints</p>
                </div>
                <div class="sample-card">
                    <h4>Distributed Architecture</h4>
                    <p>Scalable three-tier setup with dedicated VMs for app, monitoring, and logging</p>
                </div>
            </div>
        </div>

        <footer>
            <p>🚀 MyApp v1.0 | Powered by Flask • Prometheus • Grafana • Elasticsearch • Kibana | Status: ✅ Running</p>
        </footer>
    </div>
</body>
</html>
"""

# ----------------------
# Routes
# ----------------------
@app.route("/")
def home():
    start_time = time.time()
    REQUEST_COUNT.labels("GET", "/", "200").inc()
    latency = time.time() - start_time
    REQUEST_LATENCY.labels("/").observe(latency)

    logger.info(
        "request_completed",
        extra={
            "endpoint": "/",
            "method": "GET",
            "status": 200,
            "latency_ms": round(latency * 1000, 2)
        }
    )

    return render_template_string(HTML_TEMPLATE)

@app.route("/api")
def api():
    start_time = time.time()
    REQUEST_COUNT.labels("GET", "/api", "200").inc()
    latency = time.time() - start_time
    REQUEST_LATENCY.labels("/api").observe(latency)

    logger.info(
        "request_completed",
        extra={
            "endpoint": "/api",
            "method": "GET",
            "status": 200,
            "latency_ms": round(latency * 1000, 2)
        }
    )

    return "API is running\n"

@app.route("/slow")
def slow():
    delay = random.uniform(1, 4)
    time.sleep(delay)
    REQUEST_COUNT.labels("GET", "/slow", "200").inc()
    REQUEST_LATENCY.labels("/slow").observe(delay)

    logger.warning(
        "slow_request",
        extra={
            "endpoint": "/slow",
            "method": "GET",
            "status": 200,
            "latency_ms": round(delay * 1000, 2)
        }
    )

    return f"Slow response: {delay:.2f}s\n"

@app.route("/error")
def error():
    REQUEST_COUNT.labels("GET", "/error", "500").inc()

    logger.error(
        "application_error",
        extra={
            "endpoint": "/error",
            "method": "GET",
            "status": 500
        }
    )

    return "Error occurred\n", 500

@app.route("/metrics")
def metrics():
    return Response(
        generate_latest(),
        mimetype=CONTENT_TYPE_LATEST
    )

# ----------------------
# Application Entry
# ----------------------
if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

App Features:

  • / - Home page with system info

  • /api - Simple API endpoint

  • /slow - Simulates slow requests (1-4s)

  • /error - Returns 500 error

  • /metrics - Prometheus metrics endpoint

2.5 Create Log Directory

Purpose: Application needs write permissions for log file.

sudo mkdir -p /var/log/myapp
sudo chown -R devops:devops /var/log/myapp

2.6 Test Application Manually

Purpose: Verify app works before creating systemd service.

python3 app.py

From your laptop:

curl http://192.168.8.50:5000
curl http://192.168.8.50:5000/metrics
cat /var/log/myapp/app.log  # Verify logs

Expected: HTTP responses and JSON logs being written


⚠️ ERROR #3 ENCOUNTERED HERE

2.7 Create Systemd Service

Purpose: Auto-start Flask app on boot and keep it running. Production standard vs manual python app.py.

sudo nano /etc/systemd/system/myapp.service

Paste:

[Unit]
Description=MyApp Flask Application
After=network.target

[Service]
Type=simple
User=devops
Group=devops
WorkingDirectory=/home/devops/myapp
ExecStart=/home/devops/myapp/venv/bin/python3 /home/devops/myapp/app.py

Restart=always
RestartSec=5

StandardOutput=journal
StandardError=journal
SyslogIdentifier=myapp

NoNewPrivileges=true

[Install]
WantedBy=multi-user.target

Critical Line: ExecStart must point to venv Python, not system Python (see Error #3)

Enable and Start:

sudo systemctl daemon-reload
sudo systemctl enable myapp.service
sudo systemctl start myapp.service
sudo systemctl status myapp.service

Expected: Status shows "active (running)"


2.8 Install Node Exporter

Purpose: Export system metrics (CPU, memory, disk) to Prometheus - app metrics come from Flask, system metrics from Node Exporter.

wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvf node_exporter-1.6.1.linux-amd64.tar.gz
sudo mv node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter

Create Service:

sudo nano /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

Start Service:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
curl http://localhost:9100/metrics  # Verify

Expected: Hundreds of metrics like node_cpu_seconds_total, node_memory_MemAvailable_bytes


⚠️ ERROR #2 ENCOUNTERED HERE

2.9 Install Filebeat

Purpose: Lightweight log shipper - tails log files and sends to Elasticsearch. Part of the Elastic Stack.

Issue: Filebeat not in standard Ubuntu repos - requires Elastic repository.

# Fix apt repositories first
sudo apt install -y apt-transport-https curl gnupg
curl -fsSL https://artifacts.elastic.co/GPG-KEY-elasticsearch | \
sudo gpg --dearmor -o /usr/share/keyrings/elastic.gpg
echo "deb [signed-by=/usr/share/keyrings/elastic.gpg] https://artifacts.elastic.co/packages/8.x/apt stable main" | \
sudo tee /etc/apt/sources.list.d/elastic-8.x.list
sudo apt update
sudo apt install filebeat -y

2.10 Configure Filebeat

Purpose: Tell Filebeat what to read (app.log) and where to send (Elasticsearch on logging-vm).

sudo nano /etc/filebeat/filebeat.yml

Minimal Config:

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/myapp/app.log
  fields:
    service: my-python-app
  fields_under_root: true

output.elasticsearch:
  hosts: ["http://<ELK_VM_IP>:9200"]

setup.kibana:
  host: "http://<ELK_VM_IP>:5601"

Key Points:

  • Input: Monitor /var/log/myapp/app.log

  • Output: Send to Elasticsearch at 192.168.8.70:9200

Start Filebeat:

sudo systemctl enable filebeat
sudo systemctl start filebeat
sudo journalctl -u filebeat -f  # Monitor logs

Expected Output:

  • "Publishing events"

  • "Connection to Elasticsearch established"

  • No "connection refused" errors


PHASE 3: Monitoring VM Setup

Goal: Deploy Prometheus (metrics storage) and Grafana (visualization) to monitor app-vm.

3.1 Install Prometheus

Purpose: Time-series database that pulls metrics from app-vm every 15 seconds. Industry standard for metrics.

ssh devops@monitoring-vm
sudo apt update && sudo apt upgrade -y
sudo useradd --no-create-home --shell /bin/false prometheus

Download and Install:

wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz
tar xvf prometheus-2.47.0.linux-amd64.tar.gz
sudo mv prometheus-2.47.0.linux-amd64/prometheus /usr/local/bin/
sudo mv prometheus-2.47.0.linux-amd64/promtool /usr/local/bin/

Create Directories:

sudo mkdir /etc/prometheus
sudo mv prometheus-2.47.0.linux-amd64/consoles /etc/prometheus/
sudo mv prometheus-2.47.0.linux-amd64/console_libraries /etc/prometheus/
sudo mv prometheus-2.47.0.linux-amd64/prometheus.yml /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool

3.2 Configure Prometheus

Purpose: Define scrape targets - tell Prometheus where to collect metrics from.

sudo nano /etc/prometheus/prometheus.yml

Add Scrape Targets:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'flask_app'
    static_configs:
      - targets: ['192.168.8.50:5000']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['192.168.8.50:9100']

What This Does:

  • Every 15s, scrape 192.168.8.50:5000/metrics (Flask app metrics)

  • Every 15s, scrape 192.168.8.50:9100/metrics (system metrics)

  • Store data in time-series database

3.3 Create Prometheus Service

sudo nano /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file /etc/prometheus/prometheus.yml \
  --storage.tsdb.path /var/lib/prometheus/ \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries

[Install]
WantedBy=multi-user.target

Create Storage & Start:

sudo mkdir /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctl enable prometheus
sudo systemctl status prometheus

Test: Open http://192.168.8.60:9090

  • Go to Status → Targets

  • Both targets should be "UP"


3.4 Install Grafana

Purpose: Visualization layer on top of Prometheus - creates beautiful dashboards from metrics.

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
sudo apt update
sudo apt install -y grafana

Start Grafana:

sudo systemctl start grafana-server
sudo systemctl enable grafana-server
sudo systemctl status grafana-server

Access: http://192.168.8.60:3000

  • Default login: admin/admin

  • You'll be prompted to set new password

3.5 Configure Grafana

1. Add Prometheus Data Source:

  • Settings → Data Sources → Add data source

  • Select Prometheus

  • URL: http://localhost:9090

  • Click Save & Test (should show green checkmark)

2. Import Node Exporter Dashboard:

  • Dashboards → Import

  • Dashboard ID: 1860 (Node Exporter Full)

  • Select Prometheus data source

  • Import

Result: System metrics dashboard (CPU, memory, disk, network) for app-vm

3. Create Custom Flask Dashboard:

Purpose: Monitor application-specific metrics not covered by Node Exporter.

  • Create new dashboard

  • Add panel with queries:

Request Rate:

rate(http_requests_total[1m])

Latency (95th percentile):

histogram_quantile(0.95, sum(rate(http_request_latency_seconds_bucket[5m])) by (le))

Error Rate:

rate(http_requests_total{status="500"}[1m])

Why Two Dashboards:

  • Dashboard 1860: System health (CPU, RAM, disk)

  • Custom dashboard: App health (requests, latency, errors)


PHASE 4: Logging VM Setup

Goal: Deploy ELK stack (Elasticsearch + Kibana) for centralized log management and analysis.

4.1 Install Elasticsearch

Purpose: Scalable search engine - stores and indexes logs for fast querying. Core of the ELK stack.

ssh devops@logging-vm
sudo apt update && sudo apt upgrade -y

# Add Elastic repo
curl -fsSL https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo gpg --dearmor -o /usr/share/keyrings/elastic.gpg
echo "deb [signed-by=/usr/share/keyrings/elastic.gpg] https://artifacts.elastic.co/packages/8.x/apt stable main" | \
sudo tee /etc/apt/sources.list.d/elastic-8.x.list
sudo apt update
sudo apt install elasticsearch -y

4.2 Configure Elasticsearch

Purpose: Make Elasticsearch accessible from network and disable security for homelab simplicity.

sudo nano /etc/elasticsearch/elasticsearch.yml

Set:

cluster.name: my-logging-cluster
node.name: logging-node-1

network.host: 0.0.0.0
http.port: 9200

discovery.type: single-node

xpack.security.enabled: false

Configuration Explained:

  • network.host: 0.0.0.0 - Accept connections from any IP

  • discovery.type: single-node - Not clustering (single VM)

  • xpack.security.enabled: false - Disable auth (⚠️ production should enable)

Start Elasticsearch:

sudo systemctl daemon-reload
sudo systemctl enable elasticsearch
sudo systemctl start elasticsearch
curl http://localhost:9200  # Verify

Expected: JSON response with cluster name, version info


4.3 Install Kibana

Purpose: Web UI for Elasticsearch - search, visualize, and analyze logs through dashboards.

sudo apt install kibana -y

Configure:

sudo nano /etc/kibana/kibana.yml
server.port: 5601
server.host: "0.0.0.0"

elasticsearch.hosts: ["http://localhost:9200"]

Start Kibana:

sudo systemctl enable kibana
sudo systemctl start kibana
sudo systemctl status kibana

Access: http://192.168.8.70:5601

  • Initial load may take 1-2 minutes

4.4 Create Data View in Kibana

Purpose: Tell Kibana which Elasticsearch indices to query. Filebeat creates indices like filebeat-2026.02.07.

  1. Go to Stack Management → Data Views

  2. Click Create data view

  3. Fill in:

    • Name: filebeat-myapp

    • Index pattern: filebeat-* (matches all filebeat indices)

    • Time field: @timestamp

  4. Click Save

Why filebeat-* Pattern:

  • Filebeat creates daily indices: filebeat-2026.02.07, filebeat-2026.02.08, etc.

  • Wildcard * matches all of them

  • New indices auto-included

4.5 View Logs

Purpose: Verify logs are flowing from app-vm → Filebeat → Elasticsearch → Kibana.

  • Navigate to Discover

  • Select data view: filebeat-myapp

  • You should see JSON logs with fields:

    • @timestamp - When log was created

    • log.level - INFO, WARNING, ERROR

    • message - Log message

    • endpoint - Which API endpoint

    • latency_ms - Request duration

If No Logs Appear:

  • Check Filebeat status on app-vm: sudo systemctl status filebeat

  • Check Elasticsearch indices: curl http://192.168.8.70:9200/_cat/indices?v

  • Look for filebeat-* indices


PHASE 5: Integration & Testing

Goal: Verify complete data flow and create comprehensive monitoring dashboards.

5.1 Test Complete Flow

Purpose: Generate realistic traffic to produce metrics and logs for visualization.

Generate Traffic:

# From your laptop
for i in {1..100}; do curl http://192.168.8.50:5000/; done
for i in {1..50}; do curl http://192.168.8.50:5000/slow; done
for i in {1..20}; do curl http://192.168.8.50:5000/error; done

What This Creates:

  • 100 normal requests → http_requests_total metric increments

  • 50 slow requests → http_request_latency_seconds histogram data

  • 20 errors → ERROR level logs in Kibana

5.2 Verify Metrics in Grafana

Purpose: Confirm Prometheus is scraping and Grafana is displaying metrics.

  • Check http://192.168.8.60:3000

  • Verify dashboards show:

    • Request rate: Should spike during traffic generation

    • Latency percentiles: /slow endpoint shows higher latency

    • Error rate: Spike from /error requests

    • System metrics: CPU/memory usage from Node Exporter (Dashboard 1860)

Queries to Verify:

# In Grafana Explore
rate(http_requests_total[1m])           # Should show recent activity
http_request_latency_seconds{quantile="0.95"}  # Should be higher for /slow

5.3 Verify Logs in Kibana

Purpose: Confirm Filebeat → Elasticsearch → Kibana pipeline is working.

  • Check http://192.168.8.70:5601

  • Go to Discover → Select filebeat-myapp

Search Examples:

log.level: ERROR                    # Find all errors
endpoint: "/slow"                   # Find slow requests
latency_ms > 1000                   # Requests over 1 second

Create Visualizations:

  1. Errors Over Time:

    • Lens → Line chart

    • Filter: log.level : "ERROR"

    • X-axis: @timestamp

  2. Requests by Endpoint:

    • Lens → Bar chart

    • Y-axis: Count

    • X-axis: endpoint.keyword

  3. Latency Distribution:

    • Filter: latency_ms exists

    • Histogram of latency values

Save to Dashboard: Combine visualizations into unified logging dashboard


⚠️ Errors & Solutions Summary

ERROR #1: Duplicate Machine IDs & Hostnames After Cloning

Symptom: All VMs report same hostname and machine-id after cloning from template.

Why This Happens: Proxmox clones EVERYTHING including /etc/machine-id, /etc/hostname, and system identifiers.

Impact:

  • Prometheus sees only 1 node instead of 3 (metrics collision)

  • Systemd services conflict

  • Logs from all VMs appear to come from same source

  • Network confusion in monitoring tools

Root Cause: Machine-specific files were copied during clone operation.

Solution:

# Fix hostname
sudo hostnamectl set-hostname <vm-name>

# Fix /etc/hosts
sudo nano /etc/hosts
# Change 127.0.1.1 to correct hostname

# Regenerate machine-id (CRITICAL - must be systemd-generated)
sudo rm -f /etc/machine-id
sudo rm -f /var/lib/dbus/machine-id
sudo systemd-machine-id-setup
sudo reboot

Verification:

# On each VM, these should be DIFFERENT:
hostnamectl
cat /etc/machine-id

Why This is Critical for Observability:

  • Prometheus labels nodes by machine-id

  • Grafana dashboards group by hostname

  • ELK logs tagged with host.name

  • Without unique IDs, all data collapses into single source


ERROR #2: Apt Timeout When Installing Filebeat

Symptom:

E: Failed to fetch ...
E: Unable to fetch some archives
Timeout was reached

Why This Happens:

  • Using slow/blocked regional mirrors (lk.archive.ubuntu.com)

  • Missing Elastic repository (Filebeat not in standard Ubuntu repos)

  • Network routing issues for HTTP/HTTPS

Impact: Cannot install Filebeat, blocking log shipping pipeline.

Root Cause: Two-part problem:

  1. Ubuntu mirrors unreachable/slow

  2. Elastic repo not configured

Solution:

# Step 1: Fix Ubuntu repositories
sudo cp /etc/apt/sources.list /etc/apt/sources.list.bak
sudo nano /etc/apt/sources.list

# Replace with main Ubuntu mirrors:
deb http://archive.ubuntu.com/ubuntu/ jammy main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu/ jammy-updates main restricted universe multiverse
deb http://security.ubuntu.com/ubuntu jammy-security main restricted universe multiverse

# Step 2: Add Elastic repository
sudo apt install -y apt-transport-https curl gnupg
curl -fsSL https://artifacts.elastic.co/GPG-KEY-elasticsearch | \
sudo gpg --dearmor -o /usr/share/keyrings/elastic.gpg
echo "deb [signed-by=/usr/share/keyrings/elastic.gpg] https://artifacts.elastic.co/packages/8.x/apt stable main" | \
sudo tee /etc/apt/sources.list.d/elastic-8.x.list

# Step 3: Update and install
sudo apt update
sudo apt install filebeat -y

Verification:

filebeat version
# Should show: filebeat version 8.x.x

Why This Matters:

  • Filebeat is log shipper - critical component of ELK pipeline

  • Without it, logs stay local on app-vm

  • No centralized logging = harder debugging in distributed systems


ERROR #3: Flask Service Failed - ModuleNotFoundError

Symptom:

ModuleNotFoundError: No module named 'flask'
systemctl status myapp.service → failed (code=exited, status=1)

Why This Happens: Systemd service points to system Python (/usr/bin/python3) instead of virtual environment Python.

How to Identify:

# Flask installed in venv:
/home/devops/myapp/venv/bin/python3 -c "import flask; print('OK')"
# Returns: OK

# System Python doesn't have Flask:
/usr/bin/python3 -c "import flask"
# Returns: ModuleNotFoundError

Root Cause: Virtual environment isolates dependencies. Systemd service must use venv Python, not system Python.

Incorrect Service File:

ExecStart=/usr/bin/python3 /home/devops/myapp/app.py
# ❌ Uses system Python → no Flask module

Correct Service File:

ExecStart=/home/devops/myapp/venv/bin/python3 /home/devops/myapp/app.py
# ✅ Uses venv Python → Flask available

Full Fix:

sudo systemctl stop myapp.service
sudo nano /etc/systemd/system/myapp.service
# Update ExecStart line to use venv/bin/python3
sudo systemctl daemon-reload
sudo systemctl start myapp.service
sudo systemctl status myapp.service

Verification:

# Service should show "active (running)"
sudo systemctl status myapp.service

# Test endpoint
curl http://localhost:5000
# Should return HTML response

Why This Matters:

  • Common mistake when deploying Python apps

  • Virtual environments prevent dependency conflicts

  • Production best practice: isolate app dependencies

  • Systemd service must match development environment

Prevention: Always specify full path to venv Python in systemd services.

🎯 Key Endpoints

ServiceVMURL
Flask Appapp-vmhttp://192.168.8.50:5000
Flask Metricsapp-vmhttp://192.168.8.50:5000/metrics
Node Exporterapp-vmhttp://192.168.8.50:9100/metrics
Prometheusmonitoring-vmhttp://192.168.8.60:9090
Grafanamonitoring-vmhttp://192.168.8.60:3000
Elasticsearchlogging-vmhttp://192.168.8.70:9200
Kibanalogging-vmhttp://192.168.8.70:5601