Build a Self-Healing GitOps Platform: GKE, Argo CD & n8n

Repository Link: https://github.com/sachindumalshan/gitops-repo.git

Architecture Overview

Here is the system level architecture.

Google Cloud (GKE)
│
├── Kubernetes Cluster (gitops-cluster)
│   ├── default namespace
│   │   ├──── service-a (Device Service)
│   │   │     ├─ app.py
│   │   │     ├─ deployment.yaml
│   │   │     ├─ hpa.yaml
│   │   │     ├─ Dockerfile
│   │   │     └─ service.yaml
│   │   ├──── service-b (Sensor Service)
│   │   │     ├─ app.py
│   │   │     ├─ deployment.yaml
│   │   │     ├─ hpa.yaml
│   │   │     ├─ Dockerfile
│   │   │     └─ service.yaml
│   │   └──── service-c (Alert Service)
│   │         ├─ app.py
│   │         ├─ deployment.yaml
│   │         ├─ hpa.yaml
│   │         ├─ Dockerfile
│   │         └─ service.yaml
│   │
│   ├── argocd namespace
│   │   ├── argocd-server
│   │   ├── argocd-repo-server
│   │   ├── argocd-application-controller
│   │   └── argocd-dex-server
│   │
│   └── automation namespace
│       └── n8n
│
├── Google Artifact Registry
│
└── GitHub (Source of Truth)

End-to-End Flow

Developer pushes code → GitHub
        ↓
Argo CD detects change
        ↓
Argo CD syncs to GKE
        ↓
Pods deployed / updated
        ↓
Kubernetes handles:
    - Self-healing
    - Auto-scaling (HPA)
        ↓
n8n monitors services
        ↓
Slack alerts if failure

PHASE 1 - Install gcloud CLI & Setup GKE

Update System

sudo apt-get update

Install required packages:

sudo apt-get install apt-transport-https ca-certificates gnupg curl

Import Google Cloud Public Key

For newer distributions:

curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg

For older distributions:

curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -

If unsupported:

curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -

Add Repository

Newer systems:

echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list

Older systems:

echo "deb https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list

Install gcloud

sudo apt-get update && sudo apt-get install google-cloud-cli

gcloud auth login
gcloud projects list
gcloud config set project gitops-self-healing-7687

Enable APIs:

gcloud services enable \
  container.googleapis.com \
  compute.googleapis.com \
  cloudbuild.googleapis.com \
  artifactregistry.googleapis.com \

🟢 Create GKE Cluster

gcloud container clusters create gitops-cluster \
  --zone asia-south1-a \
  --num-nodes 2 \
  --machine-type e2-medium

Get credentials:

gcloud container clusters get-credentials gitops-cluster \
  --zone asia-south1-a

❗ ERROR 1 — GKE Auth Plugin Missing

If you see authentication errors:

Install plugin:

sudo apt-get install google-cloud-cli-gke-gcloud-auth-plugin

Verify:

which gke-gcloud-auth-plugin

Refresh kubeconfig:

gcloud container clusters get-credentials gitops-cluster \
  --zone asia-south1-a

Test:

kubectl get nodes

PHASE 2 - Build & Push Python Microservice

Sample Flask App: app.py

from flask import Flask
import os

app = Flask(__name__)

@app.route("/health")
def health():
    if os.getenv("FAIL") == "true":
        return "FAIL", 500
    return "OK", 200

@app.route("/")
def home():
    return "Service A Running", 200

app.run(host="0.0.0.0", port=8080)

Dockerfile

FROM python:3.11-slim
WORKDIR /app
RUN pip install flask
COPY app.py .
CMD ["python", "app.py"]

Create Artifact Registry

gcloud artifacts repositories create docker-repo \
  --repository-format=docker \
  --location=asia-south1

Configure Docker:

gcloud auth configure-docker asia-south1-docker.pkg.dev

Build & push:

docker build -t asia-south1-docker.pkg.dev/gitops-self-healing/docker-repo/service-a:v1 .
docker push asia-south1-docker.pkg.dev/gitops-self-healing/docker-repo/service-a:v1

❗ ERROR 2 — ImagePullBackOff

Cause: GKE nodes don’t have permission to pull images.

Get project number:

gcloud projects describe gitops-self-healing \
  --format="value(projectNumber)"

Grant permission:

gcloud projects add-iam-policy-binding gitops-self-healing \
  --member="serviceAccount:[email protected]" \
  --role="roles/artifactregistry.reader"

Restart pod:

kubectl delete pod <pods_name> #service-a-595cc8c965-tlmrh
kubectl get pods -w

PHASE 3 - Kubernetes Self-Healing

Deployment with probes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: service-a
spec:
  replicas: 1
  selector:
    matchLabels:
      app: service-a
  template:
    metadata:
      labels:
        app: service-a
    spec:
      containers:
      - name: service-a
        image: asia-south1-docker.pkg.dev/gitops-self-healing/docker-repo/service-a:v1
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5

Apply:

kubectl apply -f deployment.yaml

Test:

kubectl set env deployment/service-a FAIL=true

🔥 Kubernetes automatically restarts unhealthy pods.

PHASE 4 - Auto Scaling (HPA)

Check metrics: (CPU/Memory)

kubectl top pods

HPA File:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: service-a-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: service-a
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

Apply HPA:

kubectl apply -f hpa.yaml
kubectl get hpa -w

HPA scales based on CPU metrics from Metrics Server.

PHASE 5 - Push to GitHub

Upload files to the GitHub Repository

git init
git add .
git commit -m "Initial commit"
git remote add origin https://github.com/<your-username>/gitops-repo.git
git push -u origin main

PHASE 6 - Install Argo CD (GitOps Engine)

Create namespace:

kubectl create namespace argocd

Install:

kubectl apply -n argocd \
  -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

Verify:

kubectl get pods -n argocd

How you access Argo CD UI:

Option 1 (Learning / Local / GKE lab)

# Port-forward
kubectl port-forward svc/argocd-server -n argocd 8080:443

# Access
https://localhost:8080

Option 2 — Change Service Type to LoadBalancer (Direct Public IP)

Instead of port-forward, expose Argo CD externally.

# Check service:
kubectl get svc argocd-server -n argocd

#It is probably:
ClusterIP

# Patch it:
kubectl patch svc argocd-server -n argocd \ -p '{"spec": {"type": "LoadBalancer"}}'

# Now check:
kubectl get svc argocd-server -n argocd

# Access:
https://<external-ip>

Option 3 — Use Ingress (Recommended for Real DevOps Setup)

Instead of exposing service directly, use an Ingress.

This allows:

Domain name
HTTPS with TLS
Multiple apps behind one Load Balancer

In GKE, Ingress uses: Google Cloud Load Balancing

Example Ingress:

apiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: argocd-ingress namespace: argocdspec: rules: - host: argocd.yourdomain.com http: paths: - path: / pathType: Prefix backend: service: name: argocd-server port: number: 443

Get Admin Password

kubectl get secret argocd-initial-admin-secret \
  -n argocd \
  -o jsonpath="{.data.password}" | base64 -d

Username: admin
Password: decoded value

GitOps Application Creation

Sync Policy: Automatic
Auto-Prune: Enabled
Self-Heal: Enabled

Argo CD now continuously reconciles cluster state with Git.

# STEP 1️⃣ What you see after login (IMPORTANT)
You’ll see:
- Empty dashboard
- No applications yet

## STEP 2️⃣ Create your FIRST GitOps Application

Click: + NEW APP

Fill like this:
- Application Name: `service-a`
- Project: `default`
- Sync Policy: `Automatic`
    - ✅ Automatic
    - ✅ Auto-Prune        
    - ✅ Self-Heal        
- Repository URL:
  https://github.com/<your-username>/gitops-repo
- Revision:
  `main`
- Path:
  `service-a`

📌 This path must contain:
- `deployment.yaml`    
- `service.yaml`    
- `hpa.yaml`
    
### Destination
- Cluster URL:
  `https://kubernetes.default.svc`
- Namespace:
  `default`

Click: CREATE

## STEP 3️⃣ What happens immediately after clicking CREATE

Behind the scenes:
1.  Argo CD pulls Git repo
2.  Reads YAML files    
3.  Compares with live cluster    
4.  Applies manifests   
5.  Shows app as **Healthy / Synced**
    
You’ll see:
- Green boxes
- Pod creation in real time

## STEP 4️⃣ Verify from terminal (important habit)
kubectl get pods
kubectl get svc
kubectl get hpa

## STEP 5️⃣ Prove SELF-HEALING

# Break something manually
kubectl delete pod -l app=service-a

Result:
- Pod deleted
- Deployment recreates pod
- Argo CD remains synced

# Create configuration drift
kubectl scale deployment service-a --replicas=5

Watch Argo CD UI: It will revert replicas back to Git value automatically.

❗ ERROR 3 — Application Not Syncing

Common causes:

Wrong repo path
Wrong branch
Missing YAML files
Incorrect namespace

Fix path and resync.

PHASE 7 - Expose Service via LoadBalancer

Update Service YAML:

apiVersion: v1
kind: Service
metadata:
  name: service-a
  namespace: default
spec:
  selector:
    app: service-a   # matches your pod labels
  ports:
    - protocol: TCP
      port: 80       # the port clients use to access
      targetPort: 5000  # the port your container listens on
  type: LoadBalancer   # gives an external IP in GKE

Commit & push:

git add service-a/
git commit -m "Change python app port to 5000"
git push origin main

Check external IP:

kubectl get svc service-a

# Access
http://<EXTERNAL-IP>

PHASE 8 - Multi-Service IoT Micro-services

Implement three services to understand how they communicate internally and add those to the ArgoCD.

service-a	Device Service
service-b	Sensor Service
service-c	Alert Service

As stated earlier created application, create 3 applications for 3 services like in below image.

PHASE 9 - Setup n8n

Why Add n8n?

Kubernetes + Argo CD already handle 80% of recovery.

But they do not:

Send Slack alerts
Trigger email notifications
Execute external Git rollback
Call external APIs
Run conditional business logic

That’s where n8n comes in.

Why Install n8n Inside Kubernetes?

You could run n8n: Inside Kubernetes

We choose Kubernetes because:

Same cluster access
Easy internal DNS communication
Scalable
Production-style deployment
Strong DevOps portfolio value

Step-by-Step — Install n8n in Kubernetes

1️⃣ Create Namespace

kubectl create namespace automation

Verify:

kubectl get ns

2️⃣ Create n8n Deployment

Create file: n8n-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: n8n
  namespace: automation
spec:
  replicas: 1
  selector:
    matchLabels:
      app: n8n
  template:
    metadata:
      labels:
        app: n8n
    spec:
      containers:
      - name: n8n
        image: n8nio/n8n:latest
        ports:
        - containerPort: 5678
        env:
        - name: N8N_BASIC_AUTH_ACTIVE
          value: "true"
        - name: N8N_BASIC_AUTH_USER
          value: "admin"
        - name: N8N_BASIC_AUTH_PASSWORD
          value: "admin123"
        - name: N8N_SECURE_COOKIE
          value: "false"
        # ── Add these four lines ──────────────────────────
        - name: N8N_EDITOR_BASE_URL
          value: "http://35.210.234.209"
        - name: WEBHOOK_URL
          value: "http://35.210.234.209/"
        - name: N8N_HOST
          value: "35.210.234.209"
        - name: N8N_PROTOCOL
          value: "http"

Apply:

kubectl apply -f n8n-deployment.yaml

3️⃣ Create Service (NodePort)

Create file: n8n-service.yaml

apiVersion: v1
kind: Service
metadata:
  name: n8n
  namespace: automation
spec:
  type: NodePort
  selector:
    app: n8n
  ports:
  - port: 5678
    targetPort: 5678

Apply:

kubectl apply -f n8n-service.yaml

Verify:

kubectl get pods -n automation
kubectl get svc -n automation

Access n8n:

http://<node-ip>:<nodeport>

❗ ERROR — n8n Not Accessible

Possible causes:

Firewall blocking NodePort
Wrong external IP
Pod not running
Service type incorrect

Check:

kubectl describe pod -n automation
kubectl describe svc n8n -n automation

🟢 Build n8n Workflow — Service Health Monitor

[Schedule Trigger] every 1 min
        ↓
[HTTP Request] service-a /live
        ↓
[HTTP Request] service-b /live
        ↓
[HTTP Request] service-c /live
        ↓
[Code Node] evaluate results
        ↓
[IF Node] has Issues?
       ↓           ↓
   [Slack]       [End]

Node 1 - Schedule Trigger

Trigger Interval: Minutes
Every: 1

This runs monitoring every 60 seconds.

Nodes 2, 3, 4 — HTTP Requests

Each node:

Method: GET

URLs:

http://service-a.automation.svc.cluster.local/live
http://service-b.automation.svc.cluster.local/live
http://service-c.automation.svc.cluster.local/live

Name them exactly:

Check Service A
Check Service B
Check Service C

⚠️ IMPORTANT:

Go to Settings tab → Enable “Continue on Fail”

Without this, workflow stops on first failure.

Node 5 — Code Node

const serviceA = $('Check Service A').first();
const serviceB = $('Check Service B').first();
const serviceC = $('Check Service C').first();

const results = [
  { name: 'service-a', data: serviceA },
  { name: 'service-b', data: serviceB },
  { name: 'service-c', data: serviceC },
];

const issues = [];

for (const svc of results) {
  if (svc.data.error !== undefined) {
    issues.push({
      service: svc.name,
      detail: svc.data.error?.message || 'No response'
    });
  }
}

const lines = issues.map(i => `*\({i.service}* — DOWN\nDetail: \){i.detail}`).join('\n\n');

return [{
  json: {
    hasIssues: issues.length > 0,
    message: issues.length > 0 ? `🚨 Service Health Alert\n\n${lines}` : 'All OK'
  }
}];

⚠️ Node names inside $('...') must match EXACTLY.

Node 6 — IF Node

Condition:

{{ $json.hasIssues }}

Operation: is true

True → Slack
False → End

Node 7 — Setup a Slack Bot

1. Go to: https://api.slack.com/apps
2. Create New App: → From scratch
3. OAuth & Permissions(Add scopes):
   * chat:write
   * chat:write.public
   * channels:read
4. Install to workspace:
5. Copy Bot Token (starts with xoxb-):

In n8n:
Settings → Credentials → Slack API → paste token

In Slack:
/invite @n8n-alerts

Slack Node Configuration:
* Resource: Message
* Operation: Send
* Channel: #alerts
* Message: {{ $json.message }}

Activate workflow.

Testing the Workflow

Change:
http://service-a/live

To:
http://service-x/live

Execute workflow.

✅ Slack alert should fire. Revert back.

PHASE 10 - Realistic Production Architecture

Now your architecture becomes:

Kubernetes
 ├── Pod crashes
 ├── Scaling events
 ├── Restarts
 │
 │ (metrics / events)
 ▼
n8n
 ├── Evaluate logic
 ├── Notify Slack
 ├── Optional Git rollback
 │
 ▼
Git
 │
 ▼
Argo CD
 └── Re-sync cluster

What Makes This Production-Grade?

Self-healing pods
GitOps reconciliation
Auto-scaling
Internal service communication
Automated alerting
Event-driven workflows
Slack integration
Extensible automation engine

Building a Self-Healing GitOps Based Micro-services Platform on GKE with Argo CD, HPA & n8n

Architecture Overview

End-to-End Flow

PHASE 1 - Install gcloud CLI & Setup GKE

❗ ERROR 1 — GKE Auth Plugin Missing

PHASE 2 - Build & Push Python Microservice

❗ ERROR 2 — ImagePullBackOff

PHASE 3 - Kubernetes Self-Healing

PHASE 4 - Auto Scaling (HPA)

PHASE 5 - Push to GitHub

PHASE 6 - Install Argo CD (GitOps Engine)

GitOps Application Creation

❗ ERROR 3 — Application Not Syncing

PHASE 7 - Expose Service via LoadBalancer

PHASE 8 - Multi-Service IoT Micro-services

PHASE 9 - Setup n8n

Why Install n8n Inside Kubernetes?

Step-by-Step — Install n8n in Kubernetes

❗ ERROR — n8n Not Accessible

Possible causes:

🟢 Build n8n Workflow — Service Health Monitor

PHASE 10 - Realistic Production Architecture

What Makes This Production-Grade?

Comments

More from this blog

Installing a Distributed Monitoring Platform: 3-VM Setup Process

Building a Production-Ready Cloud-Native Microservice with Complete CI/CD Pipeline on AWS EKS

Run AWS Services Locally on Your PC for Free with LocalStack

Creating a Hybrid Cloud: Integrate AWS and Proxmox Homelab Using Tailscale

Command Palette

Architecture Overview

End-to-End Flow

PHASE 1 - Install gcloud CLI & Setup GKE

❗ ERROR 1 — GKE Auth Plugin Missing

PHASE 2 - Build & Push Python Microservice

❗ ERROR 2 — ImagePullBackOff

PHASE 3 - Kubernetes Self-Healing

PHASE 4 - Auto Scaling (HPA)

PHASE 5 - Push to GitHub

PHASE 6 - Install Argo CD (GitOps Engine)

GitOps Application Creation

❗ ERROR 3 — Application Not Syncing

PHASE 7 - Expose Service via LoadBalancer

PHASE 8 - Multi-Service IoT Micro-services

PHASE 9 - Setup n8n

Why Install n8n Inside Kubernetes?

Step-by-Step — Install n8n in Kubernetes

❗ ERROR — n8n Not Accessible

Possible causes:

🟢 Build n8n Workflow — Service Health Monitor

PHASE 10 - Realistic Production Architecture

What Makes This Production-Grade?

Comments

More from this blog