Back to blog

Running Kubernetes across Scaleway, OVH, and a Proxmox box

Terraform for 4 providers, ArgoCD app-of-apps, Cilium network policies, CrunchyData PostgreSQL operator, and NixOS VMs for on-prem K3s. How I manage 5 clusters.

I manage Kubernetes infrastructure across two jobs: Stamus Networks (cybersecurity) and my own projects (Dosismart, a dosimetry SaaS). Both use similar patterns because I built both. The total footprint is 5 clusters across 4 providers, managed entirely through Terraform and ArgoCD.

No one runs kubectl apply in production. Everything goes through Git.

The providers and why

Scaleway is the primary cloud for both. Kapsule for managed Kubernetes, S3-compatible storage for Terraform state, VPC networking. It’s cheaper than AWS for small-to-medium European workloads and the Terraform provider is solid.

OVH runs the Stamus production cluster in SBG5 (Strasbourg). Some workloads need to stay on French infrastructure for compliance. OVH’s managed Kubernetes is basic but reliable.

DigitalOcean handles development clusters at Stamus plus DNS hosting and a container registry. At MasterMonkeys (Dosismart’s parent), it hosts Terraform state in Spaces and manages domains.

Proxmox runs on-prem NixOS VMs for a K3s cluster. This handles workloads that can’t leave the local network and doubles as a testing ground for deployment changes without burning cloud credits.

Cluster configuration

The Scaleway production cluster for Dosismart runs Kubernetes 1.32 with Cilium CNI and auto-upgrading enabled (Sunday 3 AM maintenance window). Node pools are spread across three availability zones (fr-par-1/2/3) with autoscaling:

resource "scaleway_k8s_pool" "prod" {
  for_each = {
    "fr-par-1" = 1,
    "fr-par-2" = 2,
    "fr-par-3" = 3
  }

  node_type   = "PRO2-XXS"
  size        = 1
  autoscaling = true
  min_size    = 1
  max_size    = 2
  autohealing = true
}

PRO2-XXS (2 vCPU, 4GB) per zone. Conservative, but for a SaaS that’s not yet live, I don’t need more. Autoscaling and autohealing mean I don’t need to babysit the cluster.

The Stamus OVH cluster uses a different strategy because it needs dedicated CI/CD capacity. There are 3 separate runner node pools with 10 nodes each, tainted so only GitLab runners schedule on them:

resource "ovh_cloud_project_kube_nodepool" "runner_pool" {
  flavor_name = "d2-8"  # 8 vCPUs, 32GB
  autoscale   = true
  max_nodes   = 10
  min_nodes   = 1
  autoscaling_scale_down_unneeded_time_seconds = 3600

  template {
    spec {
      taints = [{
        effect = "NoExecute"
        key    = "stamus.com/type"
        value  = "runner"
      }]
    }
  }
}

The 1-hour scale-down delay prevents nodes from being killed mid-build. Without it, the autoscaler would kill a node 10 minutes after a build finishes, then spin a new one up when the next build starts. That’s more expensive than just keeping it around for an hour.

On-prem K3s with NixOS

The Proxmox setup provisions NixOS VMs via Terraform:

resource "proxmox_vm_qemu" "main-1" {
  name        = "main-1"
  target_node = "proxmox1"
  cores       = 4
  memory      = 8196

  disks {
    scsi {
      scsi1 { disk { size = "64G"; storage = "local-lvm" } }
      scsi2 { disk { size = "32G"; storage = "local-lvm" } }
    }
    ide {
      ide2 { cdrom { iso = "local:iso/nixos-kube-init.iso" } }
    }
  }
}

Three VMs: one init node (bootstraps the K3s cluster), two workers. The NixOS images are built from flakes in the repo, so the VMs are fully declarative. Destroy a VM, re-apply Terraform, and it comes back identical in about 5 minutes. 64GB primary disk for workloads, 32GB secondary for etcd and logs.

Terraform state management

Each provider’s Terraform lives in its own directory with its own state backend:

  • Scaleway resources store state in Scaleway S3
  • OVH resources store state in OVH S3
  • DigitalOcean resources store state in either DigitalOcean Spaces or OVH S3 (depending on the project)

I tried sharing state across providers once. Don’t do this. A Terraform apply that touches Scaleway and OVH resources in the same state means both providers have to be reachable for any operation, and a transient error on one blocks changes to the other.

ArgoCD and the app-of-apps pattern

Every cluster runs ArgoCD with GitLab OIDC for authentication. No local ArgoCD accounts exist. You log in with GitLab, and your GitLab group membership determines your ArgoCD role:

configs:
  admin.enabled: "false"
  rbac:
    'policy.default': 'role:none'
    'policy.csv': |
      p, role:org-admin, applications, *, */*, allow
      g, master-monkey, role:org-admin

The root Application points to a directory of ApplicationSets. Each ApplicationSet generates Kubernetes resources from a Helm chart plus environment-specific values:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: dosismart
spec:
  generators:
    - list:
        elements:
          - tag: main-bd342d74
            hostname: dosismart.com
  template:
    spec:
      sources:
        - chart: basic-app
          helm:
            valuesObject:
              apps:
                - name: front
                  image:
                    repository: registry.gitlab.com/.../app
                    tag: main-bd342d74
                  hpa:
                    enabled: true
                    minReplicas: 2
                    maxReplicas: 6
                    targetCPU: 70
                  pdb:
                    enabled: true
                    maxUnavailable: 1
                  networkPolicy:
                    enabled: true

The basic-app Helm chart is a generic template I wrote that takes a list of apps and generates Deployments, Services, Ingresses, HPAs, PDBs, NetworkPolicies, ServiceMonitors, and ExternalSecrets for each one. One chart handles everything because the variation between services is in the values, not the template logic.

Network policies with Cilium

Every deployment gets a Cilium NetworkPolicy. The frontend only accepts traffic from the ingress-nginx namespace on port 80. The backend can talk to the monitoring namespace (OpenTelemetry collector on port 4317), to the database on the private network, and to external services on 443 (Stripe, Zitadel):

networkPolicy:
  enabled: true
  egress:
    - toEntities: ["world"]
      toPorts: [{ ports: [{ port: "443", protocol: TCP }] }]
    - toEndpoints:
        - matchLabels:
            k8s:io.kubernetes.pod.namespace: monitoring
      toPorts: [{ ports: [{ port: "4317", protocol: TCP }] }]

This is the part most people skip. Default Kubernetes networking is flat, everything can talk to everything. In a security product, that’s not acceptable. Cilium makes L3/L4/L7 policies easy to define in the Helm values.

Secret management

Secrets come from Scaleway Secret Manager via the External Secrets Operator. A ClusterSecretStore authenticates to Scaleway, and ExternalSecret resources in each namespace pull specific secrets with automatic refresh:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: dosismart-db-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: scaleway-secret-manager
    kind: ClusterSecretStore
  dataFrom:
    - extract:
        key: name:db-dosi-credentials
        version: latest_enabled

No secrets in Git, no secrets in Helm values, no secrets in CI/CD variables (except the initial Scaleway credentials that bootstrap the External Secrets Operator). Rotation happens in Scaleway’s UI and the operator picks up changes within the hour.

PostgreSQL with CrunchyData

Both projects use the CrunchyData PostgreSQL Operator for stateful databases on Kubernetes. The operator manages PostgreSQL 17 instances with WAL compression (zstd), automated pgbackrest backups with 2 full retention count, and proper PersistentVolumeClaims:

ApplicationSets reference the operator-managed PostgreSQL instances via DNS (pgsql-main-ha.postgres.svc.cluster.local) and the operator-created secrets (pgsql-main-pguser-scirius). The operator handles failover, backup scheduling, and credential rotation.

For Dosismart’s production database, I use Scaleway’s managed RDB (PostgreSQL 16) instead of the operator because it’s simpler to manage and Scaleway handles backups. The trade-off is vendor lock-in, but for a database I don’t want to operate myself, that’s fine.

CI/CD flow

The actual deployment flow:

  1. Developer pushes to main
  2. GitLab CI builds Docker images, pushes to registry
  3. A downstream pipeline updates the image tag in the app-of-apps repo
  4. ArgoCD detects the change and syncs

The “deploy” stage in GitLab CI doesn’t actually deploy. It updates a YAML file in Git. ArgoCD does the rest. This means rollback is git revert on the app-of-apps repo, not “find the right kubectl command.”

push:
  image: bitnami/git
  script:
    - git remote add origin "https://token:$ACCESS_TOKEN@gitlab.com/.../app-of-apps.git"
    - git add .
    - ./scripts/git_commit.sh
    - git push origin HEAD:main

What I’d simplify

Five clusters is too many for the workloads I’m running. If compliance wasn’t a factor, I’d consolidate to two: one production, one dev. The multi-provider Terraform modules add complexity that doesn’t pay for itself at this scale.

The basic-app Helm chart is getting large. It started simple but now handles HPAs, PDBs, NetworkPolicies, ExternalSecrets, PrometheusRules, and ServiceMonitors. I might split it into a base chart and addons, or switch to Kustomize for the simpler services.