Running Kubernetes across Scaleway, OVH, and a Proxmox box
Terraform for 4 providers, ArgoCD app-of-apps, Cilium network policies, CrunchyData PostgreSQL operator, and NixOS VMs for on-prem K3s. How I manage 5 clusters.
I manage Kubernetes infrastructure across two jobs: Stamus Networks (cybersecurity) and my own projects (Dosismart, a dosimetry SaaS). Both use similar patterns because I built both. The total footprint is 5 clusters across 4 providers, managed entirely through Terraform and ArgoCD.
No one runs kubectl apply in production. Everything goes through Git.
The providers and why
Scaleway is the primary cloud for both. Kapsule for managed Kubernetes, S3-compatible storage for Terraform state, VPC networking. It’s cheaper than AWS for small-to-medium European workloads and the Terraform provider is solid.
OVH runs the Stamus production cluster in SBG5 (Strasbourg). Some workloads need to stay on French infrastructure for compliance. OVH’s managed Kubernetes is basic but reliable.
DigitalOcean handles development clusters at Stamus plus DNS hosting and a container registry. At MasterMonkeys (Dosismart’s parent), it hosts Terraform state in Spaces and manages domains.
Proxmox runs on-prem NixOS VMs for a K3s cluster. This handles workloads that can’t leave the local network and doubles as a testing ground for deployment changes without burning cloud credits.
Cluster configuration
The Scaleway production cluster for Dosismart runs Kubernetes 1.32 with Cilium CNI and auto-upgrading enabled (Sunday 3 AM maintenance window). Node pools are spread across three availability zones (fr-par-1/2/3) with autoscaling:
resource "scaleway_k8s_pool" "prod" {
for_each = {
"fr-par-1" = 1,
"fr-par-2" = 2,
"fr-par-3" = 3
}
node_type = "PRO2-XXS"
size = 1
autoscaling = true
min_size = 1
max_size = 2
autohealing = true
}
PRO2-XXS (2 vCPU, 4GB) per zone. Conservative, but for a SaaS that’s not yet live, I don’t need more. Autoscaling and autohealing mean I don’t need to babysit the cluster.
The Stamus OVH cluster uses a different strategy because it needs dedicated CI/CD capacity. There are 3 separate runner node pools with 10 nodes each, tainted so only GitLab runners schedule on them:
resource "ovh_cloud_project_kube_nodepool" "runner_pool" {
flavor_name = "d2-8" # 8 vCPUs, 32GB
autoscale = true
max_nodes = 10
min_nodes = 1
autoscaling_scale_down_unneeded_time_seconds = 3600
template {
spec {
taints = [{
effect = "NoExecute"
key = "stamus.com/type"
value = "runner"
}]
}
}
}
The 1-hour scale-down delay prevents nodes from being killed mid-build. Without it, the autoscaler would kill a node 10 minutes after a build finishes, then spin a new one up when the next build starts. That’s more expensive than just keeping it around for an hour.
On-prem K3s with NixOS
The Proxmox setup provisions NixOS VMs via Terraform:
resource "proxmox_vm_qemu" "main-1" {
name = "main-1"
target_node = "proxmox1"
cores = 4
memory = 8196
disks {
scsi {
scsi1 { disk { size = "64G"; storage = "local-lvm" } }
scsi2 { disk { size = "32G"; storage = "local-lvm" } }
}
ide {
ide2 { cdrom { iso = "local:iso/nixos-kube-init.iso" } }
}
}
}
Three VMs: one init node (bootstraps the K3s cluster), two workers. The NixOS images are built from flakes in the repo, so the VMs are fully declarative. Destroy a VM, re-apply Terraform, and it comes back identical in about 5 minutes. 64GB primary disk for workloads, 32GB secondary for etcd and logs.
Terraform state management
Each provider’s Terraform lives in its own directory with its own state backend:
- Scaleway resources store state in Scaleway S3
- OVH resources store state in OVH S3
- DigitalOcean resources store state in either DigitalOcean Spaces or OVH S3 (depending on the project)
I tried sharing state across providers once. Don’t do this. A Terraform apply that touches Scaleway and OVH resources in the same state means both providers have to be reachable for any operation, and a transient error on one blocks changes to the other.
ArgoCD and the app-of-apps pattern
Every cluster runs ArgoCD with GitLab OIDC for authentication. No local ArgoCD accounts exist. You log in with GitLab, and your GitLab group membership determines your ArgoCD role:
configs:
admin.enabled: "false"
rbac:
'policy.default': 'role:none'
'policy.csv': |
p, role:org-admin, applications, *, */*, allow
g, master-monkey, role:org-admin
The root Application points to a directory of ApplicationSets. Each ApplicationSet generates Kubernetes resources from a Helm chart plus environment-specific values:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: dosismart
spec:
generators:
- list:
elements:
- tag: main-bd342d74
hostname: dosismart.com
template:
spec:
sources:
- chart: basic-app
helm:
valuesObject:
apps:
- name: front
image:
repository: registry.gitlab.com/.../app
tag: main-bd342d74
hpa:
enabled: true
minReplicas: 2
maxReplicas: 6
targetCPU: 70
pdb:
enabled: true
maxUnavailable: 1
networkPolicy:
enabled: true
The basic-app Helm chart is a generic template I wrote that takes a list of apps and generates Deployments, Services, Ingresses, HPAs, PDBs, NetworkPolicies, ServiceMonitors, and ExternalSecrets for each one. One chart handles everything because the variation between services is in the values, not the template logic.
Network policies with Cilium
Every deployment gets a Cilium NetworkPolicy. The frontend only accepts traffic from the ingress-nginx namespace on port 80. The backend can talk to the monitoring namespace (OpenTelemetry collector on port 4317), to the database on the private network, and to external services on 443 (Stripe, Zitadel):
networkPolicy:
enabled: true
egress:
- toEntities: ["world"]
toPorts: [{ ports: [{ port: "443", protocol: TCP }] }]
- toEndpoints:
- matchLabels:
k8s:io.kubernetes.pod.namespace: monitoring
toPorts: [{ ports: [{ port: "4317", protocol: TCP }] }]
This is the part most people skip. Default Kubernetes networking is flat, everything can talk to everything. In a security product, that’s not acceptable. Cilium makes L3/L4/L7 policies easy to define in the Helm values.
Secret management
Secrets come from Scaleway Secret Manager via the External Secrets Operator. A ClusterSecretStore authenticates to Scaleway, and ExternalSecret resources in each namespace pull specific secrets with automatic refresh:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: dosismart-db-credentials
spec:
refreshInterval: 1h
secretStoreRef:
name: scaleway-secret-manager
kind: ClusterSecretStore
dataFrom:
- extract:
key: name:db-dosi-credentials
version: latest_enabled
No secrets in Git, no secrets in Helm values, no secrets in CI/CD variables (except the initial Scaleway credentials that bootstrap the External Secrets Operator). Rotation happens in Scaleway’s UI and the operator picks up changes within the hour.
PostgreSQL with CrunchyData
Both projects use the CrunchyData PostgreSQL Operator for stateful databases on Kubernetes. The operator manages PostgreSQL 17 instances with WAL compression (zstd), automated pgbackrest backups with 2 full retention count, and proper PersistentVolumeClaims:
ApplicationSets reference the operator-managed PostgreSQL instances via DNS (pgsql-main-ha.postgres.svc.cluster.local) and the operator-created secrets (pgsql-main-pguser-scirius). The operator handles failover, backup scheduling, and credential rotation.
For Dosismart’s production database, I use Scaleway’s managed RDB (PostgreSQL 16) instead of the operator because it’s simpler to manage and Scaleway handles backups. The trade-off is vendor lock-in, but for a database I don’t want to operate myself, that’s fine.
CI/CD flow
The actual deployment flow:
- Developer pushes to main
- GitLab CI builds Docker images, pushes to registry
- A downstream pipeline updates the image tag in the app-of-apps repo
- ArgoCD detects the change and syncs
The “deploy” stage in GitLab CI doesn’t actually deploy. It updates a YAML file in Git. ArgoCD does the rest. This means rollback is git revert on the app-of-apps repo, not “find the right kubectl command.”
push:
image: bitnami/git
script:
- git remote add origin "https://token:$ACCESS_TOKEN@gitlab.com/.../app-of-apps.git"
- git add .
- ./scripts/git_commit.sh
- git push origin HEAD:main
What I’d simplify
Five clusters is too many for the workloads I’m running. If compliance wasn’t a factor, I’d consolidate to two: one production, one dev. The multi-provider Terraform modules add complexity that doesn’t pay for itself at this scale.
The basic-app Helm chart is getting large. It started simple but now handles HPAs, PDBs, NetworkPolicies, ExternalSecrets, PrometheusRules, and ServiceMonitors. I might split it into a base chart and addons, or switch to Kustomize for the simpler services.