Back to blog

Building stamusctl · Part 3

The daemon, observability, and testing

stamusd exposes the same CLI as a REST API. Priority-based shutdown, hot-reloading auth, rate limiting with Redis fallback, and testing with an in-memory filesystem.

In the previous parts I covered the template pipeline and the Docker/PCAP plumbing. This last part is about stamusd (the REST API daemon), observability, and how the codebase is tested.

Same binary, different mode

stamusctl and stamusd are the same Go binary. An environment variable decides which mode to run:

switch app.Name {
case "stamusctl":
    ctl.Execute()
case "stamusd":
    daemon.Execute()
}

Both use the same internal handlers. Anything you can do from the CLI, you can do over HTTP. The daemon exists for automation: CI/CD pipelines, custom management UIs, or anything that needs to talk to Clear NDR programmatically.

Gin middleware stack

The daemon runs Gin with a layered middleware stack, and the order matters:

  1. Recovery - catch panics, return 500 instead of crashing
  2. Security headers - CSP, HSTS, X-Frame-Options, X-Content-Type-Options, Referrer-Policy, Permissions-Policy
  3. CORS - configurable allowed origins, with wildcard support
  4. OpenTelemetry - otelgin middleware instruments every request with a trace
  5. Request ID - propagates the OTel trace ID as X-Request-ID in the response
  6. Request/response logging - structured log of every request
  7. Auth - token-based, with hot-reload from a file
  8. Rate limiting - per-IP, Redis-backed with in-memory fallback

The auth middleware watches a token file on disk. When the file changes, the expected token updates without restarting the daemon. This is useful in Kubernetes where secrets get rotated by operators. The middleware does Basic auth validation: Authorization: Basic <base64(username:token)>. If no token is configured, auth is disabled entirely, which is the default for local development.

Rate limiting tries Redis first (for distributed limiting across multiple daemon instances), then falls back to in-memory (5 requests/second/IP). The fallback is important because the daemon should work standalone. Requiring Redis to start would be a bad dependency.

Shutdown ordering

When stamusd receives SIGTERM, things need to stop in the right order. If you flush telemetry before in-flight Docker operations finish, you lose traces. If you close connections before the HTTP server stops accepting requests, clients get broken pipes.

The shutdown system uses priority-based handlers:

const (
    PriorityFirst       Priority = 100  // Stop accepting requests
    PriorityInFlight    Priority = 200  // Wait for Docker operations
    PriorityConnections Priority = 300  // Close connections
    PriorityTelemetry   Priority = 400  // Flush traces
    PriorityLast        Priority = 500  // Final cleanup
)

Each component registers a handler with a priority and a function. On shutdown, handlers execute in priority order with a 30-second timeout. The HTTP server stops first, then Docker operations drain, then connections close, then OTel flushes its buffer.

The CLI has a different but related pattern: double Ctrl+C handling. First interrupt triggers graceful shutdown. Second interrupt within 2 seconds forces exit. This prevents the “I pressed Ctrl+C and nothing happened so I mashed it” situation that kills in-flight operations.

OpenTelemetry from day one

Every API request generates a distributed trace. The OTel collector URL is configurable. Traces include the request ID, the operation being performed, and any Docker container IDs involved.

When compose init fails through the daemon, I can look up the trace and see exactly which step broke: was it the template pull from the OCI registry? The template rendering? The Docker Compose file write? The image pull? Each step is a span in the trace.

The exporter supports both OTLP gRPC (for production collectors like Tempo or Jaeger) and stdout (for development). The sampler is set to AlwaysSample because the request volume is low enough that sampling every request is fine.

Logging is Zap with dual output: JSON to a file (for machine parsing and log aggregation), colored console to stdout (for humans watching the daemon). Both include structured fields: instance name, operation type, container ID, duration.

Testing with a fake filesystem

All file operations go through Afero, an abstraction layer over the OS filesystem:

var FS afero.Fs = afero.NewOsFs()  // Real filesystem

// In tests:
app.FS = afero.NewMemMapFs()  // In-memory

This lets me test that compose init creates the right directory structure, writes the right config files, and handles permission errors, without touching disk. The template rendering, parameter extraction, and config writing all operate on the abstract filesystem. They don’t know or care whether it’s real.

The daemon has route registration tests that verify every Gin endpoint exists with the correct HTTP method:

func TestNewCompose_RegistersRoutes(t *testing.T) {
    router := gin.New()
    v1 := router.Group("/api/v1")
    NewCompose(v1)

    routes := router.Routes()
    expectedRoutes := map[string]string{
        "/api/v1/compose/init": http.MethodPost,
        "/api/v1/compose/up":   http.MethodPost,
    }

    for path, method := range expectedRoutes {
        found := false
        for _, route := range routes {
            if route.Path == path && route.Method == method {
                found = true
                break
            }
        }
        assert.True(t, found)
    }
}

This catches the “I added a handler but forgot to register the route” mistake, which is easy to make with Gin’s group-based routing.

The circuit breaker has state machine tests that verify transitions between closed, open, and half-open states. Integration tests for the daemon spin up a real Gin server and make HTTP requests against it.

What I’d change

The Cobra command tree is 3 levels deep (stamusctl compose readpcap). Adding a new command means touching multiple files across the hierarchy. I’d flatten it or generate more of the boilerplate.

Go templates with Sprig get hard to read when the conditionals nest deep. For the CE templates, where the goal is “one question, everything works,” this is fine. But if the templates grow more complex, a real configuration DSL would be cleaner.

The code is at github.com/StamusNetworks/stamusctl.