Adding AI chat and observability to an open-source NDR

Clear NDR Community Edition ships as a Docker Compose stack managed by stamusctl. The core is Suricata, Fluentd, OpenSearch, Scirius, and a handful of supporting services. Recently I added three optional components: an AI chat assistant backed by LibreChat, application log collection via OpenTelemetry, and Grafana dashboards. Each is a feature flag. None touches the core stack unless you turn it on.

The interesting parts aren’t the tools themselves. It’s how they integrate into an existing template system, share authentication, and solve problems that only show up when you’re wiring third-party software into a security appliance.

OpenTelemetry Collector for application logs

The stack has seven services that produce logs: Suricata, Scirius, Arkime, OpenSearch, PostgreSQL, NGINX, and Scout. Before this change, those logs went to Docker’s logging driver and nowhere else. If you wanted to debug why Scirius was throwing 500s, you ran docker logs and grep’d through unstructured text.

The OTel Collector reads log files from each service and ships them to OpenSearch under per-application indices:

receivers:
    filelog/scirius:
        include:
            - /var/log/scirius/*.log
        start_at: end
        multiline:
            line_start_pattern: '^\d{4}-\d{2}-\d{2}|^\{'
        operators:
            - type: json_parser
              if: 'body matches "^\\{"'

processors:
    transform/scirius:
        log_statements:
            - context: log
              statements:
                  - set(attributes["opensearch.index"], "applogs-scirius")

Each service gets its own filelog receiver, a transform processor that tags the source and sets the target index, and they all export to the same OpenSearch instance through a single exporter. The opensearch.index attribute drives dynamic index routing, so applogs-scirius, applogs-nginx, and applogs-suricata are separate indices you can query independently or together via applogs-*.

NGINX access logs get parsed with a regex operator that extracts remote_addr, status, request, and user agent into structured attributes. Scirius logs are mixed: some are JSON (Django), some are plain text (Celery workers). The json_parser operator only fires when the line starts with {, so plain text lines pass through unparsed rather than failing.

The receiver config is a Go template, so optional services are handled with conditionals:

{{- if .Values.scout }}
  filelog/scout:
    include:
      - /var/log/scout/*.log
{{- end }}

If Scout isn’t deployed, the receiver doesn’t render, and the collector doesn’t complain about missing log files.

Infrastructure metrics

The same OTel Collector optionally scrapes infrastructure metrics: PostgreSQL connection stats, RabbitMQ queue depths, OpenSearch cluster health, NGINX stub_status, and Docker container CPU/memory/network. Each metric source is independently toggleable:

otel:
    metrics:
        pg:
            enabled: true
            interval: '60s'
        docker:
            enabled: true
            interval: '30s'

The OpenSearch metrics use the elasticsearch receiver (OpenSearch is wire-compatible). Docker stats require mounting the Docker socket into the collector container. Both are opt-in because they have side effects: the socket mount is a security consideration, and the elasticsearch receiver makes API calls that show up in OpenSearch’s own slow log.

VictoriaMetrics and Grafana

Metrics need somewhere to go. When victoriametrics.enabled is true, the OTel Collector exports via Prometheus remote write to a VictoriaMetrics single-node instance. When it’s false, metrics go to a debug exporter (logged and discarded). This lets you enable OTel for logs without committing to a metrics store.

{{- if .Values.victoriametrics.enabled }}
  prometheusremotewrite/victoriametrics:
    endpoint: http://victoriametrics:8428/api/v1/write
{{- else }}
  debug/metrics:
    verbosity: basic
{{- end }}

Grafana connects to both OpenSearch (for Suricata EVE data and application logs) and VictoriaMetrics (for infrastructure metrics). Three dashboards are provisioned automatically:

Suricata EVE Overview: alert counts by severity, top signatures, event type distribution over time. Queries the logstash-* indices that Fluentd already populates.
Application Logs: log volume per service, error rate trends, filterable log viewer. Queries the applogs-* indices from OTel.
Infrastructure Metrics: container CPU/memory, PostgreSQL connections, OpenSearch heap, NGINX request rates. Queries VictoriaMetrics.

The dashboards are JSON files shipped as Docker configs. No import step, no manual setup. Grafana’s provisioning system picks them up on startup.

Nginx auth gating

Grafana and the AI chat both need to be accessible only to authenticated users. Rather than configure authentication in each service separately, I use NGINX’s auth_request directive against Scirius’s existing session:

# Shared auth check for protected services
location = /scirius-auth-check {
    internal;
    proxy_pass http://scirius:8000/rest/rules/system_settings/;
    proxy_pass_request_body off;
    proxy_set_header Content-Length "";
}

location /grafana/ {
    auth_request /scirius-auth-check;
    error_page 401 403 = /accounts/login/?next=$request_uri;
    proxy_pass http://grafana:3000/grafana/;
}

If you’re logged into Scirius, the subrequest succeeds (the REST endpoint returns 200 for authenticated users), and NGINX proxies through to Grafana. If you’re not logged in, NGINX redirects to Scirius’s login page with a next parameter that sends you back after authentication. One login for the whole platform.

The internal directive on the auth check endpoint means it can’t be reached directly, only as a subrequest from other location blocks. Both Grafana and AI Chat share the same endpoint, so there’s one auth configuration to maintain.

LibreChat and MCP

The AI chat is LibreChat, a self-hosted ChatGPT alternative that supports multiple LLM providers and MCP servers. The deployment has five containers: LibreChat itself, MongoDB (required by LibreChat), an init container that creates a default user, a Scirius token init container, and an nginx-based MCP proxy.

The MCP integration is the interesting part. LibreChat connects to two MCP servers:

mcpServers:
    scirius:
        type: streamable-http
        url: http://mcp-proxy:8080/scirius-mcp

    opensearch:
        type: streamable-http
        url: http://opensearch:9200/_plugins/_ml/mcp

The Scirius MCP server exposes IDS alerts, detection rules, and network talker data. The OpenSearch MCP server (built into OpenSearch 3.x) provides direct index queries. Together, they let the AI assistant investigate security events without the user writing OpenSearch queries by hand.

Scirius’s MCP endpoint requires token authentication. LibreChat doesn’t support per-MCP-server auth headers. The solution is an nginx sidecar that injects the token:

location /scirius-mcp {
    rewrite ^/scirius-mcp(.*)$ /mcp$1 break;
    proxy_pass http://scirius:8000;
    proxy_set_header Authorization "Token __SCIRIUS_TOKEN__";
}

The token is generated by the init container, which runs manage.py shell to create a DRF auth token for the default user and writes it to a shared Docker volume. The MCP proxy reads the token from the volume on startup, substitutes it into the nginx config template, and starts serving. LibreChat talks to http://mcp-proxy:8080/scirius-mcp, which transparently adds the auth header and proxies to Scirius.

The SSRF problem

LibreChat has built-in SSRF protection that blocks requests to private IP ranges and .internal hostnames. This makes sense for a public-facing chat app but breaks MCP connections to Docker Compose services, which resolve to private IPs on the internal network.

The fix is an allowlist in the LibreChat config:

mcpSettings:
    allowedDomains:
        - 'mcp-proxy'
        - 'opensearch'
        - 'host.docker.internal'

Without this, LibreChat silently drops MCP connections. The error in LibreChat’s logs just says “connection refused” with no mention of SSRF filtering. I found this by running curl from inside the LibreChat container (which worked), then reading LibreChat’s source to find the request filter.

LibreChat is served behind NGINX at /aichat/. Authentication is handled by the nginx auth_request against Scirius, but LibreChat has its own login system that can’t be disabled cleanly. Rather than fork LibreChat, I inject a script that auto-logs in the default user:

sub_filter '</head>' '<script>(function(){
    if(localStorage.getItem("_ndral")) return;
    fetch("/aichat/api/auth/login", {
        method: "POST",
        headers: {"Content-Type": "application/json"},
        body: JSON.stringify({
            email: "admin@clearndr.local",
            password: "clearndr"
        })
    }).then(function(r){
        if(r.ok){
            localStorage.setItem("_ndral","1");
            window.location.reload()
        }
    })
})()</script></head>';
sub_filter_once on;

The injected script runs once per browser (tracked via localStorage). It POSTs to LibreChat’s login API with the default credentials, stores a flag, and reloads. After that, LibreChat’s session cookie handles subsequent requests. The user sees a brief flash on first visit, then the chat interface loads directly.

This requires proxy_set_header Accept-Encoding "" on the location block. NGINX can only rewrite uncompressed responses. Compression still applies for non-rewritten responses elsewhere.

Go templates within Go templates

The LibreChat config (librechat.yaml.template) is a Go template that stamusctl renders at deploy time. But it also has runtime substitutions (the Scirius token, which isn’t known until the init containers run). So it’s a template that produces a template:

# stamusctl renders this at deploy time:
{{- if .Values.aichat.local_llm_url }}
  custom:
    - name: "Local LLM"
      baseURL: "{{ .Values.aichat.local_llm_url }}/v1"
      models:
        fetch: true
{{- end }}

stamusctl renders the Go template conditionals. The output file still has __SCIRIUS_TOKEN__ placeholders, which the start-librechat.sh entrypoint replaces with sed at container startup after reading the token from the shared volume. Two-phase rendering: template engine for structure, sed for secrets.

The start-librechat.sh script waits up to 60 seconds for the token file to appear, then generates the final config and exec’s into LibreChat’s original entrypoint. If the token never appears (init container failed), it uses a placeholder and LibreChat starts without MCP access, degraded but not broken.

Everything is optional

All three components (OTel, Grafana+VictoriaMetrics, and AI Chat) default to false. A standard deployment is unchanged:

stamusctl compose init --default

Turning them on is one flag each:

stamusctl compose init --default \
    otel.enabled=true \
    grafana.enabled=true \
    victoriametrics.enabled=true \
    aichat.enabled=true \
    aichat.anthropic_api_key=sk-ant-...

The template system from part 1 handles the rest. Each component’s compose fragment is wrapped in {{- if .Values.X.enabled }}, so disabled components produce zero containers, zero volumes, zero config. NGINX’s location blocks for /grafana/ and /aichat/ are similarly guarded. The dependency graph in compose (depends_on) only references services that will actually exist.

The config files (otel.config.yaml, grafana.config.yaml, etc.) use the same self-describing format as every other template parameter. When you run compose init, stamusctl discovers the new parameters, applies defaults, and prompts only for values that have no default and no CLI override. Adding AI chat didn’t require any CLI changes, just new template files and config declarations.

The template repo is at github.com/StamusNetworks/stamusctl-public-templates.