Files
NxMesh-old/docs/architecture.md
GW_MC 43b2e44d95 Add project structure and roadmap documentation
- Created `project-structure.md` to outline the directory layout, crate dependencies, design principles, module guidelines, and naming conventions for the NxMesh codebase.
- Introduced `roadmap.md` detailing the development phases, milestones, tasks, deliverables, and resource requirements for the NxMesh project, spanning from foundational setup to enterprise features.
2026-03-03 04:13:31 +00:00

29 KiB

NxMesh Architecture

Table of Contents

  1. Overview
  2. System Components
  3. Data Flow
  4. Communication Protocols
  5. Security Model
  6. Deployment Patterns
  7. Failure Handling

Overview

NxMesh follows a Control Plane / Data Plane architecture pattern, similar to service meshes like Istio or Linkerd, but specifically optimized for nginx management.

Design Principles

  1. Separation of Concerns: Master handles policy and state; Agent handles execution
  2. Eventual Consistency: Configuration changes propagate asynchronously
  3. Local Autonomy: Agents can operate independently during master outages
  4. Zero-Downtime Updates: Nginx reloads without dropping connections
  5. Observability First: Every action is observable and traceable

System Components

1. Master (Control Plane)

The Master is the brain of the system. It maintains the desired state and coordinates all agents.

┌──────────────────────────────────────────────────────────────────┐
│                         MASTER                                   │
│  ┌──────────────┐  ┌──────────────┐  ┌─────────────────────────┐ │
│  │   API        │  │  Config      │  │    Event & Agent        │ │
│  │   Layer      │  │  Engine      │  │    Coordination         │ │
│  │              │  │              │  │                         │ │
│  │ ┌─────────┐  │  │ ┌─────────┐  │  │  ┌───────────────────┐  │ │
│  │ │ REST    │  │  │ │ Template│  │  │  │  Agent Registry   │  │ │
│  │ │ Handler │  │  │ │ Engine  │  │  │  │  (Connections)    │  │ │
│  │ └─────────┘  │  │ └─────────┘  │  │  └───────────────────┘  │ │
│  │ ┌─────────┐  │  │ ┌─────────┐  │  │  ┌───────────────────┐  │ │
│  │ │ gRPC    │  │  │ │ Version │  │  │  │  Event Bus        │  │ │
│  │ │ Server  │  │  │ │ Control │  │  │  │  (Config Dist.)   │  │ │
│  │ └─────────┘  │  │ └─────────┘  │  │  └───────────────────┘  │ │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │  ┌───────────────────┐  │ │
│  │ │ WebSocket│ │  │ │ Validator│ │  │  │  Broadcast        │  │ │
│  │ │ Handler  │ │  │ │          │ │  │  │  (Agent Updates)  │  │ │
│  │ └──────────┘ │  │ └──────────┘ │  │  └───────────────────┘  │ │
│  └──────────────┘  └──────────────┘  └─────────────────────────┘ │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐   │
│  │   Auth      │  │  Storage    │  │    Observability        │   │
│  │   Service   │  │  Layer      │  │                         │   │
│  │             │  │             │  │  ┌───────────────────┐  │   │
│  │ ┌─────────┐ │  │ ┌─────────┐ │  │  │  Metrics          │  │   │
│  │ │ JWT     │ │  │ │ Postgres│ │  │  │  (Prometheus)     │  │   │
│  │ │ OAuth2  │ │  │ │ (SeaORM)│ │  │  └───────────────────┘  │   │
│  │ └─────────┘ │  │ └─────────┘ │  │  ┌───────────────────┐  │   │
│  │ ┌─────────┐ │  │ ┌─────────┐ │  │  │  Tracing          │  │   │
│  │ │ Password│ │  │ │ Cache   │ │  │  │  (OpenTelemetry)  │  │   │
│  │ │ Login   │ │  │ │ (Redis) │ │  │  └───────────────────┘  │   │
│  │ └─────────┘ │  │ └─────────┘ │  │                         │   │
│  │ ┌─────────┐ │  │              │  │                         │   │
│  │ │ RBAC    │ │  │              │  │                         │   │
│  │ │ Engine  │ │  │              │  │                         │   │
│  │ └─────────┘ │  │              │  │                         │   │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘   │
└──────────────────────────────────────────────────────────────────┘

Master Responsibilities

Module Responsibility
API Layer HTTP REST API for external clients (CLI, Web UI, external systems)
Config Engine Template rendering, validation, versioning
Event & Agent Coordination Agent connection management, config event broadcasting
Auth Service Authentication (JWT/OAuth2, Password) and authorization (RBAC)
Storage Layer PostgreSQL for persistent state, Redis for caching
Observability Metrics collection, distributed tracing, structured logging

Future: High Availability Mode

For large-scale deployments, the master can be extended with:

  • Raft Consensus for leader election and state replication
  • Cluster Manager for coordinating multiple master instances
  • This is not required for single-organization, self-hosted deployments |

2. Agent (Data Plane)

The Agent is a lightweight sidecar that runs alongside each nginx instance.

┌─────────────────────────────────────────────────────────────────┐
│                         AGENT                                   │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │   Master    │  │  Nginx      │  │    Health Monitor       │  │
│  │   Client    │  │  Controller │  │                         │  │
│  │             │  │             │  │  ┌───────────────────┐  │  │
│  │ ┌─────────┐ │  │ ┌─────────┐ │  │  │  Nginx Health     │  │  │
│  │ │ gRPC    │ │  │ │ Config  │ │  │  │  (HTTP checks)    │  │  │
│  │ │ Client  │ │  │ │ Renderer│ │  │  └───────────────────┘  │  │
│  │ └─────────┘ │  │ └─────────┘ │  │  ┌───────────────────┐  │  │
│  │ ┌─────────┐ │  │ ┌─────────┐ │  │  │  System Metrics   │  │  │
│  │ │ WebSocket│ │  │ │ Reload  │ │  │  │  (CPU/Mem/IO)     │  │  │
│  │ │ Client  │ │  │ │ Manager │ │  │  └───────────────────┘  │  │
│  │ └─────────┘ │  │ └─────────┘ │  │                         │  │
│  │ ┌─────────┐ │  │ ┌─────────┐ │  │  ┌───────────────────┐  │  │
│  │ │ Reconnect│ │  │ │ Process │ │  │  │  Self-Health      │  │  │
│  │ │ Handler │ │  │ │ Signal  │ │  │  │  (Heartbeat)      │  │  │
│  │ └─────────┘ │  │ └─────────┘ │  │  └───────────────────┘  │  │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │   Metrics   │  │  Local      │  │    Watchdog             │  │
│  │   Exporter  │  │  Cache      │  │                         │  │
│  │             │  │             │  │  ┌───────────────────┐  │  │
│  │ ┌─────────┐ │  │ ┌─────────┐ │  │  │  Config Drift     │  │  │
│  │ │Prometheus│ │  │ │ Config  │ │  │  │  Detection        │  │  │
│  │ │Endpoint │ │  │ │ State   │ │  │  └───────────────────┘  │  │
│  │ └─────────┘ │  │ └─────────┘ │  │  ┌───────────────────┐  │  │
│  │ ┌─────────┐ │  │ ┌─────────┐ │  │  │  Auto-Recovery    │  │  │
│  │ │Statsd   │ │  │ │ Backup  │ │  │  │  (Nginx restart)  │  │  │
│  │ │Client   │ │  │ │ Files   │ │  │  └───────────────────┘  │  │
│  │ └─────────┘ │  │ └─────────┘ │  │                         │  │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Agent Responsibilities

Module Responsibility
Master Client Maintains persistent connection to master (gRPC + WebSocket fallback)
Nginx Controller Generates configs, manages reloads, handles lifecycle
Health Monitor Monitors nginx health, system resources, reports status
Metrics Exporter Prometheus endpoint, statsd client for metrics
Local Cache Caches configs for offline operation, backup/restore
Watchdog Detects config drift, auto-recovery from failures

Data Flow

1. Configuration Push Flow

┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐
│  User  │────▶│  API   │────▶│ Config │────▶│ Event  │────▶│ Agents │
│ Action │     │ Server │     │ Engine │     │  Bus   │     │        │
└────────┘     └────────┘     └────────┘     └────────┘     └────────┘
                                                                  │
                                                                  ▼
┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐
│ Nginx  │◀────│ Config │◀────│ Template│◀────│ gRPC   │◀────│ Agent  │
│Reloaded│     │Applied │     │ Render │     │ Stream │     │Receive │
└────────┘     └────────┘     └────────┘     └────────┘     └────────┘

Flow Description:

  1. User creates/updates configuration via API or Web UI
  2. Master validates and stores configuration in database
  3. Config Engine determines affected agents
  4. Event Bus broadcasts configuration change event
  5. Agents receive event via gRPC streaming
  6. Agent renders local nginx configuration from templates
  7. Agent validates new configuration (nginx -t)
  8. Agent applies configuration via graceful reload
  9. Agent reports status back to master

2. Health Reporting Flow

┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐
│ Nginx  │────▶│ Agent  │────▶│ Master │────▶│  DB    │
│ Health │     │ Health │     │ API    │     │ Store  │
└────────┘     └────────┘     └────────┘     └────────┘
                    │
                    ▼
              ┌────────┐
              │Prometheus│
              │ Server │
              └────────┘

Flow Description:

  1. Agent periodically checks nginx health (HTTP health endpoint)
  2. Agent collects system metrics (CPU, memory, connections)
  3. Agent sends health report to master via gRPC
  4. Master aggregates and stores in database
  5. Prometheus scrapes agent metrics endpoint

3. Certificate Management Flow

┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐
│ Let's  │◀────│ Master │────▶│ Agent  │────▶│ Nginx  │◀────│ Client │
│Encrypt │     │ ACME   │     │ Deploy │     │ Serve  │     │Request │
└────────┘     └────────┘     └────────┘     └────────┘     └────────┘

Flow Description:

  1. Master requests certificate from Let's Encrypt (ACME protocol)
  2. Master distributes certificate to relevant agents
  3. Agent stores certificate locally (encrypted at rest)
  4. Agent updates nginx configuration with new certificate
  5. Nginx serves HTTPS traffic with new certificate

Communication Protocols

Master-Agent Protocol

NxMesh uses a bidirectional gRPC stream as the primary communication channel between master and agents.

// agent.proto
syntax = "proto3";
package nxmesh.agent;

service AgentService {
  // Bidirectional streaming for real-time communication
  rpc Stream(stream AgentMessage) returns (stream MasterMessage);

  // Unary calls for specific operations
  rpc ReportHealth(HealthReport) returns (Ack);
  rpc ReportMetrics(MetricsBatch) returns (Ack);
}

message AgentMessage {
  string agent_id = 1;
  uint64 timestamp = 2;
  oneof payload {
    RegistrationRequest register = 3;
    HealthReport health = 4;
    ConfigStatus config_status = 5;
    MetricsBatch metrics = 6;
    LogBatch logs = 7;
  }
}

message MasterMessage {
  uint64 timestamp = 1;
  oneof payload {
    RegistrationResponse register_response = 2;
    ConfigUpdate config_update = 3;
    Command command = 4;
    Ack ack = 5;
  }
}

message ConfigUpdate {
  string config_id = 1;
  uint64 version = 2;
  repeated VirtualHost virtual_hosts = 3;
  repeated Upstream upstreams = 4;
  map<string, string> ssl_certificates = 5;
}

Connection Management

┌─────────────────────────────────────────────────────────────────────┐
│                        CONNECTION LIFECYCLE                          │
│                                                                      │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐          │
│  │  INIT   │───▶│ CONNECT │───▶│ STREAM  │───▶│  READY  │          │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘          │
│                      │               │               │               │
│                      ▼               ▼               ▼               │
│                 ┌─────────┐    ┌─────────┐    ┌─────────┐          │
│                 │  RETRY  │    │RECONNECT│    │  ERROR  │          │
│                 └─────────┘    └─────────┘    └─────────┘          │
│                                                                      │
│  Connection Parameters:                                              │
│  - Heartbeat interval: 30s                                           │
│  - Reconnect backoff: 1s, 2s, 4s, 8s... (max 60s)                    │
│  - gRPC keepalive: 10s ping, 20s timeout                             │
│  - TLS: Server-side TLS (auto-generated or custom)                   │
│  - Agent auth: Bootstrap token → Shared secret (HMAC)                │
└─────────────────────────────────────────────────────────────────────┘

Security Model

Authentication

Component Method Details
Master API JWT (RS256) Short-lived access tokens, refresh tokens
Master WebSocket JWT Same tokens as API
Master-Agent gRPC TLS + Shared Secret Server TLS + bootstrap token → session HMAC
Agent Registration One-time Bootstrap Token Generated in Master UI, single-use, short expiry

Agent Authentication Flow (TLS + Shared Secret)

┌─────────────┐                                    ┌──────────────┐
│    Agent    │                                    │    Master    │
└──────┬──────┘                                    └──────┬───────┘
       │                                                 │
       │  1. TLS Handshake (verify server certificate)   │
       │◄───────────────────────────────────────────────►│
       │                                                 │
       │  2. Register with bootstrap_token               │
       │  ── gRPC: RegisterAgent { token } ─────────────▶│
       │                                                 │
       │  3. Receive agent_id + session_key (+ key_id)   │
       │◄────────────────────────────────────────────────│
       │     [Encrypted over TLS]                        │
       │                                                 │
       │  4. Subsequent requests: HMAC-signed            │
       │  ── gRPC + Headers:                             │
       │     X-Agent-ID: <agent_id>                      │
       │     X-Key-ID: <session_key_id>                  │
       │     X-Signature: HMAC(request_body, session_key)│
       │────────────────────────────────────────────────▶│
       │                                                 │
       │  5. Key Rotation (primary/secondary)            │
       │◄═══════════════════════════════════════════════►│

Security Properties:

  • TLS: Encrypts channel, verifies master identity (server cert)
  • Bootstrap Token: One-time use, time-limited, proves initial identity
  • Session Key: Per-agent secret, used for HMAC request signing
  • Key Rotation: Primary/secondary key design for seamless rotation

Authorization (RBAC)

# Example RBAC Configuration
roles:
  admin:
    permissions:
      - "*:*"

  operator:
    permissions:
      - "config:read"
      - "config:write"
      - "agent:read"
      - "agent:reload"

  viewer:
    permissions:
      - "config:read"
      - "agent:read"
      - "metrics:read"

# Resource hierarchy
resources:
  - organization
    - workspace
      - agent
      - certificate
      - config (virtual_host, upstream)

Deployment Patterns

Pattern 1: Docker Sidecar (Development/Single Host)

# docker-compose.yml
version: '3.8'

services:
  nxmesh-master:
    image: nxmesh/master:latest
    ports:
      - "8080:8080"   # API
      - "8443:8443"   # gRPC
    environment:
      - DATABASE_URL=postgres://...

  nginx-site-a:
    image: nginx:alpine
    volumes:
      - site-a-html:/usr/share/nginx/html

  nxmesh-agent-a:
    image: nxmesh/agent:latest
    network_mode: service:nginx-site-a  # Share network namespace with nginx
    pid: service:nginx-site-a            # Share PID namespace (for nginx reload)
    environment:
      - NXMESH_MASTER_URL=wss://nxmesh-master:8443
      - NXMESH_AGENT_TOKEN=${AGENT_TOKEN_A}
      - NXMESH_DEPLOYMENT_MODE=docker_sidecar
      - NXMESH_NGINX_PID_FILE=/var/run/nginx.pid

Pros: Simple, isolated, good for development Cons: Docker-only, single host limitation

Pattern 2: Kubernetes Sidecar

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-service
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: nginx
          image: nginx:alpine
          volumeMounts:
            - name: nxmesh-config
              mountPath: /etc/nginx/conf.d

        - name: nxmesh-agent
          image: nxmesh/agent:latest
          env:
            - name: NXMESH_MASTER_URL
              value: "wss://nxmesh-master.default.svc:8443"
            - name: NXMESH_AGENT_TOKEN
              valueFrom:
                secretKeyRef:
                  name: nxmesh-agent-token
                  key: token
          volumeMounts:
            - name: nxmesh-config
              mountPath: /etc/nginx/conf.d
      volumes:
        - name: nxmesh-config
          emptyDir: {}

Pros: Native K8s integration, auto-scaling, health checks Cons: K8s-only, more complex setup

Pattern 3: Standalone (VM/Bare Metal)

┌─────────────────────────────────────────────────────────────────┐
│                         VM / Bare Metal                          │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Systemd                                                   │  │
│  │  ┌─────────────────────────────────────────────────────┐  │  │
│  │  │ nxmesh-agent.service                                │  │  │
│  │  │  ┌──────────────┐  ┌──────────────┐  ┌───────────┐  │  │  │
│  │  │  │   Agent      │  │   Nginx      │  │  Config   │  │  │  │
│  │  │  │   Process    │──│   Process    │──│  Files    │  │  │  │
│  │  │  └──────────────┘  └──────────────┘  └───────────┘  │  │  │
│  │  └─────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Pros: Works anywhere, minimal dependencies Cons: Manual setup, no container isolation


Failure Handling

Master Failure Scenarios

Scenario Impact Mitigation
Master unreachable Agents continue with cached config Agents retry with exponential backoff
Master crashes New connections fail, existing continue External load balancer + health checks (HA: future)
Database down Read-only mode for existing configs Database replication, failover

Agent Failure Scenarios

Scenario Impact Mitigation
Agent crashes Nginx continues running Systemd restart, watchdog
Config validation fails Previous config kept Atomic config swap, rollback
Nginx crashes Agent restarts nginx Health checks, auto-restart
Network partition Agent operates in "island mode" Local cache, reconciliation on reconnect

Recovery Procedures

┌─────────────────────────────────────────────────────────────────────┐
│                     FAILURE RECOVERY FLOW                            │
│                                                                      │
│  Agent Disconnect                                                     │
│       │                                                               │
│       ▼                                                               │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐           │
│  │  Retry  │───▶│  Cache  │───▶│  Alert  │───▶│  Watch  │           │
│  │ Connect │    │ Config  │    │ Master  │    │  Dog    │           │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘           │
│       │                                            │                  │
│       ▼                                            ▼                  │
│  ┌─────────┐                                  ┌─────────┐            │
│  │Reconnected│                                  │ Restart │            │
│  │  Sync   │                                  │ Nginx   │            │
│  └─────────┘                                  └─────────┘            │
│                                                                      │
│  Recovery Strategies:                                                │
│  1. Exponential backoff for reconnection                             │
│  2. Circuit breaker for failed operations                            │
│  3. Config checksum verification after reconnect                     │
│  4. Automatic nginx restart on health check failure                  │
└─────────────────────────────────────────────────────────────────────┘

Technology Stack

Layer Technology Rationale
Master Backend Rust (Axum) Performance, safety, async ecosystem
Agent Rust (Tokio) Small binary, low memory, fast startup
Database PostgreSQL ACID, JSON support, reliability
Cache Redis Fast key-value, pub/sub for events
Frontend React + Vite (embedded) Static build served by master, fast HMR in dev
gRPC Tonic Native Rust implementation
ORM SeaORM Async, type-safe, migration support
Config Template Handlebars Logic-less, secure templating
Metrics Prometheus Industry standard, rich ecosystem
Tracing OpenTelemetry Vendor-neutral, future-proof