- Created `project-structure.md` to outline the directory layout, crate dependencies, design principles, module guidelines, and naming conventions for the NxMesh codebase. - Introduced `roadmap.md` detailing the development phases, milestones, tasks, deliverables, and resource requirements for the NxMesh project, spanning from foundational setup to enterprise features.
29 KiB
NxMesh Architecture
Table of Contents
- Overview
- System Components
- Data Flow
- Communication Protocols
- Security Model
- Deployment Patterns
- Failure Handling
Overview
NxMesh follows a Control Plane / Data Plane architecture pattern, similar to service meshes like Istio or Linkerd, but specifically optimized for nginx management.
Design Principles
- Separation of Concerns: Master handles policy and state; Agent handles execution
- Eventual Consistency: Configuration changes propagate asynchronously
- Local Autonomy: Agents can operate independently during master outages
- Zero-Downtime Updates: Nginx reloads without dropping connections
- Observability First: Every action is observable and traceable
System Components
1. Master (Control Plane)
The Master is the brain of the system. It maintains the desired state and coordinates all agents.
┌──────────────────────────────────────────────────────────────────┐
│ MASTER │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │
│ │ API │ │ Config │ │ Event & Agent │ │
│ │ Layer │ │ Engine │ │ Coordination │ │
│ │ │ │ │ │ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌───────────────────┐ │ │
│ │ │ REST │ │ │ │ Template│ │ │ │ Agent Registry │ │ │
│ │ │ Handler │ │ │ │ Engine │ │ │ │ (Connections) │ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ └───────────────────┘ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌───────────────────┐ │ │
│ │ │ gRPC │ │ │ │ Version │ │ │ │ Event Bus │ │ │
│ │ │ Server │ │ │ │ Control │ │ │ │ (Config Dist.) │ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ └───────────────────┘ │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌───────────────────┐ │ │
│ │ │ WebSocket│ │ │ │ Validator│ │ │ │ Broadcast │ │ │
│ │ │ Handler │ │ │ │ │ │ │ │ (Agent Updates) │ │ │
│ │ └──────────┘ │ │ └──────────┘ │ │ └───────────────────┘ │ │
│ └──────────────┘ └──────────────┘ └─────────────────────────┘ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Auth │ │ Storage │ │ Observability │ │
│ │ Service │ │ Layer │ │ │ │
│ │ │ │ │ │ ┌───────────────────┐ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Metrics │ │ │
│ │ │ JWT │ │ │ │ Postgres│ │ │ │ (Prometheus) │ │ │
│ │ │ OAuth2 │ │ │ │ (SeaORM)│ │ │ └───────────────────┘ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ ┌───────────────────┐ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Tracing │ │ │
│ │ │ Password│ │ │ │ Cache │ │ │ │ (OpenTelemetry) │ │ │
│ │ │ Login │ │ │ │ (Redis) │ │ │ └───────────────────┘ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ │ │
│ │ ┌─────────┐ │ │ │ │ │ │
│ │ │ RBAC │ │ │ │ │ │ │
│ │ │ Engine │ │ │ │ │ │ │
│ │ └─────────┘ │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Master Responsibilities
| Module | Responsibility |
|---|---|
| API Layer | HTTP REST API for external clients (CLI, Web UI, external systems) |
| Config Engine | Template rendering, validation, versioning |
| Event & Agent Coordination | Agent connection management, config event broadcasting |
| Auth Service | Authentication (JWT/OAuth2, Password) and authorization (RBAC) |
| Storage Layer | PostgreSQL for persistent state, Redis for caching |
| Observability | Metrics collection, distributed tracing, structured logging |
Future: High Availability Mode
For large-scale deployments, the master can be extended with:
- Raft Consensus for leader election and state replication
- Cluster Manager for coordinating multiple master instances
- This is not required for single-organization, self-hosted deployments |
2. Agent (Data Plane)
The Agent is a lightweight sidecar that runs alongside each nginx instance.
┌─────────────────────────────────────────────────────────────────┐
│ AGENT │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Master │ │ Nginx │ │ Health Monitor │ │
│ │ Client │ │ Controller │ │ │ │
│ │ │ │ │ │ ┌───────────────────┐ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Nginx Health │ │ │
│ │ │ gRPC │ │ │ │ Config │ │ │ │ (HTTP checks) │ │ │
│ │ │ Client │ │ │ │ Renderer│ │ │ └───────────────────┘ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ ┌───────────────────┐ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ System Metrics │ │ │
│ │ │ WebSocket│ │ │ │ Reload │ │ │ │ (CPU/Mem/IO) │ │ │
│ │ │ Client │ │ │ │ Manager │ │ │ └───────────────────┘ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌───────────────────┐ │ │
│ │ │ Reconnect│ │ │ │ Process │ │ │ │ Self-Health │ │ │
│ │ │ Handler │ │ │ │ Signal │ │ │ │ (Heartbeat) │ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ └───────────────────┘ │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Metrics │ │ Local │ │ Watchdog │ │
│ │ Exporter │ │ Cache │ │ │ │
│ │ │ │ │ │ ┌───────────────────┐ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Config Drift │ │ │
│ │ │Prometheus│ │ │ │ Config │ │ │ │ Detection │ │ │
│ │ │Endpoint │ │ │ │ State │ │ │ └───────────────────┘ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ ┌───────────────────┐ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Auto-Recovery │ │ │
│ │ │Statsd │ │ │ │ Backup │ │ │ │ (Nginx restart) │ │ │
│ │ │Client │ │ │ │ Files │ │ │ └───────────────────┘ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Agent Responsibilities
| Module | Responsibility |
|---|---|
| Master Client | Maintains persistent connection to master (gRPC + WebSocket fallback) |
| Nginx Controller | Generates configs, manages reloads, handles lifecycle |
| Health Monitor | Monitors nginx health, system resources, reports status |
| Metrics Exporter | Prometheus endpoint, statsd client for metrics |
| Local Cache | Caches configs for offline operation, backup/restore |
| Watchdog | Detects config drift, auto-recovery from failures |
Data Flow
1. Configuration Push Flow
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ User │────▶│ API │────▶│ Config │────▶│ Event │────▶│ Agents │
│ Action │ │ Server │ │ Engine │ │ Bus │ │ │
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
│
▼
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ Nginx │◀────│ Config │◀────│ Template│◀────│ gRPC │◀────│ Agent │
│Reloaded│ │Applied │ │ Render │ │ Stream │ │Receive │
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
Flow Description:
- User creates/updates configuration via API or Web UI
- Master validates and stores configuration in database
- Config Engine determines affected agents
- Event Bus broadcasts configuration change event
- Agents receive event via gRPC streaming
- Agent renders local nginx configuration from templates
- Agent validates new configuration (
nginx -t) - Agent applies configuration via graceful reload
- Agent reports status back to master
2. Health Reporting Flow
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ Nginx │────▶│ Agent │────▶│ Master │────▶│ DB │
│ Health │ │ Health │ │ API │ │ Store │
└────────┘ └────────┘ └────────┘ └────────┘
│
▼
┌────────┐
│Prometheus│
│ Server │
└────────┘
Flow Description:
- Agent periodically checks nginx health (HTTP health endpoint)
- Agent collects system metrics (CPU, memory, connections)
- Agent sends health report to master via gRPC
- Master aggregates and stores in database
- Prometheus scrapes agent metrics endpoint
3. Certificate Management Flow
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ Let's │◀────│ Master │────▶│ Agent │────▶│ Nginx │◀────│ Client │
│Encrypt │ │ ACME │ │ Deploy │ │ Serve │ │Request │
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
Flow Description:
- Master requests certificate from Let's Encrypt (ACME protocol)
- Master distributes certificate to relevant agents
- Agent stores certificate locally (encrypted at rest)
- Agent updates nginx configuration with new certificate
- Nginx serves HTTPS traffic with new certificate
Communication Protocols
Master-Agent Protocol
NxMesh uses a bidirectional gRPC stream as the primary communication channel between master and agents.
// agent.proto
syntax = "proto3";
package nxmesh.agent;
service AgentService {
// Bidirectional streaming for real-time communication
rpc Stream(stream AgentMessage) returns (stream MasterMessage);
// Unary calls for specific operations
rpc ReportHealth(HealthReport) returns (Ack);
rpc ReportMetrics(MetricsBatch) returns (Ack);
}
message AgentMessage {
string agent_id = 1;
uint64 timestamp = 2;
oneof payload {
RegistrationRequest register = 3;
HealthReport health = 4;
ConfigStatus config_status = 5;
MetricsBatch metrics = 6;
LogBatch logs = 7;
}
}
message MasterMessage {
uint64 timestamp = 1;
oneof payload {
RegistrationResponse register_response = 2;
ConfigUpdate config_update = 3;
Command command = 4;
Ack ack = 5;
}
}
message ConfigUpdate {
string config_id = 1;
uint64 version = 2;
repeated VirtualHost virtual_hosts = 3;
repeated Upstream upstreams = 4;
map<string, string> ssl_certificates = 5;
}
Connection Management
┌─────────────────────────────────────────────────────────────────────┐
│ CONNECTION LIFECYCLE │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ INIT │───▶│ CONNECT │───▶│ STREAM │───▶│ READY │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ RETRY │ │RECONNECT│ │ ERROR │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Connection Parameters: │
│ - Heartbeat interval: 30s │
│ - Reconnect backoff: 1s, 2s, 4s, 8s... (max 60s) │
│ - gRPC keepalive: 10s ping, 20s timeout │
│ - TLS: Server-side TLS (auto-generated or custom) │
│ - Agent auth: Bootstrap token → Shared secret (HMAC) │
└─────────────────────────────────────────────────────────────────────┘
Security Model
Authentication
| Component | Method | Details |
|---|---|---|
| Master API | JWT (RS256) | Short-lived access tokens, refresh tokens |
| Master WebSocket | JWT | Same tokens as API |
| Master-Agent gRPC | TLS + Shared Secret | Server TLS + bootstrap token → session HMAC |
| Agent Registration | One-time Bootstrap Token | Generated in Master UI, single-use, short expiry |
Agent Authentication Flow (TLS + Shared Secret)
┌─────────────┐ ┌──────────────┐
│ Agent │ │ Master │
└──────┬──────┘ └──────┬───────┘
│ │
│ 1. TLS Handshake (verify server certificate) │
│◄───────────────────────────────────────────────►│
│ │
│ 2. Register with bootstrap_token │
│ ── gRPC: RegisterAgent { token } ─────────────▶│
│ │
│ 3. Receive agent_id + session_key (+ key_id) │
│◄────────────────────────────────────────────────│
│ [Encrypted over TLS] │
│ │
│ 4. Subsequent requests: HMAC-signed │
│ ── gRPC + Headers: │
│ X-Agent-ID: <agent_id> │
│ X-Key-ID: <session_key_id> │
│ X-Signature: HMAC(request_body, session_key)│
│────────────────────────────────────────────────▶│
│ │
│ 5. Key Rotation (primary/secondary) │
│◄═══════════════════════════════════════════════►│
Security Properties:
- TLS: Encrypts channel, verifies master identity (server cert)
- Bootstrap Token: One-time use, time-limited, proves initial identity
- Session Key: Per-agent secret, used for HMAC request signing
- Key Rotation: Primary/secondary key design for seamless rotation
Authorization (RBAC)
# Example RBAC Configuration
roles:
admin:
permissions:
- "*:*"
operator:
permissions:
- "config:read"
- "config:write"
- "agent:read"
- "agent:reload"
viewer:
permissions:
- "config:read"
- "agent:read"
- "metrics:read"
# Resource hierarchy
resources:
- organization
- workspace
- agent
- certificate
- config (virtual_host, upstream)
Deployment Patterns
Pattern 1: Docker Sidecar (Development/Single Host)
# docker-compose.yml
version: '3.8'
services:
nxmesh-master:
image: nxmesh/master:latest
ports:
- "8080:8080" # API
- "8443:8443" # gRPC
environment:
- DATABASE_URL=postgres://...
nginx-site-a:
image: nginx:alpine
volumes:
- site-a-html:/usr/share/nginx/html
nxmesh-agent-a:
image: nxmesh/agent:latest
network_mode: service:nginx-site-a # Share network namespace with nginx
pid: service:nginx-site-a # Share PID namespace (for nginx reload)
environment:
- NXMESH_MASTER_URL=wss://nxmesh-master:8443
- NXMESH_AGENT_TOKEN=${AGENT_TOKEN_A}
- NXMESH_DEPLOYMENT_MODE=docker_sidecar
- NXMESH_NGINX_PID_FILE=/var/run/nginx.pid
Pros: Simple, isolated, good for development Cons: Docker-only, single host limitation
Pattern 2: Kubernetes Sidecar
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-service
spec:
replicas: 3
template:
spec:
containers:
- name: nginx
image: nginx:alpine
volumeMounts:
- name: nxmesh-config
mountPath: /etc/nginx/conf.d
- name: nxmesh-agent
image: nxmesh/agent:latest
env:
- name: NXMESH_MASTER_URL
value: "wss://nxmesh-master.default.svc:8443"
- name: NXMESH_AGENT_TOKEN
valueFrom:
secretKeyRef:
name: nxmesh-agent-token
key: token
volumeMounts:
- name: nxmesh-config
mountPath: /etc/nginx/conf.d
volumes:
- name: nxmesh-config
emptyDir: {}
Pros: Native K8s integration, auto-scaling, health checks Cons: K8s-only, more complex setup
Pattern 3: Standalone (VM/Bare Metal)
┌─────────────────────────────────────────────────────────────────┐
│ VM / Bare Metal │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Systemd │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ nxmesh-agent.service │ │ │
│ │ │ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │ │ │
│ │ │ │ Agent │ │ Nginx │ │ Config │ │ │ │
│ │ │ │ Process │──│ Process │──│ Files │ │ │ │
│ │ │ └──────────────┘ └──────────────┘ └───────────┘ │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Pros: Works anywhere, minimal dependencies Cons: Manual setup, no container isolation
Failure Handling
Master Failure Scenarios
| Scenario | Impact | Mitigation |
|---|---|---|
| Master unreachable | Agents continue with cached config | Agents retry with exponential backoff |
| Master crashes | New connections fail, existing continue | External load balancer + health checks (HA: future) |
| Database down | Read-only mode for existing configs | Database replication, failover |
Agent Failure Scenarios
| Scenario | Impact | Mitigation |
|---|---|---|
| Agent crashes | Nginx continues running | Systemd restart, watchdog |
| Config validation fails | Previous config kept | Atomic config swap, rollback |
| Nginx crashes | Agent restarts nginx | Health checks, auto-restart |
| Network partition | Agent operates in "island mode" | Local cache, reconciliation on reconnect |
Recovery Procedures
┌─────────────────────────────────────────────────────────────────────┐
│ FAILURE RECOVERY FLOW │
│ │
│ Agent Disconnect │
│ │ │
│ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Retry │───▶│ Cache │───▶│ Alert │───▶│ Watch │ │
│ │ Connect │ │ Config │ │ Master │ │ Dog │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │Reconnected│ │ Restart │ │
│ │ Sync │ │ Nginx │ │
│ └─────────┘ └─────────┘ │
│ │
│ Recovery Strategies: │
│ 1. Exponential backoff for reconnection │
│ 2. Circuit breaker for failed operations │
│ 3. Config checksum verification after reconnect │
│ 4. Automatic nginx restart on health check failure │
└─────────────────────────────────────────────────────────────────────┘
Technology Stack
| Layer | Technology | Rationale |
|---|---|---|
| Master Backend | Rust (Axum) | Performance, safety, async ecosystem |
| Agent | Rust (Tokio) | Small binary, low memory, fast startup |
| Database | PostgreSQL | ACID, JSON support, reliability |
| Cache | Redis | Fast key-value, pub/sub for events |
| Frontend | React + Vite (embedded) | Static build served by master, fast HMR in dev |
| gRPC | Tonic | Native Rust implementation |
| ORM | SeaORM | Async, type-safe, migration support |
| Config Template | Handlebars | Logic-less, secure templating |
| Metrics | Prometheus | Industry standard, rich ecosystem |
| Tracing | OpenTelemetry | Vendor-neutral, future-proof |