Add project structure and roadmap documentation
- Created `project-structure.md` to outline the directory layout, crate dependencies, design principles, module guidelines, and naming conventions for the NxMesh codebase. - Introduced `roadmap.md` detailing the development phases, milestones, tasks, deliverables, and resource requirements for the NxMesh project, spanning from foundational setup to enterprise features.
This commit is contained in:
1107
docs/api.md
Normal file
1107
docs/api.md
Normal file
File diff suppressed because it is too large
Load Diff
527
docs/architecture.md
Normal file
527
docs/architecture.md
Normal file
@@ -0,0 +1,527 @@
|
||||
# NxMesh Architecture
|
||||
|
||||
## Table of Contents
|
||||
1. [Overview](#overview)
|
||||
2. [System Components](#system-components)
|
||||
3. [Data Flow](#data-flow)
|
||||
4. [Communication Protocols](#communication-protocols)
|
||||
5. [Security Model](#security-model)
|
||||
6. [Deployment Patterns](#deployment-patterns)
|
||||
7. [Failure Handling](#failure-handling)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
NxMesh follows a **Control Plane / Data Plane** architecture pattern, similar to service meshes like Istio or Linkerd, but specifically optimized for nginx management.
|
||||
|
||||
### Design Principles
|
||||
|
||||
1. **Separation of Concerns**: Master handles policy and state; Agent handles execution
|
||||
2. **Eventual Consistency**: Configuration changes propagate asynchronously
|
||||
3. **Local Autonomy**: Agents can operate independently during master outages
|
||||
4. **Zero-Downtime Updates**: Nginx reloads without dropping connections
|
||||
5. **Observability First**: Every action is observable and traceable
|
||||
|
||||
---
|
||||
|
||||
## System Components
|
||||
|
||||
### 1. Master (Control Plane)
|
||||
|
||||
The Master is the brain of the system. It maintains the desired state and coordinates all agents.
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ MASTER │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │
|
||||
│ │ API │ │ Config │ │ Event & Agent │ │
|
||||
│ │ Layer │ │ Engine │ │ Coordination │ │
|
||||
│ │ │ │ │ │ │ │
|
||||
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌───────────────────┐ │ │
|
||||
│ │ │ REST │ │ │ │ Template│ │ │ │ Agent Registry │ │ │
|
||||
│ │ │ Handler │ │ │ │ Engine │ │ │ │ (Connections) │ │ │
|
||||
│ │ └─────────┘ │ │ └─────────┘ │ │ └───────────────────┘ │ │
|
||||
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌───────────────────┐ │ │
|
||||
│ │ │ gRPC │ │ │ │ Version │ │ │ │ Event Bus │ │ │
|
||||
│ │ │ Server │ │ │ │ Control │ │ │ │ (Config Dist.) │ │ │
|
||||
│ │ └─────────┘ │ │ └─────────┘ │ │ └───────────────────┘ │ │
|
||||
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌───────────────────┐ │ │
|
||||
│ │ │ WebSocket│ │ │ │ Validator│ │ │ │ Broadcast │ │ │
|
||||
│ │ │ Handler │ │ │ │ │ │ │ │ (Agent Updates) │ │ │
|
||||
│ │ └──────────┘ │ │ └──────────┘ │ │ └───────────────────┘ │ │
|
||||
│ └──────────────┘ └──────────────┘ └─────────────────────────┘ │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
|
||||
│ │ Auth │ │ Storage │ │ Observability │ │
|
||||
│ │ Service │ │ Layer │ │ │ │
|
||||
│ │ │ │ │ │ ┌───────────────────┐ │ │
|
||||
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Metrics │ │ │
|
||||
│ │ │ JWT │ │ │ │ Postgres│ │ │ │ (Prometheus) │ │ │
|
||||
│ │ │ OAuth2 │ │ │ │ (SeaORM)│ │ │ └───────────────────┘ │ │
|
||||
│ │ └─────────┘ │ │ └─────────┘ │ │ ┌───────────────────┐ │ │
|
||||
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Tracing │ │ │
|
||||
│ │ │ Password│ │ │ │ Cache │ │ │ │ (OpenTelemetry) │ │ │
|
||||
│ │ │ Login │ │ │ │ (Redis) │ │ │ └───────────────────┘ │ │
|
||||
│ │ └─────────┘ │ │ └─────────┘ │ │ │ │
|
||||
│ │ ┌─────────┐ │ │ │ │ │ │
|
||||
│ │ │ RBAC │ │ │ │ │ │ │
|
||||
│ │ │ Engine │ │ │ │ │ │ │
|
||||
│ │ └─────────┘ │ │ │ │ │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
#### Master Responsibilities
|
||||
|
||||
| Module | Responsibility |
|
||||
|--------|----------------|
|
||||
| API Layer | HTTP REST API for external clients (CLI, Web UI, external systems) |
|
||||
| Config Engine | Template rendering, validation, versioning |
|
||||
| Event & Agent Coordination | Agent connection management, config event broadcasting |
|
||||
| Auth Service | Authentication (JWT/OAuth2, Password) and authorization (RBAC) |
|
||||
| Storage Layer | PostgreSQL for persistent state, Redis for caching |
|
||||
| Observability | Metrics collection, distributed tracing, structured logging |
|
||||
|
||||
#### Future: High Availability Mode
|
||||
|
||||
For large-scale deployments, the master can be extended with:
|
||||
- **Raft Consensus** for leader election and state replication
|
||||
- **Cluster Manager** for coordinating multiple master instances
|
||||
- This is **not required** for single-organization, self-hosted deployments |
|
||||
|
||||
### 2. Agent (Data Plane)
|
||||
|
||||
The Agent is a lightweight sidecar that runs alongside each nginx instance.
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ AGENT │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
|
||||
│ │ Master │ │ Nginx │ │ Health Monitor │ │
|
||||
│ │ Client │ │ Controller │ │ │ │
|
||||
│ │ │ │ │ │ ┌───────────────────┐ │ │
|
||||
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Nginx Health │ │ │
|
||||
│ │ │ gRPC │ │ │ │ Config │ │ │ │ (HTTP checks) │ │ │
|
||||
│ │ │ Client │ │ │ │ Renderer│ │ │ └───────────────────┘ │ │
|
||||
│ │ └─────────┘ │ │ └─────────┘ │ │ ┌───────────────────┐ │ │
|
||||
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ System Metrics │ │ │
|
||||
│ │ │ WebSocket│ │ │ │ Reload │ │ │ │ (CPU/Mem/IO) │ │ │
|
||||
│ │ │ Client │ │ │ │ Manager │ │ │ └───────────────────┘ │ │
|
||||
│ │ └─────────┘ │ │ └─────────┘ │ │ │ │
|
||||
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌───────────────────┐ │ │
|
||||
│ │ │ Reconnect│ │ │ │ Process │ │ │ │ Self-Health │ │ │
|
||||
│ │ │ Handler │ │ │ │ Signal │ │ │ │ (Heartbeat) │ │ │
|
||||
│ │ └─────────┘ │ │ └─────────┘ │ │ └───────────────────┘ │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
|
||||
│ │ Metrics │ │ Local │ │ Watchdog │ │
|
||||
│ │ Exporter │ │ Cache │ │ │ │
|
||||
│ │ │ │ │ │ ┌───────────────────┐ │ │
|
||||
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Config Drift │ │ │
|
||||
│ │ │Prometheus│ │ │ │ Config │ │ │ │ Detection │ │ │
|
||||
│ │ │Endpoint │ │ │ │ State │ │ │ └───────────────────┘ │ │
|
||||
│ │ └─────────┘ │ │ └─────────┘ │ │ ┌───────────────────┐ │ │
|
||||
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Auto-Recovery │ │ │
|
||||
│ │ │Statsd │ │ │ │ Backup │ │ │ │ (Nginx restart) │ │ │
|
||||
│ │ │Client │ │ │ │ Files │ │ │ └───────────────────┘ │ │
|
||||
│ │ └─────────┘ │ │ └─────────┘ │ │ │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
#### Agent Responsibilities
|
||||
|
||||
| Module | Responsibility |
|
||||
|--------|----------------|
|
||||
| Master Client | Maintains persistent connection to master (gRPC + WebSocket fallback) |
|
||||
| Nginx Controller | Generates configs, manages reloads, handles lifecycle |
|
||||
| Health Monitor | Monitors nginx health, system resources, reports status |
|
||||
| Metrics Exporter | Prometheus endpoint, statsd client for metrics |
|
||||
| Local Cache | Caches configs for offline operation, backup/restore |
|
||||
| Watchdog | Detects config drift, auto-recovery from failures |
|
||||
|
||||
---
|
||||
|
||||
## Data Flow
|
||||
|
||||
### 1. Configuration Push Flow
|
||||
|
||||
```
|
||||
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
|
||||
│ User │────▶│ API │────▶│ Config │────▶│ Event │────▶│ Agents │
|
||||
│ Action │ │ Server │ │ Engine │ │ Bus │ │ │
|
||||
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
|
||||
│
|
||||
▼
|
||||
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
|
||||
│ Nginx │◀────│ Config │◀────│ Template│◀────│ gRPC │◀────│ Agent │
|
||||
│Reloaded│ │Applied │ │ Render │ │ Stream │ │Receive │
|
||||
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
|
||||
```
|
||||
|
||||
**Flow Description:**
|
||||
1. User creates/updates configuration via API or Web UI
|
||||
2. Master validates and stores configuration in database
|
||||
3. Config Engine determines affected agents
|
||||
4. Event Bus broadcasts configuration change event
|
||||
5. Agents receive event via gRPC streaming
|
||||
6. Agent renders local nginx configuration from templates
|
||||
7. Agent validates new configuration (`nginx -t`)
|
||||
8. Agent applies configuration via graceful reload
|
||||
9. Agent reports status back to master
|
||||
|
||||
### 2. Health Reporting Flow
|
||||
|
||||
```
|
||||
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
|
||||
│ Nginx │────▶│ Agent │────▶│ Master │────▶│ DB │
|
||||
│ Health │ │ Health │ │ API │ │ Store │
|
||||
└────────┘ └────────┘ └────────┘ └────────┘
|
||||
│
|
||||
▼
|
||||
┌────────┐
|
||||
│Prometheus│
|
||||
│ Server │
|
||||
└────────┘
|
||||
```
|
||||
|
||||
**Flow Description:**
|
||||
1. Agent periodically checks nginx health (HTTP health endpoint)
|
||||
2. Agent collects system metrics (CPU, memory, connections)
|
||||
3. Agent sends health report to master via gRPC
|
||||
4. Master aggregates and stores in database
|
||||
5. Prometheus scrapes agent metrics endpoint
|
||||
|
||||
### 3. Certificate Management Flow
|
||||
|
||||
```
|
||||
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
|
||||
│ Let's │◀────│ Master │────▶│ Agent │────▶│ Nginx │◀────│ Client │
|
||||
│Encrypt │ │ ACME │ │ Deploy │ │ Serve │ │Request │
|
||||
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
|
||||
```
|
||||
|
||||
**Flow Description:**
|
||||
1. Master requests certificate from Let's Encrypt (ACME protocol)
|
||||
2. Master distributes certificate to relevant agents
|
||||
3. Agent stores certificate locally (encrypted at rest)
|
||||
4. Agent updates nginx configuration with new certificate
|
||||
5. Nginx serves HTTPS traffic with new certificate
|
||||
|
||||
---
|
||||
|
||||
## Communication Protocols
|
||||
|
||||
### Master-Agent Protocol
|
||||
|
||||
NxMesh uses a **bidirectional gRPC stream** as the primary communication channel between master and agents.
|
||||
|
||||
```protobuf
|
||||
// agent.proto
|
||||
syntax = "proto3";
|
||||
package nxmesh.agent;
|
||||
|
||||
service AgentService {
|
||||
// Bidirectional streaming for real-time communication
|
||||
rpc Stream(stream AgentMessage) returns (stream MasterMessage);
|
||||
|
||||
// Unary calls for specific operations
|
||||
rpc ReportHealth(HealthReport) returns (Ack);
|
||||
rpc ReportMetrics(MetricsBatch) returns (Ack);
|
||||
}
|
||||
|
||||
message AgentMessage {
|
||||
string agent_id = 1;
|
||||
uint64 timestamp = 2;
|
||||
oneof payload {
|
||||
RegistrationRequest register = 3;
|
||||
HealthReport health = 4;
|
||||
ConfigStatus config_status = 5;
|
||||
MetricsBatch metrics = 6;
|
||||
LogBatch logs = 7;
|
||||
}
|
||||
}
|
||||
|
||||
message MasterMessage {
|
||||
uint64 timestamp = 1;
|
||||
oneof payload {
|
||||
RegistrationResponse register_response = 2;
|
||||
ConfigUpdate config_update = 3;
|
||||
Command command = 4;
|
||||
Ack ack = 5;
|
||||
}
|
||||
}
|
||||
|
||||
message ConfigUpdate {
|
||||
string config_id = 1;
|
||||
uint64 version = 2;
|
||||
repeated VirtualHost virtual_hosts = 3;
|
||||
repeated Upstream upstreams = 4;
|
||||
map<string, string> ssl_certificates = 5;
|
||||
}
|
||||
```
|
||||
|
||||
### Connection Management
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ CONNECTION LIFECYCLE │
|
||||
│ │
|
||||
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
||||
│ │ INIT │───▶│ CONNECT │───▶│ STREAM │───▶│ READY │ │
|
||||
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
|
||||
│ │ │ │ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
||||
│ │ RETRY │ │RECONNECT│ │ ERROR │ │
|
||||
│ └─────────┘ └─────────┘ └─────────┘ │
|
||||
│ │
|
||||
│ Connection Parameters: │
|
||||
│ - Heartbeat interval: 30s │
|
||||
│ - Reconnect backoff: 1s, 2s, 4s, 8s... (max 60s) │
|
||||
│ - gRPC keepalive: 10s ping, 20s timeout │
|
||||
│ - TLS: Server-side TLS (auto-generated or custom) │
|
||||
│ - Agent auth: Bootstrap token → Shared secret (HMAC) │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security Model
|
||||
|
||||
### Authentication
|
||||
|
||||
| Component | Method | Details |
|
||||
|-----------|--------|---------|
|
||||
| Master API | JWT (RS256) | Short-lived access tokens, refresh tokens |
|
||||
| Master WebSocket | JWT | Same tokens as API |
|
||||
| Master-Agent gRPC | **TLS + Shared Secret** | Server TLS + bootstrap token → session HMAC |
|
||||
| Agent Registration | One-time Bootstrap Token | Generated in Master UI, single-use, short expiry |
|
||||
|
||||
### Agent Authentication Flow (TLS + Shared Secret)
|
||||
|
||||
```
|
||||
┌─────────────┐ ┌──────────────┐
|
||||
│ Agent │ │ Master │
|
||||
└──────┬──────┘ └──────┬───────┘
|
||||
│ │
|
||||
│ 1. TLS Handshake (verify server certificate) │
|
||||
│◄───────────────────────────────────────────────►│
|
||||
│ │
|
||||
│ 2. Register with bootstrap_token │
|
||||
│ ── gRPC: RegisterAgent { token } ─────────────▶│
|
||||
│ │
|
||||
│ 3. Receive agent_id + session_key (+ key_id) │
|
||||
│◄────────────────────────────────────────────────│
|
||||
│ [Encrypted over TLS] │
|
||||
│ │
|
||||
│ 4. Subsequent requests: HMAC-signed │
|
||||
│ ── gRPC + Headers: │
|
||||
│ X-Agent-ID: <agent_id> │
|
||||
│ X-Key-ID: <session_key_id> │
|
||||
│ X-Signature: HMAC(request_body, session_key)│
|
||||
│────────────────────────────────────────────────▶│
|
||||
│ │
|
||||
│ 5. Key Rotation (primary/secondary) │
|
||||
│◄═══════════════════════════════════════════════►│
|
||||
```
|
||||
|
||||
**Security Properties:**
|
||||
- **TLS**: Encrypts channel, verifies master identity (server cert)
|
||||
- **Bootstrap Token**: One-time use, time-limited, proves initial identity
|
||||
- **Session Key**: Per-agent secret, used for HMAC request signing
|
||||
- **Key Rotation**: Primary/secondary key design for seamless rotation
|
||||
|
||||
### Authorization (RBAC)
|
||||
|
||||
```yaml
|
||||
# Example RBAC Configuration
|
||||
roles:
|
||||
admin:
|
||||
permissions:
|
||||
- "*:*"
|
||||
|
||||
operator:
|
||||
permissions:
|
||||
- "config:read"
|
||||
- "config:write"
|
||||
- "agent:read"
|
||||
- "agent:reload"
|
||||
|
||||
viewer:
|
||||
permissions:
|
||||
- "config:read"
|
||||
- "agent:read"
|
||||
- "metrics:read"
|
||||
|
||||
# Resource hierarchy
|
||||
resources:
|
||||
- organization
|
||||
- workspace
|
||||
- agent
|
||||
- certificate
|
||||
- config (virtual_host, upstream)
|
||||
```
|
||||
|
||||
## Deployment Patterns
|
||||
|
||||
### Pattern 1: Docker Sidecar (Development/Single Host)
|
||||
|
||||
```yaml
|
||||
# docker-compose.yml
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
nxmesh-master:
|
||||
image: nxmesh/master:latest
|
||||
ports:
|
||||
- "8080:8080" # API
|
||||
- "8443:8443" # gRPC
|
||||
environment:
|
||||
- DATABASE_URL=postgres://...
|
||||
|
||||
nginx-site-a:
|
||||
image: nginx:alpine
|
||||
volumes:
|
||||
- site-a-html:/usr/share/nginx/html
|
||||
|
||||
nxmesh-agent-a:
|
||||
image: nxmesh/agent:latest
|
||||
network_mode: service:nginx-site-a # Share network namespace with nginx
|
||||
pid: service:nginx-site-a # Share PID namespace (for nginx reload)
|
||||
environment:
|
||||
- NXMESH_MASTER_URL=wss://nxmesh-master:8443
|
||||
- NXMESH_AGENT_TOKEN=${AGENT_TOKEN_A}
|
||||
- NXMESH_DEPLOYMENT_MODE=docker_sidecar
|
||||
- NXMESH_NGINX_PID_FILE=/var/run/nginx.pid
|
||||
```
|
||||
|
||||
**Pros:** Simple, isolated, good for development
|
||||
**Cons:** Docker-only, single host limitation
|
||||
|
||||
### Pattern 2: Kubernetes Sidecar
|
||||
|
||||
```yaml
|
||||
# deployment.yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: web-service
|
||||
spec:
|
||||
replicas: 3
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: nginx
|
||||
image: nginx:alpine
|
||||
volumeMounts:
|
||||
- name: nxmesh-config
|
||||
mountPath: /etc/nginx/conf.d
|
||||
|
||||
- name: nxmesh-agent
|
||||
image: nxmesh/agent:latest
|
||||
env:
|
||||
- name: NXMESH_MASTER_URL
|
||||
value: "wss://nxmesh-master.default.svc:8443"
|
||||
- name: NXMESH_AGENT_TOKEN
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: nxmesh-agent-token
|
||||
key: token
|
||||
volumeMounts:
|
||||
- name: nxmesh-config
|
||||
mountPath: /etc/nginx/conf.d
|
||||
volumes:
|
||||
- name: nxmesh-config
|
||||
emptyDir: {}
|
||||
```
|
||||
|
||||
**Pros:** Native K8s integration, auto-scaling, health checks
|
||||
**Cons:** K8s-only, more complex setup
|
||||
|
||||
### Pattern 3: Standalone (VM/Bare Metal)
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ VM / Bare Metal │
|
||||
│ ┌───────────────────────────────────────────────────────────┐ │
|
||||
│ │ Systemd │ │
|
||||
│ │ ┌─────────────────────────────────────────────────────┐ │ │
|
||||
│ │ │ nxmesh-agent.service │ │ │
|
||||
│ │ │ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │ │ │
|
||||
│ │ │ │ Agent │ │ Nginx │ │ Config │ │ │ │
|
||||
│ │ │ │ Process │──│ Process │──│ Files │ │ │ │
|
||||
│ │ │ └──────────────┘ └──────────────┘ └───────────┘ │ │ │
|
||||
│ │ └─────────────────────────────────────────────────────┘ │ │
|
||||
│ └───────────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Pros:** Works anywhere, minimal dependencies
|
||||
**Cons:** Manual setup, no container isolation
|
||||
|
||||
---
|
||||
|
||||
## Failure Handling
|
||||
|
||||
### Master Failure Scenarios
|
||||
|
||||
| Scenario | Impact | Mitigation |
|
||||
|----------|--------|------------|
|
||||
| Master unreachable | Agents continue with cached config | Agents retry with exponential backoff |
|
||||
| Master crashes | New connections fail, existing continue | External load balancer + health checks (HA: future) |
|
||||
| Database down | Read-only mode for existing configs | Database replication, failover |
|
||||
|
||||
### Agent Failure Scenarios
|
||||
|
||||
| Scenario | Impact | Mitigation |
|
||||
|----------|--------|------------|
|
||||
| Agent crashes | Nginx continues running | Systemd restart, watchdog |
|
||||
| Config validation fails | Previous config kept | Atomic config swap, rollback |
|
||||
| Nginx crashes | Agent restarts nginx | Health checks, auto-restart |
|
||||
| Network partition | Agent operates in "island mode" | Local cache, reconciliation on reconnect |
|
||||
|
||||
### Recovery Procedures
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ FAILURE RECOVERY FLOW │
|
||||
│ │
|
||||
│ Agent Disconnect │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
||||
│ │ Retry │───▶│ Cache │───▶│ Alert │───▶│ Watch │ │
|
||||
│ │ Connect │ │ Config │ │ Master │ │ Dog │ │
|
||||
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
|
||||
│ │ │ │
|
||||
│ ▼ ▼ │
|
||||
│ ┌─────────┐ ┌─────────┐ │
|
||||
│ │Reconnected│ │ Restart │ │
|
||||
│ │ Sync │ │ Nginx │ │
|
||||
│ └─────────┘ └─────────┘ │
|
||||
│ │
|
||||
│ Recovery Strategies: │
|
||||
│ 1. Exponential backoff for reconnection │
|
||||
│ 2. Circuit breaker for failed operations │
|
||||
│ 3. Config checksum verification after reconnect │
|
||||
│ 4. Automatic nginx restart on health check failure │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Technology Stack
|
||||
|
||||
| Layer | Technology | Rationale |
|
||||
|-------|------------|-----------|
|
||||
| **Master Backend** | Rust (Axum) | Performance, safety, async ecosystem |
|
||||
| **Agent** | Rust (Tokio) | Small binary, low memory, fast startup |
|
||||
| **Database** | PostgreSQL | ACID, JSON support, reliability |
|
||||
| **Cache** | Redis | Fast key-value, pub/sub for events |
|
||||
| **Frontend** | React + Vite (embedded) | Static build served by master, fast HMR in dev |
|
||||
| **gRPC** | Tonic | Native Rust implementation |
|
||||
| **ORM** | SeaORM | Async, type-safe, migration support |
|
||||
| **Config Template** | Handlebars | Logic-less, secure templating |
|
||||
| **Metrics** | Prometheus | Industry standard, rich ecosystem |
|
||||
| **Tracing** | OpenTelemetry | Vendor-neutral, future-proof |
|
||||
814
docs/features.md
Normal file
814
docs/features.md
Normal file
@@ -0,0 +1,814 @@
|
||||
# NxMesh Feature Specification
|
||||
|
||||
## Table of Contents
|
||||
1. [Core Features](#core-features)
|
||||
2. [Master Features](#master-features)
|
||||
3. [Agent Features](#agent-features)
|
||||
4. [Configuration Management](#configuration-management)
|
||||
5. [Observability](#observability)
|
||||
6. [Security Features](#security-features)
|
||||
|
||||
---
|
||||
|
||||
## Core Features
|
||||
|
||||
### CF-001: Multi-tenancy with Organizations and Workspaces
|
||||
|
||||
**Description**: Support for multiple organizations with isolated workspaces within each organization.
|
||||
|
||||
**Requirements**:
|
||||
- Organizations are top-level resource containers
|
||||
- Each organization can have multiple workspaces
|
||||
- Resources (agents, configs, certificates) are scoped to a workspace
|
||||
- Cross-workspace visibility is configurable
|
||||
|
||||
**Data Model**:
|
||||
```rust
|
||||
struct Organization {
|
||||
id: Uuid,
|
||||
name: String,
|
||||
slug: String, // URL-friendly identifier
|
||||
created_at: DateTime,
|
||||
settings: OrganizationSettings,
|
||||
}
|
||||
|
||||
struct Workspace {
|
||||
id: Uuid,
|
||||
organization_id: Uuid,
|
||||
name: String,
|
||||
slug: String,
|
||||
created_at: DateTime,
|
||||
}
|
||||
```
|
||||
|
||||
**API Endpoints**:
|
||||
- `GET /api/v1/organizations` - List organizations
|
||||
- `POST /api/v1/organizations` - Create organization
|
||||
- `GET /api/v1/organizations/{id}/workspaces` - List workspaces
|
||||
- `POST /api/v1/organizations/{id}/workspaces` - Create workspace
|
||||
|
||||
---
|
||||
|
||||
### CF-002: Agent Registration and Lifecycle Management
|
||||
|
||||
**Description**: Agents must register with the master before receiving configurations.
|
||||
|
||||
**Registration Flow**:
|
||||
1. Administrator generates bootstrap token in Master UI
|
||||
2. Token is provided to agent via environment variable or config file
|
||||
3. Agent establishes TLS connection to master (verifies server certificate)
|
||||
4. Agent sends bootstrap token for registration
|
||||
5. Master validates token and establishes shared secret:
|
||||
- Master generates session_key (per-agent) + key_id
|
||||
- Session key used for HMAC request signing
|
||||
- Primary/secondary key design for rotation
|
||||
|
||||
**Agent States**:
|
||||
```rust
|
||||
enum AgentState {
|
||||
Pending, // Registered but never connected
|
||||
Online, // Connected and healthy
|
||||
Offline, // Disconnected
|
||||
Degraded, // Connected but health checks failing
|
||||
Maintenance, // Manually placed in maintenance mode
|
||||
}
|
||||
```
|
||||
|
||||
**Agent Metadata**:
|
||||
```rust
|
||||
struct Agent {
|
||||
id: Uuid,
|
||||
workspace_id: Uuid,
|
||||
name: String,
|
||||
hostname: String,
|
||||
ip_address: String,
|
||||
version: String,
|
||||
state: AgentState,
|
||||
deployment_mode: DeploymentMode, // DockerSidecar, K8sSidecar, Standalone
|
||||
last_seen_at: DateTime,
|
||||
capabilities: Vec<String>, // e.g., ["http3", "websocket", "rate_limiting"]
|
||||
labels: HashMap<String, String>, // e.g., {"env": "prod", "region": "us-east"}
|
||||
}
|
||||
```
|
||||
|
||||
**API Endpoints**:
|
||||
- `POST /api/v1/agents/register` - Register new agent
|
||||
- `GET /api/v1/agents` - List agents
|
||||
- `GET /api/v1/agents/{id}` - Get agent details
|
||||
- `POST /api/v1/agents/{id}/tokens` - Generate registration token
|
||||
- `DELETE /api/v1/agents/{id}` - Deregister agent
|
||||
|
||||
---
|
||||
|
||||
### CF-003: Real-time Configuration Distribution
|
||||
|
||||
**Description**: Push configuration changes to agents in real-time with delivery guarantees.
|
||||
|
||||
**Requirements**:
|
||||
- Config changes propagate to all affected agents within 5 seconds
|
||||
- Support for targeted updates (specific agents or groups)
|
||||
- Config versioning with rollback capability
|
||||
- Delivery confirmation from agents
|
||||
|
||||
**Configuration Scope**:
|
||||
```rust
|
||||
enum ConfigScope {
|
||||
Global, // All agents
|
||||
Workspace, // All agents in workspace
|
||||
AgentGroup(String), // Agents with specific label selector
|
||||
Agent(Uuid), // Single agent
|
||||
}
|
||||
```
|
||||
|
||||
**Delivery Guarantees**:
|
||||
- At-least-once delivery
|
||||
- Automatic retry with exponential backoff
|
||||
- Config checksum verification
|
||||
- Offline agents receive updates on reconnection
|
||||
|
||||
---
|
||||
|
||||
## Master Features
|
||||
|
||||
### MF-001: RESTful API
|
||||
|
||||
**Description**: Comprehensive REST API for all operations.
|
||||
|
||||
**Base URL**: `/api/v1`
|
||||
|
||||
**Resource Endpoints**:
|
||||
|
||||
| Resource | Endpoints |
|
||||
|----------|-----------|
|
||||
| Organizations | GET, POST, PATCH, DELETE `/organizations` |
|
||||
| Workspaces | GET, POST, PATCH, DELETE `/workspaces` |
|
||||
| Agents | GET, POST, PATCH, DELETE `/agents` |
|
||||
| VirtualHosts | GET, POST, PATCH, DELETE `/virtual-hosts` |
|
||||
| Upstreams | GET, POST, PATCH, DELETE `/upstreams` |
|
||||
| Certificates | GET, POST, DELETE `/certificates` |
|
||||
| AccessLogs | GET `/access-logs` |
|
||||
| Metrics | GET `/metrics` |
|
||||
|
||||
**Response Format**:
|
||||
```json
|
||||
{
|
||||
"data": { ... },
|
||||
"meta": {
|
||||
"page": 1,
|
||||
"per_page": 20,
|
||||
"total": 100
|
||||
},
|
||||
"links": {
|
||||
"self": "/api/v1/agents?page=1",
|
||||
"next": "/api/v1/agents?page=2",
|
||||
"prev": null
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Error Format**:
|
||||
```json
|
||||
{
|
||||
"error": {
|
||||
"code": "VALIDATION_ERROR",
|
||||
"message": "Invalid configuration",
|
||||
"details": [
|
||||
{"field": "server_name", "message": "Invalid domain format"}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### MF-002: Web-based Admin Console (Embedded)
|
||||
|
||||
**Description**: Modern web UI for managing the entire system. Built with React + Vite and served as static files embedded directly in the master binary.
|
||||
|
||||
**Pages**:
|
||||
|
||||
| Page | Features |
|
||||
|------|----------|
|
||||
| Dashboard | Agent status, recent events, traffic overview |
|
||||
| Agents | List, detail view, logs, metrics graphs |
|
||||
| Configurations | Virtual host editor, upstream management |
|
||||
| Certificates | SSL certificate list, expiration alerts |
|
||||
| Access Control | Users, roles, permissions management |
|
||||
| Settings | Organization settings, integrations |
|
||||
|
||||
**Key UI Features**:
|
||||
- Real-time updates via WebSocket
|
||||
- Monaco editor for nginx configuration
|
||||
- Visual topology view (agent connections)
|
||||
- Dark/light mode support
|
||||
- Responsive design
|
||||
|
||||
---
|
||||
|
||||
### MF-003: Configuration Template Engine
|
||||
|
||||
**Description**: Templating system for generating nginx configurations.
|
||||
|
||||
**Template Variables**:
|
||||
```handlebars
|
||||
# Example virtual host template
|
||||
server {
|
||||
listen {{port}} {{#if ssl}}ssl{{/if}} {{#if http2}}http2{{/if}};
|
||||
server_name {{server_name}};
|
||||
|
||||
{{#if ssl}}
|
||||
ssl_certificate {{ssl_certificate_path}};
|
||||
ssl_certificate_key {{ssl_certificate_key_path}};
|
||||
{{/if}}
|
||||
|
||||
location {{location_path}} {
|
||||
proxy_pass http://{{upstream_name}};
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
|
||||
{{#each custom_headers}}
|
||||
add_header {{name}} "{{value}}";
|
||||
{{/each}}
|
||||
|
||||
{{#if rate_limiting}}
|
||||
limit_req zone={{rate_limit_zone}} burst={{rate_limit_burst}};
|
||||
{{/if}}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Built-in Templates**:
|
||||
- `default` - Standard reverse proxy
|
||||
- `spa` - Single Page Application (with fallback to index.html)
|
||||
- `api` - API gateway with rate limiting
|
||||
- `static` - Static file serving with caching
|
||||
- `websocket` - WebSocket proxy with connection upgrades
|
||||
|
||||
---
|
||||
|
||||
### MF-004: Certificate Management (ACME)
|
||||
|
||||
**Description**: Automatic SSL/TLS certificate provisioning via Let's Encrypt.
|
||||
|
||||
**Features**:
|
||||
- ACME v2 protocol support
|
||||
- HTTP-01 and DNS-01 challenges
|
||||
- Automatic renewal (30 days before expiry)
|
||||
- Wildcard certificate support (DNS-01)
|
||||
- Certificate monitoring and alerts
|
||||
|
||||
**Certificate Entity**:
|
||||
```rust
|
||||
struct Certificate {
|
||||
id: Uuid,
|
||||
workspace_id: Uuid,
|
||||
domain: String,
|
||||
is_wildcard: bool,
|
||||
provider: CertificateProvider, // LetsEncrypt, Custom
|
||||
status: CertificateStatus, // Pending, Active, Expired, Error
|
||||
issued_at: DateTime,
|
||||
expires_at: DateTime,
|
||||
auto_renew: bool,
|
||||
certificate_pem: Option<String>, // Encrypted at rest
|
||||
private_key_pem: Option<String>, // Encrypted at rest
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Agent Features
|
||||
|
||||
### AF-001: Nginx Lifecycle Management
|
||||
|
||||
**Description**: Agent manages nginx process lifecycle based on deployment mode.
|
||||
|
||||
**Docker Sidecar Mode**:
|
||||
- Shares PID namespace with nginx container (via `pid: service:nginx`)
|
||||
- Directly signals nginx process for reload/restart
|
||||
- Monitors nginx via health checks
|
||||
|
||||
**Standalone Mode**:
|
||||
- Direct process management (signals to PID from file)
|
||||
- systemd integration (optional, for service management)
|
||||
- PID file monitoring
|
||||
|
||||
**Lifecycle Actions**:
|
||||
- `start` - Start nginx
|
||||
- `stop` - Graceful shutdown
|
||||
- `reload` - Hot reload configuration
|
||||
- `restart` - Full restart
|
||||
- `test` - Validate configuration
|
||||
|
||||
---
|
||||
|
||||
### AF-002: Configuration Rendering and Application
|
||||
|
||||
**Description**: Agent renders nginx configs from master templates and applies them using atomic symlink swaps for zero-downtime updates.
|
||||
|
||||
**Config Directory Structure**:
|
||||
```
|
||||
/etc/nginx/
|
||||
├── nginx.conf # Contains: include /etc/nginx/conf.d/current/*.conf
|
||||
├── conf.d/
|
||||
│ ├── current -> ./20260302143000/ # Symlink to active deployment
|
||||
│ ├── 20260302143000/ # Active config (timestamped)
|
||||
│ │ ├── default.conf
|
||||
│ │ └── upstream.conf
|
||||
│ ├── 20260302141500/ # Previous deployment (for rollback)
|
||||
│ │ ├── default.conf
|
||||
│ │ └── upstream.conf
|
||||
│ └── 20260302140000/ # Older deployment (cleanup candidate)
|
||||
```
|
||||
|
||||
**Config Rendering Flow**:
|
||||
1. Receive ConfigUpdate from master
|
||||
2. Create new deployment folder: `./conf.d/<timestamp>/`
|
||||
3. Render nginx config files into timestamped folder
|
||||
4. **Validate** new config: `nginx -t -c /etc/nginx/conf.d/<timestamp>/nginx.conf`
|
||||
5. If validation passes, **atomically update symlink**: `current` → `<timestamp>/`
|
||||
6. Execute graceful nginx reload
|
||||
7. Verify reload success (health check)
|
||||
8. Report status to master
|
||||
9. Cleanup old deployments (keep N recent versions)
|
||||
|
||||
**Atomic Config Swap**:
|
||||
```rust
|
||||
async fn apply_config(&self, config: ConfigUpdate) -> Result<()> {
|
||||
let timestamp = generate_timestamp();
|
||||
let deploy_dir = self.conf_d_path.join(×tamp);
|
||||
let symlink_path = self.conf_d_path.join("current");
|
||||
|
||||
// 1. Render config to new timestamped directory
|
||||
self.render_config(&config, &deploy_dir).await?;
|
||||
|
||||
// 2. Validate BEFORE switching symlink (point to new folder directly)
|
||||
self.validate_config(&deploy_dir).await?;
|
||||
|
||||
// 3. Atomic symlink swap (Unix: symlink + rename)
|
||||
let temp_link = self.conf_d_path.join("current.tmp");
|
||||
tokio::fs::symlink(&deploy_dir, &temp_link).await?;
|
||||
tokio::fs::rename(&temp_link, &symlink_path).await?; // Atomic operation
|
||||
|
||||
// 4. Reload nginx (picks up new symlink target)
|
||||
self.reload_nginx().await?;
|
||||
|
||||
// 5. Verify and cleanup
|
||||
self.verify_health().await?;
|
||||
self.cleanup_old_deployments(5).await?; // Keep last 5 versions
|
||||
|
||||
self.report_success(config.id, timestamp).await;
|
||||
}
|
||||
```
|
||||
|
||||
**Rollback Strategy**:
|
||||
```rust
|
||||
async fn rollback(&self, target_timestamp: &str) -> Result<()> {
|
||||
let target_dir = self.conf_d_path.join(target_timestamp);
|
||||
let symlink_path = self.conf_d_path.join("current");
|
||||
|
||||
// Verify target exists
|
||||
if !target_dir.exists() {
|
||||
return Err(Error::RollbackTargetNotFound);
|
||||
}
|
||||
|
||||
// Atomic symlink swap back to previous deployment
|
||||
let temp_link = self.conf_d_path.join("current.tmp");
|
||||
tokio::fs::symlink(&target_dir, &temp_link).await?;
|
||||
tokio::fs::rename(&temp_link, &symlink_path).await?;
|
||||
|
||||
// Reload nginx
|
||||
self.reload_nginx().await?;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### AF-003: Health Monitoring and Reporting
|
||||
|
||||
**Description**: Continuous health monitoring of nginx and the host system.
|
||||
|
||||
**Health Checks**:
|
||||
- **Nginx Health**: HTTP request to nginx health endpoint
|
||||
- **Configuration Health**: Verify current config matches expected
|
||||
- **Resource Health**: CPU, memory, disk usage
|
||||
- **Connection Health**: Active connections, request rate
|
||||
|
||||
**Health Report Structure**:
|
||||
```rust
|
||||
struct HealthReport {
|
||||
agent_id: Uuid,
|
||||
timestamp: DateTime,
|
||||
nginx_status: NginxStatus,
|
||||
system_metrics: SystemMetrics,
|
||||
config_checksum: String,
|
||||
alerts: Vec<Alert>,
|
||||
}
|
||||
|
||||
struct NginxStatus {
|
||||
is_running: bool,
|
||||
pid: Option<u32>,
|
||||
uptime_seconds: u64,
|
||||
active_connections: u32,
|
||||
requests_per_second: f64,
|
||||
}
|
||||
|
||||
struct SystemMetrics {
|
||||
cpu_percent: f64,
|
||||
memory_used_mb: u64,
|
||||
memory_total_mb: u64,
|
||||
disk_used_gb: u64,
|
||||
disk_total_gb: u64,
|
||||
}
|
||||
```
|
||||
|
||||
**Reporting Interval**: Configurable (default: 30 seconds)
|
||||
|
||||
---
|
||||
|
||||
### AF-004: Metrics Collection and Export
|
||||
|
||||
**Description**: Collect and expose metrics in Prometheus format.
|
||||
|
||||
**Metrics Endpoint**: `GET /metrics` (on agent)
|
||||
|
||||
**Built-in Metrics**:
|
||||
```
|
||||
# Nginx metrics (parsed from stub_status)
|
||||
nxmesh_nginx_connections_active{agent_id="..."} 42
|
||||
nxmesh_nginx_connections_reading{agent_id="..."} 5
|
||||
nxmesh_nginx_connections_writing{agent_id="..."} 30
|
||||
nxmesh_nginx_connections_waiting{agent_id="..."} 7
|
||||
nxmesh_nginx_requests_total{agent_id="..."} 1234567
|
||||
|
||||
# Agent metrics
|
||||
nxmesh_agent_uptime_seconds{agent_id="..."} 86400
|
||||
nxmesh_agent_master_connection_status{agent_id="..."} 1
|
||||
nxmesh_agent_config_version{agent_id="...",version="123"} 1
|
||||
|
||||
# System metrics
|
||||
nxmesh_system_cpu_percent{agent_id="..."} 25.5
|
||||
nxmesh_system_memory_used_bytes{agent_id="..."} 1073741824
|
||||
nxmesh_system_disk_used_bytes{agent_id="..."} 53687091200
|
||||
```
|
||||
|
||||
**Custom Metrics**: Agents can collect custom metrics from nginx access logs
|
||||
|
||||
---
|
||||
|
||||
### AF-005: Offline Operation and Recovery
|
||||
|
||||
**Description**: Agent can operate independently when master is unreachable.
|
||||
|
||||
**Offline Capabilities**:
|
||||
- Continue serving traffic with cached configuration
|
||||
- Local health monitoring continues
|
||||
- Metrics are buffered for later transmission
|
||||
- Automatic reconnection attempts
|
||||
|
||||
**Recovery Flow**:
|
||||
1. Detect disconnection from master
|
||||
2. Enter "offline mode"
|
||||
3. Continue operating with cached config
|
||||
4. Buffer metrics and logs
|
||||
5. Attempt reconnection with exponential backoff
|
||||
6. On reconnection:
|
||||
- Sync configuration (compare checksums)
|
||||
- Transmit buffered metrics
|
||||
- Resume normal operation
|
||||
|
||||
---
|
||||
|
||||
## Configuration Management
|
||||
|
||||
### CM-001: Virtual Host Configuration
|
||||
|
||||
**Description**: Define nginx server blocks (virtual hosts) via API/UI.
|
||||
|
||||
**VirtualHost Entity**:
|
||||
```rust
|
||||
struct VirtualHost {
|
||||
id: Uuid,
|
||||
workspace_id: Uuid,
|
||||
name: String, // Human-readable name
|
||||
server_name: String, // Domain name(s), comma-separated
|
||||
listen_port: u16, // Usually 80 or 443
|
||||
ssl_enabled: bool,
|
||||
ssl_certificate_id: Option<Uuid>,
|
||||
|
||||
// Routing configuration
|
||||
locations: Vec<Location>,
|
||||
|
||||
// Advanced settings
|
||||
http2_enabled: bool,
|
||||
http3_enabled: bool,
|
||||
gzip_enabled: bool,
|
||||
rate_limiting: Option<RateLimitConfig>,
|
||||
|
||||
// Target agents
|
||||
target_agents: AgentSelector,
|
||||
}
|
||||
|
||||
struct Location {
|
||||
path: String, // e.g., "/api" or "~ \.php$"
|
||||
proxy_pass: Option<String>, // e.g., "http://backend"
|
||||
upstream_id: Option<Uuid>,
|
||||
root: Option<String>, // For static files
|
||||
index: Option<String>, // e.g., "index.html"
|
||||
custom_headers: Vec<Header>,
|
||||
rewrite_rules: Vec<RewriteRule>,
|
||||
}
|
||||
```
|
||||
|
||||
**Validation Rules**:
|
||||
- `server_name` must be valid domain(s)
|
||||
- `listen_port` must be 1-65535
|
||||
- SSL certificate must exist if `ssl_enabled` is true
|
||||
- At least one location must be defined
|
||||
|
||||
---
|
||||
|
||||
### CM-002: Upstream Configuration
|
||||
|
||||
**Description**: Define backend server pools for load balancing.
|
||||
|
||||
**Upstream Entity**:
|
||||
```rust
|
||||
struct Upstream {
|
||||
id: Uuid,
|
||||
workspace_id: Uuid,
|
||||
name: String, // Used as upstream identifier
|
||||
|
||||
// Load balancing algorithm
|
||||
algorithm: LoadBalanceAlgorithm, // RoundRobin, LeastConn, IPHash, etc.
|
||||
|
||||
// Backend servers
|
||||
servers: Vec<UpstreamServer>,
|
||||
|
||||
// Health check configuration
|
||||
health_check: Option<HealthCheckConfig>,
|
||||
|
||||
// Connection settings
|
||||
keepalive_connections: Option<u32>,
|
||||
keepalive_timeout: Option<u32>,
|
||||
}
|
||||
|
||||
struct UpstreamServer {
|
||||
address: String, // IP:port or hostname:port
|
||||
weight: u32, // Default: 1
|
||||
backup: bool, // Backup server
|
||||
down: bool, // Temporarily down
|
||||
max_fails: u32, // Default: 1
|
||||
fail_timeout: u32, // Seconds, default: 10
|
||||
}
|
||||
|
||||
enum LoadBalanceAlgorithm {
|
||||
RoundRobin,
|
||||
LeastConnections,
|
||||
IPHash,
|
||||
WeightedRoundRobin,
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### CM-003: Configuration Versioning
|
||||
|
||||
**Description**: Track all configuration changes with full history.
|
||||
|
||||
**Versioning Features**:
|
||||
- Every change creates a new version
|
||||
- Versions are immutable
|
||||
- Rollback to any previous version
|
||||
- Diff between versions
|
||||
- Audit log of who changed what
|
||||
|
||||
**Version Entity**:
|
||||
```rust
|
||||
struct ConfigVersion {
|
||||
id: Uuid,
|
||||
resource_type: String, // "virtual_host", "upstream", etc.
|
||||
resource_id: Uuid,
|
||||
version_number: u64, // Auto-incrementing
|
||||
data: Json, // Full configuration snapshot
|
||||
checksum: String, // SHA-256 of data
|
||||
created_by: Uuid, // User ID
|
||||
created_at: DateTime,
|
||||
change_summary: String, // Human-readable description
|
||||
}
|
||||
```
|
||||
|
||||
**API Endpoints**:
|
||||
- `GET /api/v1/virtual-hosts/{id}/versions` - List versions
|
||||
- `GET /api/v1/virtual-hosts/{id}/versions/{version}` - Get specific version
|
||||
- `POST /api/v1/virtual-hosts/{id}/rollback` - Rollback to version
|
||||
- `GET /api/v1/virtual-hosts/{id}/diff?from=v1&to=v2` - Compare versions
|
||||
|
||||
---
|
||||
|
||||
## Observability
|
||||
|
||||
### OB-001: Structured Logging
|
||||
|
||||
**Description**: Comprehensive logging with structured format.
|
||||
|
||||
**Log Levels**: ERROR, WARN, INFO, DEBUG, TRACE
|
||||
|
||||
**Log Fields**:
|
||||
```json
|
||||
{
|
||||
"timestamp": "2026-03-02T10:30:00Z",
|
||||
"level": "INFO",
|
||||
"component": "agent",
|
||||
"agent_id": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"trace_id": "abc123",
|
||||
"span_id": "def456",
|
||||
"message": "Configuration applied successfully",
|
||||
"fields": {
|
||||
"config_id": "config-123",
|
||||
"version": 42,
|
||||
"duration_ms": 150
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Log Targets**:
|
||||
- Master: systemd journal, file, or centralized (ELK/Loki)
|
||||
- Agent: stdout (Docker), file (standalone), or remote
|
||||
|
||||
---
|
||||
|
||||
### OB-002: Distributed Tracing
|
||||
|
||||
**Description**: OpenTelemetry tracing for request flow visualization.
|
||||
|
||||
**Traced Operations**:
|
||||
- Configuration push (master → agent → nginx)
|
||||
- Health check cycles
|
||||
- Certificate issuance
|
||||
- API requests
|
||||
|
||||
**Span Attributes**:
|
||||
- `nxmesh.agent_id`
|
||||
- `nxmesh.config_id`
|
||||
- `nxmesh.workspace_id`
|
||||
- `nxmesh.organization_id`
|
||||
|
||||
---
|
||||
|
||||
### OB-003: Access Log Aggregation
|
||||
|
||||
**Description**: Collect and query nginx access logs from all agents.
|
||||
|
||||
**Features**:
|
||||
- Centralized access log storage
|
||||
- Real-time log streaming
|
||||
- SQL-like query interface
|
||||
- Log retention policies
|
||||
|
||||
**Access Log Schema**:
|
||||
```rust
|
||||
struct AccessLogEntry {
|
||||
id: Uuid,
|
||||
agent_id: Uuid,
|
||||
timestamp: DateTime,
|
||||
|
||||
// Request details
|
||||
remote_addr: String,
|
||||
method: String,
|
||||
uri: String,
|
||||
protocol: String,
|
||||
host: String,
|
||||
|
||||
// Response details
|
||||
status: u16,
|
||||
body_bytes_sent: u64,
|
||||
response_time_ms: f64,
|
||||
|
||||
// Additional fields
|
||||
user_agent: Option<String>,
|
||||
referer: Option<String>,
|
||||
request_id: Option<String>,
|
||||
}
|
||||
```
|
||||
|
||||
**Query API**:
|
||||
```graphql
|
||||
# Example query
|
||||
query {
|
||||
accessLogs(
|
||||
filter: {
|
||||
agentId: "...",
|
||||
timeRange: { from: "2026-03-01", to: "2026-03-02" },
|
||||
statusCode: { gte: 500 }
|
||||
},
|
||||
limit: 100
|
||||
) {
|
||||
timestamp
|
||||
method
|
||||
uri
|
||||
status
|
||||
responseTimeMs
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security Features
|
||||
|
||||
### SF-001: Authentication and Authorization
|
||||
|
||||
**Description**: Multi-method authentication with fine-grained RBAC.
|
||||
|
||||
**Authentication Methods**:
|
||||
- JWT (for API/Web UI)
|
||||
- Password-based login (local user accounts)
|
||||
- OAuth2/OIDC (Google, GitHub, enterprise SSO)
|
||||
- API Keys (for service accounts)
|
||||
- **TLS + Shared Secret** (for agent communication)
|
||||
- Server-side TLS (auto-generated self-signed or custom certificates)
|
||||
- Bootstrap token for initial registration
|
||||
- Session key with HMAC signing for ongoing requests
|
||||
- Primary/secondary key rotation
|
||||
|
||||
**RBAC Model**:
|
||||
```rust
|
||||
struct Role {
|
||||
id: Uuid,
|
||||
name: String,
|
||||
permissions: Vec<Permission>,
|
||||
}
|
||||
|
||||
enum Permission {
|
||||
// Organization scope
|
||||
OrganizationRead,
|
||||
OrganizationWrite,
|
||||
OrganizationDelete,
|
||||
|
||||
// Workspace scope
|
||||
WorkspaceRead,
|
||||
WorkspaceWrite,
|
||||
WorkspaceDelete,
|
||||
|
||||
// Agent scope
|
||||
AgentRead,
|
||||
AgentWrite,
|
||||
AgentReload,
|
||||
AgentDelete,
|
||||
|
||||
// Config scope
|
||||
ConfigRead,
|
||||
ConfigWrite,
|
||||
ConfigDeploy,
|
||||
ConfigDelete,
|
||||
|
||||
// Certificate scope
|
||||
CertificateRead,
|
||||
CertificateWrite,
|
||||
CertificateDelete,
|
||||
|
||||
// User management
|
||||
UserRead,
|
||||
UserWrite,
|
||||
UserDelete,
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### SF-002: Secret Management
|
||||
|
||||
**Description**: Secure storage and distribution of sensitive data.
|
||||
|
||||
**Secrets**:
|
||||
- SSL private keys
|
||||
- API tokens
|
||||
- Database passwords
|
||||
- External service credentials
|
||||
|
||||
**Security Measures**:
|
||||
- Encryption at rest (AES-256-GCM)
|
||||
- Encryption in transit (TLS 1.3)
|
||||
- Automatic secret rotation
|
||||
- Audit logging for secret access
|
||||
|
||||
---
|
||||
|
||||
### SF-003: Network Security
|
||||
|
||||
**Description**: Network-level security controls.
|
||||
|
||||
**Features**:
|
||||
- IP allowlisting for agent connections
|
||||
- Rate limiting on API endpoints
|
||||
- DDoS protection recommendations
|
||||
- Security headers enforcement (HSTS, CSP, etc.)
|
||||
|
||||
**Agent Connection Security**:
|
||||
- **TLS Encryption**: Server-side TLS (auto-generated or custom certificates)
|
||||
- Development: Self-signed certificates auto-generated on first start
|
||||
- Production: Valid certificates (Let's Encrypt or corporate CA)
|
||||
- **Bootstrap Authentication**: One-time token for initial registration
|
||||
- **Session Authentication**: HMAC-signed requests with shared session key
|
||||
- **Key Rotation**: Primary/secondary key design for seamless rotation
|
||||
- **Certificate Pinning**: Optional fingerprint verification for additional security
|
||||
428
docs/project-structure.md
Normal file
428
docs/project-structure.md
Normal file
@@ -0,0 +1,428 @@
|
||||
# NxMesh Project Structure
|
||||
|
||||
This document outlines the recommended project structure for the NxMesh codebase.
|
||||
|
||||
## Directory Layout
|
||||
|
||||
```
|
||||
nxmesh/
|
||||
├── Cargo.toml # Workspace root
|
||||
├── Cargo.lock
|
||||
├── README.md
|
||||
├── LICENSE
|
||||
├── justfile # Task runner
|
||||
├── AGENTS.md # AI agent context
|
||||
├──
|
||||
├── crates/ # Rust workspace crates
|
||||
│ ├── nxmesh-core/ # Shared core library
|
||||
│ │ ├── Cargo.toml
|
||||
│ │ └── src/
|
||||
│ │ ├── lib.rs
|
||||
│ │ ├── models/ # Shared data models
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ ├── organization.rs
|
||||
│ │ │ ├── workspace.rs
|
||||
│ │ │ ├── agent.rs
|
||||
│ │ │ ├── config.rs
|
||||
│ │ │ └── certificate.rs
|
||||
│ │ ├── crypto/ # Encryption, hashing
|
||||
│ │ ├── validation/ # Input validation
|
||||
│ │ └── error.rs # Common error types
|
||||
│ │
|
||||
│ ├── nxmesh-proto/ # Protocol buffers
|
||||
│ │ ├── Cargo.toml
|
||||
│ │ ├── build.rs
|
||||
│ │ └── proto/
|
||||
│ │ ├── agent.proto
|
||||
│ │ ├── config.proto
|
||||
│ │ └── common.proto
|
||||
│ │
|
||||
│ ├── nxmesh-master/ # Control plane
|
||||
│ │ ├── Cargo.toml
|
||||
│ │ └── src/
|
||||
│ │ ├── main.rs
|
||||
│ │ ├── lib.rs
|
||||
│ │ ├── api/ # REST API handlers
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ ├── routes.rs
|
||||
│ │ │ ├── middleware/
|
||||
│ │ │ ├── v1/ # API version 1
|
||||
│ │ │ │ ├── mod.rs
|
||||
│ │ │ │ ├── organizations.rs
|
||||
│ │ │ │ ├── workspaces.rs
|
||||
│ │ │ │ ├── agents.rs
|
||||
│ │ │ │ ├── virtual_hosts.rs
|
||||
│ │ │ │ ├── upstreams.rs
|
||||
│ │ │ │ ├── certificates.rs
|
||||
│ │ │ │ └── metrics.rs
|
||||
│ │ │ └── websocket.rs
|
||||
│ │ ├── grpc/ # gRPC service
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ ├── server.rs
|
||||
│ │ │ ├── agent_service.rs
|
||||
│ │ │ └── interceptor.rs
|
||||
│ │ ├── config/ # Configuration
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ └── settings.rs
|
||||
│ │ ├── db/ # Database layer
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ ├── connection.rs
|
||||
│ │ │ ├── migration.rs
|
||||
│ │ │ └── repositories/
|
||||
│ │ ├── services/ # Business logic
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ ├── organization_service.rs
|
||||
│ │ │ ├── workspace_service.rs
|
||||
│ │ │ ├── agent_service.rs
|
||||
│ │ │ ├── config_service.rs
|
||||
│ │ │ ├── certificate_service.rs
|
||||
│ │ │ └── auth_service.rs
|
||||
│ │ ├── domain/ # Domain entities
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ ├── organization.rs
|
||||
│ │ │ ├── agent.rs
|
||||
│ │ │ └── config.rs
|
||||
│ │ ├── infrastructure/ # External integrations
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ ├── acme/ # Let's Encrypt
|
||||
│ │ │ ├── storage/ # Object storage
|
||||
│ │ │ └── notifier/ # Notifications
|
||||
│ │ ├── events/ # Event bus
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ ├── bus.rs
|
||||
│ │ │ └── handlers.rs
|
||||
│ │ └── cli.rs # CLI commands
|
||||
│ │
|
||||
│ ├── nxmesh-agent/ # Data plane
|
||||
│ │ ├── Cargo.toml
|
||||
│ │ └── src/
|
||||
│ │ ├── main.rs
|
||||
│ │ ├── lib.rs
|
||||
│ │ ├── config/ # Agent configuration
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ └── settings.rs
|
||||
│ │ ├── master/ # Master communication
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ ├── client.rs
|
||||
│ │ │ ├── reconnect.rs
|
||||
│ │ │ └── stream.rs
|
||||
│ │ ├── nginx/ # Nginx management
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ ├── controller.rs
|
||||
│ │ │ ├── config_manager.rs # Symlink-based atomic deployment
|
||||
│ │ │ ├── config_renderer.rs
|
||||
│ │ │ ├── validator.rs
|
||||
│ │ │ ├── docker_sidecar.rs # Docker sidecar (PID namespace sharing)
|
||||
│ │ │ ├── systemd.rs # Standalone mode
|
||||
│ │ │ └── parser.rs # Nginx config parser
|
||||
│ │ ├── health/ # Health monitoring
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ ├── monitor.rs
|
||||
│ │ │ ├── nginx.rs
|
||||
│ │ │ └── system.rs
|
||||
│ │ ├── metrics/ # Metrics collection
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ ├── collector.rs
|
||||
│ │ │ └── exporter.rs
|
||||
│ │ ├── cache/ # Local caching
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ └── config_cache.rs
|
||||
│ │ ├── watch/ # File watchers
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ └── config_watch.rs
|
||||
│ │ └── cli.rs # CLI commands
|
||||
│ │
|
||||
│ └── nxmesh-cli/ # CLI tool
|
||||
│ ├── Cargo.toml
|
||||
│ └── src/
|
||||
│ ├── main.rs
|
||||
│ ├── commands/ # CLI commands
|
||||
│ │ ├── mod.rs
|
||||
│ │ ├── login.rs
|
||||
│ │ ├── agent.rs
|
||||
│ │ ├── config.rs
|
||||
│ │ └── deploy.rs
|
||||
│ └── api/ # API client
|
||||
│
|
||||
├── frontend/ # Web UI (embedded in master)
|
||||
│ ├── package.json
|
||||
│ ├── vite.config.ts
|
||||
│ ├── tsconfig.json
|
||||
│ ├── index.html
|
||||
│ ├── src/
|
||||
│ │ ├── main.tsx
|
||||
│ │ ├── App.tsx
|
||||
│ │ ├── components/ # Reusable components
|
||||
│ │ │ ├── common/
|
||||
│ │ │ ├── layout/
|
||||
│ │ │ └── forms/
|
||||
│ │ ├── pages/ # Page components
|
||||
│ │ │ ├── Dashboard/
|
||||
│ │ │ ├── Agents/
|
||||
│ │ │ ├── Configurations/
|
||||
│ │ │ ├── Certificates/
|
||||
│ │ │ └── Settings/
|
||||
│ │ ├── hooks/ # React hooks
|
||||
│ │ ├── stores/ # State management (Zustand)
|
||||
│ │ ├── api/ # API client
|
||||
│ │ ├── types/ # TypeScript types
|
||||
│ │ ├── utils/ # Utilities
|
||||
│ │ └── styles/ # CSS/Tailwind
|
||||
│ └── public/
|
||||
│
|
||||
│ # Build output (dist/) is embedded into master binary
|
||||
│ # Master serves static files at root path ("/")
|
||||
│
|
||||
├── migrations/ # Database migrations
|
||||
│ └── sea-orm/
|
||||
│ ├── Cargo.toml
|
||||
│ └── src/
|
||||
│
|
||||
├── tests/ # Integration tests
|
||||
│ ├── integration/
|
||||
│ │ ├── master_api_tests.rs
|
||||
│ │ ├── agent_master_tests.rs
|
||||
│ │ └── config_flow_tests.rs
|
||||
│ └── fixtures/
|
||||
│
|
||||
├── scripts/ # Build/utility scripts
|
||||
│ ├── build.sh
|
||||
│ ├── test.sh
|
||||
│ └── release.sh
|
||||
│
|
||||
├── deploy/ # Deployment configs
|
||||
│ ├── docker/
|
||||
│ │ ├── master.Dockerfile
|
||||
│ │ ├── agent.Dockerfile
|
||||
│ │ └── docker-compose.yml
|
||||
│ ├── k8s/
|
||||
│ │ ├── namespace.yaml
|
||||
│ │ ├── master/
|
||||
│ │ ├── agent/
|
||||
│ │ └── helm/
|
||||
│ └── terraform/
|
||||
│
|
||||
├── docs/ # Documentation
|
||||
│ ├── architecture.md
|
||||
│ ├── features.md
|
||||
│ ├── roadmap.md
|
||||
│ ├── api.md
|
||||
│ ├── deployment.md
|
||||
│ └── project-structure.md
|
||||
│
|
||||
└── .devcontainer/ # Dev container
|
||||
├── devcontainer.json
|
||||
├── docker-compose.yml
|
||||
├── Dockerfile
|
||||
└── nginx/
|
||||
```
|
||||
|
||||
## Crate Dependencies
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Workspace Crates"
|
||||
CLI[nxmesh-cli]
|
||||
AGENT[nxmesh-agent]
|
||||
MASTER[nxmesh-master]
|
||||
PROTO[nxmesh-proto]
|
||||
CORE[nxmesh-core]
|
||||
end
|
||||
|
||||
CORE --> PROTO
|
||||
AGENT --> CORE
|
||||
AGENT --> PROTO
|
||||
MASTER --> CORE
|
||||
MASTER --> PROTO
|
||||
CLI --> CORE
|
||||
```
|
||||
|
||||
## Key Design Principles
|
||||
|
||||
### 1. Separation of Concerns
|
||||
|
||||
- **nxmesh-core**: Only shared types and utilities
|
||||
- **nxmesh-master**: Only control plane logic
|
||||
- **nxmesh-agent**: Only data plane logic
|
||||
- **frontend**: Only UI logic
|
||||
|
||||
### 2. Domain-Driven Design (in Master)
|
||||
|
||||
```
|
||||
domain/ # Domain entities (pure logic)
|
||||
services/ # Application services (orchestration)
|
||||
repositories/ # Data access abstraction
|
||||
api/ # Interface adapters (HTTP, gRPC)
|
||||
infrastructure/ # External concerns
|
||||
```
|
||||
|
||||
### 3. Agent Modularity
|
||||
|
||||
Each major concern in the agent is a separate module:
|
||||
- `nginx/`: All nginx-specific code
|
||||
- `master/`: All master communication code
|
||||
- `health/`: All health monitoring code
|
||||
- `metrics/`: All metrics code
|
||||
|
||||
### 4. Configuration Management
|
||||
|
||||
Use hierarchical config:
|
||||
1. Default values (in code)
|
||||
2. Config file (`/etc/nxmesh/*.toml`)
|
||||
3. Environment variables
|
||||
4. Command-line arguments (highest priority)
|
||||
|
||||
## Module Guidelines
|
||||
|
||||
### API Versioning
|
||||
|
||||
- Always version REST APIs: `/api/v1/...`
|
||||
- Maintain backward compatibility within major versions
|
||||
- Use feature flags for gradual rollouts
|
||||
|
||||
### Error Handling
|
||||
|
||||
- Use `thiserror` for error definitions
|
||||
- Propagate errors with context
|
||||
- Convert to user-friendly messages at API boundary
|
||||
|
||||
### Testing Structure
|
||||
|
||||
```rust
|
||||
// In each module
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_feature() {
|
||||
// unit tests
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- Unit tests: In same file as code
|
||||
- Integration tests: In `tests/` directory
|
||||
- E2E tests: Separate crate or external repo
|
||||
|
||||
### Documentation
|
||||
|
||||
- All public APIs must have doc comments
|
||||
- Include examples in doc comments
|
||||
- Keep README files in each crate
|
||||
|
||||
## Build Configuration
|
||||
|
||||
### Workspace Cargo.toml
|
||||
|
||||
```toml
|
||||
[workspace]
|
||||
members = [
|
||||
"crates/nxmesh-core",
|
||||
"crates/nxmesh-proto",
|
||||
"crates/nxmesh-master",
|
||||
"crates/nxmesh-agent",
|
||||
"crates/nxmesh-cli",
|
||||
]
|
||||
resolver = "3"
|
||||
|
||||
[workspace.dependencies]
|
||||
# Core dependencies
|
||||
tokio = { version = "1", features = ["full"] }
|
||||
serde = { version = "1", features = ["derive"] }
|
||||
thiserror = "1"
|
||||
tracing = "0.1"
|
||||
|
||||
# Web framework
|
||||
axum = "0.7"
|
||||
tower = "0.4"
|
||||
tower-http = "0.5"
|
||||
|
||||
# gRPC
|
||||
tonic = "0.11"
|
||||
prost = "0.12"
|
||||
|
||||
# Database
|
||||
sea-orm = "2.0.0-rc"
|
||||
sea-orm-migration = "2.0.0-rc"
|
||||
|
||||
# Async
|
||||
async-trait = "0.1"
|
||||
futures = "0.3"
|
||||
|
||||
# Serialization
|
||||
serde_json = "1"
|
||||
toml = "0.8"
|
||||
|
||||
# HTTP
|
||||
reqwest = { version = "0.12", default-features = false }
|
||||
|
||||
# Crypto
|
||||
sha2 = "0.10"
|
||||
hex = "0.4"
|
||||
|
||||
# Testing
|
||||
tokio-test = "0.4"
|
||||
mockall = "0.12"
|
||||
```
|
||||
|
||||
## Naming Conventions
|
||||
|
||||
### Files
|
||||
- Use `snake_case` for file names
|
||||
- Module entry point: `mod.rs` or `{module_name}.rs`
|
||||
|
||||
### Types
|
||||
- Structs/Enums: `PascalCase`
|
||||
- Traits: `PascalCase` (often ending in `able` or with verb prefix)
|
||||
- Functions/Methods: `snake_case`
|
||||
- Constants: `SCREAMING_SNAKE_CASE`
|
||||
- Generic parameters: Single uppercase letter (`T`, `K`, `V`)
|
||||
|
||||
### Error Types
|
||||
- Suffix with `Error`: `ConfigError`, `AgentError`
|
||||
- Group in `error.rs` or `errors/` module
|
||||
|
||||
### Feature Flags
|
||||
- Use `kebab-case`: `postgres-native`, `tls-rustls`
|
||||
|
||||
## CI/CD Structure
|
||||
|
||||
```yaml
|
||||
# .github/workflows/
|
||||
├── ci.yml # PR checks
|
||||
├── test.yml # Test suite
|
||||
├── release.yml # Release builds
|
||||
├── docker.yml # Docker image builds
|
||||
└── docs.yml # Documentation deploy
|
||||
```
|
||||
|
||||
## Scripts
|
||||
|
||||
Common operations should have just commands:
|
||||
|
||||
```justfile
|
||||
# Development
|
||||
just dev # Start all services
|
||||
just dev-backend # Start backend only
|
||||
just dev-frontend # Start frontend only
|
||||
|
||||
# Testing
|
||||
just test # Run all tests
|
||||
just test-unit # Unit tests only
|
||||
just test-integration # Integration tests
|
||||
|
||||
# Building
|
||||
just build # Build all
|
||||
just build-master # Build master only
|
||||
just build-agent # Build agent only
|
||||
|
||||
# Database
|
||||
just db-migrate # Run migrations
|
||||
just db-reset # Reset database
|
||||
just db-console # Open psql
|
||||
|
||||
# Deployment
|
||||
just docker-build # Build Docker images
|
||||
just k8s-deploy # Deploy to Kubernetes
|
||||
```
|
||||
486
docs/roadmap.md
Normal file
486
docs/roadmap.md
Normal file
@@ -0,0 +1,486 @@
|
||||
# NxMesh Project Roadmap
|
||||
|
||||
## Overview
|
||||
|
||||
This document outlines the development phases and milestones for NxMesh. The project is divided into four major phases, each building upon the previous one.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Foundation (Months 1-3)
|
||||
|
||||
**Goal**: Build a working MVP with basic master-agent communication and nginx configuration management.
|
||||
|
||||
### Milestone 1.1: Project Setup and Core Infrastructure
|
||||
**Target**: Week 2
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | Set up Rust workspace structure (master, agent, shared) | 🔲 |
|
||||
| [ ] | Configure CI/CD pipeline (GitHub Actions) | 🔲 |
|
||||
| [ ] | Set up database schema with SeaORM migrations | 🔲 |
|
||||
| [ ] | Create development environment (devcontainer) | 🔲 |
|
||||
| [ ] | Set up testing framework (unit, integration) | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- Working development environment
|
||||
- Database schema for organizations, workspaces, agents
|
||||
- CI pipeline with linting and testing
|
||||
|
||||
---
|
||||
|
||||
### Milestone 1.2: Master - Core API
|
||||
**Target**: Week 5
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | Implement Axum-based REST API server | 🔲 |
|
||||
| [ ] | JWT authentication middleware | 🔲 |
|
||||
| [ ] | CRUD endpoints for Organizations | 🔲 |
|
||||
| [ ] | CRUD endpoints for Workspaces | 🔲 |
|
||||
| [ ] | CRUD endpoints for Agents | 🔲 |
|
||||
| [ ] | PostgreSQL persistence layer | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- REST API for basic resource management
|
||||
- JWT authentication working
|
||||
- API documentation (OpenAPI)
|
||||
|
||||
---
|
||||
|
||||
### Milestone 1.3: Master - Agent Communication
|
||||
**Target**: Week 7
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | gRPC server implementation (Tonic) | 🔲 |
|
||||
| [ ] | Bidirectional streaming protocol | 🔲 |
|
||||
| [ ] | Agent registration flow | 🔲 |
|
||||
| [ ] | Token-based authentication for agents | 🔲 |
|
||||
| [ ] | Agent heartbeat/health monitoring | 🔲 |
|
||||
| [ ] | WebSocket fallback for events | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- Master can accept agent connections
|
||||
- Agent registration and authentication works
|
||||
- Health status tracking
|
||||
|
||||
---
|
||||
|
||||
### Milestone 1.4: Agent - Core Functionality
|
||||
**Target**: Week 9
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | Agent CLI and configuration | 🔲 |
|
||||
| [ ] | gRPC client for master communication | 🔲 |
|
||||
| [ ] | Automatic reconnection with backoff | 🔲 |
|
||||
| [ ] | Nginx process management (Docker sidecar PID sharing) | 🔲 |
|
||||
| [ ] | Health check reporting | 🔲 |
|
||||
| [ ] | Local config caching | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- Agent binary that connects to master
|
||||
- Nginx lifecycle management (Docker sidecar mode)
|
||||
- Health reporting
|
||||
|
||||
---
|
||||
|
||||
### Milestone 1.5: Configuration Management
|
||||
**Target**: Week 11
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | VirtualHost CRUD API | 🔲 |
|
||||
| [ ] | Upstream CRUD API | 🔲 |
|
||||
| [ ] | Handlebars template engine integration | 🔲 |
|
||||
| [ ] | Config rendering on agent | 🔲 |
|
||||
| [ ] | Nginx config validation (`nginx -t`) | 🔲 |
|
||||
| [ ] | Graceful reload on config change | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- End-to-end config push: Master → Agent → Nginx
|
||||
- Basic virtual host and upstream management
|
||||
- Template-based nginx config generation
|
||||
|
||||
---
|
||||
|
||||
### Milestone 1.6: Web Admin Console - Foundation
|
||||
**Target**: Week 13
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | React + Vite project setup | 🔲 |
|
||||
| [ ] | Authentication UI (login/logout) | 🔲 |
|
||||
| [ ] | Dashboard layout and navigation | 🔲 |
|
||||
| [ ] | Agent list and detail views | 🔲 |
|
||||
| [ ] | Basic virtual host form | 🔲 |
|
||||
| [ ] | WebSocket integration for real-time updates | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- Functional Web UI
|
||||
- Agent management via UI
|
||||
- Basic configuration editing
|
||||
|
||||
---
|
||||
|
||||
### Phase 1 Completion Criteria
|
||||
- [ ] Master and Agent communicate via gRPC
|
||||
- [ ] Nginx configs can be pushed from Master to Agent
|
||||
- [ ] Web UI for basic management
|
||||
- [ ] Docker sidecar deployment working
|
||||
- [ ] Documentation complete
|
||||
|
||||
**Estimated Effort**: 3 months
|
||||
**Team Size**: 2-3 engineers
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Resilience and Observability (Months 4-5)
|
||||
|
||||
**Goal**: Make the system production-ready with HA, monitoring, and robust failure handling.
|
||||
|
||||
### Milestone 2.1: High Availability - Master Clustering
|
||||
**Target**: Week 15
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | Raft consensus integration (raft-rs) | 🔲 |
|
||||
| [ ] | Leader election | 🔲 |
|
||||
| [ ] | State replication across masters | 🔲 |
|
||||
| [ ] | Agent connection failover | 🔲 |
|
||||
| [ ] | Cluster health monitoring | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- Multiple master instances can form a cluster
|
||||
- Automatic failover on master failure
|
||||
- No single point of failure
|
||||
|
||||
---
|
||||
|
||||
### Milestone 2.2: Certificate Management
|
||||
**Target**: Week 17
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | ACME client integration (acme-rs) | 🔲 |
|
||||
| [ ] | Let's Encrypt HTTP-01 challenge | 🔲 |
|
||||
| [ ] | Certificate storage (encrypted) | 🔲 |
|
||||
| [ ] | Automatic renewal | 🔲 |
|
||||
| [ ] | Certificate distribution to agents | 🔲 |
|
||||
| [ ] | Expiration monitoring and alerts | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- Automatic SSL certificate provisioning
|
||||
- Certificate renewal before expiry
|
||||
- UI for certificate management
|
||||
|
||||
---
|
||||
|
||||
### Milestone 2.3: Observability Stack
|
||||
**Target**: Week 19
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | OpenTelemetry integration | 🔲 |
|
||||
| [ ] | Structured logging (tracing) | 🔲 |
|
||||
| [ ] | Prometheus metrics endpoint (agent) | 🔲 |
|
||||
| [ ] | Custom metrics collection | 🔲 |
|
||||
| [ ] | Health check dashboard | 🔲 |
|
||||
| [ ] | Alert configuration | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- Metrics visible in Prometheus
|
||||
- Distributed traces for config pushes
|
||||
- Health dashboard in Web UI
|
||||
|
||||
---
|
||||
|
||||
### Milestone 2.4: Enhanced Failure Handling
|
||||
**Target**: Week 21
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | Configuration drift detection | 🔲 |
|
||||
| [ ] | Auto-healing (config sync) | 🔲 |
|
||||
| [ ] | Circuit breaker for master connection | 🔲 |
|
||||
| [ ] | Nginx crash detection and restart | 🔲 |
|
||||
| [ ] | Config rollback on validation failure | 🔲 |
|
||||
| [ ] | Bulk operations and queue management | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- System self-heals from common failures
|
||||
- Config drift automatically corrected
|
||||
- Robust reconnection logic
|
||||
|
||||
---
|
||||
|
||||
### Phase 2 Completion Criteria
|
||||
- [ ] Master clustering with Raft
|
||||
- [ ] Automatic SSL certificates
|
||||
- [ ] Full observability (metrics, logs, traces)
|
||||
- [ ] Production-grade failure handling
|
||||
- [ ] Performance benchmarks
|
||||
|
||||
**Estimated Effort**: 2 months
|
||||
**Team Size**: 2-3 engineers
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Advanced Traffic Management (Months 6-7)
|
||||
|
||||
**Goal**: Add enterprise-grade traffic management features.
|
||||
|
||||
### Milestone 3.1: Advanced Load Balancing
|
||||
**Target**: Week 23
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | Multiple load balancing algorithms | 🔲 |
|
||||
| [ ] | Health checks for upstream servers | 🔲 |
|
||||
| [ ] | Circuit breaker for upstreams | 🔲 |
|
||||
| [ ] | Retry policies | 🔲 |
|
||||
| [ ] | Connection pooling | 🔲 |
|
||||
| [ ] | Upstream status dashboard | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- Advanced upstream configuration
|
||||
- Health check visualization
|
||||
- Circuit breaker metrics
|
||||
|
||||
---
|
||||
|
||||
### Milestone 3.2: Rate Limiting and WAF
|
||||
**Target**: Week 25
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | Rate limiting rules (IP, user, global) | 🔲 |
|
||||
| [ ] | Rate limiting zones | 🔲 |
|
||||
| [ ] | Basic WAF rules (ModSecurity integration) | 🔲 |
|
||||
| [ ] | IP allowlist/blocklist | 🔲 |
|
||||
| [ ] | Geo-blocking | 🔲 |
|
||||
| [ ] | Rate limit analytics | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- Configurable rate limiting
|
||||
- Basic WAF protection
|
||||
- Security event dashboard
|
||||
|
||||
---
|
||||
|
||||
### Milestone 3.3: Traffic Routing and Canary
|
||||
**Target**: Week 27
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | Header-based routing | 🔲 |
|
||||
| [ ] | Weight-based traffic splitting | 🔲 |
|
||||
| [ ] | Canary deployment support | 🔲 |
|
||||
| [ ] | A/B testing configuration | 🔲 |
|
||||
| [ ] | Blue-green deployment | 🔲 |
|
||||
| [ ] | Traffic analytics | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- Advanced traffic routing
|
||||
- Canary deployment UI
|
||||
- Traffic split visualization
|
||||
|
||||
---
|
||||
|
||||
### Milestone 3.4: Access Log Aggregation
|
||||
**Target**: Week 29
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | Nginx access log parsing | 🔲 |
|
||||
| [ ] | Log streaming to master | 🔲 |
|
||||
| [ ] | Log storage and indexing | 🔲 |
|
||||
| [ ] | Log query interface | 🔲 |
|
||||
| [ ] | Real-time log tailing | 🔲 |
|
||||
| [ ] | Log-based alerting | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- Centralized access logs
|
||||
- Log search and filtering
|
||||
- Log-based metrics
|
||||
|
||||
---
|
||||
|
||||
### Phase 3 Completion Criteria
|
||||
- [ ] Advanced load balancing and health checks
|
||||
- [ ] Rate limiting and basic WAF
|
||||
- [ ] Canary and A/B testing
|
||||
- [ ] Access log aggregation
|
||||
- [ ] Traffic analytics dashboard
|
||||
|
||||
**Estimated Effort**: 2 months
|
||||
**Team Size**: 2-3 engineers
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Enterprise Features (Months 8-10)
|
||||
|
||||
**Goal**: Enterprise readiness with multi-tenancy, RBAC, and advanced integrations.
|
||||
|
||||
### Milestone 4.1: Multi-tenancy and RBAC
|
||||
**Target**: Week 31
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | Organization isolation | 🔲 |
|
||||
| [ ] | Workspace-scoped resources | 🔲 |
|
||||
| [ ] | Role-based access control | 🔲 |
|
||||
| [ ] | User management API | 🔲 |
|
||||
| [ ] | API key management | 🔲 |
|
||||
| [ ] | Audit logging | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- Full multi-tenancy
|
||||
- Granular permissions
|
||||
- Audit trail
|
||||
|
||||
---
|
||||
|
||||
### Milestone 4.2: Kubernetes Integration
|
||||
**Target**: Week 33
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | Kubernetes operator | 🔲 |
|
||||
| [ ] | CRD definitions | 🔲 |
|
||||
| [ ] | Helm chart | 🔲 |
|
||||
| [ ] | Service discovery integration | 🔲 |
|
||||
| [ ] | Ingress controller mode | 🔲 |
|
||||
| [ ] | K8s-native agent deployment | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- Kubernetes operator
|
||||
- Helm chart for easy deployment
|
||||
- Ingress controller functionality
|
||||
|
||||
---
|
||||
|
||||
### Milestone 4.3: External Integrations
|
||||
**Target**: Week 35
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | Terraform provider | 🔲 |
|
||||
| [ ] | GitOps integration (Git sync) | 🔲 |
|
||||
| [ ] | Webhook support | 🔲 |
|
||||
| [ ] | Slack/Discord notifications | 🔲 |
|
||||
| [ ] | PagerDuty/Opsgenie integration | 🔲 |
|
||||
| [ ] | DNS provider integration (Route53, Cloudflare) | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- Infrastructure as Code support
|
||||
- GitOps workflows
|
||||
- Notification channels
|
||||
|
||||
---
|
||||
|
||||
### Milestone 4.4: Performance and Scale
|
||||
**Target**: Week 37
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | Connection pooling optimization | 🔲 |
|
||||
| [ ] | Config caching improvements | 🔲 |
|
||||
| [ ] | Database query optimization | 🔲 |
|
||||
| [ ] | Horizontal scaling tests | 🔲 |
|
||||
| [ ] | Load testing (10k+ agents) | 🔲 |
|
||||
| [ ] | Performance tuning documentation | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- Performance benchmarks
|
||||
- Scaling guidelines
|
||||
- Optimization recommendations
|
||||
|
||||
---
|
||||
|
||||
### Milestone 4.5: Enterprise Security
|
||||
**Target**: Week 39
|
||||
|
||||
| Task | Description | Status |
|
||||
|------|-------------|--------|
|
||||
| [ ] | mTLS for all communications | 🔲 |
|
||||
| [ ] | Secret encryption at rest | 🔲 |
|
||||
| [ ] | HSM integration | 🔲 |
|
||||
| [ ] | SSO/SAML integration | 🔲 |
|
||||
| [ ] | Security scanning (SAST/DAST) | 🔲 |
|
||||
| [ ] | Compliance documentation (SOC2) | 🔲 |
|
||||
|
||||
**Deliverables**:
|
||||
- Enterprise security features
|
||||
- Compliance documentation
|
||||
- Security audit
|
||||
|
||||
---
|
||||
|
||||
### Phase 4 Completion Criteria
|
||||
- [ ] Full RBAC and multi-tenancy
|
||||
- [ ] Kubernetes operator
|
||||
- [ ] External integrations (Terraform, GitOps)
|
||||
- [ ] Proven scalability (10k+ agents)
|
||||
- [ ] Enterprise security compliance
|
||||
|
||||
**Estimated Effort**: 3 months
|
||||
**Team Size**: 3-4 engineers
|
||||
|
||||
---
|
||||
|
||||
## Timeline Summary
|
||||
|
||||
```
|
||||
Month 1-3: ████████████████████████████████████████ Phase 1: Foundation
|
||||
Month 4-5: ████████████████████ Phase 2: Resilience
|
||||
Month 6-7: ████████████████████ Phase 3: Advanced
|
||||
Month 8-10: ██████████████████████████ Phase 4: Enterprise
|
||||
|
||||
Week: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
|
||||
|--M1--|--M2--|--M3--|--M4--|--M5--|--M6--|
|
||||
|
||||
Week: 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
|
||||
|--M7--|--M8--|--M9--|--M10-|--M11-|--M12-|--M13-|--M14-|
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resource Requirements
|
||||
|
||||
### Phase 1
|
||||
- **Backend Engineers**: 2
|
||||
- **Frontend Engineer**: 1
|
||||
- **Total Person-Months**: 9
|
||||
|
||||
### Phase 2
|
||||
- **Backend Engineers**: 2
|
||||
- **Frontend Engineer**: 1 (part-time)
|
||||
- **DevOps Engineer**: 1 (part-time)
|
||||
- **Total Person-Months**: 7
|
||||
|
||||
### Phase 3
|
||||
- **Backend Engineers**: 2
|
||||
- **Frontend Engineer**: 1
|
||||
- **Total Person-Months**: 6
|
||||
|
||||
### Phase 4
|
||||
- **Backend Engineers**: 2
|
||||
- **Frontend Engineer**: 1
|
||||
- **DevOps Engineer**: 1
|
||||
- **Security Engineer**: 1 (part-time)
|
||||
- **Total Person-Months**: 10
|
||||
|
||||
**Total Project**: ~32 person-months
|
||||
|
||||
---
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
| Risk | Probability | Impact | Mitigation |
|
||||
|------|-------------|--------|------------|
|
||||
| Raft complexity delays HA | Medium | High | Start with single master, add HA later |
|
||||
| gRPC performance issues | Low | Medium | Implement WebSocket fallback early |
|
||||
| Nginx reload edge cases | Medium | High | Extensive testing, rollback capability |
|
||||
| Team scaling challenges | Medium | Medium | Document architecture, modular design |
|
||||
| Integration complexity | Medium | Medium | Clear APIs, contract testing |
|
||||
Reference in New Issue
Block a user