Add project structure and roadmap documentation

- Created `project-structure.md` to outline the directory layout, crate dependencies, design principles, module guidelines, and naming conventions for the NxMesh codebase.
- Introduced `roadmap.md` detailing the development phases, milestones, tasks, deliverables, and resource requirements for the NxMesh project, spanning from foundational setup to enterprise features.
This commit is contained in:
GW_MC
2026-03-03 04:13:31 +00:00
parent 39bd860c55
commit 43b2e44d95
11 changed files with 9293 additions and 7 deletions

View File

@@ -8,6 +8,7 @@ RUN apt-get update && apt-get install -y \
pkg-config \ pkg-config \
libssl-dev \ libssl-dev \
postgresql-client \ postgresql-client \
protobuf-compiler \
&& rm -rf /var/lib/apt/lists/* && rm -rf /var/lib/apt/lists/*
# Set working directory # Set working directory

View File

@@ -27,7 +27,7 @@ services:
- docker - docker
pid: "service:nginx" pid: "service:nginx"
# Data Plane - Nginx (controlled by agent via Docker) # Data Plane - Nginx (controlled by agent via PID namespace sharing)
nginx: nginx:
image: nginx:alpine image: nginx:alpine
container_name: nxmesh-nginx container_name: nxmesh-nginx

104
AGENTS.md Normal file
View File

@@ -0,0 +1,104 @@
# NxMesh - Agent Instructions
This document provides context for AI agents working on the NxMesh project.
## Project Overview
**NxMesh** is a distributed nginx management system using a master-agent architecture:
- **Master (Control Plane)**: Central API, embedded Web UI, configuration distribution, cluster management
- **Agent (Data Plane)**: Sidecar that manages local nginx instances
- **Web UI**: Vite React-based admin console, embedded and served by master
## Quick Links to Documentation
| Document | Purpose |
|----------|---------|
| [README.md](./README.md) | Project overview and quick start |
| [docs/architecture.md](./docs/architecture.md) | System design and data flow |
| [docs/features.md](./docs/features.md) | Detailed feature specifications |
| [docs/roadmap.md](./docs/roadmap.md) | Development phases and milestones |
| [docs/api.md](./docs/api.md) | REST and gRPC API specifications |
| [docs/project-structure.md](./docs/project-structure.md) | Code organization |
## Technology Stack
| Component | Technology |
|-----------|------------|
| Backend | Rust (Axum, Tonic, SeaORM) |
| Frontend | React + TypeScript + Vite |
| Database | PostgreSQL 16+ |
| Cache | Redis |
| Message Format | Protocol Buffers (gRPC) |
| Container | Docker |
| Orchestration | Kubernetes (optional) |
## Development Environment
This project uses Dev Containers for consistent development:
```bash
# All dependencies are pre-installed in the devcontainer
just setup # Initial setup
just dev # Start development
```
### Pre-configured Services
The devcontainer includes:
- PostgreSQL database
- Redis cache
- Nginx instance
- Rust toolchain
- Node.js/Bun for frontend
## Key Design Decisions
1. **Master-Agent Protocol**: Bidirectional gRPC streaming for real-time communication
2. **Configuration Management**: Template-based (Handlebars) with versioning
3. **Security**: TLS + Shared Secret for agent connections, JWT for API auth
4. **Deployment**: Support for Docker sidecar, K8s sidecar, and standalone modes
## Common Tasks
### Adding a New API Endpoint
1. Define route in `crates/nxmesh-master/src/api/v1/`
2. Add request/response types to shared models
3. Implement handler with proper error handling
4. Add tests
5. Update OpenAPI documentation
### Adding a Database Entity
1. Create migration with `sea-orm-cli migrate generate <name>`
2. Define entity in `crates/nxmesh-master/src/db/entities/`
3. Add repository in `crates/nxmesh-master/src/db/repositories/`
4. Update service layer
### Adding Agent Functionality
1. Add module in `crates/nxmesh-agent/src/`
2. Update gRPC protocol if needed (`crates/nxmesh-proto/proto/`)
3. Implement handler in agent
4. Add corresponding master service
## Testing
```bash
just test # All tests
just test-unit # Unit tests only
just test-integration # Integration tests
```
## Code Style
- Follow Rust API Guidelines
- Use `cargo fmt` and `cargo clippy`
- All public APIs must have doc comments
- Error types should be descriptive and actionable
## Questions?
Refer to the documentation in `docs/` directory or ask the team.

5552
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,13 +1,80 @@
[workspace] [workspace]
members = [ members = [
"crates/nxmesh-core",
"crates/nxmesh-proto",
"crates/nxmesh-master",
"crates/nxmesh-agent",
"crates/nxmesh-cli",
"migrations/sea-orm",
] ]
resolver = "3" resolver = "3"
[workspace.lints.clippy] [workspace.package]
module_inception = "allow" version = "0.1.0"
edition = "2021"
authors = ["NxMesh Team"]
license = "GNU General Public License v3.0"
repository = "https://github.com/nxmesh/nxmesh"
rust-version = "1.80"
[workspace.dependencies] [workspace.dependencies]
sea-orm = "2.0.0-rc" # Core dependencies
sea-orm-cli = "2.0.0-rc" tokio = { version = "1", features = ["full"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"
thiserror = "1"
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter", "json"] }
# Web framework
axum = "0.7"
tower = "0.4"
tower-http = { version = "0.5", features = ["trace", "cors", "fs"] }
# gRPC
tonic = "0.11"
prost = "0.12"
# Database
sea-orm = { version = "2.0.0-rc", features = ["sqlx-postgres", "runtime-tokio-native-tls"] }
sea-orm-migration = "2.0.0-rc" sea-orm-migration = "2.0.0-rc"
# Async
async-trait = "0.1"
futures = "0.3"
# Configuration
toml = "0.8"
config = "0.14"
# HTTP client
reqwest = { version = "0.12", default-features = false, features = ["rustls-tls", "json"] }
# Crypto
sha2 = "0.10"
hex = "0.4"
argon2 = "0.5"
jsonwebtoken = "9"
# Validation
validator = { version = "0.18", features = ["derive"] }
# Time
chrono = { version = "0.4", features = ["serde"] }
# UUID
uuid = { version = "1", features = ["v4", "serde"] }
# Templating
handlebars = "5"
# CLI
clap = { version = "4", features = ["derive"] }
# Testing
tokio-test = "0.4"
mockall = "0.12"
# NxMesh internal
nxmesh-core = { path = "crates/nxmesh-core" }
nxmesh-proto = { path = "crates/nxmesh-proto" }

202
README.md
View File

@@ -1,2 +1,202 @@
# NxMesh # NxMesh - Distributed Nginx Management System
> **NxMesh** is a modern, scalable, distributed system for managing nginx instances across diverse infrastructure environments. Built with a master-agent architecture inspired by service mesh patterns, NxMesh provides centralized control with local intelligence.
## 🎯 Project Vision
NxMesh transforms nginx from a standalone reverse proxy into a **distributed, programmable edge layer**. By adopting a control plane (master) + data plane (agent/sidecar) architecture, NxMesh enables:
- **Centralized Management**: Control thousands of nginx instances from a single control plane
- **Dynamic Configuration**: Real-time configuration updates without restarts or connection drops
- **Observability**: Unified metrics, logs, and health status across the entire fleet
- **Hybrid Deployment**: Support for Docker, Kubernetes, VMs, and bare metal environments
- **High Availability**: Fault-tolerant design with automatic failover and recovery
## 🏗️ Architecture Overview
```
┌─────────────────────────────────────────────────────────────────────────────────┐
│ CONTROL PLANE (Master) │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ NxMesh Master │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ API │ │ Config │ │ Cluster │ │ Admin │ │ │
│ │ │ Server │ │ Manager │ │ Coordinator │ │ Console │ │ │
│ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │
│ │ └──────────────────┴──────────────────┴──────────────────┘ │ │
│ │ │ │ │
│ │ PostgreSQL (State) │ │
│ └──────────────────────────────┼─────────────────────────────────────────────┘ │
│ │ │
│ gRPC/TLS │ WebSocket (Events) │
│ ▼ │
└─────────────────────────────────────────────────────────────────────────────────┘
┌───────────────────────────┼───────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ AGENT 1 │ │ AGENT 2 │ │ AGENT N │
│ (Sidecar) │ │ (Standalone) │ │ (K8s Pod) │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ NxMesh │ │ │ │ NxMesh │ │ │ │ NxMesh │ │
│ │ Agent │ │ │ │ Agent │ │ │ │ Agent │ │
│ └─────┬─────┘ │ │ └─────┬─────┘ │ │ └─────┬─────┘ │
│ │ │ │ │ │ │ │ │
│ ┌────┴────┐ │ │ ┌────┴────┐ │ │ ┌────┴────┐ │
│ │ Nginx │ │ │ │ Nginx │ │ │ │ Nginx │ │
│ │ Instance│ │ │ │ Instance│ │ │ │ Instance│ │
│ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │
└───────────────┘ └───────────────┘ └───────────────┘
Docker Compose VM/Bare Metal Kubernetes
```
### Core Components
| Component | Description | Technology |
|-----------|-------------|------------|
| **Master** | Central control plane - API, embedded Web UI, config distribution | Rust (Axum/gRPC) + Embedded Vite React |
| **Agent** | Local nginx controller - configuration, health checks, metrics | Rust (Tokio) |
| **Database** | Persistent state storage | PostgreSQL |
## 🚀 Key Features
### Phase 1: Foundation
- [ ] **Master Control Plane**
- RESTful API for configuration management
- gRPC for agent communication
- PostgreSQL persistence
- JWT-based authentication
- [ ] **Agent Sidecar**
- Docker deployment mode (sidecar pattern)
- Standalone deployment mode
- Automatic nginx lifecycle management
- Configuration hot-reloading
- [ ] **Configuration Management**
- Virtual host (server block) templating
- Upstream pool management
- SSL/TLS certificate management
- Configuration versioning & rollback
### Phase 2: Resilience
- [ ] **High Availability**
- Master clustering with Raft consensus
- Agent auto-reconnection with exponential backoff
- Configuration drift detection & auto-healing
- [ ] **Observability**
- Real-time metrics collection (Prometheus)
- Structured logging (OpenTelemetry)
- Health check dashboards
- Alert management
### Phase 3: Advanced
- [ ] **Traffic Management**
- Dynamic load balancing strategies
- Circuit breaker patterns
- Rate limiting & WAF rules
- A/B testing & canary deployments
- [ ] **Multi-tenancy**
- Organization/workspace isolation
- RBAC (Role-Based Access Control)
- Resource quotas & limits
## 📦 Deployment Modes
### 1. Docker Sidecar (Recommended for Development)
```yaml
# docker-compose.yml
services:
nginx:
image: nginx:alpine
nxmesh-agent:
image: nxmesh/agent:latest
environment:
- NXMESH_MASTER_URL=wss://master.nxmesh.io:8443
- NXMESH_AGENT_TOKEN=${AGENT_TOKEN}
network_mode: service:nginx # Share network namespace
pid: service:nginx # Share PID namespace (for nginx reload)
```
### 2. Kubernetes Sidecar
```yaml
# deployment.yaml
spec:
containers:
- name: nginx
image: nginx:alpine
- name: nxmesh-agent
image: nxmesh/agent:latest
env:
- name: NXMESH_MASTER_URL
value: "wss://master.nxmesh.svc:8443"
```
### 3. Standalone (VM/Bare Metal)
```bash
# Install agent
curl -fsSL https://get.nxmesh.io | bash
# Configure and start
nxmesh-agent --master-url wss://master.nxmesh.io:8443 --token ${AGENT_TOKEN}
```
## 📋 Quick Start
### Prerequisites
- Docker & Docker Compose
- Rust 1.75+ (for development)
- PostgreSQL 16+
### Development Setup
```bash
# Clone and setup
git clone https://github.com/your-org/nxmesh.git
cd nxmesh
just setup
# Start development environment
just dev
# Access services
# - Web UI: http://localhost:3000
# - API: http://localhost:8080
# - Nginx: http://localhost:80
```
### Production Deployment
```bash
# Deploy master
docker run -d \
-p 8080:8080 \
-p 8443:8443 \
-e DATABASE_URL=postgres://... \
nxmesh/master:latest
# Deploy agent (on nginx host)
docker run -d \
--network container:nginx \
-e NXMESH_MASTER_URL=wss://master.example.com:8443 \
-e NXMESH_AGENT_TOKEN=<token> \
nxmesh/agent:latest
```
## 📚 Documentation
| Document | Description |
|----------|-------------|
| [Architecture](./docs/architecture.md) | System design, data flow, component interactions |
| [Features](./docs/features.md) | Detailed feature specifications |
| [Roadmap](./docs/roadmap.md) | Development phases and milestones |
| [API Reference](./docs/api.md) | REST API and gRPC specifications |
| [Deployment](./docs/deployment.md) | Production deployment guides |
## 📄 License
NxMesh is licensed under the Apache License 3.0. See [LICENSE](./LICENSE) for details.
---

1107
docs/api.md Normal file

File diff suppressed because it is too large Load Diff

527
docs/architecture.md Normal file
View File

@@ -0,0 +1,527 @@
# NxMesh Architecture
## Table of Contents
1. [Overview](#overview)
2. [System Components](#system-components)
3. [Data Flow](#data-flow)
4. [Communication Protocols](#communication-protocols)
5. [Security Model](#security-model)
6. [Deployment Patterns](#deployment-patterns)
7. [Failure Handling](#failure-handling)
---
## Overview
NxMesh follows a **Control Plane / Data Plane** architecture pattern, similar to service meshes like Istio or Linkerd, but specifically optimized for nginx management.
### Design Principles
1. **Separation of Concerns**: Master handles policy and state; Agent handles execution
2. **Eventual Consistency**: Configuration changes propagate asynchronously
3. **Local Autonomy**: Agents can operate independently during master outages
4. **Zero-Downtime Updates**: Nginx reloads without dropping connections
5. **Observability First**: Every action is observable and traceable
---
## System Components
### 1. Master (Control Plane)
The Master is the brain of the system. It maintains the desired state and coordinates all agents.
```
┌──────────────────────────────────────────────────────────────────┐
│ MASTER │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │
│ │ API │ │ Config │ │ Event & Agent │ │
│ │ Layer │ │ Engine │ │ Coordination │ │
│ │ │ │ │ │ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌───────────────────┐ │ │
│ │ │ REST │ │ │ │ Template│ │ │ │ Agent Registry │ │ │
│ │ │ Handler │ │ │ │ Engine │ │ │ │ (Connections) │ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ └───────────────────┘ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌───────────────────┐ │ │
│ │ │ gRPC │ │ │ │ Version │ │ │ │ Event Bus │ │ │
│ │ │ Server │ │ │ │ Control │ │ │ │ (Config Dist.) │ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ └───────────────────┘ │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌───────────────────┐ │ │
│ │ │ WebSocket│ │ │ │ Validator│ │ │ │ Broadcast │ │ │
│ │ │ Handler │ │ │ │ │ │ │ │ (Agent Updates) │ │ │
│ │ └──────────┘ │ │ └──────────┘ │ │ └───────────────────┘ │ │
│ └──────────────┘ └──────────────┘ └─────────────────────────┘ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Auth │ │ Storage │ │ Observability │ │
│ │ Service │ │ Layer │ │ │ │
│ │ │ │ │ │ ┌───────────────────┐ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Metrics │ │ │
│ │ │ JWT │ │ │ │ Postgres│ │ │ │ (Prometheus) │ │ │
│ │ │ OAuth2 │ │ │ │ (SeaORM)│ │ │ └───────────────────┘ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ ┌───────────────────┐ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Tracing │ │ │
│ │ │ Password│ │ │ │ Cache │ │ │ │ (OpenTelemetry) │ │ │
│ │ │ Login │ │ │ │ (Redis) │ │ │ └───────────────────┘ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ │ │
│ │ ┌─────────┐ │ │ │ │ │ │
│ │ │ RBAC │ │ │ │ │ │ │
│ │ │ Engine │ │ │ │ │ │ │
│ │ └─────────┘ │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
```
#### Master Responsibilities
| Module | Responsibility |
|--------|----------------|
| API Layer | HTTP REST API for external clients (CLI, Web UI, external systems) |
| Config Engine | Template rendering, validation, versioning |
| Event & Agent Coordination | Agent connection management, config event broadcasting |
| Auth Service | Authentication (JWT/OAuth2, Password) and authorization (RBAC) |
| Storage Layer | PostgreSQL for persistent state, Redis for caching |
| Observability | Metrics collection, distributed tracing, structured logging |
#### Future: High Availability Mode
For large-scale deployments, the master can be extended with:
- **Raft Consensus** for leader election and state replication
- **Cluster Manager** for coordinating multiple master instances
- This is **not required** for single-organization, self-hosted deployments |
### 2. Agent (Data Plane)
The Agent is a lightweight sidecar that runs alongside each nginx instance.
```
┌─────────────────────────────────────────────────────────────────┐
│ AGENT │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Master │ │ Nginx │ │ Health Monitor │ │
│ │ Client │ │ Controller │ │ │ │
│ │ │ │ │ │ ┌───────────────────┐ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Nginx Health │ │ │
│ │ │ gRPC │ │ │ │ Config │ │ │ │ (HTTP checks) │ │ │
│ │ │ Client │ │ │ │ Renderer│ │ │ └───────────────────┘ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ ┌───────────────────┐ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ System Metrics │ │ │
│ │ │ WebSocket│ │ │ │ Reload │ │ │ │ (CPU/Mem/IO) │ │ │
│ │ │ Client │ │ │ │ Manager │ │ │ └───────────────────┘ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌───────────────────┐ │ │
│ │ │ Reconnect│ │ │ │ Process │ │ │ │ Self-Health │ │ │
│ │ │ Handler │ │ │ │ Signal │ │ │ │ (Heartbeat) │ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ └───────────────────┘ │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Metrics │ │ Local │ │ Watchdog │ │
│ │ Exporter │ │ Cache │ │ │ │
│ │ │ │ │ │ ┌───────────────────┐ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Config Drift │ │ │
│ │ │Prometheus│ │ │ │ Config │ │ │ │ Detection │ │ │
│ │ │Endpoint │ │ │ │ State │ │ │ └───────────────────┘ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ ┌───────────────────┐ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Auto-Recovery │ │ │
│ │ │Statsd │ │ │ │ Backup │ │ │ │ (Nginx restart) │ │ │
│ │ │Client │ │ │ │ Files │ │ │ └───────────────────┘ │ │
│ │ └─────────┘ │ │ └─────────┘ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
#### Agent Responsibilities
| Module | Responsibility |
|--------|----------------|
| Master Client | Maintains persistent connection to master (gRPC + WebSocket fallback) |
| Nginx Controller | Generates configs, manages reloads, handles lifecycle |
| Health Monitor | Monitors nginx health, system resources, reports status |
| Metrics Exporter | Prometheus endpoint, statsd client for metrics |
| Local Cache | Caches configs for offline operation, backup/restore |
| Watchdog | Detects config drift, auto-recovery from failures |
---
## Data Flow
### 1. Configuration Push Flow
```
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ User │────▶│ API │────▶│ Config │────▶│ Event │────▶│ Agents │
│ Action │ │ Server │ │ Engine │ │ Bus │ │ │
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ Nginx │◀────│ Config │◀────│ Template│◀────│ gRPC │◀────│ Agent │
│Reloaded│ │Applied │ │ Render │ │ Stream │ │Receive │
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
```
**Flow Description:**
1. User creates/updates configuration via API or Web UI
2. Master validates and stores configuration in database
3. Config Engine determines affected agents
4. Event Bus broadcasts configuration change event
5. Agents receive event via gRPC streaming
6. Agent renders local nginx configuration from templates
7. Agent validates new configuration (`nginx -t`)
8. Agent applies configuration via graceful reload
9. Agent reports status back to master
### 2. Health Reporting Flow
```
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ Nginx │────▶│ Agent │────▶│ Master │────▶│ DB │
│ Health │ │ Health │ │ API │ │ Store │
└────────┘ └────────┘ └────────┘ └────────┘
┌────────┐
│Prometheus│
│ Server │
└────────┘
```
**Flow Description:**
1. Agent periodically checks nginx health (HTTP health endpoint)
2. Agent collects system metrics (CPU, memory, connections)
3. Agent sends health report to master via gRPC
4. Master aggregates and stores in database
5. Prometheus scrapes agent metrics endpoint
### 3. Certificate Management Flow
```
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ Let's │◀────│ Master │────▶│ Agent │────▶│ Nginx │◀────│ Client │
│Encrypt │ │ ACME │ │ Deploy │ │ Serve │ │Request │
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
```
**Flow Description:**
1. Master requests certificate from Let's Encrypt (ACME protocol)
2. Master distributes certificate to relevant agents
3. Agent stores certificate locally (encrypted at rest)
4. Agent updates nginx configuration with new certificate
5. Nginx serves HTTPS traffic with new certificate
---
## Communication Protocols
### Master-Agent Protocol
NxMesh uses a **bidirectional gRPC stream** as the primary communication channel between master and agents.
```protobuf
// agent.proto
syntax = "proto3";
package nxmesh.agent;
service AgentService {
// Bidirectional streaming for real-time communication
rpc Stream(stream AgentMessage) returns (stream MasterMessage);
// Unary calls for specific operations
rpc ReportHealth(HealthReport) returns (Ack);
rpc ReportMetrics(MetricsBatch) returns (Ack);
}
message AgentMessage {
string agent_id = 1;
uint64 timestamp = 2;
oneof payload {
RegistrationRequest register = 3;
HealthReport health = 4;
ConfigStatus config_status = 5;
MetricsBatch metrics = 6;
LogBatch logs = 7;
}
}
message MasterMessage {
uint64 timestamp = 1;
oneof payload {
RegistrationResponse register_response = 2;
ConfigUpdate config_update = 3;
Command command = 4;
Ack ack = 5;
}
}
message ConfigUpdate {
string config_id = 1;
uint64 version = 2;
repeated VirtualHost virtual_hosts = 3;
repeated Upstream upstreams = 4;
map<string, string> ssl_certificates = 5;
}
```
### Connection Management
```
┌─────────────────────────────────────────────────────────────────────┐
│ CONNECTION LIFECYCLE │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ INIT │───▶│ CONNECT │───▶│ STREAM │───▶│ READY │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ RETRY │ │RECONNECT│ │ ERROR │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Connection Parameters: │
│ - Heartbeat interval: 30s │
│ - Reconnect backoff: 1s, 2s, 4s, 8s... (max 60s) │
│ - gRPC keepalive: 10s ping, 20s timeout │
│ - TLS: Server-side TLS (auto-generated or custom) │
│ - Agent auth: Bootstrap token → Shared secret (HMAC) │
└─────────────────────────────────────────────────────────────────────┘
```
---
## Security Model
### Authentication
| Component | Method | Details |
|-----------|--------|---------|
| Master API | JWT (RS256) | Short-lived access tokens, refresh tokens |
| Master WebSocket | JWT | Same tokens as API |
| Master-Agent gRPC | **TLS + Shared Secret** | Server TLS + bootstrap token → session HMAC |
| Agent Registration | One-time Bootstrap Token | Generated in Master UI, single-use, short expiry |
### Agent Authentication Flow (TLS + Shared Secret)
```
┌─────────────┐ ┌──────────────┐
│ Agent │ │ Master │
└──────┬──────┘ └──────┬───────┘
│ │
│ 1. TLS Handshake (verify server certificate) │
│◄───────────────────────────────────────────────►│
│ │
│ 2. Register with bootstrap_token │
│ ── gRPC: RegisterAgent { token } ─────────────▶│
│ │
│ 3. Receive agent_id + session_key (+ key_id) │
│◄────────────────────────────────────────────────│
│ [Encrypted over TLS] │
│ │
│ 4. Subsequent requests: HMAC-signed │
│ ── gRPC + Headers: │
│ X-Agent-ID: <agent_id> │
│ X-Key-ID: <session_key_id> │
│ X-Signature: HMAC(request_body, session_key)│
│────────────────────────────────────────────────▶│
│ │
│ 5. Key Rotation (primary/secondary) │
│◄═══════════════════════════════════════════════►│
```
**Security Properties:**
- **TLS**: Encrypts channel, verifies master identity (server cert)
- **Bootstrap Token**: One-time use, time-limited, proves initial identity
- **Session Key**: Per-agent secret, used for HMAC request signing
- **Key Rotation**: Primary/secondary key design for seamless rotation
### Authorization (RBAC)
```yaml
# Example RBAC Configuration
roles:
admin:
permissions:
- "*:*"
operator:
permissions:
- "config:read"
- "config:write"
- "agent:read"
- "agent:reload"
viewer:
permissions:
- "config:read"
- "agent:read"
- "metrics:read"
# Resource hierarchy
resources:
- organization
- workspace
- agent
- certificate
- config (virtual_host, upstream)
```
## Deployment Patterns
### Pattern 1: Docker Sidecar (Development/Single Host)
```yaml
# docker-compose.yml
version: '3.8'
services:
nxmesh-master:
image: nxmesh/master:latest
ports:
- "8080:8080" # API
- "8443:8443" # gRPC
environment:
- DATABASE_URL=postgres://...
nginx-site-a:
image: nginx:alpine
volumes:
- site-a-html:/usr/share/nginx/html
nxmesh-agent-a:
image: nxmesh/agent:latest
network_mode: service:nginx-site-a # Share network namespace with nginx
pid: service:nginx-site-a # Share PID namespace (for nginx reload)
environment:
- NXMESH_MASTER_URL=wss://nxmesh-master:8443
- NXMESH_AGENT_TOKEN=${AGENT_TOKEN_A}
- NXMESH_DEPLOYMENT_MODE=docker_sidecar
- NXMESH_NGINX_PID_FILE=/var/run/nginx.pid
```
**Pros:** Simple, isolated, good for development
**Cons:** Docker-only, single host limitation
### Pattern 2: Kubernetes Sidecar
```yaml
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-service
spec:
replicas: 3
template:
spec:
containers:
- name: nginx
image: nginx:alpine
volumeMounts:
- name: nxmesh-config
mountPath: /etc/nginx/conf.d
- name: nxmesh-agent
image: nxmesh/agent:latest
env:
- name: NXMESH_MASTER_URL
value: "wss://nxmesh-master.default.svc:8443"
- name: NXMESH_AGENT_TOKEN
valueFrom:
secretKeyRef:
name: nxmesh-agent-token
key: token
volumeMounts:
- name: nxmesh-config
mountPath: /etc/nginx/conf.d
volumes:
- name: nxmesh-config
emptyDir: {}
```
**Pros:** Native K8s integration, auto-scaling, health checks
**Cons:** K8s-only, more complex setup
### Pattern 3: Standalone (VM/Bare Metal)
```
┌─────────────────────────────────────────────────────────────────┐
│ VM / Bare Metal │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Systemd │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ nxmesh-agent.service │ │ │
│ │ │ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │ │ │
│ │ │ │ Agent │ │ Nginx │ │ Config │ │ │ │
│ │ │ │ Process │──│ Process │──│ Files │ │ │ │
│ │ │ └──────────────┘ └──────────────┘ └───────────┘ │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
**Pros:** Works anywhere, minimal dependencies
**Cons:** Manual setup, no container isolation
---
## Failure Handling
### Master Failure Scenarios
| Scenario | Impact | Mitigation |
|----------|--------|------------|
| Master unreachable | Agents continue with cached config | Agents retry with exponential backoff |
| Master crashes | New connections fail, existing continue | External load balancer + health checks (HA: future) |
| Database down | Read-only mode for existing configs | Database replication, failover |
### Agent Failure Scenarios
| Scenario | Impact | Mitigation |
|----------|--------|------------|
| Agent crashes | Nginx continues running | Systemd restart, watchdog |
| Config validation fails | Previous config kept | Atomic config swap, rollback |
| Nginx crashes | Agent restarts nginx | Health checks, auto-restart |
| Network partition | Agent operates in "island mode" | Local cache, reconciliation on reconnect |
### Recovery Procedures
```
┌─────────────────────────────────────────────────────────────────────┐
│ FAILURE RECOVERY FLOW │
│ │
│ Agent Disconnect │
│ │ │
│ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Retry │───▶│ Cache │───▶│ Alert │───▶│ Watch │ │
│ │ Connect │ │ Config │ │ Master │ │ Dog │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │Reconnected│ │ Restart │ │
│ │ Sync │ │ Nginx │ │
│ └─────────┘ └─────────┘ │
│ │
│ Recovery Strategies: │
│ 1. Exponential backoff for reconnection │
│ 2. Circuit breaker for failed operations │
│ 3. Config checksum verification after reconnect │
│ 4. Automatic nginx restart on health check failure │
└─────────────────────────────────────────────────────────────────────┘
```
---
## Technology Stack
| Layer | Technology | Rationale |
|-------|------------|-----------|
| **Master Backend** | Rust (Axum) | Performance, safety, async ecosystem |
| **Agent** | Rust (Tokio) | Small binary, low memory, fast startup |
| **Database** | PostgreSQL | ACID, JSON support, reliability |
| **Cache** | Redis | Fast key-value, pub/sub for events |
| **Frontend** | React + Vite (embedded) | Static build served by master, fast HMR in dev |
| **gRPC** | Tonic | Native Rust implementation |
| **ORM** | SeaORM | Async, type-safe, migration support |
| **Config Template** | Handlebars | Logic-less, secure templating |
| **Metrics** | Prometheus | Industry standard, rich ecosystem |
| **Tracing** | OpenTelemetry | Vendor-neutral, future-proof |

814
docs/features.md Normal file
View File

@@ -0,0 +1,814 @@
# NxMesh Feature Specification
## Table of Contents
1. [Core Features](#core-features)
2. [Master Features](#master-features)
3. [Agent Features](#agent-features)
4. [Configuration Management](#configuration-management)
5. [Observability](#observability)
6. [Security Features](#security-features)
---
## Core Features
### CF-001: Multi-tenancy with Organizations and Workspaces
**Description**: Support for multiple organizations with isolated workspaces within each organization.
**Requirements**:
- Organizations are top-level resource containers
- Each organization can have multiple workspaces
- Resources (agents, configs, certificates) are scoped to a workspace
- Cross-workspace visibility is configurable
**Data Model**:
```rust
struct Organization {
id: Uuid,
name: String,
slug: String, // URL-friendly identifier
created_at: DateTime,
settings: OrganizationSettings,
}
struct Workspace {
id: Uuid,
organization_id: Uuid,
name: String,
slug: String,
created_at: DateTime,
}
```
**API Endpoints**:
- `GET /api/v1/organizations` - List organizations
- `POST /api/v1/organizations` - Create organization
- `GET /api/v1/organizations/{id}/workspaces` - List workspaces
- `POST /api/v1/organizations/{id}/workspaces` - Create workspace
---
### CF-002: Agent Registration and Lifecycle Management
**Description**: Agents must register with the master before receiving configurations.
**Registration Flow**:
1. Administrator generates bootstrap token in Master UI
2. Token is provided to agent via environment variable or config file
3. Agent establishes TLS connection to master (verifies server certificate)
4. Agent sends bootstrap token for registration
5. Master validates token and establishes shared secret:
- Master generates session_key (per-agent) + key_id
- Session key used for HMAC request signing
- Primary/secondary key design for rotation
**Agent States**:
```rust
enum AgentState {
Pending, // Registered but never connected
Online, // Connected and healthy
Offline, // Disconnected
Degraded, // Connected but health checks failing
Maintenance, // Manually placed in maintenance mode
}
```
**Agent Metadata**:
```rust
struct Agent {
id: Uuid,
workspace_id: Uuid,
name: String,
hostname: String,
ip_address: String,
version: String,
state: AgentState,
deployment_mode: DeploymentMode, // DockerSidecar, K8sSidecar, Standalone
last_seen_at: DateTime,
capabilities: Vec<String>, // e.g., ["http3", "websocket", "rate_limiting"]
labels: HashMap<String, String>, // e.g., {"env": "prod", "region": "us-east"}
}
```
**API Endpoints**:
- `POST /api/v1/agents/register` - Register new agent
- `GET /api/v1/agents` - List agents
- `GET /api/v1/agents/{id}` - Get agent details
- `POST /api/v1/agents/{id}/tokens` - Generate registration token
- `DELETE /api/v1/agents/{id}` - Deregister agent
---
### CF-003: Real-time Configuration Distribution
**Description**: Push configuration changes to agents in real-time with delivery guarantees.
**Requirements**:
- Config changes propagate to all affected agents within 5 seconds
- Support for targeted updates (specific agents or groups)
- Config versioning with rollback capability
- Delivery confirmation from agents
**Configuration Scope**:
```rust
enum ConfigScope {
Global, // All agents
Workspace, // All agents in workspace
AgentGroup(String), // Agents with specific label selector
Agent(Uuid), // Single agent
}
```
**Delivery Guarantees**:
- At-least-once delivery
- Automatic retry with exponential backoff
- Config checksum verification
- Offline agents receive updates on reconnection
---
## Master Features
### MF-001: RESTful API
**Description**: Comprehensive REST API for all operations.
**Base URL**: `/api/v1`
**Resource Endpoints**:
| Resource | Endpoints |
|----------|-----------|
| Organizations | GET, POST, PATCH, DELETE `/organizations` |
| Workspaces | GET, POST, PATCH, DELETE `/workspaces` |
| Agents | GET, POST, PATCH, DELETE `/agents` |
| VirtualHosts | GET, POST, PATCH, DELETE `/virtual-hosts` |
| Upstreams | GET, POST, PATCH, DELETE `/upstreams` |
| Certificates | GET, POST, DELETE `/certificates` |
| AccessLogs | GET `/access-logs` |
| Metrics | GET `/metrics` |
**Response Format**:
```json
{
"data": { ... },
"meta": {
"page": 1,
"per_page": 20,
"total": 100
},
"links": {
"self": "/api/v1/agents?page=1",
"next": "/api/v1/agents?page=2",
"prev": null
}
}
```
**Error Format**:
```json
{
"error": {
"code": "VALIDATION_ERROR",
"message": "Invalid configuration",
"details": [
{"field": "server_name", "message": "Invalid domain format"}
]
}
}
```
---
### MF-002: Web-based Admin Console (Embedded)
**Description**: Modern web UI for managing the entire system. Built with React + Vite and served as static files embedded directly in the master binary.
**Pages**:
| Page | Features |
|------|----------|
| Dashboard | Agent status, recent events, traffic overview |
| Agents | List, detail view, logs, metrics graphs |
| Configurations | Virtual host editor, upstream management |
| Certificates | SSL certificate list, expiration alerts |
| Access Control | Users, roles, permissions management |
| Settings | Organization settings, integrations |
**Key UI Features**:
- Real-time updates via WebSocket
- Monaco editor for nginx configuration
- Visual topology view (agent connections)
- Dark/light mode support
- Responsive design
---
### MF-003: Configuration Template Engine
**Description**: Templating system for generating nginx configurations.
**Template Variables**:
```handlebars
# Example virtual host template
server {
listen {{port}} {{#if ssl}}ssl{{/if}} {{#if http2}}http2{{/if}};
server_name {{server_name}};
{{#if ssl}}
ssl_certificate {{ssl_certificate_path}};
ssl_certificate_key {{ssl_certificate_key_path}};
{{/if}}
location {{location_path}} {
proxy_pass http://{{upstream_name}};
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
{{#each custom_headers}}
add_header {{name}} "{{value}}";
{{/each}}
{{#if rate_limiting}}
limit_req zone={{rate_limit_zone}} burst={{rate_limit_burst}};
{{/if}}
}
}
```
**Built-in Templates**:
- `default` - Standard reverse proxy
- `spa` - Single Page Application (with fallback to index.html)
- `api` - API gateway with rate limiting
- `static` - Static file serving with caching
- `websocket` - WebSocket proxy with connection upgrades
---
### MF-004: Certificate Management (ACME)
**Description**: Automatic SSL/TLS certificate provisioning via Let's Encrypt.
**Features**:
- ACME v2 protocol support
- HTTP-01 and DNS-01 challenges
- Automatic renewal (30 days before expiry)
- Wildcard certificate support (DNS-01)
- Certificate monitoring and alerts
**Certificate Entity**:
```rust
struct Certificate {
id: Uuid,
workspace_id: Uuid,
domain: String,
is_wildcard: bool,
provider: CertificateProvider, // LetsEncrypt, Custom
status: CertificateStatus, // Pending, Active, Expired, Error
issued_at: DateTime,
expires_at: DateTime,
auto_renew: bool,
certificate_pem: Option<String>, // Encrypted at rest
private_key_pem: Option<String>, // Encrypted at rest
}
```
---
## Agent Features
### AF-001: Nginx Lifecycle Management
**Description**: Agent manages nginx process lifecycle based on deployment mode.
**Docker Sidecar Mode**:
- Shares PID namespace with nginx container (via `pid: service:nginx`)
- Directly signals nginx process for reload/restart
- Monitors nginx via health checks
**Standalone Mode**:
- Direct process management (signals to PID from file)
- systemd integration (optional, for service management)
- PID file monitoring
**Lifecycle Actions**:
- `start` - Start nginx
- `stop` - Graceful shutdown
- `reload` - Hot reload configuration
- `restart` - Full restart
- `test` - Validate configuration
---
### AF-002: Configuration Rendering and Application
**Description**: Agent renders nginx configs from master templates and applies them using atomic symlink swaps for zero-downtime updates.
**Config Directory Structure**:
```
/etc/nginx/
├── nginx.conf # Contains: include /etc/nginx/conf.d/current/*.conf
├── conf.d/
│ ├── current -> ./20260302143000/ # Symlink to active deployment
│ ├── 20260302143000/ # Active config (timestamped)
│ │ ├── default.conf
│ │ └── upstream.conf
│ ├── 20260302141500/ # Previous deployment (for rollback)
│ │ ├── default.conf
│ │ └── upstream.conf
│ └── 20260302140000/ # Older deployment (cleanup candidate)
```
**Config Rendering Flow**:
1. Receive ConfigUpdate from master
2. Create new deployment folder: `./conf.d/<timestamp>/`
3. Render nginx config files into timestamped folder
4. **Validate** new config: `nginx -t -c /etc/nginx/conf.d/<timestamp>/nginx.conf`
5. If validation passes, **atomically update symlink**: `current``<timestamp>/`
6. Execute graceful nginx reload
7. Verify reload success (health check)
8. Report status to master
9. Cleanup old deployments (keep N recent versions)
**Atomic Config Swap**:
```rust
async fn apply_config(&self, config: ConfigUpdate) -> Result<()> {
let timestamp = generate_timestamp();
let deploy_dir = self.conf_d_path.join(&timestamp);
let symlink_path = self.conf_d_path.join("current");
// 1. Render config to new timestamped directory
self.render_config(&config, &deploy_dir).await?;
// 2. Validate BEFORE switching symlink (point to new folder directly)
self.validate_config(&deploy_dir).await?;
// 3. Atomic symlink swap (Unix: symlink + rename)
let temp_link = self.conf_d_path.join("current.tmp");
tokio::fs::symlink(&deploy_dir, &temp_link).await?;
tokio::fs::rename(&temp_link, &symlink_path).await?; // Atomic operation
// 4. Reload nginx (picks up new symlink target)
self.reload_nginx().await?;
// 5. Verify and cleanup
self.verify_health().await?;
self.cleanup_old_deployments(5).await?; // Keep last 5 versions
self.report_success(config.id, timestamp).await;
}
```
**Rollback Strategy**:
```rust
async fn rollback(&self, target_timestamp: &str) -> Result<()> {
let target_dir = self.conf_d_path.join(target_timestamp);
let symlink_path = self.conf_d_path.join("current");
// Verify target exists
if !target_dir.exists() {
return Err(Error::RollbackTargetNotFound);
}
// Atomic symlink swap back to previous deployment
let temp_link = self.conf_d_path.join("current.tmp");
tokio::fs::symlink(&target_dir, &temp_link).await?;
tokio::fs::rename(&temp_link, &symlink_path).await?;
// Reload nginx
self.reload_nginx().await?;
}
```
---
### AF-003: Health Monitoring and Reporting
**Description**: Continuous health monitoring of nginx and the host system.
**Health Checks**:
- **Nginx Health**: HTTP request to nginx health endpoint
- **Configuration Health**: Verify current config matches expected
- **Resource Health**: CPU, memory, disk usage
- **Connection Health**: Active connections, request rate
**Health Report Structure**:
```rust
struct HealthReport {
agent_id: Uuid,
timestamp: DateTime,
nginx_status: NginxStatus,
system_metrics: SystemMetrics,
config_checksum: String,
alerts: Vec<Alert>,
}
struct NginxStatus {
is_running: bool,
pid: Option<u32>,
uptime_seconds: u64,
active_connections: u32,
requests_per_second: f64,
}
struct SystemMetrics {
cpu_percent: f64,
memory_used_mb: u64,
memory_total_mb: u64,
disk_used_gb: u64,
disk_total_gb: u64,
}
```
**Reporting Interval**: Configurable (default: 30 seconds)
---
### AF-004: Metrics Collection and Export
**Description**: Collect and expose metrics in Prometheus format.
**Metrics Endpoint**: `GET /metrics` (on agent)
**Built-in Metrics**:
```
# Nginx metrics (parsed from stub_status)
nxmesh_nginx_connections_active{agent_id="..."} 42
nxmesh_nginx_connections_reading{agent_id="..."} 5
nxmesh_nginx_connections_writing{agent_id="..."} 30
nxmesh_nginx_connections_waiting{agent_id="..."} 7
nxmesh_nginx_requests_total{agent_id="..."} 1234567
# Agent metrics
nxmesh_agent_uptime_seconds{agent_id="..."} 86400
nxmesh_agent_master_connection_status{agent_id="..."} 1
nxmesh_agent_config_version{agent_id="...",version="123"} 1
# System metrics
nxmesh_system_cpu_percent{agent_id="..."} 25.5
nxmesh_system_memory_used_bytes{agent_id="..."} 1073741824
nxmesh_system_disk_used_bytes{agent_id="..."} 53687091200
```
**Custom Metrics**: Agents can collect custom metrics from nginx access logs
---
### AF-005: Offline Operation and Recovery
**Description**: Agent can operate independently when master is unreachable.
**Offline Capabilities**:
- Continue serving traffic with cached configuration
- Local health monitoring continues
- Metrics are buffered for later transmission
- Automatic reconnection attempts
**Recovery Flow**:
1. Detect disconnection from master
2. Enter "offline mode"
3. Continue operating with cached config
4. Buffer metrics and logs
5. Attempt reconnection with exponential backoff
6. On reconnection:
- Sync configuration (compare checksums)
- Transmit buffered metrics
- Resume normal operation
---
## Configuration Management
### CM-001: Virtual Host Configuration
**Description**: Define nginx server blocks (virtual hosts) via API/UI.
**VirtualHost Entity**:
```rust
struct VirtualHost {
id: Uuid,
workspace_id: Uuid,
name: String, // Human-readable name
server_name: String, // Domain name(s), comma-separated
listen_port: u16, // Usually 80 or 443
ssl_enabled: bool,
ssl_certificate_id: Option<Uuid>,
// Routing configuration
locations: Vec<Location>,
// Advanced settings
http2_enabled: bool,
http3_enabled: bool,
gzip_enabled: bool,
rate_limiting: Option<RateLimitConfig>,
// Target agents
target_agents: AgentSelector,
}
struct Location {
path: String, // e.g., "/api" or "~ \.php$"
proxy_pass: Option<String>, // e.g., "http://backend"
upstream_id: Option<Uuid>,
root: Option<String>, // For static files
index: Option<String>, // e.g., "index.html"
custom_headers: Vec<Header>,
rewrite_rules: Vec<RewriteRule>,
}
```
**Validation Rules**:
- `server_name` must be valid domain(s)
- `listen_port` must be 1-65535
- SSL certificate must exist if `ssl_enabled` is true
- At least one location must be defined
---
### CM-002: Upstream Configuration
**Description**: Define backend server pools for load balancing.
**Upstream Entity**:
```rust
struct Upstream {
id: Uuid,
workspace_id: Uuid,
name: String, // Used as upstream identifier
// Load balancing algorithm
algorithm: LoadBalanceAlgorithm, // RoundRobin, LeastConn, IPHash, etc.
// Backend servers
servers: Vec<UpstreamServer>,
// Health check configuration
health_check: Option<HealthCheckConfig>,
// Connection settings
keepalive_connections: Option<u32>,
keepalive_timeout: Option<u32>,
}
struct UpstreamServer {
address: String, // IP:port or hostname:port
weight: u32, // Default: 1
backup: bool, // Backup server
down: bool, // Temporarily down
max_fails: u32, // Default: 1
fail_timeout: u32, // Seconds, default: 10
}
enum LoadBalanceAlgorithm {
RoundRobin,
LeastConnections,
IPHash,
WeightedRoundRobin,
}
```
---
### CM-003: Configuration Versioning
**Description**: Track all configuration changes with full history.
**Versioning Features**:
- Every change creates a new version
- Versions are immutable
- Rollback to any previous version
- Diff between versions
- Audit log of who changed what
**Version Entity**:
```rust
struct ConfigVersion {
id: Uuid,
resource_type: String, // "virtual_host", "upstream", etc.
resource_id: Uuid,
version_number: u64, // Auto-incrementing
data: Json, // Full configuration snapshot
checksum: String, // SHA-256 of data
created_by: Uuid, // User ID
created_at: DateTime,
change_summary: String, // Human-readable description
}
```
**API Endpoints**:
- `GET /api/v1/virtual-hosts/{id}/versions` - List versions
- `GET /api/v1/virtual-hosts/{id}/versions/{version}` - Get specific version
- `POST /api/v1/virtual-hosts/{id}/rollback` - Rollback to version
- `GET /api/v1/virtual-hosts/{id}/diff?from=v1&to=v2` - Compare versions
---
## Observability
### OB-001: Structured Logging
**Description**: Comprehensive logging with structured format.
**Log Levels**: ERROR, WARN, INFO, DEBUG, TRACE
**Log Fields**:
```json
{
"timestamp": "2026-03-02T10:30:00Z",
"level": "INFO",
"component": "agent",
"agent_id": "550e8400-e29b-41d4-a716-446655440000",
"trace_id": "abc123",
"span_id": "def456",
"message": "Configuration applied successfully",
"fields": {
"config_id": "config-123",
"version": 42,
"duration_ms": 150
}
}
```
**Log Targets**:
- Master: systemd journal, file, or centralized (ELK/Loki)
- Agent: stdout (Docker), file (standalone), or remote
---
### OB-002: Distributed Tracing
**Description**: OpenTelemetry tracing for request flow visualization.
**Traced Operations**:
- Configuration push (master → agent → nginx)
- Health check cycles
- Certificate issuance
- API requests
**Span Attributes**:
- `nxmesh.agent_id`
- `nxmesh.config_id`
- `nxmesh.workspace_id`
- `nxmesh.organization_id`
---
### OB-003: Access Log Aggregation
**Description**: Collect and query nginx access logs from all agents.
**Features**:
- Centralized access log storage
- Real-time log streaming
- SQL-like query interface
- Log retention policies
**Access Log Schema**:
```rust
struct AccessLogEntry {
id: Uuid,
agent_id: Uuid,
timestamp: DateTime,
// Request details
remote_addr: String,
method: String,
uri: String,
protocol: String,
host: String,
// Response details
status: u16,
body_bytes_sent: u64,
response_time_ms: f64,
// Additional fields
user_agent: Option<String>,
referer: Option<String>,
request_id: Option<String>,
}
```
**Query API**:
```graphql
# Example query
query {
accessLogs(
filter: {
agentId: "...",
timeRange: { from: "2026-03-01", to: "2026-03-02" },
statusCode: { gte: 500 }
},
limit: 100
) {
timestamp
method
uri
status
responseTimeMs
}
}
```
---
## Security Features
### SF-001: Authentication and Authorization
**Description**: Multi-method authentication with fine-grained RBAC.
**Authentication Methods**:
- JWT (for API/Web UI)
- Password-based login (local user accounts)
- OAuth2/OIDC (Google, GitHub, enterprise SSO)
- API Keys (for service accounts)
- **TLS + Shared Secret** (for agent communication)
- Server-side TLS (auto-generated self-signed or custom certificates)
- Bootstrap token for initial registration
- Session key with HMAC signing for ongoing requests
- Primary/secondary key rotation
**RBAC Model**:
```rust
struct Role {
id: Uuid,
name: String,
permissions: Vec<Permission>,
}
enum Permission {
// Organization scope
OrganizationRead,
OrganizationWrite,
OrganizationDelete,
// Workspace scope
WorkspaceRead,
WorkspaceWrite,
WorkspaceDelete,
// Agent scope
AgentRead,
AgentWrite,
AgentReload,
AgentDelete,
// Config scope
ConfigRead,
ConfigWrite,
ConfigDeploy,
ConfigDelete,
// Certificate scope
CertificateRead,
CertificateWrite,
CertificateDelete,
// User management
UserRead,
UserWrite,
UserDelete,
}
```
---
### SF-002: Secret Management
**Description**: Secure storage and distribution of sensitive data.
**Secrets**:
- SSL private keys
- API tokens
- Database passwords
- External service credentials
**Security Measures**:
- Encryption at rest (AES-256-GCM)
- Encryption in transit (TLS 1.3)
- Automatic secret rotation
- Audit logging for secret access
---
### SF-003: Network Security
**Description**: Network-level security controls.
**Features**:
- IP allowlisting for agent connections
- Rate limiting on API endpoints
- DDoS protection recommendations
- Security headers enforcement (HSTS, CSP, etc.)
**Agent Connection Security**:
- **TLS Encryption**: Server-side TLS (auto-generated or custom certificates)
- Development: Self-signed certificates auto-generated on first start
- Production: Valid certificates (Let's Encrypt or corporate CA)
- **Bootstrap Authentication**: One-time token for initial registration
- **Session Authentication**: HMAC-signed requests with shared session key
- **Key Rotation**: Primary/secondary key design for seamless rotation
- **Certificate Pinning**: Optional fingerprint verification for additional security

428
docs/project-structure.md Normal file
View File

@@ -0,0 +1,428 @@
# NxMesh Project Structure
This document outlines the recommended project structure for the NxMesh codebase.
## Directory Layout
```
nxmesh/
├── Cargo.toml # Workspace root
├── Cargo.lock
├── README.md
├── LICENSE
├── justfile # Task runner
├── AGENTS.md # AI agent context
├──
├── crates/ # Rust workspace crates
│ ├── nxmesh-core/ # Shared core library
│ │ ├── Cargo.toml
│ │ └── src/
│ │ ├── lib.rs
│ │ ├── models/ # Shared data models
│ │ │ ├── mod.rs
│ │ │ ├── organization.rs
│ │ │ ├── workspace.rs
│ │ │ ├── agent.rs
│ │ │ ├── config.rs
│ │ │ └── certificate.rs
│ │ ├── crypto/ # Encryption, hashing
│ │ ├── validation/ # Input validation
│ │ └── error.rs # Common error types
│ │
│ ├── nxmesh-proto/ # Protocol buffers
│ │ ├── Cargo.toml
│ │ ├── build.rs
│ │ └── proto/
│ │ ├── agent.proto
│ │ ├── config.proto
│ │ └── common.proto
│ │
│ ├── nxmesh-master/ # Control plane
│ │ ├── Cargo.toml
│ │ └── src/
│ │ ├── main.rs
│ │ ├── lib.rs
│ │ ├── api/ # REST API handlers
│ │ │ ├── mod.rs
│ │ │ ├── routes.rs
│ │ │ ├── middleware/
│ │ │ ├── v1/ # API version 1
│ │ │ │ ├── mod.rs
│ │ │ │ ├── organizations.rs
│ │ │ │ ├── workspaces.rs
│ │ │ │ ├── agents.rs
│ │ │ │ ├── virtual_hosts.rs
│ │ │ │ ├── upstreams.rs
│ │ │ │ ├── certificates.rs
│ │ │ │ └── metrics.rs
│ │ │ └── websocket.rs
│ │ ├── grpc/ # gRPC service
│ │ │ ├── mod.rs
│ │ │ ├── server.rs
│ │ │ ├── agent_service.rs
│ │ │ └── interceptor.rs
│ │ ├── config/ # Configuration
│ │ │ ├── mod.rs
│ │ │ └── settings.rs
│ │ ├── db/ # Database layer
│ │ │ ├── mod.rs
│ │ │ ├── connection.rs
│ │ │ ├── migration.rs
│ │ │ └── repositories/
│ │ ├── services/ # Business logic
│ │ │ ├── mod.rs
│ │ │ ├── organization_service.rs
│ │ │ ├── workspace_service.rs
│ │ │ ├── agent_service.rs
│ │ │ ├── config_service.rs
│ │ │ ├── certificate_service.rs
│ │ │ └── auth_service.rs
│ │ ├── domain/ # Domain entities
│ │ │ ├── mod.rs
│ │ │ ├── organization.rs
│ │ │ ├── agent.rs
│ │ │ └── config.rs
│ │ ├── infrastructure/ # External integrations
│ │ │ ├── mod.rs
│ │ │ ├── acme/ # Let's Encrypt
│ │ │ ├── storage/ # Object storage
│ │ │ └── notifier/ # Notifications
│ │ ├── events/ # Event bus
│ │ │ ├── mod.rs
│ │ │ ├── bus.rs
│ │ │ └── handlers.rs
│ │ └── cli.rs # CLI commands
│ │
│ ├── nxmesh-agent/ # Data plane
│ │ ├── Cargo.toml
│ │ └── src/
│ │ ├── main.rs
│ │ ├── lib.rs
│ │ ├── config/ # Agent configuration
│ │ │ ├── mod.rs
│ │ │ └── settings.rs
│ │ ├── master/ # Master communication
│ │ │ ├── mod.rs
│ │ │ ├── client.rs
│ │ │ ├── reconnect.rs
│ │ │ └── stream.rs
│ │ ├── nginx/ # Nginx management
│ │ │ ├── mod.rs
│ │ │ ├── controller.rs
│ │ │ ├── config_manager.rs # Symlink-based atomic deployment
│ │ │ ├── config_renderer.rs
│ │ │ ├── validator.rs
│ │ │ ├── docker_sidecar.rs # Docker sidecar (PID namespace sharing)
│ │ │ ├── systemd.rs # Standalone mode
│ │ │ └── parser.rs # Nginx config parser
│ │ ├── health/ # Health monitoring
│ │ │ ├── mod.rs
│ │ │ ├── monitor.rs
│ │ │ ├── nginx.rs
│ │ │ └── system.rs
│ │ ├── metrics/ # Metrics collection
│ │ │ ├── mod.rs
│ │ │ ├── collector.rs
│ │ │ └── exporter.rs
│ │ ├── cache/ # Local caching
│ │ │ ├── mod.rs
│ │ │ └── config_cache.rs
│ │ ├── watch/ # File watchers
│ │ │ ├── mod.rs
│ │ │ └── config_watch.rs
│ │ └── cli.rs # CLI commands
│ │
│ └── nxmesh-cli/ # CLI tool
│ ├── Cargo.toml
│ └── src/
│ ├── main.rs
│ ├── commands/ # CLI commands
│ │ ├── mod.rs
│ │ ├── login.rs
│ │ ├── agent.rs
│ │ ├── config.rs
│ │ └── deploy.rs
│ └── api/ # API client
├── frontend/ # Web UI (embedded in master)
│ ├── package.json
│ ├── vite.config.ts
│ ├── tsconfig.json
│ ├── index.html
│ ├── src/
│ │ ├── main.tsx
│ │ ├── App.tsx
│ │ ├── components/ # Reusable components
│ │ │ ├── common/
│ │ │ ├── layout/
│ │ │ └── forms/
│ │ ├── pages/ # Page components
│ │ │ ├── Dashboard/
│ │ │ ├── Agents/
│ │ │ ├── Configurations/
│ │ │ ├── Certificates/
│ │ │ └── Settings/
│ │ ├── hooks/ # React hooks
│ │ ├── stores/ # State management (Zustand)
│ │ ├── api/ # API client
│ │ ├── types/ # TypeScript types
│ │ ├── utils/ # Utilities
│ │ └── styles/ # CSS/Tailwind
│ └── public/
│ # Build output (dist/) is embedded into master binary
│ # Master serves static files at root path ("/")
├── migrations/ # Database migrations
│ └── sea-orm/
│ ├── Cargo.toml
│ └── src/
├── tests/ # Integration tests
│ ├── integration/
│ │ ├── master_api_tests.rs
│ │ ├── agent_master_tests.rs
│ │ └── config_flow_tests.rs
│ └── fixtures/
├── scripts/ # Build/utility scripts
│ ├── build.sh
│ ├── test.sh
│ └── release.sh
├── deploy/ # Deployment configs
│ ├── docker/
│ │ ├── master.Dockerfile
│ │ ├── agent.Dockerfile
│ │ └── docker-compose.yml
│ ├── k8s/
│ │ ├── namespace.yaml
│ │ ├── master/
│ │ ├── agent/
│ │ └── helm/
│ └── terraform/
├── docs/ # Documentation
│ ├── architecture.md
│ ├── features.md
│ ├── roadmap.md
│ ├── api.md
│ ├── deployment.md
│ └── project-structure.md
└── .devcontainer/ # Dev container
├── devcontainer.json
├── docker-compose.yml
├── Dockerfile
└── nginx/
```
## Crate Dependencies
```mermaid
graph TB
subgraph "Workspace Crates"
CLI[nxmesh-cli]
AGENT[nxmesh-agent]
MASTER[nxmesh-master]
PROTO[nxmesh-proto]
CORE[nxmesh-core]
end
CORE --> PROTO
AGENT --> CORE
AGENT --> PROTO
MASTER --> CORE
MASTER --> PROTO
CLI --> CORE
```
## Key Design Principles
### 1. Separation of Concerns
- **nxmesh-core**: Only shared types and utilities
- **nxmesh-master**: Only control plane logic
- **nxmesh-agent**: Only data plane logic
- **frontend**: Only UI logic
### 2. Domain-Driven Design (in Master)
```
domain/ # Domain entities (pure logic)
services/ # Application services (orchestration)
repositories/ # Data access abstraction
api/ # Interface adapters (HTTP, gRPC)
infrastructure/ # External concerns
```
### 3. Agent Modularity
Each major concern in the agent is a separate module:
- `nginx/`: All nginx-specific code
- `master/`: All master communication code
- `health/`: All health monitoring code
- `metrics/`: All metrics code
### 4. Configuration Management
Use hierarchical config:
1. Default values (in code)
2. Config file (`/etc/nxmesh/*.toml`)
3. Environment variables
4. Command-line arguments (highest priority)
## Module Guidelines
### API Versioning
- Always version REST APIs: `/api/v1/...`
- Maintain backward compatibility within major versions
- Use feature flags for gradual rollouts
### Error Handling
- Use `thiserror` for error definitions
- Propagate errors with context
- Convert to user-friendly messages at API boundary
### Testing Structure
```rust
// In each module
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_feature() {
// unit tests
}
}
```
- Unit tests: In same file as code
- Integration tests: In `tests/` directory
- E2E tests: Separate crate or external repo
### Documentation
- All public APIs must have doc comments
- Include examples in doc comments
- Keep README files in each crate
## Build Configuration
### Workspace Cargo.toml
```toml
[workspace]
members = [
"crates/nxmesh-core",
"crates/nxmesh-proto",
"crates/nxmesh-master",
"crates/nxmesh-agent",
"crates/nxmesh-cli",
]
resolver = "3"
[workspace.dependencies]
# Core dependencies
tokio = { version = "1", features = ["full"] }
serde = { version = "1", features = ["derive"] }
thiserror = "1"
tracing = "0.1"
# Web framework
axum = "0.7"
tower = "0.4"
tower-http = "0.5"
# gRPC
tonic = "0.11"
prost = "0.12"
# Database
sea-orm = "2.0.0-rc"
sea-orm-migration = "2.0.0-rc"
# Async
async-trait = "0.1"
futures = "0.3"
# Serialization
serde_json = "1"
toml = "0.8"
# HTTP
reqwest = { version = "0.12", default-features = false }
# Crypto
sha2 = "0.10"
hex = "0.4"
# Testing
tokio-test = "0.4"
mockall = "0.12"
```
## Naming Conventions
### Files
- Use `snake_case` for file names
- Module entry point: `mod.rs` or `{module_name}.rs`
### Types
- Structs/Enums: `PascalCase`
- Traits: `PascalCase` (often ending in `able` or with verb prefix)
- Functions/Methods: `snake_case`
- Constants: `SCREAMING_SNAKE_CASE`
- Generic parameters: Single uppercase letter (`T`, `K`, `V`)
### Error Types
- Suffix with `Error`: `ConfigError`, `AgentError`
- Group in `error.rs` or `errors/` module
### Feature Flags
- Use `kebab-case`: `postgres-native`, `tls-rustls`
## CI/CD Structure
```yaml
# .github/workflows/
├── ci.yml # PR checks
├── test.yml # Test suite
├── release.yml # Release builds
├── docker.yml # Docker image builds
└── docs.yml # Documentation deploy
```
## Scripts
Common operations should have just commands:
```justfile
# Development
just dev # Start all services
just dev-backend # Start backend only
just dev-frontend # Start frontend only
# Testing
just test # Run all tests
just test-unit # Unit tests only
just test-integration # Integration tests
# Building
just build # Build all
just build-master # Build master only
just build-agent # Build agent only
# Database
just db-migrate # Run migrations
just db-reset # Reset database
just db-console # Open psql
# Deployment
just docker-build # Build Docker images
just k8s-deploy # Deploy to Kubernetes
```

486
docs/roadmap.md Normal file
View File

@@ -0,0 +1,486 @@
# NxMesh Project Roadmap
## Overview
This document outlines the development phases and milestones for NxMesh. The project is divided into four major phases, each building upon the previous one.
---
## Phase 1: Foundation (Months 1-3)
**Goal**: Build a working MVP with basic master-agent communication and nginx configuration management.
### Milestone 1.1: Project Setup and Core Infrastructure
**Target**: Week 2
| Task | Description | Status |
|------|-------------|--------|
| [ ] | Set up Rust workspace structure (master, agent, shared) | 🔲 |
| [ ] | Configure CI/CD pipeline (GitHub Actions) | 🔲 |
| [ ] | Set up database schema with SeaORM migrations | 🔲 |
| [ ] | Create development environment (devcontainer) | 🔲 |
| [ ] | Set up testing framework (unit, integration) | 🔲 |
**Deliverables**:
- Working development environment
- Database schema for organizations, workspaces, agents
- CI pipeline with linting and testing
---
### Milestone 1.2: Master - Core API
**Target**: Week 5
| Task | Description | Status |
|------|-------------|--------|
| [ ] | Implement Axum-based REST API server | 🔲 |
| [ ] | JWT authentication middleware | 🔲 |
| [ ] | CRUD endpoints for Organizations | 🔲 |
| [ ] | CRUD endpoints for Workspaces | 🔲 |
| [ ] | CRUD endpoints for Agents | 🔲 |
| [ ] | PostgreSQL persistence layer | 🔲 |
**Deliverables**:
- REST API for basic resource management
- JWT authentication working
- API documentation (OpenAPI)
---
### Milestone 1.3: Master - Agent Communication
**Target**: Week 7
| Task | Description | Status |
|------|-------------|--------|
| [ ] | gRPC server implementation (Tonic) | 🔲 |
| [ ] | Bidirectional streaming protocol | 🔲 |
| [ ] | Agent registration flow | 🔲 |
| [ ] | Token-based authentication for agents | 🔲 |
| [ ] | Agent heartbeat/health monitoring | 🔲 |
| [ ] | WebSocket fallback for events | 🔲 |
**Deliverables**:
- Master can accept agent connections
- Agent registration and authentication works
- Health status tracking
---
### Milestone 1.4: Agent - Core Functionality
**Target**: Week 9
| Task | Description | Status |
|------|-------------|--------|
| [ ] | Agent CLI and configuration | 🔲 |
| [ ] | gRPC client for master communication | 🔲 |
| [ ] | Automatic reconnection with backoff | 🔲 |
| [ ] | Nginx process management (Docker sidecar PID sharing) | 🔲 |
| [ ] | Health check reporting | 🔲 |
| [ ] | Local config caching | 🔲 |
**Deliverables**:
- Agent binary that connects to master
- Nginx lifecycle management (Docker sidecar mode)
- Health reporting
---
### Milestone 1.5: Configuration Management
**Target**: Week 11
| Task | Description | Status |
|------|-------------|--------|
| [ ] | VirtualHost CRUD API | 🔲 |
| [ ] | Upstream CRUD API | 🔲 |
| [ ] | Handlebars template engine integration | 🔲 |
| [ ] | Config rendering on agent | 🔲 |
| [ ] | Nginx config validation (`nginx -t`) | 🔲 |
| [ ] | Graceful reload on config change | 🔲 |
**Deliverables**:
- End-to-end config push: Master → Agent → Nginx
- Basic virtual host and upstream management
- Template-based nginx config generation
---
### Milestone 1.6: Web Admin Console - Foundation
**Target**: Week 13
| Task | Description | Status |
|------|-------------|--------|
| [ ] | React + Vite project setup | 🔲 |
| [ ] | Authentication UI (login/logout) | 🔲 |
| [ ] | Dashboard layout and navigation | 🔲 |
| [ ] | Agent list and detail views | 🔲 |
| [ ] | Basic virtual host form | 🔲 |
| [ ] | WebSocket integration for real-time updates | 🔲 |
**Deliverables**:
- Functional Web UI
- Agent management via UI
- Basic configuration editing
---
### Phase 1 Completion Criteria
- [ ] Master and Agent communicate via gRPC
- [ ] Nginx configs can be pushed from Master to Agent
- [ ] Web UI for basic management
- [ ] Docker sidecar deployment working
- [ ] Documentation complete
**Estimated Effort**: 3 months
**Team Size**: 2-3 engineers
---
## Phase 2: Resilience and Observability (Months 4-5)
**Goal**: Make the system production-ready with HA, monitoring, and robust failure handling.
### Milestone 2.1: High Availability - Master Clustering
**Target**: Week 15
| Task | Description | Status |
|------|-------------|--------|
| [ ] | Raft consensus integration (raft-rs) | 🔲 |
| [ ] | Leader election | 🔲 |
| [ ] | State replication across masters | 🔲 |
| [ ] | Agent connection failover | 🔲 |
| [ ] | Cluster health monitoring | 🔲 |
**Deliverables**:
- Multiple master instances can form a cluster
- Automatic failover on master failure
- No single point of failure
---
### Milestone 2.2: Certificate Management
**Target**: Week 17
| Task | Description | Status |
|------|-------------|--------|
| [ ] | ACME client integration (acme-rs) | 🔲 |
| [ ] | Let's Encrypt HTTP-01 challenge | 🔲 |
| [ ] | Certificate storage (encrypted) | 🔲 |
| [ ] | Automatic renewal | 🔲 |
| [ ] | Certificate distribution to agents | 🔲 |
| [ ] | Expiration monitoring and alerts | 🔲 |
**Deliverables**:
- Automatic SSL certificate provisioning
- Certificate renewal before expiry
- UI for certificate management
---
### Milestone 2.3: Observability Stack
**Target**: Week 19
| Task | Description | Status |
|------|-------------|--------|
| [ ] | OpenTelemetry integration | 🔲 |
| [ ] | Structured logging (tracing) | 🔲 |
| [ ] | Prometheus metrics endpoint (agent) | 🔲 |
| [ ] | Custom metrics collection | 🔲 |
| [ ] | Health check dashboard | 🔲 |
| [ ] | Alert configuration | 🔲 |
**Deliverables**:
- Metrics visible in Prometheus
- Distributed traces for config pushes
- Health dashboard in Web UI
---
### Milestone 2.4: Enhanced Failure Handling
**Target**: Week 21
| Task | Description | Status |
|------|-------------|--------|
| [ ] | Configuration drift detection | 🔲 |
| [ ] | Auto-healing (config sync) | 🔲 |
| [ ] | Circuit breaker for master connection | 🔲 |
| [ ] | Nginx crash detection and restart | 🔲 |
| [ ] | Config rollback on validation failure | 🔲 |
| [ ] | Bulk operations and queue management | 🔲 |
**Deliverables**:
- System self-heals from common failures
- Config drift automatically corrected
- Robust reconnection logic
---
### Phase 2 Completion Criteria
- [ ] Master clustering with Raft
- [ ] Automatic SSL certificates
- [ ] Full observability (metrics, logs, traces)
- [ ] Production-grade failure handling
- [ ] Performance benchmarks
**Estimated Effort**: 2 months
**Team Size**: 2-3 engineers
---
## Phase 3: Advanced Traffic Management (Months 6-7)
**Goal**: Add enterprise-grade traffic management features.
### Milestone 3.1: Advanced Load Balancing
**Target**: Week 23
| Task | Description | Status |
|------|-------------|--------|
| [ ] | Multiple load balancing algorithms | 🔲 |
| [ ] | Health checks for upstream servers | 🔲 |
| [ ] | Circuit breaker for upstreams | 🔲 |
| [ ] | Retry policies | 🔲 |
| [ ] | Connection pooling | 🔲 |
| [ ] | Upstream status dashboard | 🔲 |
**Deliverables**:
- Advanced upstream configuration
- Health check visualization
- Circuit breaker metrics
---
### Milestone 3.2: Rate Limiting and WAF
**Target**: Week 25
| Task | Description | Status |
|------|-------------|--------|
| [ ] | Rate limiting rules (IP, user, global) | 🔲 |
| [ ] | Rate limiting zones | 🔲 |
| [ ] | Basic WAF rules (ModSecurity integration) | 🔲 |
| [ ] | IP allowlist/blocklist | 🔲 |
| [ ] | Geo-blocking | 🔲 |
| [ ] | Rate limit analytics | 🔲 |
**Deliverables**:
- Configurable rate limiting
- Basic WAF protection
- Security event dashboard
---
### Milestone 3.3: Traffic Routing and Canary
**Target**: Week 27
| Task | Description | Status |
|------|-------------|--------|
| [ ] | Header-based routing | 🔲 |
| [ ] | Weight-based traffic splitting | 🔲 |
| [ ] | Canary deployment support | 🔲 |
| [ ] | A/B testing configuration | 🔲 |
| [ ] | Blue-green deployment | 🔲 |
| [ ] | Traffic analytics | 🔲 |
**Deliverables**:
- Advanced traffic routing
- Canary deployment UI
- Traffic split visualization
---
### Milestone 3.4: Access Log Aggregation
**Target**: Week 29
| Task | Description | Status |
|------|-------------|--------|
| [ ] | Nginx access log parsing | 🔲 |
| [ ] | Log streaming to master | 🔲 |
| [ ] | Log storage and indexing | 🔲 |
| [ ] | Log query interface | 🔲 |
| [ ] | Real-time log tailing | 🔲 |
| [ ] | Log-based alerting | 🔲 |
**Deliverables**:
- Centralized access logs
- Log search and filtering
- Log-based metrics
---
### Phase 3 Completion Criteria
- [ ] Advanced load balancing and health checks
- [ ] Rate limiting and basic WAF
- [ ] Canary and A/B testing
- [ ] Access log aggregation
- [ ] Traffic analytics dashboard
**Estimated Effort**: 2 months
**Team Size**: 2-3 engineers
---
## Phase 4: Enterprise Features (Months 8-10)
**Goal**: Enterprise readiness with multi-tenancy, RBAC, and advanced integrations.
### Milestone 4.1: Multi-tenancy and RBAC
**Target**: Week 31
| Task | Description | Status |
|------|-------------|--------|
| [ ] | Organization isolation | 🔲 |
| [ ] | Workspace-scoped resources | 🔲 |
| [ ] | Role-based access control | 🔲 |
| [ ] | User management API | 🔲 |
| [ ] | API key management | 🔲 |
| [ ] | Audit logging | 🔲 |
**Deliverables**:
- Full multi-tenancy
- Granular permissions
- Audit trail
---
### Milestone 4.2: Kubernetes Integration
**Target**: Week 33
| Task | Description | Status |
|------|-------------|--------|
| [ ] | Kubernetes operator | 🔲 |
| [ ] | CRD definitions | 🔲 |
| [ ] | Helm chart | 🔲 |
| [ ] | Service discovery integration | 🔲 |
| [ ] | Ingress controller mode | 🔲 |
| [ ] | K8s-native agent deployment | 🔲 |
**Deliverables**:
- Kubernetes operator
- Helm chart for easy deployment
- Ingress controller functionality
---
### Milestone 4.3: External Integrations
**Target**: Week 35
| Task | Description | Status |
|------|-------------|--------|
| [ ] | Terraform provider | 🔲 |
| [ ] | GitOps integration (Git sync) | 🔲 |
| [ ] | Webhook support | 🔲 |
| [ ] | Slack/Discord notifications | 🔲 |
| [ ] | PagerDuty/Opsgenie integration | 🔲 |
| [ ] | DNS provider integration (Route53, Cloudflare) | 🔲 |
**Deliverables**:
- Infrastructure as Code support
- GitOps workflows
- Notification channels
---
### Milestone 4.4: Performance and Scale
**Target**: Week 37
| Task | Description | Status |
|------|-------------|--------|
| [ ] | Connection pooling optimization | 🔲 |
| [ ] | Config caching improvements | 🔲 |
| [ ] | Database query optimization | 🔲 |
| [ ] | Horizontal scaling tests | 🔲 |
| [ ] | Load testing (10k+ agents) | 🔲 |
| [ ] | Performance tuning documentation | 🔲 |
**Deliverables**:
- Performance benchmarks
- Scaling guidelines
- Optimization recommendations
---
### Milestone 4.5: Enterprise Security
**Target**: Week 39
| Task | Description | Status |
|------|-------------|--------|
| [ ] | mTLS for all communications | 🔲 |
| [ ] | Secret encryption at rest | 🔲 |
| [ ] | HSM integration | 🔲 |
| [ ] | SSO/SAML integration | 🔲 |
| [ ] | Security scanning (SAST/DAST) | 🔲 |
| [ ] | Compliance documentation (SOC2) | 🔲 |
**Deliverables**:
- Enterprise security features
- Compliance documentation
- Security audit
---
### Phase 4 Completion Criteria
- [ ] Full RBAC and multi-tenancy
- [ ] Kubernetes operator
- [ ] External integrations (Terraform, GitOps)
- [ ] Proven scalability (10k+ agents)
- [ ] Enterprise security compliance
**Estimated Effort**: 3 months
**Team Size**: 3-4 engineers
---
## Timeline Summary
```
Month 1-3: ████████████████████████████████████████ Phase 1: Foundation
Month 4-5: ████████████████████ Phase 2: Resilience
Month 6-7: ████████████████████ Phase 3: Advanced
Month 8-10: ██████████████████████████ Phase 4: Enterprise
Week: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
|--M1--|--M2--|--M3--|--M4--|--M5--|--M6--|
Week: 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
|--M7--|--M8--|--M9--|--M10-|--M11-|--M12-|--M13-|--M14-|
```
---
## Resource Requirements
### Phase 1
- **Backend Engineers**: 2
- **Frontend Engineer**: 1
- **Total Person-Months**: 9
### Phase 2
- **Backend Engineers**: 2
- **Frontend Engineer**: 1 (part-time)
- **DevOps Engineer**: 1 (part-time)
- **Total Person-Months**: 7
### Phase 3
- **Backend Engineers**: 2
- **Frontend Engineer**: 1
- **Total Person-Months**: 6
### Phase 4
- **Backend Engineers**: 2
- **Frontend Engineer**: 1
- **DevOps Engineer**: 1
- **Security Engineer**: 1 (part-time)
- **Total Person-Months**: 10
**Total Project**: ~32 person-months
---
## Risk Assessment
| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| Raft complexity delays HA | Medium | High | Start with single master, add HA later |
| gRPC performance issues | Low | Medium | Implement WebSocket fallback early |
| Nginx reload edge cases | Medium | High | Extensive testing, rollback capability |
| Team scaling challenges | Medium | Medium | Document architecture, modular design |
| Integration complexity | Medium | Medium | Clear APIs, contract testing |