Add project structure and roadmap documentation
- Created `project-structure.md` to outline the directory layout, crate dependencies, design principles, module guidelines, and naming conventions for the NxMesh codebase. - Introduced `roadmap.md` detailing the development phases, milestones, tasks, deliverables, and resource requirements for the NxMesh project, spanning from foundational setup to enterprise features.
This commit is contained in:
@@ -8,6 +8,7 @@ RUN apt-get update && apt-get install -y \
|
|||||||
pkg-config \
|
pkg-config \
|
||||||
libssl-dev \
|
libssl-dev \
|
||||||
postgresql-client \
|
postgresql-client \
|
||||||
|
protobuf-compiler \
|
||||||
&& rm -rf /var/lib/apt/lists/*
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
# Set working directory
|
# Set working directory
|
||||||
|
|||||||
@@ -27,7 +27,7 @@ services:
|
|||||||
- docker
|
- docker
|
||||||
pid: "service:nginx"
|
pid: "service:nginx"
|
||||||
|
|
||||||
# Data Plane - Nginx (controlled by agent via Docker)
|
# Data Plane - Nginx (controlled by agent via PID namespace sharing)
|
||||||
nginx:
|
nginx:
|
||||||
image: nginx:alpine
|
image: nginx:alpine
|
||||||
container_name: nxmesh-nginx
|
container_name: nxmesh-nginx
|
||||||
|
|||||||
104
AGENTS.md
Normal file
104
AGENTS.md
Normal file
@@ -0,0 +1,104 @@
|
|||||||
|
# NxMesh - Agent Instructions
|
||||||
|
|
||||||
|
This document provides context for AI agents working on the NxMesh project.
|
||||||
|
|
||||||
|
## Project Overview
|
||||||
|
|
||||||
|
**NxMesh** is a distributed nginx management system using a master-agent architecture:
|
||||||
|
|
||||||
|
- **Master (Control Plane)**: Central API, embedded Web UI, configuration distribution, cluster management
|
||||||
|
- **Agent (Data Plane)**: Sidecar that manages local nginx instances
|
||||||
|
- **Web UI**: Vite React-based admin console, embedded and served by master
|
||||||
|
|
||||||
|
## Quick Links to Documentation
|
||||||
|
|
||||||
|
| Document | Purpose |
|
||||||
|
|----------|---------|
|
||||||
|
| [README.md](./README.md) | Project overview and quick start |
|
||||||
|
| [docs/architecture.md](./docs/architecture.md) | System design and data flow |
|
||||||
|
| [docs/features.md](./docs/features.md) | Detailed feature specifications |
|
||||||
|
| [docs/roadmap.md](./docs/roadmap.md) | Development phases and milestones |
|
||||||
|
| [docs/api.md](./docs/api.md) | REST and gRPC API specifications |
|
||||||
|
|
||||||
|
| [docs/project-structure.md](./docs/project-structure.md) | Code organization |
|
||||||
|
|
||||||
|
## Technology Stack
|
||||||
|
|
||||||
|
| Component | Technology |
|
||||||
|
|-----------|------------|
|
||||||
|
| Backend | Rust (Axum, Tonic, SeaORM) |
|
||||||
|
| Frontend | React + TypeScript + Vite |
|
||||||
|
| Database | PostgreSQL 16+ |
|
||||||
|
| Cache | Redis |
|
||||||
|
| Message Format | Protocol Buffers (gRPC) |
|
||||||
|
| Container | Docker |
|
||||||
|
| Orchestration | Kubernetes (optional) |
|
||||||
|
|
||||||
|
## Development Environment
|
||||||
|
|
||||||
|
This project uses Dev Containers for consistent development:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# All dependencies are pre-installed in the devcontainer
|
||||||
|
just setup # Initial setup
|
||||||
|
just dev # Start development
|
||||||
|
```
|
||||||
|
|
||||||
|
### Pre-configured Services
|
||||||
|
|
||||||
|
The devcontainer includes:
|
||||||
|
- PostgreSQL database
|
||||||
|
- Redis cache
|
||||||
|
- Nginx instance
|
||||||
|
- Rust toolchain
|
||||||
|
- Node.js/Bun for frontend
|
||||||
|
|
||||||
|
## Key Design Decisions
|
||||||
|
|
||||||
|
1. **Master-Agent Protocol**: Bidirectional gRPC streaming for real-time communication
|
||||||
|
2. **Configuration Management**: Template-based (Handlebars) with versioning
|
||||||
|
3. **Security**: TLS + Shared Secret for agent connections, JWT for API auth
|
||||||
|
4. **Deployment**: Support for Docker sidecar, K8s sidecar, and standalone modes
|
||||||
|
|
||||||
|
## Common Tasks
|
||||||
|
|
||||||
|
### Adding a New API Endpoint
|
||||||
|
|
||||||
|
1. Define route in `crates/nxmesh-master/src/api/v1/`
|
||||||
|
2. Add request/response types to shared models
|
||||||
|
3. Implement handler with proper error handling
|
||||||
|
4. Add tests
|
||||||
|
5. Update OpenAPI documentation
|
||||||
|
|
||||||
|
### Adding a Database Entity
|
||||||
|
|
||||||
|
1. Create migration with `sea-orm-cli migrate generate <name>`
|
||||||
|
2. Define entity in `crates/nxmesh-master/src/db/entities/`
|
||||||
|
3. Add repository in `crates/nxmesh-master/src/db/repositories/`
|
||||||
|
4. Update service layer
|
||||||
|
|
||||||
|
### Adding Agent Functionality
|
||||||
|
|
||||||
|
1. Add module in `crates/nxmesh-agent/src/`
|
||||||
|
2. Update gRPC protocol if needed (`crates/nxmesh-proto/proto/`)
|
||||||
|
3. Implement handler in agent
|
||||||
|
4. Add corresponding master service
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
```bash
|
||||||
|
just test # All tests
|
||||||
|
just test-unit # Unit tests only
|
||||||
|
just test-integration # Integration tests
|
||||||
|
```
|
||||||
|
|
||||||
|
## Code Style
|
||||||
|
|
||||||
|
- Follow Rust API Guidelines
|
||||||
|
- Use `cargo fmt` and `cargo clippy`
|
||||||
|
- All public APIs must have doc comments
|
||||||
|
- Error types should be descriptive and actionable
|
||||||
|
|
||||||
|
## Questions?
|
||||||
|
|
||||||
|
Refer to the documentation in `docs/` directory or ask the team.
|
||||||
5552
Cargo.lock
generated
5552
Cargo.lock
generated
File diff suppressed because it is too large
Load Diff
77
Cargo.toml
77
Cargo.toml
@@ -1,13 +1,80 @@
|
|||||||
[workspace]
|
[workspace]
|
||||||
members = [
|
members = [
|
||||||
|
"crates/nxmesh-core",
|
||||||
|
"crates/nxmesh-proto",
|
||||||
|
"crates/nxmesh-master",
|
||||||
|
"crates/nxmesh-agent",
|
||||||
|
"crates/nxmesh-cli",
|
||||||
|
"migrations/sea-orm",
|
||||||
]
|
]
|
||||||
|
|
||||||
resolver = "3"
|
resolver = "3"
|
||||||
|
|
||||||
[workspace.lints.clippy]
|
[workspace.package]
|
||||||
module_inception = "allow"
|
version = "0.1.0"
|
||||||
|
edition = "2021"
|
||||||
|
authors = ["NxMesh Team"]
|
||||||
|
license = "GNU General Public License v3.0"
|
||||||
|
repository = "https://github.com/nxmesh/nxmesh"
|
||||||
|
rust-version = "1.80"
|
||||||
|
|
||||||
[workspace.dependencies]
|
[workspace.dependencies]
|
||||||
sea-orm = "2.0.0-rc"
|
# Core dependencies
|
||||||
sea-orm-cli = "2.0.0-rc"
|
tokio = { version = "1", features = ["full"] }
|
||||||
|
serde = { version = "1", features = ["derive"] }
|
||||||
|
serde_json = "1"
|
||||||
|
thiserror = "1"
|
||||||
|
tracing = "0.1"
|
||||||
|
tracing-subscriber = { version = "0.3", features = ["env-filter", "json"] }
|
||||||
|
|
||||||
|
# Web framework
|
||||||
|
axum = "0.7"
|
||||||
|
tower = "0.4"
|
||||||
|
tower-http = { version = "0.5", features = ["trace", "cors", "fs"] }
|
||||||
|
|
||||||
|
# gRPC
|
||||||
|
tonic = "0.11"
|
||||||
|
prost = "0.12"
|
||||||
|
|
||||||
|
# Database
|
||||||
|
sea-orm = { version = "2.0.0-rc", features = ["sqlx-postgres", "runtime-tokio-native-tls"] }
|
||||||
sea-orm-migration = "2.0.0-rc"
|
sea-orm-migration = "2.0.0-rc"
|
||||||
|
|
||||||
|
# Async
|
||||||
|
async-trait = "0.1"
|
||||||
|
futures = "0.3"
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
toml = "0.8"
|
||||||
|
config = "0.14"
|
||||||
|
|
||||||
|
# HTTP client
|
||||||
|
reqwest = { version = "0.12", default-features = false, features = ["rustls-tls", "json"] }
|
||||||
|
|
||||||
|
# Crypto
|
||||||
|
sha2 = "0.10"
|
||||||
|
hex = "0.4"
|
||||||
|
argon2 = "0.5"
|
||||||
|
jsonwebtoken = "9"
|
||||||
|
|
||||||
|
# Validation
|
||||||
|
validator = { version = "0.18", features = ["derive"] }
|
||||||
|
|
||||||
|
# Time
|
||||||
|
chrono = { version = "0.4", features = ["serde"] }
|
||||||
|
|
||||||
|
# UUID
|
||||||
|
uuid = { version = "1", features = ["v4", "serde"] }
|
||||||
|
|
||||||
|
# Templating
|
||||||
|
handlebars = "5"
|
||||||
|
|
||||||
|
# CLI
|
||||||
|
clap = { version = "4", features = ["derive"] }
|
||||||
|
|
||||||
|
# Testing
|
||||||
|
tokio-test = "0.4"
|
||||||
|
mockall = "0.12"
|
||||||
|
|
||||||
|
# NxMesh internal
|
||||||
|
nxmesh-core = { path = "crates/nxmesh-core" }
|
||||||
|
nxmesh-proto = { path = "crates/nxmesh-proto" }
|
||||||
|
|||||||
202
README.md
202
README.md
@@ -1,2 +1,202 @@
|
|||||||
# NxMesh
|
# NxMesh - Distributed Nginx Management System
|
||||||
|
|
||||||
|
> **NxMesh** is a modern, scalable, distributed system for managing nginx instances across diverse infrastructure environments. Built with a master-agent architecture inspired by service mesh patterns, NxMesh provides centralized control with local intelligence.
|
||||||
|
|
||||||
|
## 🎯 Project Vision
|
||||||
|
|
||||||
|
NxMesh transforms nginx from a standalone reverse proxy into a **distributed, programmable edge layer**. By adopting a control plane (master) + data plane (agent/sidecar) architecture, NxMesh enables:
|
||||||
|
|
||||||
|
- **Centralized Management**: Control thousands of nginx instances from a single control plane
|
||||||
|
- **Dynamic Configuration**: Real-time configuration updates without restarts or connection drops
|
||||||
|
- **Observability**: Unified metrics, logs, and health status across the entire fleet
|
||||||
|
- **Hybrid Deployment**: Support for Docker, Kubernetes, VMs, and bare metal environments
|
||||||
|
- **High Availability**: Fault-tolerant design with automatic failover and recovery
|
||||||
|
|
||||||
|
## 🏗️ Architecture Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ CONTROL PLANE (Master) │
|
||||||
|
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ NxMesh Master │ │
|
||||||
|
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
|
||||||
|
│ │ │ API │ │ Config │ │ Cluster │ │ Admin │ │ │
|
||||||
|
│ │ │ Server │ │ Manager │ │ Coordinator │ │ Console │ │ │
|
||||||
|
│ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │
|
||||||
|
│ │ └──────────────────┴──────────────────┴──────────────────┘ │ │
|
||||||
|
│ │ │ │ │
|
||||||
|
│ │ PostgreSQL (State) │ │
|
||||||
|
│ └──────────────────────────────┼─────────────────────────────────────────────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ gRPC/TLS │ WebSocket (Events) │
|
||||||
|
│ ▼ │
|
||||||
|
└─────────────────────────────────────────────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
┌───────────────────────────┼───────────────────────────┐
|
||||||
|
│ │ │
|
||||||
|
▼ ▼ ▼
|
||||||
|
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
|
||||||
|
│ AGENT 1 │ │ AGENT 2 │ │ AGENT N │
|
||||||
|
│ (Sidecar) │ │ (Standalone) │ │ (K8s Pod) │
|
||||||
|
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
|
||||||
|
│ │ NxMesh │ │ │ │ NxMesh │ │ │ │ NxMesh │ │
|
||||||
|
│ │ Agent │ │ │ │ Agent │ │ │ │ Agent │ │
|
||||||
|
│ └─────┬─────┘ │ │ └─────┬─────┘ │ │ └─────┬─────┘ │
|
||||||
|
│ │ │ │ │ │ │ │ │
|
||||||
|
│ ┌────┴────┐ │ │ ┌────┴────┐ │ │ ┌────┴────┐ │
|
||||||
|
│ │ Nginx │ │ │ │ Nginx │ │ │ │ Nginx │ │
|
||||||
|
│ │ Instance│ │ │ │ Instance│ │ │ │ Instance│ │
|
||||||
|
│ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │
|
||||||
|
└───────────────┘ └───────────────┘ └───────────────┘
|
||||||
|
Docker Compose VM/Bare Metal Kubernetes
|
||||||
|
```
|
||||||
|
|
||||||
|
### Core Components
|
||||||
|
|
||||||
|
| Component | Description | Technology |
|
||||||
|
|-----------|-------------|------------|
|
||||||
|
| **Master** | Central control plane - API, embedded Web UI, config distribution | Rust (Axum/gRPC) + Embedded Vite React |
|
||||||
|
| **Agent** | Local nginx controller - configuration, health checks, metrics | Rust (Tokio) |
|
||||||
|
| **Database** | Persistent state storage | PostgreSQL |
|
||||||
|
|
||||||
|
## 🚀 Key Features
|
||||||
|
|
||||||
|
### Phase 1: Foundation
|
||||||
|
- [ ] **Master Control Plane**
|
||||||
|
- RESTful API for configuration management
|
||||||
|
- gRPC for agent communication
|
||||||
|
- PostgreSQL persistence
|
||||||
|
- JWT-based authentication
|
||||||
|
|
||||||
|
- [ ] **Agent Sidecar**
|
||||||
|
- Docker deployment mode (sidecar pattern)
|
||||||
|
- Standalone deployment mode
|
||||||
|
- Automatic nginx lifecycle management
|
||||||
|
- Configuration hot-reloading
|
||||||
|
|
||||||
|
- [ ] **Configuration Management**
|
||||||
|
- Virtual host (server block) templating
|
||||||
|
- Upstream pool management
|
||||||
|
- SSL/TLS certificate management
|
||||||
|
- Configuration versioning & rollback
|
||||||
|
|
||||||
|
### Phase 2: Resilience
|
||||||
|
- [ ] **High Availability**
|
||||||
|
- Master clustering with Raft consensus
|
||||||
|
- Agent auto-reconnection with exponential backoff
|
||||||
|
- Configuration drift detection & auto-healing
|
||||||
|
|
||||||
|
- [ ] **Observability**
|
||||||
|
- Real-time metrics collection (Prometheus)
|
||||||
|
- Structured logging (OpenTelemetry)
|
||||||
|
- Health check dashboards
|
||||||
|
- Alert management
|
||||||
|
|
||||||
|
### Phase 3: Advanced
|
||||||
|
- [ ] **Traffic Management**
|
||||||
|
- Dynamic load balancing strategies
|
||||||
|
- Circuit breaker patterns
|
||||||
|
- Rate limiting & WAF rules
|
||||||
|
- A/B testing & canary deployments
|
||||||
|
|
||||||
|
- [ ] **Multi-tenancy**
|
||||||
|
- Organization/workspace isolation
|
||||||
|
- RBAC (Role-Based Access Control)
|
||||||
|
- Resource quotas & limits
|
||||||
|
|
||||||
|
## 📦 Deployment Modes
|
||||||
|
|
||||||
|
### 1. Docker Sidecar (Recommended for Development)
|
||||||
|
```yaml
|
||||||
|
# docker-compose.yml
|
||||||
|
services:
|
||||||
|
nginx:
|
||||||
|
image: nginx:alpine
|
||||||
|
|
||||||
|
nxmesh-agent:
|
||||||
|
image: nxmesh/agent:latest
|
||||||
|
environment:
|
||||||
|
- NXMESH_MASTER_URL=wss://master.nxmesh.io:8443
|
||||||
|
- NXMESH_AGENT_TOKEN=${AGENT_TOKEN}
|
||||||
|
network_mode: service:nginx # Share network namespace
|
||||||
|
pid: service:nginx # Share PID namespace (for nginx reload)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Kubernetes Sidecar
|
||||||
|
```yaml
|
||||||
|
# deployment.yaml
|
||||||
|
spec:
|
||||||
|
containers:
|
||||||
|
- name: nginx
|
||||||
|
image: nginx:alpine
|
||||||
|
- name: nxmesh-agent
|
||||||
|
image: nxmesh/agent:latest
|
||||||
|
env:
|
||||||
|
- name: NXMESH_MASTER_URL
|
||||||
|
value: "wss://master.nxmesh.svc:8443"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Standalone (VM/Bare Metal)
|
||||||
|
```bash
|
||||||
|
# Install agent
|
||||||
|
curl -fsSL https://get.nxmesh.io | bash
|
||||||
|
|
||||||
|
# Configure and start
|
||||||
|
nxmesh-agent --master-url wss://master.nxmesh.io:8443 --token ${AGENT_TOKEN}
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📋 Quick Start
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
- Docker & Docker Compose
|
||||||
|
- Rust 1.75+ (for development)
|
||||||
|
- PostgreSQL 16+
|
||||||
|
|
||||||
|
### Development Setup
|
||||||
|
```bash
|
||||||
|
# Clone and setup
|
||||||
|
git clone https://github.com/your-org/nxmesh.git
|
||||||
|
cd nxmesh
|
||||||
|
just setup
|
||||||
|
|
||||||
|
# Start development environment
|
||||||
|
just dev
|
||||||
|
|
||||||
|
# Access services
|
||||||
|
# - Web UI: http://localhost:3000
|
||||||
|
# - API: http://localhost:8080
|
||||||
|
# - Nginx: http://localhost:80
|
||||||
|
```
|
||||||
|
|
||||||
|
### Production Deployment
|
||||||
|
```bash
|
||||||
|
# Deploy master
|
||||||
|
docker run -d \
|
||||||
|
-p 8080:8080 \
|
||||||
|
-p 8443:8443 \
|
||||||
|
-e DATABASE_URL=postgres://... \
|
||||||
|
nxmesh/master:latest
|
||||||
|
|
||||||
|
# Deploy agent (on nginx host)
|
||||||
|
docker run -d \
|
||||||
|
--network container:nginx \
|
||||||
|
-e NXMESH_MASTER_URL=wss://master.example.com:8443 \
|
||||||
|
-e NXMESH_AGENT_TOKEN=<token> \
|
||||||
|
nxmesh/agent:latest
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📚 Documentation
|
||||||
|
|
||||||
|
| Document | Description |
|
||||||
|
|----------|-------------|
|
||||||
|
| [Architecture](./docs/architecture.md) | System design, data flow, component interactions |
|
||||||
|
| [Features](./docs/features.md) | Detailed feature specifications |
|
||||||
|
| [Roadmap](./docs/roadmap.md) | Development phases and milestones |
|
||||||
|
| [API Reference](./docs/api.md) | REST API and gRPC specifications |
|
||||||
|
| [Deployment](./docs/deployment.md) | Production deployment guides |
|
||||||
|
|
||||||
|
## 📄 License
|
||||||
|
|
||||||
|
NxMesh is licensed under the Apache License 3.0. See [LICENSE](./LICENSE) for details.
|
||||||
|
|
||||||
|
---
|
||||||
1107
docs/api.md
Normal file
1107
docs/api.md
Normal file
File diff suppressed because it is too large
Load Diff
527
docs/architecture.md
Normal file
527
docs/architecture.md
Normal file
@@ -0,0 +1,527 @@
|
|||||||
|
# NxMesh Architecture
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
1. [Overview](#overview)
|
||||||
|
2. [System Components](#system-components)
|
||||||
|
3. [Data Flow](#data-flow)
|
||||||
|
4. [Communication Protocols](#communication-protocols)
|
||||||
|
5. [Security Model](#security-model)
|
||||||
|
6. [Deployment Patterns](#deployment-patterns)
|
||||||
|
7. [Failure Handling](#failure-handling)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
NxMesh follows a **Control Plane / Data Plane** architecture pattern, similar to service meshes like Istio or Linkerd, but specifically optimized for nginx management.
|
||||||
|
|
||||||
|
### Design Principles
|
||||||
|
|
||||||
|
1. **Separation of Concerns**: Master handles policy and state; Agent handles execution
|
||||||
|
2. **Eventual Consistency**: Configuration changes propagate asynchronously
|
||||||
|
3. **Local Autonomy**: Agents can operate independently during master outages
|
||||||
|
4. **Zero-Downtime Updates**: Nginx reloads without dropping connections
|
||||||
|
5. **Observability First**: Every action is observable and traceable
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## System Components
|
||||||
|
|
||||||
|
### 1. Master (Control Plane)
|
||||||
|
|
||||||
|
The Master is the brain of the system. It maintains the desired state and coordinates all agents.
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────────────────────────────────────────────────────────┐
|
||||||
|
│ MASTER │
|
||||||
|
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │
|
||||||
|
│ │ API │ │ Config │ │ Event & Agent │ │
|
||||||
|
│ │ Layer │ │ Engine │ │ Coordination │ │
|
||||||
|
│ │ │ │ │ │ │ │
|
||||||
|
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌───────────────────┐ │ │
|
||||||
|
│ │ │ REST │ │ │ │ Template│ │ │ │ Agent Registry │ │ │
|
||||||
|
│ │ │ Handler │ │ │ │ Engine │ │ │ │ (Connections) │ │ │
|
||||||
|
│ │ └─────────┘ │ │ └─────────┘ │ │ └───────────────────┘ │ │
|
||||||
|
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌───────────────────┐ │ │
|
||||||
|
│ │ │ gRPC │ │ │ │ Version │ │ │ │ Event Bus │ │ │
|
||||||
|
│ │ │ Server │ │ │ │ Control │ │ │ │ (Config Dist.) │ │ │
|
||||||
|
│ │ └─────────┘ │ │ └─────────┘ │ │ └───────────────────┘ │ │
|
||||||
|
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌───────────────────┐ │ │
|
||||||
|
│ │ │ WebSocket│ │ │ │ Validator│ │ │ │ Broadcast │ │ │
|
||||||
|
│ │ │ Handler │ │ │ │ │ │ │ │ (Agent Updates) │ │ │
|
||||||
|
│ │ └──────────┘ │ │ └──────────┘ │ │ └───────────────────┘ │ │
|
||||||
|
│ └──────────────┘ └──────────────┘ └─────────────────────────┘ │
|
||||||
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
|
||||||
|
│ │ Auth │ │ Storage │ │ Observability │ │
|
||||||
|
│ │ Service │ │ Layer │ │ │ │
|
||||||
|
│ │ │ │ │ │ ┌───────────────────┐ │ │
|
||||||
|
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Metrics │ │ │
|
||||||
|
│ │ │ JWT │ │ │ │ Postgres│ │ │ │ (Prometheus) │ │ │
|
||||||
|
│ │ │ OAuth2 │ │ │ │ (SeaORM)│ │ │ └───────────────────┘ │ │
|
||||||
|
│ │ └─────────┘ │ │ └─────────┘ │ │ ┌───────────────────┐ │ │
|
||||||
|
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Tracing │ │ │
|
||||||
|
│ │ │ Password│ │ │ │ Cache │ │ │ │ (OpenTelemetry) │ │ │
|
||||||
|
│ │ │ Login │ │ │ │ (Redis) │ │ │ └───────────────────┘ │ │
|
||||||
|
│ │ └─────────┘ │ │ └─────────┘ │ │ │ │
|
||||||
|
│ │ ┌─────────┐ │ │ │ │ │ │
|
||||||
|
│ │ │ RBAC │ │ │ │ │ │ │
|
||||||
|
│ │ │ Engine │ │ │ │ │ │ │
|
||||||
|
│ │ └─────────┘ │ │ │ │ │ │
|
||||||
|
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
||||||
|
└──────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Master Responsibilities
|
||||||
|
|
||||||
|
| Module | Responsibility |
|
||||||
|
|--------|----------------|
|
||||||
|
| API Layer | HTTP REST API for external clients (CLI, Web UI, external systems) |
|
||||||
|
| Config Engine | Template rendering, validation, versioning |
|
||||||
|
| Event & Agent Coordination | Agent connection management, config event broadcasting |
|
||||||
|
| Auth Service | Authentication (JWT/OAuth2, Password) and authorization (RBAC) |
|
||||||
|
| Storage Layer | PostgreSQL for persistent state, Redis for caching |
|
||||||
|
| Observability | Metrics collection, distributed tracing, structured logging |
|
||||||
|
|
||||||
|
#### Future: High Availability Mode
|
||||||
|
|
||||||
|
For large-scale deployments, the master can be extended with:
|
||||||
|
- **Raft Consensus** for leader election and state replication
|
||||||
|
- **Cluster Manager** for coordinating multiple master instances
|
||||||
|
- This is **not required** for single-organization, self-hosted deployments |
|
||||||
|
|
||||||
|
### 2. Agent (Data Plane)
|
||||||
|
|
||||||
|
The Agent is a lightweight sidecar that runs alongside each nginx instance.
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ AGENT │
|
||||||
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
|
||||||
|
│ │ Master │ │ Nginx │ │ Health Monitor │ │
|
||||||
|
│ │ Client │ │ Controller │ │ │ │
|
||||||
|
│ │ │ │ │ │ ┌───────────────────┐ │ │
|
||||||
|
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Nginx Health │ │ │
|
||||||
|
│ │ │ gRPC │ │ │ │ Config │ │ │ │ (HTTP checks) │ │ │
|
||||||
|
│ │ │ Client │ │ │ │ Renderer│ │ │ └───────────────────┘ │ │
|
||||||
|
│ │ └─────────┘ │ │ └─────────┘ │ │ ┌───────────────────┐ │ │
|
||||||
|
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ System Metrics │ │ │
|
||||||
|
│ │ │ WebSocket│ │ │ │ Reload │ │ │ │ (CPU/Mem/IO) │ │ │
|
||||||
|
│ │ │ Client │ │ │ │ Manager │ │ │ └───────────────────┘ │ │
|
||||||
|
│ │ └─────────┘ │ │ └─────────┘ │ │ │ │
|
||||||
|
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌───────────────────┐ │ │
|
||||||
|
│ │ │ Reconnect│ │ │ │ Process │ │ │ │ Self-Health │ │ │
|
||||||
|
│ │ │ Handler │ │ │ │ Signal │ │ │ │ (Heartbeat) │ │ │
|
||||||
|
│ │ └─────────┘ │ │ └─────────┘ │ │ └───────────────────┘ │ │
|
||||||
|
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
||||||
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
|
||||||
|
│ │ Metrics │ │ Local │ │ Watchdog │ │
|
||||||
|
│ │ Exporter │ │ Cache │ │ │ │
|
||||||
|
│ │ │ │ │ │ ┌───────────────────┐ │ │
|
||||||
|
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Config Drift │ │ │
|
||||||
|
│ │ │Prometheus│ │ │ │ Config │ │ │ │ Detection │ │ │
|
||||||
|
│ │ │Endpoint │ │ │ │ State │ │ │ └───────────────────┘ │ │
|
||||||
|
│ │ └─────────┘ │ │ └─────────┘ │ │ ┌───────────────────┐ │ │
|
||||||
|
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ Auto-Recovery │ │ │
|
||||||
|
│ │ │Statsd │ │ │ │ Backup │ │ │ │ (Nginx restart) │ │ │
|
||||||
|
│ │ │Client │ │ │ │ Files │ │ │ └───────────────────┘ │ │
|
||||||
|
│ │ └─────────┘ │ │ └─────────┘ │ │ │ │
|
||||||
|
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Agent Responsibilities
|
||||||
|
|
||||||
|
| Module | Responsibility |
|
||||||
|
|--------|----------------|
|
||||||
|
| Master Client | Maintains persistent connection to master (gRPC + WebSocket fallback) |
|
||||||
|
| Nginx Controller | Generates configs, manages reloads, handles lifecycle |
|
||||||
|
| Health Monitor | Monitors nginx health, system resources, reports status |
|
||||||
|
| Metrics Exporter | Prometheus endpoint, statsd client for metrics |
|
||||||
|
| Local Cache | Caches configs for offline operation, backup/restore |
|
||||||
|
| Watchdog | Detects config drift, auto-recovery from failures |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Data Flow
|
||||||
|
|
||||||
|
### 1. Configuration Push Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
|
||||||
|
│ User │────▶│ API │────▶│ Config │────▶│ Event │────▶│ Agents │
|
||||||
|
│ Action │ │ Server │ │ Engine │ │ Bus │ │ │
|
||||||
|
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
|
||||||
|
│ Nginx │◀────│ Config │◀────│ Template│◀────│ gRPC │◀────│ Agent │
|
||||||
|
│Reloaded│ │Applied │ │ Render │ │ Stream │ │Receive │
|
||||||
|
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**Flow Description:**
|
||||||
|
1. User creates/updates configuration via API or Web UI
|
||||||
|
2. Master validates and stores configuration in database
|
||||||
|
3. Config Engine determines affected agents
|
||||||
|
4. Event Bus broadcasts configuration change event
|
||||||
|
5. Agents receive event via gRPC streaming
|
||||||
|
6. Agent renders local nginx configuration from templates
|
||||||
|
7. Agent validates new configuration (`nginx -t`)
|
||||||
|
8. Agent applies configuration via graceful reload
|
||||||
|
9. Agent reports status back to master
|
||||||
|
|
||||||
|
### 2. Health Reporting Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
|
||||||
|
│ Nginx │────▶│ Agent │────▶│ Master │────▶│ DB │
|
||||||
|
│ Health │ │ Health │ │ API │ │ Store │
|
||||||
|
└────────┘ └────────┘ └────────┘ └────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌────────┐
|
||||||
|
│Prometheus│
|
||||||
|
│ Server │
|
||||||
|
└────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**Flow Description:**
|
||||||
|
1. Agent periodically checks nginx health (HTTP health endpoint)
|
||||||
|
2. Agent collects system metrics (CPU, memory, connections)
|
||||||
|
3. Agent sends health report to master via gRPC
|
||||||
|
4. Master aggregates and stores in database
|
||||||
|
5. Prometheus scrapes agent metrics endpoint
|
||||||
|
|
||||||
|
### 3. Certificate Management Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
|
||||||
|
│ Let's │◀────│ Master │────▶│ Agent │────▶│ Nginx │◀────│ Client │
|
||||||
|
│Encrypt │ │ ACME │ │ Deploy │ │ Serve │ │Request │
|
||||||
|
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**Flow Description:**
|
||||||
|
1. Master requests certificate from Let's Encrypt (ACME protocol)
|
||||||
|
2. Master distributes certificate to relevant agents
|
||||||
|
3. Agent stores certificate locally (encrypted at rest)
|
||||||
|
4. Agent updates nginx configuration with new certificate
|
||||||
|
5. Nginx serves HTTPS traffic with new certificate
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Communication Protocols
|
||||||
|
|
||||||
|
### Master-Agent Protocol
|
||||||
|
|
||||||
|
NxMesh uses a **bidirectional gRPC stream** as the primary communication channel between master and agents.
|
||||||
|
|
||||||
|
```protobuf
|
||||||
|
// agent.proto
|
||||||
|
syntax = "proto3";
|
||||||
|
package nxmesh.agent;
|
||||||
|
|
||||||
|
service AgentService {
|
||||||
|
// Bidirectional streaming for real-time communication
|
||||||
|
rpc Stream(stream AgentMessage) returns (stream MasterMessage);
|
||||||
|
|
||||||
|
// Unary calls for specific operations
|
||||||
|
rpc ReportHealth(HealthReport) returns (Ack);
|
||||||
|
rpc ReportMetrics(MetricsBatch) returns (Ack);
|
||||||
|
}
|
||||||
|
|
||||||
|
message AgentMessage {
|
||||||
|
string agent_id = 1;
|
||||||
|
uint64 timestamp = 2;
|
||||||
|
oneof payload {
|
||||||
|
RegistrationRequest register = 3;
|
||||||
|
HealthReport health = 4;
|
||||||
|
ConfigStatus config_status = 5;
|
||||||
|
MetricsBatch metrics = 6;
|
||||||
|
LogBatch logs = 7;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
message MasterMessage {
|
||||||
|
uint64 timestamp = 1;
|
||||||
|
oneof payload {
|
||||||
|
RegistrationResponse register_response = 2;
|
||||||
|
ConfigUpdate config_update = 3;
|
||||||
|
Command command = 4;
|
||||||
|
Ack ack = 5;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
message ConfigUpdate {
|
||||||
|
string config_id = 1;
|
||||||
|
uint64 version = 2;
|
||||||
|
repeated VirtualHost virtual_hosts = 3;
|
||||||
|
repeated Upstream upstreams = 4;
|
||||||
|
map<string, string> ssl_certificates = 5;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Connection Management
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ CONNECTION LIFECYCLE │
|
||||||
|
│ │
|
||||||
|
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
||||||
|
│ │ INIT │───▶│ CONNECT │───▶│ STREAM │───▶│ READY │ │
|
||||||
|
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
|
||||||
|
│ │ │ │ │
|
||||||
|
│ ▼ ▼ ▼ │
|
||||||
|
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
||||||
|
│ │ RETRY │ │RECONNECT│ │ ERROR │ │
|
||||||
|
│ └─────────┘ └─────────┘ └─────────┘ │
|
||||||
|
│ │
|
||||||
|
│ Connection Parameters: │
|
||||||
|
│ - Heartbeat interval: 30s │
|
||||||
|
│ - Reconnect backoff: 1s, 2s, 4s, 8s... (max 60s) │
|
||||||
|
│ - gRPC keepalive: 10s ping, 20s timeout │
|
||||||
|
│ - TLS: Server-side TLS (auto-generated or custom) │
|
||||||
|
│ - Agent auth: Bootstrap token → Shared secret (HMAC) │
|
||||||
|
└─────────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Security Model
|
||||||
|
|
||||||
|
### Authentication
|
||||||
|
|
||||||
|
| Component | Method | Details |
|
||||||
|
|-----------|--------|---------|
|
||||||
|
| Master API | JWT (RS256) | Short-lived access tokens, refresh tokens |
|
||||||
|
| Master WebSocket | JWT | Same tokens as API |
|
||||||
|
| Master-Agent gRPC | **TLS + Shared Secret** | Server TLS + bootstrap token → session HMAC |
|
||||||
|
| Agent Registration | One-time Bootstrap Token | Generated in Master UI, single-use, short expiry |
|
||||||
|
|
||||||
|
### Agent Authentication Flow (TLS + Shared Secret)
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────┐ ┌──────────────┐
|
||||||
|
│ Agent │ │ Master │
|
||||||
|
└──────┬──────┘ └──────┬───────┘
|
||||||
|
│ │
|
||||||
|
│ 1. TLS Handshake (verify server certificate) │
|
||||||
|
│◄───────────────────────────────────────────────►│
|
||||||
|
│ │
|
||||||
|
│ 2. Register with bootstrap_token │
|
||||||
|
│ ── gRPC: RegisterAgent { token } ─────────────▶│
|
||||||
|
│ │
|
||||||
|
│ 3. Receive agent_id + session_key (+ key_id) │
|
||||||
|
│◄────────────────────────────────────────────────│
|
||||||
|
│ [Encrypted over TLS] │
|
||||||
|
│ │
|
||||||
|
│ 4. Subsequent requests: HMAC-signed │
|
||||||
|
│ ── gRPC + Headers: │
|
||||||
|
│ X-Agent-ID: <agent_id> │
|
||||||
|
│ X-Key-ID: <session_key_id> │
|
||||||
|
│ X-Signature: HMAC(request_body, session_key)│
|
||||||
|
│────────────────────────────────────────────────▶│
|
||||||
|
│ │
|
||||||
|
│ 5. Key Rotation (primary/secondary) │
|
||||||
|
│◄═══════════════════════════════════════════════►│
|
||||||
|
```
|
||||||
|
|
||||||
|
**Security Properties:**
|
||||||
|
- **TLS**: Encrypts channel, verifies master identity (server cert)
|
||||||
|
- **Bootstrap Token**: One-time use, time-limited, proves initial identity
|
||||||
|
- **Session Key**: Per-agent secret, used for HMAC request signing
|
||||||
|
- **Key Rotation**: Primary/secondary key design for seamless rotation
|
||||||
|
|
||||||
|
### Authorization (RBAC)
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Example RBAC Configuration
|
||||||
|
roles:
|
||||||
|
admin:
|
||||||
|
permissions:
|
||||||
|
- "*:*"
|
||||||
|
|
||||||
|
operator:
|
||||||
|
permissions:
|
||||||
|
- "config:read"
|
||||||
|
- "config:write"
|
||||||
|
- "agent:read"
|
||||||
|
- "agent:reload"
|
||||||
|
|
||||||
|
viewer:
|
||||||
|
permissions:
|
||||||
|
- "config:read"
|
||||||
|
- "agent:read"
|
||||||
|
- "metrics:read"
|
||||||
|
|
||||||
|
# Resource hierarchy
|
||||||
|
resources:
|
||||||
|
- organization
|
||||||
|
- workspace
|
||||||
|
- agent
|
||||||
|
- certificate
|
||||||
|
- config (virtual_host, upstream)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Deployment Patterns
|
||||||
|
|
||||||
|
### Pattern 1: Docker Sidecar (Development/Single Host)
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# docker-compose.yml
|
||||||
|
version: '3.8'
|
||||||
|
|
||||||
|
services:
|
||||||
|
nxmesh-master:
|
||||||
|
image: nxmesh/master:latest
|
||||||
|
ports:
|
||||||
|
- "8080:8080" # API
|
||||||
|
- "8443:8443" # gRPC
|
||||||
|
environment:
|
||||||
|
- DATABASE_URL=postgres://...
|
||||||
|
|
||||||
|
nginx-site-a:
|
||||||
|
image: nginx:alpine
|
||||||
|
volumes:
|
||||||
|
- site-a-html:/usr/share/nginx/html
|
||||||
|
|
||||||
|
nxmesh-agent-a:
|
||||||
|
image: nxmesh/agent:latest
|
||||||
|
network_mode: service:nginx-site-a # Share network namespace with nginx
|
||||||
|
pid: service:nginx-site-a # Share PID namespace (for nginx reload)
|
||||||
|
environment:
|
||||||
|
- NXMESH_MASTER_URL=wss://nxmesh-master:8443
|
||||||
|
- NXMESH_AGENT_TOKEN=${AGENT_TOKEN_A}
|
||||||
|
- NXMESH_DEPLOYMENT_MODE=docker_sidecar
|
||||||
|
- NXMESH_NGINX_PID_FILE=/var/run/nginx.pid
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:** Simple, isolated, good for development
|
||||||
|
**Cons:** Docker-only, single host limitation
|
||||||
|
|
||||||
|
### Pattern 2: Kubernetes Sidecar
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# deployment.yaml
|
||||||
|
apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
metadata:
|
||||||
|
name: web-service
|
||||||
|
spec:
|
||||||
|
replicas: 3
|
||||||
|
template:
|
||||||
|
spec:
|
||||||
|
containers:
|
||||||
|
- name: nginx
|
||||||
|
image: nginx:alpine
|
||||||
|
volumeMounts:
|
||||||
|
- name: nxmesh-config
|
||||||
|
mountPath: /etc/nginx/conf.d
|
||||||
|
|
||||||
|
- name: nxmesh-agent
|
||||||
|
image: nxmesh/agent:latest
|
||||||
|
env:
|
||||||
|
- name: NXMESH_MASTER_URL
|
||||||
|
value: "wss://nxmesh-master.default.svc:8443"
|
||||||
|
- name: NXMESH_AGENT_TOKEN
|
||||||
|
valueFrom:
|
||||||
|
secretKeyRef:
|
||||||
|
name: nxmesh-agent-token
|
||||||
|
key: token
|
||||||
|
volumeMounts:
|
||||||
|
- name: nxmesh-config
|
||||||
|
mountPath: /etc/nginx/conf.d
|
||||||
|
volumes:
|
||||||
|
- name: nxmesh-config
|
||||||
|
emptyDir: {}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:** Native K8s integration, auto-scaling, health checks
|
||||||
|
**Cons:** K8s-only, more complex setup
|
||||||
|
|
||||||
|
### Pattern 3: Standalone (VM/Bare Metal)
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ VM / Bare Metal │
|
||||||
|
│ ┌───────────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ Systemd │ │
|
||||||
|
│ │ ┌─────────────────────────────────────────────────────┐ │ │
|
||||||
|
│ │ │ nxmesh-agent.service │ │ │
|
||||||
|
│ │ │ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │ │ │
|
||||||
|
│ │ │ │ Agent │ │ Nginx │ │ Config │ │ │ │
|
||||||
|
│ │ │ │ Process │──│ Process │──│ Files │ │ │ │
|
||||||
|
│ │ │ └──────────────┘ └──────────────┘ └───────────┘ │ │ │
|
||||||
|
│ │ └─────────────────────────────────────────────────────┘ │ │
|
||||||
|
│ └───────────────────────────────────────────────────────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:** Works anywhere, minimal dependencies
|
||||||
|
**Cons:** Manual setup, no container isolation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Failure Handling
|
||||||
|
|
||||||
|
### Master Failure Scenarios
|
||||||
|
|
||||||
|
| Scenario | Impact | Mitigation |
|
||||||
|
|----------|--------|------------|
|
||||||
|
| Master unreachable | Agents continue with cached config | Agents retry with exponential backoff |
|
||||||
|
| Master crashes | New connections fail, existing continue | External load balancer + health checks (HA: future) |
|
||||||
|
| Database down | Read-only mode for existing configs | Database replication, failover |
|
||||||
|
|
||||||
|
### Agent Failure Scenarios
|
||||||
|
|
||||||
|
| Scenario | Impact | Mitigation |
|
||||||
|
|----------|--------|------------|
|
||||||
|
| Agent crashes | Nginx continues running | Systemd restart, watchdog |
|
||||||
|
| Config validation fails | Previous config kept | Atomic config swap, rollback |
|
||||||
|
| Nginx crashes | Agent restarts nginx | Health checks, auto-restart |
|
||||||
|
| Network partition | Agent operates in "island mode" | Local cache, reconciliation on reconnect |
|
||||||
|
|
||||||
|
### Recovery Procedures
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ FAILURE RECOVERY FLOW │
|
||||||
|
│ │
|
||||||
|
│ Agent Disconnect │
|
||||||
|
│ │ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
||||||
|
│ │ Retry │───▶│ Cache │───▶│ Alert │───▶│ Watch │ │
|
||||||
|
│ │ Connect │ │ Config │ │ Master │ │ Dog │ │
|
||||||
|
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
|
||||||
|
│ │ │ │
|
||||||
|
│ ▼ ▼ │
|
||||||
|
│ ┌─────────┐ ┌─────────┐ │
|
||||||
|
│ │Reconnected│ │ Restart │ │
|
||||||
|
│ │ Sync │ │ Nginx │ │
|
||||||
|
│ └─────────┘ └─────────┘ │
|
||||||
|
│ │
|
||||||
|
│ Recovery Strategies: │
|
||||||
|
│ 1. Exponential backoff for reconnection │
|
||||||
|
│ 2. Circuit breaker for failed operations │
|
||||||
|
│ 3. Config checksum verification after reconnect │
|
||||||
|
│ 4. Automatic nginx restart on health check failure │
|
||||||
|
└─────────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Technology Stack
|
||||||
|
|
||||||
|
| Layer | Technology | Rationale |
|
||||||
|
|-------|------------|-----------|
|
||||||
|
| **Master Backend** | Rust (Axum) | Performance, safety, async ecosystem |
|
||||||
|
| **Agent** | Rust (Tokio) | Small binary, low memory, fast startup |
|
||||||
|
| **Database** | PostgreSQL | ACID, JSON support, reliability |
|
||||||
|
| **Cache** | Redis | Fast key-value, pub/sub for events |
|
||||||
|
| **Frontend** | React + Vite (embedded) | Static build served by master, fast HMR in dev |
|
||||||
|
| **gRPC** | Tonic | Native Rust implementation |
|
||||||
|
| **ORM** | SeaORM | Async, type-safe, migration support |
|
||||||
|
| **Config Template** | Handlebars | Logic-less, secure templating |
|
||||||
|
| **Metrics** | Prometheus | Industry standard, rich ecosystem |
|
||||||
|
| **Tracing** | OpenTelemetry | Vendor-neutral, future-proof |
|
||||||
814
docs/features.md
Normal file
814
docs/features.md
Normal file
@@ -0,0 +1,814 @@
|
|||||||
|
# NxMesh Feature Specification
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
1. [Core Features](#core-features)
|
||||||
|
2. [Master Features](#master-features)
|
||||||
|
3. [Agent Features](#agent-features)
|
||||||
|
4. [Configuration Management](#configuration-management)
|
||||||
|
5. [Observability](#observability)
|
||||||
|
6. [Security Features](#security-features)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Core Features
|
||||||
|
|
||||||
|
### CF-001: Multi-tenancy with Organizations and Workspaces
|
||||||
|
|
||||||
|
**Description**: Support for multiple organizations with isolated workspaces within each organization.
|
||||||
|
|
||||||
|
**Requirements**:
|
||||||
|
- Organizations are top-level resource containers
|
||||||
|
- Each organization can have multiple workspaces
|
||||||
|
- Resources (agents, configs, certificates) are scoped to a workspace
|
||||||
|
- Cross-workspace visibility is configurable
|
||||||
|
|
||||||
|
**Data Model**:
|
||||||
|
```rust
|
||||||
|
struct Organization {
|
||||||
|
id: Uuid,
|
||||||
|
name: String,
|
||||||
|
slug: String, // URL-friendly identifier
|
||||||
|
created_at: DateTime,
|
||||||
|
settings: OrganizationSettings,
|
||||||
|
}
|
||||||
|
|
||||||
|
struct Workspace {
|
||||||
|
id: Uuid,
|
||||||
|
organization_id: Uuid,
|
||||||
|
name: String,
|
||||||
|
slug: String,
|
||||||
|
created_at: DateTime,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**API Endpoints**:
|
||||||
|
- `GET /api/v1/organizations` - List organizations
|
||||||
|
- `POST /api/v1/organizations` - Create organization
|
||||||
|
- `GET /api/v1/organizations/{id}/workspaces` - List workspaces
|
||||||
|
- `POST /api/v1/organizations/{id}/workspaces` - Create workspace
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### CF-002: Agent Registration and Lifecycle Management
|
||||||
|
|
||||||
|
**Description**: Agents must register with the master before receiving configurations.
|
||||||
|
|
||||||
|
**Registration Flow**:
|
||||||
|
1. Administrator generates bootstrap token in Master UI
|
||||||
|
2. Token is provided to agent via environment variable or config file
|
||||||
|
3. Agent establishes TLS connection to master (verifies server certificate)
|
||||||
|
4. Agent sends bootstrap token for registration
|
||||||
|
5. Master validates token and establishes shared secret:
|
||||||
|
- Master generates session_key (per-agent) + key_id
|
||||||
|
- Session key used for HMAC request signing
|
||||||
|
- Primary/secondary key design for rotation
|
||||||
|
|
||||||
|
**Agent States**:
|
||||||
|
```rust
|
||||||
|
enum AgentState {
|
||||||
|
Pending, // Registered but never connected
|
||||||
|
Online, // Connected and healthy
|
||||||
|
Offline, // Disconnected
|
||||||
|
Degraded, // Connected but health checks failing
|
||||||
|
Maintenance, // Manually placed in maintenance mode
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Agent Metadata**:
|
||||||
|
```rust
|
||||||
|
struct Agent {
|
||||||
|
id: Uuid,
|
||||||
|
workspace_id: Uuid,
|
||||||
|
name: String,
|
||||||
|
hostname: String,
|
||||||
|
ip_address: String,
|
||||||
|
version: String,
|
||||||
|
state: AgentState,
|
||||||
|
deployment_mode: DeploymentMode, // DockerSidecar, K8sSidecar, Standalone
|
||||||
|
last_seen_at: DateTime,
|
||||||
|
capabilities: Vec<String>, // e.g., ["http3", "websocket", "rate_limiting"]
|
||||||
|
labels: HashMap<String, String>, // e.g., {"env": "prod", "region": "us-east"}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**API Endpoints**:
|
||||||
|
- `POST /api/v1/agents/register` - Register new agent
|
||||||
|
- `GET /api/v1/agents` - List agents
|
||||||
|
- `GET /api/v1/agents/{id}` - Get agent details
|
||||||
|
- `POST /api/v1/agents/{id}/tokens` - Generate registration token
|
||||||
|
- `DELETE /api/v1/agents/{id}` - Deregister agent
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### CF-003: Real-time Configuration Distribution
|
||||||
|
|
||||||
|
**Description**: Push configuration changes to agents in real-time with delivery guarantees.
|
||||||
|
|
||||||
|
**Requirements**:
|
||||||
|
- Config changes propagate to all affected agents within 5 seconds
|
||||||
|
- Support for targeted updates (specific agents or groups)
|
||||||
|
- Config versioning with rollback capability
|
||||||
|
- Delivery confirmation from agents
|
||||||
|
|
||||||
|
**Configuration Scope**:
|
||||||
|
```rust
|
||||||
|
enum ConfigScope {
|
||||||
|
Global, // All agents
|
||||||
|
Workspace, // All agents in workspace
|
||||||
|
AgentGroup(String), // Agents with specific label selector
|
||||||
|
Agent(Uuid), // Single agent
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Delivery Guarantees**:
|
||||||
|
- At-least-once delivery
|
||||||
|
- Automatic retry with exponential backoff
|
||||||
|
- Config checksum verification
|
||||||
|
- Offline agents receive updates on reconnection
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Master Features
|
||||||
|
|
||||||
|
### MF-001: RESTful API
|
||||||
|
|
||||||
|
**Description**: Comprehensive REST API for all operations.
|
||||||
|
|
||||||
|
**Base URL**: `/api/v1`
|
||||||
|
|
||||||
|
**Resource Endpoints**:
|
||||||
|
|
||||||
|
| Resource | Endpoints |
|
||||||
|
|----------|-----------|
|
||||||
|
| Organizations | GET, POST, PATCH, DELETE `/organizations` |
|
||||||
|
| Workspaces | GET, POST, PATCH, DELETE `/workspaces` |
|
||||||
|
| Agents | GET, POST, PATCH, DELETE `/agents` |
|
||||||
|
| VirtualHosts | GET, POST, PATCH, DELETE `/virtual-hosts` |
|
||||||
|
| Upstreams | GET, POST, PATCH, DELETE `/upstreams` |
|
||||||
|
| Certificates | GET, POST, DELETE `/certificates` |
|
||||||
|
| AccessLogs | GET `/access-logs` |
|
||||||
|
| Metrics | GET `/metrics` |
|
||||||
|
|
||||||
|
**Response Format**:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"data": { ... },
|
||||||
|
"meta": {
|
||||||
|
"page": 1,
|
||||||
|
"per_page": 20,
|
||||||
|
"total": 100
|
||||||
|
},
|
||||||
|
"links": {
|
||||||
|
"self": "/api/v1/agents?page=1",
|
||||||
|
"next": "/api/v1/agents?page=2",
|
||||||
|
"prev": null
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Error Format**:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"error": {
|
||||||
|
"code": "VALIDATION_ERROR",
|
||||||
|
"message": "Invalid configuration",
|
||||||
|
"details": [
|
||||||
|
{"field": "server_name", "message": "Invalid domain format"}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### MF-002: Web-based Admin Console (Embedded)
|
||||||
|
|
||||||
|
**Description**: Modern web UI for managing the entire system. Built with React + Vite and served as static files embedded directly in the master binary.
|
||||||
|
|
||||||
|
**Pages**:
|
||||||
|
|
||||||
|
| Page | Features |
|
||||||
|
|------|----------|
|
||||||
|
| Dashboard | Agent status, recent events, traffic overview |
|
||||||
|
| Agents | List, detail view, logs, metrics graphs |
|
||||||
|
| Configurations | Virtual host editor, upstream management |
|
||||||
|
| Certificates | SSL certificate list, expiration alerts |
|
||||||
|
| Access Control | Users, roles, permissions management |
|
||||||
|
| Settings | Organization settings, integrations |
|
||||||
|
|
||||||
|
**Key UI Features**:
|
||||||
|
- Real-time updates via WebSocket
|
||||||
|
- Monaco editor for nginx configuration
|
||||||
|
- Visual topology view (agent connections)
|
||||||
|
- Dark/light mode support
|
||||||
|
- Responsive design
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### MF-003: Configuration Template Engine
|
||||||
|
|
||||||
|
**Description**: Templating system for generating nginx configurations.
|
||||||
|
|
||||||
|
**Template Variables**:
|
||||||
|
```handlebars
|
||||||
|
# Example virtual host template
|
||||||
|
server {
|
||||||
|
listen {{port}} {{#if ssl}}ssl{{/if}} {{#if http2}}http2{{/if}};
|
||||||
|
server_name {{server_name}};
|
||||||
|
|
||||||
|
{{#if ssl}}
|
||||||
|
ssl_certificate {{ssl_certificate_path}};
|
||||||
|
ssl_certificate_key {{ssl_certificate_key_path}};
|
||||||
|
{{/if}}
|
||||||
|
|
||||||
|
location {{location_path}} {
|
||||||
|
proxy_pass http://{{upstream_name}};
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_set_header X-Real-IP $remote_addr;
|
||||||
|
|
||||||
|
{{#each custom_headers}}
|
||||||
|
add_header {{name}} "{{value}}";
|
||||||
|
{{/each}}
|
||||||
|
|
||||||
|
{{#if rate_limiting}}
|
||||||
|
limit_req zone={{rate_limit_zone}} burst={{rate_limit_burst}};
|
||||||
|
{{/if}}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Built-in Templates**:
|
||||||
|
- `default` - Standard reverse proxy
|
||||||
|
- `spa` - Single Page Application (with fallback to index.html)
|
||||||
|
- `api` - API gateway with rate limiting
|
||||||
|
- `static` - Static file serving with caching
|
||||||
|
- `websocket` - WebSocket proxy with connection upgrades
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### MF-004: Certificate Management (ACME)
|
||||||
|
|
||||||
|
**Description**: Automatic SSL/TLS certificate provisioning via Let's Encrypt.
|
||||||
|
|
||||||
|
**Features**:
|
||||||
|
- ACME v2 protocol support
|
||||||
|
- HTTP-01 and DNS-01 challenges
|
||||||
|
- Automatic renewal (30 days before expiry)
|
||||||
|
- Wildcard certificate support (DNS-01)
|
||||||
|
- Certificate monitoring and alerts
|
||||||
|
|
||||||
|
**Certificate Entity**:
|
||||||
|
```rust
|
||||||
|
struct Certificate {
|
||||||
|
id: Uuid,
|
||||||
|
workspace_id: Uuid,
|
||||||
|
domain: String,
|
||||||
|
is_wildcard: bool,
|
||||||
|
provider: CertificateProvider, // LetsEncrypt, Custom
|
||||||
|
status: CertificateStatus, // Pending, Active, Expired, Error
|
||||||
|
issued_at: DateTime,
|
||||||
|
expires_at: DateTime,
|
||||||
|
auto_renew: bool,
|
||||||
|
certificate_pem: Option<String>, // Encrypted at rest
|
||||||
|
private_key_pem: Option<String>, // Encrypted at rest
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Agent Features
|
||||||
|
|
||||||
|
### AF-001: Nginx Lifecycle Management
|
||||||
|
|
||||||
|
**Description**: Agent manages nginx process lifecycle based on deployment mode.
|
||||||
|
|
||||||
|
**Docker Sidecar Mode**:
|
||||||
|
- Shares PID namespace with nginx container (via `pid: service:nginx`)
|
||||||
|
- Directly signals nginx process for reload/restart
|
||||||
|
- Monitors nginx via health checks
|
||||||
|
|
||||||
|
**Standalone Mode**:
|
||||||
|
- Direct process management (signals to PID from file)
|
||||||
|
- systemd integration (optional, for service management)
|
||||||
|
- PID file monitoring
|
||||||
|
|
||||||
|
**Lifecycle Actions**:
|
||||||
|
- `start` - Start nginx
|
||||||
|
- `stop` - Graceful shutdown
|
||||||
|
- `reload` - Hot reload configuration
|
||||||
|
- `restart` - Full restart
|
||||||
|
- `test` - Validate configuration
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### AF-002: Configuration Rendering and Application
|
||||||
|
|
||||||
|
**Description**: Agent renders nginx configs from master templates and applies them using atomic symlink swaps for zero-downtime updates.
|
||||||
|
|
||||||
|
**Config Directory Structure**:
|
||||||
|
```
|
||||||
|
/etc/nginx/
|
||||||
|
├── nginx.conf # Contains: include /etc/nginx/conf.d/current/*.conf
|
||||||
|
├── conf.d/
|
||||||
|
│ ├── current -> ./20260302143000/ # Symlink to active deployment
|
||||||
|
│ ├── 20260302143000/ # Active config (timestamped)
|
||||||
|
│ │ ├── default.conf
|
||||||
|
│ │ └── upstream.conf
|
||||||
|
│ ├── 20260302141500/ # Previous deployment (for rollback)
|
||||||
|
│ │ ├── default.conf
|
||||||
|
│ │ └── upstream.conf
|
||||||
|
│ └── 20260302140000/ # Older deployment (cleanup candidate)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Config Rendering Flow**:
|
||||||
|
1. Receive ConfigUpdate from master
|
||||||
|
2. Create new deployment folder: `./conf.d/<timestamp>/`
|
||||||
|
3. Render nginx config files into timestamped folder
|
||||||
|
4. **Validate** new config: `nginx -t -c /etc/nginx/conf.d/<timestamp>/nginx.conf`
|
||||||
|
5. If validation passes, **atomically update symlink**: `current` → `<timestamp>/`
|
||||||
|
6. Execute graceful nginx reload
|
||||||
|
7. Verify reload success (health check)
|
||||||
|
8. Report status to master
|
||||||
|
9. Cleanup old deployments (keep N recent versions)
|
||||||
|
|
||||||
|
**Atomic Config Swap**:
|
||||||
|
```rust
|
||||||
|
async fn apply_config(&self, config: ConfigUpdate) -> Result<()> {
|
||||||
|
let timestamp = generate_timestamp();
|
||||||
|
let deploy_dir = self.conf_d_path.join(×tamp);
|
||||||
|
let symlink_path = self.conf_d_path.join("current");
|
||||||
|
|
||||||
|
// 1. Render config to new timestamped directory
|
||||||
|
self.render_config(&config, &deploy_dir).await?;
|
||||||
|
|
||||||
|
// 2. Validate BEFORE switching symlink (point to new folder directly)
|
||||||
|
self.validate_config(&deploy_dir).await?;
|
||||||
|
|
||||||
|
// 3. Atomic symlink swap (Unix: symlink + rename)
|
||||||
|
let temp_link = self.conf_d_path.join("current.tmp");
|
||||||
|
tokio::fs::symlink(&deploy_dir, &temp_link).await?;
|
||||||
|
tokio::fs::rename(&temp_link, &symlink_path).await?; // Atomic operation
|
||||||
|
|
||||||
|
// 4. Reload nginx (picks up new symlink target)
|
||||||
|
self.reload_nginx().await?;
|
||||||
|
|
||||||
|
// 5. Verify and cleanup
|
||||||
|
self.verify_health().await?;
|
||||||
|
self.cleanup_old_deployments(5).await?; // Keep last 5 versions
|
||||||
|
|
||||||
|
self.report_success(config.id, timestamp).await;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Rollback Strategy**:
|
||||||
|
```rust
|
||||||
|
async fn rollback(&self, target_timestamp: &str) -> Result<()> {
|
||||||
|
let target_dir = self.conf_d_path.join(target_timestamp);
|
||||||
|
let symlink_path = self.conf_d_path.join("current");
|
||||||
|
|
||||||
|
// Verify target exists
|
||||||
|
if !target_dir.exists() {
|
||||||
|
return Err(Error::RollbackTargetNotFound);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Atomic symlink swap back to previous deployment
|
||||||
|
let temp_link = self.conf_d_path.join("current.tmp");
|
||||||
|
tokio::fs::symlink(&target_dir, &temp_link).await?;
|
||||||
|
tokio::fs::rename(&temp_link, &symlink_path).await?;
|
||||||
|
|
||||||
|
// Reload nginx
|
||||||
|
self.reload_nginx().await?;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### AF-003: Health Monitoring and Reporting
|
||||||
|
|
||||||
|
**Description**: Continuous health monitoring of nginx and the host system.
|
||||||
|
|
||||||
|
**Health Checks**:
|
||||||
|
- **Nginx Health**: HTTP request to nginx health endpoint
|
||||||
|
- **Configuration Health**: Verify current config matches expected
|
||||||
|
- **Resource Health**: CPU, memory, disk usage
|
||||||
|
- **Connection Health**: Active connections, request rate
|
||||||
|
|
||||||
|
**Health Report Structure**:
|
||||||
|
```rust
|
||||||
|
struct HealthReport {
|
||||||
|
agent_id: Uuid,
|
||||||
|
timestamp: DateTime,
|
||||||
|
nginx_status: NginxStatus,
|
||||||
|
system_metrics: SystemMetrics,
|
||||||
|
config_checksum: String,
|
||||||
|
alerts: Vec<Alert>,
|
||||||
|
}
|
||||||
|
|
||||||
|
struct NginxStatus {
|
||||||
|
is_running: bool,
|
||||||
|
pid: Option<u32>,
|
||||||
|
uptime_seconds: u64,
|
||||||
|
active_connections: u32,
|
||||||
|
requests_per_second: f64,
|
||||||
|
}
|
||||||
|
|
||||||
|
struct SystemMetrics {
|
||||||
|
cpu_percent: f64,
|
||||||
|
memory_used_mb: u64,
|
||||||
|
memory_total_mb: u64,
|
||||||
|
disk_used_gb: u64,
|
||||||
|
disk_total_gb: u64,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Reporting Interval**: Configurable (default: 30 seconds)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### AF-004: Metrics Collection and Export
|
||||||
|
|
||||||
|
**Description**: Collect and expose metrics in Prometheus format.
|
||||||
|
|
||||||
|
**Metrics Endpoint**: `GET /metrics` (on agent)
|
||||||
|
|
||||||
|
**Built-in Metrics**:
|
||||||
|
```
|
||||||
|
# Nginx metrics (parsed from stub_status)
|
||||||
|
nxmesh_nginx_connections_active{agent_id="..."} 42
|
||||||
|
nxmesh_nginx_connections_reading{agent_id="..."} 5
|
||||||
|
nxmesh_nginx_connections_writing{agent_id="..."} 30
|
||||||
|
nxmesh_nginx_connections_waiting{agent_id="..."} 7
|
||||||
|
nxmesh_nginx_requests_total{agent_id="..."} 1234567
|
||||||
|
|
||||||
|
# Agent metrics
|
||||||
|
nxmesh_agent_uptime_seconds{agent_id="..."} 86400
|
||||||
|
nxmesh_agent_master_connection_status{agent_id="..."} 1
|
||||||
|
nxmesh_agent_config_version{agent_id="...",version="123"} 1
|
||||||
|
|
||||||
|
# System metrics
|
||||||
|
nxmesh_system_cpu_percent{agent_id="..."} 25.5
|
||||||
|
nxmesh_system_memory_used_bytes{agent_id="..."} 1073741824
|
||||||
|
nxmesh_system_disk_used_bytes{agent_id="..."} 53687091200
|
||||||
|
```
|
||||||
|
|
||||||
|
**Custom Metrics**: Agents can collect custom metrics from nginx access logs
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### AF-005: Offline Operation and Recovery
|
||||||
|
|
||||||
|
**Description**: Agent can operate independently when master is unreachable.
|
||||||
|
|
||||||
|
**Offline Capabilities**:
|
||||||
|
- Continue serving traffic with cached configuration
|
||||||
|
- Local health monitoring continues
|
||||||
|
- Metrics are buffered for later transmission
|
||||||
|
- Automatic reconnection attempts
|
||||||
|
|
||||||
|
**Recovery Flow**:
|
||||||
|
1. Detect disconnection from master
|
||||||
|
2. Enter "offline mode"
|
||||||
|
3. Continue operating with cached config
|
||||||
|
4. Buffer metrics and logs
|
||||||
|
5. Attempt reconnection with exponential backoff
|
||||||
|
6. On reconnection:
|
||||||
|
- Sync configuration (compare checksums)
|
||||||
|
- Transmit buffered metrics
|
||||||
|
- Resume normal operation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Configuration Management
|
||||||
|
|
||||||
|
### CM-001: Virtual Host Configuration
|
||||||
|
|
||||||
|
**Description**: Define nginx server blocks (virtual hosts) via API/UI.
|
||||||
|
|
||||||
|
**VirtualHost Entity**:
|
||||||
|
```rust
|
||||||
|
struct VirtualHost {
|
||||||
|
id: Uuid,
|
||||||
|
workspace_id: Uuid,
|
||||||
|
name: String, // Human-readable name
|
||||||
|
server_name: String, // Domain name(s), comma-separated
|
||||||
|
listen_port: u16, // Usually 80 or 443
|
||||||
|
ssl_enabled: bool,
|
||||||
|
ssl_certificate_id: Option<Uuid>,
|
||||||
|
|
||||||
|
// Routing configuration
|
||||||
|
locations: Vec<Location>,
|
||||||
|
|
||||||
|
// Advanced settings
|
||||||
|
http2_enabled: bool,
|
||||||
|
http3_enabled: bool,
|
||||||
|
gzip_enabled: bool,
|
||||||
|
rate_limiting: Option<RateLimitConfig>,
|
||||||
|
|
||||||
|
// Target agents
|
||||||
|
target_agents: AgentSelector,
|
||||||
|
}
|
||||||
|
|
||||||
|
struct Location {
|
||||||
|
path: String, // e.g., "/api" or "~ \.php$"
|
||||||
|
proxy_pass: Option<String>, // e.g., "http://backend"
|
||||||
|
upstream_id: Option<Uuid>,
|
||||||
|
root: Option<String>, // For static files
|
||||||
|
index: Option<String>, // e.g., "index.html"
|
||||||
|
custom_headers: Vec<Header>,
|
||||||
|
rewrite_rules: Vec<RewriteRule>,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Validation Rules**:
|
||||||
|
- `server_name` must be valid domain(s)
|
||||||
|
- `listen_port` must be 1-65535
|
||||||
|
- SSL certificate must exist if `ssl_enabled` is true
|
||||||
|
- At least one location must be defined
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### CM-002: Upstream Configuration
|
||||||
|
|
||||||
|
**Description**: Define backend server pools for load balancing.
|
||||||
|
|
||||||
|
**Upstream Entity**:
|
||||||
|
```rust
|
||||||
|
struct Upstream {
|
||||||
|
id: Uuid,
|
||||||
|
workspace_id: Uuid,
|
||||||
|
name: String, // Used as upstream identifier
|
||||||
|
|
||||||
|
// Load balancing algorithm
|
||||||
|
algorithm: LoadBalanceAlgorithm, // RoundRobin, LeastConn, IPHash, etc.
|
||||||
|
|
||||||
|
// Backend servers
|
||||||
|
servers: Vec<UpstreamServer>,
|
||||||
|
|
||||||
|
// Health check configuration
|
||||||
|
health_check: Option<HealthCheckConfig>,
|
||||||
|
|
||||||
|
// Connection settings
|
||||||
|
keepalive_connections: Option<u32>,
|
||||||
|
keepalive_timeout: Option<u32>,
|
||||||
|
}
|
||||||
|
|
||||||
|
struct UpstreamServer {
|
||||||
|
address: String, // IP:port or hostname:port
|
||||||
|
weight: u32, // Default: 1
|
||||||
|
backup: bool, // Backup server
|
||||||
|
down: bool, // Temporarily down
|
||||||
|
max_fails: u32, // Default: 1
|
||||||
|
fail_timeout: u32, // Seconds, default: 10
|
||||||
|
}
|
||||||
|
|
||||||
|
enum LoadBalanceAlgorithm {
|
||||||
|
RoundRobin,
|
||||||
|
LeastConnections,
|
||||||
|
IPHash,
|
||||||
|
WeightedRoundRobin,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### CM-003: Configuration Versioning
|
||||||
|
|
||||||
|
**Description**: Track all configuration changes with full history.
|
||||||
|
|
||||||
|
**Versioning Features**:
|
||||||
|
- Every change creates a new version
|
||||||
|
- Versions are immutable
|
||||||
|
- Rollback to any previous version
|
||||||
|
- Diff between versions
|
||||||
|
- Audit log of who changed what
|
||||||
|
|
||||||
|
**Version Entity**:
|
||||||
|
```rust
|
||||||
|
struct ConfigVersion {
|
||||||
|
id: Uuid,
|
||||||
|
resource_type: String, // "virtual_host", "upstream", etc.
|
||||||
|
resource_id: Uuid,
|
||||||
|
version_number: u64, // Auto-incrementing
|
||||||
|
data: Json, // Full configuration snapshot
|
||||||
|
checksum: String, // SHA-256 of data
|
||||||
|
created_by: Uuid, // User ID
|
||||||
|
created_at: DateTime,
|
||||||
|
change_summary: String, // Human-readable description
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**API Endpoints**:
|
||||||
|
- `GET /api/v1/virtual-hosts/{id}/versions` - List versions
|
||||||
|
- `GET /api/v1/virtual-hosts/{id}/versions/{version}` - Get specific version
|
||||||
|
- `POST /api/v1/virtual-hosts/{id}/rollback` - Rollback to version
|
||||||
|
- `GET /api/v1/virtual-hosts/{id}/diff?from=v1&to=v2` - Compare versions
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Observability
|
||||||
|
|
||||||
|
### OB-001: Structured Logging
|
||||||
|
|
||||||
|
**Description**: Comprehensive logging with structured format.
|
||||||
|
|
||||||
|
**Log Levels**: ERROR, WARN, INFO, DEBUG, TRACE
|
||||||
|
|
||||||
|
**Log Fields**:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"timestamp": "2026-03-02T10:30:00Z",
|
||||||
|
"level": "INFO",
|
||||||
|
"component": "agent",
|
||||||
|
"agent_id": "550e8400-e29b-41d4-a716-446655440000",
|
||||||
|
"trace_id": "abc123",
|
||||||
|
"span_id": "def456",
|
||||||
|
"message": "Configuration applied successfully",
|
||||||
|
"fields": {
|
||||||
|
"config_id": "config-123",
|
||||||
|
"version": 42,
|
||||||
|
"duration_ms": 150
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Log Targets**:
|
||||||
|
- Master: systemd journal, file, or centralized (ELK/Loki)
|
||||||
|
- Agent: stdout (Docker), file (standalone), or remote
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### OB-002: Distributed Tracing
|
||||||
|
|
||||||
|
**Description**: OpenTelemetry tracing for request flow visualization.
|
||||||
|
|
||||||
|
**Traced Operations**:
|
||||||
|
- Configuration push (master → agent → nginx)
|
||||||
|
- Health check cycles
|
||||||
|
- Certificate issuance
|
||||||
|
- API requests
|
||||||
|
|
||||||
|
**Span Attributes**:
|
||||||
|
- `nxmesh.agent_id`
|
||||||
|
- `nxmesh.config_id`
|
||||||
|
- `nxmesh.workspace_id`
|
||||||
|
- `nxmesh.organization_id`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### OB-003: Access Log Aggregation
|
||||||
|
|
||||||
|
**Description**: Collect and query nginx access logs from all agents.
|
||||||
|
|
||||||
|
**Features**:
|
||||||
|
- Centralized access log storage
|
||||||
|
- Real-time log streaming
|
||||||
|
- SQL-like query interface
|
||||||
|
- Log retention policies
|
||||||
|
|
||||||
|
**Access Log Schema**:
|
||||||
|
```rust
|
||||||
|
struct AccessLogEntry {
|
||||||
|
id: Uuid,
|
||||||
|
agent_id: Uuid,
|
||||||
|
timestamp: DateTime,
|
||||||
|
|
||||||
|
// Request details
|
||||||
|
remote_addr: String,
|
||||||
|
method: String,
|
||||||
|
uri: String,
|
||||||
|
protocol: String,
|
||||||
|
host: String,
|
||||||
|
|
||||||
|
// Response details
|
||||||
|
status: u16,
|
||||||
|
body_bytes_sent: u64,
|
||||||
|
response_time_ms: f64,
|
||||||
|
|
||||||
|
// Additional fields
|
||||||
|
user_agent: Option<String>,
|
||||||
|
referer: Option<String>,
|
||||||
|
request_id: Option<String>,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Query API**:
|
||||||
|
```graphql
|
||||||
|
# Example query
|
||||||
|
query {
|
||||||
|
accessLogs(
|
||||||
|
filter: {
|
||||||
|
agentId: "...",
|
||||||
|
timeRange: { from: "2026-03-01", to: "2026-03-02" },
|
||||||
|
statusCode: { gte: 500 }
|
||||||
|
},
|
||||||
|
limit: 100
|
||||||
|
) {
|
||||||
|
timestamp
|
||||||
|
method
|
||||||
|
uri
|
||||||
|
status
|
||||||
|
responseTimeMs
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Security Features
|
||||||
|
|
||||||
|
### SF-001: Authentication and Authorization
|
||||||
|
|
||||||
|
**Description**: Multi-method authentication with fine-grained RBAC.
|
||||||
|
|
||||||
|
**Authentication Methods**:
|
||||||
|
- JWT (for API/Web UI)
|
||||||
|
- Password-based login (local user accounts)
|
||||||
|
- OAuth2/OIDC (Google, GitHub, enterprise SSO)
|
||||||
|
- API Keys (for service accounts)
|
||||||
|
- **TLS + Shared Secret** (for agent communication)
|
||||||
|
- Server-side TLS (auto-generated self-signed or custom certificates)
|
||||||
|
- Bootstrap token for initial registration
|
||||||
|
- Session key with HMAC signing for ongoing requests
|
||||||
|
- Primary/secondary key rotation
|
||||||
|
|
||||||
|
**RBAC Model**:
|
||||||
|
```rust
|
||||||
|
struct Role {
|
||||||
|
id: Uuid,
|
||||||
|
name: String,
|
||||||
|
permissions: Vec<Permission>,
|
||||||
|
}
|
||||||
|
|
||||||
|
enum Permission {
|
||||||
|
// Organization scope
|
||||||
|
OrganizationRead,
|
||||||
|
OrganizationWrite,
|
||||||
|
OrganizationDelete,
|
||||||
|
|
||||||
|
// Workspace scope
|
||||||
|
WorkspaceRead,
|
||||||
|
WorkspaceWrite,
|
||||||
|
WorkspaceDelete,
|
||||||
|
|
||||||
|
// Agent scope
|
||||||
|
AgentRead,
|
||||||
|
AgentWrite,
|
||||||
|
AgentReload,
|
||||||
|
AgentDelete,
|
||||||
|
|
||||||
|
// Config scope
|
||||||
|
ConfigRead,
|
||||||
|
ConfigWrite,
|
||||||
|
ConfigDeploy,
|
||||||
|
ConfigDelete,
|
||||||
|
|
||||||
|
// Certificate scope
|
||||||
|
CertificateRead,
|
||||||
|
CertificateWrite,
|
||||||
|
CertificateDelete,
|
||||||
|
|
||||||
|
// User management
|
||||||
|
UserRead,
|
||||||
|
UserWrite,
|
||||||
|
UserDelete,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### SF-002: Secret Management
|
||||||
|
|
||||||
|
**Description**: Secure storage and distribution of sensitive data.
|
||||||
|
|
||||||
|
**Secrets**:
|
||||||
|
- SSL private keys
|
||||||
|
- API tokens
|
||||||
|
- Database passwords
|
||||||
|
- External service credentials
|
||||||
|
|
||||||
|
**Security Measures**:
|
||||||
|
- Encryption at rest (AES-256-GCM)
|
||||||
|
- Encryption in transit (TLS 1.3)
|
||||||
|
- Automatic secret rotation
|
||||||
|
- Audit logging for secret access
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### SF-003: Network Security
|
||||||
|
|
||||||
|
**Description**: Network-level security controls.
|
||||||
|
|
||||||
|
**Features**:
|
||||||
|
- IP allowlisting for agent connections
|
||||||
|
- Rate limiting on API endpoints
|
||||||
|
- DDoS protection recommendations
|
||||||
|
- Security headers enforcement (HSTS, CSP, etc.)
|
||||||
|
|
||||||
|
**Agent Connection Security**:
|
||||||
|
- **TLS Encryption**: Server-side TLS (auto-generated or custom certificates)
|
||||||
|
- Development: Self-signed certificates auto-generated on first start
|
||||||
|
- Production: Valid certificates (Let's Encrypt or corporate CA)
|
||||||
|
- **Bootstrap Authentication**: One-time token for initial registration
|
||||||
|
- **Session Authentication**: HMAC-signed requests with shared session key
|
||||||
|
- **Key Rotation**: Primary/secondary key design for seamless rotation
|
||||||
|
- **Certificate Pinning**: Optional fingerprint verification for additional security
|
||||||
428
docs/project-structure.md
Normal file
428
docs/project-structure.md
Normal file
@@ -0,0 +1,428 @@
|
|||||||
|
# NxMesh Project Structure
|
||||||
|
|
||||||
|
This document outlines the recommended project structure for the NxMesh codebase.
|
||||||
|
|
||||||
|
## Directory Layout
|
||||||
|
|
||||||
|
```
|
||||||
|
nxmesh/
|
||||||
|
├── Cargo.toml # Workspace root
|
||||||
|
├── Cargo.lock
|
||||||
|
├── README.md
|
||||||
|
├── LICENSE
|
||||||
|
├── justfile # Task runner
|
||||||
|
├── AGENTS.md # AI agent context
|
||||||
|
├──
|
||||||
|
├── crates/ # Rust workspace crates
|
||||||
|
│ ├── nxmesh-core/ # Shared core library
|
||||||
|
│ │ ├── Cargo.toml
|
||||||
|
│ │ └── src/
|
||||||
|
│ │ ├── lib.rs
|
||||||
|
│ │ ├── models/ # Shared data models
|
||||||
|
│ │ │ ├── mod.rs
|
||||||
|
│ │ │ ├── organization.rs
|
||||||
|
│ │ │ ├── workspace.rs
|
||||||
|
│ │ │ ├── agent.rs
|
||||||
|
│ │ │ ├── config.rs
|
||||||
|
│ │ │ └── certificate.rs
|
||||||
|
│ │ ├── crypto/ # Encryption, hashing
|
||||||
|
│ │ ├── validation/ # Input validation
|
||||||
|
│ │ └── error.rs # Common error types
|
||||||
|
│ │
|
||||||
|
│ ├── nxmesh-proto/ # Protocol buffers
|
||||||
|
│ │ ├── Cargo.toml
|
||||||
|
│ │ ├── build.rs
|
||||||
|
│ │ └── proto/
|
||||||
|
│ │ ├── agent.proto
|
||||||
|
│ │ ├── config.proto
|
||||||
|
│ │ └── common.proto
|
||||||
|
│ │
|
||||||
|
│ ├── nxmesh-master/ # Control plane
|
||||||
|
│ │ ├── Cargo.toml
|
||||||
|
│ │ └── src/
|
||||||
|
│ │ ├── main.rs
|
||||||
|
│ │ ├── lib.rs
|
||||||
|
│ │ ├── api/ # REST API handlers
|
||||||
|
│ │ │ ├── mod.rs
|
||||||
|
│ │ │ ├── routes.rs
|
||||||
|
│ │ │ ├── middleware/
|
||||||
|
│ │ │ ├── v1/ # API version 1
|
||||||
|
│ │ │ │ ├── mod.rs
|
||||||
|
│ │ │ │ ├── organizations.rs
|
||||||
|
│ │ │ │ ├── workspaces.rs
|
||||||
|
│ │ │ │ ├── agents.rs
|
||||||
|
│ │ │ │ ├── virtual_hosts.rs
|
||||||
|
│ │ │ │ ├── upstreams.rs
|
||||||
|
│ │ │ │ ├── certificates.rs
|
||||||
|
│ │ │ │ └── metrics.rs
|
||||||
|
│ │ │ └── websocket.rs
|
||||||
|
│ │ ├── grpc/ # gRPC service
|
||||||
|
│ │ │ ├── mod.rs
|
||||||
|
│ │ │ ├── server.rs
|
||||||
|
│ │ │ ├── agent_service.rs
|
||||||
|
│ │ │ └── interceptor.rs
|
||||||
|
│ │ ├── config/ # Configuration
|
||||||
|
│ │ │ ├── mod.rs
|
||||||
|
│ │ │ └── settings.rs
|
||||||
|
│ │ ├── db/ # Database layer
|
||||||
|
│ │ │ ├── mod.rs
|
||||||
|
│ │ │ ├── connection.rs
|
||||||
|
│ │ │ ├── migration.rs
|
||||||
|
│ │ │ └── repositories/
|
||||||
|
│ │ ├── services/ # Business logic
|
||||||
|
│ │ │ ├── mod.rs
|
||||||
|
│ │ │ ├── organization_service.rs
|
||||||
|
│ │ │ ├── workspace_service.rs
|
||||||
|
│ │ │ ├── agent_service.rs
|
||||||
|
│ │ │ ├── config_service.rs
|
||||||
|
│ │ │ ├── certificate_service.rs
|
||||||
|
│ │ │ └── auth_service.rs
|
||||||
|
│ │ ├── domain/ # Domain entities
|
||||||
|
│ │ │ ├── mod.rs
|
||||||
|
│ │ │ ├── organization.rs
|
||||||
|
│ │ │ ├── agent.rs
|
||||||
|
│ │ │ └── config.rs
|
||||||
|
│ │ ├── infrastructure/ # External integrations
|
||||||
|
│ │ │ ├── mod.rs
|
||||||
|
│ │ │ ├── acme/ # Let's Encrypt
|
||||||
|
│ │ │ ├── storage/ # Object storage
|
||||||
|
│ │ │ └── notifier/ # Notifications
|
||||||
|
│ │ ├── events/ # Event bus
|
||||||
|
│ │ │ ├── mod.rs
|
||||||
|
│ │ │ ├── bus.rs
|
||||||
|
│ │ │ └── handlers.rs
|
||||||
|
│ │ └── cli.rs # CLI commands
|
||||||
|
│ │
|
||||||
|
│ ├── nxmesh-agent/ # Data plane
|
||||||
|
│ │ ├── Cargo.toml
|
||||||
|
│ │ └── src/
|
||||||
|
│ │ ├── main.rs
|
||||||
|
│ │ ├── lib.rs
|
||||||
|
│ │ ├── config/ # Agent configuration
|
||||||
|
│ │ │ ├── mod.rs
|
||||||
|
│ │ │ └── settings.rs
|
||||||
|
│ │ ├── master/ # Master communication
|
||||||
|
│ │ │ ├── mod.rs
|
||||||
|
│ │ │ ├── client.rs
|
||||||
|
│ │ │ ├── reconnect.rs
|
||||||
|
│ │ │ └── stream.rs
|
||||||
|
│ │ ├── nginx/ # Nginx management
|
||||||
|
│ │ │ ├── mod.rs
|
||||||
|
│ │ │ ├── controller.rs
|
||||||
|
│ │ │ ├── config_manager.rs # Symlink-based atomic deployment
|
||||||
|
│ │ │ ├── config_renderer.rs
|
||||||
|
│ │ │ ├── validator.rs
|
||||||
|
│ │ │ ├── docker_sidecar.rs # Docker sidecar (PID namespace sharing)
|
||||||
|
│ │ │ ├── systemd.rs # Standalone mode
|
||||||
|
│ │ │ └── parser.rs # Nginx config parser
|
||||||
|
│ │ ├── health/ # Health monitoring
|
||||||
|
│ │ │ ├── mod.rs
|
||||||
|
│ │ │ ├── monitor.rs
|
||||||
|
│ │ │ ├── nginx.rs
|
||||||
|
│ │ │ └── system.rs
|
||||||
|
│ │ ├── metrics/ # Metrics collection
|
||||||
|
│ │ │ ├── mod.rs
|
||||||
|
│ │ │ ├── collector.rs
|
||||||
|
│ │ │ └── exporter.rs
|
||||||
|
│ │ ├── cache/ # Local caching
|
||||||
|
│ │ │ ├── mod.rs
|
||||||
|
│ │ │ └── config_cache.rs
|
||||||
|
│ │ ├── watch/ # File watchers
|
||||||
|
│ │ │ ├── mod.rs
|
||||||
|
│ │ │ └── config_watch.rs
|
||||||
|
│ │ └── cli.rs # CLI commands
|
||||||
|
│ │
|
||||||
|
│ └── nxmesh-cli/ # CLI tool
|
||||||
|
│ ├── Cargo.toml
|
||||||
|
│ └── src/
|
||||||
|
│ ├── main.rs
|
||||||
|
│ ├── commands/ # CLI commands
|
||||||
|
│ │ ├── mod.rs
|
||||||
|
│ │ ├── login.rs
|
||||||
|
│ │ ├── agent.rs
|
||||||
|
│ │ ├── config.rs
|
||||||
|
│ │ └── deploy.rs
|
||||||
|
│ └── api/ # API client
|
||||||
|
│
|
||||||
|
├── frontend/ # Web UI (embedded in master)
|
||||||
|
│ ├── package.json
|
||||||
|
│ ├── vite.config.ts
|
||||||
|
│ ├── tsconfig.json
|
||||||
|
│ ├── index.html
|
||||||
|
│ ├── src/
|
||||||
|
│ │ ├── main.tsx
|
||||||
|
│ │ ├── App.tsx
|
||||||
|
│ │ ├── components/ # Reusable components
|
||||||
|
│ │ │ ├── common/
|
||||||
|
│ │ │ ├── layout/
|
||||||
|
│ │ │ └── forms/
|
||||||
|
│ │ ├── pages/ # Page components
|
||||||
|
│ │ │ ├── Dashboard/
|
||||||
|
│ │ │ ├── Agents/
|
||||||
|
│ │ │ ├── Configurations/
|
||||||
|
│ │ │ ├── Certificates/
|
||||||
|
│ │ │ └── Settings/
|
||||||
|
│ │ ├── hooks/ # React hooks
|
||||||
|
│ │ ├── stores/ # State management (Zustand)
|
||||||
|
│ │ ├── api/ # API client
|
||||||
|
│ │ ├── types/ # TypeScript types
|
||||||
|
│ │ ├── utils/ # Utilities
|
||||||
|
│ │ └── styles/ # CSS/Tailwind
|
||||||
|
│ └── public/
|
||||||
|
│
|
||||||
|
│ # Build output (dist/) is embedded into master binary
|
||||||
|
│ # Master serves static files at root path ("/")
|
||||||
|
│
|
||||||
|
├── migrations/ # Database migrations
|
||||||
|
│ └── sea-orm/
|
||||||
|
│ ├── Cargo.toml
|
||||||
|
│ └── src/
|
||||||
|
│
|
||||||
|
├── tests/ # Integration tests
|
||||||
|
│ ├── integration/
|
||||||
|
│ │ ├── master_api_tests.rs
|
||||||
|
│ │ ├── agent_master_tests.rs
|
||||||
|
│ │ └── config_flow_tests.rs
|
||||||
|
│ └── fixtures/
|
||||||
|
│
|
||||||
|
├── scripts/ # Build/utility scripts
|
||||||
|
│ ├── build.sh
|
||||||
|
│ ├── test.sh
|
||||||
|
│ └── release.sh
|
||||||
|
│
|
||||||
|
├── deploy/ # Deployment configs
|
||||||
|
│ ├── docker/
|
||||||
|
│ │ ├── master.Dockerfile
|
||||||
|
│ │ ├── agent.Dockerfile
|
||||||
|
│ │ └── docker-compose.yml
|
||||||
|
│ ├── k8s/
|
||||||
|
│ │ ├── namespace.yaml
|
||||||
|
│ │ ├── master/
|
||||||
|
│ │ ├── agent/
|
||||||
|
│ │ └── helm/
|
||||||
|
│ └── terraform/
|
||||||
|
│
|
||||||
|
├── docs/ # Documentation
|
||||||
|
│ ├── architecture.md
|
||||||
|
│ ├── features.md
|
||||||
|
│ ├── roadmap.md
|
||||||
|
│ ├── api.md
|
||||||
|
│ ├── deployment.md
|
||||||
|
│ └── project-structure.md
|
||||||
|
│
|
||||||
|
└── .devcontainer/ # Dev container
|
||||||
|
├── devcontainer.json
|
||||||
|
├── docker-compose.yml
|
||||||
|
├── Dockerfile
|
||||||
|
└── nginx/
|
||||||
|
```
|
||||||
|
|
||||||
|
## Crate Dependencies
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph TB
|
||||||
|
subgraph "Workspace Crates"
|
||||||
|
CLI[nxmesh-cli]
|
||||||
|
AGENT[nxmesh-agent]
|
||||||
|
MASTER[nxmesh-master]
|
||||||
|
PROTO[nxmesh-proto]
|
||||||
|
CORE[nxmesh-core]
|
||||||
|
end
|
||||||
|
|
||||||
|
CORE --> PROTO
|
||||||
|
AGENT --> CORE
|
||||||
|
AGENT --> PROTO
|
||||||
|
MASTER --> CORE
|
||||||
|
MASTER --> PROTO
|
||||||
|
CLI --> CORE
|
||||||
|
```
|
||||||
|
|
||||||
|
## Key Design Principles
|
||||||
|
|
||||||
|
### 1. Separation of Concerns
|
||||||
|
|
||||||
|
- **nxmesh-core**: Only shared types and utilities
|
||||||
|
- **nxmesh-master**: Only control plane logic
|
||||||
|
- **nxmesh-agent**: Only data plane logic
|
||||||
|
- **frontend**: Only UI logic
|
||||||
|
|
||||||
|
### 2. Domain-Driven Design (in Master)
|
||||||
|
|
||||||
|
```
|
||||||
|
domain/ # Domain entities (pure logic)
|
||||||
|
services/ # Application services (orchestration)
|
||||||
|
repositories/ # Data access abstraction
|
||||||
|
api/ # Interface adapters (HTTP, gRPC)
|
||||||
|
infrastructure/ # External concerns
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Agent Modularity
|
||||||
|
|
||||||
|
Each major concern in the agent is a separate module:
|
||||||
|
- `nginx/`: All nginx-specific code
|
||||||
|
- `master/`: All master communication code
|
||||||
|
- `health/`: All health monitoring code
|
||||||
|
- `metrics/`: All metrics code
|
||||||
|
|
||||||
|
### 4. Configuration Management
|
||||||
|
|
||||||
|
Use hierarchical config:
|
||||||
|
1. Default values (in code)
|
||||||
|
2. Config file (`/etc/nxmesh/*.toml`)
|
||||||
|
3. Environment variables
|
||||||
|
4. Command-line arguments (highest priority)
|
||||||
|
|
||||||
|
## Module Guidelines
|
||||||
|
|
||||||
|
### API Versioning
|
||||||
|
|
||||||
|
- Always version REST APIs: `/api/v1/...`
|
||||||
|
- Maintain backward compatibility within major versions
|
||||||
|
- Use feature flags for gradual rollouts
|
||||||
|
|
||||||
|
### Error Handling
|
||||||
|
|
||||||
|
- Use `thiserror` for error definitions
|
||||||
|
- Propagate errors with context
|
||||||
|
- Convert to user-friendly messages at API boundary
|
||||||
|
|
||||||
|
### Testing Structure
|
||||||
|
|
||||||
|
```rust
|
||||||
|
// In each module
|
||||||
|
#[cfg(test)]
|
||||||
|
mod tests {
|
||||||
|
use super::*;
|
||||||
|
|
||||||
|
#[test]
|
||||||
|
fn test_feature() {
|
||||||
|
// unit tests
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- Unit tests: In same file as code
|
||||||
|
- Integration tests: In `tests/` directory
|
||||||
|
- E2E tests: Separate crate or external repo
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
|
||||||
|
- All public APIs must have doc comments
|
||||||
|
- Include examples in doc comments
|
||||||
|
- Keep README files in each crate
|
||||||
|
|
||||||
|
## Build Configuration
|
||||||
|
|
||||||
|
### Workspace Cargo.toml
|
||||||
|
|
||||||
|
```toml
|
||||||
|
[workspace]
|
||||||
|
members = [
|
||||||
|
"crates/nxmesh-core",
|
||||||
|
"crates/nxmesh-proto",
|
||||||
|
"crates/nxmesh-master",
|
||||||
|
"crates/nxmesh-agent",
|
||||||
|
"crates/nxmesh-cli",
|
||||||
|
]
|
||||||
|
resolver = "3"
|
||||||
|
|
||||||
|
[workspace.dependencies]
|
||||||
|
# Core dependencies
|
||||||
|
tokio = { version = "1", features = ["full"] }
|
||||||
|
serde = { version = "1", features = ["derive"] }
|
||||||
|
thiserror = "1"
|
||||||
|
tracing = "0.1"
|
||||||
|
|
||||||
|
# Web framework
|
||||||
|
axum = "0.7"
|
||||||
|
tower = "0.4"
|
||||||
|
tower-http = "0.5"
|
||||||
|
|
||||||
|
# gRPC
|
||||||
|
tonic = "0.11"
|
||||||
|
prost = "0.12"
|
||||||
|
|
||||||
|
# Database
|
||||||
|
sea-orm = "2.0.0-rc"
|
||||||
|
sea-orm-migration = "2.0.0-rc"
|
||||||
|
|
||||||
|
# Async
|
||||||
|
async-trait = "0.1"
|
||||||
|
futures = "0.3"
|
||||||
|
|
||||||
|
# Serialization
|
||||||
|
serde_json = "1"
|
||||||
|
toml = "0.8"
|
||||||
|
|
||||||
|
# HTTP
|
||||||
|
reqwest = { version = "0.12", default-features = false }
|
||||||
|
|
||||||
|
# Crypto
|
||||||
|
sha2 = "0.10"
|
||||||
|
hex = "0.4"
|
||||||
|
|
||||||
|
# Testing
|
||||||
|
tokio-test = "0.4"
|
||||||
|
mockall = "0.12"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Naming Conventions
|
||||||
|
|
||||||
|
### Files
|
||||||
|
- Use `snake_case` for file names
|
||||||
|
- Module entry point: `mod.rs` or `{module_name}.rs`
|
||||||
|
|
||||||
|
### Types
|
||||||
|
- Structs/Enums: `PascalCase`
|
||||||
|
- Traits: `PascalCase` (often ending in `able` or with verb prefix)
|
||||||
|
- Functions/Methods: `snake_case`
|
||||||
|
- Constants: `SCREAMING_SNAKE_CASE`
|
||||||
|
- Generic parameters: Single uppercase letter (`T`, `K`, `V`)
|
||||||
|
|
||||||
|
### Error Types
|
||||||
|
- Suffix with `Error`: `ConfigError`, `AgentError`
|
||||||
|
- Group in `error.rs` or `errors/` module
|
||||||
|
|
||||||
|
### Feature Flags
|
||||||
|
- Use `kebab-case`: `postgres-native`, `tls-rustls`
|
||||||
|
|
||||||
|
## CI/CD Structure
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# .github/workflows/
|
||||||
|
├── ci.yml # PR checks
|
||||||
|
├── test.yml # Test suite
|
||||||
|
├── release.yml # Release builds
|
||||||
|
├── docker.yml # Docker image builds
|
||||||
|
└── docs.yml # Documentation deploy
|
||||||
|
```
|
||||||
|
|
||||||
|
## Scripts
|
||||||
|
|
||||||
|
Common operations should have just commands:
|
||||||
|
|
||||||
|
```justfile
|
||||||
|
# Development
|
||||||
|
just dev # Start all services
|
||||||
|
just dev-backend # Start backend only
|
||||||
|
just dev-frontend # Start frontend only
|
||||||
|
|
||||||
|
# Testing
|
||||||
|
just test # Run all tests
|
||||||
|
just test-unit # Unit tests only
|
||||||
|
just test-integration # Integration tests
|
||||||
|
|
||||||
|
# Building
|
||||||
|
just build # Build all
|
||||||
|
just build-master # Build master only
|
||||||
|
just build-agent # Build agent only
|
||||||
|
|
||||||
|
# Database
|
||||||
|
just db-migrate # Run migrations
|
||||||
|
just db-reset # Reset database
|
||||||
|
just db-console # Open psql
|
||||||
|
|
||||||
|
# Deployment
|
||||||
|
just docker-build # Build Docker images
|
||||||
|
just k8s-deploy # Deploy to Kubernetes
|
||||||
|
```
|
||||||
486
docs/roadmap.md
Normal file
486
docs/roadmap.md
Normal file
@@ -0,0 +1,486 @@
|
|||||||
|
# NxMesh Project Roadmap
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This document outlines the development phases and milestones for NxMesh. The project is divided into four major phases, each building upon the previous one.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1: Foundation (Months 1-3)
|
||||||
|
|
||||||
|
**Goal**: Build a working MVP with basic master-agent communication and nginx configuration management.
|
||||||
|
|
||||||
|
### Milestone 1.1: Project Setup and Core Infrastructure
|
||||||
|
**Target**: Week 2
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | Set up Rust workspace structure (master, agent, shared) | 🔲 |
|
||||||
|
| [ ] | Configure CI/CD pipeline (GitHub Actions) | 🔲 |
|
||||||
|
| [ ] | Set up database schema with SeaORM migrations | 🔲 |
|
||||||
|
| [ ] | Create development environment (devcontainer) | 🔲 |
|
||||||
|
| [ ] | Set up testing framework (unit, integration) | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- Working development environment
|
||||||
|
- Database schema for organizations, workspaces, agents
|
||||||
|
- CI pipeline with linting and testing
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Milestone 1.2: Master - Core API
|
||||||
|
**Target**: Week 5
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | Implement Axum-based REST API server | 🔲 |
|
||||||
|
| [ ] | JWT authentication middleware | 🔲 |
|
||||||
|
| [ ] | CRUD endpoints for Organizations | 🔲 |
|
||||||
|
| [ ] | CRUD endpoints for Workspaces | 🔲 |
|
||||||
|
| [ ] | CRUD endpoints for Agents | 🔲 |
|
||||||
|
| [ ] | PostgreSQL persistence layer | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- REST API for basic resource management
|
||||||
|
- JWT authentication working
|
||||||
|
- API documentation (OpenAPI)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Milestone 1.3: Master - Agent Communication
|
||||||
|
**Target**: Week 7
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | gRPC server implementation (Tonic) | 🔲 |
|
||||||
|
| [ ] | Bidirectional streaming protocol | 🔲 |
|
||||||
|
| [ ] | Agent registration flow | 🔲 |
|
||||||
|
| [ ] | Token-based authentication for agents | 🔲 |
|
||||||
|
| [ ] | Agent heartbeat/health monitoring | 🔲 |
|
||||||
|
| [ ] | WebSocket fallback for events | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- Master can accept agent connections
|
||||||
|
- Agent registration and authentication works
|
||||||
|
- Health status tracking
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Milestone 1.4: Agent - Core Functionality
|
||||||
|
**Target**: Week 9
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | Agent CLI and configuration | 🔲 |
|
||||||
|
| [ ] | gRPC client for master communication | 🔲 |
|
||||||
|
| [ ] | Automatic reconnection with backoff | 🔲 |
|
||||||
|
| [ ] | Nginx process management (Docker sidecar PID sharing) | 🔲 |
|
||||||
|
| [ ] | Health check reporting | 🔲 |
|
||||||
|
| [ ] | Local config caching | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- Agent binary that connects to master
|
||||||
|
- Nginx lifecycle management (Docker sidecar mode)
|
||||||
|
- Health reporting
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Milestone 1.5: Configuration Management
|
||||||
|
**Target**: Week 11
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | VirtualHost CRUD API | 🔲 |
|
||||||
|
| [ ] | Upstream CRUD API | 🔲 |
|
||||||
|
| [ ] | Handlebars template engine integration | 🔲 |
|
||||||
|
| [ ] | Config rendering on agent | 🔲 |
|
||||||
|
| [ ] | Nginx config validation (`nginx -t`) | 🔲 |
|
||||||
|
| [ ] | Graceful reload on config change | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- End-to-end config push: Master → Agent → Nginx
|
||||||
|
- Basic virtual host and upstream management
|
||||||
|
- Template-based nginx config generation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Milestone 1.6: Web Admin Console - Foundation
|
||||||
|
**Target**: Week 13
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | React + Vite project setup | 🔲 |
|
||||||
|
| [ ] | Authentication UI (login/logout) | 🔲 |
|
||||||
|
| [ ] | Dashboard layout and navigation | 🔲 |
|
||||||
|
| [ ] | Agent list and detail views | 🔲 |
|
||||||
|
| [ ] | Basic virtual host form | 🔲 |
|
||||||
|
| [ ] | WebSocket integration for real-time updates | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- Functional Web UI
|
||||||
|
- Agent management via UI
|
||||||
|
- Basic configuration editing
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Phase 1 Completion Criteria
|
||||||
|
- [ ] Master and Agent communicate via gRPC
|
||||||
|
- [ ] Nginx configs can be pushed from Master to Agent
|
||||||
|
- [ ] Web UI for basic management
|
||||||
|
- [ ] Docker sidecar deployment working
|
||||||
|
- [ ] Documentation complete
|
||||||
|
|
||||||
|
**Estimated Effort**: 3 months
|
||||||
|
**Team Size**: 2-3 engineers
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2: Resilience and Observability (Months 4-5)
|
||||||
|
|
||||||
|
**Goal**: Make the system production-ready with HA, monitoring, and robust failure handling.
|
||||||
|
|
||||||
|
### Milestone 2.1: High Availability - Master Clustering
|
||||||
|
**Target**: Week 15
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | Raft consensus integration (raft-rs) | 🔲 |
|
||||||
|
| [ ] | Leader election | 🔲 |
|
||||||
|
| [ ] | State replication across masters | 🔲 |
|
||||||
|
| [ ] | Agent connection failover | 🔲 |
|
||||||
|
| [ ] | Cluster health monitoring | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- Multiple master instances can form a cluster
|
||||||
|
- Automatic failover on master failure
|
||||||
|
- No single point of failure
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Milestone 2.2: Certificate Management
|
||||||
|
**Target**: Week 17
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | ACME client integration (acme-rs) | 🔲 |
|
||||||
|
| [ ] | Let's Encrypt HTTP-01 challenge | 🔲 |
|
||||||
|
| [ ] | Certificate storage (encrypted) | 🔲 |
|
||||||
|
| [ ] | Automatic renewal | 🔲 |
|
||||||
|
| [ ] | Certificate distribution to agents | 🔲 |
|
||||||
|
| [ ] | Expiration monitoring and alerts | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- Automatic SSL certificate provisioning
|
||||||
|
- Certificate renewal before expiry
|
||||||
|
- UI for certificate management
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Milestone 2.3: Observability Stack
|
||||||
|
**Target**: Week 19
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | OpenTelemetry integration | 🔲 |
|
||||||
|
| [ ] | Structured logging (tracing) | 🔲 |
|
||||||
|
| [ ] | Prometheus metrics endpoint (agent) | 🔲 |
|
||||||
|
| [ ] | Custom metrics collection | 🔲 |
|
||||||
|
| [ ] | Health check dashboard | 🔲 |
|
||||||
|
| [ ] | Alert configuration | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- Metrics visible in Prometheus
|
||||||
|
- Distributed traces for config pushes
|
||||||
|
- Health dashboard in Web UI
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Milestone 2.4: Enhanced Failure Handling
|
||||||
|
**Target**: Week 21
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | Configuration drift detection | 🔲 |
|
||||||
|
| [ ] | Auto-healing (config sync) | 🔲 |
|
||||||
|
| [ ] | Circuit breaker for master connection | 🔲 |
|
||||||
|
| [ ] | Nginx crash detection and restart | 🔲 |
|
||||||
|
| [ ] | Config rollback on validation failure | 🔲 |
|
||||||
|
| [ ] | Bulk operations and queue management | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- System self-heals from common failures
|
||||||
|
- Config drift automatically corrected
|
||||||
|
- Robust reconnection logic
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Phase 2 Completion Criteria
|
||||||
|
- [ ] Master clustering with Raft
|
||||||
|
- [ ] Automatic SSL certificates
|
||||||
|
- [ ] Full observability (metrics, logs, traces)
|
||||||
|
- [ ] Production-grade failure handling
|
||||||
|
- [ ] Performance benchmarks
|
||||||
|
|
||||||
|
**Estimated Effort**: 2 months
|
||||||
|
**Team Size**: 2-3 engineers
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3: Advanced Traffic Management (Months 6-7)
|
||||||
|
|
||||||
|
**Goal**: Add enterprise-grade traffic management features.
|
||||||
|
|
||||||
|
### Milestone 3.1: Advanced Load Balancing
|
||||||
|
**Target**: Week 23
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | Multiple load balancing algorithms | 🔲 |
|
||||||
|
| [ ] | Health checks for upstream servers | 🔲 |
|
||||||
|
| [ ] | Circuit breaker for upstreams | 🔲 |
|
||||||
|
| [ ] | Retry policies | 🔲 |
|
||||||
|
| [ ] | Connection pooling | 🔲 |
|
||||||
|
| [ ] | Upstream status dashboard | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- Advanced upstream configuration
|
||||||
|
- Health check visualization
|
||||||
|
- Circuit breaker metrics
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Milestone 3.2: Rate Limiting and WAF
|
||||||
|
**Target**: Week 25
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | Rate limiting rules (IP, user, global) | 🔲 |
|
||||||
|
| [ ] | Rate limiting zones | 🔲 |
|
||||||
|
| [ ] | Basic WAF rules (ModSecurity integration) | 🔲 |
|
||||||
|
| [ ] | IP allowlist/blocklist | 🔲 |
|
||||||
|
| [ ] | Geo-blocking | 🔲 |
|
||||||
|
| [ ] | Rate limit analytics | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- Configurable rate limiting
|
||||||
|
- Basic WAF protection
|
||||||
|
- Security event dashboard
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Milestone 3.3: Traffic Routing and Canary
|
||||||
|
**Target**: Week 27
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | Header-based routing | 🔲 |
|
||||||
|
| [ ] | Weight-based traffic splitting | 🔲 |
|
||||||
|
| [ ] | Canary deployment support | 🔲 |
|
||||||
|
| [ ] | A/B testing configuration | 🔲 |
|
||||||
|
| [ ] | Blue-green deployment | 🔲 |
|
||||||
|
| [ ] | Traffic analytics | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- Advanced traffic routing
|
||||||
|
- Canary deployment UI
|
||||||
|
- Traffic split visualization
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Milestone 3.4: Access Log Aggregation
|
||||||
|
**Target**: Week 29
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | Nginx access log parsing | 🔲 |
|
||||||
|
| [ ] | Log streaming to master | 🔲 |
|
||||||
|
| [ ] | Log storage and indexing | 🔲 |
|
||||||
|
| [ ] | Log query interface | 🔲 |
|
||||||
|
| [ ] | Real-time log tailing | 🔲 |
|
||||||
|
| [ ] | Log-based alerting | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- Centralized access logs
|
||||||
|
- Log search and filtering
|
||||||
|
- Log-based metrics
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Phase 3 Completion Criteria
|
||||||
|
- [ ] Advanced load balancing and health checks
|
||||||
|
- [ ] Rate limiting and basic WAF
|
||||||
|
- [ ] Canary and A/B testing
|
||||||
|
- [ ] Access log aggregation
|
||||||
|
- [ ] Traffic analytics dashboard
|
||||||
|
|
||||||
|
**Estimated Effort**: 2 months
|
||||||
|
**Team Size**: 2-3 engineers
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 4: Enterprise Features (Months 8-10)
|
||||||
|
|
||||||
|
**Goal**: Enterprise readiness with multi-tenancy, RBAC, and advanced integrations.
|
||||||
|
|
||||||
|
### Milestone 4.1: Multi-tenancy and RBAC
|
||||||
|
**Target**: Week 31
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | Organization isolation | 🔲 |
|
||||||
|
| [ ] | Workspace-scoped resources | 🔲 |
|
||||||
|
| [ ] | Role-based access control | 🔲 |
|
||||||
|
| [ ] | User management API | 🔲 |
|
||||||
|
| [ ] | API key management | 🔲 |
|
||||||
|
| [ ] | Audit logging | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- Full multi-tenancy
|
||||||
|
- Granular permissions
|
||||||
|
- Audit trail
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Milestone 4.2: Kubernetes Integration
|
||||||
|
**Target**: Week 33
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | Kubernetes operator | 🔲 |
|
||||||
|
| [ ] | CRD definitions | 🔲 |
|
||||||
|
| [ ] | Helm chart | 🔲 |
|
||||||
|
| [ ] | Service discovery integration | 🔲 |
|
||||||
|
| [ ] | Ingress controller mode | 🔲 |
|
||||||
|
| [ ] | K8s-native agent deployment | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- Kubernetes operator
|
||||||
|
- Helm chart for easy deployment
|
||||||
|
- Ingress controller functionality
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Milestone 4.3: External Integrations
|
||||||
|
**Target**: Week 35
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | Terraform provider | 🔲 |
|
||||||
|
| [ ] | GitOps integration (Git sync) | 🔲 |
|
||||||
|
| [ ] | Webhook support | 🔲 |
|
||||||
|
| [ ] | Slack/Discord notifications | 🔲 |
|
||||||
|
| [ ] | PagerDuty/Opsgenie integration | 🔲 |
|
||||||
|
| [ ] | DNS provider integration (Route53, Cloudflare) | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- Infrastructure as Code support
|
||||||
|
- GitOps workflows
|
||||||
|
- Notification channels
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Milestone 4.4: Performance and Scale
|
||||||
|
**Target**: Week 37
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | Connection pooling optimization | 🔲 |
|
||||||
|
| [ ] | Config caching improvements | 🔲 |
|
||||||
|
| [ ] | Database query optimization | 🔲 |
|
||||||
|
| [ ] | Horizontal scaling tests | 🔲 |
|
||||||
|
| [ ] | Load testing (10k+ agents) | 🔲 |
|
||||||
|
| [ ] | Performance tuning documentation | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- Performance benchmarks
|
||||||
|
- Scaling guidelines
|
||||||
|
- Optimization recommendations
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Milestone 4.5: Enterprise Security
|
||||||
|
**Target**: Week 39
|
||||||
|
|
||||||
|
| Task | Description | Status |
|
||||||
|
|------|-------------|--------|
|
||||||
|
| [ ] | mTLS for all communications | 🔲 |
|
||||||
|
| [ ] | Secret encryption at rest | 🔲 |
|
||||||
|
| [ ] | HSM integration | 🔲 |
|
||||||
|
| [ ] | SSO/SAML integration | 🔲 |
|
||||||
|
| [ ] | Security scanning (SAST/DAST) | 🔲 |
|
||||||
|
| [ ] | Compliance documentation (SOC2) | 🔲 |
|
||||||
|
|
||||||
|
**Deliverables**:
|
||||||
|
- Enterprise security features
|
||||||
|
- Compliance documentation
|
||||||
|
- Security audit
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Phase 4 Completion Criteria
|
||||||
|
- [ ] Full RBAC and multi-tenancy
|
||||||
|
- [ ] Kubernetes operator
|
||||||
|
- [ ] External integrations (Terraform, GitOps)
|
||||||
|
- [ ] Proven scalability (10k+ agents)
|
||||||
|
- [ ] Enterprise security compliance
|
||||||
|
|
||||||
|
**Estimated Effort**: 3 months
|
||||||
|
**Team Size**: 3-4 engineers
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Timeline Summary
|
||||||
|
|
||||||
|
```
|
||||||
|
Month 1-3: ████████████████████████████████████████ Phase 1: Foundation
|
||||||
|
Month 4-5: ████████████████████ Phase 2: Resilience
|
||||||
|
Month 6-7: ████████████████████ Phase 3: Advanced
|
||||||
|
Month 8-10: ██████████████████████████ Phase 4: Enterprise
|
||||||
|
|
||||||
|
Week: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
|
||||||
|
|--M1--|--M2--|--M3--|--M4--|--M5--|--M6--|
|
||||||
|
|
||||||
|
Week: 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
|
||||||
|
|--M7--|--M8--|--M9--|--M10-|--M11-|--M12-|--M13-|--M14-|
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resource Requirements
|
||||||
|
|
||||||
|
### Phase 1
|
||||||
|
- **Backend Engineers**: 2
|
||||||
|
- **Frontend Engineer**: 1
|
||||||
|
- **Total Person-Months**: 9
|
||||||
|
|
||||||
|
### Phase 2
|
||||||
|
- **Backend Engineers**: 2
|
||||||
|
- **Frontend Engineer**: 1 (part-time)
|
||||||
|
- **DevOps Engineer**: 1 (part-time)
|
||||||
|
- **Total Person-Months**: 7
|
||||||
|
|
||||||
|
### Phase 3
|
||||||
|
- **Backend Engineers**: 2
|
||||||
|
- **Frontend Engineer**: 1
|
||||||
|
- **Total Person-Months**: 6
|
||||||
|
|
||||||
|
### Phase 4
|
||||||
|
- **Backend Engineers**: 2
|
||||||
|
- **Frontend Engineer**: 1
|
||||||
|
- **DevOps Engineer**: 1
|
||||||
|
- **Security Engineer**: 1 (part-time)
|
||||||
|
- **Total Person-Months**: 10
|
||||||
|
|
||||||
|
**Total Project**: ~32 person-months
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risk Assessment
|
||||||
|
|
||||||
|
| Risk | Probability | Impact | Mitigation |
|
||||||
|
|------|-------------|--------|------------|
|
||||||
|
| Raft complexity delays HA | Medium | High | Start with single master, add HA later |
|
||||||
|
| gRPC performance issues | Low | Medium | Implement WebSocket fallback early |
|
||||||
|
| Nginx reload edge cases | Medium | High | Extensive testing, rollback capability |
|
||||||
|
| Team scaling challenges | Medium | Medium | Document architecture, modular design |
|
||||||
|
| Integration complexity | Medium | Medium | Clear APIs, contract testing |
|
||||||
Reference in New Issue
Block a user