Add project structure and roadmap documentation

- Created `project-structure.md` to outline the directory layout, crate dependencies, design principles, module guidelines, and naming conventions for the NxMesh codebase. - Introduced `roadmap.md` detailing the development phases, milestones, tasks, deliverables, and resource requirements for the NxMesh project, spanning from foundational setup to enterprise features.
2026-03-03 04:13:31 +00:00
parent 39bd860c55
commit 43b2e44d95
11 changed files with 9293 additions and 7 deletions
--- a/.devcontainer/Dockerfile
+++ b/.devcontainer/Dockerfile
@@ -8,6 +8,7 @@ RUN apt-get update && apt-get install -y \
    pkg-config \
    libssl-dev \
    postgresql-client \
    protobuf-compiler \
    && rm -rf /var/lib/apt/lists/*
 # Set working directory
--- a/.devcontainer/docker-compose.yml
+++ b/.devcontainer/docker-compose.yml
@@ -27,7 +27,7 @@ services:
      - docker
    pid: "service:nginx"
-  # Data Plane - Nginx (controlled by agent via Docker)
+  # Data Plane - Nginx (controlled by agent via PID namespace sharing)
  nginx:
    image: nginx:alpine
    container_name: nxmesh-nginx
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -0,0 +1,104 @@
 # NxMesh - Agent Instructions
 This document provides context for AI agents working on the NxMesh project.
 ## Project Overview
 **NxMesh** is a distributed nginx management system using a master-agent architecture:
 - **Master (Control Plane)**: Central API, embedded Web UI, configuration distribution, cluster management
 - **Agent (Data Plane)**: Sidecar that manages local nginx instances
 - **Web UI**: Vite React-based admin console, embedded and served by master
 ## Quick Links to Documentation
 | Document | Purpose |
 |----------|---------|
 | [README.md](./README.md) | Project overview and quick start |
 | [docs/architecture.md](./docs/architecture.md) | System design and data flow |
 | [docs/features.md](./docs/features.md) | Detailed feature specifications |
 | [docs/roadmap.md](./docs/roadmap.md) | Development phases and milestones |
 | [docs/api.md](./docs/api.md) | REST and gRPC API specifications |
 | [docs/project-structure.md](./docs/project-structure.md) | Code organization |
 ## Technology Stack
 | Component | Technology |
 |-----------|------------|
 | Backend | Rust (Axum, Tonic, SeaORM) |
 | Frontend | React + TypeScript + Vite |
 | Database | PostgreSQL 16+ |
 | Cache | Redis |
 | Message Format | Protocol Buffers (gRPC) |
 | Container | Docker |
 | Orchestration | Kubernetes (optional) |
 ## Development Environment
 This project uses Dev Containers for consistent development:
 ```bash
 # All dependencies are pre-installed in the devcontainer
 just setup    # Initial setup
 just dev      # Start development
 ```
 ### Pre-configured Services
 The devcontainer includes:
 - PostgreSQL database
 - Redis cache
 - Nginx instance
 - Rust toolchain
 - Node.js/Bun for frontend
 ## Key Design Decisions
 1. **Master-Agent Protocol**: Bidirectional gRPC streaming for real-time communication
 2. **Configuration Management**: Template-based (Handlebars) with versioning
 3. **Security**: TLS + Shared Secret for agent connections, JWT for API auth
 4. **Deployment**: Support for Docker sidecar, K8s sidecar, and standalone modes
 ## Common Tasks
 ### Adding a New API Endpoint
 1. Define route in `crates/nxmesh-master/src/api/v1/`
 2. Add request/response types to shared models
 3. Implement handler with proper error handling
 4. Add tests
 5. Update OpenAPI documentation
 ### Adding a Database Entity
 1. Create migration with `sea-orm-cli migrate generate <name>`
 2. Define entity in `crates/nxmesh-master/src/db/entities/`
 3. Add repository in `crates/nxmesh-master/src/db/repositories/`
 4. Update service layer
 ### Adding Agent Functionality
 1. Add module in `crates/nxmesh-agent/src/`
 2. Update gRPC protocol if needed (`crates/nxmesh-proto/proto/`)
 3. Implement handler in agent
 4. Add corresponding master service
 ## Testing
 ```bash
 just test              # All tests
 just test-unit         # Unit tests only
 just test-integration  # Integration tests
 ```
 ## Code Style
 - Follow Rust API Guidelines
 - Use `cargo fmt` and `cargo clippy`
 - All public APIs must have doc comments
 - Error types should be descriptive and actionable
 ## Questions?
 Refer to the documentation in `docs/` directory or ask the team.
--- a/Cargo.lock
+++ b/Cargo.lock
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -1,13 +1,80 @@
 [workspace]
 members = [
    "crates/nxmesh-core",
    "crates/nxmesh-proto",
    "crates/nxmesh-master",
    "crates/nxmesh-agent",
    "crates/nxmesh-cli",
    "migrations/sea-orm",
 ]
 resolver = "3"
-[workspace.lints.clippy]
+[workspace.package]
-module_inception = "allow"
+version = "0.1.0"
 edition = "2021"
 authors = ["NxMesh Team"]
 license = "GNU General Public License v3.0"
 repository = "https://github.com/nxmesh/nxmesh"
 rust-version = "1.80"
 [workspace.dependencies]
-sea-orm = "2.0.0-rc"
+# Core dependencies
-sea-orm-cli = "2.0.0-rc"
+tokio = { version = "1", features = ["full"] }
 serde = { version = "1", features = ["derive"] }
 serde_json = "1"
 thiserror = "1"
 tracing = "0.1"
 tracing-subscriber = { version = "0.3", features = ["env-filter", "json"] }
 # Web framework
 axum = "0.7"
 tower = "0.4"
 tower-http = { version = "0.5", features = ["trace", "cors", "fs"] }
 # gRPC
 tonic = "0.11"
 prost = "0.12"
 # Database
 sea-orm = { version = "2.0.0-rc", features = ["sqlx-postgres", "runtime-tokio-native-tls"] }
 sea-orm-migration = "2.0.0-rc"
 # Async
 async-trait = "0.1"
 futures = "0.3"
 # Configuration
 toml = "0.8"
 config = "0.14"
 # HTTP client
 reqwest = { version = "0.12", default-features = false, features = ["rustls-tls", "json"] }
 # Crypto
 sha2 = "0.10"
 hex = "0.4"
 argon2 = "0.5"
 jsonwebtoken = "9"
 # Validation
 validator = { version = "0.18", features = ["derive"] }
 # Time
 chrono = { version = "0.4", features = ["serde"] }
 # UUID
 uuid = { version = "1", features = ["v4", "serde"] }
 # Templating
 handlebars = "5"
 # CLI
 clap = { version = "4", features = ["derive"] }
 # Testing
 tokio-test = "0.4"
 mockall = "0.12"
 # NxMesh internal
 nxmesh-core = { path = "crates/nxmesh-core" }
 nxmesh-proto = { path = "crates/nxmesh-proto" }
--- a/README.md
+++ b/README.md
@@ -1,2 +1,202 @@
-# NxMesh
+# NxMesh - Distributed Nginx Management System
 > **NxMesh** is a modern, scalable, distributed system for managing nginx instances across diverse infrastructure environments. Built with a master-agent architecture inspired by service mesh patterns, NxMesh provides centralized control with local intelligence.
 ## 🎯 Project Vision
 NxMesh transforms nginx from a standalone reverse proxy into a **distributed, programmable edge layer**. By adopting a control plane (master) + data plane (agent/sidecar) architecture, NxMesh enables:
 - **Centralized Management**: Control thousands of nginx instances from a single control plane
 - **Dynamic Configuration**: Real-time configuration updates without restarts or connection drops
 - **Observability**: Unified metrics, logs, and health status across the entire fleet
 - **Hybrid Deployment**: Support for Docker, Kubernetes, VMs, and bare metal environments
 - **High Availability**: Fault-tolerant design with automatic failover and recovery
 ## 🏗️ Architecture Overview
 ```
 ┌─────────────────────────────────────────────────────────────────────────────────┐
 │                           CONTROL PLANE (Master)                                 │
 │  ┌──────────────────────────────────────────────────────────────────────────┐   │
 │  │                          NxMesh Master                                   │   │
 │  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │   │
 │  │  │   API        │  │  Config      │  │  Cluster     │  │   Admin      │  │   │
 │  │  │   Server     │  │  Manager     │  │  Coordinator │  │   Console    │  │   │
 │  │  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  │   │
 │  │         └──────────────────┴──────────────────┴──────────────────┘        │   │
 │  │                              │                                             │   │
 │  │                         PostgreSQL (State)                                │   │
 │  └──────────────────────────────┼─────────────────────────────────────────────┘   │
 │                                 │                                                 │
 │                    gRPC/TLS     │    WebSocket (Events)                           │
 │                                 ▼                                                 │
 └─────────────────────────────────────────────────────────────────────────────────┘
                                    │
        ┌───────────────────────────┼───────────────────────────┐
        │                           │                           │
        ▼                           ▼                           ▼
 ┌───────────────┐          ┌───────────────┐          ┌───────────────┐
 │   AGENT 1     │          │   AGENT 2     │          │   AGENT N     │
 │  (Sidecar)    │          │  (Standalone) │          │  (K8s Pod)    │
 │ ┌───────────┐ │          │ ┌───────────┐ │          │ ┌───────────┐ │
 │ │ NxMesh    │ │          │ │ NxMesh    │ │          │ │ NxMesh    │ │
 │ │ Agent     │ │          │ │ Agent     │ │          │ │ Agent     │ │
 │ └─────┬─────┘ │          │ └─────┬─────┘ │          │ └─────┬─────┘ │
 │       │       │          │       │       │          │       │       │
 │  ┌────┴────┐  │          │  ┌────┴────┐  │          │  ┌────┴────┐  │
 │  │  Nginx  │  │          │  │  Nginx  │  │          │  │  Nginx  │  │
 │  │ Instance│  │          │  │ Instance│  │          │  │ Instance│  │
 │  └─────────┘  │          │  └─────────┘  │          │  └─────────┘  │
 └───────────────┘          └───────────────┘          └───────────────┘
  Docker Compose              VM/Bare Metal              Kubernetes
 ```
 ### Core Components
 | Component | Description | Technology |
 |-----------|-------------|------------|
 | **Master** | Central control plane - API, embedded Web UI, config distribution | Rust (Axum/gRPC) + Embedded Vite React |
 | **Agent** | Local nginx controller - configuration, health checks, metrics | Rust (Tokio) |
 | **Database** | Persistent state storage | PostgreSQL |
 ## 🚀 Key Features
 ### Phase 1: Foundation
 - [ ] **Master Control Plane**
  - RESTful API for configuration management
  - gRPC for agent communication
  - PostgreSQL persistence
  - JWT-based authentication
 - [ ] **Agent Sidecar**
  - Docker deployment mode (sidecar pattern)
  - Standalone deployment mode
  - Automatic nginx lifecycle management
  - Configuration hot-reloading
 - [ ] **Configuration Management**
  - Virtual host (server block) templating
  - Upstream pool management
  - SSL/TLS certificate management
  - Configuration versioning & rollback
 ### Phase 2: Resilience
 - [ ] **High Availability**
  - Master clustering with Raft consensus
  - Agent auto-reconnection with exponential backoff
  - Configuration drift detection & auto-healing
 - [ ] **Observability**
  - Real-time metrics collection (Prometheus)
  - Structured logging (OpenTelemetry)
  - Health check dashboards
  - Alert management
 ### Phase 3: Advanced
 - [ ] **Traffic Management**
  - Dynamic load balancing strategies
  - Circuit breaker patterns
  - Rate limiting & WAF rules
  - A/B testing & canary deployments
 - [ ] **Multi-tenancy**
  - Organization/workspace isolation
  - RBAC (Role-Based Access Control)
  - Resource quotas & limits
 ## 📦 Deployment Modes
 ### 1. Docker Sidecar (Recommended for Development)
 ```yaml
 # docker-compose.yml
 services:
  nginx:
    image: nginx:alpine
  nxmesh-agent:
    image: nxmesh/agent:latest
    environment:
      - NXMESH_MASTER_URL=wss://master.nxmesh.io:8443
      - NXMESH_AGENT_TOKEN=${AGENT_TOKEN}
    network_mode: service:nginx  # Share network namespace
    pid: service:nginx            # Share PID namespace (for nginx reload)
 ```
 ### 2. Kubernetes Sidecar
 ```yaml
 # deployment.yaml
 spec:
  containers:
    - name: nginx
      image: nginx:alpine
    - name: nxmesh-agent
      image: nxmesh/agent:latest
      env:
        - name: NXMESH_MASTER_URL
          value: "wss://master.nxmesh.svc:8443"
 ```
 ### 3. Standalone (VM/Bare Metal)
 ```bash
 # Install agent
 curl -fsSL https://get.nxmesh.io | bash
 # Configure and start
 nxmesh-agent --master-url wss://master.nxmesh.io:8443 --token ${AGENT_TOKEN}
 ```
 ## 📋 Quick Start
 ### Prerequisites
 - Docker & Docker Compose
 - Rust 1.75+ (for development)
 - PostgreSQL 16+
 ### Development Setup
 ```bash
 # Clone and setup
 git clone https://github.com/your-org/nxmesh.git
 cd nxmesh
 just setup
 # Start development environment
 just dev
 # Access services
 # - Web UI: http://localhost:3000
 # - API: http://localhost:8080
 # - Nginx: http://localhost:80
 ```
 ### Production Deployment
 ```bash
 # Deploy master
 docker run -d \
  -p 8080:8080 \
  -p 8443:8443 \
  -e DATABASE_URL=postgres://... \
  nxmesh/master:latest
 # Deploy agent (on nginx host)
 docker run -d \
  --network container:nginx \
  -e NXMESH_MASTER_URL=wss://master.example.com:8443 \
  -e NXMESH_AGENT_TOKEN=<token> \
  nxmesh/agent:latest
 ```
 ## 📚 Documentation
 | Document | Description |
 |----------|-------------|
 | [Architecture](./docs/architecture.md) | System design, data flow, component interactions |
 | [Features](./docs/features.md) | Detailed feature specifications |
 | [Roadmap](./docs/roadmap.md) | Development phases and milestones |
 | [API Reference](./docs/api.md) | REST API and gRPC specifications |
 | [Deployment](./docs/deployment.md) | Production deployment guides |
 ## 📄 License
 NxMesh is licensed under the Apache License 3.0. See [LICENSE](./LICENSE) for details.
 ---
--- a/docs/api.md
+++ b/docs/api.md
--- a/docs/architecture.md
+++ b/docs/architecture.md
@@ -0,0 +1,527 @@
 # NxMesh Architecture
 ## Table of Contents
 1. [Overview](#overview)
 2. [System Components](#system-components)
 3. [Data Flow](#data-flow)
 4. [Communication Protocols](#communication-protocols)
 5. [Security Model](#security-model)
 6. [Deployment Patterns](#deployment-patterns)
 7. [Failure Handling](#failure-handling)
 ---
 ## Overview
 NxMesh follows a **Control Plane / Data Plane** architecture pattern, similar to service meshes like Istio or Linkerd, but specifically optimized for nginx management.
 ### Design Principles
 1. **Separation of Concerns**: Master handles policy and state; Agent handles execution
 2. **Eventual Consistency**: Configuration changes propagate asynchronously
 3. **Local Autonomy**: Agents can operate independently during master outages
 4. **Zero-Downtime Updates**: Nginx reloads without dropping connections
 5. **Observability First**: Every action is observable and traceable
 ---
 ## System Components
 ### 1. Master (Control Plane)
 The Master is the brain of the system. It maintains the desired state and coordinates all agents.
 ```
 ┌──────────────────────────────────────────────────────────────────┐
 │                         MASTER                                   │
 │  ┌──────────────┐  ┌──────────────┐  ┌─────────────────────────┐ │
 │  │   API        │  │  Config      │  │    Event & Agent        │ │
 │  │   Layer      │  │  Engine      │  │    Coordination         │ │
 │  │              │  │              │  │                         │ │
 │  │ ┌─────────┐  │  │ ┌─────────┐  │  │  ┌───────────────────┐  │ │
 │  │ │ REST    │  │  │ │ Template│  │  │  │  Agent Registry   │  │ │
 │  │ │ Handler │  │  │ │ Engine  │  │  │  │  (Connections)    │  │ │
 │  │ └─────────┘  │  │ └─────────┘  │  │  └───────────────────┘  │ │
 │  │ ┌─────────┐  │  │ ┌─────────┐  │  │  ┌───────────────────┐  │ │
 │  │ │ gRPC    │  │  │ │ Version │  │  │  │  Event Bus        │  │ │
 │  │ │ Server  │  │  │ │ Control │  │  │  │  (Config Dist.)   │  │ │
 │  │ └─────────┘  │  │ └─────────┘  │  │  └───────────────────┘  │ │
 │  │ ┌──────────┐ │  │ ┌──────────┐ │  │  ┌───────────────────┐  │ │
 │  │ │ WebSocket│ │  │ │ Validator│ │  │  │  Broadcast        │  │ │
 │  │ │ Handler  │ │  │ │          │ │  │  │  (Agent Updates)  │  │ │
 │  │ └──────────┘ │  │ └──────────┘ │  │  └───────────────────┘  │ │
 │  └──────────────┘  └──────────────┘  └─────────────────────────┘ │
 │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐   │
 │  │   Auth      │  │  Storage    │  │    Observability        │   │
 │  │   Service   │  │  Layer      │  │                         │   │
 │  │             │  │             │  │  ┌───────────────────┐  │   │
 │  │ ┌─────────┐ │  │ ┌─────────┐ │  │  │  Metrics          │  │   │
 │  │ │ JWT     │ │  │ │ Postgres│ │  │  │  (Prometheus)     │  │   │
 │  │ │ OAuth2  │ │  │ │ (SeaORM)│ │  │  └───────────────────┘  │   │
 │  │ └─────────┘ │  │ └─────────┘ │  │  ┌───────────────────┐  │   │
 │  │ ┌─────────┐ │  │ ┌─────────┐ │  │  │  Tracing          │  │   │
 │  │ │ Password│ │  │ │ Cache   │ │  │  │  (OpenTelemetry)  │  │   │
 │  │ │ Login   │ │  │ │ (Redis) │ │  │  └───────────────────┘  │   │
 │  │ └─────────┘ │  │ └─────────┘ │  │                         │   │
 │  │ ┌─────────┐ │  │              │  │                         │   │
 │  │ │ RBAC    │ │  │              │  │                         │   │
 │  │ │ Engine  │ │  │              │  │                         │   │
 │  │ └─────────┘ │  │              │  │                         │   │
 │  └─────────────┘  └─────────────┘  └─────────────────────────┘   │
 └──────────────────────────────────────────────────────────────────┘
 ```
 #### Master Responsibilities
 | Module | Responsibility |
 |--------|----------------|
 | API Layer | HTTP REST API for external clients (CLI, Web UI, external systems) |
 | Config Engine | Template rendering, validation, versioning |
 | Event & Agent Coordination | Agent connection management, config event broadcasting |
 | Auth Service | Authentication (JWT/OAuth2, Password) and authorization (RBAC) |
 | Storage Layer | PostgreSQL for persistent state, Redis for caching |
 | Observability | Metrics collection, distributed tracing, structured logging |
 #### Future: High Availability Mode
 For large-scale deployments, the master can be extended with:
 - **Raft Consensus** for leader election and state replication
 - **Cluster Manager** for coordinating multiple master instances
 - This is **not required** for single-organization, self-hosted deployments |
 ### 2. Agent (Data Plane)
 The Agent is a lightweight sidecar that runs alongside each nginx instance.
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                         AGENT                                   │
 │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
 │  │   Master    │  │  Nginx      │  │    Health Monitor       │  │
 │  │   Client    │  │  Controller │  │                         │  │
 │  │             │  │             │  │  ┌───────────────────┐  │  │
 │  │ ┌─────────┐ │  │ ┌─────────┐ │  │  │  Nginx Health     │  │  │
 │  │ │ gRPC    │ │  │ │ Config  │ │  │  │  (HTTP checks)    │  │  │
 │  │ │ Client  │ │  │ │ Renderer│ │  │  └───────────────────┘  │  │
 │  │ └─────────┘ │  │ └─────────┘ │  │  ┌───────────────────┐  │  │
 │  │ ┌─────────┐ │  │ ┌─────────┐ │  │  │  System Metrics   │  │  │
 │  │ │ WebSocket│ │  │ │ Reload  │ │  │  │  (CPU/Mem/IO)     │  │  │
 │  │ │ Client  │ │  │ │ Manager │ │  │  └───────────────────┘  │  │
 │  │ └─────────┘ │  │ └─────────┘ │  │                         │  │
 │  │ ┌─────────┐ │  │ ┌─────────┐ │  │  ┌───────────────────┐  │  │
 │  │ │ Reconnect│ │  │ │ Process │ │  │  │  Self-Health      │  │  │
 │  │ │ Handler │ │  │ │ Signal  │ │  │  │  (Heartbeat)      │  │  │
 │  │ └─────────┘ │  │ └─────────┘ │  │  └───────────────────┘  │  │
 │  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
 │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
 │  │   Metrics   │  │  Local      │  │    Watchdog             │  │
 │  │   Exporter  │  │  Cache      │  │                         │  │
 │  │             │  │             │  │  ┌───────────────────┐  │  │
 │  │ ┌─────────┐ │  │ ┌─────────┐ │  │  │  Config Drift     │  │  │
 │  │ │Prometheus│ │  │ │ Config  │ │  │  │  Detection        │  │  │
 │  │ │Endpoint │ │  │ │ State   │ │  │  └───────────────────┘  │  │
 │  │ └─────────┘ │  │ └─────────┘ │  │  ┌───────────────────┐  │  │
 │  │ ┌─────────┐ │  │ ┌─────────┐ │  │  │  Auto-Recovery    │  │  │
 │  │ │Statsd   │ │  │ │ Backup  │ │  │  │  (Nginx restart)  │  │  │
 │  │ │Client   │ │  │ │ Files   │ │  │  └───────────────────┘  │  │
 │  │ └─────────┘ │  │ └─────────┘ │  │                         │  │
 │  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
 └─────────────────────────────────────────────────────────────────┘
 ```
 #### Agent Responsibilities
 | Module | Responsibility |
 |--------|----------------|
 | Master Client | Maintains persistent connection to master (gRPC + WebSocket fallback) |
 | Nginx Controller | Generates configs, manages reloads, handles lifecycle |
 | Health Monitor | Monitors nginx health, system resources, reports status |
 | Metrics Exporter | Prometheus endpoint, statsd client for metrics |
 | Local Cache | Caches configs for offline operation, backup/restore |
 | Watchdog | Detects config drift, auto-recovery from failures |
 ---
 ## Data Flow
 ### 1. Configuration Push Flow
 ```
 ┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐
 │  User  │────▶│  API   │────▶│ Config │────▶│ Event  │────▶│ Agents │
 │ Action │     │ Server │     │ Engine │     │  Bus   │     │        │
 └────────┘     └────────┘     └────────┘     └────────┘     └────────┘
                                                                  │
                                                                  ▼
 ┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐
 │ Nginx  │◀────│ Config │◀────│ Template│◀────│ gRPC   │◀────│ Agent  │
 │Reloaded│     │Applied │     │ Render │     │ Stream │     │Receive │
 └────────┘     └────────┘     └────────┘     └────────┘     └────────┘
 ```
 **Flow Description:**
 1. User creates/updates configuration via API or Web UI
 2. Master validates and stores configuration in database
 3. Config Engine determines affected agents
 4. Event Bus broadcasts configuration change event
 5. Agents receive event via gRPC streaming
 6. Agent renders local nginx configuration from templates
 7. Agent validates new configuration (`nginx -t`)
 8. Agent applies configuration via graceful reload
 9. Agent reports status back to master
 ### 2. Health Reporting Flow
 ```
 ┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐
 │ Nginx  │────▶│ Agent  │────▶│ Master │────▶│  DB    │
 │ Health │     │ Health │     │ API    │     │ Store  │
 └────────┘     └────────┘     └────────┘     └────────┘
                    │
                    ▼
              ┌────────┐
              │Prometheus│
              │ Server │
              └────────┘
 ```
 **Flow Description:**
 1. Agent periodically checks nginx health (HTTP health endpoint)
 2. Agent collects system metrics (CPU, memory, connections)
 3. Agent sends health report to master via gRPC
 4. Master aggregates and stores in database
 5. Prometheus scrapes agent metrics endpoint
 ### 3. Certificate Management Flow
 ```
 ┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐
 │ Let's  │◀────│ Master │────▶│ Agent  │────▶│ Nginx  │◀────│ Client │
 │Encrypt │     │ ACME   │     │ Deploy │     │ Serve  │     │Request │
 └────────┘     └────────┘     └────────┘     └────────┘     └────────┘
 ```
 **Flow Description:**
 1. Master requests certificate from Let's Encrypt (ACME protocol)
 2. Master distributes certificate to relevant agents
 3. Agent stores certificate locally (encrypted at rest)
 4. Agent updates nginx configuration with new certificate
 5. Nginx serves HTTPS traffic with new certificate
 ---
 ## Communication Protocols
 ### Master-Agent Protocol
 NxMesh uses a **bidirectional gRPC stream** as the primary communication channel between master and agents.
 ```protobuf
 // agent.proto
 syntax = "proto3";
 package nxmesh.agent;
 service AgentService {
  // Bidirectional streaming for real-time communication
  rpc Stream(stream AgentMessage) returns (stream MasterMessage);
  // Unary calls for specific operations
  rpc ReportHealth(HealthReport) returns (Ack);
  rpc ReportMetrics(MetricsBatch) returns (Ack);
 }
 message AgentMessage {
  string agent_id = 1;
  uint64 timestamp = 2;
  oneof payload {
    RegistrationRequest register = 3;
    HealthReport health = 4;
    ConfigStatus config_status = 5;
    MetricsBatch metrics = 6;
    LogBatch logs = 7;
  }
 }
 message MasterMessage {
  uint64 timestamp = 1;
  oneof payload {
    RegistrationResponse register_response = 2;
    ConfigUpdate config_update = 3;
    Command command = 4;
    Ack ack = 5;
  }
 }
 message ConfigUpdate {
  string config_id = 1;
  uint64 version = 2;
  repeated VirtualHost virtual_hosts = 3;
  repeated Upstream upstreams = 4;
  map<string, string> ssl_certificates = 5;
 }
 ```
 ### Connection Management
 ```
 ┌─────────────────────────────────────────────────────────────────────┐
 │                        CONNECTION LIFECYCLE                          │
 │                                                                      │
 │  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐          │
 │  │  INIT   │───▶│ CONNECT │───▶│ STREAM  │───▶│  READY  │          │
 │  └─────────┘    └─────────┘    └─────────┘    └─────────┘          │
 │                      │               │               │               │
 │                      ▼               ▼               ▼               │
 │                 ┌─────────┐    ┌─────────┐    ┌─────────┐          │
 │                 │  RETRY  │    │RECONNECT│    │  ERROR  │          │
 │                 └─────────┘    └─────────┘    └─────────┘          │
 │                                                                      │
 │  Connection Parameters:                                              │
 │  - Heartbeat interval: 30s                                           │
 │  - Reconnect backoff: 1s, 2s, 4s, 8s... (max 60s)                    │
 │  - gRPC keepalive: 10s ping, 20s timeout                             │
 │  - TLS: Server-side TLS (auto-generated or custom)                   │
 │  - Agent auth: Bootstrap token → Shared secret (HMAC)                │
 └─────────────────────────────────────────────────────────────────────┘
 ```
 ---
 ## Security Model
 ### Authentication
 | Component | Method | Details |
 |-----------|--------|---------|
 | Master API | JWT (RS256) | Short-lived access tokens, refresh tokens |
 | Master WebSocket | JWT | Same tokens as API |
 | Master-Agent gRPC | **TLS + Shared Secret** | Server TLS + bootstrap token → session HMAC |
 | Agent Registration | One-time Bootstrap Token | Generated in Master UI, single-use, short expiry |
 ### Agent Authentication Flow (TLS + Shared Secret)
 ```
 ┌─────────────┐                                    ┌──────────────┐
 │    Agent    │                                    │    Master    │
 └──────┬──────┘                                    └──────┬───────┘
       │                                                 │
       │  1. TLS Handshake (verify server certificate)   │
       │◄───────────────────────────────────────────────►│
       │                                                 │
       │  2. Register with bootstrap_token               │
       │  ── gRPC: RegisterAgent { token } ─────────────▶│
       │                                                 │
       │  3. Receive agent_id + session_key (+ key_id)   │
       │◄────────────────────────────────────────────────│
       │     [Encrypted over TLS]                        │
       │                                                 │
       │  4. Subsequent requests: HMAC-signed            │
       │  ── gRPC + Headers:                             │
       │     X-Agent-ID: <agent_id>                      │
       │     X-Key-ID: <session_key_id>                  │
       │     X-Signature: HMAC(request_body, session_key)│
       │────────────────────────────────────────────────▶│
       │                                                 │
       │  5. Key Rotation (primary/secondary)            │
       │◄═══════════════════════════════════════════════►│
 ```
 **Security Properties:**
 - **TLS**: Encrypts channel, verifies master identity (server cert)
 - **Bootstrap Token**: One-time use, time-limited, proves initial identity
 - **Session Key**: Per-agent secret, used for HMAC request signing
 - **Key Rotation**: Primary/secondary key design for seamless rotation
 ### Authorization (RBAC)
 ```yaml
 # Example RBAC Configuration
 roles:
  admin:
    permissions:
      - "*:*"
  operator:
    permissions:
      - "config:read"
      - "config:write"
      - "agent:read"
      - "agent:reload"
  viewer:
    permissions:
      - "config:read"
      - "agent:read"
      - "metrics:read"
 # Resource hierarchy
 resources:
  - organization
    - workspace
      - agent
      - certificate
      - config (virtual_host, upstream)
 ```
 ## Deployment Patterns
 ### Pattern 1: Docker Sidecar (Development/Single Host)
 ```yaml
 # docker-compose.yml
 version: '3.8'
 services:
  nxmesh-master:
    image: nxmesh/master:latest
    ports:
      - "8080:8080"   # API
      - "8443:8443"   # gRPC
    environment:
      - DATABASE_URL=postgres://...
  nginx-site-a:
    image: nginx:alpine
    volumes:
      - site-a-html:/usr/share/nginx/html
  nxmesh-agent-a:
    image: nxmesh/agent:latest
    network_mode: service:nginx-site-a  # Share network namespace with nginx
    pid: service:nginx-site-a            # Share PID namespace (for nginx reload)
    environment:
      - NXMESH_MASTER_URL=wss://nxmesh-master:8443
      - NXMESH_AGENT_TOKEN=${AGENT_TOKEN_A}
      - NXMESH_DEPLOYMENT_MODE=docker_sidecar
      - NXMESH_NGINX_PID_FILE=/var/run/nginx.pid
 ```
 **Pros:** Simple, isolated, good for development
 **Cons:** Docker-only, single host limitation
 ### Pattern 2: Kubernetes Sidecar
 ```yaml
 # deployment.yaml
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  name: web-service
 spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: nginx
          image: nginx:alpine
          volumeMounts:
            - name: nxmesh-config
              mountPath: /etc/nginx/conf.d
        - name: nxmesh-agent
          image: nxmesh/agent:latest
          env:
            - name: NXMESH_MASTER_URL
              value: "wss://nxmesh-master.default.svc:8443"
            - name: NXMESH_AGENT_TOKEN
              valueFrom:
                secretKeyRef:
                  name: nxmesh-agent-token
                  key: token
          volumeMounts:
            - name: nxmesh-config
              mountPath: /etc/nginx/conf.d
      volumes:
        - name: nxmesh-config
          emptyDir: {}
 ```
 **Pros:** Native K8s integration, auto-scaling, health checks
 **Cons:** K8s-only, more complex setup
 ### Pattern 3: Standalone (VM/Bare Metal)
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                         VM / Bare Metal                          │
 │  ┌───────────────────────────────────────────────────────────┐  │
 │  │  Systemd                                                   │  │
 │  │  ┌─────────────────────────────────────────────────────┐  │  │
 │  │  │ nxmesh-agent.service                                │  │  │
 │  │  │  ┌──────────────┐  ┌──────────────┐  ┌───────────┐  │  │  │
 │  │  │  │   Agent      │  │   Nginx      │  │  Config   │  │  │  │
 │  │  │  │   Process    │──│   Process    │──│  Files    │  │  │  │
 │  │  │  └──────────────┘  └──────────────┘  └───────────┘  │  │  │
 │  │  └─────────────────────────────────────────────────────┘  │  │
 │  └───────────────────────────────────────────────────────────┘  │
 └─────────────────────────────────────────────────────────────────┘
 ```
 **Pros:** Works anywhere, minimal dependencies
 **Cons:** Manual setup, no container isolation
 ---
 ## Failure Handling
 ### Master Failure Scenarios
 | Scenario | Impact | Mitigation |
 |----------|--------|------------|
 | Master unreachable | Agents continue with cached config | Agents retry with exponential backoff |
 | Master crashes | New connections fail, existing continue | External load balancer + health checks (HA: future) |
 | Database down | Read-only mode for existing configs | Database replication, failover |
 ### Agent Failure Scenarios
 | Scenario | Impact | Mitigation |
 |----------|--------|------------|
 | Agent crashes | Nginx continues running | Systemd restart, watchdog |
 | Config validation fails | Previous config kept | Atomic config swap, rollback |
 | Nginx crashes | Agent restarts nginx | Health checks, auto-restart |
 | Network partition | Agent operates in "island mode" | Local cache, reconciliation on reconnect |
 ### Recovery Procedures
 ```
 ┌─────────────────────────────────────────────────────────────────────┐
 │                     FAILURE RECOVERY FLOW                            │
 │                                                                      │
 │  Agent Disconnect                                                     │
 │       │                                                               │
 │       ▼                                                               │
 │  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐           │
 │  │  Retry  │───▶│  Cache  │───▶│  Alert  │───▶│  Watch  │           │
 │  │ Connect │    │ Config  │    │ Master  │    │  Dog    │           │
 │  └─────────┘    └─────────┘    └─────────┘    └─────────┘           │
 │       │                                            │                  │
 │       ▼                                            ▼                  │
 │  ┌─────────┐                                  ┌─────────┐            │
 │  │Reconnected│                                  │ Restart │            │
 │  │  Sync   │                                  │ Nginx   │            │
 │  └─────────┘                                  └─────────┘            │
 │                                                                      │
 │  Recovery Strategies:                                                │
 │  1. Exponential backoff for reconnection                             │
 │  2. Circuit breaker for failed operations                            │
 │  3. Config checksum verification after reconnect                     │
 │  4. Automatic nginx restart on health check failure                  │
 └─────────────────────────────────────────────────────────────────────┘
 ```
 ---
 ## Technology Stack
 | Layer | Technology | Rationale |
 |-------|------------|-----------|
 | **Master Backend** | Rust (Axum) | Performance, safety, async ecosystem |
 | **Agent** | Rust (Tokio) | Small binary, low memory, fast startup |
 | **Database** | PostgreSQL | ACID, JSON support, reliability |
 | **Cache** | Redis | Fast key-value, pub/sub for events |
 | **Frontend** | React + Vite (embedded) | Static build served by master, fast HMR in dev |
 | **gRPC** | Tonic | Native Rust implementation |
 | **ORM** | SeaORM | Async, type-safe, migration support |
 | **Config Template** | Handlebars | Logic-less, secure templating |
 | **Metrics** | Prometheus | Industry standard, rich ecosystem |
 | **Tracing** | OpenTelemetry | Vendor-neutral, future-proof |
--- a/docs/features.md
+++ b/docs/features.md
@@ -0,0 +1,814 @@
 # NxMesh Feature Specification
 ## Table of Contents
 1. [Core Features](#core-features)
 2. [Master Features](#master-features)
 3. [Agent Features](#agent-features)
 4. [Configuration Management](#configuration-management)
 5. [Observability](#observability)
 6. [Security Features](#security-features)
 ---
 ## Core Features
 ### CF-001: Multi-tenancy with Organizations and Workspaces
 **Description**: Support for multiple organizations with isolated workspaces within each organization.
 **Requirements**:
 - Organizations are top-level resource containers
 - Each organization can have multiple workspaces
 - Resources (agents, configs, certificates) are scoped to a workspace
 - Cross-workspace visibility is configurable
 **Data Model**:
 ```rust
 struct Organization {
    id: Uuid,
    name: String,
    slug: String,  // URL-friendly identifier
    created_at: DateTime,
    settings: OrganizationSettings,
 }
 struct Workspace {
    id: Uuid,
    organization_id: Uuid,
    name: String,
    slug: String,
    created_at: DateTime,
 }
 ```
 **API Endpoints**:
 - `GET /api/v1/organizations` - List organizations
 - `POST /api/v1/organizations` - Create organization
 - `GET /api/v1/organizations/{id}/workspaces` - List workspaces
 - `POST /api/v1/organizations/{id}/workspaces` - Create workspace
 ---
 ### CF-002: Agent Registration and Lifecycle Management
 **Description**: Agents must register with the master before receiving configurations.
 **Registration Flow**:
 1. Administrator generates bootstrap token in Master UI
 2. Token is provided to agent via environment variable or config file
 3. Agent establishes TLS connection to master (verifies server certificate)
 4. Agent sends bootstrap token for registration
 5. Master validates token and establishes shared secret:
   - Master generates session_key (per-agent) + key_id
   - Session key used for HMAC request signing
   - Primary/secondary key design for rotation
 **Agent States**:
 ```rust
 enum AgentState {
    Pending,      // Registered but never connected
    Online,       // Connected and healthy
    Offline,      // Disconnected
    Degraded,     // Connected but health checks failing
    Maintenance,  // Manually placed in maintenance mode
 }
 ```
 **Agent Metadata**:
 ```rust
 struct Agent {
    id: Uuid,
    workspace_id: Uuid,
    name: String,
    hostname: String,
    ip_address: String,
    version: String,
    state: AgentState,
    deployment_mode: DeploymentMode,  // DockerSidecar, K8sSidecar, Standalone
    last_seen_at: DateTime,
    capabilities: Vec<String>,  // e.g., ["http3", "websocket", "rate_limiting"]
    labels: HashMap<String, String>,  // e.g., {"env": "prod", "region": "us-east"}
 }
 ```
 **API Endpoints**:
 - `POST /api/v1/agents/register` - Register new agent
 - `GET /api/v1/agents` - List agents
 - `GET /api/v1/agents/{id}` - Get agent details
 - `POST /api/v1/agents/{id}/tokens` - Generate registration token
 - `DELETE /api/v1/agents/{id}` - Deregister agent
 ---
 ### CF-003: Real-time Configuration Distribution
 **Description**: Push configuration changes to agents in real-time with delivery guarantees.
 **Requirements**:
 - Config changes propagate to all affected agents within 5 seconds
 - Support for targeted updates (specific agents or groups)
 - Config versioning with rollback capability
 - Delivery confirmation from agents
 **Configuration Scope**:
 ```rust
 enum ConfigScope {
    Global,           // All agents
    Workspace,        // All agents in workspace
    AgentGroup(String), // Agents with specific label selector
    Agent(Uuid),      // Single agent
 }
 ```
 **Delivery Guarantees**:
 - At-least-once delivery
 - Automatic retry with exponential backoff
 - Config checksum verification
 - Offline agents receive updates on reconnection
 ---
 ## Master Features
 ### MF-001: RESTful API
 **Description**: Comprehensive REST API for all operations.
 **Base URL**: `/api/v1`
 **Resource Endpoints**:
 | Resource | Endpoints |
 |----------|-----------|
 | Organizations | GET, POST, PATCH, DELETE `/organizations` |
 | Workspaces | GET, POST, PATCH, DELETE `/workspaces` |
 | Agents | GET, POST, PATCH, DELETE `/agents` |
 | VirtualHosts | GET, POST, PATCH, DELETE `/virtual-hosts` |
 | Upstreams | GET, POST, PATCH, DELETE `/upstreams` |
 | Certificates | GET, POST, DELETE `/certificates` |
 | AccessLogs | GET `/access-logs` |
 | Metrics | GET `/metrics` |
 **Response Format**:
 ```json
 {
  "data": { ... },
  "meta": {
    "page": 1,
    "per_page": 20,
    "total": 100
  },
  "links": {
    "self": "/api/v1/agents?page=1",
    "next": "/api/v1/agents?page=2",
    "prev": null
  }
 }
 ```
 **Error Format**:
 ```json
 {
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "Invalid configuration",
    "details": [
      {"field": "server_name", "message": "Invalid domain format"}
    ]
  }
 }
 ```
 ---
 ### MF-002: Web-based Admin Console (Embedded)
 **Description**: Modern web UI for managing the entire system. Built with React + Vite and served as static files embedded directly in the master binary.
 **Pages**:
 | Page | Features |
 |------|----------|
 | Dashboard | Agent status, recent events, traffic overview |
 | Agents | List, detail view, logs, metrics graphs |
 | Configurations | Virtual host editor, upstream management |
 | Certificates | SSL certificate list, expiration alerts |
 | Access Control | Users, roles, permissions management |
 | Settings | Organization settings, integrations |
 **Key UI Features**:
 - Real-time updates via WebSocket
 - Monaco editor for nginx configuration
 - Visual topology view (agent connections)
 - Dark/light mode support
 - Responsive design
 ---
 ### MF-003: Configuration Template Engine
 **Description**: Templating system for generating nginx configurations.
 **Template Variables**:
 ```handlebars
 # Example virtual host template
 server {
    listen {{port}} {{#if ssl}}ssl{{/if}} {{#if http2}}http2{{/if}};
    server_name {{server_name}};
    {{#if ssl}}
    ssl_certificate {{ssl_certificate_path}};
    ssl_certificate_key {{ssl_certificate_key_path}};
    {{/if}}
    location {{location_path}} {
        proxy_pass http://{{upstream_name}};
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        {{#each custom_headers}}
        add_header {{name}} "{{value}}";
        {{/each}}
        {{#if rate_limiting}}
        limit_req zone={{rate_limit_zone}} burst={{rate_limit_burst}};
        {{/if}}
    }
 }
 ```
 **Built-in Templates**:
 - `default` - Standard reverse proxy
 - `spa` - Single Page Application (with fallback to index.html)
 - `api` - API gateway with rate limiting
 - `static` - Static file serving with caching
 - `websocket` - WebSocket proxy with connection upgrades
 ---
 ### MF-004: Certificate Management (ACME)
 **Description**: Automatic SSL/TLS certificate provisioning via Let's Encrypt.
 **Features**:
 - ACME v2 protocol support
 - HTTP-01 and DNS-01 challenges
 - Automatic renewal (30 days before expiry)
 - Wildcard certificate support (DNS-01)
 - Certificate monitoring and alerts
 **Certificate Entity**:
 ```rust
 struct Certificate {
    id: Uuid,
    workspace_id: Uuid,
    domain: String,
    is_wildcard: bool,
    provider: CertificateProvider,  // LetsEncrypt, Custom
    status: CertificateStatus,      // Pending, Active, Expired, Error
    issued_at: DateTime,
    expires_at: DateTime,
    auto_renew: bool,
    certificate_pem: Option<String>,  // Encrypted at rest
    private_key_pem: Option<String>,  // Encrypted at rest
 }
 ```
 ---
 ## Agent Features
 ### AF-001: Nginx Lifecycle Management
 **Description**: Agent manages nginx process lifecycle based on deployment mode.
 **Docker Sidecar Mode**:
 - Shares PID namespace with nginx container (via `pid: service:nginx`)
 - Directly signals nginx process for reload/restart
 - Monitors nginx via health checks
 **Standalone Mode**:
 - Direct process management (signals to PID from file)
 - systemd integration (optional, for service management)
 - PID file monitoring
 **Lifecycle Actions**:
 - `start` - Start nginx
 - `stop` - Graceful shutdown
 - `reload` - Hot reload configuration
 - `restart` - Full restart
 - `test` - Validate configuration
 ---
 ### AF-002: Configuration Rendering and Application
 **Description**: Agent renders nginx configs from master templates and applies them using atomic symlink swaps for zero-downtime updates.
 **Config Directory Structure**:
 ```
 /etc/nginx/
 ├── nginx.conf              # Contains: include /etc/nginx/conf.d/current/*.conf
 ├── conf.d/
 │   ├── current -> ./20260302143000/    # Symlink to active deployment
 │   ├── 20260302143000/                 # Active config (timestamped)
 │   │   ├── default.conf
 │   │   └── upstream.conf
 │   ├── 20260302141500/                 # Previous deployment (for rollback)
 │   │   ├── default.conf
 │   │   └── upstream.conf
 │   └── 20260302140000/                 # Older deployment (cleanup candidate)
 ```
 **Config Rendering Flow**:
 1. Receive ConfigUpdate from master
 2. Create new deployment folder: `./conf.d/<timestamp>/`
 3. Render nginx config files into timestamped folder
 4. **Validate** new config: `nginx -t -c /etc/nginx/conf.d/<timestamp>/nginx.conf`
 5. If validation passes, **atomically update symlink**: `current` → `<timestamp>/`
 6. Execute graceful nginx reload
 7. Verify reload success (health check)
 8. Report status to master
 9. Cleanup old deployments (keep N recent versions)
 **Atomic Config Swap**:
 ```rust
 async fn apply_config(&self, config: ConfigUpdate) -> Result<()> {
    let timestamp = generate_timestamp();
    let deploy_dir = self.conf_d_path.join(&timestamp);
    let symlink_path = self.conf_d_path.join("current");
    // 1. Render config to new timestamped directory
    self.render_config(&config, &deploy_dir).await?;
    // 2. Validate BEFORE switching symlink (point to new folder directly)
    self.validate_config(&deploy_dir).await?;
    // 3. Atomic symlink swap (Unix: symlink + rename)
    let temp_link = self.conf_d_path.join("current.tmp");
    tokio::fs::symlink(&deploy_dir, &temp_link).await?;
    tokio::fs::rename(&temp_link, &symlink_path).await?;  // Atomic operation
    // 4. Reload nginx (picks up new symlink target)
    self.reload_nginx().await?;
    // 5. Verify and cleanup
    self.verify_health().await?;
    self.cleanup_old_deployments(5).await?;  // Keep last 5 versions
    self.report_success(config.id, timestamp).await;
 }
 ```
 **Rollback Strategy**:
 ```rust
 async fn rollback(&self, target_timestamp: &str) -> Result<()> {
    let target_dir = self.conf_d_path.join(target_timestamp);
    let symlink_path = self.conf_d_path.join("current");
    // Verify target exists
    if !target_dir.exists() {
        return Err(Error::RollbackTargetNotFound);
    }
    // Atomic symlink swap back to previous deployment
    let temp_link = self.conf_d_path.join("current.tmp");
    tokio::fs::symlink(&target_dir, &temp_link).await?;
    tokio::fs::rename(&temp_link, &symlink_path).await?;
    // Reload nginx
    self.reload_nginx().await?;
 }
 ```
 ---
 ### AF-003: Health Monitoring and Reporting
 **Description**: Continuous health monitoring of nginx and the host system.
 **Health Checks**:
 - **Nginx Health**: HTTP request to nginx health endpoint
 - **Configuration Health**: Verify current config matches expected
 - **Resource Health**: CPU, memory, disk usage
 - **Connection Health**: Active connections, request rate
 **Health Report Structure**:
 ```rust
 struct HealthReport {
    agent_id: Uuid,
    timestamp: DateTime,
    nginx_status: NginxStatus,
    system_metrics: SystemMetrics,
    config_checksum: String,
    alerts: Vec<Alert>,
 }
 struct NginxStatus {
    is_running: bool,
    pid: Option<u32>,
    uptime_seconds: u64,
    active_connections: u32,
    requests_per_second: f64,
 }
 struct SystemMetrics {
    cpu_percent: f64,
    memory_used_mb: u64,
    memory_total_mb: u64,
    disk_used_gb: u64,
    disk_total_gb: u64,
 }
 ```
 **Reporting Interval**: Configurable (default: 30 seconds)
 ---
 ### AF-004: Metrics Collection and Export
 **Description**: Collect and expose metrics in Prometheus format.
 **Metrics Endpoint**: `GET /metrics` (on agent)
 **Built-in Metrics**:
 ```
 # Nginx metrics (parsed from stub_status)
 nxmesh_nginx_connections_active{agent_id="..."} 42
 nxmesh_nginx_connections_reading{agent_id="..."} 5
 nxmesh_nginx_connections_writing{agent_id="..."} 30
 nxmesh_nginx_connections_waiting{agent_id="..."} 7
 nxmesh_nginx_requests_total{agent_id="..."} 1234567
 # Agent metrics
 nxmesh_agent_uptime_seconds{agent_id="..."} 86400
 nxmesh_agent_master_connection_status{agent_id="..."} 1
 nxmesh_agent_config_version{agent_id="...",version="123"} 1
 # System metrics
 nxmesh_system_cpu_percent{agent_id="..."} 25.5
 nxmesh_system_memory_used_bytes{agent_id="..."} 1073741824
 nxmesh_system_disk_used_bytes{agent_id="..."} 53687091200
 ```
 **Custom Metrics**: Agents can collect custom metrics from nginx access logs
 ---
 ### AF-005: Offline Operation and Recovery
 **Description**: Agent can operate independently when master is unreachable.
 **Offline Capabilities**:
 - Continue serving traffic with cached configuration
 - Local health monitoring continues
 - Metrics are buffered for later transmission
 - Automatic reconnection attempts
 **Recovery Flow**:
 1. Detect disconnection from master
 2. Enter "offline mode"
 3. Continue operating with cached config
 4. Buffer metrics and logs
 5. Attempt reconnection with exponential backoff
 6. On reconnection:
   - Sync configuration (compare checksums)
   - Transmit buffered metrics
   - Resume normal operation
 ---
 ## Configuration Management
 ### CM-001: Virtual Host Configuration
 **Description**: Define nginx server blocks (virtual hosts) via API/UI.
 **VirtualHost Entity**:
 ```rust
 struct VirtualHost {
    id: Uuid,
    workspace_id: Uuid,
    name: String,              // Human-readable name
    server_name: String,       // Domain name(s), comma-separated
    listen_port: u16,          // Usually 80 or 443
    ssl_enabled: bool,
    ssl_certificate_id: Option<Uuid>,
    // Routing configuration
    locations: Vec<Location>,
    // Advanced settings
    http2_enabled: bool,
    http3_enabled: bool,
    gzip_enabled: bool,
    rate_limiting: Option<RateLimitConfig>,
    // Target agents
    target_agents: AgentSelector,
 }
 struct Location {
    path: String,              // e.g., "/api" or "~ \.php$"
    proxy_pass: Option<String>, // e.g., "http://backend"
    upstream_id: Option<Uuid>,
    root: Option<String>,      // For static files
    index: Option<String>,     // e.g., "index.html"
    custom_headers: Vec<Header>,
    rewrite_rules: Vec<RewriteRule>,
 }
 ```
 **Validation Rules**:
 - `server_name` must be valid domain(s)
 - `listen_port` must be 1-65535
 - SSL certificate must exist if `ssl_enabled` is true
 - At least one location must be defined
 ---
 ### CM-002: Upstream Configuration
 **Description**: Define backend server pools for load balancing.
 **Upstream Entity**:
 ```rust
 struct Upstream {
    id: Uuid,
    workspace_id: Uuid,
    name: String,              // Used as upstream identifier
    // Load balancing algorithm
    algorithm: LoadBalanceAlgorithm,  // RoundRobin, LeastConn, IPHash, etc.
    // Backend servers
    servers: Vec<UpstreamServer>,
    // Health check configuration
    health_check: Option<HealthCheckConfig>,
    // Connection settings
    keepalive_connections: Option<u32>,
    keepalive_timeout: Option<u32>,
 }
 struct UpstreamServer {
    address: String,           // IP:port or hostname:port
    weight: u32,               // Default: 1
    backup: bool,              // Backup server
    down: bool,                // Temporarily down
    max_fails: u32,            // Default: 1
    fail_timeout: u32,         // Seconds, default: 10
 }
 enum LoadBalanceAlgorithm {
    RoundRobin,
    LeastConnections,
    IPHash,
    WeightedRoundRobin,
 }
 ```
 ---
 ### CM-003: Configuration Versioning
 **Description**: Track all configuration changes with full history.
 **Versioning Features**:
 - Every change creates a new version
 - Versions are immutable
 - Rollback to any previous version
 - Diff between versions
 - Audit log of who changed what
 **Version Entity**:
 ```rust
 struct ConfigVersion {
    id: Uuid,
    resource_type: String,     // "virtual_host", "upstream", etc.
    resource_id: Uuid,
    version_number: u64,       // Auto-incrementing
    data: Json,                // Full configuration snapshot
    checksum: String,          // SHA-256 of data
    created_by: Uuid,          // User ID
    created_at: DateTime,
    change_summary: String,    // Human-readable description
 }
 ```
 **API Endpoints**:
 - `GET /api/v1/virtual-hosts/{id}/versions` - List versions
 - `GET /api/v1/virtual-hosts/{id}/versions/{version}` - Get specific version
 - `POST /api/v1/virtual-hosts/{id}/rollback` - Rollback to version
 - `GET /api/v1/virtual-hosts/{id}/diff?from=v1&to=v2` - Compare versions
 ---
 ## Observability
 ### OB-001: Structured Logging
 **Description**: Comprehensive logging with structured format.
 **Log Levels**: ERROR, WARN, INFO, DEBUG, TRACE
 **Log Fields**:
 ```json
 {
  "timestamp": "2026-03-02T10:30:00Z",
  "level": "INFO",
  "component": "agent",
  "agent_id": "550e8400-e29b-41d4-a716-446655440000",
  "trace_id": "abc123",
  "span_id": "def456",
  "message": "Configuration applied successfully",
  "fields": {
    "config_id": "config-123",
    "version": 42,
    "duration_ms": 150
  }
 }
 ```
 **Log Targets**:
 - Master: systemd journal, file, or centralized (ELK/Loki)
 - Agent: stdout (Docker), file (standalone), or remote
 ---
 ### OB-002: Distributed Tracing
 **Description**: OpenTelemetry tracing for request flow visualization.
 **Traced Operations**:
 - Configuration push (master → agent → nginx)
 - Health check cycles
 - Certificate issuance
 - API requests
 **Span Attributes**:
 - `nxmesh.agent_id`
 - `nxmesh.config_id`
 - `nxmesh.workspace_id`
 - `nxmesh.organization_id`
 ---
 ### OB-003: Access Log Aggregation
 **Description**: Collect and query nginx access logs from all agents.
 **Features**:
 - Centralized access log storage
 - Real-time log streaming
 - SQL-like query interface
 - Log retention policies
 **Access Log Schema**:
 ```rust
 struct AccessLogEntry {
    id: Uuid,
    agent_id: Uuid,
    timestamp: DateTime,
    // Request details
    remote_addr: String,
    method: String,
    uri: String,
    protocol: String,
    host: String,
    // Response details
    status: u16,
    body_bytes_sent: u64,
    response_time_ms: f64,
    // Additional fields
    user_agent: Option<String>,
    referer: Option<String>,
    request_id: Option<String>,
 }
 ```
 **Query API**:
 ```graphql
 # Example query
 query {
  accessLogs(
    filter: {
      agentId: "...",
      timeRange: { from: "2026-03-01", to: "2026-03-02" },
      statusCode: { gte: 500 }
    },
    limit: 100
  ) {
    timestamp
    method
    uri
    status
    responseTimeMs
  }
 }
 ```
 ---
 ## Security Features
 ### SF-001: Authentication and Authorization
 **Description**: Multi-method authentication with fine-grained RBAC.
 **Authentication Methods**:
 - JWT (for API/Web UI)
 - Password-based login (local user accounts)
 - OAuth2/OIDC (Google, GitHub, enterprise SSO)
 - API Keys (for service accounts)
 - **TLS + Shared Secret** (for agent communication)
  - Server-side TLS (auto-generated self-signed or custom certificates)
  - Bootstrap token for initial registration
  - Session key with HMAC signing for ongoing requests
  - Primary/secondary key rotation
 **RBAC Model**:
 ```rust
 struct Role {
    id: Uuid,
    name: String,
    permissions: Vec<Permission>,
 }
 enum Permission {
    // Organization scope
    OrganizationRead,
    OrganizationWrite,
    OrganizationDelete,
    // Workspace scope
    WorkspaceRead,
    WorkspaceWrite,
    WorkspaceDelete,
    // Agent scope
    AgentRead,
    AgentWrite,
    AgentReload,
    AgentDelete,
    // Config scope
    ConfigRead,
    ConfigWrite,
    ConfigDeploy,
    ConfigDelete,
    // Certificate scope
    CertificateRead,
    CertificateWrite,
    CertificateDelete,
    // User management
    UserRead,
    UserWrite,
    UserDelete,
 }
 ```
 ---
 ### SF-002: Secret Management
 **Description**: Secure storage and distribution of sensitive data.
 **Secrets**:
 - SSL private keys
 - API tokens
 - Database passwords
 - External service credentials
 **Security Measures**:
 - Encryption at rest (AES-256-GCM)
 - Encryption in transit (TLS 1.3)
 - Automatic secret rotation
 - Audit logging for secret access
 ---
 ### SF-003: Network Security
 **Description**: Network-level security controls.
 **Features**:
 - IP allowlisting for agent connections
 - Rate limiting on API endpoints
 - DDoS protection recommendations
 - Security headers enforcement (HSTS, CSP, etc.)
 **Agent Connection Security**:
 - **TLS Encryption**: Server-side TLS (auto-generated or custom certificates)
  - Development: Self-signed certificates auto-generated on first start
  - Production: Valid certificates (Let's Encrypt or corporate CA)
 - **Bootstrap Authentication**: One-time token for initial registration
 - **Session Authentication**: HMAC-signed requests with shared session key
 - **Key Rotation**: Primary/secondary key design for seamless rotation
 - **Certificate Pinning**: Optional fingerprint verification for additional security
--- a/docs/project-structure.md
+++ b/docs/project-structure.md
@@ -0,0 +1,428 @@
 # NxMesh Project Structure
 This document outlines the recommended project structure for the NxMesh codebase.
 ## Directory Layout
 ```
 nxmesh/
 ├── Cargo.toml                    # Workspace root
 ├── Cargo.lock
 ├── README.md
 ├── LICENSE
 ├── justfile                      # Task runner
 ├── AGENTS.md                     # AI agent context
 ├──
 ├── crates/                       # Rust workspace crates
 │   ├── nxmesh-core/             # Shared core library
 │   │   ├── Cargo.toml
 │   │   └── src/
 │   │       ├── lib.rs
 │   │       ├── models/          # Shared data models
 │   │       │   ├── mod.rs
 │   │       │   ├── organization.rs
 │   │       │   ├── workspace.rs
 │   │       │   ├── agent.rs
 │   │       │   ├── config.rs
 │   │       │   └── certificate.rs
 │   │       ├── crypto/          # Encryption, hashing
 │   │       ├── validation/      # Input validation
 │   │       └── error.rs         # Common error types
 │   │
 │   ├── nxmesh-proto/            # Protocol buffers
 │   │   ├── Cargo.toml
 │   │   ├── build.rs
 │   │   └── proto/
 │   │       ├── agent.proto
 │   │       ├── config.proto
 │   │       └── common.proto
 │   │
 │   ├── nxmesh-master/           # Control plane
 │   │   ├── Cargo.toml
 │   │   └── src/
 │   │       ├── main.rs
 │   │       ├── lib.rs
 │   │       ├── api/             # REST API handlers
 │   │       │   ├── mod.rs
 │   │       │   ├── routes.rs
 │   │       │   ├── middleware/
 │   │       │   ├── v1/          # API version 1
 │   │       │   │   ├── mod.rs
 │   │       │   │   ├── organizations.rs
 │   │       │   │   ├── workspaces.rs
 │   │       │   │   ├── agents.rs
 │   │       │   │   ├── virtual_hosts.rs
 │   │       │   │   ├── upstreams.rs
 │   │       │   │   ├── certificates.rs
 │   │       │   │   └── metrics.rs
 │   │       │   └── websocket.rs
 │   │       ├── grpc/            # gRPC service
 │   │       │   ├── mod.rs
 │   │       │   ├── server.rs
 │   │       │   ├── agent_service.rs
 │   │       │   └── interceptor.rs
 │   │       ├── config/          # Configuration
 │   │       │   ├── mod.rs
 │   │       │   └── settings.rs
 │   │       ├── db/              # Database layer
 │   │       │   ├── mod.rs
 │   │       │   ├── connection.rs
 │   │       │   ├── migration.rs
 │   │       │   └── repositories/
 │   │       ├── services/        # Business logic
 │   │       │   ├── mod.rs
 │   │       │   ├── organization_service.rs
 │   │       │   ├── workspace_service.rs
 │   │       │   ├── agent_service.rs
 │   │       │   ├── config_service.rs
 │   │       │   ├── certificate_service.rs
 │   │       │   └── auth_service.rs
 │   │       ├── domain/          # Domain entities
 │   │       │   ├── mod.rs
 │   │       │   ├── organization.rs
 │   │       │   ├── agent.rs
 │   │       │   └── config.rs
 │   │       ├── infrastructure/  # External integrations
 │   │       │   ├── mod.rs
 │   │       │   ├── acme/        # Let's Encrypt
 │   │       │   ├── storage/     # Object storage
 │   │       │   └── notifier/    # Notifications
 │   │       ├── events/          # Event bus
 │   │       │   ├── mod.rs
 │   │       │   ├── bus.rs
 │   │       │   └── handlers.rs
 │   │       └── cli.rs           # CLI commands
 │   │
 │   ├── nxmesh-agent/            # Data plane
 │   │   ├── Cargo.toml
 │   │   └── src/
 │   │       ├── main.rs
 │   │       ├── lib.rs
 │   │       ├── config/          # Agent configuration
 │   │       │   ├── mod.rs
 │   │       │   └── settings.rs
 │   │       ├── master/          # Master communication
 │   │       │   ├── mod.rs
 │   │       │   ├── client.rs
 │   │       │   ├── reconnect.rs
 │   │       │   └── stream.rs
 │   │       ├── nginx/           # Nginx management
 │   │       │   ├── mod.rs
 │   │       │   ├── controller.rs
 │   │       │   ├── config_manager.rs  # Symlink-based atomic deployment
 │   │       │   ├── config_renderer.rs
 │   │       │   ├── validator.rs
 │   │       │   ├── docker_sidecar.rs  # Docker sidecar (PID namespace sharing)
 │   │       │   ├── systemd.rs   # Standalone mode
 │   │       │   └── parser.rs    # Nginx config parser
 │   │       ├── health/          # Health monitoring
 │   │       │   ├── mod.rs
 │   │       │   ├── monitor.rs
 │   │       │   ├── nginx.rs
 │   │       │   └── system.rs
 │   │       ├── metrics/         # Metrics collection
 │   │       │   ├── mod.rs
 │   │       │   ├── collector.rs
 │   │       │   └── exporter.rs
 │   │       ├── cache/           # Local caching
 │   │       │   ├── mod.rs
 │   │       │   └── config_cache.rs
 │   │       ├── watch/           # File watchers
 │   │       │   ├── mod.rs
 │   │       │   └── config_watch.rs
 │   │       └── cli.rs           # CLI commands
 │   │
 │   └── nxmesh-cli/              # CLI tool
 │       ├── Cargo.toml
 │       └── src/
 │           ├── main.rs
 │           ├── commands/        # CLI commands
 │           │   ├── mod.rs
 │           │   ├── login.rs
 │           │   ├── agent.rs
 │           │   ├── config.rs
 │           │   └── deploy.rs
 │           └── api/             # API client
 │
 ├── frontend/                    # Web UI (embedded in master)
 │   ├── package.json
 │   ├── vite.config.ts
 │   ├── tsconfig.json
 │   ├── index.html
 │   ├── src/
 │   │   ├── main.tsx
 │   │   ├── App.tsx
 │   │   ├── components/          # Reusable components
 │   │   │   ├── common/
 │   │   │   ├── layout/
 │   │   │   └── forms/
 │   │   ├── pages/               # Page components
 │   │   │   ├── Dashboard/
 │   │   │   ├── Agents/
 │   │   │   ├── Configurations/
 │   │   │   ├── Certificates/
 │   │   │   └── Settings/
 │   │   ├── hooks/               # React hooks
 │   │   ├── stores/              # State management (Zustand)
 │   │   ├── api/                 # API client
 │   │   ├── types/               # TypeScript types
 │   │   ├── utils/               # Utilities
 │   │   └── styles/              # CSS/Tailwind
 │   └── public/
 │   
 │   # Build output (dist/) is embedded into master binary
 │   # Master serves static files at root path ("/")
 │
 ├── migrations/                  # Database migrations
 │   └── sea-orm/
 │       ├── Cargo.toml
 │       └── src/
 │
 ├── tests/                       # Integration tests
 │   ├── integration/
 │   │   ├── master_api_tests.rs
 │   │   ├── agent_master_tests.rs
 │   │   └── config_flow_tests.rs
 │   └── fixtures/
 │
 ├── scripts/                     # Build/utility scripts
 │   ├── build.sh
 │   ├── test.sh
 │   └── release.sh
 │
 ├── deploy/                      # Deployment configs
 │   ├── docker/
 │   │   ├── master.Dockerfile
 │   │   ├── agent.Dockerfile
 │   │   └── docker-compose.yml
 │   ├── k8s/
 │   │   ├── namespace.yaml
 │   │   ├── master/
 │   │   ├── agent/
 │   │   └── helm/
 │   └── terraform/
 │
 ├── docs/                        # Documentation
 │   ├── architecture.md
 │   ├── features.md
 │   ├── roadmap.md
 │   ├── api.md
 │   ├── deployment.md
 │   └── project-structure.md
 │
 └── .devcontainer/               # Dev container
    ├── devcontainer.json
    ├── docker-compose.yml
    ├── Dockerfile
    └── nginx/
 ```
 ## Crate Dependencies
 ```mermaid
 graph TB
    subgraph "Workspace Crates"
        CLI[nxmesh-cli]
        AGENT[nxmesh-agent]
        MASTER[nxmesh-master]
        PROTO[nxmesh-proto]
        CORE[nxmesh-core]
    end
    CORE --> PROTO
    AGENT --> CORE
    AGENT --> PROTO
    MASTER --> CORE
    MASTER --> PROTO
    CLI --> CORE
 ```
 ## Key Design Principles
 ### 1. Separation of Concerns
 - **nxmesh-core**: Only shared types and utilities
 - **nxmesh-master**: Only control plane logic
 - **nxmesh-agent**: Only data plane logic
 - **frontend**: Only UI logic
 ### 2. Domain-Driven Design (in Master)
 ```
 domain/          # Domain entities (pure logic)
 services/        # Application services (orchestration)
 repositories/    # Data access abstraction
 api/             # Interface adapters (HTTP, gRPC)
 infrastructure/  # External concerns
 ```
 ### 3. Agent Modularity
 Each major concern in the agent is a separate module:
 - `nginx/`: All nginx-specific code
 - `master/`: All master communication code
 - `health/`: All health monitoring code
 - `metrics/`: All metrics code
 ### 4. Configuration Management
 Use hierarchical config:
 1. Default values (in code)
 2. Config file (`/etc/nxmesh/*.toml`)
 3. Environment variables
 4. Command-line arguments (highest priority)
 ## Module Guidelines
 ### API Versioning
 - Always version REST APIs: `/api/v1/...`
 - Maintain backward compatibility within major versions
 - Use feature flags for gradual rollouts
 ### Error Handling
 - Use `thiserror` for error definitions
 - Propagate errors with context
 - Convert to user-friendly messages at API boundary
 ### Testing Structure
 ```rust
 // In each module
 #[cfg(test)]
 mod tests {
    use super::*;
    #[test]
    fn test_feature() {
        // unit tests
    }
 }
 ```
 - Unit tests: In same file as code
 - Integration tests: In `tests/` directory
 - E2E tests: Separate crate or external repo
 ### Documentation
 - All public APIs must have doc comments
 - Include examples in doc comments
 - Keep README files in each crate
 ## Build Configuration
 ### Workspace Cargo.toml
 ```toml
 [workspace]
 members = [
    "crates/nxmesh-core",
    "crates/nxmesh-proto",
    "crates/nxmesh-master",
    "crates/nxmesh-agent",
    "crates/nxmesh-cli",
 ]
 resolver = "3"
 [workspace.dependencies]
 # Core dependencies
 tokio = { version = "1", features = ["full"] }
 serde = { version = "1", features = ["derive"] }
 thiserror = "1"
 tracing = "0.1"
 # Web framework
 axum = "0.7"
 tower = "0.4"
 tower-http = "0.5"
 # gRPC
 tonic = "0.11"
 prost = "0.12"
 # Database
 sea-orm = "2.0.0-rc"
 sea-orm-migration = "2.0.0-rc"
 # Async
 async-trait = "0.1"
 futures = "0.3"
 # Serialization
 serde_json = "1"
 toml = "0.8"
 # HTTP
 reqwest = { version = "0.12", default-features = false }
 # Crypto
 sha2 = "0.10"
 hex = "0.4"
 # Testing
 tokio-test = "0.4"
 mockall = "0.12"
 ```
 ## Naming Conventions
 ### Files
 - Use `snake_case` for file names
 - Module entry point: `mod.rs` or `{module_name}.rs`
 ### Types
 - Structs/Enums: `PascalCase`
 - Traits: `PascalCase` (often ending in `able` or with verb prefix)
 - Functions/Methods: `snake_case`
 - Constants: `SCREAMING_SNAKE_CASE`
 - Generic parameters: Single uppercase letter (`T`, `K`, `V`)
 ### Error Types
 - Suffix with `Error`: `ConfigError`, `AgentError`
 - Group in `error.rs` or `errors/` module
 ### Feature Flags
 - Use `kebab-case`: `postgres-native`, `tls-rustls`
 ## CI/CD Structure
 ```yaml
 # .github/workflows/
 ├── ci.yml           # PR checks
 ├── test.yml         # Test suite
 ├── release.yml      # Release builds
 ├── docker.yml       # Docker image builds
 └── docs.yml         # Documentation deploy
 ```
 ## Scripts
 Common operations should have just commands:
 ```justfile
 # Development
 just dev          # Start all services
 just dev-backend  # Start backend only
 just dev-frontend # Start frontend only
 # Testing
 just test         # Run all tests
 just test-unit    # Unit tests only
 just test-integration  # Integration tests
 # Building
 just build        # Build all
 just build-master # Build master only
 just build-agent  # Build agent only
 # Database
 just db-migrate   # Run migrations
 just db-reset     # Reset database
 just db-console   # Open psql
 # Deployment
 just docker-build # Build Docker images
 just k8s-deploy   # Deploy to Kubernetes
 ```
--- a/docs/roadmap.md
+++ b/docs/roadmap.md
@@ -0,0 +1,486 @@
 # NxMesh Project Roadmap
 ## Overview
 This document outlines the development phases and milestones for NxMesh. The project is divided into four major phases, each building upon the previous one.
 ---
 ## Phase 1: Foundation (Months 1-3)
 **Goal**: Build a working MVP with basic master-agent communication and nginx configuration management.
 ### Milestone 1.1: Project Setup and Core Infrastructure
 **Target**: Week 2
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | Set up Rust workspace structure (master, agent, shared) | 🔲 |
 | [ ] | Configure CI/CD pipeline (GitHub Actions) | 🔲 |
 | [ ] | Set up database schema with SeaORM migrations | 🔲 |
 | [ ] | Create development environment (devcontainer) | 🔲 |
 | [ ] | Set up testing framework (unit, integration) | 🔲 |
 **Deliverables**:
 - Working development environment
 - Database schema for organizations, workspaces, agents
 - CI pipeline with linting and testing
 ---
 ### Milestone 1.2: Master - Core API
 **Target**: Week 5
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | Implement Axum-based REST API server | 🔲 |
 | [ ] | JWT authentication middleware | 🔲 |
 | [ ] | CRUD endpoints for Organizations | 🔲 |
 | [ ] | CRUD endpoints for Workspaces | 🔲 |
 | [ ] | CRUD endpoints for Agents | 🔲 |
 | [ ] | PostgreSQL persistence layer | 🔲 |
 **Deliverables**:
 - REST API for basic resource management
 - JWT authentication working
 - API documentation (OpenAPI)
 ---
 ### Milestone 1.3: Master - Agent Communication
 **Target**: Week 7
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | gRPC server implementation (Tonic) | 🔲 |
 | [ ] | Bidirectional streaming protocol | 🔲 |
 | [ ] | Agent registration flow | 🔲 |
 | [ ] | Token-based authentication for agents | 🔲 |
 | [ ] | Agent heartbeat/health monitoring | 🔲 |
 | [ ] | WebSocket fallback for events | 🔲 |
 **Deliverables**:
 - Master can accept agent connections
 - Agent registration and authentication works
 - Health status tracking
 ---
 ### Milestone 1.4: Agent - Core Functionality
 **Target**: Week 9
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | Agent CLI and configuration | 🔲 |
 | [ ] | gRPC client for master communication | 🔲 |
 | [ ] | Automatic reconnection with backoff | 🔲 |
 | [ ] | Nginx process management (Docker sidecar PID sharing) | 🔲 |
 | [ ] | Health check reporting | 🔲 |
 | [ ] | Local config caching | 🔲 |
 **Deliverables**:
 - Agent binary that connects to master
 - Nginx lifecycle management (Docker sidecar mode)
 - Health reporting
 ---
 ### Milestone 1.5: Configuration Management
 **Target**: Week 11
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | VirtualHost CRUD API | 🔲 |
 | [ ] | Upstream CRUD API | 🔲 |
 | [ ] | Handlebars template engine integration | 🔲 |
 | [ ] | Config rendering on agent | 🔲 |
 | [ ] | Nginx config validation (`nginx -t`) | 🔲 |
 | [ ] | Graceful reload on config change | 🔲 |
 **Deliverables**:
 - End-to-end config push: Master → Agent → Nginx
 - Basic virtual host and upstream management
 - Template-based nginx config generation
 ---
 ### Milestone 1.6: Web Admin Console - Foundation
 **Target**: Week 13
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | React + Vite project setup | 🔲 |
 | [ ] | Authentication UI (login/logout) | 🔲 |
 | [ ] | Dashboard layout and navigation | 🔲 |
 | [ ] | Agent list and detail views | 🔲 |
 | [ ] | Basic virtual host form | 🔲 |
 | [ ] | WebSocket integration for real-time updates | 🔲 |
 **Deliverables**:
 - Functional Web UI
 - Agent management via UI
 - Basic configuration editing
 ---
 ### Phase 1 Completion Criteria
 - [ ] Master and Agent communicate via gRPC
 - [ ] Nginx configs can be pushed from Master to Agent
 - [ ] Web UI for basic management
 - [ ] Docker sidecar deployment working
 - [ ] Documentation complete
 **Estimated Effort**: 3 months
 **Team Size**: 2-3 engineers
 ---
 ## Phase 2: Resilience and Observability (Months 4-5)
 **Goal**: Make the system production-ready with HA, monitoring, and robust failure handling.
 ### Milestone 2.1: High Availability - Master Clustering
 **Target**: Week 15
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | Raft consensus integration (raft-rs) | 🔲 |
 | [ ] | Leader election | 🔲 |
 | [ ] | State replication across masters | 🔲 |
 | [ ] | Agent connection failover | 🔲 |
 | [ ] | Cluster health monitoring | 🔲 |
 **Deliverables**:
 - Multiple master instances can form a cluster
 - Automatic failover on master failure
 - No single point of failure
 ---
 ### Milestone 2.2: Certificate Management
 **Target**: Week 17
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | ACME client integration (acme-rs) | 🔲 |
 | [ ] | Let's Encrypt HTTP-01 challenge | 🔲 |
 | [ ] | Certificate storage (encrypted) | 🔲 |
 | [ ] | Automatic renewal | 🔲 |
 | [ ] | Certificate distribution to agents | 🔲 |
 | [ ] | Expiration monitoring and alerts | 🔲 |
 **Deliverables**:
 - Automatic SSL certificate provisioning
 - Certificate renewal before expiry
 - UI for certificate management
 ---
 ### Milestone 2.3: Observability Stack
 **Target**: Week 19
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | OpenTelemetry integration | 🔲 |
 | [ ] | Structured logging (tracing) | 🔲 |
 | [ ] | Prometheus metrics endpoint (agent) | 🔲 |
 | [ ] | Custom metrics collection | 🔲 |
 | [ ] | Health check dashboard | 🔲 |
 | [ ] | Alert configuration | 🔲 |
 **Deliverables**:
 - Metrics visible in Prometheus
 - Distributed traces for config pushes
 - Health dashboard in Web UI
 ---
 ### Milestone 2.4: Enhanced Failure Handling
 **Target**: Week 21
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | Configuration drift detection | 🔲 |
 | [ ] | Auto-healing (config sync) | 🔲 |
 | [ ] | Circuit breaker for master connection | 🔲 |
 | [ ] | Nginx crash detection and restart | 🔲 |
 | [ ] | Config rollback on validation failure | 🔲 |
 | [ ] | Bulk operations and queue management | 🔲 |
 **Deliverables**:
 - System self-heals from common failures
 - Config drift automatically corrected
 - Robust reconnection logic
 ---
 ### Phase 2 Completion Criteria
 - [ ] Master clustering with Raft
 - [ ] Automatic SSL certificates
 - [ ] Full observability (metrics, logs, traces)
 - [ ] Production-grade failure handling
 - [ ] Performance benchmarks
 **Estimated Effort**: 2 months
 **Team Size**: 2-3 engineers
 ---
 ## Phase 3: Advanced Traffic Management (Months 6-7)
 **Goal**: Add enterprise-grade traffic management features.
 ### Milestone 3.1: Advanced Load Balancing
 **Target**: Week 23
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | Multiple load balancing algorithms | 🔲 |
 | [ ] | Health checks for upstream servers | 🔲 |
 | [ ] | Circuit breaker for upstreams | 🔲 |
 | [ ] | Retry policies | 🔲 |
 | [ ] | Connection pooling | 🔲 |
 | [ ] | Upstream status dashboard | 🔲 |
 **Deliverables**:
 - Advanced upstream configuration
 - Health check visualization
 - Circuit breaker metrics
 ---
 ### Milestone 3.2: Rate Limiting and WAF
 **Target**: Week 25
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | Rate limiting rules (IP, user, global) | 🔲 |
 | [ ] | Rate limiting zones | 🔲 |
 | [ ] | Basic WAF rules (ModSecurity integration) | 🔲 |
 | [ ] | IP allowlist/blocklist | 🔲 |
 | [ ] | Geo-blocking | 🔲 |
 | [ ] | Rate limit analytics | 🔲 |
 **Deliverables**:
 - Configurable rate limiting
 - Basic WAF protection
 - Security event dashboard
 ---
 ### Milestone 3.3: Traffic Routing and Canary
 **Target**: Week 27
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | Header-based routing | 🔲 |
 | [ ] | Weight-based traffic splitting | 🔲 |
 | [ ] | Canary deployment support | 🔲 |
 | [ ] | A/B testing configuration | 🔲 |
 | [ ] | Blue-green deployment | 🔲 |
 | [ ] | Traffic analytics | 🔲 |
 **Deliverables**:
 - Advanced traffic routing
 - Canary deployment UI
 - Traffic split visualization
 ---
 ### Milestone 3.4: Access Log Aggregation
 **Target**: Week 29
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | Nginx access log parsing | 🔲 |
 | [ ] | Log streaming to master | 🔲 |
 | [ ] | Log storage and indexing | 🔲 |
 | [ ] | Log query interface | 🔲 |
 | [ ] | Real-time log tailing | 🔲 |
 | [ ] | Log-based alerting | 🔲 |
 **Deliverables**:
 - Centralized access logs
 - Log search and filtering
 - Log-based metrics
 ---
 ### Phase 3 Completion Criteria
 - [ ] Advanced load balancing and health checks
 - [ ] Rate limiting and basic WAF
 - [ ] Canary and A/B testing
 - [ ] Access log aggregation
 - [ ] Traffic analytics dashboard
 **Estimated Effort**: 2 months
 **Team Size**: 2-3 engineers
 ---
 ## Phase 4: Enterprise Features (Months 8-10)
 **Goal**: Enterprise readiness with multi-tenancy, RBAC, and advanced integrations.
 ### Milestone 4.1: Multi-tenancy and RBAC
 **Target**: Week 31
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | Organization isolation | 🔲 |
 | [ ] | Workspace-scoped resources | 🔲 |
 | [ ] | Role-based access control | 🔲 |
 | [ ] | User management API | 🔲 |
 | [ ] | API key management | 🔲 |
 | [ ] | Audit logging | 🔲 |
 **Deliverables**:
 - Full multi-tenancy
 - Granular permissions
 - Audit trail
 ---
 ### Milestone 4.2: Kubernetes Integration
 **Target**: Week 33
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | Kubernetes operator | 🔲 |
 | [ ] | CRD definitions | 🔲 |
 | [ ] | Helm chart | 🔲 |
 | [ ] | Service discovery integration | 🔲 |
 | [ ] | Ingress controller mode | 🔲 |
 | [ ] | K8s-native agent deployment | 🔲 |
 **Deliverables**:
 - Kubernetes operator
 - Helm chart for easy deployment
 - Ingress controller functionality
 ---
 ### Milestone 4.3: External Integrations
 **Target**: Week 35
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | Terraform provider | 🔲 |
 | [ ] | GitOps integration (Git sync) | 🔲 |
 | [ ] | Webhook support | 🔲 |
 | [ ] | Slack/Discord notifications | 🔲 |
 | [ ] | PagerDuty/Opsgenie integration | 🔲 |
 | [ ] | DNS provider integration (Route53, Cloudflare) | 🔲 |
 **Deliverables**:
 - Infrastructure as Code support
 - GitOps workflows
 - Notification channels
 ---
 ### Milestone 4.4: Performance and Scale
 **Target**: Week 37
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | Connection pooling optimization | 🔲 |
 | [ ] | Config caching improvements | 🔲 |
 | [ ] | Database query optimization | 🔲 |
 | [ ] | Horizontal scaling tests | 🔲 |
 | [ ] | Load testing (10k+ agents) | 🔲 |
 | [ ] | Performance tuning documentation | 🔲 |
 **Deliverables**:
 - Performance benchmarks
 - Scaling guidelines
 - Optimization recommendations
 ---
 ### Milestone 4.5: Enterprise Security
 **Target**: Week 39
 | Task | Description | Status |
 |------|-------------|--------|
 | [ ] | mTLS for all communications | 🔲 |
 | [ ] | Secret encryption at rest | 🔲 |
 | [ ] | HSM integration | 🔲 |
 | [ ] | SSO/SAML integration | 🔲 |
 | [ ] | Security scanning (SAST/DAST) | 🔲 |
 | [ ] | Compliance documentation (SOC2) | 🔲 |
 **Deliverables**:
 - Enterprise security features
 - Compliance documentation
 - Security audit
 ---
 ### Phase 4 Completion Criteria
 - [ ] Full RBAC and multi-tenancy
 - [ ] Kubernetes operator
 - [ ] External integrations (Terraform, GitOps)
 - [ ] Proven scalability (10k+ agents)
 - [ ] Enterprise security compliance
 **Estimated Effort**: 3 months
 **Team Size**: 3-4 engineers
 ---
 ## Timeline Summary
 ```
 Month 1-3:   ████████████████████████████████████████ Phase 1: Foundation
 Month 4-5:   ████████████████████                    Phase 2: Resilience
 Month 6-7:   ████████████████████                    Phase 3: Advanced
 Month 8-10:  ██████████████████████████              Phase 4: Enterprise
 Week:        1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20
             |--M1--|--M2--|--M3--|--M4--|--M5--|--M6--|
 Week:        21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
             |--M7--|--M8--|--M9--|--M10-|--M11-|--M12-|--M13-|--M14-|
 ```
 ---
 ## Resource Requirements
 ### Phase 1
 - **Backend Engineers**: 2
 - **Frontend Engineer**: 1
 - **Total Person-Months**: 9
 ### Phase 2
 - **Backend Engineers**: 2
 - **Frontend Engineer**: 1 (part-time)
 - **DevOps Engineer**: 1 (part-time)
 - **Total Person-Months**: 7
 ### Phase 3
 - **Backend Engineers**: 2
 - **Frontend Engineer**: 1
 - **Total Person-Months**: 6
 ### Phase 4
 - **Backend Engineers**: 2
 - **Frontend Engineer**: 1
 - **DevOps Engineer**: 1
 - **Security Engineer**: 1 (part-time)
 - **Total Person-Months**: 10
 **Total Project**: ~32 person-months
 ---
 ## Risk Assessment
 | Risk | Probability | Impact | Mitigation |
 |------|-------------|--------|------------|
 | Raft complexity delays HA | Medium | High | Start with single master, add HA later |
 | gRPC performance issues | Low | Medium | Implement WebSocket fallback early |
 | Nginx reload edge cases | Medium | High | Extensive testing, rollback capability |
 | Team scaling challenges | Medium | Medium | Document architecture, modular design |
 | Integration complexity | Medium | Medium | Clear APIs, contract testing |