Add project structure and roadmap documentation

- Created `project-structure.md` to outline the directory layout, crate dependencies, design principles, module guidelines, and naming conventions for the NxMesh codebase. - Introduced `roadmap.md` detailing the development phases, milestones, tasks, deliverables, and resource requirements for the NxMesh project, spanning from foundational setup to enterprise features.
2026-03-03 04:13:31 +00:00
parent 39bd860c55
commit 43b2e44d95
11 changed files with 9293 additions and 7 deletions
--- a/docs/features.md
+++ b/docs/features.md
@@ -0,0 +1,814 @@
+# NxMesh Feature Specification
+
+## Table of Contents
+1. [Core Features](#core-features)
+2. [Master Features](#master-features)
+3. [Agent Features](#agent-features)
+4. [Configuration Management](#configuration-management)
+5. [Observability](#observability)
+6. [Security Features](#security-features)
+
+---
+
+## Core Features
+
+### CF-001: Multi-tenancy with Organizations and Workspaces
+
+**Description**: Support for multiple organizations with isolated workspaces within each organization.
+
+**Requirements**:
+- Organizations are top-level resource containers
+- Each organization can have multiple workspaces
+- Resources (agents, configs, certificates) are scoped to a workspace
+- Cross-workspace visibility is configurable
+
+**Data Model**:
+```rust
+struct Organization {
+    id: Uuid,
+    name: String,
+    slug: String,  // URL-friendly identifier
+    created_at: DateTime,
+    settings: OrganizationSettings,
+}
+
+struct Workspace {
+    id: Uuid,
+    organization_id: Uuid,
+    name: String,
+    slug: String,
+    created_at: DateTime,
+}
+```
+
+**API Endpoints**:
+- `GET /api/v1/organizations` - List organizations
+- `POST /api/v1/organizations` - Create organization
+- `GET /api/v1/organizations/{id}/workspaces` - List workspaces
+- `POST /api/v1/organizations/{id}/workspaces` - Create workspace
+
+---
+
+### CF-002: Agent Registration and Lifecycle Management
+
+**Description**: Agents must register with the master before receiving configurations.
+
+**Registration Flow**:
+1. Administrator generates bootstrap token in Master UI
+2. Token is provided to agent via environment variable or config file
+3. Agent establishes TLS connection to master (verifies server certificate)
+4. Agent sends bootstrap token for registration
+5. Master validates token and establishes shared secret:
+   - Master generates session_key (per-agent) + key_id
+   - Session key used for HMAC request signing
+   - Primary/secondary key design for rotation
+
+**Agent States**:
+```rust
+enum AgentState {
+    Pending,      // Registered but never connected
+    Online,       // Connected and healthy
+    Offline,      // Disconnected
+    Degraded,     // Connected but health checks failing
+    Maintenance,  // Manually placed in maintenance mode
+}
+```
+
+**Agent Metadata**:
+```rust
+struct Agent {
+    id: Uuid,
+    workspace_id: Uuid,
+    name: String,
+    hostname: String,
+    ip_address: String,
+    version: String,
+    state: AgentState,
+    deployment_mode: DeploymentMode,  // DockerSidecar, K8sSidecar, Standalone
+    last_seen_at: DateTime,
+    capabilities: Vec<String>,  // e.g., ["http3", "websocket", "rate_limiting"]
+    labels: HashMap<String, String>,  // e.g., {"env": "prod", "region": "us-east"}
+}
+```
+
+**API Endpoints**:
+- `POST /api/v1/agents/register` - Register new agent
+- `GET /api/v1/agents` - List agents
+- `GET /api/v1/agents/{id}` - Get agent details
+- `POST /api/v1/agents/{id}/tokens` - Generate registration token
+- `DELETE /api/v1/agents/{id}` - Deregister agent
+
+---
+
+### CF-003: Real-time Configuration Distribution
+
+**Description**: Push configuration changes to agents in real-time with delivery guarantees.
+
+**Requirements**:
+- Config changes propagate to all affected agents within 5 seconds
+- Support for targeted updates (specific agents or groups)
+- Config versioning with rollback capability
+- Delivery confirmation from agents
+
+**Configuration Scope**:
+```rust
+enum ConfigScope {
+    Global,           // All agents
+    Workspace,        // All agents in workspace
+    AgentGroup(String), // Agents with specific label selector
+    Agent(Uuid),      // Single agent
+}
+```
+
+**Delivery Guarantees**:
+- At-least-once delivery
+- Automatic retry with exponential backoff
+- Config checksum verification
+- Offline agents receive updates on reconnection
+
+---
+
+## Master Features
+
+### MF-001: RESTful API
+
+**Description**: Comprehensive REST API for all operations.
+
+**Base URL**: `/api/v1`
+
+**Resource Endpoints**:
+
+| Resource | Endpoints |
+|----------|-----------|
+| Organizations | GET, POST, PATCH, DELETE `/organizations` |
+| Workspaces | GET, POST, PATCH, DELETE `/workspaces` |
+| Agents | GET, POST, PATCH, DELETE `/agents` |
+| VirtualHosts | GET, POST, PATCH, DELETE `/virtual-hosts` |
+| Upstreams | GET, POST, PATCH, DELETE `/upstreams` |
+| Certificates | GET, POST, DELETE `/certificates` |
+| AccessLogs | GET `/access-logs` |
+| Metrics | GET `/metrics` |
+
+**Response Format**:
+```json
+{
+  "data": { ... },
+  "meta": {
+    "page": 1,
+    "per_page": 20,
+    "total": 100
+  },
+  "links": {
+    "self": "/api/v1/agents?page=1",
+    "next": "/api/v1/agents?page=2",
+    "prev": null
+  }
+}
+```
+
+**Error Format**:
+```json
+{
+  "error": {
+    "code": "VALIDATION_ERROR",
+    "message": "Invalid configuration",
+    "details": [
+      {"field": "server_name", "message": "Invalid domain format"}
+    ]
+  }
+}
+```
+
+---
+
+### MF-002: Web-based Admin Console (Embedded)
+
+**Description**: Modern web UI for managing the entire system. Built with React + Vite and served as static files embedded directly in the master binary.
+
+**Pages**:
+
+| Page | Features |
+|------|----------|
+| Dashboard | Agent status, recent events, traffic overview |
+| Agents | List, detail view, logs, metrics graphs |
+| Configurations | Virtual host editor, upstream management |
+| Certificates | SSL certificate list, expiration alerts |
+| Access Control | Users, roles, permissions management |
+| Settings | Organization settings, integrations |
+
+**Key UI Features**:
+- Real-time updates via WebSocket
+- Monaco editor for nginx configuration
+- Visual topology view (agent connections)
+- Dark/light mode support
+- Responsive design
+
+---
+
+### MF-003: Configuration Template Engine
+
+**Description**: Templating system for generating nginx configurations.
+
+**Template Variables**:
+```handlebars
+# Example virtual host template
+server {
+    listen {{port}} {{#if ssl}}ssl{{/if}} {{#if http2}}http2{{/if}};
+    server_name {{server_name}};
+    
+    {{#if ssl}}
+    ssl_certificate {{ssl_certificate_path}};
+    ssl_certificate_key {{ssl_certificate_key_path}};
+    {{/if}}
+    
+    location {{location_path}} {
+        proxy_pass http://{{upstream_name}};
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        
+        {{#each custom_headers}}
+        add_header {{name}} "{{value}}";
+        {{/each}}
+        
+        {{#if rate_limiting}}
+        limit_req zone={{rate_limit_zone}} burst={{rate_limit_burst}};
+        {{/if}}
+    }
+}
+```
+
+**Built-in Templates**:
+- `default` - Standard reverse proxy
+- `spa` - Single Page Application (with fallback to index.html)
+- `api` - API gateway with rate limiting
+- `static` - Static file serving with caching
+- `websocket` - WebSocket proxy with connection upgrades
+
+---
+
+### MF-004: Certificate Management (ACME)
+
+**Description**: Automatic SSL/TLS certificate provisioning via Let's Encrypt.
+
+**Features**:
+- ACME v2 protocol support
+- HTTP-01 and DNS-01 challenges
+- Automatic renewal (30 days before expiry)
+- Wildcard certificate support (DNS-01)
+- Certificate monitoring and alerts
+
+**Certificate Entity**:
+```rust
+struct Certificate {
+    id: Uuid,
+    workspace_id: Uuid,
+    domain: String,
+    is_wildcard: bool,
+    provider: CertificateProvider,  // LetsEncrypt, Custom
+    status: CertificateStatus,      // Pending, Active, Expired, Error
+    issued_at: DateTime,
+    expires_at: DateTime,
+    auto_renew: bool,
+    certificate_pem: Option<String>,  // Encrypted at rest
+    private_key_pem: Option<String>,  // Encrypted at rest
+}
+```
+
+---
+
+## Agent Features
+
+### AF-001: Nginx Lifecycle Management
+
+**Description**: Agent manages nginx process lifecycle based on deployment mode.
+
+**Docker Sidecar Mode**:
+- Shares PID namespace with nginx container (via `pid: service:nginx`)
+- Directly signals nginx process for reload/restart
+- Monitors nginx via health checks
+
+**Standalone Mode**:
+- Direct process management (signals to PID from file)
+- systemd integration (optional, for service management)
+- PID file monitoring
+
+**Lifecycle Actions**:
+- `start` - Start nginx
+- `stop` - Graceful shutdown
+- `reload` - Hot reload configuration
+- `restart` - Full restart
+- `test` - Validate configuration
+
+---
+
+### AF-002: Configuration Rendering and Application
+
+**Description**: Agent renders nginx configs from master templates and applies them using atomic symlink swaps for zero-downtime updates.
+
+**Config Directory Structure**:
+```
+/etc/nginx/
+├── nginx.conf              # Contains: include /etc/nginx/conf.d/current/*.conf
+├── conf.d/
+│   ├── current -> ./20260302143000/    # Symlink to active deployment
+│   ├── 20260302143000/                 # Active config (timestamped)
+│   │   ├── default.conf
+│   │   └── upstream.conf
+│   ├── 20260302141500/                 # Previous deployment (for rollback)
+│   │   ├── default.conf
+│   │   └── upstream.conf
+│   └── 20260302140000/                 # Older deployment (cleanup candidate)
+```
+
+**Config Rendering Flow**:
+1. Receive ConfigUpdate from master
+2. Create new deployment folder: `./conf.d/<timestamp>/`
+3. Render nginx config files into timestamped folder
+4. **Validate** new config: `nginx -t -c /etc/nginx/conf.d/<timestamp>/nginx.conf`
+5. If validation passes, **atomically update symlink**: `current` → `<timestamp>/`
+6. Execute graceful nginx reload
+7. Verify reload success (health check)
+8. Report status to master
+9. Cleanup old deployments (keep N recent versions)
+
+**Atomic Config Swap**:
+```rust
+async fn apply_config(&self, config: ConfigUpdate) -> Result<()> {
+    let timestamp = generate_timestamp();
+    let deploy_dir = self.conf_d_path.join(&timestamp);
+    let symlink_path = self.conf_d_path.join("current");
+    
+    // 1. Render config to new timestamped directory
+    self.render_config(&config, &deploy_dir).await?;
+    
+    // 2. Validate BEFORE switching symlink (point to new folder directly)
+    self.validate_config(&deploy_dir).await?;
+    
+    // 3. Atomic symlink swap (Unix: symlink + rename)
+    let temp_link = self.conf_d_path.join("current.tmp");
+    tokio::fs::symlink(&deploy_dir, &temp_link).await?;
+    tokio::fs::rename(&temp_link, &symlink_path).await?;  // Atomic operation
+    
+    // 4. Reload nginx (picks up new symlink target)
+    self.reload_nginx().await?;
+    
+    // 5. Verify and cleanup
+    self.verify_health().await?;
+    self.cleanup_old_deployments(5).await?;  // Keep last 5 versions
+    
+    self.report_success(config.id, timestamp).await;
+}
+```
+
+**Rollback Strategy**:
+```rust
+async fn rollback(&self, target_timestamp: &str) -> Result<()> {
+    let target_dir = self.conf_d_path.join(target_timestamp);
+    let symlink_path = self.conf_d_path.join("current");
+    
+    // Verify target exists
+    if !target_dir.exists() {
+        return Err(Error::RollbackTargetNotFound);
+    }
+    
+    // Atomic symlink swap back to previous deployment
+    let temp_link = self.conf_d_path.join("current.tmp");
+    tokio::fs::symlink(&target_dir, &temp_link).await?;
+    tokio::fs::rename(&temp_link, &symlink_path).await?;
+    
+    // Reload nginx
+    self.reload_nginx().await?;
+}
+```
+
+---
+
+### AF-003: Health Monitoring and Reporting
+
+**Description**: Continuous health monitoring of nginx and the host system.
+
+**Health Checks**:
+- **Nginx Health**: HTTP request to nginx health endpoint
+- **Configuration Health**: Verify current config matches expected
+- **Resource Health**: CPU, memory, disk usage
+- **Connection Health**: Active connections, request rate
+
+**Health Report Structure**:
+```rust
+struct HealthReport {
+    agent_id: Uuid,
+    timestamp: DateTime,
+    nginx_status: NginxStatus,
+    system_metrics: SystemMetrics,
+    config_checksum: String,
+    alerts: Vec<Alert>,
+}
+
+struct NginxStatus {
+    is_running: bool,
+    pid: Option<u32>,
+    uptime_seconds: u64,
+    active_connections: u32,
+    requests_per_second: f64,
+}
+
+struct SystemMetrics {
+    cpu_percent: f64,
+    memory_used_mb: u64,
+    memory_total_mb: u64,
+    disk_used_gb: u64,
+    disk_total_gb: u64,
+}
+```
+
+**Reporting Interval**: Configurable (default: 30 seconds)
+
+---
+
+### AF-004: Metrics Collection and Export
+
+**Description**: Collect and expose metrics in Prometheus format.
+
+**Metrics Endpoint**: `GET /metrics` (on agent)
+
+**Built-in Metrics**:
+```
+# Nginx metrics (parsed from stub_status)
+nxmesh_nginx_connections_active{agent_id="..."} 42
+nxmesh_nginx_connections_reading{agent_id="..."} 5
+nxmesh_nginx_connections_writing{agent_id="..."} 30
+nxmesh_nginx_connections_waiting{agent_id="..."} 7
+nxmesh_nginx_requests_total{agent_id="..."} 1234567
+
+# Agent metrics
+nxmesh_agent_uptime_seconds{agent_id="..."} 86400
+nxmesh_agent_master_connection_status{agent_id="..."} 1
+nxmesh_agent_config_version{agent_id="...",version="123"} 1
+
+# System metrics
+nxmesh_system_cpu_percent{agent_id="..."} 25.5
+nxmesh_system_memory_used_bytes{agent_id="..."} 1073741824
+nxmesh_system_disk_used_bytes{agent_id="..."} 53687091200
+```
+
+**Custom Metrics**: Agents can collect custom metrics from nginx access logs
+
+---
+
+### AF-005: Offline Operation and Recovery
+
+**Description**: Agent can operate independently when master is unreachable.
+
+**Offline Capabilities**:
+- Continue serving traffic with cached configuration
+- Local health monitoring continues
+- Metrics are buffered for later transmission
+- Automatic reconnection attempts
+
+**Recovery Flow**:
+1. Detect disconnection from master
+2. Enter "offline mode"
+3. Continue operating with cached config
+4. Buffer metrics and logs
+5. Attempt reconnection with exponential backoff
+6. On reconnection:
+   - Sync configuration (compare checksums)
+   - Transmit buffered metrics
+   - Resume normal operation
+
+---
+
+## Configuration Management
+
+### CM-001: Virtual Host Configuration
+
+**Description**: Define nginx server blocks (virtual hosts) via API/UI.
+
+**VirtualHost Entity**:
+```rust
+struct VirtualHost {
+    id: Uuid,
+    workspace_id: Uuid,
+    name: String,              // Human-readable name
+    server_name: String,       // Domain name(s), comma-separated
+    listen_port: u16,          // Usually 80 or 443
+    ssl_enabled: bool,
+    ssl_certificate_id: Option<Uuid>,
+    
+    // Routing configuration
+    locations: Vec<Location>,
+    
+    // Advanced settings
+    http2_enabled: bool,
+    http3_enabled: bool,
+    gzip_enabled: bool,
+    rate_limiting: Option<RateLimitConfig>,
+    
+    // Target agents
+    target_agents: AgentSelector,
+}
+
+struct Location {
+    path: String,              // e.g., "/api" or "~ \.php$"
+    proxy_pass: Option<String>, // e.g., "http://backend"
+    upstream_id: Option<Uuid>,
+    root: Option<String>,      // For static files
+    index: Option<String>,     // e.g., "index.html"
+    custom_headers: Vec<Header>,
+    rewrite_rules: Vec<RewriteRule>,
+}
+```
+
+**Validation Rules**:
+- `server_name` must be valid domain(s)
+- `listen_port` must be 1-65535
+- SSL certificate must exist if `ssl_enabled` is true
+- At least one location must be defined
+
+---
+
+### CM-002: Upstream Configuration
+
+**Description**: Define backend server pools for load balancing.
+
+**Upstream Entity**:
+```rust
+struct Upstream {
+    id: Uuid,
+    workspace_id: Uuid,
+    name: String,              // Used as upstream identifier
+    
+    // Load balancing algorithm
+    algorithm: LoadBalanceAlgorithm,  // RoundRobin, LeastConn, IPHash, etc.
+    
+    // Backend servers
+    servers: Vec<UpstreamServer>,
+    
+    // Health check configuration
+    health_check: Option<HealthCheckConfig>,
+    
+    // Connection settings
+    keepalive_connections: Option<u32>,
+    keepalive_timeout: Option<u32>,
+}
+
+struct UpstreamServer {
+    address: String,           // IP:port or hostname:port
+    weight: u32,               // Default: 1
+    backup: bool,              // Backup server
+    down: bool,                // Temporarily down
+    max_fails: u32,            // Default: 1
+    fail_timeout: u32,         // Seconds, default: 10
+}
+
+enum LoadBalanceAlgorithm {
+    RoundRobin,
+    LeastConnections,
+    IPHash,
+    WeightedRoundRobin,
+}
+```
+
+---
+
+### CM-003: Configuration Versioning
+
+**Description**: Track all configuration changes with full history.
+
+**Versioning Features**:
+- Every change creates a new version
+- Versions are immutable
+- Rollback to any previous version
+- Diff between versions
+- Audit log of who changed what
+
+**Version Entity**:
+```rust
+struct ConfigVersion {
+    id: Uuid,
+    resource_type: String,     // "virtual_host", "upstream", etc.
+    resource_id: Uuid,
+    version_number: u64,       // Auto-incrementing
+    data: Json,                // Full configuration snapshot
+    checksum: String,          // SHA-256 of data
+    created_by: Uuid,          // User ID
+    created_at: DateTime,
+    change_summary: String,    // Human-readable description
+}
+```
+
+**API Endpoints**:
+- `GET /api/v1/virtual-hosts/{id}/versions` - List versions
+- `GET /api/v1/virtual-hosts/{id}/versions/{version}` - Get specific version
+- `POST /api/v1/virtual-hosts/{id}/rollback` - Rollback to version
+- `GET /api/v1/virtual-hosts/{id}/diff?from=v1&to=v2` - Compare versions
+
+---
+
+## Observability
+
+### OB-001: Structured Logging
+
+**Description**: Comprehensive logging with structured format.
+
+**Log Levels**: ERROR, WARN, INFO, DEBUG, TRACE
+
+**Log Fields**:
+```json
+{
+  "timestamp": "2026-03-02T10:30:00Z",
+  "level": "INFO",
+  "component": "agent",
+  "agent_id": "550e8400-e29b-41d4-a716-446655440000",
+  "trace_id": "abc123",
+  "span_id": "def456",
+  "message": "Configuration applied successfully",
+  "fields": {
+    "config_id": "config-123",
+    "version": 42,
+    "duration_ms": 150
+  }
+}
+```
+
+**Log Targets**:
+- Master: systemd journal, file, or centralized (ELK/Loki)
+- Agent: stdout (Docker), file (standalone), or remote
+
+---
+
+### OB-002: Distributed Tracing
+
+**Description**: OpenTelemetry tracing for request flow visualization.
+
+**Traced Operations**:
+- Configuration push (master → agent → nginx)
+- Health check cycles
+- Certificate issuance
+- API requests
+
+**Span Attributes**:
+- `nxmesh.agent_id`
+- `nxmesh.config_id`
+- `nxmesh.workspace_id`
+- `nxmesh.organization_id`
+
+---
+
+### OB-003: Access Log Aggregation
+
+**Description**: Collect and query nginx access logs from all agents.
+
+**Features**:
+- Centralized access log storage
+- Real-time log streaming
+- SQL-like query interface
+- Log retention policies
+
+**Access Log Schema**:
+```rust
+struct AccessLogEntry {
+    id: Uuid,
+    agent_id: Uuid,
+    timestamp: DateTime,
+    
+    // Request details
+    remote_addr: String,
+    method: String,
+    uri: String,
+    protocol: String,
+    host: String,
+    
+    // Response details
+    status: u16,
+    body_bytes_sent: u64,
+    response_time_ms: f64,
+    
+    // Additional fields
+    user_agent: Option<String>,
+    referer: Option<String>,
+    request_id: Option<String>,
+}
+```
+
+**Query API**:
+```graphql
+# Example query
+query {
+  accessLogs(
+    filter: {
+      agentId: "...",
+      timeRange: { from: "2026-03-01", to: "2026-03-02" },
+      statusCode: { gte: 500 }
+    },
+    limit: 100
+  ) {
+    timestamp
+    method
+    uri
+    status
+    responseTimeMs
+  }
+}
+```
+
+---
+
+## Security Features
+
+### SF-001: Authentication and Authorization
+
+**Description**: Multi-method authentication with fine-grained RBAC.
+
+**Authentication Methods**:
+- JWT (for API/Web UI)
+- Password-based login (local user accounts)
+- OAuth2/OIDC (Google, GitHub, enterprise SSO)
+- API Keys (for service accounts)
+- **TLS + Shared Secret** (for agent communication)
+  - Server-side TLS (auto-generated self-signed or custom certificates)
+  - Bootstrap token for initial registration
+  - Session key with HMAC signing for ongoing requests
+  - Primary/secondary key rotation
+
+**RBAC Model**:
+```rust
+struct Role {
+    id: Uuid,
+    name: String,
+    permissions: Vec<Permission>,
+}
+
+enum Permission {
+    // Organization scope
+    OrganizationRead,
+    OrganizationWrite,
+    OrganizationDelete,
+    
+    // Workspace scope
+    WorkspaceRead,
+    WorkspaceWrite,
+    WorkspaceDelete,
+    
+    // Agent scope
+    AgentRead,
+    AgentWrite,
+    AgentReload,
+    AgentDelete,
+    
+    // Config scope
+    ConfigRead,
+    ConfigWrite,
+    ConfigDeploy,
+    ConfigDelete,
+    
+    // Certificate scope
+    CertificateRead,
+    CertificateWrite,
+    CertificateDelete,
+    
+    // User management
+    UserRead,
+    UserWrite,
+    UserDelete,
+}
+```
+
+---
+
+### SF-002: Secret Management
+
+**Description**: Secure storage and distribution of sensitive data.
+
+**Secrets**:
+- SSL private keys
+- API tokens
+- Database passwords
+- External service credentials
+
+**Security Measures**:
+- Encryption at rest (AES-256-GCM)
+- Encryption in transit (TLS 1.3)
+- Automatic secret rotation
+- Audit logging for secret access
+
+---
+
+### SF-003: Network Security
+
+**Description**: Network-level security controls.
+
+**Features**:
+- IP allowlisting for agent connections
+- Rate limiting on API endpoints
+- DDoS protection recommendations
+- Security headers enforcement (HSTS, CSP, etc.)
+
+**Agent Connection Security**:
+- **TLS Encryption**: Server-side TLS (auto-generated or custom certificates)
+  - Development: Self-signed certificates auto-generated on first start
+  - Production: Valid certificates (Let's Encrypt or corporate CA)
+- **Bootstrap Authentication**: One-time token for initial registration
+- **Session Authentication**: HMAC-signed requests with shared session key
+- **Key Rotation**: Primary/secondary key design for seamless rotation
+- **Certificate Pinning**: Optional fingerprint verification for additional security