Add project structure and roadmap documentation

- Created `project-structure.md` to outline the directory layout, crate dependencies, design principles, module guidelines, and naming conventions for the NxMesh codebase.
- Introduced `roadmap.md` detailing the development phases, milestones, tasks, deliverables, and resource requirements for the NxMesh project, spanning from foundational setup to enterprise features.
This commit is contained in:
GW_MC
2026-03-03 04:13:31 +00:00
parent 39bd860c55
commit 43b2e44d95
11 changed files with 9293 additions and 7 deletions

814
docs/features.md Normal file
View File

@@ -0,0 +1,814 @@
# NxMesh Feature Specification
## Table of Contents
1. [Core Features](#core-features)
2. [Master Features](#master-features)
3. [Agent Features](#agent-features)
4. [Configuration Management](#configuration-management)
5. [Observability](#observability)
6. [Security Features](#security-features)
---
## Core Features
### CF-001: Multi-tenancy with Organizations and Workspaces
**Description**: Support for multiple organizations with isolated workspaces within each organization.
**Requirements**:
- Organizations are top-level resource containers
- Each organization can have multiple workspaces
- Resources (agents, configs, certificates) are scoped to a workspace
- Cross-workspace visibility is configurable
**Data Model**:
```rust
struct Organization {
id: Uuid,
name: String,
slug: String, // URL-friendly identifier
created_at: DateTime,
settings: OrganizationSettings,
}
struct Workspace {
id: Uuid,
organization_id: Uuid,
name: String,
slug: String,
created_at: DateTime,
}
```
**API Endpoints**:
- `GET /api/v1/organizations` - List organizations
- `POST /api/v1/organizations` - Create organization
- `GET /api/v1/organizations/{id}/workspaces` - List workspaces
- `POST /api/v1/organizations/{id}/workspaces` - Create workspace
---
### CF-002: Agent Registration and Lifecycle Management
**Description**: Agents must register with the master before receiving configurations.
**Registration Flow**:
1. Administrator generates bootstrap token in Master UI
2. Token is provided to agent via environment variable or config file
3. Agent establishes TLS connection to master (verifies server certificate)
4. Agent sends bootstrap token for registration
5. Master validates token and establishes shared secret:
- Master generates session_key (per-agent) + key_id
- Session key used for HMAC request signing
- Primary/secondary key design for rotation
**Agent States**:
```rust
enum AgentState {
Pending, // Registered but never connected
Online, // Connected and healthy
Offline, // Disconnected
Degraded, // Connected but health checks failing
Maintenance, // Manually placed in maintenance mode
}
```
**Agent Metadata**:
```rust
struct Agent {
id: Uuid,
workspace_id: Uuid,
name: String,
hostname: String,
ip_address: String,
version: String,
state: AgentState,
deployment_mode: DeploymentMode, // DockerSidecar, K8sSidecar, Standalone
last_seen_at: DateTime,
capabilities: Vec<String>, // e.g., ["http3", "websocket", "rate_limiting"]
labels: HashMap<String, String>, // e.g., {"env": "prod", "region": "us-east"}
}
```
**API Endpoints**:
- `POST /api/v1/agents/register` - Register new agent
- `GET /api/v1/agents` - List agents
- `GET /api/v1/agents/{id}` - Get agent details
- `POST /api/v1/agents/{id}/tokens` - Generate registration token
- `DELETE /api/v1/agents/{id}` - Deregister agent
---
### CF-003: Real-time Configuration Distribution
**Description**: Push configuration changes to agents in real-time with delivery guarantees.
**Requirements**:
- Config changes propagate to all affected agents within 5 seconds
- Support for targeted updates (specific agents or groups)
- Config versioning with rollback capability
- Delivery confirmation from agents
**Configuration Scope**:
```rust
enum ConfigScope {
Global, // All agents
Workspace, // All agents in workspace
AgentGroup(String), // Agents with specific label selector
Agent(Uuid), // Single agent
}
```
**Delivery Guarantees**:
- At-least-once delivery
- Automatic retry with exponential backoff
- Config checksum verification
- Offline agents receive updates on reconnection
---
## Master Features
### MF-001: RESTful API
**Description**: Comprehensive REST API for all operations.
**Base URL**: `/api/v1`
**Resource Endpoints**:
| Resource | Endpoints |
|----------|-----------|
| Organizations | GET, POST, PATCH, DELETE `/organizations` |
| Workspaces | GET, POST, PATCH, DELETE `/workspaces` |
| Agents | GET, POST, PATCH, DELETE `/agents` |
| VirtualHosts | GET, POST, PATCH, DELETE `/virtual-hosts` |
| Upstreams | GET, POST, PATCH, DELETE `/upstreams` |
| Certificates | GET, POST, DELETE `/certificates` |
| AccessLogs | GET `/access-logs` |
| Metrics | GET `/metrics` |
**Response Format**:
```json
{
"data": { ... },
"meta": {
"page": 1,
"per_page": 20,
"total": 100
},
"links": {
"self": "/api/v1/agents?page=1",
"next": "/api/v1/agents?page=2",
"prev": null
}
}
```
**Error Format**:
```json
{
"error": {
"code": "VALIDATION_ERROR",
"message": "Invalid configuration",
"details": [
{"field": "server_name", "message": "Invalid domain format"}
]
}
}
```
---
### MF-002: Web-based Admin Console (Embedded)
**Description**: Modern web UI for managing the entire system. Built with React + Vite and served as static files embedded directly in the master binary.
**Pages**:
| Page | Features |
|------|----------|
| Dashboard | Agent status, recent events, traffic overview |
| Agents | List, detail view, logs, metrics graphs |
| Configurations | Virtual host editor, upstream management |
| Certificates | SSL certificate list, expiration alerts |
| Access Control | Users, roles, permissions management |
| Settings | Organization settings, integrations |
**Key UI Features**:
- Real-time updates via WebSocket
- Monaco editor for nginx configuration
- Visual topology view (agent connections)
- Dark/light mode support
- Responsive design
---
### MF-003: Configuration Template Engine
**Description**: Templating system for generating nginx configurations.
**Template Variables**:
```handlebars
# Example virtual host template
server {
listen {{port}} {{#if ssl}}ssl{{/if}} {{#if http2}}http2{{/if}};
server_name {{server_name}};
{{#if ssl}}
ssl_certificate {{ssl_certificate_path}};
ssl_certificate_key {{ssl_certificate_key_path}};
{{/if}}
location {{location_path}} {
proxy_pass http://{{upstream_name}};
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
{{#each custom_headers}}
add_header {{name}} "{{value}}";
{{/each}}
{{#if rate_limiting}}
limit_req zone={{rate_limit_zone}} burst={{rate_limit_burst}};
{{/if}}
}
}
```
**Built-in Templates**:
- `default` - Standard reverse proxy
- `spa` - Single Page Application (with fallback to index.html)
- `api` - API gateway with rate limiting
- `static` - Static file serving with caching
- `websocket` - WebSocket proxy with connection upgrades
---
### MF-004: Certificate Management (ACME)
**Description**: Automatic SSL/TLS certificate provisioning via Let's Encrypt.
**Features**:
- ACME v2 protocol support
- HTTP-01 and DNS-01 challenges
- Automatic renewal (30 days before expiry)
- Wildcard certificate support (DNS-01)
- Certificate monitoring and alerts
**Certificate Entity**:
```rust
struct Certificate {
id: Uuid,
workspace_id: Uuid,
domain: String,
is_wildcard: bool,
provider: CertificateProvider, // LetsEncrypt, Custom
status: CertificateStatus, // Pending, Active, Expired, Error
issued_at: DateTime,
expires_at: DateTime,
auto_renew: bool,
certificate_pem: Option<String>, // Encrypted at rest
private_key_pem: Option<String>, // Encrypted at rest
}
```
---
## Agent Features
### AF-001: Nginx Lifecycle Management
**Description**: Agent manages nginx process lifecycle based on deployment mode.
**Docker Sidecar Mode**:
- Shares PID namespace with nginx container (via `pid: service:nginx`)
- Directly signals nginx process for reload/restart
- Monitors nginx via health checks
**Standalone Mode**:
- Direct process management (signals to PID from file)
- systemd integration (optional, for service management)
- PID file monitoring
**Lifecycle Actions**:
- `start` - Start nginx
- `stop` - Graceful shutdown
- `reload` - Hot reload configuration
- `restart` - Full restart
- `test` - Validate configuration
---
### AF-002: Configuration Rendering and Application
**Description**: Agent renders nginx configs from master templates and applies them using atomic symlink swaps for zero-downtime updates.
**Config Directory Structure**:
```
/etc/nginx/
├── nginx.conf # Contains: include /etc/nginx/conf.d/current/*.conf
├── conf.d/
│ ├── current -> ./20260302143000/ # Symlink to active deployment
│ ├── 20260302143000/ # Active config (timestamped)
│ │ ├── default.conf
│ │ └── upstream.conf
│ ├── 20260302141500/ # Previous deployment (for rollback)
│ │ ├── default.conf
│ │ └── upstream.conf
│ └── 20260302140000/ # Older deployment (cleanup candidate)
```
**Config Rendering Flow**:
1. Receive ConfigUpdate from master
2. Create new deployment folder: `./conf.d/<timestamp>/`
3. Render nginx config files into timestamped folder
4. **Validate** new config: `nginx -t -c /etc/nginx/conf.d/<timestamp>/nginx.conf`
5. If validation passes, **atomically update symlink**: `current``<timestamp>/`
6. Execute graceful nginx reload
7. Verify reload success (health check)
8. Report status to master
9. Cleanup old deployments (keep N recent versions)
**Atomic Config Swap**:
```rust
async fn apply_config(&self, config: ConfigUpdate) -> Result<()> {
let timestamp = generate_timestamp();
let deploy_dir = self.conf_d_path.join(&timestamp);
let symlink_path = self.conf_d_path.join("current");
// 1. Render config to new timestamped directory
self.render_config(&config, &deploy_dir).await?;
// 2. Validate BEFORE switching symlink (point to new folder directly)
self.validate_config(&deploy_dir).await?;
// 3. Atomic symlink swap (Unix: symlink + rename)
let temp_link = self.conf_d_path.join("current.tmp");
tokio::fs::symlink(&deploy_dir, &temp_link).await?;
tokio::fs::rename(&temp_link, &symlink_path).await?; // Atomic operation
// 4. Reload nginx (picks up new symlink target)
self.reload_nginx().await?;
// 5. Verify and cleanup
self.verify_health().await?;
self.cleanup_old_deployments(5).await?; // Keep last 5 versions
self.report_success(config.id, timestamp).await;
}
```
**Rollback Strategy**:
```rust
async fn rollback(&self, target_timestamp: &str) -> Result<()> {
let target_dir = self.conf_d_path.join(target_timestamp);
let symlink_path = self.conf_d_path.join("current");
// Verify target exists
if !target_dir.exists() {
return Err(Error::RollbackTargetNotFound);
}
// Atomic symlink swap back to previous deployment
let temp_link = self.conf_d_path.join("current.tmp");
tokio::fs::symlink(&target_dir, &temp_link).await?;
tokio::fs::rename(&temp_link, &symlink_path).await?;
// Reload nginx
self.reload_nginx().await?;
}
```
---
### AF-003: Health Monitoring and Reporting
**Description**: Continuous health monitoring of nginx and the host system.
**Health Checks**:
- **Nginx Health**: HTTP request to nginx health endpoint
- **Configuration Health**: Verify current config matches expected
- **Resource Health**: CPU, memory, disk usage
- **Connection Health**: Active connections, request rate
**Health Report Structure**:
```rust
struct HealthReport {
agent_id: Uuid,
timestamp: DateTime,
nginx_status: NginxStatus,
system_metrics: SystemMetrics,
config_checksum: String,
alerts: Vec<Alert>,
}
struct NginxStatus {
is_running: bool,
pid: Option<u32>,
uptime_seconds: u64,
active_connections: u32,
requests_per_second: f64,
}
struct SystemMetrics {
cpu_percent: f64,
memory_used_mb: u64,
memory_total_mb: u64,
disk_used_gb: u64,
disk_total_gb: u64,
}
```
**Reporting Interval**: Configurable (default: 30 seconds)
---
### AF-004: Metrics Collection and Export
**Description**: Collect and expose metrics in Prometheus format.
**Metrics Endpoint**: `GET /metrics` (on agent)
**Built-in Metrics**:
```
# Nginx metrics (parsed from stub_status)
nxmesh_nginx_connections_active{agent_id="..."} 42
nxmesh_nginx_connections_reading{agent_id="..."} 5
nxmesh_nginx_connections_writing{agent_id="..."} 30
nxmesh_nginx_connections_waiting{agent_id="..."} 7
nxmesh_nginx_requests_total{agent_id="..."} 1234567
# Agent metrics
nxmesh_agent_uptime_seconds{agent_id="..."} 86400
nxmesh_agent_master_connection_status{agent_id="..."} 1
nxmesh_agent_config_version{agent_id="...",version="123"} 1
# System metrics
nxmesh_system_cpu_percent{agent_id="..."} 25.5
nxmesh_system_memory_used_bytes{agent_id="..."} 1073741824
nxmesh_system_disk_used_bytes{agent_id="..."} 53687091200
```
**Custom Metrics**: Agents can collect custom metrics from nginx access logs
---
### AF-005: Offline Operation and Recovery
**Description**: Agent can operate independently when master is unreachable.
**Offline Capabilities**:
- Continue serving traffic with cached configuration
- Local health monitoring continues
- Metrics are buffered for later transmission
- Automatic reconnection attempts
**Recovery Flow**:
1. Detect disconnection from master
2. Enter "offline mode"
3. Continue operating with cached config
4. Buffer metrics and logs
5. Attempt reconnection with exponential backoff
6. On reconnection:
- Sync configuration (compare checksums)
- Transmit buffered metrics
- Resume normal operation
---
## Configuration Management
### CM-001: Virtual Host Configuration
**Description**: Define nginx server blocks (virtual hosts) via API/UI.
**VirtualHost Entity**:
```rust
struct VirtualHost {
id: Uuid,
workspace_id: Uuid,
name: String, // Human-readable name
server_name: String, // Domain name(s), comma-separated
listen_port: u16, // Usually 80 or 443
ssl_enabled: bool,
ssl_certificate_id: Option<Uuid>,
// Routing configuration
locations: Vec<Location>,
// Advanced settings
http2_enabled: bool,
http3_enabled: bool,
gzip_enabled: bool,
rate_limiting: Option<RateLimitConfig>,
// Target agents
target_agents: AgentSelector,
}
struct Location {
path: String, // e.g., "/api" or "~ \.php$"
proxy_pass: Option<String>, // e.g., "http://backend"
upstream_id: Option<Uuid>,
root: Option<String>, // For static files
index: Option<String>, // e.g., "index.html"
custom_headers: Vec<Header>,
rewrite_rules: Vec<RewriteRule>,
}
```
**Validation Rules**:
- `server_name` must be valid domain(s)
- `listen_port` must be 1-65535
- SSL certificate must exist if `ssl_enabled` is true
- At least one location must be defined
---
### CM-002: Upstream Configuration
**Description**: Define backend server pools for load balancing.
**Upstream Entity**:
```rust
struct Upstream {
id: Uuid,
workspace_id: Uuid,
name: String, // Used as upstream identifier
// Load balancing algorithm
algorithm: LoadBalanceAlgorithm, // RoundRobin, LeastConn, IPHash, etc.
// Backend servers
servers: Vec<UpstreamServer>,
// Health check configuration
health_check: Option<HealthCheckConfig>,
// Connection settings
keepalive_connections: Option<u32>,
keepalive_timeout: Option<u32>,
}
struct UpstreamServer {
address: String, // IP:port or hostname:port
weight: u32, // Default: 1
backup: bool, // Backup server
down: bool, // Temporarily down
max_fails: u32, // Default: 1
fail_timeout: u32, // Seconds, default: 10
}
enum LoadBalanceAlgorithm {
RoundRobin,
LeastConnections,
IPHash,
WeightedRoundRobin,
}
```
---
### CM-003: Configuration Versioning
**Description**: Track all configuration changes with full history.
**Versioning Features**:
- Every change creates a new version
- Versions are immutable
- Rollback to any previous version
- Diff between versions
- Audit log of who changed what
**Version Entity**:
```rust
struct ConfigVersion {
id: Uuid,
resource_type: String, // "virtual_host", "upstream", etc.
resource_id: Uuid,
version_number: u64, // Auto-incrementing
data: Json, // Full configuration snapshot
checksum: String, // SHA-256 of data
created_by: Uuid, // User ID
created_at: DateTime,
change_summary: String, // Human-readable description
}
```
**API Endpoints**:
- `GET /api/v1/virtual-hosts/{id}/versions` - List versions
- `GET /api/v1/virtual-hosts/{id}/versions/{version}` - Get specific version
- `POST /api/v1/virtual-hosts/{id}/rollback` - Rollback to version
- `GET /api/v1/virtual-hosts/{id}/diff?from=v1&to=v2` - Compare versions
---
## Observability
### OB-001: Structured Logging
**Description**: Comprehensive logging with structured format.
**Log Levels**: ERROR, WARN, INFO, DEBUG, TRACE
**Log Fields**:
```json
{
"timestamp": "2026-03-02T10:30:00Z",
"level": "INFO",
"component": "agent",
"agent_id": "550e8400-e29b-41d4-a716-446655440000",
"trace_id": "abc123",
"span_id": "def456",
"message": "Configuration applied successfully",
"fields": {
"config_id": "config-123",
"version": 42,
"duration_ms": 150
}
}
```
**Log Targets**:
- Master: systemd journal, file, or centralized (ELK/Loki)
- Agent: stdout (Docker), file (standalone), or remote
---
### OB-002: Distributed Tracing
**Description**: OpenTelemetry tracing for request flow visualization.
**Traced Operations**:
- Configuration push (master → agent → nginx)
- Health check cycles
- Certificate issuance
- API requests
**Span Attributes**:
- `nxmesh.agent_id`
- `nxmesh.config_id`
- `nxmesh.workspace_id`
- `nxmesh.organization_id`
---
### OB-003: Access Log Aggregation
**Description**: Collect and query nginx access logs from all agents.
**Features**:
- Centralized access log storage
- Real-time log streaming
- SQL-like query interface
- Log retention policies
**Access Log Schema**:
```rust
struct AccessLogEntry {
id: Uuid,
agent_id: Uuid,
timestamp: DateTime,
// Request details
remote_addr: String,
method: String,
uri: String,
protocol: String,
host: String,
// Response details
status: u16,
body_bytes_sent: u64,
response_time_ms: f64,
// Additional fields
user_agent: Option<String>,
referer: Option<String>,
request_id: Option<String>,
}
```
**Query API**:
```graphql
# Example query
query {
accessLogs(
filter: {
agentId: "...",
timeRange: { from: "2026-03-01", to: "2026-03-02" },
statusCode: { gte: 500 }
},
limit: 100
) {
timestamp
method
uri
status
responseTimeMs
}
}
```
---
## Security Features
### SF-001: Authentication and Authorization
**Description**: Multi-method authentication with fine-grained RBAC.
**Authentication Methods**:
- JWT (for API/Web UI)
- Password-based login (local user accounts)
- OAuth2/OIDC (Google, GitHub, enterprise SSO)
- API Keys (for service accounts)
- **TLS + Shared Secret** (for agent communication)
- Server-side TLS (auto-generated self-signed or custom certificates)
- Bootstrap token for initial registration
- Session key with HMAC signing for ongoing requests
- Primary/secondary key rotation
**RBAC Model**:
```rust
struct Role {
id: Uuid,
name: String,
permissions: Vec<Permission>,
}
enum Permission {
// Organization scope
OrganizationRead,
OrganizationWrite,
OrganizationDelete,
// Workspace scope
WorkspaceRead,
WorkspaceWrite,
WorkspaceDelete,
// Agent scope
AgentRead,
AgentWrite,
AgentReload,
AgentDelete,
// Config scope
ConfigRead,
ConfigWrite,
ConfigDeploy,
ConfigDelete,
// Certificate scope
CertificateRead,
CertificateWrite,
CertificateDelete,
// User management
UserRead,
UserWrite,
UserDelete,
}
```
---
### SF-002: Secret Management
**Description**: Secure storage and distribution of sensitive data.
**Secrets**:
- SSL private keys
- API tokens
- Database passwords
- External service credentials
**Security Measures**:
- Encryption at rest (AES-256-GCM)
- Encryption in transit (TLS 1.3)
- Automatic secret rotation
- Audit logging for secret access
---
### SF-003: Network Security
**Description**: Network-level security controls.
**Features**:
- IP allowlisting for agent connections
- Rate limiting on API endpoints
- DDoS protection recommendations
- Security headers enforcement (HSTS, CSP, etc.)
**Agent Connection Security**:
- **TLS Encryption**: Server-side TLS (auto-generated or custom certificates)
- Development: Self-signed certificates auto-generated on first start
- Production: Valid certificates (Let's Encrypt or corporate CA)
- **Bootstrap Authentication**: One-time token for initial registration
- **Session Authentication**: HMAC-signed requests with shared session key
- **Key Rotation**: Primary/secondary key design for seamless rotation
- **Certificate Pinning**: Optional fingerprint verification for additional security