Skip to main content

Synnefo SmartSpace

Product Architecture Plan

After building and scaling multiple SaaS platforms, I've learned that education technology succeeds when it solves a real pain point simply and scales efficiently. This platform addresses the fundamental friction in hands-on technical education: environment setup hell.

The Core Problem: DevOps and cybersecurity learning requires complex, multi-node environments that take hours to set up and often fail due to hardware/software conflicts.

Our Solution: Sub-5-second provisioning of production-grade environments accessible from any device, with embedded validation that mirrors real-world scenarios.

Product-Market Fit Analysis

Primary Markets (Launch Focus)

  1. Corporate Training Teams - Immediate revenue potential, willingness to pay for standardization
  2. Cybersecurity Bootcamps - High demand for hands-on labs, premium pricing
  3. Individual Practitioners - Large market, freemium conversion opportunity

Secondary Markets (Expansion)

  • University computer science programs
  • Open source project onboarding
  • Technical interview platforms
  • Conference workshop hosting

Why This Sequencing: Corporate and bootcamp markets have proven willingness to pay for solutions that reduce training overhead. Individual practitioners provide volume and viral growth.

Core Architecture Philosophy

Design Principles (Learned from Past Mistakes)

  1. Start Boring: Use proven technologies; innovate only where it creates competitive advantage
  2. Fail Fast: Design for quick feedback loops in both technical and business validation
  3. Scale Smart: Build for 10x growth, not 100x (premature optimization kills startups)
  4. Security by Design: Security retrofits are expensive and unreliable

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Frontend (Next.js)                       │
│  - Playground UI    - Terminal (Xterm.js)   - Progress     │
│  - Course Content   - File Upload           - Leaderboards │
└─────────────────────────────────────────────────────────────┘
                              │
                    ┌─────────────────┐
                    │   Backend API   │
                    │  (Node.js)      │
                    │ - Auth/Users    │
                    │ - Playground    │
                    │ - Progress Mgmt │
                    └─────────────────┘
                              │
                         Calls Bender
                              │
┌─────────────────────────────────────────────────────────────┐
│                    Docker Host Fleet                        │
│                                                             │
│  ┌─────────────────────────────────────────────────────────┐│
│  │              Playground Container                       ││
│  │  ┌─────────────────────────────────────────────────────┐││
│  │  │         Firecracker Starter                         │││
│  │  │  - Reads playground config                          │││  
│  │  │  - Provisions MicroVMs                              │││
│  │  │  - Manages VM lifecycle                             │││
│  │  └─────────────────────────────────────────────────────┘││
│  │                           │                             ││
│  │  ┌─────────────────────────────────────────────────────┐││
│  │  │              Service Mesh                           │││
│  │  │  ┌──────────────┐  ┌───────────────────────────────┐│││
│  │  │  │ PowerDNS/    │  │         Envoy Proxy           ││││
│  │  │  │ CoreDNS      │  │  - HTTP/TCP/WebSocket Proxy   ││││
│  │  │  │ - VM Service │  │  - Dynamic Port Mapping       ││││
│  │  │  │   Discovery  │  │  - Public Service Exposure    ││││
│  │  │  └──────────────┘  └───────────────────────────────┘│││
│  │  └─────────────────────────────────────────────────────┘││
│  │                           │                             ││
│  │  ┌─────────────────────────────────────────────────────┐││
│  │  │           Isolated MicroVM Network                  │││
│  │  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │││|
│  │  │  │   VM1    │ │   VM2    │ │   VM3    │ │  VMn   │ │││|
│  │  │  │          │ │          │ │          │ │        │ │││|
│  │  │  │ Terminal │ │ Terminal │ │ Terminal │ │Terminal│ │││|
│  │  │  │(vsock)   │ │ (vsock)  │ │ (vsock)  │ │(vsock) │ │││|
│  │  │  │          │ │          │ │          │ │        │ │││|
│  │  │  │Examiner  │ │Examiner  │ │Examiner  │ │Examiner│ │││|
│  │  │  │Service   │ │Service   │ │Service   │ │Service │ │││|
│  │  │  └──────────┘ └──────────┘ └──────────┘ └────────┘ │││|
│  │  └─────────────────────────────────────────────────────┘││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Legend: User sees only terminal access to VMs, not container structure

Technical Implementation Strategy

Phase 1: Validated MVP (Months 1-3)

Goal: Prove the core hypothesis with minimal viable architecture

Critical Path Features:

  1. Single-node playground provisioning (Ubuntu + Docker only)
  2. Basic terminal access via web interface
  3. Simple task validation (5-10 predefined challenges)
  4. User authentication (GitHub OAuth)
  5. Payment integration (Stripe for premium access)

Architecture Decisions:

  • Monolith First: Single Node.js app handling web + API + background jobs
  • Single Region: One bare-metal server cluster to start
  • PostgreSQL: Proven, familiar, handles both relational and JSON data well
  • No Kubernetes: Direct Docker management reduces complexity

Success Metrics: 100 active users, 80% task completion rate, sub-5s provisioning

Phase 2: Product-Market Fit (Months 4-6)

Goal: Validate market demand and optimize for retention

Key Features:

  1. Multi-node playgrounds (Kubernetes, Ansible clusters)
  2. Structured courses with integrated tutorials
  3. Progress tracking with visual dashboards
  4. Basic collaboration (terminal sharing for mentors)
  5. Cybersecurity templates (Kali, Metasploitable)

Architecture Evolution:

  • Separate API Service: Extract API from monolith as load increases
  • Worker Fleet: 3-5 bare-metal servers with load balancing
  • Redis Integration: Real-time features and session management
  • Content Management: Git-based content pipeline

Success Metrics: 500 active users, 60% week-2 retention, $10k MRR

Phase 3: Scale & Differentiate (Months 7-12)

Goal: Build moat through unique features and operational excellence

Differentiating Features:

  1. Live CTF competitions with real-time leaderboards
  2. Advanced collaboration (file sharing, session recording)
  3. AI-powered hints and debugging assistance
  4. Enterprise features (SSO, team management)
  5. Mobile app for terminal access

Architecture Maturity:

  • Service Decomposition: Separate services for competition, collaboration, content
  • Multi-region: US + EU deployments for latency optimization
  • Advanced Security: Zero-trust network, comprehensive monitoring
  • Observability: Full telemetry stack for operational excellence

Key Architectural Components

1. Bender (Playground Container Manager)

Core Responsibilities (Similar to iximiuz but container-focused):

  • Container lifecycle management (start, stop, cleanup)
  • Playground template management and deployment
  • Resource allocation across Docker hosts
  • Health monitoring and automatic recovery

Container Deployment Flow:

Frontend Request → Backend API → Bender → Docker Host → Container Deploy → 
Firecracker Starter → MicroVM Provisioning → User Terminal Access

2. Firecracker Starter (In-Container Engine)

Responsibilities:

  • Parse playground configuration from mounted config
  • Provision MicroVMs according to specification
  • Configure internal networking between VMs
  • Manage VM lifecycle within container boundary
  • Handle resource allocation and monitoring

vsock Terminal Architecture:

User Browser → WebSocket → Backend → vsock Proxy → MicroVM Terminal
                                        ↓
                              No SSH/IP dependency!

Why vsock is Superior:

  • IP Independence: Terminal access unaffected by VM IP changes
  • Performance: Direct hypervisor communication, lower latency
  • Security: No SSH daemon needed, reduced attack surface
  • Simplicity: No network configuration needed for terminal access

3. Service Mesh Strategy (PowerDNS + Envoy)

Option A: In-Container Service Mesh (Recommended)

Advantages:

  • Complete isolation per playground
  • No shared state between playgrounds
  • Easier resource management
  • Simpler debugging and troubleshooting
┌─────────────────────────────────────────────┐
│           Playground Container              │
│  ┌─────────────────────────────────────────┐│
│  │         PowerDNS/CoreDNS                ││ 
│  │  - vm1.playground.local → 192.168.1.10 ││
│  │  - vm2.playground.local → 192.168.1.11 ││
│  └─────────────────────────────────────────┘│
│  ┌─────────────────────────────────────────┐│
│  │            Envoy Proxy                  ││
│  │  - Port 8080 → vm1:80 (web server)     ││
│  │  - Port 3000 → vm2:3000 (nodejs app)   ││  
│  │  - WebSocket proxy support             ││
│  └─────────────────────────────────────────┘│
│                     │                       │
│         Internal MicroVM Network            │
│  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐       │
│  │VM1  │  │VM2  │  │VM3  │  │VMn  │       │
│  │:80  │  │:3000│  │:443 │  │:... │       │
│  └─────┘  └─────┘  └─────┘  └─────┘       │
└─────────────────────────────────────────────┘
                      │
              ┌───────────────┐
              │Container Ports│
              │ 8080, 3000,   │
              │ 443, ...      │
              └───────────────┘
                      │
              ┌───────────────┐
              │ Public Access │
              │ https://user- │
              │ 123-playground│
              │ .platform.io  │
              └───────────────┘

Dynamic Service Exposure

Real-time Port Detection: Examiner services detect when applications start listening on ports Automatic Proxy Configuration: Envoy config updated dynamically to expose new services DNS Management: PowerDNS updates to provide service discovery

Example Flow - User Starts Nginx:

  1. User runs nginx in VM1
  2. Examiner detects process listening on port 80
  3. Examiner notifies Firecracker Starter via gRPC
  4. Firecracker Starter updates Envoy config: vm1:80 → container:8080
  5. PowerDNS adds: nginx.vm1.playground.local → VM1_IP
  6. User accesses via: https://user-123-playground.platform.io/vm1/

WebSocket Support: Envoy natively supports WebSocket proxying with proper upgrade headers

4. Terminal Access via vsock

vsock Implementation Architecture

Browser (Xterm.js) → WebSocket → Backend → vsock Proxy → MicroVM Terminal
                                     │
                                     └── Direct Firecracker connection
                                         (No network dependency!)

Technical Implementation:

  • vsock Address: Each MicroVM gets unique vsock CID (Context ID)
  • Proxy Service: Go service that bridges WebSocket ↔ vsock
  • Connection Management: Handle reconnections, session persistence
  • Multi-Terminal: Support multiple terminal sessions per VM

Why vsock Wins:

  • IP Agnostic: User can ifconfig eth0 0.0.0.0 and terminal still works
  • Performance: Direct hypervisor communication, no TCP overhead
  • Security: No SSH service running, no network attack surface
  • Reliability: Immune to firewall changes, network reconfigurations

Data Architecture & State Management

Primary Data Stores

PostgreSQL (Transactional Data)

-- Core business entities
users, organizations, teams
courses, lessons, challenges, playgrounds
user_progress, completions, achievements
competition_events, team_memberships

-- Audit and analytics
session_logs, task_completions, performance_metrics

Redis (Real-time State)

-- Live competition data
competition:{id}:leaderboard → sorted set (score, user_id)
competition:{id}:participants → hash (user_id → status)
session:{id}:state → hash (vm_status, progress, connections)

-- Collaboration state  
mentor_session:{id}:viewers → set (user_ids watching)
file_transfer:{session_id} → list (pending uploads)

Object Storage (Content & Artifacts)

  • Template images and VM snapshots
  • Course videos and static content
  • User file uploads and session recordings
  • Backup data and disaster recovery assets

State Synchronization Strategy

  • Event Sourcing: Critical state changes (task completions, competition events)
  • CQRS Pattern: Separate read/write models for high-frequency operations
  • WebSocket Streams: Real-time updates without polling overhead

Security Architecture

Multi-Layer Defense Strategy

Layer 1: Network Isolation

  • VPC per customer organization
  • Firewall rules preventing cross-contamination
  • DDoS protection and rate limiting

Layer 2: Host Security

  • Minimal attack surface on worker nodes
  • Regular security patching pipeline
  • Intrusion detection and monitoring

Layer 3: VM Isolation

  • Firecracker hypervisor isolation
  • Jailed execution environment
  • Resource quotas and monitoring

Layer 4: Application Security

  • Examiner process runs with minimal permissions
  • Audit logging for all administrative actions

Threat Model & Mitigations

  • VM Escape: Firecracker isolation + regular security updates
  • Resource Abuse: Hard quotas + monitoring + automatic termination
  • Data Exfiltration: Network egress filtering + content scanning
  • Platform Compromise: Zero-trust architecture + principle of least privilege

Competitive Positioning

  • vs. Katacoda/O'Reilly: Better validation, real-time collaboration
  • vs. Cloud Providers: Faster, cheaper, purpose-built for learning
  • vs. Local VMs: Zero setup, consistent environments, collaboration

Implementation Priorities & Rationale

Must-Have (MVP Blockers)

  1. Sub-5s VM provisioning - Core value proposition, technical differentiator
  2. Reliable task validation - Without this, we're just expensive cloud VMs
  3. Terminal web interface - Baseline functionality for any user interaction
  4. Basic auth & billing - Required for any paid customer validation

Should-Have (PMF Accelerators)

  1. Multi-node playgrounds - Unlocks Kubernetes/complex scenarios
  2. Progress tracking - Gamification drives engagement and retention
  3. Content management - Allows iteration without engineering time
  4. Mentor collaboration - Addresses high-value corporate market

Could-Have (Growth Features)

  1. Live competitions - Viral growth potential, community building
  2. Mobile app - Convenience feature, not core value prop
  3. AI assistance - Interesting but unproven in education
  4. Enterprise SSO - Required for larger deals but not for validation

Won't-Have (V1 Scope Creep)

  1. Custom IDE development - VS Code integration sufficient
  2. Video hosting platform - Use existing solutions (Vimeo, YouTube)
  3. Advanced analytics - Focus on core metrics first
  4. Multi-language support - English market sufficient for validation

Technical Risk Assessment

High-Impact Risks & Mitigations

1. Security Vulnerabilities (High Probability, High Impact)

  • Risk: VM escape leading to host compromise
  • Mitigation: Defense-in-depth, regular penetration testing, bug bounty program
  • Timeline: Security audit every quarter starting Month 2

2. Performance Degradation at Scale (Medium Probability, High Impact)

  • Risk: VM provisioning time increases with load
  • Mitigation: Comprehensive load testing, warm pool optimization
  • Timeline: Load testing framework by Month 2

3. Resource Abuse (High Probability, Medium Impact)

  • Risk: Users consuming excessive resources, driving up costs
  • Mitigation: Hard quotas, automated monitoring, tiered pricing
  • Timeline: Quota system in MVP, advanced monitoring by Month 4

Technical Debt Management

  • Month 1-3: Acceptable to accrue debt for speed
  • Month 4-6: Pay down critical debt blocking scale
  • Month 7+: Establish sustainable development velocity

Business Model & Unit Economics

Revenue Streams

  1. Subscription SaaS (Primary) - Predictable recurring revenue
  2. Corporate Training (High-value) - Custom content + platform access
  3. Certification Programs (Future) - Verified skill assessment

Unit Economics (Target)

  • Customer Acquisition Cost: $100-150 (individual), $500-1000 (enterprise)
  • Monthly Churn Rate: <5% (individual), <2% (enterprise)
  • Gross Margin: 80%+ (software-centric model)
  • Payback Period: 6 months (individual), 12 months (enterprise)

Technical Implementation Roadmap

Phase 1: Technical Foundation (Months 1-3)

Architecture Goals: Prove core hypothesis with your container-first design

Core Infrastructure:

Platform Manager (Node.js)
├── Web Frontend (Next.js + TypeScript)
├── API Layer (Express + Prisma)  
├── Background Jobs (Bull + Redis)
└── Database (PostgreSQL + Redis)

Docker Host Fleet
├── Docker Engine with custom networking
├── Container orchestration logic
└── Host-level monitoring and security

Playground Container Template
├── Firecracker Starter (Go binary)
├── PowerDNS/CoreDNS for service discovery
├── Envoy proxy for ingress/egress
└── Multiple MicroVMs with Examiner services

Key Features Delivered:

  • Container-based playground isolation
  • Firecracker Starter for VM management
  • Internal service mesh (DNS + Envoy)
  • Examiner-based task validation
  • Basic web terminal access

Why This Architecture Works:

  • Docker First: Proven orchestration, easier debugging, familiar ops
  • Container Isolation: Complete playground separation without VM overhead
  • Internal Service Mesh: Professional-grade networking with familiar tools
  • Future Flexibility: Can migrate to K8s/Swarm when scale demands it

Success Criteria:

  • 50 beta users completing challenges
  • 95% uptime with <5s provisioning
  • Positive user feedback on core experience

Phase 2: Market Validation (Months 4-6)

Architecture Goals: Scale to hundreds of concurrent users

Infrastructure Evolution:

  • Extract API service from monolith
  • Add 2-3 worker nodes for redundancy
  • Implement proper load balancing
  • Add comprehensive monitoring

Feature Expansion:

  • Multi-node playground support
  • Mentor-student collaboration (terminal sharing)
  • Structured course delivery system
  • Cybersecurity lab templates
  • Progress analytics dashboard

Success Criteria:

  • 300+ active users
  • $10k+ MRR with positive unit economics
  • 10+ corporate pilot programs

Phase 3: Competitive Differentiation (Months 7-12)

Architecture Goals: Build features competitors can't easily replicate

Platform Maturity:

  • Microservice decomposition where justified
  • Multi-region deployment (US East + EU)
  • Advanced security hardening
  • Comprehensive observability stack

Unique Features:

  • Live CTF competition platform
  • Real-time collaborative debugging
  • AI-powered learning assistance
  • Advanced team management
  • Mobile app for terminal access

Success Criteria:

  • 1000+ active users across 3 market segments
  • $50k+ MRR with expanding margins
  • 5+ enterprise customers with annual contracts

Risk Mitigation & Contingency Planning

Technical Contingencies

If VM provisioning doesn't hit <5s consistently:

  • Implement advanced warm pool strategies
  • Consider container-based fallback for simple playgrounds
  • Invest in custom kernel optimization

If security incidents occur:

  • Immediate incident response plan
  • Comprehensive audit and remediation
  • Enhanced monitoring and threat detection

If scaling costs become prohibitive:

  • Implement advanced resource optimization
  • Consider hybrid cloud deployment
  • Develop resource sharing algorithms

Business Contingencies

If corporate market adoption is slow:

  • Pivot focus to bootcamp and individual markets
  • Develop channel partner program
  • Create freemium viral growth mechanics

If competition emerges quickly:

  • Accelerate unique feature development (live CTF, AI assistance)
  • Build strong customer relationships and switching costs
  • Consider strategic partnerships or acquisition opportunities

Success Metrics & Monitoring

Product Metrics (Leading Indicators)

  • Time to First Success: Minutes from signup to first completed challenge
  • Engagement Depth: Average session duration and tasks completed
  • Feature Adoption: % users utilizing collaboration, competition features
  • Content Quality: Task completion rates by difficulty level

Business Metrics (Lagging Indicators)

  • Growth Rate: Month-over-month user acquisition
  • Revenue Growth: MRR expansion and customer lifetime value
  • Churn Analysis: Cohort retention and reasons for cancellation
  • Market Penetration: Share of target segments (bootcamps, corporates)

Technical Metrics (Operational Health)

  • Performance: 95th percentile provisioning time, uptime SLA
  • Security: Incident frequency, vulnerability patch time
  • Scalability: Resource utilization, cost per active user
  • Quality: Bug rates, feature deployment frequency


Architectural Justifications

1. Monolith-First Strategy

  • Decision: Start with monolithic Platform Manager
  • Reasoning: Faster development, easier debugging, simpler deployment
  • Evolution: Extract services only when team/scale demands it

2. In-VM Examiner Design

  • Decision: Run validation inside microVMs rather than external monitoring
  • Reasoning: Higher accuracy, lower latency, better security isolation
  • Trade-off: Slightly higher resource usage for significantly better UX

3. Bare-Metal Workers

  • Decision: Use bare-metal servers instead of cloud VMs
  • Reasoning: Better performance, lower costs, Firecracker requirements
  • Trade-off: More operational complexity for better unit economics

4. Competition Engine as Differentiator

  • Decision: Build live CTF platform from the start
  • Reasoning: Creates network effects, drives engagement, hard to replicate
  • Investment: High development cost but significant competitive moat

Market Timing Advantages

  • DevOps Skills Gap: Massive demand for hands-on training
  • Remote Work: Increased need for virtual learning environments
  • Cloud Native Adoption: Growing complexity requires new training approaches
  • Cybersecurity Demand: Critical skills shortage with high willingness to pay

Next Steps & Decision Points

Immediate Actions (Next 30 Days)

  1. Technical Validation: Build basic Firecracker + Docker prototype
  2. Market Research: Interview 20+ potential corporate customers
  3. Competitive Analysis: Deep dive on existing solutions' weaknesses
  4. Team Planning: Identify first engineering hires and timeline

Key Decision Points

  • Month 2: Validate technical feasibility and initial user feedback
  • Month 4: Decide on market focus based on early traction data
  • Month 6: Evaluate need for additional funding based on growth metrics
  • Month 9: Plan international expansion based on product-market fit

Success Dependencies

  • Technical: Achieving reliable sub-5s provisioning at scale
  • Product: Finding repeatable customer acquisition channels
  • Business: Proving sustainable unit economics with target pricing
  • Team: Building engineering team capable of scaling platform