Sea Lock: Predictive Multi-Agent Wargaming for Adversarial Naval Blockade and Command-and-Control
A reinforcement learning framework that introduces an adaptive adversarial learning layer into mission-space Digital Twins.
Mission-level Digital Twins increasingly support defence decision-making, yet most existing deployments model friendly forces and environments with high fidelity while treating adversarial behaviour as scripted or scenario-fixed — limiting their value as anticipatory Command-and-Control (C2) tools in contested maritime settings where red and blue forces co-evolve under uncertainty and partial observability.
We present Sea Lock, a multi-agent reinforcement learning framework that introduces an adaptive adversarial learning layer into mission-space Digital Twins. The blockade engagement is formalised as a partially observable stochastic game over heterogeneous surface vessels, jointly modelling formation, reconnaissance, electronic warfare, and salvo combat.
Three training regimes are compared under a centralised-training, decentralised-execution architecture: standard PPO, fictitious self-play, and an LLM-shaped doctrinal reward variant. A lightweight Recurrent State-Space Model, trained on interaction trajectories, enables counterfactual rollouts of alternative manoeuvre doctrines without re-simulating the environment.
Sea Lock is the maritime rendering of a broader adversarial- simulation engine designed for cross-domain composition. The same engine architecture extends to land logistics, air swarm operations, cyber resilience, and integrated multi-domain C2 — each rendering shares the engine while varying observation and action spaces, reward composition, and domain-specific physics. Digital Twin is one application; predictive wargaming, force structure analysis, doctrine red-teaming, and operator decision support are others.
What we measured.
Dashboard walkthrough.
The simulator dashboard renders live engagement state, LLM tactical narration, and operator-actionable Course-of-Action recommendations. Open it and explore the system yourself.
Open Sea Lock dashboard
Mapbox tactical map · 3 recorded engagement replays with real qwen2.5:7b narration · LIVE policy playback over WebSocket · operator console (move/fire/jam/withdraw with audit log) · WHAT-IF world-model rollouts comparing up to 3 operator decisions side-by-side · custom-scenario builder.
Three training regimes, one engine.
Standard PPO with adaptive KL
Baseline regime. Clipped Proximal Policy Optimization with adaptive KL penalty under centralised-training, decentralised-execution. Establishes the lower bound for what learning-only agents achieve in this engagement scenario.
Fictitious self-play with bounded opponent pool
Population-based training with a bounded pool of past policies. Yields the most stable strategic differentiation across the three regimes, including a measurable approximate-equilibrium window before late-stage policy-pool drift.
LLM-shaped doctrinal reward variant
Qwen-2.5-7B (via Ollama) provides per-state doctrinal annotations and bounded reward shaping. Matches PPO baseline in aggregate learning while supplying tactical narration for commander-in-the-loop review.
Recurrent State-Space World Model (counterfactual rollout)
Trained on logged interaction trajectories. Enables counterfactual rollouts of alternative manoeuvre doctrines without re-simulating the environment — the predictive C2 application primitive.
Three observations from training runs.
Generated from logged self-play episodes via the system's LLM advisor module. Each insight renders a complete tactical narrative, risk assessment, and Course-of-Action evaluation in commander-readable format.
Decisive blockade closure — early training
Iteration 3, 205 steps, blockade victory. Rapid attrition of breakthrough fleet under concentrated early fire.
Extended breakthrough success — mid training
Iteration 83, 9,063 steps, breakthrough victory. Long-horizon manoeuvre and delayed engagement sequence.
Late-training equilibrium — high-return outliers
Iteration 99, 9,586 steps, return 136.29. Equilibrium-regime counter-example to aggregate win-rate dominance.
Subscribe
Monthly methodology essays and policy commentary.
↳ Beehiiv embed will replace this form once newsletter is live.