▸ PRODUCT INFORMATION

Sea Lock: Predictive Multi-Agent Wargaming for Adversarial Naval Blockade and Command-and-Control

A reinforcement learning framework that introduces an adaptive adversarial learning layer into mission-space Digital Twins.

▸ ABSTRACT

Mission-level Digital Twins increasingly support defence decision-making, yet most existing deployments model friendly forces and environments with high fidelity while treating adversarial behaviour as scripted or scenario-fixed — limiting their value as anticipatory Command-and-Control (C2) tools in contested maritime settings where red and blue forces co-evolve under uncertainty and partial observability.

We present Sea Lock, a multi-agent reinforcement learning framework that introduces an adaptive adversarial learning layer into mission-space Digital Twins. The blockade engagement is formalised as a partially observable stochastic game over heterogeneous surface vessels, jointly modelling formation, reconnaissance, electronic warfare, and salvo combat.

Three training regimes are compared under a centralised-training, decentralised-execution architecture: standard PPO, fictitious self-play, and an LLM-shaped doctrinal reward variant. A lightweight Recurrent State-Space Model, trained on interaction trajectories, enables counterfactual rollouts of alternative manoeuvre doctrines without re-simulating the environment.

Sea Lock is the maritime rendering of a broader adversarial- simulation engine designed for cross-domain composition. The same engine architecture extends to land logistics, air swarm operations, cyber resilience, and integrated multi-domain C2 — each rendering shares the engine while varying observation and action spaces, reward composition, and domain-specific physics. Digital Twin is one application; predictive wargaming, force structure analysis, doctrine red-teaming, and operator decision support are others.

▸ KEY RESULTS

What we measured.

1,210
Training episodes
Across the three training regimes, 27.8 mean episode length, 5,814 turns processed.
6.7%
Nash gap
Approximate-equilibrium window during the stable phase of fictitious self-play.
28×
World model loss reduction
Observation reconstruction loss reduced 1.02 → 0.036 over ten epochs of RSSM training.
95.6%
Reconnaissance share
Of all logged actions — emergent deterrence-without-engagement equilibrium.
p < 10⁻²⁶²
Reward separation
Independent t-test across teams — Cohen's d = 1.6 (large effect size).
Δ̄ = 0.049
Counterfactual divergence
Mean trajectory-level divergence between aggressive vs cautious doctrines, 30-step horizon.
▸ INTERACTIVE DEMO

Dashboard walkthrough.

The simulator dashboard renders live engagement state, LLM tactical narration, and operator-actionable Course-of-Action recommendations. Open it and explore the system yourself.

▸ INTERACTIVE DEMO

Open Sea Lock dashboard

Mapbox tactical map · 3 recorded engagement replays with real qwen2.5:7b narration · LIVE policy playback over WebSocket · operator console (move/fire/jam/withdraw with audit log) · WHAT-IF world-model rollouts comparing up to 3 operator decisions side-by-side · custom-scenario builder.

REPLAYLIVEOPERATORWHAT-IFCUSTOMCOUNTERFACTUAL
Launch dashboard
▸ METHODOLOGY

Three training regimes, one engine.

01

Standard PPO with adaptive KL

Baseline regime. Clipped Proximal Policy Optimization with adaptive KL penalty under centralised-training, decentralised-execution. Establishes the lower bound for what learning-only agents achieve in this engagement scenario.

02

Fictitious self-play with bounded opponent pool

Population-based training with a bounded pool of past policies. Yields the most stable strategic differentiation across the three regimes, including a measurable approximate-equilibrium window before late-stage policy-pool drift.

03

LLM-shaped doctrinal reward variant

Qwen-2.5-7B (via Ollama) provides per-state doctrinal annotations and bounded reward shaping. Matches PPO baseline in aggregate learning while supplying tactical narration for commander-in-the-loop review.

04

Recurrent State-Space World Model (counterfactual rollout)

Trained on logged interaction trajectories. Enables counterfactual rollouts of alternative manoeuvre doctrines without re-simulating the environment — the predictive C2 application primitive.

▸ INSIGHTS FROM MARL TRAINING

Three observations from training runs.

Generated from logged self-play episodes via the system's LLM advisor module. Each insight renders a complete tactical narrative, risk assessment, and Course-of-Action evaluation in commander-readable format.

INSIGHT 01

Decisive blockade closure — early training

Iteration 3, 205 steps, blockade victory. Rapid attrition of breakthrough fleet under concentrated early fire.

INSIGHT 02

Extended breakthrough success — mid training

Iteration 83, 9,063 steps, breakthrough victory. Long-horizon manoeuvre and delayed engagement sequence.

INSIGHT 03

Late-training equilibrium — high-return outliers

Iteration 99, 9,586 steps, return 136.29. Equilibrium-regime counter-example to aggregate win-rate dominance.

Subscribe

Monthly methodology essays and policy commentary.

↳ Beehiiv embed will replace this form once newsletter is live.