Building a coordination layer that assumes the network is the enemy

June 29, 2026

Building a coordination layer that assumes the network is the enemy

Charalampos Papamanthou, Chief Scientist, Lagrange · Week 1 after launch.

A swarm operating under real conditions is not a distributed system. It’s a distributed system whose every comforting assumption is wrong.

The network is partial. Messages drop and reorder. Nodes die mid-mission. There is no central authority — anything that looks like one gets targeted first. A swarm has to work anyway.

When we launched Halo last week, the framing was simple: it’s the coordination engine for autonomous systems. This post is what’s underneath — four pieces of the architecture, and what they look like when the network is falling apart.

The setting

Every textbook distributed system assumes things a swarm can’t get: reliable links, durable nodes, a central tie-breaker. Connectivity is an irregular, time-varying graph. Spectrum gets congested and jammed. Drones go offline through battery, attack, or weather. Messages get lost, reordered, duplicated.

The Ukrainian front line is the most instrumented version of this environment ever observed — a contact zone of more than a thousand kilometers that defense reporting has called the densest electronic-warfare environment in modern history. GNSS-denied operation is mainstream engineering now.

Halo treats every failure mode as a first-class assumption, not an exception to patch.

Halo architecture: four layers. The network is the enemy.

Layer 1: Shared state

Every node needs roughly the same picture of the world, even when the network is fragmented. Think of ten people collaborating on a Google Doc — except half keep losing Wi-Fi and a few are about to be knocked offline by an adversary. Layer 1 makes the doc converge anyway or makes the latest agreed upon state known.

We run gossip-based replication with structured ordering and a state-fusion layer. State is typed by consistency requirement — identity needs strong agreement; sensor fusion can converge over time. Updates carry vector-clock-style causality, so a returning drone can’t overwrite fresher reality. Sync is opportunistic — nodes converge as the network heals.

The application API is deliberately boring: subscribe, read, publish, watch. Boring APIs survive integration.

Layer 2: Failure detection

The dumbest failure detector is also the most common: if I haven’t heard from node X in N seconds, X is dead or temporarily unavailable. Run that in a contested environment and watch the cascade — a jammer kicks in for two seconds, six drones get declared dead, six roles reassign, the jammer stops, and the swarm is mourning members that are still flying.

Halo uses a phi-accrual-style detector — the same approach Cassandra and Akka use — adapted for partial connectivity. Instead of “is this node dead?” it asks “how much do i trust this node to still be functioning?” Every node maintains a suspicion score for every other, calibrated to that node’s historical message pattern. A chatty drone going quiet ramps fast. A drone on a flaky link ramps slowly. Coordination subscribes to thresholds, not booleans — “warm up a backup mapper” before “declare the mapper dead.”

In sandbox runs against a 30%-loss, intermittent-partition profile, role thrashing dropped roughly tenfold versus a timeout detector, with no measurable cost in time-to-respond.

A jammer can silence nodes. An adversary can spoof presence. We treat liveness signals as authenticated and replay-resistant, and cross-check presence claims against causal activity in Layer 1. A node that’s “alive” by ping but causally absent from any state update is itself a suspicion signal. Silence is suspicious. Presence without participation is more suspicious.

Layer 3: Decision tiers

A drone reacting to a threat has milliseconds. A swarm repositioning has seconds. A mission re-tasking has minutes. Forcing all three through the same agreement protocol gets you killed.

Halo runs three tiers.

Three layers keep a swarm a swarm: local reflex in milliseconds, neighborhood in seconds, full consensus in minutes.

Swarm-wide consensus, slowest. Repositioning, role assignment, mission re-tasking — full, partition-tolerant consensus. Progress preserves on the majority side of any split; decisions made during partition merge deterministically when the swarm reconverges.

Active-neighborhood, mid. Each node knows its currently-reachable peers and keeps coordinating with them when the broader swarm fragments. Two halves keep operating and reconcile when they meet.

Local cached state, fastest. Threat evasion, collision avoidance — the decision happens on the drone, in milliseconds, with the data it already has. Halo’s job isn’t to be in the decision. It’s to make sure the data is fresh enough that the right decision is the obvious one.

The drone never blocks on consensus when its life depends on milliseconds. The swarm never sacrifices coherence on decisions that take seconds.

Layer 4: Reconfiguration

When the swarm loses members, you want it to hesitate for less than a second and continue. No human intervention. No mission stop. Layer 4 is what makes that happen.

When a node is lost — or several in succession — Halo runs a cooperative reconfiguration cycle: deterministic role reassignment over the survivors (five mappers, ten scouts, three relays minus the dead ones), coverage redistribution (a four-drone sweep loses one; the three remaining shift to hold coverage), and load rebalancing across surviving relays — with no central scheduler.

No master node coordinates it. The shared state from Layer 1 is what makes this possible — reconfiguration is deterministic over that state, so every surviving node arrives at the same conclusion without asking anyone.

The honest part

A few things, because launches tend to overpromise.

Halo is not yet running on hardware at scale. The launch sandbox is a simulation at 10–20 nodes. The next horizon is 50-node field trials with a public scorecard — claims we’ll earn with demonstrations, not press releases.

Halo does not solve the application layer. It doesn’t decide what mission to fly, what to attack, what to map. Customer logic does. The boundary is deliberate.

We are not the first to think about distributed consensus. The literature on phi-accrual, gossip, partition-tolerant consensus, and CRDTs runs back decades. What we’re standardizing is a narrow, defense-credible substrate that’s been rebuilt one-off inside every autonomy vendor for a decade — shipped as a library that integrates without rewriting your stack.

What’s next

50-node field trials under real EW and GPS denial. Reference-platform engineering with drone manufacturers shipping into Replicator. An operator console for read-only visibility into coordination state. A published latency SLA — target sub-second swarm-wide reconfiguration at 50 nodes under 20% packet loss — measured, not promised.

Halo is the first piece of what we’re calling verifiable autonomy: distributed autonomous systems an operator, a coalition partner, or a regulator can trust, audit, and hold accountable.

If that’s the problem you have, I’d like to hear from you. defense@lagrange.dev · lagrange.dev/defense

— Charalampos