2026 · 05 · 20 · simulation
Frequent partitions break Raft
Force partitions on a schedule, observe leader churn, stalled writes, and log repair under continuous stress.
Introduction
Distributed systems often work correctly under ideal conditions but fail under unreliable networks and failures. Consensus algorithms like Raft are designed to tolerate node crashes and network partitions, but real behavior depends heavily on implementation details.
This simulation shows frequent and controlled network partitions in a Raft cluster to expose edge cases and weaknesses.
Goal
To observe and analyze:
- Leader churn
- Stalled or failed client writes
- Log divergence and subsequent repair
By stressing the system continuously, the project evaluates:
- Correctness under instability
- Performance degradation
- Resilience of the Raft implementation
Phase 1Building the baseline cluster
In this phase, we construct a fully functional Raft cluster consisting of five independent nodes. The goal is to establish a stable baseline implementation before introducing any network failures or partitions.
We deploy five nodes, which is the minimum recommended size for safely tolerating failures and observing meaningful quorum behavior. Each node runs as an isolated process and participates equally in the consensus protocol.
Node responsibilities
Each node implements the core components of the Raft protocol:
- Raft state
currentTerm to track logical time
log[] to store replicated commands
commitIndex and lastApplied for state machine progress
- RPC endpoints
RequestVote for leader election
AppendEntries for log replication and heartbeats
- Persistent storage
- Critical state (term, votedFor, log) is stored on disk to survive restarts
- This can be implemented using simple file-based storage (e.g., JSON, BoltDB)
Cluster deployment with Docker Compose
To simulate a distributed environment in a reproducible and code-driven way, we use Docker Compose. This approach gives us:
- Process isolation (each node behaves like a separate machine)
- A built-in private network for inter-node communication
- Deterministic configuration via YAML
- Easy startup, teardown, and scaling
Each Raft node runs inside its own container, exposing a single port for RPC communication.
docker-compose.yml
version: "3.9"
services:
node1:
build: .
container_name: node1
ports:
- "8001:8000"
environment:
- NODE_ID=1
- PORT=8000
- PEERS=node2:8000,node3:8000,node4:8000,node5:8000
node2:
build: .
container_name: node2
ports:
- "8002:8000"
environment:
- NODE_ID=2
- PORT=8000
- PEERS=node1:8000,node3:8000,node4:8000,node5:8000
node3:
build: .
container_name: node3
ports:
- "8003:8000"
environment:
- NODE_ID=3
- PORT=8000
- PEERS=node1:8000,node2:8000,node4:8000,node5:8000
node4:
build: .
container_name: node4
ports:
- "8004:8000"
environment:
- NODE_ID=4
- PORT=8000
- PEERS=node1:8000,node2:8000,node3:8000,node5:8000
node5:
build: .
container_name: node5
ports:
- "8005:8000"
environment:
- NODE_ID=5
- PORT=8000
- PEERS=node1:8000,node2:8000,node3:8000,node4:8000
Dockerfile
FROM golang:1.22
WORKDIR /app
COPY . .
RUN go build -o raft-node .
CMD ["./raft-node"]
Running the cluster
docker compose up --build
At this point, we should have:
- 5 nodes running concurrently
- A functioning Raft cluster
- Leader election occurring automatically
- Log replication working under normal conditions
This baseline is critical—any instability here will be amplified once failures are introduced.
Phase 2Introducing the network layer
With a stable cluster in place, the next step is to take control of communication between nodes.
In a real distributed system, the network is unreliable: messages can be delayed, dropped, or blocked entirely. To simulate this accurately, we must decouple Raft logic from direct network communication.
Instead of nodes communicating directly:
Node A => Node B
We introduce an intermediary:
Node A => Network Layer => Node B
This network layer becomes the single control point for all communication and allows us to inject failures in a controlled and observable way.
Responsibilities of the network layer
- Routing messages between nodes
- Dropping messages (simulate packet loss)
- Delaying messages (simulate latency)
- Blocking communication (simulate partitions)
Implementation approach
The most practical approach for this setup is an application-level transport layer inside our Go code.
Define a transport interface:
type Transport interface {
Send(to string, msg Message) error
}
All Raft RPCs (RequestVote, AppendEntries) must go through this interface instead of making direct HTTP/gRPC calls.
Network controller
Implement a central controller that maintains the current network state:
type Network struct {
mu sync.RWMutex
blocked map[string]map[string]bool // from -> to
latency time.Duration
dropRate float64
}
Before sending any message:
- Check if the link is blocked => drop
- Apply random drop probability => maybe drop
- Apply artificial delay => sleep
- Forward the message if allowed
Experiments we will conduct
01Leader isolation test
Idea
Force the current leader into a partition where it cannot communicate with a majority of nodes.
Setup
- Identify leader (from logs or client metadata)
- Isolate it: leader alone; remaining 4 nodes form majority group
What should happen
- Majority elects a new leader
- Old leader remains “alive” but cannot commit entries
- When healed, old leader must step down or reconcile term
What we are testing
- Correctness of leader step-down logic
- Safety of split-brain prevention
- Term updates under isolation
Failure signals
- Two leaders active at same time
- Old leader still accepting commits incorrectly
02Minority partition write block test
Idea
Put the leader in the minority partition and try writing to it.
Setup
- Leader isolated with 1 follower
- Remaining 3 nodes form majority
What should happen
- Leader cannot commit entries
- Writes should fail or remain uncommitted
What we are testing
- Quorum enforcement
- Write safety under partition
- Correct rejection of unsafe commits
Failure signals
- Writes succeeding in minority partition
- Clients seeing “successful” but uncommitted logs