Scaling & Multi-Node
Deploy additional nodes and keep data in sync across them.
Overview
A single ControlBird node handles thousands of data points with sub-millisecond latency, but real deployments often need more: a node close to the devices on the factory floor, another in the cloud for remote dashboards, or a redundant pair so operations continue if one node goes offline. ControlBird's multi-node architecture lets you deploy independent nodes that automatically discover each other and stay synchronized through direct peer-to-peer connections. There is no central coordinator: every node carries a full copy of the Store and accepts both reads and writes.
When a node joins an environment, it connects to its peers, receives a full copy of the current state, and then participates in continuous real-time replication. The complete Store (entity tree, field values with timestamps, schemas, and automations) replicates across the cluster, while each node keeps its own write-ahead log, snapshots, and service logs for independent persistence and debugging. The cluster automatically elects a leader to coordinate certain operations, but all nodes remain readable and writable. Use multiple nodes when you need geographic distribution, high availability, or workload isolation across protocols and device groups.
Prefer a guided tutorial?
New to this? Follow the Scale with Multiple Nodes walkthrough for a step-by-step tour, then come back here for the full reference.
Key Concepts
- Node: a deployed ControlBird instance with a unique
NODE_ID(defaultnode-a) and its own data directory. Each node runs the full service set and holds a complete copy of the Store. - Peer-to-peer sync: nodes connect directly to one another over the kernel's unified port (default
9100). There is no central broker; the Store replicates between peers. - Leader election: the cluster automatically and deterministically selects one node as leader to coordinate certain operations. If the leader goes offline, a standby takes over.
- Real-time replication: every write is broadcast to all connected peers, keeping all nodes consistent in near real-time.
- Node-local persistence: write-ahead log files, snapshots, and logs live under each node's own data directory and are never synchronized across peers.
How Nodes Discover and Sync
When a new node joins an environment, it connects to each peer, receives a full copy of the current state, and then switches to continuous replication. The initial state transfer runs once at join time; after that, only changes cross the wire.
What synchronizes and what stays local
| Data | Synced across nodes | Notes |
|---|---|---|
| Entity tree | Yes | All entities, hierarchy, and names |
| Field values | Yes | Current values with timestamps and writer info |
| Schemas | Yes | Entity type definitions and field schemas |
| Automations | Yes | Rule Chains are entities, so they replicate too |
WAL (.data/{NODE_ID}/wal/) | No | Each node keeps its own write-ahead log |
Snapshots (.data/{NODE_ID}/snapshots/) | No | Point-in-time snapshots are node-local |
Service logs (.data/{NODE_ID}/service-logs/) | No | Node-specific, used for debugging |
Comm logs (.data/{NODE_ID}/comm-logs/) | No | Protocol traces stay on the node that produced them |
Configuration & Usage
Add a second node via the Control Plane
The Control Plane portal handles node discovery and peer-address injection automatically. Nodes in the same environment discover each other without manual IP configuration.
Control Plane portal:
1. Open your environment and click "Add Node"
2. Select a subscription tier for the node
3. Set a unique node name, e.g. production-node-2
4. Select the deployment region
5. Click Deploy
The new node automatically discovers and connects to node-a,
then syncs the current Store state from it.Start a self-managed cluster
For self-managed deployments, each node is defined in configuration with its own reachable address (host:port). A node binds the port from its own entry and discovers its peers from the other entries; no peer addresses are passed on the command line. Each node initiates peer connections and performs a full sync on first connect.
# Configuration defines every node and its address, e.g.
# node-a -> 192.168.1.9:9100
# node-b -> 192.168.1.10:9100
# node-c -> 192.168.1.11:9100
# Each kernel only needs its own identity and data directory:
cb-kernel --node node-a --data-dir /dataClient failover with KERNEL_ADDRESS
Services and CLI tools read KERNEL_ADDRESS as a comma-separated list and try each address in order. Tools like cb-cli, cb-tree, and cb-select auto-parse this format.
# Try node-a first; if unreachable, retry node-b
KERNEL_ADDRESS=node-a:9100,node-b:9100Leader election across a 3-node cluster
3-node cluster:
node-a -> elected leader
node-b -> standby
node-c -> standby
If node-a goes offline:
node-b is promoted to leader automatically
If node-b also goes offline:
node-c takes over
Promotion happens automatically with no manual intervention.Common Patterns
Edge + Cloud topology
The most common multi-node layout. An edge node runs on-site and connects devices with low latency; a cloud node syncs the full Store for remote dashboards and data aggregation. Both nodes run the same services (protocol controllers, automations) and coordinate through Store writes.
Per-node persistence
Write-ahead log and snapshot files are stored per node, for example .data/node-a/wal/ and .data/node-b/wal/. Each node recovers from its own files; there is no cross-node persistence dependency. A node that crashes and restarts re-syncs Store state from its peers while replaying its own local persistence.
Shared schemas across nodes
All nodes in an environment share the same schemas, which are applied automatically at startup. You don't need to copy schema definitions between nodes manually.
Monitoring sync health
Use the Logs app's node selector to view a specific node's peer connections: a connected peer shows as Connected and you can watch data flowing between peers. The same view reflects when a peer is syncing or disconnected.
| Indicator | Meaning |
|---|---|
| Connected | Peer is online and syncing in real-time |
| Syncing | Initial state transfer in progress |
| Disconnected | Peer is offline; it will resync when reconnected |
Troubleshooting & Limitations
- Leader election assumes synchronized clocks. The cluster picks a leader based partly on timing, so it assumes nodes have reasonably synchronized clocks. Large clock skew can cause unexpected leader changes.
- Initial full sync can take seconds. A joining node receives a full copy of the current state. For deployments with thousands of entities this can take several seconds; incremental updates afterward are near-instant.
- Network partitions create an eventual-consistency window. If two nodes are disconnected and both write conflicting values, the write with the latest timestamp wins on reconnect. This conflict-resolution rule is not configurable.
KERNEL_ADDRESSfailover is client-side only. A service configured withnode-a:9100,node-b:9100tries node-a then node-b, but if both are down the connection fails rather than retrying indefinitely.- Self-managed nodes need peer addresses at startup. When you deploy through the Control Plane, discovery is automatic. For self-managed deployments, peer addresses are supplied at startup, so manually adding a node means updating peer configuration on existing nodes and restarting them.
- Persistence is not synchronized. Each node maintains its own write-ahead log and snapshots. A restarted node re-syncs Store state from peers but replays only its own local persistence.
- Service logs are node-local. Troubleshooting a multi-node issue means checking logs on the specific node where the problem occurred, via the Logs app node selector.
- No built-in cluster migration. Adding or removing nodes is a manual Control Plane operation. A new node catches up via a full sync rather than a zero-downtime handover.
Eventual consistency under partition
During a network partition, nodes may temporarily diverge. When connectivity is restored, the latest write by timestamp wins; there is no configurable merge strategy. For most workloads the reconnect-and-resolve cycle is fast enough to be imperceptible.