Scaling & Multi-Node

Deploy additional nodes and keep data in sync across them.

Overview

A single ControlBird node handles thousands of data points with sub-millisecond latency, but real deployments often need more: a node close to the devices on the factory floor, another in the cloud for remote dashboards, or a redundant pair so operations continue if one node goes offline. ControlBird's multi-node architecture lets you deploy independent nodes that automatically discover each other and stay synchronized through direct peer-to-peer connections. There is no central coordinator: every node carries a full copy of the Store and accepts both reads and writes.

When a node joins an environment, it connects to its peers, receives a full copy of the current state, and then participates in continuous real-time replication. The complete Store (entity tree, field values with timestamps, schemas, and automations) replicates across the cluster, while each node keeps its own write-ahead log, snapshots, and service logs for independent persistence and debugging. The cluster automatically elects a leader to coordinate certain operations, but all nodes remain readable and writable. Use multiple nodes when you need geographic distribution, high availability, or workload isolation across protocols and device groups.

Prefer a guided tutorial?

New to this? Follow the Scale with Multiple Nodes walkthrough for a step-by-step tour, then come back here for the full reference.

Key Concepts

  • Node: a deployed ControlBird instance with a unique NODE_ID (default node-a) and its own data directory. Each node runs the full service set and holds a complete copy of the Store.
  • Peer-to-peer sync: nodes connect directly to one another over the kernel's unified port (default 9100). There is no central broker; the Store replicates between peers.
  • Leader election: the cluster automatically and deterministically selects one node as leader to coordinate certain operations. If the leader goes offline, a standby takes over.
  • Real-time replication: every write is broadcast to all connected peers, keeping all nodes consistent in near real-time.
  • Node-local persistence: write-ahead log files, snapshots, and logs live under each node's own data directory and are never synchronized across peers.

How Nodes Discover and Sync

When a new node joins an environment, it connects to each peer, receives a full copy of the current state, and then switches to continuous replication. The initial state transfer runs once at join time; after that, only changes cross the wire.

What synchronizes and what stays local

DataSynced across nodesNotes
Entity treeYesAll entities, hierarchy, and names
Field valuesYesCurrent values with timestamps and writer info
SchemasYesEntity type definitions and field schemas
AutomationsYesRule Chains are entities, so they replicate too
WAL (.data/{NODE_ID}/wal/)NoEach node keeps its own write-ahead log
Snapshots (.data/{NODE_ID}/snapshots/)NoPoint-in-time snapshots are node-local
Service logs (.data/{NODE_ID}/service-logs/)NoNode-specific, used for debugging
Comm logs (.data/{NODE_ID}/comm-logs/)NoProtocol traces stay on the node that produced them

Configuration & Usage

Add a second node via the Control Plane

The Control Plane portal handles node discovery and peer-address injection automatically. Nodes in the same environment discover each other without manual IP configuration.

Control Plane portal:
  1. Open your environment and click "Add Node"
  2. Select a subscription tier for the node
  3. Set a unique node name, e.g. production-node-2
  4. Select the deployment region
  5. Click Deploy

The new node automatically discovers and connects to node-a,
then syncs the current Store state from it.

Start a self-managed cluster

For self-managed deployments, each node is defined in configuration with its own reachable address (host:port). A node binds the port from its own entry and discovers its peers from the other entries; no peer addresses are passed on the command line. Each node initiates peer connections and performs a full sync on first connect.

# Configuration defines every node and its address, e.g.
#   node-a  ->  192.168.1.9:9100
#   node-b  ->  192.168.1.10:9100
#   node-c  ->  192.168.1.11:9100

# Each kernel only needs its own identity and data directory:
cb-kernel --node node-a --data-dir /data

Client failover with KERNEL_ADDRESS

Services and CLI tools read KERNEL_ADDRESS as a comma-separated list and try each address in order. Tools like cb-cli, cb-tree, and cb-select auto-parse this format.

# Try node-a first; if unreachable, retry node-b
KERNEL_ADDRESS=node-a:9100,node-b:9100

Leader election across a 3-node cluster

3-node cluster:
  node-a  -> elected leader
  node-b  -> standby
  node-c  -> standby

If node-a goes offline:
  node-b is promoted to leader automatically
If node-b also goes offline:
  node-c takes over

Promotion happens automatically with no manual intervention.

Common Patterns

Edge + Cloud topology

The most common multi-node layout. An edge node runs on-site and connects devices with low latency; a cloud node syncs the full Store for remote dashboards and data aggregation. Both nodes run the same services (protocol controllers, automations) and coordinate through Store writes.

Per-node persistence

Write-ahead log and snapshot files are stored per node, for example .data/node-a/wal/ and .data/node-b/wal/. Each node recovers from its own files; there is no cross-node persistence dependency. A node that crashes and restarts re-syncs Store state from its peers while replaying its own local persistence.

Shared schemas across nodes

All nodes in an environment share the same schemas, which are applied automatically at startup. You don't need to copy schema definitions between nodes manually.

Monitoring sync health

Use the Logs app's node selector to view a specific node's peer connections: a connected peer shows as Connected and you can watch data flowing between peers. The same view reflects when a peer is syncing or disconnected.

IndicatorMeaning
ConnectedPeer is online and syncing in real-time
SyncingInitial state transfer in progress
DisconnectedPeer is offline; it will resync when reconnected

Troubleshooting & Limitations

  • Leader election assumes synchronized clocks. The cluster picks a leader based partly on timing, so it assumes nodes have reasonably synchronized clocks. Large clock skew can cause unexpected leader changes.
  • Initial full sync can take seconds. A joining node receives a full copy of the current state. For deployments with thousands of entities this can take several seconds; incremental updates afterward are near-instant.
  • Network partitions create an eventual-consistency window. If two nodes are disconnected and both write conflicting values, the write with the latest timestamp wins on reconnect. This conflict-resolution rule is not configurable.
  • KERNEL_ADDRESS failover is client-side only. A service configured with node-a:9100,node-b:9100 tries node-a then node-b, but if both are down the connection fails rather than retrying indefinitely.
  • Self-managed nodes need peer addresses at startup. When you deploy through the Control Plane, discovery is automatic. For self-managed deployments, peer addresses are supplied at startup, so manually adding a node means updating peer configuration on existing nodes and restarting them.
  • Persistence is not synchronized. Each node maintains its own write-ahead log and snapshots. A restarted node re-syncs Store state from peers but replays only its own local persistence.
  • Service logs are node-local. Troubleshooting a multi-node issue means checking logs on the specific node where the problem occurred, via the Logs app node selector.
  • No built-in cluster migration. Adding or removing nodes is a manual Control Plane operation. A new node catches up via a full sync rather than a zero-downtime handover.

Eventual consistency under partition

During a network partition, nodes may temporarily diverge. When connectivity is restored, the latest write by timestamp wins; there is no configurable merge strategy. For most workloads the reconnect-and-resolve cycle is fast enough to be imperceptible.