Fault Tolerance

Automatic failover and high availability for stateful services.

Overview

ControlBird's Fault Tolerance feature keeps services running through crashes, restarts, and node outages without manual intervention. You deploy the same service across multiple nodes, and ControlBird automatically elects a single active instance (the leader) to do the work while the others stand by, ready to take over. When a leader stops responding, a standby is promoted automatically within a few seconds.

Reach for fault tolerance whenever a stateful service must have exactly one active instance: protocol controllers, device managers, schedulers, and compute services are typical examples. Stateless services that can safely run many copies at once do not need it and should leave it disabled. The same capability also governs network endpoints, letting you choose whether only the leader accepts connections or every healthy node stays warm.

Prefer a guided tutorial?

New to this? Follow the Service Fault Tolerance walkthrough for a step-by-step tour, then come back here for the full reference.

Key Concepts

  • Candidate: any instance of a service or endpoint that competes for leadership. You run several candidates across different nodes.
  • Leader: the single candidate currently elected to do the work. Only the leader actively processes requests; the rest stand by.
  • Health signal: each candidate continuously reports that it is alive. If that signal stops, ControlBird treats the candidate as failed.
  • Availability: a candidate marks itself ready before it becomes eligible for election, so a service that is still starting up is never promoted prematurely.
  • Automatic failover: if the leader fails, ControlBird promotes a healthy standby and notifies every candidate of the change so the new leader can take over.

How it fits together

  1. You deploy several candidates of the same service across multiple nodes.
  2. Each candidate reports that it is healthy and marks itself available once it is ready.
  3. ControlBird elects one healthy, available candidate as the leader.
  4. Only the leader performs work; standby candidates stay warm and idle.
  5. If the leader fails or shuts down, a healthy standby is promoted and all candidates are notified of the new leader.

Candidates: Services and Endpoints

Two kinds of things can participate in leader election: services and endpoints.

  • Service: a stateful service instance, identified by its binary name (for example cb-device-manager). Run several copies across nodes and exactly one acts as leader at a time.
  • Endpoint: a network connection, such as an MQTT or Modbus link. Endpoints participate in election the same way services do, and additionally choose how connections are distributed across nodes (see below).

Endpoint Connection Modes

Endpoints support two connection modes, letting you match the behavior to the device or protocol:

ModeBehaviorUse for
LeaderOnlyOnly the leader endpoint accepts connections.Exactly one active connection to an external device, e.g. a Modbus TCP controller that writes commands to a PLC.
AllWarmAll available endpoints accept connections.Read-only or idempotent operations where multiple readers are safe, e.g. an MQTT subscription on every node.

LeaderOnly is enforced for you

With LeaderOnly, ControlBird only allows the elected leader to accept connections. Standby nodes are prevented from connecting automatically, so you do not have to guard against duplicate connections in your own logic.

Service Lifecycle

A stateful service follows a fixed lifecycle: enable fault tolerance when the service starts, mark itself available, do work only while it is the leader, and leave the election cleanly on shutdown.

  1. Start the service with fault tolerance enabled.
  2. Subscribe to leadership changes so the service learns when it becomes (or stops being) leader.
  3. Mark the service available to enter the election.
  4. On each loop iteration, report health and process any leadership changes.
  5. On shutdown, leave the election cleanly so a standby can take over right away.

Only the leader works

Standby candidates keep running and keep reporting health, but skip work in their main loop. This conserves resources while keeping them ready for fast failover. Only the leader should modify persistent shared state.

Monitoring with the Database Browser

Use the Database Browser to inspect any group of candidates. Each group shows all of its members, which members are currently healthy, and which one is the elected leader, so you can confirm at a glance which node is in charge.

Common Patterns

  • Always shut down gracefully. When stopping a service, let it leave the election cleanly so a standby is promoted immediately rather than after the failure timeout.
  • Keep the main loop responsive. Process leadership changes promptly; long delays reduce responsiveness during failover.
  • Leader-only writes. Only the leader should modify persistent shared state.
  • Disable for stateless services. Services that can safely run many copies at once do not need fault tolerance, so leave it off to avoid the small overhead.

Troubleshooting & Limitations

  • Service never becomes leader. A candidate must mark itself available before it is eligible. A service that never signals readiness is never promoted, even when it is otherwise healthy.
  • Crash without graceful shutdown. If a service crashes instead of leaving the election cleanly, it is still treated as a candidate until its health signal times out, which adds a few seconds of failover delay.
  • All candidates dead. If every candidate becomes unavailable, there is no leader and no service processes requests. The system stays inert until at least one candidate recovers and starts reporting health again.
  • Detection is a trade-off. Faster failure detection reacts more quickly but is more sensitive to brief network hiccups; slower detection is more tolerant but takes longer to fail over. The defaults are tuned for typical deployments.