Worker Maintenance
Introduction
In a distributed environment, worker nodes must regularly perform maintenance tasks to ensure the system remains healthy and consistent. In lilota, this responsibility is handled automatically through the worker heartbeat mechanism and a lightweight leader election process.
How Maintenance Works
Each worker periodically sends a heartbeat to indicate that it is alive. During this process, workers also participate in a leader election. At any given time, exactly one worker acts as the leader.
Only the elected leader performs maintenance tasks. This avoids duplicate work and ensures consistent system state across all workers.
Maintenance Tasks
When a worker becomes the leader, it executes the following maintenance tasks:
1. Cleanup of Dead Nodes
Workers that have not sent a heartbeat within a configured timeout are considered dead.
- A cutoff timestamp is calculated based on the configured node timeout
- All nodes with a
last_seen_atolder than this cutoff are marked as dead - The current worker (leader) is excluded from this process
This ensures that inactive or crashed workers do not remain in an inconsistent state.
2. Expiration of Overdue Tasks
Tasks that exceed their configured timeout are automatically marked as expired.
- The task store checks for running tasks whose expiration time has passed
- These tasks are updated to the expired state
This prevents tasks from running indefinitely and ensures system stability.
More information can be found here.
Fault Tolerance
Maintenance execution is designed to be safe and resilient:
- If maintenance fails, the error is logged but does not stop the worker
- The heartbeat process continues running regardless of failures
- Leadership is automatically renewed or reassigned if a leader becomes unavailable