🚀 Executive Summary
TL;DR: Pacemaker’s default auto-failback behavior can disrupt an active DRBD primary by attempting premature promotion on a recovering node, leading to service outages and potential data risks. This issue can be prevented by configuring negative resource stickiness, implementing manual failback, or carefully setting up graceful and delayed promotion with robust STONITH.
🎯 Key Takeaways
- Setting a high negative `resource-stickiness` value (e.g., -10000) on the DRBD master/slave clone resource reliably prevents automatic failback, ensuring resources remain on the current primary until manually moved.
- Manual failback strategies, such as placing a recovering node into `standby` or using `location` constraints to assign negative scores for the `Promoted` role, provide complete administrative control over when DRBD resources are promoted.
- Achieving graceful and delayed promotion requires robust STONITH, increasing `cluster-delay` for state propagation, and configuring generous `promoted-stop-timeout` values to ensure the old primary safely demotes before a new one promotes.
Pacemaker/DRBD clusters, while providing high availability, can sometimes exhibit problematic auto-failback behavior where a recovering node attempts to re-assert its primary role, leading to resource conflicts and service disruption. Learn how to prevent these “kill” scenarios and ensure graceful failovers.
Introduction: The Peril of Premature Failback
High-availability clusters built with Pacemaker and DRBD are critical components in modern infrastructure, ensuring services remain operational even during node failures. DRBD provides block-device replication, while Pacemaker orchestrates resources, including DRBD, across nodes. A common challenge arises when a failed node recovers: Pacemaker’s default behavior often prioritizes resource locality, attempting to “failback” resources to their preferred node.
In a DRBD context, this can be disastrous. If the recovering node tries to promote its DRBD resource to Primary while another node is already actively serving as Primary/UpToDate, it creates a split-brain scenario or, more commonly, a forceful demotion/kill of the active primary, leading to service outages, data corruption risks, and general cluster instability. This post details why this happens and provides robust solutions to prevent it.
Symptoms: What Does Uncontrolled Auto-Failback Look Like?
When Pacemaker attempts a premature or uncontrolled failback, several symptoms can indicate the issue:
- Service Outages: Applications running on the DRBD resource unexpectedly stop or become unresponsive on the currently active primary node.
- DRBD Status Changes: You might observe the active DRBD Primary resource suddenly transitioning to Secondary, Unknown, or a connection state indicating a conflict (e.g.,
WFConnection,StandAlone). - Pacemaker Log Entries: The Pacemaker logs (e.g.,
/var/log/pacemaker/pacemaker.logor system journal) will show attempts to promote the DRBD resource on the recovering node, often followed by demotion attempts on the currently active node or fencing actions. Look for messages related todrbd_promote,drbd_demote, or conflicts.
# Example Pacemaker log snippet indicating a problem
Sep 20 10:35:01 node-a pacemakerd[12345]: info: Status: Requesting promote of drbd_res on node-a
Sep 20 10:35:01 node-a pacemakerd[12345]: crit: Result: promote_drbd_res_on_node-a: CIB_R_ERR_OP_FAILED
Sep 20 10:35:01 node-b pacemakerd[12345]: info: Status: Requesting demote of drbd_res on node-b
Sep 20 10:35:01 node-b pacemakerd[12345]: info: drbd_demote: stdout [drbd_demote: Attempting to demote resource 'r0']
Sep 20 10:35:02 node-b pacemakerd[12345]: warn: drbd_demote: stderr [drbd_demote: Cannot demote 'r0', it is still in use.]
Sep 20 10:35:02 node-b pacemakerd[12345]: crit: Result: demote_drbd_res_on_node-b: CIB_R_ERR_OP_FAILED
drbd-overviewOutput: Runningdrbd-overviewwill show the DRBD resource status. During an issue, you might see unexpected roles or connections.
# Example drbd-overview output during conflict
0:r0 Connected Primary/Primary UpToDate/UpToDate
[WARNING: This indicates split-brain in a two-node cluster, which Pacemaker should prevent]
[More likely, you'll see a quick flip or errors.]
Ideally, Pacemaker, especially with fencing (STONITH) enabled, should prevent true split-brain where both nodes are Primary. However, the aggressive failback can lead to a race condition where the recovering node attempts promotion before the active node can be safely demoted, or before the cluster has a clear picture of the state, causing the active primary to be forcefully taken down or experience severe I/O issues.
Root Cause Analysis: Why Auto-Failback Kills DRBD Primary
The core of the problem lies in Pacemaker’s default resource management behavior and its interaction with DRBD’s stateful nature:
- Resource Locality Preference: Pacemaker often tries to keep resources on their preferred nodes. When a node recovers, Pacemaker sees it as a suitable candidate for hosting resources again.
- DRBD Primary Requirement: For most applications, a DRBD resource must be in the “Primary” role to be mounted and serve data. Only one node can be Primary at a time in a two-node synchronous DRBD setup (Protocol C).
- Premature Promotion Attempt: Upon node recovery, Pacemaker evaluates resource placement. If the recovering node is its ‘preferred’ location (e.g., due to configuration default or historical reasons), Pacemaker might attempt to promote the DRBD resource to Primary *immediately*.
- Conflict with Active Primary: If another node is currently acting as the DRBD Primary, this immediate promotion attempt by the recovering node will either:
- Fail (if the DRBD resource agent is robust enough to detect another primary).
- Lead to a race condition where both nodes briefly believe they should be primary.
- Trigger DRBD’s internal mechanisms to resolve the conflict (e.g., automatic demotion of one, or fencing if configured), which can be disruptive.
- Most dangerously, in a poorly configured cluster, it can cause I/O disruption on the existing primary, leading to application failure. The “kill” happens when the active node is forced out of its primary role due to this conflict, often leading to ungraceful shutdown of services.
- Lack of Graceful Demotion: Pacemaker might not have enough time or a clear mandate to gracefully demote the currently active primary *before* the recovering node tries to assert its primary role. This is exacerbated if fencing (STONITH) is not robustly configured or is too slow.
Solution 1: Preventing Automatic Failback with Negative Resource Stickiness and Location Constraints
This is arguably the most common and robust solution. It tells Pacemaker to avoid moving resources back to a node once they’ve failed away from it, effectively disabling automatic failback for DRBD resources.
Mechanism
By setting a negative resource-stickiness value on the DRBD primary resource, you make it “sticky” to its current location. A very large negative value ensures it stays put. You can further reinforce this with a location constraint that prefers the currently active node or simply prevents it from moving to the recovering node.
Configuration Example
First, define your DRBD master/slave resource. Let’s assume your resource is named drbd_r0 and your filesystem/application resource is fs_data.
# Define your DRBD Master/Slave resource (example)
pcs resource create drbd_r0 ocf:linbit:drbd \
drbd_resource=r0 \
op monitor interval="60s" \
op promote interval="30s" start-timeout="90s" stop-timeout="90s" \
op demote interval="30s" start-timeout="90s" stop-timeout="90s" \
--clone globally-unique=true ordered=true interleave=true
# Add a negative resource-stickiness to prevent automatic failback
# This tells Pacemaker: "Don't move this resource back unless explicitly told to."
pcs resource meta drbd_r0-clone resource-stickiness=-10000
# Create a filesystem resource that depends on drbd_r0 being primary
pcs resource create fs_data ocf:heartbeat:Filesystem \
device="/dev/drbd/by-res/r0" directory="/mnt/data" fstype="ext4" \
op monitor interval="30s"
# Ensure fs_data starts only when drbd_r0 is promoted
pcs constraint colocation add fs_data with drbd_r0-clone INFINITY target-role=Promoted
# Ensure fs_data starts after drbd_r0 is promoted
pcs constraint order promote drbd_r0-clone then start fs_data
The key here is pcs resource meta drbd_r0-clone resource-stickiness=-10000. This high negative score means if the resource fails over to node-b, it will stay on node-b even if node-a recovers, unless manually moved.
Pros and Cons
| Pros | Cons |
| Highly predictable and reliable. | Requires manual intervention (pcs resource move or pcs resource migrate) to failback the resource to the original primary node once it has recovered. |
| Prevents split-brain scenarios caused by aggressive auto-failback. | Increased downtime if manual intervention is slow after a node recovers and you want to return to the preferred node. |
| Simplifies troubleshooting by eliminating one potential source of resource flapping. | Might lead to resources remaining on less-preferred nodes for extended periods. |
Solution 2: Implementing Manual Failback with Administrative Confirmation
This solution ensures that a returning node never automatically promotes its DRBD resource without explicit administrative approval. It effectively puts the recovering node in a “waiting room” until deemed safe to promote.
Mechanism
This approach involves setting target-role=Stopped for the DRBD primary resource on the recovering node, preventing Pacemaker from automatically starting and promoting it. This can be done via node-specific resource attributes or by placing the entire node into maintenance mode temporarily.
Configuration Example
Assuming the previous DRBD clone setup:
-
Place the recovering node into maintenance mode: When a node comes back online, Pacemaker will detect it. You can immediately put it into maintenance mode before it has a chance to mess with resources.
# On the administrative workstation or another node pcs node standby <recovering_node_name>This will prevent Pacemaker from trying to run any resources on
<recovering_node_name>. Once you’ve verified the node’s health and are ready to consider a failback (which would still be manual viapcs resource move), you would bring it out of standby:pcs node unstandby <recovering_node_name> -
Using a `location` constraint to prevent promotion on recovery: While Solution 1 uses `resource-stickiness`, you can also create a location constraint that assigns a very low score to the recovering node for the `Promoted` role of your DRBD resource.
# Assume node-a is preferred for drbd_r0. # If node-a fails and drbd_r0 moves to node-b, when node-a recovers, # we want to prevent it from automatically promoting drbd_r0. # This constraint tells Pacemaker that if drbd_r0-clone is "Promoted" # on <recovering_node_name>, it gets a score of -INFINITY, effectively preventing it. pcs constraint location drbd_r0-clone prefers <other_node_name>=100 target-role=Promoted pcs constraint location drbd_r0-clone avoids <recovering_node_name>=INFINITY target-role=Promoted # More simply, combine with Solution 1's stickiness: # pcs resource meta drbd_r0-clone resource-stickiness=-10000 # This combined with the initial state makes it stick to the current primary.When the previously failed node comes back up, Pacemaker will start the DRBD resource clone in the `Secondary` role, but it won’t promote it to `Primary` due to the negative score for that role. An administrator then explicitly moves the resource.
Pros and Cons
| Pros | Cons |
| Provides complete administrative control over failback. | Requires continuous monitoring and manual intervention after a node recovers. |
| Minimizes risk of unintentional primary conflicts. | Potentially longer downtime for failback operations as human interaction is needed. |
| Guarantees verification of node health before resources are promoted. | Less “automatic” in a High Availability context. |
Solution 3: Configuring Pacemaker for Graceful & Delayed Promotion
Instead of completely preventing failback, this approach focuses on making the failback process inherently safer by giving Pacemaker ample time and clear instructions to demote the old primary *before* promoting a new one, thereby preventing the “kill” scenario.
Mechanism
This solution leverages several Pacemaker global options and resource meta-attributes to ensure a sequential and controlled transition of the DRBD primary role. Key elements include robust fencing (STONITH), increasing `cluster-delay` for state propagation, and carefully configuring timeouts for resource actions.
Configuration Example
-
Ensure Robust Fencing (STONITH): This is paramount. If Pacemaker cannot reliably fence a failed node, no failback strategy is truly safe.
pcs property set stonith-enabled=true pcs property set no-quorum-policy=stop # Or 'freeze' depending on requirements # Ensure you have a working STONITH device configured pcs stonith create fence_ipmi_node1 fence_ipmi ipaddr=192.168.1.10 pcmk_host_list=node-a \ login=admin passwd=password op monitor interval=60s pcs stonith create fence_ipmi_node2 fence_ipmi ipaddr=192.168.1.11 pcmk_host_list=node-b \ login=admin passwd=password op monitor interval=60s -
Increase `cluster-delay`: This gives Pacemaker more time to propagate state changes and prevents premature decision-making.
pcs property set cluster-delay=60sThis tells Pacemaker to wait 60 seconds after a node joins or leaves before making significant resource placement decisions. Adjust as needed, but be aware of the impact on failover times.
-
Configure `promoted-stop-timeout` and `stop-failure-is-fatal=false` for DRBD:
promoted-stop-timeout: For a Master/Slave resource, this is the maximum time Pacemaker will wait for a demote operation to complete. Setting a generous timeout ensures the old primary has time to demote properly.stop-failure-is-fatal=false: This tells Pacemaker that if a demote (stop) operation fails on a DRBD primary, it shouldn’t immediately declare the resource permanently failed on that node. Instead, it allows for other recovery actions (like fencing).pcs resource update drbd_r0-clone \ op demote timeout="120s" promoted-stop-timeout="180s" \ op stop timeout="120s" \ meta stop-failure-is-fatal=falseNote:
promoted-stop-timeoutis applied when a cloned resource is currently promoted and Pacemaker attempts to stop it (which effectively means demoting it).stop-failure-is-fatal=falseon the resource’s `meta` attributes, or specifically on the `op demote`, can prevent a transient demote failure from causing a hard resource failure. -
Consider `resource-stickiness` (positive values) for preferred node: If you want *some* level of auto-failback to a preferred node, but gracefully, use a positive `resource-stickiness` with the above timeouts. This means a preferred node will eventually get its resources back, but only after Pacemaker has had time to safely demote the other.
This relies heavily on STONITH and the timeouts to ensure the demotion of the *current* primary occurs *before* the promotion on the preferred, recovering node.
Pros and Cons
| Pros | Cons |
| Enables a more “automatic” failback while trying to mitigate the “kill” scenario. | Requires extremely robust and well-tested STONITH; without it, this solution is dangerous. |
| Optimizes for shorter downtime than manual failback, if successful. | Can lead to longer failover times due to increased `cluster-delay` and operation timeouts. |
| Leverages Pacemaker’s native recovery mechanisms more fully. | Complex to configure and troubleshoot; misconfigurations can still lead to issues. |
Conclusion: Choosing the Right Strategy
Preventing Pacemaker’s auto-failback from “killing” your active DRBD primary is crucial for cluster stability. The best solution depends on your operational requirements and risk tolerance:
- If predictability and absolute prevention of resource flapping are paramount, and you’re comfortable with manual intervention, Solution 1 (Negative Resource Stickiness) is your safest bet.
- If you need granular control and human oversight before resources return to a recovering node, Solution 2 (Manual Failback) provides that assurance.
- If you desire a more automated failback but need to ensure it’s handled gracefully with minimal disruption, Solution 3 (Graceful & Delayed Promotion) can work, but it demands meticulous STONITH configuration and extensive testing to be truly reliable.
Regardless of the chosen solution, always ensure your Pacemaker cluster has a properly configured and tested STONITH (fencing) mechanism. Fencing is the last line of defense against data corruption and split-brain scenarios, making any failover strategy significantly safer.
🤖 Frequently Asked Questions
âť“ Why does Pacemaker’s auto-failback disrupt an active DRBD primary?
Pacemaker’s default resource locality preference causes a recovering node to prematurely attempt to promote its DRBD resource to Primary. This conflicts with the currently active Primary, leading to forceful demotion, I/O disruption, or a race condition if fencing is not robust.
âť“ How do the different Pacemaker/DRBD failback prevention strategies compare?
Negative `resource-stickiness` offers high predictability and prevents flapping but requires manual intervention. Manual failback via `pcs node standby` provides full administrative control but increases downtime. Graceful/delayed promotion aims for automation but demands meticulous STONITH and careful timeout configurations for safety and reliability.
âť“ What is the most critical component for ensuring safe DRBD failover in a Pacemaker cluster?
Robust and well-tested STONITH (fencing) is paramount. It acts as the last line of defense against split-brain scenarios and data corruption by ensuring a failed node is truly isolated and powered off before resources are promoted on another node.
Leave a Reply