🚀 Executive Summary

TL;DR: Attempting to use VRRP/CARP for active-active load balancing results in network chaos due to ARP contention, not true load balancing. The recommended solutions involve either segregating services with multiple VIPs, deploying a dedicated load balancer tier (e.g., HAProxy/NGINX) with VRRP for its own HA, or utilizing advanced ECMP routing for large-scale, Layer 3 load distribution.

🎯 Key Takeaways

  • VRRP and CARP are High Availability (HA) failover protocols designed for a single, authoritative owner of a Virtual IP (VIP) at any given moment, not for active-active load balancing.
  • Trying to force an ‘active-active’ setup with VRRP/CARP by setting multiple nodes to the same high priority causes ARP contention, leading to network switches rapidly updating ARP tables and packet-level chaos for stateful connections.
  • The ‘right’ way to achieve load balancing is to introduce a dedicated Layer 4 or Layer 7 load balancer tier (e.g., HAProxy, NGINX), which uses VRRP for its own high availability, and intelligently distributes traffic to backend servers with robust, application-specific health checks.

Multi primary VRRP/CARP net loadbalance setup

Unlock true high availability by moving beyond simple VRRP/CARP for load balancing. Discover why active-active failover often fails and learn robust, production-ready patterns using dedicated load balancers or advanced network routing.

You Can’t Load Balance with VRRP. Stop Trying.

I remember it like it was yesterday. 3 AM, the on-call pager screams, and the entire payment processing dashboard is a sea of red. The weird part? All the database nodes were up. `prod-db-01` was healthy, `prod-db-02` was healthy. We could connect to both directly. But the application, connecting through the virtual IP (VIP), was flapping like crazy. It took us an hour of frantic debugging to find the culprit: two over-eager engineers had tried to create an “active-active” database cluster by setting both Keepalived nodes to the same high priority. The result? A network civil war over a single IP address, known as ARP flux, that was killing our application. We’ve all been there, trying to squeeze more performance and availability out of a tool that just wasn’t built for the job.

The “Why”: VRRP is a Failover Protocol, Not a Load Balancer

Let’s get this straight first. Protocols like VRRP (Virtual Router Redundancy Protocol) and CARP (Common Address Redundancy Protocol) are brilliant for one thing: High Availability (HA). They ensure that if one machine (`MASTER`) dies, another (`BACKUP`) can seamlessly take over its Virtual IP address. They are designed for a single, authoritative owner of that IP at any given moment.

The core problem when you try to force an “active-active” or “multi-primary” setup is ARP contention. In simple terms:

  • Server A shouts, “Hey network, the IP 192.168.1.100 is at my MAC address AA:BB:CC:01!”
  • Server B, with the same VRRP priority, simultaneously shouts, “No, 192.168.1.100 is at my MAC address AA:BB:CC:02!”

Your network switches get confused. They rapidly update their ARP tables, sending some packets to Server A and some to Server B. For stateful connections like a database, this is a death sentence. It’s not load balancing; it’s packet-level chaos. So, how do we solve this the right way?

Solution 1: The Quick Fix – Service Segregation with Multiple VIPs

This is the simplest pattern and often what people land on first. It’s not true load balancing, but it’s a valid way to distribute load if you can segment your application traffic. Instead of one VIP for everything, you create multiple, distinct VIPs, with each one having a clear primary owner.

Imagine you have two database nodes. You want to direct all write traffic to `prod-db-01` and all read traffic to `prod-db-02`.

On `prod-db-01`, you’d configure Keepalived to be the master for the “write” VIP:


# /etc/keepalived/keepalived.conf on prod-db-01
vrrp_instance VI_WRITE {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 150 # Higher priority
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass mysecret
    }
    virtual_ipaddress {
        192.168.1.100/24 # WRITE VIP
    }
}

vrrp_instance VI_READ {
    state BACKUP
    interface eth0
    virtual_router_id 52
    priority 100 # Lower priority
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass mysecret
    }
    virtual_ipaddress {
        192.168.1.101/24 # READ VIP
    }
}

And on `prod-db-02`, you do the opposite—make it the master for the “read” VIP:


# /etc/keepalived/keepalived.conf on prod-db-02
vrrp_instance VI_WRITE {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 100 # Lower priority
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass mysecret
    }
    virtual_ipaddress {
        192.168.1.100/24 # WRITE VIP
    }
}

vrrp_instance VI_READ {
    state MASTER
    interface eth0
    virtual_router_id 52
    priority 150 # Higher priority
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass mysecret
    }
    virtual_ipaddress {
        192.168.1.101/24 # READ VIP
    }
}

The result: Your application now connects to `.100` for writes and `.101` for reads. If `prod-db-01` fails, `prod-db-02` takes over the write VIP (and vice versa). It’s simple and effective, but it requires your application to be aware of the different endpoints.

Solution 2: The “Right” Way – Use a Real Load Balancer

This is my default recommendation. Stop trying to make your database or application servers do the network’s job. Introduce a dedicated Layer 4 or Layer 7 load balancing tier. This tier’s entire purpose is to accept traffic on a single VIP and intelligently distribute it to a pool of healthy backend servers.

Your new architecture looks like this:

  1. Two load balancer nodes (e.g., running HAProxy or NGINX).
  2. These two nodes use VRRP/Keepalived between them for a single, highly-available VIP.
  3. Your application servers (`prod-db-01`, `prod-db-02`, etc.) are just backend targets for the load balancers. They don’t run Keepalived at all.

Here’s a conceptual HAProxy config (`haproxy.cfg`) that would run on both load balancers:


frontend postgres_cluster
    bind 192.168.1.200:5432 # The VIP managed by Keepalived
    mode tcp
    default_backend postgres_nodes

backend postgres_nodes
    mode tcp
    balance roundrobin
    option pgsql-check user haproxy_check
    server prod-db-01 10.0.0.11:5432 check
    server prod-db-02 10.0.0.12:5432 check

In this setup, VRRP’s job is simplified back to its original purpose: making sure the load balancer tier itself doesn’t have a single point of failure. The load balancer handles the complex task of health checking backends and distributing connections. This is the clean, scalable, and professional way to do it.

Pro Tip: Don’t skimp on the health checks! A simple port check isn’t enough. Use application-specific checks (like HAProxy’s `pgsql-check` or `http-check`) to ensure the service is actually functional, not just listening on a socket.

Solution 3: The ‘Nuclear’ Option – ECMP Routing

Alright, let’s talk about how the big cloud providers and large-scale networks do it. This option isn’t for everyone and requires tight collaboration with your networking team. It’s called Equal-Cost Multi-Path (ECMP) routing.

Instead of relying on Layer 2 tricks like ARP, ECMP is a Layer 3 routing strategy. Here’s the high-level concept:

  • You announce the same IP address (a /32 prefix) from multiple servers using a routing protocol like BGP.
  • Your network routers see multiple, equally “good” paths to reach that IP.
  • The router then load-balances traffic across those paths, typically using a hash based on the source IP, destination IP, and ports. This ensures that packets for the same connection (a 5-tuple) consistently go to the same server.

This is incredibly powerful and scalable, but it comes with a huge caveat:

Warning: This is a networking-heavy solution. You can’t just spin up a VM and do this. You need control over your routing fabric and BGP-capable switches/routers. Tools like BIRD or FRR can be used on your servers to speak BGP, but this is an advanced setup. Attempting this without a solid networking foundation is a recipe for a bad day.

Which Path Should You Choose?

Here’s my advice, distilled into a simple table:

Solution When to Use It Complexity
Multiple VIPs You need a quick win and can easily separate traffic at the application level (e.g., read/write splitting). Low
Load Balancer Tier This is the default for 95% of use cases. You want true load balancing, robust health checks, and a clean separation of concerns. Medium
ECMP You operate at massive scale, have a strong network engineering team, and need maximum throughput and resiliency at the network layer. Very High

So next time you’re tempted to set two VRRP nodes to the same priority, take a step back. Think about the problem you’re truly trying to solve. Chances are, a dedicated load balancer is the reliable, scalable, and sane choice that will let your on-call engineers actually get some sleep.

Darian Vance - Lead Cloud Architect

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.


🤖 Frequently Asked Questions

âť“ Why can’t VRRP/CARP be used for active-active load balancing?

VRRP/CARP are failover protocols designed for a single master to own a VIP. When multiple nodes are configured with the same high priority for a VIP, they contend for ownership, leading to ARP flux. This confuses network switches, causing packets for the same connection to be sent to different servers, which is detrimental to stateful applications.

âť“ How do the different load balancing solutions compare in terms of complexity and use case?

Multiple VIPs offer low complexity for quick wins by segregating application traffic (e.g., read/write splitting). A dedicated Load Balancer Tier is of medium complexity, serving as the default for 95% of use cases by providing true load balancing and robust health checks. ECMP routing is very high complexity, suitable for massive scale and requiring strong network engineering expertise and BGP-capable infrastructure.

âť“ What is a common implementation pitfall when trying to achieve multi-primary VRRP/CARP load balancing?

A common pitfall is configuring multiple Keepalived nodes with the same high priority for a single Virtual IP. This results in ARP contention, where both servers simultaneously claim ownership of the IP, causing network switches to rapidly update their ARP tables and leading to connection flapping and instability.

Leave a Reply

Discover more from TechResolve - SaaS Troubleshooting & Software Alternatives

Subscribe now to keep reading and get access to the full archive.

Continue reading