Failover Manager (EFM) is a tool for managing Postgres database clusters, enabling high availability of primary-standby deployment architectures using streaming replication. The EFM version 5.0 release continues adding improvements and new features to Failover Manager. This is the first change in the major release identifier in several years. The decision to release 5.0 instead of 4.11 was driven by semantics: there have been several improvements that are not backwards compatible with the 4.10, 4.9, etc, versions of Failover Manager. This blog covers the changes that resulted in a major version change.
Major version changes
The changes that drove this major version increment are:
- EFM agents no longer use a startup log to contain output during startup. If using systemd, for instance, the startup logging or startup errors will be in the systemctl/journalctl output. This puts startup information where users most expect it, and solves a technical problem that startup logging, unlike an agent log, could not be redirected to different places per cluster.
- When reconfiguring a standby database to follow a newly-promoted primary, the standby’s local WAL is removed. In previous versions of EFM, this WAL was copied first as a safety precaution. This copy is generally not needed, and the new default behavior is to not make a copy, saving time and disk space. The new backup.wal property can be set to use the old behavior if desired.
- The “xlog” references in the efm cluster-status-json output now use “lsn” instead.
- Previous versions had a property named 'auto.resume.period' which served two purposes (below). This has been split into two properties, one for each specific behavior. The EFM upgrade page has an example of the output when upgrading a properties file that contains the old property. The two behaviors are:
- Control an agent attempting to monitor a database that was started after the agent was started. See auto.resume.startup.period.
- Control an agent attempting to resume monitoring after a database failure. See auto.resume.failure.period.
- For historical reasons, EFM had a property detach.on.agent.failure whose default was to detach a node from a load balancer if the agent (not the database) failed. The default value for this property is now the generally expected behavior.
- Failover Manager no longer resizes clusters automatically after nodes fail or are disconnected from the cluster. This is discussed in more detail below, but means that cluster status output can now contain the addresses of failed/disconnected nodes.
Handling cluster size for quorum
As mentioned above, Failover Manager used to automatically resize the cluster after node failures/disconnects. For example, the behavior of a primary agent seeing the cluster drop from three nodes to one, which often happens in two back-to-back changes, is to fence off the primary database. But the behavior seeing the cluster drop from three nodes to two, and later two to one, is to do nothing: an even split in an cluster does not result in the primary fencing itself off, as the other half of the cluster will not promote. The difference between these two situations is only the amount of time the changes took, which can be confusing and have unexpected behaviors in edge cases.
Starting with 5.0, when an agent sees other nodes fail/become disconnected, they are still considered part of the cluster until they are restarted, rejoin, or are removed with the efm reset-members command. This simplifies the cluster behavior, and is better at preventing “split brain” situations in very unusual network failures (e.g. cases where some nodes may see others but not vice-versa; “failed” nodes could be separating over time to form their own sub-cluster).
Adding or removing nodes from the cluster will resize the cluster as always. The change in this version is to consider failed/disconnected nodes as still part of the expected cluster size for quorum purposes.
New features and fixes
In a future post we will give more information on the new efm create-standby feature and other improvements to Failover Manager.