Minimizing Postgres downtime and service interruptions is essential for enterprises to protect against revenue loss, operational disruptions, and poor user experiences. EDB Postgres AI addresses these challenges with EDB Failover Manager (EFM), which enables high availability of primary-standby deployments that rely on streaming replication. In a previous entry, we discussed the changes that precipitated a major version number. This blog covers two of the new features added in 5.0 in more detail than the release notes and the Q1 Announcement blog.
Creating standby databases
Important note with v5.0: If this command is run on a standby node (the local database is running and is being monitored), you must restart the agent after the command completes. There is a known issue that the agent can be left in an incorrect internal state. If this node is later promoted to primary and the database fails, there may not be a failover due to this state. This does not affect "Idle" nodes (where either the local database is not running or is not being monitored). It will be fixed in 5.1.
EFM 5.0 adds a new command to the ‘efm’ utility: The efm create-standby command. This command uses pg_basebackup to create a standby database on the node where the command is run. The agent must already be running on the node, which means that the properties file for this node must already contain usable values. This is where the command process will get information necessary for running, e.g. the database bin and directories.
The efm create-standby command will handle the following for you:
- Determine the current primary node
- Create a physical replication slot if desired, dropping the slot first if it already exists on the primary
- Remove the current data directory as specified by the db.data.dir property
- Run pg_basebackup
- Start the new standby database server
- Resume monitoring the database
- Prompt before doing any of the above if desired
The documentation covers which EFM properties are used, and includes an example of output from the command being run.
Setup after the standby database is created
There could be further setup that is required on the new standby database before it is usable. For example, if synchronous_standby_names was previously set in the database configuration, this information is lost during the standby database creation.
The 5.0 release is just the initial implementation of standby creation. More features, including preserving synchronous_standby_names, are planned for upcoming releases.
Notes on privileges
Currently the command calls pg_basebackup through sudo, so it must be invoked with root privileges in order to use pg_basebackup, start the db if running as a service, and have the efm agent resume monitoring. This will be improved in a future release so that the requirement matches similar commands where the requirement is “...must be invoked by efm, a member of the efm group, or root.”
If running Failover Manager in non-sudo mode, the command can be run as the db.service.owner user and sudo is not needed. Please see the documentation for details about setting up the non-sudo mode, e.g. adding the database OS user to the ‘efm’ group and running the database server not as a service.
Synchronous standby handling
EFM will monitor a primary database’s synchronous_standby_names and, if configured, change it if needed in cases of standbys being removed from or added to the cluster. The EFM properties that control this are:
- reconfigure.num.sync and reconfigure.num.sync.max to control raising or lowering the num_sync value in synchronous_standby_names. From the link above, “num_sync is the number of synchronous standbys that transactions need to wait for replies from.”
- reconfigure.sync.primary to tell an agent it can take the primary out of synchronous mode entirely when needed. (There is not currently a way to have Failover Manager put a primary back into synchronous mode.)
- (new in 5.0) check.num.sync.period, which is described below.
Prior to EFM 5.0, a primary agent would check if the local database required a change based on certain events in the cluster, e.g. after standbys have left the cluster or when a new primary has been promoted. This could cause timing issues, requiring a certain value for the postgresql wal_sender_timeout value – see the “Note” below the reconfigure.sync.primary property. If the primary did not need a change to num_sync at the time of the check, the primary was not reconfigured, and EFM did not check again.
The 5.0 release of Failover Manager removes the specific events needed for a check of the primary, and now checks every N seconds whether there have been changes in the cluster or not. In many cases this will speed up the db check and reconfiguration, reducing time that a primary is “stuck” and unable to accept writes. For example, with the default of 50 seconds for node.timeout, an earlier EFM cluster would take roughly 50 seconds to see if a standby node failed and take action. With version 5.0 and the default of 30 seconds for check.num.sync.period, the primary agent will check to see if the primary database is stuck anywhere from immediately up to 30 seconds, depending on when in the check cycle the standby failed.
Standby agents don’t even have to be in the EFM cluster; the primary agent will perform this check every N seconds as specified. This decouples the monitoring of a primary database’s synchronous standby requirements from other cluster events, making the system more responsive and more stable.