In database management, ensuring data availability is very important. PostgreSQL offers powerful replication mechanisms to achieve these goals. It supports two types of replication: physical and logical. While both replicate data across multiple servers, the process is very different, each with its strengths and limitations. This article will explore the distinctions between these approaches and how EDB's Postgres Distributed (PGD) builds on top of both physical and logical replication to offer a more advanced and flexible distributed database solution.
To begin with, let's look into the practical aspects of physical and logical replication.
Physical Replication: A Byte-for-Byte Copy
Physical replication in PostgreSQL operates at the storage level. It involves creating and maintaining an exact, byte-for-byte copy of the source (primary) database server on one or more destination (standby) servers; this is achieved by continuously reading and sending the Write Ahead Log from the primary to one or more standbys which consume the WAL, applying the changes to the data on the standby. That's made possible because the WAL contains a detailed history of every change made to the database, like "at offset 12 of disk page 10 of relation 17556, wrote tuple with hex value 0x1262acab4523...".
Key characteristics of physical replication:
- Binary Copy
- Low Overhead
- Full Database Replication
- Limited Selectivity
- Exact Replica Requirement
- Read Scalability
- High Availability
Because physical replication deals with the raw data blocks and transaction logs, it ensures a complete and exact replica, a full binary copy. And working at that level means there's a minimal overhead on the source server. There is a catch: the servers you are replicating to have to be precisely the same major PostgreSQL version as the source so they fully understand the bits and bytes in the same way.
There's also the fact that it can't be selective over what it replicates. You have to replicate the whole database and there's no picking and choosing a subset of data.
But those destination servers can happily answer read-only queries and that opens the way to read scalability. And when a source server goes offline, any of the destination servers can act as a standby to take its place, though you do need external tooling to make that happen automatically.
Logical Replication: Selective and Decoded Changes
Logical replication in PostgreSQL is about the actual data, not the bits and bytes that physical replication focuses on. By operating at this higher level, it enables more interoperability in the flow of data. Where does logical replication get to know what has changed in the data? Possibly surprisingly, the same place as physical replication, the WAL. The difference is that logical replication has a decoder which turns the WAL into the logical changes. These changes, in the form of DMLs INSERT, UPDATE, and DELETE statements, are then transmitted to subscriber databases and applied.
Key characteristics of logical replication:
- Logical Decoding
- Selective Replication
- Cross-Version Compatibility
- Schema Flexibility
- Higher Overhead
- Conflict Potential
- Manual DDL required
When it comes to logical decoding, the changes are interpreted and replicated as logical operations, which opens up a world of possibilities for data management.
One of the standout features of this approach is selective replication. It gives users fine-grained control over what data gets replicated and how secure that data is. Essentially, you can define publications directly on the primary server, clearly specifying the tables or even specific rows and columns that should be replicated. From there, subscribers can easily create subscriptions to receive these customized publications.
Moreover, cross-version compatibility adds to the appeal. Logical replication allows you to replicate data across different major versions of PostgreSQL, which significantly enhances your flexibility during upgrades. This invaluable feature enables online upgrades of individual nodes, meaning you don't have to shut down the entire replication cluster to perform updates. Consequently, you don't have to stress about staying on the same version when making updates
Another advantage is schema flexibility. Even though unreplicated DDL (Data Definition Language) is allowed, native logical replication offers much more granular control than the typical physical replication method.
However, it's important to be aware of the challenges. For instance, the logical decoding process can lead to higher overhead. This happens because both the decoding of the Write Ahead Log (WAL) on the upstream side and its application on the downstream side can introduce more complexity than simply shipping the raw WAL with physical replication.
There's also the potential for conflicts to arise. Sometimes, replication can hit a snag when incoming data conflicts with existing constraints. Resolving such issues may require manually adjusting the data or permissions, or even skipping the problematic transaction altogether.
Lastly, it’s worth noting that logical replication doesn’t automatically sync metadata— it focuses exclusively on rows. So, it falls on the operator to apply any required DDL changes to ensure that the schemas across the replicas remain synchronized.
In essence, while logical replication offers myriad advantages and control, it comes with some responsibilities and challenges that need to be managed.
PGD: Taking Replication to the Next Level
EDBs Postgres Distributed (PGD) extends the replication concepts and needs to provide a robust active-active replication, high availability, and data distribution solution. It aims to overcome some of the limitations of both physical and logical replication.
How PGD Addresses Limitations of Physical Replication:
- Active-Active Architecture
- Selective Replication
- Version Compatibility
- Parallel Apply
- Builtin routing
PGD leads the way by facilitating a setup where multiple nodes are simultaneously live and actively serving traffic. Furthermore, it can consistently add and remove nodes during normal operations without any impact on the running cluster. This approach not only boosts availability but also enhances performance significantly, especially when you compare it to the more passive nature of typical physical standbys.
Another advantage is Selective Replication. PGD allows you to create replication sets, which means you can intelligently group tables and dictate exactly which data is replicated to specific nodes or groups of nodes. This level of granularity is a game changer, as standard physical replication typically involves replicating the entire database cluster without such control.
Much like logical replication, PGD supports a wide range of PostgreSQL versions, specifically from 12 to 17 in PGD 5. This feature is incredibly useful because it permits online upgrades, unlike physical replication, which can impose stricter version requirements.
With Parallel Apply, a PGD node can use multiple writers for each subscription while maintaining transaction ordering! This means there are no visibility issues compared to the source node. The result is a significant increase in subscription throughput and an overall improvement in replication performance.
It's also worth noting the role of the PGD Connection Manager. This component plays a crucial role in maintaining connections to the PGD cluster, which provides resilience during node failures. By doing so, it also simplifies the underlying node topology for users, making the distributed database easier to manage.
How PGD Addresses Limitations of Logical Replication:
- Advanced Conflict Management
- DDL Replication
- Distributed Sequences
- Automatic Failover
- AutoPartitioning
- Commit Scopes and Synchronicity
In the world of Advanced Conflict Management, one of the standout features is its ability to automatically detect and resolve conflicts that arise when multiple nodes try to write to the same or related data. This capability goes a long way in enhancing basic logical replication, which often lacks sophisticated conflict resolution mechanisms.
Then we have DDL Replication, where PGD really shines by providing effective tools to manage DDL replication. This helps tackle some of the complexities that typically come with DDL changes, especially when dealing with standard logical replication across multiple nodes.
Another impressive offering is Distributed Sequences. PGD ensures that unique identifiers are generated across all nodes in the cluster, effectively mitigating the risk of duplicate keys — a common concern in distributed database systems.
When it comes to Automatic Failover, the PGD Connection Manager takes charge by maintaining connections to the PGD cluster. This feature provides resilience during node failures while also simplifying the underlying node topology for users. Additionally, it simplifies integration with existing load balancer solutions.
Furthermore, PGDs AutoPartitioning features, automatically slices tables into partitions. This functionality is a game-changer for scalability and manageability in distributed environments.
Lastly, let's talk about Commit Scopes and Synchronicity. PGD offers flexibility by allowing users to configure different levels of synchronicity between nodes for each transaction. Options like Synchronous Commit, Group Commit, and CAMO provide enhanced control over data durability and consistency, going beyond the basic synchronous mode of native logical replication.
Conclusion:
Both physical and logical replication are valuable tools in the PostgreSQL ecosystem. They serve different needs for high availability, read scalability, and data sharing. Physical replication provides a robust, low-level mechanism for creating exact replicas, while logical replication offers more flexibility for selective data sharing and cross-version compatibility.
EDBs Postgres Distributed (PGD) provides a comprehensive active-active replication and data distribution solution. By addressing the limitations of traditional physical replication with features like active-active writes and selective replication, and by enhancing logical replication with advanced conflict management, DDL replication, and distributed functionalities, PGD enables the creation of highly available, scalable, and resilient distributed PostgreSQL clusters for demanding enterprise applications.
References:
https://www.enterprisedb.com/docs/pgd/latest/