Last week, OpenAI detailed their design of scaling a single-primary PostgreSQL instance to support 800 million users. Shortly after, MariaDB claimed their product could have handled those same users without the "cracks" OpenAI encountered.
While OpenAI’s approach certainly illustrated deep Postgres design and operational expertise, it also illustrated the lengths teams have to go through to serve modern workloads with single writer architecture.
MariaDB’s response suggests that the better option is to leave the Postgres ecosystem entirely. The arguments against this is that Postgres has won - it is the fastest growing database across the board and has consistently ranked above MariaDB by factors of 10 for almost the past 10 years according to the yearly Stack Overflow survey.
In part it has won because of its highly extensible foundation and very vibrant ecosystem of innovation. It is precisely that combination of extensibility and innovation that empowers its users to do more than what just ‘off the shelf’ Postgres can do. For today’s high volume workloads, there are better solutions – without having to move away from Postgres.
A Better Way: Active-Active Postgres with EDB Postgres Distributed
With EDB Postgres Distributed (PGD), organizations don’t have to work within the confines of traditional physical streaming replication models. Nor do they have to go through the painstaking process of moving to new database engines. PGD enables active-active, distributed Postgres architecture while remaining 100% Postgres.
PGD is a cluster management solution that enables its users to meet the scalability and reliability requirements of applications that support core national infrastructure, whether emergency services, utility grids, and global payment processing, to name just a few.
PGD transforms standard PostgreSQL into a distributed, always-on database platform that is designed for modern enterprise scalability, supporting scale-out read performance, granular data locality, seamless lifecycle management across all nodes and regions.
Part 1: Why Azure "Flexible Server" Hits a Wall
OpenAI used Azure Database for PostgreSQL Flexible Server, a robust service, but one fundamentally built on a Single-Primary architecture. As their user base exploded, the systemic risks to business continuity became clear.
1. The Revenue Risk: The Single-Writer Bottleneck
In Azure Flex, every single write, every ChatGPT conversation and billing record, must pass through one primary node. OpenAI was forced to move write-heavy transactions to Cosmos DB just to handle the load.
- The business risk: If that primary node or its Availability Zone (AZ) suffers any distinctly measurable disruption, OpenAI faces what would potentially be a total write outage. A 60-second failover in Azure isn't just a hiccup; it’s millions of dollars in lost customer trust.
- The PGD advantage: PGD's Active-Active (Multi-Master) architecture offers a significant advantage. Its use of fast failovers multiplies throughput by distributing write workloads across multiple nodes and regions, thereby also eliminating any single point of failure of a single write leader.
2. The Operational Risk: Maintenance as a Liability
According to OpenAI, even routine database changes requiring administrative intervention or cleanup could create measurable performance risk for their 800 million users. That reality placed significant limits on schema and system design, forcing difficult trade-offs where optimization often had to yield to availability and operational stability.
- The business risk: For a global 24/7 service, there is no "off-peak." You are forced to choose between performance degradation or risky but routine maintenance disruption.
- The PGD advantage: PGD provides Always On Maintenance. You can perform a
REINDEX,VACUUM FULL, or major version upgrade on one node at a time without stopping the write stream on others. In addition, PGD provides ‘cluster aware orchestration’ so you can issue a single command and it will run sequentially across the cluster. In other words, routine maintenance does not compromise availability or increase operational overhead of the system.
3. The Customer Satisfaction Risk
OpenAI initially suffered incidents when their 5,000-connection limit was hit, causing cascading failures. They had to resort to added architectural complexity to mitigate; external proxies and application-side rate limiting.
- The business risk: When engineers spend 50% of their time building workarounds for database limits, innovation stalls.
- The PGD advantage: PGD includes a Native Connection Manager. It manages thousands of connections and updates routing in sub-seconds, preventing connection overload before it starts.
Part 2: High Availability vs. Business Resilience
| Feature / Risk Factor | Azure PG Flexserver (OpenAI's Current State) | EDB Postgres Distributed (PGD) |
| Write Architecture | Single Primary (High Risk Bottleneck) | Multi-Master (Resilient Active-Active) |
| Read Scalability | Primary Overload | Subscriber-Only Node and node group expandable beyond 200 total nodes |
| Recovery Time (RTO) | Minutes (Dependent on cloud fabric and physical replication) | Sub-second (leader-election) |
| Maintenance | Potential for degradation; maintenance must be carefully coordinated between app and DBA teams to avoid disruption. | Online maintenance supports upgrades, DDL, VACUUM, REINDEX, etc. |
| Data Durability | Standard sync/async replicas. Potential for unbounded RPO unless heavy penalty paid for synchronous replication. | Multiple commit scopes for durability and consistency |
| Dev Velocity | Engineers are limited by database rules and operations | DB scales and adopts to engineer’s needs |
| Portability | Runs only on Azure Cloud | Runs on all clouds, on-prem, Kubernetes, and future Stargate data centers on Oracle Cloud |
Part 3: Addressing the MariaDB "Alternative"
MariaDB suggests their Galera clusters avoid the "Postgres cracks." However, their argument ignores the most important trend: Postgres has won the developer heart and mind.
Countering the Technical Claims: Commit Scopes
MariaDB claims Galera is superior, but it relies on Synchronous Replication, which creates a "Certification Wall." If you have a node in another region, every write must wait for global agreement, causing massive latency that only gets worse the more distributed your system.
PGD offers a more sophisticated answer: Commit Scopes, which allow you to configure and control how data is written and committed to the system. This flexibility enables users to define workload patterns factoring in considerations for performance, durability, and availability.
- Synchronous Commit: This mode commits a transaction locally but keeps its locks open until a configured quorum of nodes acknowledges receipt. It includes an Auto-Degrade functionality that can automatically switch to asynchronous mode if the synchronous nodes become unavailable, balancing durability with system availability.
- Lag Control: Designed for asynchronous replication, this feature allows you to set a threshold (in time or bytes) for replication lag. If the risk of data loss exceeds this configured limit, PGD automatically throttles or inserts delays into the local commit to bound the risk.
- Eager Conflict Resolution: A pessimistic resolution strategy that detects and manages potential conflicts before a transaction is committed, ensuring data consistency across the cluster.
- Synchronous Replication with Consensus: PGD's durability is built on a custom Raft-based consensus implementation, which automates leader election and ensures that all participating nodes have a consistent view of the cluster state and transaction history.
Purpose-Built
OpenAI’s ability to scale to 800 million users very much highlights the power of Postgres, but it’s not really tapped into the real potential of Postgres. It is a willful display of ingenuity, and a testament to the deep and broad expertise that is behind the partnership of the two companies, but it still begs the question of whether it is sustainable. It begs the question of whether Azure Postgres Flexserver is designed to rinse and repeat this kind of architecture for more than one, a dozen, or hundreds of similar architectures.
EDB Postgres Distributed is. It is designed inherently for always on workloads, and specifically designed to serve up some of the most critical workloads around the globe. And PGD does this hundreds of times over.
If you’re an individual that is responsible for maintaining the durability and availability of core national infrastructure systems, do you want to deploy a system that is inherently designed for your system workloads, or cleverly configured for your system workloads?
Imagine if a global payment processing platform went offline for 60 seconds…
Imagine if your country’s national power grid orchestration all routed through a single primary leader…
Imagine if your nation’s emergency services risked maintenance outages because someone executed a long running query…