EDB Postgres Distributed (PGD), under the product umbrella of EDB Postgres AI, supports seamlessly 1) replicating data to an analytics data format (Iceberg) and 2) running regular Postgres queries against this data with EDB Postgres Analytics Accelerator (PGAA) which uses Apache DataFusion under the hood.
In this post we'll set up a PGD cluster and run a basic analytics query against a business table storing customer purchase data to demonstrate the builtin Postgres query engine versus PGAA.
We'll see our query run about 6x faster with PGAA compared to Postgres's builtin engine on a t2.xlarge EC2 machine (running Ubuntu 24.04). This is not a thorough or definitive experiment! My primary goal here is to show how to set this environment up and how to observe differences in the two engines.
Setting up PGD
Log in or register for a free EDB account and grab your subscription token. Export it in your environment:
$ export EDB_SUBSCRIPTION_TOKEN=whatever-it-is
Now set up repositories for PGD, PGAA, and EDB Postgres Extended (EDB's distribution of Postgres).
$ curl -1sLf "https://downloads.enterprisedb.com/$EDB_SUBSCRIPTION_TOKEN/postgres_distributed/setup.deb.sh" | sudo -E bash
$ curl -1sLf "https://downloads.enterprisedb.com/$EDB_SUBSCRIPTION_TOKEN/enterprise/setup.deb.sh" | sudo -E bash
Update apt-get and install the packages for PGD, PGAA and EDB Postgres Extended.
$ sudo apt-get update -y
$ sudo apt-get install -y edb-pgd6-expanded-pgextended17 edb-postgresextended-17-pgaa hyperfine
Now switch to the postgres
user to set up the instance.
$ sudo su postgres
$ cd ~
Use the PGD CLI to create a new Postgres instance and set up a single-node PGD cluster.
$ PGPASSWORD=secret /usr/lib/edb-pge/17/bin/pgd node db1 setup \
--dsn 'host=localhost dbname=pgd' \
--pgdata /var/lib/postgresql/db \
--log-file logfile \
--group-name pgd-group
Now Postgres is running and PGD is set up. We can see how the nodes in the PGD cluster are doing. (There is only one node.)
$ /usr/lib/edb-pge/17/bin/psql -P expanded=auto -h localhost pgd \
-c 'SELECT * FROM bdr.node_summary'
-[ RECORD 1 ]----------+-------------------------------------
node_name | db1
node_group_name | pgd-group
interface_connstr | host=localhost dbname=pgd
peer_state_name | ACTIVE
peer_target_state_name | ACTIVE
node_seq_id | 1
node_local_dbname | pgd
node_id | 2084062396
node_group_id | 1693353463
node_kind_name | data
node_uuid | 19207a60-298e-4d2e-88e5-7cfc9ab3035c
Setting up PGAA
Now let's tweak postgresql.conf
for the analytics engine.
echo "
pgaa.max_replication_lag_s = 1
pgaa.flush_task_interval_s = 1
# This is where our analytics-formatted data is going to end up in.
pgfs.allowed_local_fs_paths = '/var/lib/postgresql/pgd-analytics'
pgaa.autostart_seafowl_port = 5445
pgaa.seafowl_url = 'http://localhost:5445'" | tee -a \
/var/lib/postgresql/db/postgresql.conf
Restart Postgres so these changes take effect.
$ /usr/lib/edb-pge/17/bin/pg_ctl -D /var/lib/postgresql/db -l logfile restart
Create the directory where PGAA data will go.
$ mkdir /var/lib/postgresql/pgd-analytics
Next we'll add the PGAA extension and tell it where to send data. To keep things simple in this post we'll write data in Iceberg format to the local filesystem, but you can also use any S3-compatible object storage (e.g. MinIO). (And you can also replicate to Iceberg tables by specifying an Iceberg REST Catalog endpoint rather than a raw storage location.)
$ /usr/lib/edb-pge/17/bin/psql -h localhost pgd -c "
-- Create extension and wait for it to be created on all nodes in the cluster.
CREATE EXTENSION pgaa CASCADE;
SELECT bdr.wait_slot_confirm_lsn(NULL, NULL);
-- Create the storage location (on all nodes)
-- This must line up with the pgfs.allowed_local_fs_paths option above.
SELECT bdr.replicate_ddl_command(\$\$SELECT pgfs.create_storage_location('local_fs', 'file:///var/lib/postgresql/pgd-analytics')\$\$);
-- Make sure we will have a write leader in this group.
SELECT bdr.alter_node_group_option('pgd-group', 'enable_proxy_routing', 'true');
-- Point the current PGD group at the storage location we created.
SELECT bdr.alter_node_group_option('pgd-group', 'analytics_storage_location', 'local_fs');
"
Now we'll create a table representing customer orders and we'll give this table 100M fake rows of data. (Takes around 10 minutes on my t2.xlarge, so grab yourself a maté).
$ /usr/lib/edb-pge/17/bin/psql -h localhost pgd -c "
BEGIN;
DROP TABLE IF EXISTS orders CASCADE;
CREATE TABLE orders (
customer_id INT,
order_id BIGSERIAL,
amount_pennies BIGINT,
PRIMARY KEY (customer_id, order_id)
) WITH (pgd.replicate_to_analytics = TRUE);
INSERT INTO orders (customer_id, amount_pennies)
SELECT
random() * 9 + 1,
random() * 100000 + 1
FROM
generate_series(1, 100_000_000);
COMMIT;"
Keep checking until you see Parquet files in /var/lib/postgresql/pgd-analytics/public.orders
. It may take another 5 minutes. Once you do, create a Postgres table that's powered by PGAA's Table Access Method, meaning queries against it will be executed by the vectorized query engine from Apache DataFusion.
$ /usr/lib/edb-pge/17/bin/psql -h localhost pgd -c "
CREATE TABLE orders_analytics ()
USING PGAA WITH (pgaa.storage_location = 'local_fs', pgaa.path = 'public.orders', pgaa.format = 'iceberg');
"
In the future you ideally won't need to create a new table like this just to query with the analytics engine. But for now this is our little workaround.
NOTE! If you try this in a multi-node cluster, this is going to fail on DDL lock acquisition until replication completes within the PGD cluster itself. Wait a while (takes around 3 minutes) and keep retrying that command until you get past DDL lock timeout errors.
Finally, let's run VACUUM ANALYZE
(takes around 4 minutes) on the orders table to make sure Postgres statistics are up to date.
$ /usr/lib/edb-pge/17/bin/psql -h localhost pgd -c "VACUUM ANALYZE orders;"
Analytics queries
Now let's do an analytics query. How much did each customer spend?
SELECT
customer_id,
SUM(amount_pennies) total
FROM
orders
GROUP BY
customer_id
ORDER BY
total DESC
We'll use hyperfine to get more meaningful results. It will run our queries a number of times and collect statistics.
$ hyperfine \
--warmup 10 \
"/usr/lib/edb-pge/17/bin/psql -h localhost pgd -c 'SELECT customer_id, SUM(amount_pennies) total FROM orders_analytics GROUP BY customer_id ORDER BY total DESC'" \
"/usr/lib/edb-pge/17/bin/psql -h localhost pgd -c 'SELECT customer_id, SUM(amount_pennies) total FROM orders GROUP by customer_id ORDER BY total DESC'"
Benchmark 1: /usr/lib/edb-pge/17/bin/psql -h localhost pgd -c 'SELECT customer_id, SUM(amount_pennies) total FROM orders_analytics GROUP by customer_id ORDER BY total DESC'
Time (mean ± σ): 863.1 ms ± 11.1 ms [User: 3.6 ms, System: 4.0 ms]
Range (min … max): 851.6 ms … 891.6 ms 10 runs
Benchmark 2: /usr/lib/edb-pge/17/bin/psql -h localhost pgd -c 'SELECT customer_id, SUM(amount_pennies) total FROM orders GROUP by customer_id ORDER BY total DESC'
Time (mean ± σ): 5.968 s ± 0.014 s [User: 0.003 s, System: 0.004 s]
Range (min … max): 5.939 s … 5.987 s 10 runs
Summary
/usr/lib/edb-pge/17/bin/psql -h localhost pgd -c 'SELECT customer_id, SUM(amount_pennies) total FROM orders_analytics GROUP by customer_id ORDER BY total DESC' ran
6.91 ± 0.09 times faster than /usr/lib/edb-pge/17/bin/psql -h localhost pgd -c 'SELECT customer_id, SUM(amount_pennies) total FROM orders GROUP by customer_id ORDER BY total DESC'
And there we have it, running the query against our analytics engine is about 6x faster than running against Postgres.