Friend of mine asked me a very simple question. He needs to get a top row for every group of rows from a table.
Here’s the table:
\d+ "table"
Table "public.table"
Column | Type | Modifiers | Storage | Description
--------+-----------------------------+-----------+---------+-------------
id | integer | | plain |
value | integer | | plain |
ts | timestamp without time zone | | plain |
Indexes:
"foobar_xz" btree (id)
"foobar_yz_ts" btree (ts)
Has OIDs: no
Lets stuff it with loads of data:
insert into "table"(id, value, ts) select a/10, random()*666, now() + (((random()*1345324532)-5234523)::int::text ||' seconds ')::interval from generate_series(1,10000000) a;
(vacuumed etc)
postgres=# \dt+ "table"
List of relations
Schema | Name | Type | Owner | Size | Description
--------+-------+-------+----------+--------+-------------
public | table | table | postgres | 422 MB |
(1 row)
Now lets run two queries, first one using window functions, second one using a simple sub-select.
postgres=# explain analyze SELECT id, value, ts FROM (select id, value, ts, rank() over (partition by id order by ts desc) from "table") a WHERE a.rank=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on a (cost=1829434.33..2154434.33 rows=50000 width=16) (actual time=26780.491..59825.700 rows=1000001 loops=1)
Filter: (a.rank = 1)
-> WindowAgg (cost=1829434.33..2029434.33 rows=10000000 width=16) (actual time=26780.467..56652.891 rows=10000000 loops=1)
-> Sort (cost=1829434.33..1854434.33 rows=10000000 width=16) (actual time=26780.449..37195.121 rows=10000000 loops=1)
Sort Key: "table".id, "table".ts
Sort Method: external merge Disk: 254032kB
-> Seq Scan on "table" (cost=0.00..154055.00 rows=10000000 width=16) (actual time=0.005..3312.756 rows=10000000 loops=1)
Total runtime: 59988.587 ms
(8 rows)
Time: 60067.760 ms
postgres=# explain analyze select id,value,ts from "table" where (id,ts) in (select id, max(ts) from "table" group by id);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Semi Join (cost=365895.10..680131.10 rows=1 width=16) (actual time=17783.725..45548.519 rows=1000001 loops=1)
Hash Cond: ((public."table".id = public."table".id) AND (public."table".ts = (max(public."table".ts))))
-> Seq Scan on "table" (cost=0.00..154055.00 rows=10000000 width=16) (actual time=0.020..6671.658 rows=10000000 loops=1)
-> Hash (cost=354760.51..354760.51 rows=574106 width=12) (actual time=17780.585..17780.585 rows=1000001 loops=1)
Buckets: 4096 Batches: 32 Memory Usage: 988kB
-> GroupAggregate (cost=0.00..349019.45 rows=574106 width=12) (actual time=81.539..15958.085 rows=1000001 loops=1)
-> Index Scan using foobar_xz on "table" (cost=0.00..291843.13 rows=10000000 width=12) (actual time=81.493..8288.009 rows=10000000 loops=1)
Total runtime: 45706.028 ms
(8 rows)
Time: 45749.765 ms
Now even I was surprised to see, that the simple query was much quicker than the window function.
Please, leave comments as to what could have I done better on the window function front.
Of course, the question would be much different if he wanted - say 3 top results.
This of course isn't suppose to be window function advocacy post, but rather post that advocates healthy choice and please do make sure that you
at least test 2 types of approach to solve your problem before committing to anything.
And fly safe !
(ooops, wrong blog
)