Gj’s SQL blog

December 2, 2010

are window functions always better ?

Filed under: Uncategorized — Grzegorz Jaskiewicz @ 9:42 pm

Friend of mine asked me a very simple question. He needs to get a top row for every group of rows from a table.

Here’s the table:

\d+ "table"
                           Table "public.table"
 Column |            Type             | Modifiers | Storage | Description 
--------+-----------------------------+-----------+---------+-------------
 id     | integer                     |           | plain   | 
 value  | integer                     |           | plain   | 
 ts     | timestamp without time zone |           | plain   | 
Indexes:
    "foobar_xz" btree (id)
    "foobar_yz_ts" btree (ts)
Has OIDs: no

Lets stuff it with loads of data:

insert into "table"(id, value, ts) select a/10, random()*666, now() + (((random()*1345324532)-5234523)::int::text ||' seconds ')::interval from generate_series(1,10000000) a;
(vacuumed etc)
postgres=# \dt+ "table"
                    List of relations
 Schema | Name  | Type  |  Owner   |  Size  | Description 
--------+-------+-------+----------+--------+-------------
 public | table | table | postgres | 422 MB | 
(1 row)


Now lets run two queries, first one using window functions, second one using a simple sub-select.


postgres=# explain analyze SELECT id, value, ts FROM (select id, value, ts, rank() over (partition by id order by ts desc) from "table") a WHERE a.rank=1;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on a  (cost=1829434.33..2154434.33 rows=50000 width=16) (actual time=26780.491..59825.700 rows=1000001 loops=1)
   Filter: (a.rank = 1)
   ->  WindowAgg  (cost=1829434.33..2029434.33 rows=10000000 width=16) (actual time=26780.467..56652.891 rows=10000000 loops=1)
         ->  Sort  (cost=1829434.33..1854434.33 rows=10000000 width=16) (actual time=26780.449..37195.121 rows=10000000 loops=1)
               Sort Key: "table".id, "table".ts
               Sort Method:  external merge  Disk: 254032kB
               ->  Seq Scan on "table"  (cost=0.00..154055.00 rows=10000000 width=16) (actual time=0.005..3312.756 rows=10000000 loops=1)
 Total runtime: 59988.587 ms
(8 rows)

Time: 60067.760 ms
postgres=# explain analyze select id,value,ts from "table" where (id,ts) in (select id, max(ts) from "table" group by id);
                                                                          QUERY PLAN                                                                           
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Semi Join  (cost=365895.10..680131.10 rows=1 width=16) (actual time=17783.725..45548.519 rows=1000001 loops=1)
   Hash Cond: ((public."table".id = public."table".id) AND (public."table".ts = (max(public."table".ts))))
   ->  Seq Scan on "table"  (cost=0.00..154055.00 rows=10000000 width=16) (actual time=0.020..6671.658 rows=10000000 loops=1)
   ->  Hash  (cost=354760.51..354760.51 rows=574106 width=12) (actual time=17780.585..17780.585 rows=1000001 loops=1)
         Buckets: 4096  Batches: 32  Memory Usage: 988kB
         ->  GroupAggregate  (cost=0.00..349019.45 rows=574106 width=12) (actual time=81.539..15958.085 rows=1000001 loops=1)
               ->  Index Scan using foobar_xz on "table"  (cost=0.00..291843.13 rows=10000000 width=12) (actual time=81.493..8288.009 rows=10000000 loops=1)
 Total runtime: 45706.028 ms
(8 rows)

Time: 45749.765 ms


Now even I was surprised to see, that the simple query was much quicker than the window function.
Please, leave comments as to what could have I done better on the window function front.
Of course, the question would be much different if he wanted - say 3 top results. 
This of course isn't suppose to be window function advocacy post, but rather post that advocates healthy choice and please do make sure that you
at least test 2 types of approach to solve your problem before committing to anything. 

And fly safe ! 🙂   (ooops, wrong blog 😉 )

 

November 27, 2010

limited array_agg output on 8.3

Filed under: Uncategorized — Grzegorz Jaskiewicz @ 4:30 pm

Someone asked about it on postgresql-general mailing list. Here’s my way of doing it.

 

We need the limit_array_append() function. So here it is:

CREATE FUNCTION limit_array_append(anyarray, anyelement, integer) RETURNS anyarray AS
$_$
   SELECT (array_append($1, $2))[1:$3];
$_$ LANGUAGE sql IMMUTABLE;

The array_agg() function is obviously predefined on 8.3.

Than we need to define the aggregate:

CREATE AGGREGATE limit_array_agg(anyelement, integer)
(
   SFUNC = limit_array_append,
   STYPE = anyarray,
   INITCOND = '{}'
);

 

The result is beautiful limited array_agg. Same thing can be obviously done much better, both code wise, and performance wise on 8.4+ with window functions.

To test just do:

select limit_array_agg((random()*626)::integer, 3) FROM generate_series(1, 20) i GROUP BY (i/7)::integer; 
limit_array_agg 
----------------- 
{546,360,139} 
{411,287,339} 
{359,288,95}

(3 rows)

Easy, ain’t ?


Create a free website or blog at WordPress.com.