Gj’s SQL blog

April 19, 2009

on using different types in joins, and performance of it

Filed under: index, joins, postgresql — Grzegorz Jaskiewicz @ 4:44 pm

I get this question quite often from folks I work with, or people I join in their projects, usually with database designed by someone who is not very experienced in database world.

Most of them, don’t understand why the hell I am pressing for using bigint/int as a key to a table.

Here’s simple example, and my own theory behind it.

We have a table with domain names, and their alexa rank (you can download csv for free from their website, with 1M addresses).

So without further ado:

gj=# CREATE TEMP TABLE domains_tmp( rank int, domain text);
CREATE TABLE

gj=# COPY  domains_tmp(rank, domain) FROM '/tmp/top-1m.csv' CSV;
COPY 1000000
Time: 2059.112 ms

gj=# create index foo_bar on domains_tmp(domain);
CREATE INDEX
Time: 8792.686 ms

gj=# create table domains(id bigserial, domain text not null, primary key(domain));
NOTICE:  CREATE TABLE will create implicit sequence "domains_id_seq" for serial column "domains.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "domains_pkey" for table "domains"
CREATE TABLE
Time: 98.709 ms

gj=# create unique index domains_id_idx on domains(id);
CREATE INDEX
Time: 13.792 ms

gj=# insert into domains(domain) select domain from domains_tmp;
INSERT 0 1000000
Time: 27627.267 ms

gj=# create table rank_t(rank int not null, domain text not null, primary key(rank));
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "rank_t_pkey" for table "rank_t"
CREATE TABLE
Time: 11.005 ms

gj=# create unique index rank_t_idx on rank_t(domain);
CREATE INDEX
Time: 3.532 ms
  
gj=# create table rank_i(rank int not null, domainid bigint not null, primary key(rank));
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "rank_i_pkey" for table "rank_i"
CREATE TABLE
Time: 5.205 ms

gj=# insert into rank_i(rank, domainid) select dt.rank, d.id FROM domains_tmp dt JOIN domains d ON dt.domain=d.domain;
INSERT 0 1000000
Time: 26621.904 ms

gj=# insert into rank_t(rank, domain) select rank, domain from domains_tmp;
INSERT 0 1000000
Time: 25145.377 ms

gj=# create unique index rank_i_idx on rank_i(domainid);
CREATE INDEX
Time: 2.796 ms

gj=#  alter table rank_t add constraint foo_k foreign key (domain) references domains(domain);
ALTER TABLE
Time: 4694.189 ms

gj=# alter table rank_i add constraint foo_i foreign key (domainid) references domains(id);
ALTER TABLE
Time: 3630.766 ms

gj=# DROP TABLE domains_tmp;
DROP TABLE
Time: 68.319 ms

now , vacuum, reindex. And here we got:


List of relations

Schema |  Name   | Type  | Owner | Size  | Description

——–+———+——-+——-+——-+————-

public | domains | table | gj    | 51 MB |

public | rank_i  | table | gj    | 38 MB |

public | rank_t  | table | gj    | 47 MB |

(3 rows)


(yes, this is 8.4)


gj=# \di+

List of relations

Schema |      Name      | Type  | Owner |  Table  | Size  | Description

——–+—————-+——-+——-+———+——-+————-

public | domains_id_idx | index | gj    | domains | 21 MB |

public | domains_pkey   | index | gj    | domains | 41 MB |

public | rank_i_idx     | index | gj    | rank_i  | 25 MB |

public | rank_i_pkey    | index | gj    | rank_i  | 25 MB |

public | rank_t_idx     | index | gj    | rank_t  | 29 MB |

public | rank_t_pkey    | index | gj    | rank_t  | 39 MB |

(6 rows)

Now, say we want to find out ranks of domains with 'g%' mask.

Let’s first see what postgresql will do, if we want to just count them:


gj=# explain analyze  select count(1) from domains where domain like ‘g%’;

QUERY PLAN

—————————————————————————————————————————————-

Aggregate  (cost=8376.17..8376.18 rows=1 width=0) (actual time=62.967..62.968 rows=1 loops=1)

->  Bitmap Heap Scan on domains  (cost=1250.38..8275.16 rows=40404 width=0) (actual time=20.496..54.766 rows=38976 loops=1)

Filter: (domain ~~ ‘g%’::text)

->  Bitmap Index Scan on domains_pkey  (cost=0.00..1240.28 rows=39982 width=0) (actual time=18.307..18.307 rows=38976 loops=1)

Index Cond: ((domain >= ‘g’::text) AND (domain < ‘h’::text))

Total runtime: 63.216 ms

(6 rows)


Time: 64.232 ms


nice,

now let’s join first rank_t:


gj=# explain analyze SELECT d.domain, r.rank FROM domains d JOIN rank_t r ON r.domain=d.domain WHERE d.domain LIKE ‘g%’;

QUERY PLAN

———————————————————————————————————————————————-

Hash Join  (cost=8584.81..40023.85 rows=40404 width=19) (actual time=100.789..1199.080 rows=38976 loops=1)

Hash Cond: (r.domain = d.domain)

->  Seq Scan on rank_t r  (cost=0.00..16035.00 rows=1000000 width=19) (actual time=0.015..227.224 rows=1000000 loops=1)

->  Hash  (cost=8079.76..8079.76 rows=40404 width=15) (actual time=99.758..99.758 rows=38976 loops=1)

->  Bitmap Heap Scan on domains d  (cost=1054.62..8079.76 rows=40404 width=15) (actual time=39.306..75.026 rows=38976 loops=1)

Filter: (domain ~~ ‘g%’::text)

->  Bitmap Index Scan on domains_pkey  (cost=0.00..1044.52 rows=40011 width=0) (actual time=37.297..37.297 rows=38976 loops=1)

Index Cond: ((domain >= ‘g’::text) AND (domain < ‘h’::text))

Total runtime: 1205.552 ms

(9 rows)


Time: 1249.874 ms

gj=# explain analyze SELECT d.domain, r.rank FROM domains d JOIN rank_i r ON r.domainid=d.id WHERE d.domain LIKE ‘g%’;

QUERY PLAN

—————————————————————————————————————————————-

Hash Join  (cost=28456.62..36390.85 rows=40404 width=19) (actual time=774.058..883.903 rows=38976 loops=1)

Hash Cond: (d.id = r.domainid)

->  Bitmap Heap Scan on domains d  (cost=1054.62..8079.76 rows=40404 width=23) (actual time=19.687..53.550 rows=38976 loops=1)

Filter: (domain ~~ ‘g%’::text)

->  Bitmap Index Scan on domains_pkey  (cost=0.00..1044.52 rows=40011 width=0) (actual time=17.667..17.667 rows=38976 loops=1)

Index Cond: ((domain >= ‘g’::text) AND (domain < ‘h’::text))

->  Hash  (cost=14902.00..14902.00 rows=1000000 width=12) (actual time=753.574..753.574 rows=1000000 loops=1)

->  Seq Scan on rank_i r  (cost=0.00..14902.00 rows=1000000 width=12) (actual time=0.014..258.203 rows=1000000 loops=1)

Total runtime: 896.658 ms

(9 rows)


Time: 898.534 ms


Now, this example is maybe bit simple, and naive (after all, we already have domain name in rank_t, so why join ? – but this is just an example, usually you need to join another table , and another, etc), but it should help you visualize why actually having a simple, short key in join helps performance. (ie, why joining on text, varchar, bytea, etc hurts performance).

You might ask – why is it. My own theory is, it is very cheap to calculate cost of integer, bigint, timestamp, enum, boolean, etc. However, variable lengths are trickier. They are also tricker from btree perspective (which is going to be another think I’ll write about – me thinks).

So, for good of us all – please reference tables using int/bigint – and your queries should be much snappier.

thanks for attention folks 🙂

Advertisements

2 Comments »

  1. It’s great that you blog, but the idea of using surrogate keys is quite controversional. especially given the fact that when using natural keys you can *skip* some of the joins.

    Comment by depesz — April 19, 2009 @ 4:50 pm

    • very true, but I have to say – in my general practice I found what I wrote about true in vast majority of cases.
      The problem with SQL, is that it isn’t as strict as , say C. DBE can use different dirty tricks to optimize queries, and give you the result you need faster.
      So it is very hard, as you probably know from your own experience – to write about anything, and give definite preconceived ideas, directions.

      This is my own experience, but still everyone should test different ideas before committing to anything.
      🙂

      Comment by gryzman — April 19, 2009 @ 5:13 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: