Gj’s SQL blog

April 19, 2009

how to speed up index on bytea, text, etc.

Filed under: index, postgresql — Grzegorz Jaskiewicz @ 5:43 pm

Well,

I recently had to create db to store some binary information. Now, these entries were to be unique in their category.

Average size of bytea field was to be 2400 bytes (about 2kb). Everything was fine, until table filled with 1.3M unique entries, and I was really astound how slow insertion become.

Turns out, that btree to figure out uniqueness has to walk through the tree – obvious, right. But with so many binary rows, that tree was getting bigger and bigger.

PostgreSql, at least for now – doesn’t support unique hash, or thick indices (combination of btree and hash – afaik).

So, here’s homebrew solution to that problem;

Let’s for instance take, as an example db used in previous post. Let’s drop rank* tables, and remove primary key on domains(domain):

alter table domains drop CONSTRAINT domains_pkey;

and make the entries bit longer, so lets execute following query few times:

update domains set domain = domain||domain;
Now, if we create typical index, and homebrew "thick" index:
gj=# create unique index foo1 on domains(domain);
CREATE INDEX
Time: 14741.673 ms
gj=# create unique index foo2 on domains(md5(domain));
CREATE INDEX
Time: 11296.674 ms

Please note, that even the latter creation of index took less time than the “normal” index.

Now, vacuum, analyze, reindex, and lets see the size of indices:

gj=# \di+

List of relations

Schema |      Name      | Type  | Owner |  Table  |  Size  | Description

——–+—————-+——-+——-+———+——–+————-

public | domains_id_idx | index | gj    | domains | 21 MB  |

public | foo1           | index | gj    | domains | 147 MB |

public | foo2           | index | gj    | domains | 52 MB  |

(3 rows)

Now, we can already see difference in size. Most of our gain will come right out of that.

It is the reason why people usually normalize their dbs, but that’s also subject for different entry.

Now, the reason to do it was simple – we wanted to see, first – that insertion of new element is going to be quicker with ‘thick’ index,

and second of all, see if actual search for element is going to be quicker. Naturally, we are trading ability to search using (i)like with that.

But you rarely need prefix search in binary data :).

so for now, lets drop the foo2 index, and choose random row using:

 select domain from domains order by random() limit 1;


gj=# explain analyze select id from domains where domain = 'chinkan.jpchinkan.jpchinkan.jpchinkan.jpchinkan.jpchinkan.jpchinkan.jpchinkan.jp';

 QUERY PLAN 

-------------------------------------------------------------------------------------------------------------------

 Index Scan using foo1 on domains  (cost=0.00..9.02 rows=1 width=8) (actual time=23.335..23.340 rows=1 loops=1)

 Index Cond: (domain = 'chinkan.jpchinkan.jpchinkan.jpchinkan.jpchinkan.jpchinkan.jpchinkan.jpchinkan.jp'::text)

 Total runtime: 23.385 ms

(3 rows)

Time: 24.134 ms

And now with foo2, lets drop foo1 first, although there’s no really a need for that;

gj=# explain analyze select id from domains where md5(domain) = md5('chinkan.jpchinkan.jpchinkan.jpchinkan.jpchinkan.jpchinkan.jpchinkan.jpchinkan.jp'::text);

 QUERY PLAN 

--------------------------------------------------------------------------------------------------------------

 Index Scan using foo2 on domains  (cost=0.00..8.54 rows=1 width=8) (actual time=0.124..0.127 rows=1 loops=1)

 Index Cond: (md5(domain) = 'aad780122d8a5a458e7db23ced18500f'::text)

 Total runtime: 0.157 ms

(3 rows)

Time: 0.857 ms
Again, this example is fairly simple – because of nature of data presented here. It is however a remedy for storing quite large data in columns, sometimes needed (filesystem might not be that quick with many files, and subdirectories).
Enjoy. HTH!
Advertisements

5 Comments »

  1. Useful information — thanks! There is a minor typo, I think, in the description of the index creation.

    and homebrew “tick” index:

    should probably read
    “and homebrew “thick” index:”

    Comment by Greg Williamson — April 20, 2009 @ 12:09 am

    • thanks, indeed it should read “thick” – fixed now.

      Comment by gryzman — April 20, 2009 @ 6:43 am

  2. Interesting. I’m thinking though that the same result is provided by the hash index.

    It would be a nice addition if you included hash index creation and query times as well.

    Comment by Fernando Hevia — September 21, 2009 @ 3:16 pm

    • hash index is slower than btree , plus you can’t have unique constraint on it, for some reason.

      So in general, I am avoiding that index, and that’s the very main reason I wrote that article 🙂

      I am indeed hoping it will be useful, so thanks for your comment.

      Comment by gryzman — September 21, 2009 @ 3:20 pm

      • The performance penalty on hash indexes is certainly an issue. As you provided nice timed examples I was kind of hoping you could include a timed example using a hash index for completeness. I think it would strengthen your solution and fulfill my curiosity 🙂

        BTW, I liked the article very much. Thanks for sharing!

        PD: Uniqueness can’t be guaranteed by hashing indexes because it’s possible that two different values produce the same hash key.
        Since MD5 is a hashing function, the same applies here. There is a slight probability two different bytea values will obtain the same MD5 key.

        Comment by Fernando Hevia — September 21, 2009 @ 6:42 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: