PGCon2010 - Final Release III

PGCon 2010
The PostgreSQL Conference

Speakers
Oleg Bartunov
Teodor Sigaev
Schedule
Day Talks - 1 - 2010-05-20
Room DMS 1140
Start time 11:30
Duration 01:00
Info
ID 227
Event type Lecture
Track Hacking
Language used for presentation English

Efficient k-nn search with GiST and other development

We present implementation of new GiST tree traverse strategy and efficient k-nn search based on this strategy. Also, we'd like to discuss new signature file based index (bloom index), it's implementation and possible improvements.

There are many application where efficient k-nn (k-closest neighbourhood) search is very needed, for example, GIS, multimedia search. Currently, k-nn search in PostgreSQL usually emulated using repeated search with changing of"radius" of a query until the number of rows in result will satisfy query. We introduce new strategy of GiST tree traverse (in addition to the original depth-first), based on priority queue, which allows native implementation of efficient k-nn search. On the test database of POI (point of interests), which has 1034170 spots, we got about 300x perfomance gain due to k-nn search.

The new feature of GiST doesn't introduce any incompatibilities, the only visible change is that consistent user-defined method now can return not just TRUE/FALSE, but - negative value, which means tuple doesn't match query (like FALSE in old implementation) - 0.0 means one of: - a zero distance (exact match) - a match for filtering clause, like a <@ or @> for point. - positive value, which means the method returns distance. In this case keyRecheck should be false!, since it's impossible to make right order with lossy values. GiST was teached to recognize which algorithm of tree traverse to use (depth-first, or distance based priority queue).

In addition, we'd like to present and discuss our new signature file based bloom index. This index is useful if table has many attributes and queries can include their arbitary combinations. Traditional Btree index is faster than bloom index , but it'd require too many indexes to support all possible queries, while one need only one bloom index. Bloom index supports only equality comparison. Since it's a signature file, not a tree, it always should be readed fully, but sequentially, so search performance is constant and doesn't depends on a query. Implementation of Bloom filter (http://en.wikipedia.org/wiki/Bloom_filter) allows fast exclusion of non-candidate tuples. Since signature is a lossy representation of all indexed attributes, search results should be rechecked using heap information.