PGCon2014 - Final Release

PGCon 2014
The PostgreSQL Conference

Speakers
Konstantin Knizhnik
Schedule
Day Talks - Day 2 - Fri May 23 - 2014-05-23
Room Morisset 221
Start time 13:00
Duration 00:45
Info
ID 643
Event type Lecture
Track Big Data
Language used for presentation English

In-Memory Columnar Store extension for PostgreSQL

IMCS is In-Memory Columnar Store for PostgreSQL. Vertical data model is more efficient for analytic queries performing operations on entire column. IMCS provides 10-100 times improvement in performance comparing with standard SQL queries because of:

  • data skipping: fetching only data needed for query execution
  • parallel execution: using multiple threads to execute query
  • vector operations: minimizing interpretation overhead and allowing SIMD instructions
  • reduced locking overhead: simple array level locking
  • no disk IO: all data is in memory

IMCS is implemented as standard PostgreSQL extension. It provides set of function and operators for manipulations with timeseries. Some of them are analog of standard SQL operators (arithmetic, comparisons, sorting, aggregation...). But there are also complex analytic operators like calculation of ranks, percentiles, cross points and extended set of aggregates for financial application like split-adjusted price, volume-weighted average price, moving average...

Columnar store manager stores data tables as sections of columns of data rather than as rows of data.In comparison, most relational DBMSs store data in rows. Such approach allows to load the whole record using one read operation which leads to better performance for OLTP queries. But OLAP queries are mostly performing operations on entire columns, for example calculating sum or average of some column. In this case vertical data representation is more efficient.

Columnar store or vertical representation of data allows to achieve better performance in comparison with horizontal representation due to three factors: * Data skipping. Only columns involved in query are accessed. * Vector operations. Applying an operator to set of values minimize interpretation cost.

Also SIMD instructions of modern processors accelerate execution of vector operations.

  • Compression of data. For example such simple compression algorithm like RLE allows not only to reduce used space, but also minimize number of performed operations.

IMCS is first of all oriented on work with timeseries. Timeseries is sequence of usually small fixed size elements ordered by some timestamp. Operations with timeseries rarely access some particular timeseries element, instead of it them operate either with whole timeseries either with some time interval. Such specific of timeseries operation requires special index for timeseries, which is different from traditional database indexes. Such index should not provide efficient way of locating arbitrary timeseries element. Instead of it this index should be able to efficiently extract range of timeseries elements.

Advantages of IMCS approach:

  • Fast execution based on vector operations
  • Parallel execution of query
  • No changes in PostgreSQL core (just standard extension)
  • No MVCC overhead (MURSIW isolation level)
  • No disk IO (in-memory store)
  • Optimized for timeseries (massive operations with time slices)