PGCon2008 - Final - we hope

PGCon 2008
The PostgreSQL Conference

Speakers
Durgaprasad S. Pawar
Schedule
Day Talks - first day (2008-05-22)
Room A
Start time 17:30
Duration 01:00
Info
ID 84
Event type lecture
Track Horizontal Scaling
Language en
Feedback

Multi threaded query accelerator

A case study

Abstract: As the amount of data in the database grows more and more it becomes necessary in most of the applications using database to organize it in a way which is optimized for both high data insertion rate and faster query execution. This paper presents a case study which describes a database design which uses data partitioning to store huge data and a lightweight multi-threaded application which works as a middleware for faster query execution (up to 13 times). More emphasis is given on the multi-threaded application which executes queries spanning across multiple tables faster than only one single-threaded Postgres process. Approximately 90GB of data is going to be supported by this implementation with queries spanning across entire data.

I will speak about the design of the multi threaded application which is developed and tuned as per our project needs to get the results from the database faster than it would actually return when asked to do it by itself. Initial tests show the performance (in terms of time taken to get the query result set) gain obtained is ~13 times more.

Data size and H/W specification: ~90 GB of data spread across more than 20 tables is stored in the database. The tables are partitioned on day basis and inheritance is used to enable easy maintenance of data. Each tables contains anywhere between 20,000,000 to 40,000,000 records. The H/W used to run the application is 2 processor machine with 2 hard disks, 2GB of RAM 4GB of swap space (Sun-Fire-V245).

Need for speed: The data is continuously flowing into the machine at a very high rate and is available on the machine's local hard disk for processing. This data is then processed and entered into the DB in appropriate tables. The database is expected to be up 24x7 because the data coming from the network needs to be stored continuously without loss.

Database can do it all? No: High data insertion rates are managed by Postgres with its default configurations and some minimum tuning of configuration parameters. However, we face following problems/limitations with using Postgres for querying:

  1. Time taken to execute the application specific queries is very high since queries span across multiple tables with huge size and involve ‘GROUP BY’ and ‘aggregation’ operations.
  2. For data spanning across more than 3 tables, Postgres throws ‘out of memory’ error when the Postgres process size handling the query in memory goes beyond 4GB.
  3. For querying on large data sets significantly more ‘sharedmemory’ and ‘workmemory’ is required.

I will explain how this application tries to overcome all the above mentioned problems.