PGCon2017 - 20180510

PGCon 2017
The PostgreSQL Conference

Jan Urbański
Day Talks - Day 1 - 2017-05-25
Room DMS 1110
Start time 14:00
Duration 00:45
ID 1042
Event type Lecture
Track DBA
Language used for presentation English

Lies, damned lies, and statistics

A journey into the PostgreSQL statistics subsystem

Before executing an SQL query, Postgres needs to decide on an execution plan. While there are multiple steps involved in generating a plan, this talk will focus on the main source of inputs for the query planner machinery, namely the statistics subsystem.

Detecting problems with statistics is often a crucial step towards finding the reasons for bad plans and slow queries, so it's important to understand exactly what gets tracked and how it's getting used.

As the universal rule of Garbage In, Garbage Out teaches us, without some degree of knowledge about the shape of queried data, it is impossible to get a reasonable execution plan. PostgreSQL employs a number of ways to maintain up-to-date statistics about table sizes and the distribution of data inside them, which in turn inform the query planner.

In this talk we'll explore what statistical information is being tracked by Postgres, how is it calculated, where does the server store it and how can the operator query it, and finally how to tweak the whole system.

Maintaining up-to-date statistics needs to balance performance overhead with information quality. For a large database, you can't just read everything and calculate some ratios, so the system employs a number of clever algorithms, which we will examine.

Another concern is that some data types require specific statistical information that's specific to the query pattern for the given type. A column holding an array is less likely to be queried for exact matches, but it is often queried using a "contains" operator. We'll cover special cases such as arrays, full text search vectors and ranges, and we'll talk more in depth about how statistics for them are gathered.

Finally, we'll look at some of the weaknesses of the statistics subsystem and areas where it could be still improved.