PGCon2019 - 3.4

PGCon 2019
The PostgreSQL Conference

Konstantin Evteev
Day Talks - Day 2 - 2019-05-31
Room DMS 1160
Start time 14:00
Duration 00:45
ID 1302
Event type Lecture
Track Scaling Out
Language used for presentation English

Standby in production: scaling application in second largest classified site in the world.

In this report, I would like to share Avito’s experience in different cases of standby usage:

  • problems and solutions in replication based horizontal scale-out;
  • our implementation for solution to avoid stale reads from replica;
  • cases highlighting possible problems while using standby with high request rate, applying DDL, receiving wal files from archive and handling some issues with technique of using few standbys in production and routing queries between them. DRP disaster recovery plans - to have synchronized standby/standbys with archive after crashes.

I joined the Avito team five years ago. Now it is the second largest classified site in the world and its ads are stored in PostgreSQL databases.

Last year, I gave a talk at PgCon 2018 that explained in details how our recovery use cases around Londiste (PGQ in general) in distributed data processing could be switched to new Logical Replication subsystem in PostgreSQL 10. In this talk, I will raise the topic of “simple” replication based on horizontal scale-out.

One part of this presentation aims to describe the basics of technique when running applications powered by PostgreSQL are offloading read operations to read only replicas. It is a pretty good high-impact and low-effort win for scalability, but not without challenges. There is a possibility of stale reads.

Even systems using read replicas without any techniques for mitigating stale reads will produce correct results most of the time, but for us “most of the time” isn’t good enough.

Another part of my report is about few pitfalls that many people are not aware of, while others have simply accepted the risks.

These problems can happen when you use standby with high request rate/ apply DDL/ receive wal files from archive/ promote one of your standbys after primary crash without saving consistency of your archive and other standbys.

Over the years of constantly using physical replication, we gained extensive experience, rethought a lot, implemented our own workarounds, archive and restore commands and disaster recovery plans in case of crashes in distributed data processing systems.

All these efforts and its results are worth being shared with who plans and proceeds PostgreSQL’s enterprise use.