what to do it your app crack under load

How to scale web-application

Yep, application can’t handle load anymore. Alerts, exceptions, timeouts. Vertical scaling helps for couple of weeks and then everything is start failing again. And it become noticeably expensive for short-term solutions. Ouch.

Good news – it means that users love your product = strong signal in regards to PMF journey!

Tech stack: Python, Django,
RQ – Redis Queues, Postgres, Kubernetes in clouds

Client profile: scale-up
Domain: inventory management

Challenge: Random spikes of 100% CPU utilisation for Postgres database, that lead to degradation of user’s experience in Web-app (delays, errors, reduced functionality)

Result:

first week – stabilize system, identify hotspots and extend monitoring
second week: improve system throughput 10x times, downscale DB instance x3 times, confirm CPU usage is below 20% (even during peak hours)
and one more day: to prepare system for further growth – create a roadmap to evolve architecture and gradually improve codebase

First rule of working with a system under pressure – Pareto principle – identify and address issues, with the most severe contribution to performance hiccups. Refactoring, changing tech stack, comprehensive code analysis – should be a strategic, educated choice of the core team at a later stage. When a system struggles to serve its main purpose – top priority is to fix it ASAP no matter what. Actually, the same applies to org’s change management but it is a topic for another essay.

Identifying bottlenecks and hotspots might be tricky. In that case system have moderate size – dozens of thousands lines of code, multiple modules with a few flows of load: web hooks, triggered by 3rd party systems; jobs in queues ranged by priorities and usual API calls by user’s actions through product’s dashboard, all of it has seasonality within week and day time, as well as random bursts because of new client’s onboarding.

Why system is slow?

RCA analysis in big distributed systems

How do you find the root cause of poor system performance?

First of all – perform “due diligence” – collect information about system:

deployment
database(s)
application’s key components and their interactions
external dependencies – 3rd parties services
observability: logging, monitoring, alerting – what we have and what we don’t
ask(!) team about their concerns/suspicious
what is the current system throughput = number of operations per second
if it is not yet under monitoring – we should start collecting this info and watch it after every change in the system (given we get)
Determine and agree on our target number – i.e. is it x10? x100?

Based on initial info – check the most suspicious parts

DB: sizes – table, index, data in tables, does schemas match queries, what about index usage, what are the hottest tables, do we have full scans, how often vacuum running, etc
App: if have access to pod/node where app is running – session with atop/htop + netstat/lsof to see whether we hit ceiling – and where is our bottleneck – CPU/RAM/DISK/NETWORK?
Depending on app – add/view logs to see hotspots in terms of most frequently called methods and their mean time or endpoints with longest tail of response time
Deployment: cpu type, disk type, network type, throttling, is it shared env (with famous CPU steal time) or dedicated or even spot, CDN, caching, load balancing, geo location, etc

Prepare action plan to address the most demanding parts

Depending on findings – i.e. tuning schema, sharding db or cache data, extract long running processes to async jobs, use auto-scaling in case of high CPU usage

Do it!

Pick the most impactful item to start working on it

Compare metrics before and after of changes

How long to collect them depend on the project and domain – i.e. some jobs of functions might be triggered not so often – but might be leading to noticeable performance degradation

Sometimes it can be super helpful to check changes of performance in isolated setup and use wrk or similar tools for stress testing to reproduce bottlenecks locally and view system throughput. Depending on complexity of setup – it can bring better control, and strategically (if expected even more growth in the nearest future) – can help to measure limits of the system.

In this case the vector of investigation was clear – as we know where to start from – database’s CPU was heavily used. But why?

Why is my postgres so slow?

What can we get from postgres system tables? Actually a lot:
Active operations happening right now:

SELECT 
    count(1),
    wait_event_type,
    wait_event,
    backend_type
FROM
    pg_stat_activity
GROUP BY
    wait_event_type,
    wait_event,
    backend_type;

Red flags are:

growing number of connections – either more and more clients try to do something or DB can not handle flow of queries and they are queued
event types like:
IO – frequent number of write ops – delete update insert
BufferMapping – full scans
idle in transaction – connection in use, db locks row for modification and client after lock acquisition do something – ideally everything should be ready – so transaction are fast
if total number of connections exceeding 500

Database usage and access patterns

number of rows added/readed/updated or deleted per table – to determine most heavily used tables and type of their usage:

SELECT
	relname,
	seq_tup_read,
	idx_scan,
	idx_tup_fetch,
	n_tup_ins,
	n_tup_upd,
	n_tup_del,
	n_tup_hot_upd,
	n_live_tup,
	n_dead_tup,
	n_mod_since_analyze,
	n_ins_since_vacuum
FROM 
	pg_stat_user_tables

Is there full scans?

i.e. are there any missing indexes?

SELECT
    relname,
    seq_scan,
    seq_tup_read,
    idx_scan
FROM
    pg_stat_user_tables
WHERE
    seq_tup_read > idx_scan
ORDER BY
    seq_tup_read 
DESC LIMIT 10;

Index usage per table

SELECT
	relid,
	relname, 
	indexrelname,
	idx_scan, 
	idx_tup_read,
	idx_tup_fetch
FROM
	pg_stat_all_indexes

Not used indexes:

However, allow some time after metrics reset before viewing this data - there are might be some rare scenarios that not utilise them right away

-- not used indexes per table, bigger then 100 MB
-- pg_toast - contains blob of data
SELECT 
    s.schemaname,
    s.relname,
    s.indexrelname,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS index_size
FROM 
    pg_stat_all_indexes s
JOIN 
    pg_class c ON c.relname = s.indexrelname
JOIN 
    pg_namespace n ON n.oid = c.relnamespace
WHERE 
    s.idx_scan = 0 
    AND s.idx_tup_read = 0 
    AND s.idx_tup_fetch = 0 
    AND pg_total_relation_size(c.oid) > 100 * 1024*1024 AND s.relname not like 'pg_toast_%';

Zoom out and think from the level of business logic – are there any redundant calls in the function’s flow?

With activated query statistics that collect information about most frequent queries with number of calls, number of rows involved as well as mean time and max time. There are handy extensions, that can be helpful to increase transparency and find suspicious code’s:

Using all of it – we can export it into prometheus and expose it in grafana’s dashboard to correlate with CPU spikes – i.e. what are the hottest tables? What are access pattern? are there full scans = extensive disk IO/data buffer operations in db? How many rows fetched? When was the last time a vacuum or analyze executed? Can we map most demanding and frequent queries to the ORM and code? How code itself is organized – i.e. does ORM boilerplate is hidden behind wrappers that re-used across the project (hence all optimisations can be done in a single place and it is easier to find all references).

There are some basic precautions, that always worth to keep an eye on – probably even with some static checkers:

using function in conditions that need to modify every row, multi table joins, using `in` in conditions and looong list of values to check with,

Additionally, can be beneficial to integrate into code:

Upserts, checks to avoid updates in DB if data not changed
Pre-computed aggregations and caching (old good LRU with)

Sometimes, it can be much more efficient to zoom out a bit and think from the level of business logic – are there any redundant calls in the function’s flow? What about data schemas – i.e. we should always design storage based on usage patterns – how we are going to query this data? are we read or write heavy => Are there any advantages of some de-normalisation or maybe there is a need for an extra layer of aggregated data?

Whatever hypothesis you have – golden rule of optimisation – iteratively change system – one thing at time – redeploy – and watch metrics!