How to choose your database?

Use Postgres! Use whatever you have the most experience with!
Every software design discussion should start with the game of questions to clarify what we are about to build. And what not. Boundaries are important to have a clear cut from a very shiny but probably too remote future with hundreds of thousands requests per second vs what is actually needed for PMF validation (speed!) or existing/prospecting clients in the nearest three years (predicted stability).
Every software design brainstorming should start with questions!
Start with numbers — old good capacity planning:

  • data volume
  • what kind of data — payload size, number of attributes = columns, data types — numbers or text, gis or nested documents, media (i.e. blobs) or special data types
  • expected load (read heavy, write heavy, and if mixed — in what proportion) — read/write amplification — actual number of operations— is the thing in the world of High Load
  • peak number of operations per second
  • how long data should be in hot storage — i.e. available for ad-hoc queries — one month worth of data or last 3 years
  • Is there any chance that data will be changing (110% yes!) — i.e. how convenient is to have a schema evolution? Some programming frameworks and databases do have existing mature solutions for this.
  • What if there is a need to deduce a new attribute from several existing fields? Are you ready to write backfill utilities to crunch all datasets? Taking into account up-time requirements and limitations for various client apps?

Identify read usage pattern — i.e.
  • what questions your data should answer — aggregation, fuzzy search or geospatial queries
  • how often
  • how fast is acceptable vs desirable
  • who ask those questions — i.e. is it required to provide dashboards over materialized views from independent data marts or SQL query interface over data lake will be sufficient for analysts?
  • is it require last 3 months or the whole history,
  • whether it is worth to prepare thumbnails or always need to load full image

This should allow defining some base ground for requirements to choose DB and schema.
In 99,99% of cases - migration is not needed - updating schema and application - resolve all hiccups!
Many teams have concerns related to the burden of db engine migration and invest too much time in debate of best db choice when they have no understanding of real usage. Time box design choice and embrace possible changes.

But before it - trace issues with existing database solution — what is the main pain points we try to overcome with migration? In around 99,99% of cases that we meet - migration was not needed - updating schema according to db engine paradigm and adjusting application resolve all hiccups.

Yes, migration will require careful planning to switch applications to new data sources to avoid downtime, several tests to run in a sandbox in order to better understand possible data issues and extract some static data for normalisations (roles or product types for example) if needed.

It is true - every migration that I have been part of, reveal number of issues:
  • data quality issues — missing foreign references, logically impossible data — i.e. negative prices or missing mandatory fields
  • data schema issues — when original idea of target schema doesn’t conform to particular data outliers
  • performance — depending on load patterns, app and business logic it might lead to hotspot tables that accessed to often (because of joins or aggregations or more tricky issues related to hotspot nodes)
And sure, you can find your way to tackle them:
  • for data quality issues — sample data statistics and start asking questions — i.e. what we consider a duplicate? Is it okay if profit value is negative? Is it accepted when the last name is null? It is very helpful if you can sync with domain experts as well as with data producers and consumers to figure out the best outcome (ignore, decide case by case, use some agreed defaults or just fix it). Based on this discovery stage data quality checks or even alerts can be setup — if necessary documenting it, using tools like greatexpectations.io or DDataFlow
  • map current and forecasted use cases of data usage to target database and tailor schemas. Sometimes, it might be handy to split data in several layers as in medalion architecture — with raw data that we later use to derive aggregations for dashboards or domain specific tables, to avoid frequent heavy queries.
  • sometimes there's simply no ready to use migrator available or existing solution is not flexible enough hence not sufficient — so you need to build a tooling and make them configurable, so it can be reused
Book a free consulting session!
We do like to chat about data! Let's talk through your struggles - it is free and we guarantee - you got wider list of options, usually it is not as bad as you might think :)
Made on
Tilda