Start with numbers — old good capacity planning:
- data volume
- what kind of data — payload size, number of attributes = columns, data types — numbers or text, gis or nested documents, media (i.e. blobs) or special data types
- expected load (read heavy, write heavy, and if mixed — in what proportion) — read/write amplification — actual number of operations— is the thing in the world of High Load
- peak number of operations per second
- how long data should be in hot storage — i.e. available for ad-hoc queries — one month worth of data or last 3 years
- Is there any chance that data will be changing (110% yes!) — i.e. how convenient is to have a schema evolution? Some programming frameworks and databases do have existing mature solutions for this.
- What if there is a need to deduce a new attribute from several existing fields? Are you ready to write backfill utilities to crunch all datasets? Taking into account up-time requirements and limitations for various client apps?
Identify read usage pattern — i.e.
- what questions your data should answer — aggregation, fuzzy search or geospatial queries
- how often
- how fast is acceptable vs desirable
- who ask those questions — i.e. is it required to provide dashboards over materialized views from independent data marts or SQL query interface over data lake will be sufficient for analysts?
- is it require last 3 months or the whole history,
- whether it is worth to prepare thumbnails or always need to load full image
This should allow defining some base ground for requirements to choose DB and schema.