Database indexes
Which type of database index you should use? Lets dive into zoo of available index types in different database engines and their corresponding use cases.
So, you have some data. Most likely you want to work with it - i.e. retrieve information according to some conditions = filtering what is important for you. Probably, even change the shape of this information - transform the data or get summary over it - i.e. apply some aggregations.

Usual question that arise - how to organize data schema - which db engine and indexes to choose.

Here I want to provide a brief overview of different kinds of indexes available in modern databases.

When data volume is small - i.e. less than a few millions of rows - it usually doesn't really matter - even fullscan will be sufficient for the majority of use cases.
With growth of data and load = number of read and write operations - it becomes paramount to structure your data in a way that facilitates processing without the need to read every row from the disk.

Data might be stored just in RAM memory or have a persistent layer at disk. You might have different requirements for data access depending on its age - i.e. ability to quickly deep dive into data from the last 3 months and keep 5 years worth of data in cold storage to comply with regulatory rules.

Data can take different shapes:
  • numbers of various sizes and precisions from uint8 to float point with configurable precision,
  • enum-like categories from fixed list of possible dimensions,
  • sequences of DNA or community relationships
  • tree-like hierarchy with number of nested fields
  • big binary blobs - like images or video
  • geometrical figures - line's segments or polygons
  • gis info - locations and coordinates
  • text fields - fixed length strings with hash values or variable length texts, with some articles
  • simple key-values with types that you know upfront
  • complex documents, where you have no idea which attributes provider might add next day
What is index anyway?
To simplify, indexes, at the end - are search trees and hash maps.
Data are tend to it evolve over time - new fields or dimensions might be added, old - change their type or cardinality or just disappear.

Query - mean you specify some filter criteria to pull only necessary data points. Filter criterias = we need to perform search or lookups. Hence data is organized in some structures that facilitate its retrieval according to specified conditions.

Golden rule of data schema design - define your schema around queries.

Let's dive into the diversity of available indexes - how data type and query type affect choice of index type (and sometimes even database engine).
B-Tree, and in many case its ancestor - B+-Tree is the most universal type of index - used for majority RDBMS databases as a default index type (from SQLite to MS SQL) as well as for NoSQL (from ScyllaDB to Aerospike).

Used to filter based on equality (==, <>, !=), compare - (>, <,>=, <=) and range conditions - BETWEEN.

Can be used with almost any data type - from numbers and strings, arrays or fields inside nested structures.
Hash based index is ideal choice for cases when you need to check for exact match - equality conditions (==, <>, !=) - faster then B-Tree based indexes.

Fit almost any simple data type, but not suitable for composite entities, nested documents or collections, or cases where more complex filtering condition is required - prefix or range searches.
Variations of inverse indexes are handy in case we work with textual data. Specialized engines - Sphinx or Elasticsearch can go the extra mile with stem and n-grams. And in case your data can be expressed as a numerical vector = feature vectors - ANN/KNN kind of queries can be used to rank results in terms of distance to target queries - to express similarity relationship.
Spatial data - geometry figures or locations require special treatment and better suited for indexes tailored for typical kind of queries - proximity search, bounding box or containment tasks served best with multidimensional trees.
It is possible to create indexes on top of results for some functions execution - `upper` for strings or convert epoch based number into timestamp - using expression or function based indexes.

Collections - arrays or sets usually require various logic to check for overlaps or membership as well as checking specific elements is present - for that best suited multikey or inverted indexes - but sometimes expression based indexes can be handy as well.

Semi-structural data - when we are not sure about schema of documents and every next entry might have some deviations - best served with indexes from these groups as well as at the end it is an intermix of array, text and probably expression and vector representations.

Graph and hierarchical relationship can be expressed in RDBMS terms - at the end all of it are vertices and edges - but for efficient traversals special types of indexes might be much more beneficial - Graph or Hierarchical.
OLAP and columnar type of database engines, introduce various tricks to efficiently deal with big data volume across multiple dimensions - mainly to skip non-relevant data partitions and be able to get a rough understanding of data statistics for a chunk of storage.

Wide-column NoSQL engines tend to rely on secondary indexes, sometimes stored locally within data nodes to facilitate filtering of non-primary keys or take advantages of sorting order.
Index Types and Supported DB Engines

Database Index Selection Guide

Structured Data

Type of Data Type of Query Type of Index Supported DB Engines
Numbers Equality, Range B-tree, Hash, GiST, BRIN, Expression, Skip, Zone Map PostgreSQL, MySQL, MSSQL, SQLite, ClickHouse, ScyllaDB, CockroachDB, Redshift, MariaDB, Snowflake, Oracle, RocksDB
Fixed Length String Equality, Range, Prefix search B-tree, Hash, Partial, Expression, Clustered, Non-clustered PostgreSQL, MySQL, MSSQL, MongoDB, SQLite, Couchbase, Oracle, MariaDB, DynamoDB
Category/Dimensions Equality, Membership Hash, B-tree, Partial, Multikey, TSI, TTL PostgreSQL, MongoDB, Couchbase, InfluxDB, TimescaleDB, ScyllaDB, Aerospike
Feature Vectors Similarity search GiST, SP-GiST, BKD Tree PostgreSQL, ScyllaDB, Elasticsearch
Sequences of DNA Sequence matching, Alignment GiST, SP-GiST PostgreSQL, ScyllaDB, Couchbase, MongoDB
Geo Info - Location and Coordinates Proximity search, Range R-tree, GiST, SP-GiST, Geospatial, BKD Tree PostgreSQL, MongoDB, MSSQL, Neo4j, Oracle, MySQL, Couchbase, ScyllaDB, Druid, Snowflake
Geometrical Figures Spatial queries R-tree, GiST, SP-GiST, Geospatial PostgreSQL, MySQL, MongoDB, Neo4j, Oracle
Time Series Data Range, Aggregation, Time-ordered search TSI, BRIN, Skip, Zone Map, Interleaved Sort Key, Expression, B-tree TimescaleDB, InfluxDB, CockroachDB, PostgreSQL, ScyllaDB, Druid, Redshift, ClickHouse, Snowflake, MongoDB, RocksDB, Aerospike
Geo Info (Time-based Data) Spatiotemporal queries Geospatial, R-tree, BRIN, GiST PostgreSQL, TimescaleDB, ScyllaDB, MongoDB, ClickHouse, Druid
Sparse Data Range, Missing values Sparse, Skip, Zone Map, BRIN MongoDB, PostgreSQL, Couchbase, ClickHouse
Time-to-Live Data Time-based expiration queries TTL Index MongoDB, Couchbase, ScyllaDB, Redis, InfluxDB
Temporal Data (Event-based Time Series) Event-based queries, Time windowing, Range aggregation TSI, BRIN, Zone Map, B-tree TimescaleDB, InfluxDB, Druid, ClickHouse, ScyllaDB
Array Data Membership, Containment, Exact match GIN, Multikey, Partial Index, B-tree PostgreSQL, MongoDB, Couchbase, Oracle
Set Data Membership, Overlap, Intersection GIN, Partial, Multikey Index, Hash PostgreSQL, MongoDB, Neo4j, Oracle
Time-partitioned Data Time-based partitioning for large datasets Time-based Partitioning, BRIN, Range Index, Zone Map TimescaleDB, ClickHouse, Druid, Redshift, InfluxDB
Wildcard Queries Prefix, Pattern matching Wildcard Index, Full-text, Hash, B-tree PostgreSQL, MongoDB, Elasticsearch, MySQL
Approximate Search Similarity, Approximate Matching GiST, BKD, Vector Index, HNSW PostgreSQL, ScyllaDB, Elasticsearch
Aggregation Queries Aggregation, Window functions Covering Index, Expression Index, TSI, BRIN, Zone Map PostgreSQL, Redshift, ClickHouse, Snowflake, TimescaleDB
Metadata Queries Querying metadata, properties of files and blobs Hash, B-tree, Partial Index, GIN, Multikey Index MongoDB, PostgreSQL, Couchbase, Oracle

Semi-Structured Data

Type of Data Type of Query Type of Index Supported DB Engines
Arbitrary Length Texts Full-text search GIN (Inverted), Full-text, Wildcard, Zone Map PostgreSQL, MSSQL, MongoDB, Elasticsearch, MySQL, Couchbase, Oracle, Neo4j, Snowflake
Documents - Tree-like Hierarchy Path traversal, Nested queries GIN, Partial, Multikey, Expression, Zone Map, B-tree MongoDB, PostgreSQL, Couchbase, Neo4j, Oracle
Big Binary Blobs (Images, Videos) Exact match, Similarity search GIN, GiST, Hash, Full-text PostgreSQL, MongoDB, Couchbase, Oracle, ScyllaDB
Documents with Nested Fields Querying nested fields, Aggregation Multikey, Partial, Expression, GIN, B-tree MongoDB, Couchbase, PostgreSQL, Neo4j, Oracle
JSON/JSONB (Semi-structured Data) Key-value queries, Path traversal, Nested object matching GIN, JSONB Path Index, Multikey, B-tree, Expression Index PostgreSQL, MongoDB, Couchbase, MySQL, Oracle
Graph Data (Nodes and Relationships) Traversal, Shortest path, Pattern matching Hash, B-tree, Graph Index Neo4j, ArangoDB, OrientDB, PostgreSQL (via extensions), Oracle
Hierarchical Data (Parent-child Relationships) Recursive queries, Hierarchical traversal B-tree, GiST, GIN, Hierarchical Index PostgreSQL, Oracle, Neo4j, ArangoDB, MongoDB
Spatial-temporal Queries Spatiotemporal Queries GiST, BRIN, R-tree, Spatiotemporal Index PostgreSQL, TimescaleDB, ScyllaDB, MongoDB
Do you need a help with schema design?
Capacity planning, query performance tuning, data migrations
Made on
Tilda