Database indexes

Which type of database index you should use? Lets dive into zoo of available index types in different database engines and their corresponding use cases.

So, you have some data. Most likely you want to work with it - i.e. retrieve information according to some conditions = filtering what is important for you. Probably, even change the shape of this information - transform the data or get summary over it - i.e. apply some aggregations.

Usual question that arise - how to organize data schema - which db engine and indexes to choose.

Here I want to provide a brief overview of different kinds of indexes available in modern databases.

When data volume is small - i.e. less than a few millions of rows - it usually doesn't really matter - even fullscan will be sufficient for the majority of use cases.

With growth of data and load = number of read and write operations - it becomes paramount to structure your data in a way that facilitates processing without the need to read every row from the disk.

Data might be stored just in RAM memory or have a persistent layer at disk. You might have different requirements for data access depending on its age - i.e. ability to quickly deep dive into data from the last 3 months and keep 5 years worth of data in cold storage to comply with regulatory rules.

Data can take different shapes:

numbers of various sizes and precisions from uint8 to float point with configurable precision,
enum-like categories from fixed list of possible dimensions,
sequences of DNA or community relationships
tree-like hierarchy with number of nested fields
big binary blobs - like images or video
geometrical figures - line's segments or polygons
gis info - locations and coordinates
text fields - fixed length strings with hash values or variable length texts, with some articles
simple key-values with types that you know upfront
complex documents, where you have no idea which attributes provider might add next day

What is index anyway?

To simplify, indexes, at the end - are search trees and hash maps.

Data are tend to it evolve over time - new fields or dimensions might be added, old - change their type or cardinality or just disappear.

Query - mean you specify some filter criteria to pull only necessary data points. Filter criterias = we need to perform search or lookups. Hence data is organized in some structures that facilitate its retrieval according to specified conditions.

Golden rule of data schema design - define your schema around queries.

Let's dive into the diversity of available indexes - how data type and query type affect choice of index type (and sometimes even database engine).

B-Tree, and in many case its ancestor - B+-Tree is the most universal type of index - used for majority RDBMS databases as a default index type (from SQLite to MS SQL) as well as for NoSQL (from ScyllaDB to Aerospike).

Used to filter based on equality (==, <>, !=), compare - (>, <,>=, <=) and range conditions - BETWEEN.

Can be used with almost any data type - from numbers and strings, arrays or fields inside nested structures.

Hash based index is ideal choice for cases when you need to check for exact match - equality conditions (==, <>, !=) - faster then B-Tree based indexes.

Fit almost any simple data type, but not suitable for composite entities, nested documents or collections, or cases where more complex filtering condition is required - prefix or range searches.

Variations of inverse indexes are handy in case we work with textual data. Specialized engines - Sphinx or Elasticsearch can go the extra mile with stem and n-grams. And in case your data can be expressed as a numerical vector = feature vectors - ANN/KNN kind of queries can be used to rank results in terms of distance to target queries - to express similarity relationship.

Spatial data - geometry figures or locations require special treatment and better suited for indexes tailored for typical kind of queries - proximity search, bounding box or containment tasks served best with multidimensional trees.

It is possible to create indexes on top of results for some functions execution - `upper` for strings or convert epoch based number into timestamp - using expression or function based indexes.

Collections - arrays or sets usually require various logic to check for overlaps or membership as well as checking specific elements is present - for that best suited multikey or inverted indexes - but sometimes expression based indexes can be handy as well.

Semi-structural data - when we are not sure about schema of documents and every next entry might have some deviations - best served with indexes from these groups as well as at the end it is an intermix of array, text and probably expression and vector representations.

Graph and hierarchical relationship can be expressed in RDBMS terms - at the end all of it are vertices and edges - but for efficient traversals special types of indexes might be much more beneficial - Graph or Hierarchical.

OLAP and columnar type of database engines, introduce various tricks to efficiently deal with big data volume across multiple dimensions - mainly to skip non-relevant data partitions and be able to get a rough understanding of data statistics for a chunk of storage.

Wide-column NoSQL engines tend to rely on secondary indexes, sometimes stored locally within data nodes to facilitate filtering of non-primary keys or take advantages of sorting order.

Index Types and Supported DB Engines

Database Index Selection Guide

Structured Data

Type of Data	Type of Query	Type of Index	Supported DB Engines
Numbers	Equality, Range	B-tree, Hash, GiST, BRIN, Expression, Skip, Zone Map	PostgreSQL, MySQL, MSSQL, SQLite, ClickHouse, ScyllaDB, CockroachDB, Redshift, MariaDB, Snowflake, Oracle, RocksDB
Fixed Length String	Equality, Range, Prefix search	B-tree, Hash, Partial, Expression, Clustered, Non-clustered	PostgreSQL, MySQL, MSSQL, MongoDB, SQLite, Couchbase, Oracle, MariaDB, DynamoDB
Category/Dimensions	Equality, Membership	Hash, B-tree, Partial, Multikey, TSI, TTL	PostgreSQL, MongoDB, Couchbase, InfluxDB, TimescaleDB, ScyllaDB, Aerospike
Feature Vectors	Similarity search	GiST, SP-GiST, BKD Tree	PostgreSQL, ScyllaDB, Elasticsearch
Sequences of DNA	Sequence matching, Alignment	GiST, SP-GiST	PostgreSQL, ScyllaDB, Couchbase, MongoDB
Geo Info - Location and Coordinates	Proximity search, Range	R-tree, GiST, SP-GiST, Geospatial, BKD Tree	PostgreSQL, MongoDB, MSSQL, Neo4j, Oracle, MySQL, Couchbase, ScyllaDB, Druid, Snowflake
Geometrical Figures	Spatial queries	R-tree, GiST, SP-GiST, Geospatial	PostgreSQL, MySQL, MongoDB, Neo4j, Oracle
Time Series Data	Range, Aggregation, Time-ordered search	TSI, BRIN, Skip, Zone Map, Interleaved Sort Key, Expression, B-tree	TimescaleDB, InfluxDB, CockroachDB, PostgreSQL, ScyllaDB, Druid, Redshift, ClickHouse, Snowflake, MongoDB, RocksDB, Aerospike
Geo Info (Time-based Data)	Spatiotemporal queries	Geospatial, R-tree, BRIN, GiST	PostgreSQL, TimescaleDB, ScyllaDB, MongoDB, ClickHouse, Druid
Sparse Data	Range, Missing values	Sparse, Skip, Zone Map, BRIN	MongoDB, PostgreSQL, Couchbase, ClickHouse
Time-to-Live Data	Time-based expiration queries	TTL Index	MongoDB, Couchbase, ScyllaDB, Redis, InfluxDB
Temporal Data (Event-based Time Series)	Event-based queries, Time windowing, Range aggregation	TSI, BRIN, Zone Map, B-tree	TimescaleDB, InfluxDB, Druid, ClickHouse, ScyllaDB
Array Data	Membership, Containment, Exact match	GIN, Multikey, Partial Index, B-tree	PostgreSQL, MongoDB, Couchbase, Oracle
Set Data	Membership, Overlap, Intersection	GIN, Partial, Multikey Index, Hash	PostgreSQL, MongoDB, Neo4j, Oracle
Time-partitioned Data	Time-based partitioning for large datasets	Time-based Partitioning, BRIN, Range Index, Zone Map	TimescaleDB, ClickHouse, Druid, Redshift, InfluxDB
Wildcard Queries	Prefix, Pattern matching	Wildcard Index, Full-text, Hash, B-tree	PostgreSQL, MongoDB, Elasticsearch, MySQL
Approximate Search	Similarity, Approximate Matching	GiST, BKD, Vector Index, HNSW	PostgreSQL, ScyllaDB, Elasticsearch
Aggregation Queries	Aggregation, Window functions	Covering Index, Expression Index, TSI, BRIN, Zone Map	PostgreSQL, Redshift, ClickHouse, Snowflake, TimescaleDB
Metadata Queries	Querying metadata, properties of files and blobs	Hash, B-tree, Partial Index, GIN, Multikey Index	MongoDB, PostgreSQL, Couchbase, Oracle

Semi-Structured Data

Type of Data	Type of Query	Type of Index	Supported DB Engines
Arbitrary Length Texts	Full-text search	GIN (Inverted), Full-text, Wildcard, Zone Map	PostgreSQL, MSSQL, MongoDB, Elasticsearch, MySQL, Couchbase, Oracle, Neo4j, Snowflake
Documents - Tree-like Hierarchy	Path traversal, Nested queries	GIN, Partial, Multikey, Expression, Zone Map, B-tree	MongoDB, PostgreSQL, Couchbase, Neo4j, Oracle
Big Binary Blobs (Images, Videos)	Exact match, Similarity search	GIN, GiST, Hash, Full-text	PostgreSQL, MongoDB, Couchbase, Oracle, ScyllaDB
Documents with Nested Fields	Querying nested fields, Aggregation	Multikey, Partial, Expression, GIN, B-tree	MongoDB, Couchbase, PostgreSQL, Neo4j, Oracle
JSON/JSONB (Semi-structured Data)	Key-value queries, Path traversal, Nested object matching	GIN, JSONB Path Index, Multikey, B-tree, Expression Index	PostgreSQL, MongoDB, Couchbase, MySQL, Oracle
Graph Data (Nodes and Relationships)	Traversal, Shortest path, Pattern matching	Hash, B-tree, Graph Index	Neo4j, ArangoDB, OrientDB, PostgreSQL (via extensions), Oracle
Hierarchical Data (Parent-child Relationships)	Recursive queries, Hierarchical traversal	B-tree, GiST, GIN, Hierarchical Index	PostgreSQL, Oracle, Neo4j, ArangoDB, MongoDB
Spatial-temporal Queries	Spatiotemporal Queries	GiST, BRIN, R-tree, Spatiotemporal Index	PostgreSQL, TimescaleDB, ScyllaDB, MongoDB

Do you need a help with schema design?

Capacity planning, query performance tuning, data migrations

Book a free 30 min call with us!