Language of the data

my utlimate SQL memo

Why you need to learn what is possible in SQL when AI everywhere

Majority of apps are built around data. We applaud (or pray) for data-driven decisions. And, sometimes, have no idea about what we store (sometimes though we think we know — which probably even worse).

Why is AI not there (yet)?

AI nowadays definitely raise the expectation’s bar to a new level — sometimes I shocked when people have a blind faith in the AI. This hope comparable to some magic wand, that allow to retrieve precious pearls of insights from the deepest trenches of the data sea.

It is lovely to have some solid looking UI of Google’s Looker or AWS Quick Insights or some custom-made LLM with fine-tuned RAG on top of company’s data — but what would be its output if we are dealing with raw data — with all its defects? What if those data are from different sources — where silly issues like mismatched timestamps (UTC and local timezone) can quite efficiently distort the whole picture? What about situations where identity resolutions are only in plans?

Conclusions based on malformed data leads to wasteful expensive actions. Old good saying — garbage in — garbage out.

One of the first step to probe the data — is EDA — exploratory data analysis — aka “What the hell I am dealing with?!” — exciting journey which sometimes lead to amazing (and not so) discoveries. How to quickly grasp the gist of the data? Compute common statistics — distributions of values — which sometimes lead to amazing (or not so) surprises.

And it is often an issue — recently I have done a lot of EDA — and was surprised about many things, that people tend to do outside of data layer.

Why?

Because they do not know what is actually possible in old good sql. For sure, SQL have its own issues — verbosity, challenges with testing — but for quick probe on the data surface — what can be better? Pull something as csv in pandas and play locally? (Not the worst idea by the way!)

Moreover, many things already exist to help with it - given prompt and schema— TEXT2SQL or RAG2SQL can generate a query for you. NLQ (natural language queries) engines sometimes on par to seasoned DBAs.

What if data is big? LLM can generate PySpark code for you - as in pyspark-ai project - (especially if you share hints in regards of schema). You can combined it with unstruct — it try to derive the schema from unstructured data, and build corresponding pipelines to Extract-Transform-Load.

Automatic EDA is also possible - YData profiling — connect to data source — pandas or spark dataframe - and it will compute standard metrics, priceless when data is fully terra incognito — i.e. it is your first encounter.

But then questions arise: How those data points produced? How right data looks like? End even more important - how wrong data looks like? What if there are several sources of data? How are they connected?

In all those cases you have to know what you can ask. How to ask - less important, especially now, when we have number of AI assistants.

What SQL can do?

Find myself too lazy to memorize exact syntax, instead I am trying to remember what conceptually is possible. Examples below not necessary ANSI SQL — it works in PostgreSQL/Google Big Query — for vanilla SQL some specific tweaking might be necessary — but overall you can do same things.

Find myself too lazy to memorize exact syntax, instead I am trying to remember what conceptually is possible.

Examples below not necessary ANSI SQL — it works in PostgreSQL/Google Big Query.

For vanilla SQL some specific tweaking might be necessary — but overall you can do same things.

Let’s start with aggregate functions — first step to get statistics over dataset for EDA.
Percentile is helpful to grasp the gist of particular value distribution, and variance and deviation highlights how data is spread:

SELECT 
  direction_id, 
  MIN(orders_per_sec) AS min_orders, 
  MAX(orders_per_sec) AS max_orders, 
  AVG(orders_per_sec) AS avg_orders,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY orders_per_sec) AS median_orders,
  PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY orders_per_sec) AS p25_orders,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY orders_per_sec) AS p75_orders,
  VAR_POP(orders_per_sec) AS population_variance,
  -- or VAR_SAMP(orders_per_sec) AS sample_variance,
  STDDEV_POP(orders_per_sec) AS population_std_dev,
  -- or STDDEV_SAMP(orders_per_sec) AS sample_std_dev
FROM 
  order_counts 
GROUP BY 
  direction_id

Histograms like view of data distribution accros the bin - is great help to identify outliers (in the example below with 3-sigma rule):

WITH stats AS (
    SELECT 
        AVG(numeric_column) AS avg_value,
        STDDEV(numeric_column) AS std_dev,
        MIN(numeric_column) AS min_value,
        MAX(numeric_column) AS max_value
    FROM your_table
),
bucketed_data AS (
    SELECT 
        *,
        WIDTH_BUCKET(
            numeric_column, 
            (SELECT min_value FROM stats), 
            (SELECT max_value FROM stats), 
            10  -- number of buckets
        ) AS bucket,
        (SELECT avg_value FROM stats) AS avg_value,
        (SELECT std_dev FROM stats) AS std_dev
    FROM your_table
)
SELECT 
    *,
    CASE 
        WHEN numeric_column > avg_value + 3 * std_dev THEN 'High Outlier'
        WHEN numeric_column < avg_value - 3 * std_dev THEN 'Low Outlier'
        ELSE 'Normal'
    END AS outlier_status
FROM bucketed_data
ORDER BY bucket, numeric_column;

Hierarchical aggregation is possible via old good Group by clause — i.e. given the sales data:

make   | model   | sales
-------|---------|------
Toyota | Camry   | 500
Toyota | Camry   | 500
Toyota | Corolla | 300
Toyota | Corolla | 500
Honda  | Civic   | 400
Honda  | Civic   | 500
Honda  | Accord  | 600
Honda  | Accord  | 500

we want to have sum of sales per make and model and their combinations — then use GROUPING and ROLLUP/CUBE:

SELECT 
  make, 
  model, 
  GROUPING(make, model),
  sum(sales)
FROM 
  items_sold
GROUP BY ROLLUP(make, model);
-- or CUBE if we need include make as NULL

and then we can get:

make    | model    | GROUPING | sum(sales)
--------|----------|----------|-----------
Toyota  | Camry    | 0        | 1000
Toyota  | Corolla  | 0        |  800
Toyota  | NULL     | 1        | 1800
Honda   | Civic    | 0        |  900
Honda   | Accord   | 0        | 1100
Honda   | NULL     | 1        | 2000
NULL    | NULL     | 3        | 3800

Note, that you can combine aggregate functions with conditions or take into account only unique values:

COUNT(CASE WHEN fk_role = 1 AND fk_status = -13 THEN 1 END) AS removed_users,
AVG(CASE WHEN order_count > 1 THEN order_total ELSE NULL END),
COUNT(DISTINCT user_id) AS distinct_user_count

Wanna check whether two or more variables are related — compute whole covariance matrix:

SELECT 
    group_id,
    COVAR_SAMP(x1, x1) AS cov_x1_x1,
    COVAR_SAMP(x1, x2) AS cov_x1_x2,
    COVAR_SAMP(x2, x2) AS cov_x2_x2
FROM 
    data_points
GROUP BY 
    group_id;

Some DB engines provide a way to compute coefficients of linear regression — machine learning in SQL? Nah, just simple check effectiveness of advertising per region — and whether it explain sales number at all:

SELECT 
    region,
    REGR_SLOPE(sales, advertising_spent) AS slope,
    REGR_INTERCEPT(sales, advertising_spent) AS intercept,
    REGR_R2(sales, advertising_spent) AS r_squared
FROM 
    sales_data
GROUP BY 
    region;

You can use window functions to trace the trends over data points: Rank regular patients that had more than 5 appointments by number of appointments (or dense_rank — depending how you want treat similar values):

SELECT 
    patient_id, 
    COUNT(appointment_id) AS total_appointments,
    SUM(amount) AS total_amount,
    RANK() OVER (ORDER BY COUNT(appointment_id) DESC) AS appointment_rank
    --, DENSE_RANK() OVER (...) AS appointment_rank
FROM 
    dental_appointments
GROUP BY 
    patient_id
HAVING 
    COUNT(appointment_id) > 5 
    AND SUM(amount) > 500;

get the most recent appointment per patient:

SELECT 
    patient_id, 
    appointment_id,
    appointment_date,
    ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY appointment_date DESC) AS rn
FROM 
    dental_appointments
WHERE 
    status = 'completed'
ORDER BY 
    patient_id, rn;

or to see trends in scheduling appointment, show previous and next appointment for every row:

SELECT 
    patient_id, 
    appointment_id, 
    appointment_date,
    LAG(appointment_date, 1) OVER (PARTITION BY patient_id ORDER BY appointment_date) AS previous_appointment,
    LEAD(appointment_date, 1) OVER (PARTITION BY patient_id ORDER BY appointment_date) AS next_appointment
FROM 
    dental_appointments
ORDER BY 
    patient_id, appointment_date;

Note the difference:

OVER (ORDER BY date) — apply function to the whole dataset, but not limiting the columns in resultset as in GROUP BY clause
OVER (PARTITION BY customer_id ORDER BY order_date) — apply function to the data in every bucket = partition independently

You also can compute percentile — CUME_DIST — using moving window — for example within specific time frame (30 days in example below):

SELECT 
    date,
    stock_symbol,
    price,
    CUME_DIST() OVER (
        PARTITION BY stock_symbol
        ORDER BY price
        ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
    ) AS price_position_30day
FROM 
    stock_prices
ORDER BY 
    stock_symbol, 
    date;

Subqueries are created not only to create confusing code, you can copy of row with specific data:

INSERT INTO my_table_name (
  field_1, 
  ... , 
  field_n
) 
SELECT          -- 1
  field_1,
  ... ,
  field_n * 10  -- 2
FROM my_table_name
WHERE id = (    -- 3 
  SELECT id 
  FROM my_table_name
  WHERE name = 'John' 
    AND last_name = 'Doe'
  LIMIT 1
) AND
NOT EXISTS (    -- 4
    SELECT 
      1 
    FROM 
      some_other_table where fk_key_field = 42  -- 5
)

1 — using subquery to select necessary fields
2 — changing original value
3 — using subquery to select id of row
4 — using subquery to check that precondition is met for insertion
5 — why is 42

Subqueries have some restrictions though.

If boilerplate of subqueries feels too scary — it is time bring CTE (Common Table Expression) into play — to organize it in a bit more procedural way — split your data wrangling in stages — divide and conquer (conquer complexity and your data!)

Recursive query — is your friend if you are dealing with hierarchical data — if entities inside table have relationship to each other — no need (yet) for dedicated mapping table and data migrations. Yep, just for tree or connection’s graph traversal (and not so perfect schema).

--  descendants
--  id   parent_id   name
--  1    NULL        John
--  2    1           Mary

WITH RECURSIVE family_tree AS (
    -- Base case: start with the root ancestor
    SELECT 
        id, 
        parent_id, 
        name,
        0 AS generation -- Root starts at generation 0
    FROM 
        descendants
    WHERE 
        id = 1 -- Starting with John (id = 1)

    UNION ALL

    -- Recursive case: find all descendants of the previous level
    SELECT 
        d.id, 
        d.parent_id, 
        d.name,
        ft.generation + 1 AS generation -- Increment generation for each descendant
    FROM 
        descendants d
    INNER JOIN 
        family_tree ft ON d.parent_id = ft.id
)
-- Final selection: retrieve the complete family tree
SELECT 
    id, 
    parent_id, 
    name, 
    generation
FROM 
    family_tree
ORDER BY 
    generation, id;

those operations are similar to ordinal boolean logic — UNION ~ OR (UNION ALL allowing duplicates though), INTERSECT ~ AND and EXCEPT — difference:

A = 13  # Binary: 1101
B = 6   # Binary: 0110

# A AND NOT B
result = A & ~B # 9 (Binary: 1001)

Schema of queries should match — but you can cast types, use alias and shape data as you need. Un-nest specific field from struct in array inside a json field or pivot columns to rows and vice versa.

So, lets say I have couple of tables:

cities with zip_codes
clinics details
list of demanded treatments
appointment details

And what I want to understand is demand distribution across different parts of city:

SELECT DISTINCT
    ar.request_details->>'clinic_id' AS unique_clinic_id,           -- 1
    array_agg(DISTINCT t.name) AS treatment_names,                  -- 2
    ar.request_details->>'treatment_date' AS treatment_date,        
    ar.request_details->>'appointment_time' AS appointment_time,    
    date(ar.created_at) AS created_date,                            -- 3
    ar.patient_id,
    zc.city,
    zc.zip_code
FROM
    appointment_requests ar
JOIN
    clinics c 
    ON
        ar.request_details->>'clinic_id' = c.clinic_id::text        -- 4
JOIN
    zip_codes zc ON c.zip_code_id = zc.zip_code_id
LEFT JOIN LATERAL                                                   -- 5
    jsonb_array_elements_text(                                      -- 6
      ar.request_details->'required_treatments'
    ) AS req_treatment(treatment_id)
    ON true
LEFT JOIN
    treatments t ON t.treatment_id = req_treatment.treatment_id::integer
WHERE
    EXTRACT(EPOCH FROM (ar.updated_at - ar.created_at)) > 60*60     -- 7
    AND ar.status = 'pending'
    AND ar.patient_id NOT IN (-1, 20000)
GROUP BY
    ar.request_details->>'clinic_id',
    ar.request_details->>'treatment_date',
    ar.request_details->>'appointment_time',
    date(ar.created_at),
    ar.patient_id,
    zc.city,
    zc.zip_code
;

1 — we extracting some fields from json’s structure field
2 — we combine all unique values into array
3 — we extract date from timestamp
4 — we use field from json in join condition casting some other value to string
5 — we expand every value from array appointment_requests
6 — into independent row = equal to number of treatment_id from json array
7 — which were updated in the last hour

JOINing tables might be not necessary the most performant operations, but in the 3NF world it is the common tool to organize the data. Simple illustration where rows fall for OUTER/INNER/LEFT and RIGHT:

Just bear in mind of possibility of duplicates when joining multiple tables — in situations when you have main entity and several related tables with one-to-many relationship:

Let’s say we have a user with:
— 2 occupations
— 3 addresses
— 2 phone numbers

and this naive query — get you a Cartesian product of those multiple entries = 2 (occupations) × 3 (addresses) × 2 (phone numbers) = 12 rows for this single user:

SELECT
  u.name,
  o.occupation,
  a.address,
  p.phone_number
FROM
  users u
JOIN occupations o ON u.id = o.user_id
JOIN addresses a ON u.id = a.user_id
JOIN phone_numbers p ON u.id = p.user_id;

Also don’t forget about self joins, for cases like finding duplicates or searching for next/prev values or missing entries in the sequential data!

Just bear in mind what is possible — you can always lookup syntax when needed (and it will stick with you through repetition — i.e. when it is really needed)

SQL methods and functions memo

Math

bitwise operations, cube root and exponent
logarithms and radian to degree
random (sampling and shuffling, for array as well)
width_bucket — fill histograms for you

String

all tools that you need for data cleansing:

trim/substring + lpad/rpad — replace characters at specific position
concat and encoding manipulation, including binary and md5/sha hashes computation
format — sprintf like functions
slicing, using left/right and reverse
regexp — replace, split, count and match

Date

age, date_bin or date_part or justify_days or hours
day of week or year or week of year — a lot of fun do manually for leap year
overlap between dates
of course — time zones

Geometry

point/box/circle types as a first citizen
closest point to object or distance between them
overlap and inclusion
area or diagonal or bounding box

IP Address

subnet contains another subnet or compute netmask

Json

extracting elements from json array or nested keys
check whether json is contained in another one
row to json or unnesting json to key/value rows

Array

including, min/max, unnesting, appending
ranges
generate_series — integer or date with steps

SELECT * FROM generate_series('2008-03-01 00:00'::timestamp,
                              '2008-03-04 12:00', '10 hours');

Treat GenAI as a pairing buddy with encyclopedic memory and confidence of person under Dunning–Kruger effect when it just think up answers. You know what you want from your data — you have to learn what is possible! (And ask AI to do boring work of typing).