my utlimate SQL memo
Language of the data
Why you need to learn what is possible in SQL when AI everywhere
Majority of apps are built around data. We applaud (or pray) for data-driven decisions. And, sometimes, have no idea about what we store (sometimes though we think we know — which probably even worse).
Why is AI not there (yet)?
AI nowadays definitely raise the expectation’s bar to a new level — sometimes I shocked when people have a blind faith in the AI. This hope comparable to some magic wand, that allow to retrieve precious pearls of insights from the deepest trenches of the data sea.
It is lovely to have some solid looking UI of Google’s Looker or AWS Quick Insights or some custom-made LLM with fine-tuned RAG on top of company’s data — but what would be its output if we are dealing with raw data — with all its defects? What if those data are from different sources — where silly issues like mismatched timestamps (UTC and local timezone) can quite efficiently distort the whole picture? What about situations where identity resolutions are only in plans?

Conclusions based on malformed data leads to wasteful expensive actions. Old good saying — garbage in — garbage out.

One of the first step to probe the data — is EDA — exploratory data analysis — aka “What the hell I am dealing with?!” — exciting journey which sometimes lead to amazing (and not so) discoveries. How to quickly grasp the gist of the data? Compute common statistics — distributions of values — which sometimes lead to amazing (or not so) surprises.

And it is often an issue — recently I have done a lot of EDA — and was surprised about many things, that people tend to do outside of data layer.

Why?

Because they do not know what is actually possible in old good sql. For sure, SQL have its own issues — verbosity, challenges with testing — but for quick probe on the data surface — what can be better? Pull something as csv in pandas and play locally? (Not the worst idea by the way!)
Moreover, many things already exist to help with it - given prompt and schema— TEXT2SQL or RAG2SQL can generate a query for you. NLQ (natural language queries) engines sometimes on par to seasoned DBAs.

What if data is big? LLM can generate PySpark code for you - as in pyspark-ai project - (especially if you share hints in regards of schema). You can combined it with unstruct — it try to derive the schema from unstructured data, and build corresponding pipelines to Extract-Transform-Load.

Automatic EDA is also possible - YData profiling — connect to data source — pandas or spark dataframe - and it will compute standard metrics, priceless when data is fully terra incognito — i.e. it is your first encounter.

But then questions arise: How those data points produced? How right data looks like? End even more important - how wrong data looks like? What if there are several sources of data? How are they connected?

In all those cases you have to know what you can ask. How to ask - less important, especially now, when we have number of AI assistants.
What SQL can do?
Find myself too lazy to memorize exact syntax, instead I am trying to remember what conceptually is possible. Examples below not necessary ANSI SQL — it works in PostgreSQL/Google Big Query — for vanilla SQL some specific tweaking might be necessary — but overall you can do same things.
Find myself too lazy to memorize exact syntax, instead I am trying to remember what conceptually is possible.

Examples below not necessary ANSI SQL — it works in PostgreSQL/Google Big Query. 

For vanilla SQL some specific tweaking might be necessary — but overall you can do same things.

Let’s start with aggregate functions — first step to get statistics over dataset for EDA.
Percentile is helpful to grasp the gist of particular value distribution, and variance and deviation highlights how data is spread:
SELECT 
  direction_id, 
  MIN(orders_per_sec) AS min_orders, 
  MAX(orders_per_sec) AS max_orders, 
  AVG(orders_per_sec) AS avg_orders,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY orders_per_sec) AS median_orders,
  PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY orders_per_sec) AS p25_orders,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY orders_per_sec) AS p75_orders,
  VAR_POP(orders_per_sec) AS population_variance,
  -- or VAR_SAMP(orders_per_sec) AS sample_variance,
  STDDEV_POP(orders_per_sec) AS population_std_dev,
  -- or STDDEV_SAMP(orders_per_sec) AS sample_std_dev
FROM 
  order_counts 
GROUP BY 
  direction_id
Histograms like view of data distribution accros the bin - is great help to identify outliers (in the example below with 3-sigma rule):
WITH stats AS (
    SELECT 
        AVG(numeric_column) AS avg_value,
        STDDEV(numeric_column) AS std_dev,
        MIN(numeric_column) AS min_value,
        MAX(numeric_column) AS max_value
    FROM your_table
),
bucketed_data AS (
    SELECT 
        *,
        WIDTH_BUCKET(
            numeric_column, 
            (SELECT min_value FROM stats), 
            (SELECT max_value FROM stats), 
            10  -- number of buckets
        ) AS bucket,
        (SELECT avg_value FROM stats) AS avg_value,
        (SELECT std_dev FROM stats) AS std_dev
    FROM your_table
)
SELECT 
    *,
    CASE 
        WHEN numeric_column > avg_value + 3 * std_dev THEN 'High Outlier'
        WHEN numeric_column < avg_value - 3 * std_dev THEN 'Low Outlier'
        ELSE 'Normal'
    END AS outlier_status
FROM bucketed_data
ORDER BY bucket, numeric_column;
Hierarchical aggregation is possible via old good Group by clause — i.e. given the sales data:
make   | model   | sales
-------|---------|------
Toyota | Camry   | 500
Toyota | Camry   | 500
Toyota | Corolla | 300
Toyota | Corolla | 500
Honda  | Civic   | 400
Honda  | Civic   | 500
Honda  | Accord  | 600
Honda  | Accord  | 500
we want to have sum of sales per make and model and their combinations — then use GROUPING and ROLLUP/CUBE:
SELECT 
  make, 
  model, 
  GROUPING(make, model),
  sum(sales)
FROM 
  items_sold
GROUP BY ROLLUP(make, model);
-- or CUBE if we need include make as NULL
and then we can get:
make    | model    | GROUPING | sum(sales)
--------|----------|----------|-----------
Toyota  | Camry    | 0        | 1000
Toyota  | Corolla  | 0        |  800
Toyota  | NULL     | 1        | 1800
Honda   | Civic    | 0        |  900
Honda   | Accord   | 0        | 1100
Honda   | NULL     | 1        | 2000
NULL    | NULL     | 3        | 3800
Note, that you can combine aggregate functions with conditions or take into account only unique values:
COUNT(CASE WHEN fk_role = 1 AND fk_status = -13 THEN 1 END) AS removed_users,
AVG(CASE WHEN order_count > 1 THEN order_total ELSE NULL END),
COUNT(DISTINCT user_id) AS distinct_user_count
Wanna check whether two or more variables are related — compute whole covariance matrix:
SELECT 
    group_id,
    COVAR_SAMP(x1, x1) AS cov_x1_x1,
    COVAR_SAMP(x1, x2) AS cov_x1_x2,
    COVAR_SAMP(x2, x2) AS cov_x2_x2
FROM 
    data_points
GROUP BY 
    group_id;
Some DB engines provide a way to compute coefficients of linear regression — machine learning in SQL? Nah, just simple check effectiveness of advertising per region — and whether it explain sales number at all:
SELECT 
    region,
    REGR_SLOPE(sales, advertising_spent) AS slope,
    REGR_INTERCEPT(sales, advertising_spent) AS intercept,
    REGR_R2(sales, advertising_spent) AS r_squared
FROM 
    sales_data
GROUP BY 
    region;
You can use window functions to trace the trends over data points: Rank regular patients that had more than 5 appointments by number of appointments (or dense_rank — depending how you want treat similar values):
SELECT 
    patient_id, 
    COUNT(appointment_id) AS total_appointments,
    SUM(amount) AS total_amount,
    RANK() OVER (ORDER BY COUNT(appointment_id) DESC) AS appointment_rank
    --, DENSE_RANK() OVER (...) AS appointment_rank
FROM 
    dental_appointments
GROUP BY 
    patient_id
HAVING 
    COUNT(appointment_id) > 5 
    AND SUM(amount) > 500;
get the most recent appointment per patient:
SELECT 
    patient_id, 
    appointment_id,
    appointment_date,
    ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY appointment_date DESC) AS rn
FROM 
    dental_appointments
WHERE 
    status = 'completed'
ORDER BY 
    patient_id, rn;
or to see trends in scheduling appointment, show previous and next appointment for every row:
SELECT 
    patient_id, 
    appointment_id, 
    appointment_date,
    LAG(appointment_date, 1) OVER (PARTITION BY patient_id ORDER BY appointment_date) AS previous_appointment,
    LEAD(appointment_date, 1) OVER (PARTITION BY patient_id ORDER BY appointment_date) AS next_appointment
FROM 
    dental_appointments
ORDER BY 
    patient_id, appointment_date;
Note the difference:
  • OVER (ORDER BY date) — apply function to the whole dataset, but not limiting the columns in resultset as in GROUP BY clause
  • OVER (PARTITION BY customer_id ORDER BY order_date) — apply function to the data in every bucket = partition independently
You also can compute percentile — CUME_DIST — using moving window — for example within specific time frame (30 days in example below):
SELECT 
    date,
    stock_symbol,
    price,
    CUME_DIST() OVER (
        PARTITION BY stock_symbol
        ORDER BY price
        ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
    ) AS price_position_30day
FROM 
    stock_prices
ORDER BY 
    stock_symbol, 
    date;
Subqueries are created not only to create confusing code, you can copy of row with specific data:
INSERT INTO my_table_name (
  field_1, 
  ... , 
  field_n
) 
SELECT          -- 1
  field_1,
  ... ,
  field_n * 10  -- 2
FROM my_table_name
WHERE id = (    -- 3 
  SELECT id 
  FROM my_table_name
  WHERE name = 'John' 
    AND last_name = 'Doe'
  LIMIT 1
) AND
NOT EXISTS (    -- 4
    SELECT 
      1 
    FROM 
      some_other_table where fk_key_field = 42  -- 5
)
  • 1 — using subquery to select necessary fields
  • 2 — changing original value
  • 3 — using subquery to select id of row
  • 4 — using subquery to check that precondition is met for insertion
  • 5 — why is 42
Subqueries have some restrictions though.
If boilerplate of subqueries feels too scary — it is time bring CTE (Common Table Expression) into play — to organize it in a bit more procedural way — split your data wrangling in stages — divide and conquer (conquer complexity and your data!)
Recursive query — is your friend if you are dealing with hierarchical data — if entities inside table have relationship to each other — no need (yet) for dedicated mapping table and data migrations. Yep, just for tree or connection’s graph traversal (and not so perfect schema).
--  descendants
--  id   parent_id   name
--  1    NULL        John
--  2    1           Mary

WITH RECURSIVE family_tree AS (
    -- Base case: start with the root ancestor
    SELECT 
        id, 
        parent_id, 
        name,
        0 AS generation -- Root starts at generation 0
    FROM 
        descendants
    WHERE 
        id = 1 -- Starting with John (id = 1)

    UNION ALL

    -- Recursive case: find all descendants of the previous level
    SELECT 
        d.id, 
        d.parent_id, 
        d.name,
        ft.generation + 1 AS generation -- Increment generation for each descendant
    FROM 
        descendants d
    INNER JOIN 
        family_tree ft ON d.parent_id = ft.id
)
-- Final selection: retrieve the complete family tree
SELECT 
    id, 
    parent_id, 
    name, 
    generation
FROM 
    family_tree
ORDER BY 
    generation, id;
those operations are similar to ordinal boolean logic — UNION ~ OR (UNION ALL allowing duplicates though), INTERSECT ~ AND and EXCEPT — difference:
A = 13  # Binary: 1101
B = 6   # Binary: 0110

# A AND NOT B
result = A & ~B # 9 (Binary: 1001)
Schema of queries should match — but you can cast types, use alias and shape data as you need. Un-nest specific field from struct in array inside a json field or pivot columns to rows and vice versa.

So, lets say I have couple of tables:
  • cities with zip_codes
  • clinics details
  • list of demanded treatments
  • appointment details

And what I want to understand is demand distribution across different parts of city:
SELECT DISTINCT
    ar.request_details->>'clinic_id' AS unique_clinic_id,           -- 1
    array_agg(DISTINCT t.name) AS treatment_names,                  -- 2
    ar.request_details->>'treatment_date' AS treatment_date,        
    ar.request_details->>'appointment_time' AS appointment_time,    
    date(ar.created_at) AS created_date,                            -- 3
    ar.patient_id,
    zc.city,
    zc.zip_code
FROM
    appointment_requests ar
JOIN
    clinics c 
    ON
        ar.request_details->>'clinic_id' = c.clinic_id::text        -- 4
JOIN
    zip_codes zc ON c.zip_code_id = zc.zip_code_id
LEFT JOIN LATERAL                                                   -- 5
    jsonb_array_elements_text(                                      -- 6
      ar.request_details->'required_treatments'
    ) AS req_treatment(treatment_id)
    ON true
LEFT JOIN
    treatments t ON t.treatment_id = req_treatment.treatment_id::integer
WHERE
    EXTRACT(EPOCH FROM (ar.updated_at - ar.created_at)) > 60*60     -- 7
    AND ar.status = 'pending'
    AND ar.patient_id NOT IN (-1, 20000)
GROUP BY
    ar.request_details->>'clinic_id',
    ar.request_details->>'treatment_date',
    ar.request_details->>'appointment_time',
    date(ar.created_at),
    ar.patient_id,
    zc.city,
    zc.zip_code
;
  • 1 — we extracting some fields from json’s structure field
  • 2 — we combine all unique values into array
  • 3 — we extract date from timestamp
  • 4 — we use field from json in join condition casting some other value to string
  • 5 — we expand every value from array appointment_requests
  • 6 — into independent row = equal to number of treatment_id from json array
  • 7 — which were updated in the last hour
JOINing tables might be not necessary the most performant operations, but in the 3NF world it is the common tool to organize the data. Simple illustration where rows fall for OUTER/INNER/LEFT and RIGHT:
Just bear in mind of possibility of duplicates when joining multiple tables — in situations when you have main entity and several related tables with one-to-many relationship:

Let’s say we have a user with:
 — 2 occupations
 — 3 addresses
 — 2 phone numbers

and this naive query — get you a Cartesian product of those multiple entries = 2 (occupations) × 3 (addresses) × 2 (phone numbers) = 12 rows for this single user:
SELECT
  u.name,
  o.occupation,
  a.address,
  p.phone_number
FROM
  users u
JOIN occupations o ON u.id = o.user_id
JOIN addresses a ON u.id = a.user_id
JOIN phone_numbers p ON u.id = p.user_id;
Also don’t forget about self joins, for cases like finding duplicates or searching for next/prev values or missing entries in the sequential data!

Just bear in mind what is possible — you can always lookup syntax when needed (and it will stick with you through repetition — i.e. when it is really needed)
SQL methods and functions memo
Math

  • bitwise operations, cube root and exponent
  • logarithms and radian to degree
  • random (sampling and shuffling, for array as well)
  • width_bucket — fill histograms for you
String

all tools that you need for data cleansing:
  • trim/substring + lpad/rpad — replace characters at specific position
  • concat and encoding manipulation, including binary and md5/sha hashes computation
  • format — sprintf like functions
  • slicing, using left/right and reverse
  • regexp — replace, split, count and match
Date

  • age, date_bin or date_part or justify_days or hours
  • day of week or year or week of year — a lot of fun do manually for leap year
  • overlap between dates
  • of course — time zones
Geometry

  • point/box/circle types as a first citizen
  • closest point to object or distance between them
  • overlap and inclusion
  • area or diagonal or bounding box
IP Address

  • subnet contains another subnet or compute netmask
Json

  • extracting elements from json array or nested keys
  • check whether json is contained in another one
  • row to json or unnesting json to key/value rows
Array

  • including, min/max, unnesting, appending
  • ranges
  • generate_series — integer or date with steps
SELECT * FROM generate_series('2008-03-01 00:00'::timestamp,
                              '2008-03-04 12:00', '10 hours');
Treat GenAI as a pairing buddy with encyclopedic memory and confidence of person under Dunning–Kruger effect when it just think up answers. You know what you want from your data — you have to learn what is possible! (And ask AI to do boring work of typing).
Made on
Tilda