Database Indexing
4 min readMar 13, 2024
Heap
- Where we store all the rows in the table.
- Data are Unordered.
- When inserting new records, new tuple goes wherever. Usually at the end.
Pages(Blocks)
- Heap is divided into 8 KB blocks.
- 8 KB is the default size of the block.
Item Pointers
- In each page, there are slots.
- Some of the slots can be empty.
- If you want to know the physical location of the tuple you can alway address with the block number & slot number.
- You can check the physical location of each tuples.
SELECT CTID, * FROM t_users;
- This CTID column is hidden system column.
Full Table Scan
- Start at the beginning and scan all the way to at the end.
What is an index?
- Indexes are a common way to enhance database performance.
- An index allows the database server to find and retrieve specific rows much faster than it could do without an index.
- Every primary key has an index by default.
- Key value stores.
- Keys are the list of values in the field that we have asked to index.
- Values are the page number and item pointer(slot number) that we mentioned before.(Physical location of that tuple in the heap)
- It is also sorted list. It’s sorts the values so it knows how to find them efficiently in this case.
- Default index is B-Tree index in Postgres.
- There is no visibility information in indexes whether that tuple is dead or updated. Indexes don’t care about those information
- Visibility information is stored in heap and after you have looked up a row from the index. Then we check the visibility in the heap.
- UPDATE — inserts new index tuple.
update t_posts set title = 'test' where user_id = 1;
- Dead tuples are removed by Vaccum. There is no operation to delete one tuple from the index. Vaccums scan through every index completely and delete all the dead tuples.
Example
- Create two tables and insert data.
create table t_test(
id BIGSERIAL NOT NULL PRIMARY KEY,
name text,
address text,
country_code text
);
create table t_posts(
id SERIAL PRIMARY KEY,
title TEXT,
discription TEXT,
user_id INTEGER NOT NULL REFERENCES t_test(id)
);
CREATE OR REPLACE FUNCTION insert_data_with_loop()
RETURNS void AS $$
DECLARE
i integer;
BEGIN
FOR i IN 1..1000000 LOOP
INSERT INTO t_test (name, address, country_code) values( md5(random()::text), CONCAT(array_to_string(ARRAY(SELECT chr((48 + round(random() * 9)) :: integer) FROM generate_series(1,12)), ''), ' ', array_to_string(ARRAY(SELECT chr((48 + round(random() * 9)) :: integer) FROM generate_series(1,12)), '')),substr(md5(random()::text), 0, 3));
END LOOP;
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION insert_data_into_posts_with_loop()
RETURNS void AS $$
DECLARE
i integer;
BEGIN
FOR i IN 1..1000000 LOOP
INSERT INTO t_posts (title, discription, user_id) values( md5(random()::text), md5(random()::text), i);
END LOOP;
END;
$$ LANGUAGE plpgsql;
SELECT insert_data_with_loop();
SELECT insert_data_into_posts_with_loop();
- Here there is an index. it is automatically generated by Postgres for primary key.
- For this table also, there is an index for primary key.
Query 01
- Check the execution plan for primary key filtering.
explain analyze select * from t_test where id = 9000000;
- According to the execution plan, it is just scanning the index.
Query 2
explain analyze select * from t_test where address = '673138731337 072912168669';
Here according to the execution plan, it is scanning the heap sequentially.
Query 3
- Create index on address column.
create index on t_test(address);
- Check the execution plan.
explain analyze select * from t_test where address = '673138731337 072912168669';
Here, it is just scanning only index. there is a huge difference between Query 2 execution time and Query 3 execution time. Query 3 execution time is really fast after creating the index.
Query 4
- Check the indexes for the specific table.
SELECT indexname AS index_name, indexdef AS index_definition
FROM pg_indexes
WHERE tablename = 'your_table_name';