Implementation of Databases

Logs - an append only data file that represents the db (not logs of statements for example)

The data would be encoded as binary with a string length then the raw bytes

Indexes - Understanding a Basic Database Implementation

Indexes enable efficient reads, slow down writes

Hash function - maps a key to a byte offset in a data file

Problems: We want sequential saving - ex. keys kitty01 and kitty02 should be close by

In order to avoid running out of disk space, we break the log file into segments then perform compaction - throwing away duplicate logs

But, because compaction makes the segments smaller, we iterate through and throw away duplicates

Thus, we end up with new segment files

Deleting Records

Have a tombstone

Append-Only vs. Overwriting

Overwriting leads to issues if there's a crash mid-overwrite (the data gets merged)
Sequential writes on disk are much faster than random access

Normalization

The principle of keep dating uniform - ex using IDs instead of strings

Denormalization

You can have redundant data in multiple tables to avoid complex joins

Normalization

ext: https://www.guru99.com/database-normalization.html

Avoid redundancy by dividing the tables into smaller tables with unique values
1NF - each table should contain a single value
2NF - no composite keys. We must have primary keys only - ex. an id
3NF - no functional dependencies (when changing a column mandates that another column changes)

SSTables

Update our segment files so that each key only appears once in each segment file, and the keys are in sorted order

Now, when we merge segments, we look at each segment and copy the lowest key into the output merged segment (similar to mergesort)

Now,if we have a key "handiwork", and we don't know it's location (we don't have that index stored), we can search between handbag and handsome. So we can store a sparse in-memory index

Also, our read requests can compress the data they scan over before writing to disk

SQL Tuning

Know that small queries are similar in latency to big queries
- Anything in application code that does multiple queries should probably be one big query
ext: N+1 Query Problem