Database

Implementation of Databases

Logs - an append only data file that represents the db (not logs of statements for example)

The data would be encoded as binary with a string length then the raw bytes

Indexes - Understanding a Basic Database Implementation

Indexes enable efficient reads, slow down writes

Hash function - maps a key to a byte offset in a data file

  • Problems: We want sequential saving - ex. keys kitty01 and kitty02 should be close by

In order to avoid running out of disk space, we break the log file into segments then perform compaction - throwing away duplicate logs

But, because compaction makes the segments smaller, we iterate through and throw away duplicates

  • Thus, we end up with new segment files

Deleting Records

  • Have a tombstone

Append-Only vs. Overwriting

  • Overwriting leads to issues if there's a crash mid-overwrite (the data gets merged)
  • Sequential writes on disk are much faster than random access

Normalization

  • The principle of keep dating uniform - ex using IDs instead of strings

Denormalization

  • You can have redundant data in multiple tables to avoid complex joins

Normalization

ext: https://www.guru99.com/database-normalization.html

  • Avoid redundancy by dividing the tables into smaller tables with unique values
  • 1NF - each table should contain a single value
  • 2NF - no composite keys. We must have primary keys only - ex. an id
  • 3NF - no functional dependencies (when changing a column mandates that another column changes)

SSTables

Update our segment files so that each key only appears once in each segment file, and the keys are in sorted order

Now, when we merge segments, we look at each segment and copy the lowest key into the output merged segment (similar to mergesort)

Now,if we have a key "handiwork", and we don't know it's location (we don't have that index stored), we can search between handbag and handsome. So we can store a sparse in-memory index

Also, our read requests can compress the data they scan over before writing to disk

SQL Tuning

  • Know that small queries are similar in latency to big queries
    • Anything in application code that does multiple queries should probably be one big query
  • ext: N+1 Query Problem