This week we learned about B-Trees and B+ Trees as a way of more efficiently reading and writing data from a hard disk drive (HDD).
In order to understand B Trees, it is important to understand a HDD’s structure and how data is read and written to it. A HDD is split up into tracks and sectors, two vital pieces of information utilized in block addresses. In addition to track and sector, we also need to know the offset (from byte 0) on which to read/write. In order to change sector, a HDD spins; in order to change tracks a HDD’s head shifts.
Once we understand a HDD’s structure, we can understand how data is stored on it, e.g. in terms of a database. Database records are rows within a table. The amount of space a row takes up on the HDD depends on how many bytes each column has been allocated. We can then use that information to determine how many records (rows) we can store per block. Block size is usually 512 bytes, but it depends on the HDD manufacturer. For example, if we create a database that has 5 columns: eid (10 bytes), name (50 bytes), dept (10 bytes), section (8 bytes), address (50 bytes), each record in that database would have a size of 128 bytes. We could then calculate that we could store up to 4 records per block (of 512 bytes). Furthermore, we would need 25 blocks to store a database with 100 rows in it.
Considering a database of this size, in the worst case scenario, we would have to access all 25 blocks on the HDD to return the results of a query. To make this process more efficient, we use indexing. An index is used to store a key (a column from the database) and a pointer to the record on the HDD. This reduces the number of blocks to access during search because the size of the index (key and pointer) will be smaller than the size of an entire record. Smaller size, means less blocks of storage and less blocks to search through. If our indexes, for instance, only take up 16 bytes, we would only need 4 blocks to store the indices for 100 records. This means, at maximum, we would only need to access 5 blocks (all 4 index blocks and the block to which the pointer points to).
But simple indexing is not the most efficient. Sparse indexing builds upon simple dense indexing by adding higher level layers of indexing on the lower level index table previously mentioned. These higher level indexes point to index blocks in the lower level table. Therefore, a single index could point to a block of 32 indexes. This is the notion of multilayer indexing.
Once we understand multilevel indexing, we can understand M-Way Trees. The m stands for the degree and represents the maximum number of children a node can have. The number of keys a parent node can have is m – 1. To use M-Way Trees/Search, our data should be sorted from smallest to largest. A Binary tree is an example of an m-way Tree with a degree of 2. M-way search trees have child pointers associated with each of their keys. In multidimensional indexes, they also have record pointers. This is the node structure of a B-Tree.
B-Trees are M-Way trees with rules. These rules include:
- Each node should have m/2 children before filling the next node.
- Root can have minimum 2 children
- All lead nodes must be at the same level
- Creation process is bottom-up
The difference between a B-Tree and a B+ Tree is that B+ Trees do not have record pointers from every node — just the lead modes. Therefore, each key in a node is also copied on to one of the leaves. Leaves are linked together like a linked list. The leaf level is a dense index. The upper node levels are spare indices. For an illustration of how B+ trees work, use the B+ Tree Simulator!