B-Tree on Big Data

One can say that with a strong conviction that most of the commercial success of MySQL comes from the InnoDB engine. Oracle acquired InnoDB several years before MySQL!
I believe that Oracle acquired not only the InnoDB engine but more importantly, the people as well. Where is the InnoDB author?
One can ask, it’s already 2020, is it so difficult to develop a storage engine? It’s not difficult, but will it be necessarily easy to use? !
From an engineering point of view, there are so many things to polish for an ACID engine, which is full of manpower, money, and patience.
Of course, this is not to encourage everyone to build their own products.


Let us examine an old tree in the database garden(let us say forest for some obvious reasons) is 50 years old and currently supports half of the database industry. It hasn’t changed for 50 years and people haven’t changed its intention. To identify the pros and cons of the B-Tree algorithm, there is a term called IO complexity analysis, which simply deduces the true and false. Let’s use this method to analyze the IO complexity of B-tree (traditional b-tree). Read and write IO at a glance, and understand why reading is fast, why writing is slow, and how to optimize.

Read IO analysis

There is a 3-level B-tree, each square represents a page, and the number represents the page ID.

Read IO Analysis B-Tree

The B-tree structure in the above figure is a manifestation of memory. If the record we want to read is on leaf-8, the read-path is shown by the blue arrow:

root-9 –> branch-6 –> leaf-8

The following figure shows the storage form of B-tree on disk, and the meta page is the starting point

The total number of random IO reads (assuming there is no page cache in the memory and page storage is random) is (blue arrow):

1(meta-10)IO + 1(root-9)IO + 1(branch-6)IO + 1(leaf-8)IO = 4 times IO, here ignore the cached meta and root, which is 2 random IO .

If the disk seek is 1ms, the read delay is 2ms.B-tree is a read-optimized data structure, and both LSM-tree or Fractal-tree can only be read slower than it, because of the read amplification (Read Amplification) problem.The storage engine algorithm can be said to be changing with each passing day, but most of them are struggling with Write-Optimized.

Write IO analysis

Now let us examine what will happen during writes.

Assuming that this write causes both root and branch changes, this in-place update reflected on the disk is

So now it is 2 times to read IO and 2 times to write IO+WAL fsync, roughly 4 random IO times. Through analysis, it is found that B-tree is not very friendly to write operations, there are more random IOs, and in-place updates must add a page-level WAL to ensure failure rollback, which is simply complex.

Write-Optimized B-tree

Speaking of write optimization, in the era of mechanical disks, everyone’s direction is basically to convert random IO to sequential IO, to give full play to the mechanical advantages of disks, so an Append-only B-tree appears.

  1. Update to generate a new page (green)
  2. page append-only to the end of the file when writing back to the disk

Append-only B-tree saves 2 random IOs during write and converts it to constant 1 sequential IO. The write performance is greatly improved

to sum up..

we are in post Hadoop era, Algorithms for Big Data apply to all storage systems, not just databases. The problem with Big Data is Microdata.
Store 1 trillion files Create tens of thousands of files per second Traverse directory hierarchies fast (ls -R) B-trees would require at least hundreds of disk drives.
For on-disk data, there will be always tradeoffs between the speeds of data ingestion, query speed, and freshness of data as well as the trade-off between Block size, Memory size, and data size. It is always better to optimize approximately for Block size and Memory than to pick the best of them.


Leave a Comment

Your email address will not be published. Required fields are marked *