Filter by Column MySQL vs Clickhouse

Table of Contents

Filter by column was a typical business case requirement for reporting or creating analytical dashboards.

Even though there are many OLAP databases available in market . In this blog post i will delve deep in to Clickhouse

There are group of tasks that is associated with the need to filter data by a large number of columns in the table, usually the data-sets will be of millions of rows. For example, let us assume a table named products,when we fire the following query

mysql> describe products;
+ ----------------- + ------------------------------- -------------------------------------------------- ------------------------------- + ------ + ----- + ----- ---- + ---------------- +
| Field | Type | Null | Key | Default | Extra |
+ ----------------- + ------------------------------- -------------------------------------------------- ------------------------------- + ------ + ----- + ----- ---- + ---------------- +
| id | bigint (20) unsigned | NO | PRI | NULL | auto_increment |
| title | varchar (1024) | YES | | NULL | |
| reviews | int (10) unsigned | NO | | 0 | |
| price | float unsigned | NO | | 0 | |
...

# There are more than 40 such columns

The client should have the ability to filter and sort data by almost any field. Moreover, for numerical filters and dates, you can select ranges. The table has nearly 100 million records, even well tuned MySQL does a poor job of handling such queries . By design MySQL is not fabricated to do that job effectively there is a vast difference one can think of race between Monkey vs Deer – Monkey for MySQL and Deer for Columnar Databases.

mysql> SELECT * FROM products
WHERE rank> 100000 AND price> 25 AND price <50 AND reviews> 1000
ORDER BY queued DESC LIMIT 10
...
10 rows in set ( 23.17 sec )

# Typical query – multiple filters and sorting

The second difficult task is to search the selected filters by text (in the title field ). And this is either using the internal full-text Mysql engine or an external tool (like sphinx ).

Why mysql (Monkey)won’t work

Mysql is a string database. This means that when filtering, if there are no indexes(🚩), Mysql will scan all the data and check each record against the filters.

If there is a suitable index, Mysql will first filter out all the data by index, and then it will iterate over only the selected rows, which is much faster.This is like a hungry Monkey seeing ripen mangoes on a tree branch.

But! wait..

The peculiarity of our task is that any of the forty columns can be selected for filtering. This means that you need to create at least 40 indexes. And the presence of so many indexes creates the following three problems

Insert performance

Mysql will fail in this since it has to manage a large number of indexes. In addition, the disk subsystem will be loaded – you will have to write much more(at-least twice).

Disk space consumption

The task of creating indexes will consume much more disk space. For example, we use only 4 of the most necessary indexes based upon initial spec given by BI usually will change after some time during production, and the ratio of data size to indexes looks like this

# Data   ============================== 60%
# Indices ===================== 40%

# And this is on 4 indices!

In this case, the table occupies 50GB on disk. If you create 40 indexes, this picture may look like below

# Data   ========== 20%
# Indices ========================================== 80%

# We did not dare to check, most likely everything will be even worse

Slow requests

Due to some possibility of predicting the sequence of column selection (there may be any), slow queries will still remain.

Clickhouse to Rescue

Remember that Clickhouse is a columnar database and has equipped with vector engine .This means that filtering data by any column is a key advantage of such a database.

To build such a search index on Clickhouse, it is enough to unload the data from Mysql and put it into a MergeTree table

clickhouse-client :) CREATE TABLE search_index
(
  id UInt32,
  reviews UInt32,
  rank UInt32,
  ...
) ENGINE = MergeTree()
PARTITION BY category
ORDER BY id
SETTINGS index_granularity = 8192

# Table in Clickhouse for index storage

Now you need to download data from Mysql to Clickhouse. First, export the data from Mysql to TSV(can be exported in multitude of ways)

mysql> SELECT * INTO OUTFILE "/var/lib/mysql-files/index.ch.tsv" FROM products;

# exports the entire table to a TSV file

This is the longest process, it takes about 20 minutes. Now we insert into the Clickhouse table

cat index.ch.tsv | clickhouse-client --query "INSERT INTO search_index FORMAT TSV"

# Insert data into Clickhouse, which takes a couple of minutes

Now we try to fulfill our request with filters already there in Clickhouse

clickhouse-client :) SELECT * FROM search_index WHERE rank> 100000 AND price> 25 AND price <50 AND reviews> 1000 ORDER BY queued DESC LIMIT 10;
...
10 rows in set. Elapsed: 0.054 sec .

# Exactly what is needed!

This is like a Deer running for his life ….

The size of the table in Clickhouse is only 1.2GB because of compression techniques provided out-of box by Clickhouse.

If this is so impressive to you than …Why not get rid of Mysql and store all the data in Clickhouse? Column databases are not intended for a large (or even small) number of updates (there is no classic UPDATE , only emulation). This is like asking Deer to climb the tree….

key takeaway…

Instead of burdening MySQL by creating a large number of indexes, one can use Clickhouse to efficiently with filter by any number of columns. In this case, it is necessary to ensure periodic loading of data from Mysql to Clickhouse (indexing).