aavin.dev

Clickhouse – Introduction

Introduction

Clickhouse is an open source columnar database that has received much attention in recent years and is mainly used in the field of data analysis (OLAP). 

Characteristics of OLAP scenarios

  1. Read more than write — Different from transaction processing (OLTP) scenarios, such as adding shopping carts, placing orders, and paying in e-commerce scenarios, a large number of insert, update, and delete operations need to be performed in place. In the data analysis (OLAP) scenario, data is usually imported in batches. Perform flexible exploration of any dimension, get insight using BI tools, report making, etc. After the data is written once, the analyst needs to try to mine and analyze the data from various angles until he finds the business value, business change trends and other information. This is a process that requires trial and error, constant adjustment, and continuous optimization, where the number of data reads is much more than the number of writes. This requires the underlying database to be specially designed for this feature, rather than blindly adopting the technical architecture of traditional databases.
  2. Large wide table, read a large number of rows but a small number of columns, the result set is usually small subset — In the OLAP scenario, there are usually one or several large-wide tables with multiple columns, and the number of columns can reach hundreds or even thousands of columns. When analyzing and processing data, select a few of them as dimension columns and other few columns as index columns, and then perform aggregation calculations on the entire table or a larger range of data. This process scans a large number of rows of data, but only a few of them are used. And the result set of aggregate calculation is much smaller than the raw data of billions.
  3. Data is written in batches, and the data is not updated or updated less frequently — OLTP services require lower latency but in case of OLAP services, the data volume is large and hence data ingestion plays a major role,the attention here would be to import as much data as possible in short amount of time (Throughput) .Once the import is complete, historical data is often archived and no further updates or deletions are performed.
  4. No transactions required, low data consistency requirements — OLAP services have fewer transaction requirements, usually importing historical log data, or matching a transnational database and synchronizing data from the transnational database in real time. Most OLAP systems support eventual consistency.
  5. Flexible and not suitable for pre-modeling — In the analysis scenario, as the business changes, it is necessary to adjust the analysis dimensions and mining methods over time to discover the value of data and update business indicators or KPI as soon as possible. The data warehouse usually stores a large amount of historical data, which is very expensive to adjust. Although the pre-modeling technology can speed up calculations in specific scenarios, it cannot meet the flexible development requirements of the business and the maintenance costs are too high.

Why ClickHouse ?

Based on the needs of the OLAP scenario, ClickHouse can be customized with a new set of efficient columnar storage engines, and has realized rich functions such as data ordered storage, primary key indexing, sparse indexing, data sharding, data partitioning, TTL, and primary and backup replication. The above functions together lay the foundation for ClickHouse’s extremely fast analysis performance.

Columnar storage

Unlike row storage, which continuously stores the data of each row, column storage stores the data of each column continuously. 

A detailed explanation can be found here

Compared with row storage, column storage has many excellent characteristics in the analysis scenario.

1) As mentioned before, analysis scenarios often need to read a large number of rows but a few columns. In the row storage mode, data is continuously stored in rows, and the data of all columns is stored in a block. Columns that are not involved in the calculation must also be read out in IO, and the read operation is severely amplified. In column storage mode, you only need to read the columns participating in the calculation, which greatly reduces the IO cost and speeds up the query.

2) The data in the same column belongs to the same type, and the compression effect is significant. Columnar storage often has a compression ratio of up to ten times or higher, saving a lot of storage space and reducing storage costs.in addition to this Clickhouse provides column wise encoding makes it more efficient in compression.

3) A higher compression ratio means a smaller data size, and it takes less time to read the corresponding data from the disk.

4) As stated before Column wise compression algorithm selection. Different columns of data have different data types, and the applicable compression algorithms are different. You can choose the most suitable compression algorithm for different column types.

5) High compression ratio means that the same size memory can store more data, and the system cache effect is better.

Official data show that by using column storage, in some analysis scenarios, you can get 100 x speed.

Ordered data storage

Clickhouse supports sorting data by certain columns when building tables. After sorting, it is guaranteed that the data of the same sort key is continuously stored on the disk and placed in an orderly manner. When performing equivalence and range queries, the data hit by the where condition is tightly stored in one or several consecutive blocks, rather than scatteredly stored in any number of blocks, greatly reducing the number of blocks that require IO. In addition, continuous IO can also make full use of the prefetching capability of the operating system page cache to reduce page faults.

Primary key index

Clickhouse supports primary key indexing. It divides each column of data according to index granularity (default 8192 rows –> No of rows or records per index). The first row at the beginning of each index granularity is called a mark row. The primary key index stores the value of the primary key corresponding to the mark row.

For queries that contain the primary key in the where condition, the binary index of the primary key index can be used to directly locate the corresponding index granularity, which avoids the full table scan and speeds up the query.

However, it is worth noting that: ClickHouse’s primary key index is different from databases such as MySQL. It is not used for deduplication. Even rows with the same primary key can also exist in the database at the same time. To achieve the deduplication effect, it needs to be implemented in combination with the specific table engines ReplacingMergeTree, CollapsingMergeTree, and VersionedCollapsingMergeTree. We will explain it in detail in a future article series.

Sparse index

Clickhouse supports creating any number of sparse indexes on any column. The indexed value can be any legal SQL Expression, and is not limited to indexing the column value itself. The reason why it is called sparse index is because it is essentially a statistical information of a complete index granularity (default 8192 lines), and does not specifically record the position of each line in the file. The sparse index types currently supported include:

minmax: Use index granularity as the unit to store the min and max values ​​calculated by the specified expression; it can help to quickly skip the blocks that do not meet the requirements in the equivalent value and range query and reduce IO.

set (max_rows): stores the distinct value set of the specified expression in the unit of index granularity, which is used to quickly determine whether an equivalent query hits the block and reduces IO.

ngrambf_v1 (n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed): After ngram segmentation of the string, construct a bloom filter, which can optimize query conditions such as equivalent, like, and in.

tokenbf_v1 (size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed): Similar to ngrambf_v1, the difference is not to use ngram for word segmentation, but to use word punctuation for word segmentation.bloom_filter ([false_positive]): Constructs a bloom filter on the specified column to speed up the execution of query conditions such as equivalent, like, and in.

Data Sharding

Clickhouse supports stand-alone mode and also supports distributed cluster mode. In distributed mode, ClickHouse divides the data into multiple shards and distributes them to different nodes. Different sharding strategies have different advantages when dealing with different access patterns patterns. ClickHouse provides a wealth of sharding strategies, allowing businesses to choose according to their actual needs.

1) Random sharding: Write data will be randomly distributed to a node in the distributed cluster.

2) constant fixed shard: write data will be distributed to a fixed node.

3) column value fragmentation: hash fragmentation according to the value of a certain column.

4) Custom expression fragmentation: Specify any legal expression, and hash fragments according to the calculated value of the expression.

Data sharding allows ClickHouse to take full advantage of the massive parallel computing capabilities of the entire cluster and quickly return query results.

More importantly, the diversified sharding function opens up the imagination space for business optimization. For example, in the case of hash sharding, JOIN calculation can avoid data shuffle and directly perform local join locally.

With custom sharding, one can customize the most suitable sharding strategy for different business scenarios and SQL access Pattern and can solve the problem of data skew between shards and so on.

In addition, the sharding mechanism Clickhouse allows to scale horizontally and linearly to build large-scale distributed clusters, which has the ability to process massive data.

Data Partitioning

ClickHouse supports the PARTITION BY clause. When building a table, you can specify data partitioning operations according to any legal expression, such as partitioning data by toYYYYMM () by month, toMonday () partitioning data by day of week, and Enum type. The columns directly take each value as a partition, etc.

Data Partition has two main applications in ClickHouse:

  • Partition clipping is performed on the partition key, and only necessary data is queried. Flexible partition expression settings make it possible to set partitions according to the SQL access pattern to maximize business characteristics.
  • Able Perform TTL management on partitions and eliminate outdated partition data.

Data TTL

In the analysis scenario, the value of data continues to decrease with the passage of time(Hot,Warm,Cold). Most businesses will only retain data for recent months due to cost considerations. ClickHouse provides the ability of data lifecycle management through TTL.

ClickHouse supports several different granularity TTLs:

1) Column level TTL: When part of the data in a column expires, it will be replaced with the default value; when all the data in the column expires, the column will be deleted.

2) Row-level TTL: When a row expires, the row will be deleted directly.

3) Partition level TTL: When the partition expires, the partition will be deleted directly.

High throughput write capacity

Clickhouse uses an LSM Tree-like structure. After data is written, it periodically performs Compaction in the background. Through the structure of the LSM tree, Clickhouse writes all the data in sequence when the data is imported. The data segment cannot be changed after writing. In the background compaction, multiple segments are sequentially written back to disk after merge sort. The sequential write feature makes full use of the disk’s throughput capability, and has excellent write performance even on HDDs.

The official public benchmark test shows that it can achieve a write throughput of 50MB-200MB / s. It is estimated at 100Byte per line, which is equivalent to a write speed of 50W-200W entries / s.

Limited support for delete, update

In the analysis scenario, delete and update operations are not the core requirements. ClickHouse not directly support delete, update operations, but provides disguised support for the mutation operation like syntax alter table delete where filter_expralter table update col=val where filter_expr.

At present, the main limitation is that the delete and update operations are asynchronous operations, which need to be performed in the background to take effect.

Active / standby synchronization

Clickhouse provides high-availability through active-standby replication, and supports operation and maintenance operations such as seamless upgrade under the active-standby architecture. And its implementation has its own characteristics compared to other systems:

1) In the default configuration, any replica is in active mode and can provide external query services;

2) The number of copies can be arbitrarily configured, and the number of copies can be from 0 to any number;

3) Different shards can be configured custom number of copies based on access patterns, which is used to solve the query hotspot problem of a single shard;

Clickhouse computing layer

Clickhouse has been carefully crafted at the core to compute with maximum hardware capabilities and improve query speed. It implements a number of important technologies such as single-machine multi-core parallelism, distributed computing, vectorized execution, SIMD instructions, and code generation.

Multi-core parallel

Clickhouse divides the data into multiple partitions, and each partition is further divided into multiple index granularities, and then multiple CPU cores process respectively a part of them to achieve parallel data processing.

Under this design, a single Query can use all the CPU of the whole machine. Extreme parallel processing capability greatly reduces query latency.

Distributed Computing

In addition to excellent stand-alone parallel processing capabilities, ClickHouse also provides linearly scalable distributed computing capabilities. ClickHouse will automatically disassemble the query into multiple tasks and send them to the cluster, then perform multi-machine parallel processing or MPP, and finally aggregate the results together.

In the case of multiple copies, Clickhouse provides multiple query delivery strategies:

  • Random delivery: randomly select one among multiple replicas;
  • Recent hostname principle: Select the hostname node that is closest to the currently delivered machine for query delivery. Under certain network topologies, network latency can be reduced. And it can ensure that the query is sent to a fixed replica machine, making full use of the system cache.
  • in order: The delivery is attempted one by one in a specific order. When the current replica is unavailable, it is postponed to the next replica.
  • first or random: In Order mode, when the first replica is unavailable, all workloads will be backlogged to the second replica, resulting in load imbalance. first or random solves this problem: when the first replica is unavailable, randomly select another replica to ensure load balancing among the remaining replicas. In addition, in the cross-region replication scenario, by setting the first replica to be a replica within the region, network latency can be significantly reduced.

Vectorized execution and SIMD instructions

ClickHouse not only stores data by column, but also calculates by column. Traditional OLTP databases usually use row-by-row calculations. The main reason is that the transaction processing is mainly based on enumeration, and the amount of SQL calculation is small. However, in the analysis scenario, a single SQL may involve a huge amount of calculations, and processing each row as a basic unit will bring serious performance losses:

1) The corresponding function must be called for each row of data, and the proportion of function call overhead is high;

2) The storage layer stores data in columns and is organized in columns in memory, but the calculation layer processes in rows, which cannot fully utilize the read-ahead capability of the CPU cache, causing serious CPU Cache miss;

3) Processing by line, can not use efficient SIMD instructions;

Clickhouse implements the Vectorized execution engine. For columnar data in memory, a batch calls SIMD instructions (rather than one call per row), which not only reduces the number of function calls and cache misses, but also fully Leveraging the parallelism of SIMD instructions and greatly reduces the calculation time. The vector execution engine usually brings several times the performance improvement.

Dynamic Code Generation: Runtime Codegen

In a classic database implementation, a volcanic model is usually used for expression calculation, and queries are converted into operators, such as HashJoin, Scan, IndexScan, Aggregation, etc. In order to connect different operators, a uniform interface is used between operators, such as open / next / close. These virtual functions of the parent class are implemented inside each operator. In the analysis scenario, a single piece of SQL usually processes data of up to hundreds of millions of rows, and the calling cost of the virtual function is no longer negligible. In addition, various variables must be considered inside each operator, such as column type, column size, and number of columns. There are a large number of if-else branch judgments that cause CPU branch prediction to fail.

Clickhouse implements Expression-level runtime codegen, which dynamically generates code directly based on the current SQL and then compiles and executes it. As shown in the example below, for Expression directly generating code, not only eliminates a large number of virtual function calls (that is, calls to multiple function pointers in the figure), but because the type and number of expression parameters are known at run-time It also eliminates unnecessary if-else branch judgments.

Clickhouse Codegen

Approximate calculation

Approximate calculations greatly improve query performance at the cost of a loss of accuracy. In massive data processing, the approximate calculation value is even more obvious.

ClickHouse implements several approximate calculation functions:

Approximately estimate various aggregate functions such as distinct values, median, and quantile

Create table DDL supports the SAMPLE BY clause, and supports sampling processing of data

Complex data type support

Clickhouse also provides composite data types such as array, json, tuple, and set to support flexible changes to the business schema.

Conclusion

In recent years, the development trend of ClickHouse has been rapid, and communities and major enterprises are using. This article attempts to start from the requirements of OLAP scenarios, and introduces the main design of ClickHouse storage layer and computing layer. ClickHouse implements most of the current mainstream data analysis technologies, with obvious technical advantages:

Provides extreme query performance: open source open benchmark display shows several 100x faster than traditional methods, providing high throughput, real-time import capability of 50MB-200MB /s .

Massive data storage at extremely low cost: With the help of well-designed column storage and efficient data compression + Codecs algorithms, it provides up to 10 times the compression ratio, greatly improving the data storage and computing capacity of a single machine, and greatly reducing the cost of use. It is a massive data warehouse solution.

Simple, flexible and powerful: Provides comprehensive SQL support, which is very easy to get started; Provides flexible data types such as json, map, and array to adapt to rapid changes in business; At the same time, it supports approximate calculations and probability data structures to handle massive data processing.Compared with several other analytical technologies in the open source community, such as Druid, Presto, Impala, Kylin, ElasticSearch, etc.,

Clickhouse provides a complete set of solutions, which includes its own storage and computing capabilities (no need to rely on other storage components) ), Completely independent to achieve high availability, and support the complete SQL syntax including JOIN, etc., has obvious technical advantages. Compared to the hadoop system, it is easier to use big data processing in the form of database, and the learning curve is low and the flexibility is high. The current community is still developing rapidly, and I believe that more and more useful functions will appear in the future.

Key Takeaways

• Column oriented, you only pay for what you select
• Each column can potentially be processed in parallel
• Carefully crafted code makes use of vectorisation instructions
• Different table engines fit for different needs
• Horizontally scalable

4

2 Comments

Leave a Comment

Your email address will not be published. Required fields are marked *