ClickHouse is an open source, columnar-oriented database that’s been developed primarily by engineers at Yandex. Yandex is one of Europe’s largest Internet-focused businesses. There search engine alone was receiving 150 million searches a day in 2012. ClickHouse has been deployed among a number of their businesses including their Metrica offering which is the world’s second largest web analytics platform. Outside of Yandex, ClickHouse has also been deployed at CERN where it was used to analyse events from the Large Hadron Collider.
ClickHouse is very feature-rich. It supports 16 different table engines, its CLI includes animated progress indicators and syntax highlighting and its performance puts it into its own class among open source offerings.
I have given some basic overall introduction about clickhouse and its features in my previous blog post here
Although clickhouse is not drop in replacement for for something like PostgreSQL. It lacks transactions, full-fledged UPDATE and DELETE statements and calls like DROP TABLE won’t work on tables above 54 GB without first changing the server’s configuration. But it is worth note that it will be a very respectable data store.
Why TPC-DS benchmark ?
There are several benchmarks available for clickhouse , In this post i try to push to the boundaries of clickhouse with the available hardware resources using TPC-DS data and queries.
What is TPC-DS?
The Transaction Processing Performance Council (TPC) is a benchmark model decision support system. Several major firms are member of TPC. You can get more information about TPC and members in the official website.
I have personally used TPC-DS to build one of machine learning model and compare results across different environments.
In this post I will explore how to generate test data and test queries using dsdgen and dsqgen utilities on a Centos machine against the product supplier snowflake-type schema as well as how to load test data into the created database in order to run some queries TPC-DS defines. I don’t want to go too much into details as TPC already provides a very comprehensive overview of the database and schema it provides, along with detail description of constraints and assumptions. This post is more focused on how to generate the required components used to evaluate the target platform(Clickhouse) for TPC-DS data sets and data loading.
Steps to Generate and Load TPC-DS Data into Clickhouse Server
Below are the steps to generate and load TPC-DS data into Clickhouse server:
I used this tool kit
Install git and other tools you need with the following command
sudo yum install gcc make flex bison byacc git
Now clone the tools needed for generating dataset
git clone https://github.com/gregrahn/tpcds-kit.git
after downloading , compile it by CD in to it and
cd tpcds-kit/tools and make OS=LINUX
and hit enter
After the above steps I created some scripts in order to generate data based on my requirement for clickhouse
The hardware used in this test contain 8 vCPUs, 32 GB of RAM, 2TB of SSD storage and support 100 Mbps networking. I’ll be using CentOS Linux 7 (Core)Server for the operating system.
Created a seperate directory to store the generated data
vi datagen-ch.sh #! /bin/bash cd /root/tpcds/tpcds-kit/tools/ ./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 1 & ./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 2 & ./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 3 & ./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 4 & ./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 5 & ./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 6 & ./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 7 & ./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 8 &
I am going to run <dsdgen> executable tool kit that we got from github which was complied. I generate 150 GB of data in to tpcds-data dir that I created.
The tpc-ds consist of 24 tables.Before we do anything with the script we have to give execute permission
chmod +x datagen-ch.sh
Now we all set to generate data , Since the data set i generate was large enough to take some time i ran it in background mode
nohup ./datagen-ch.sh &
To make sure it was running
ps-ef | grep datagen-ch
It took almost 3 hrs to generate data, below will show the generated data.
[root@aavin-dev tpcds-data]# ls call_center_1_8.dat customer_demographics_5_8.dat ship_mode_1_4.dat warehouse_1_4.dat catalog_page_1_8.dat customer_demographics_6_8.dat ship_mode_1_8.dat warehouse_1_8.dat catalog_returns_1_8.dat customer_demographics_7_8.dat store_1_4.dat web_page_1_4.dat catalog_returns_2_8.dat customer_demographics_8_8.dat store_1_8.dat web_page_1_8.dat catalog_returns_3_8.dat date_dim_1_4.dat store_returns_1_4.dat web_returns_1_4.dat catalog_returns_4_8.dat date_dim_1_8.dat store_returns_1_8.dat web_returns_1_8.dat catalog_returns_5_8.dat dbgen_version_1_4.dat store_returns_2_4.dat web_returns_2_4.dat catalog_returns_6_8.dat dbgen_version_1_8.dat store_returns_2_8.dat web_returns_2_8.dat catalog_returns_7_8.dat household_demographics_1_4.dat store_returns_3_4.dat web_returns_3_4.dat catalog_returns_8_8.dat household_demographics_1_8.dat store_returns_3_8.dat web_returns_3_8.dat catalog_sales_1_8.dat income_band_1_4.dat store_returns_4_4.dat web_returns_4_4.dat catalog_sales_2_8.dat income_band_1_8.dat store_returns_4_8.dat web_returns_4_8.dat catalog_sales_3_8.dat inventory_1_4.dat store_returns_5_8.dat web_returns_5_8.dat catalog_sales_4_8.dat inventory_1_8.dat store_returns_6_8.dat web_returns_6_8.dat catalog_sales_5_8.dat inventory_2_4.dat store_returns_7_8.dat web_returns_7_8.dat catalog_sales_6_8.dat inventory_2_8.dat store_returns_8_8.dat web_returns_8_8.dat catalog_sales_7_8.dat inventory_3_4.dat store_sales_1_4.dat web_sales_1_4.dat catalog_sales_8_8.dat inventory_3_8.dat store_sales_1_8.dat web_sales_1_8.dat customer_1_4.dat inventory_4_4.dat store_sales_2_4.dat web_sales_2_4.dat customer_1_8.dat inventory_4_8.dat store_sales_2_8.dat web_sales_2_8.dat customer_address_1_4.dat inventory_5_8.dat store_sales_3_4.dat web_sales_3_4.dat customer_address_1_8.dat inventory_6_8.dat store_sales_3_8.dat web_sales_3_8.dat customer_demographics_1_4.dat inventory_7_8.dat store_sales_4_4.dat web_sales_4_4.dat customer_demographics_1_8.dat inventory_8_8.dat store_sales_4_8.dat web_sales_4_8.dat customer_demographics_2_4.dat item_1_4.dat store_sales_5_8.dat web_sales_5_8.dat customer_demographics_2_8.dat item_1_8.dat store_sales_6_8.dat web_sales_6_8.dat customer_demographics_3_4.dat promotion_1_4.dat store_sales_7_8.dat web_sales_7_8.dat customer_demographics_3_8.dat promotion_1_8.dat store_sales_8_8.dat web_sales_8_8.dat customer_demographics_4_4.dat reason_1_4.dat time_dim_1_4.dat web_site_1_4.dat customer_demographics_4_8.dat reason_1_8.dat time_dim_1_8.dat web_site_1_8.dat
With this we come to end of this blog post …will discuss loading and running queries in next post….here0