TPC-DS Benchmark On Clickhouse Part 1

Table of Contents

ClickHouse

ClickHouse is an open source, columnar-oriented database that’s been developed primarily by engineers at Yandex. Yandex is one of Europe’s largest Internet-focused businesses. There search engine alone was receiving 150 million searches a day in 2012. ClickHouse has been deployed among a number of their businesses including their Metrica offering which is the world’s second largest web analytics platform. Outside of Yandex, ClickHouse has also been deployed at CERN where it was used to analyse events from the Large Hadron Collider.

ClickHouse is very feature-rich. It supports 16 different table engines, its CLI includes animated progress indicators and syntax highlighting and its performance puts it into its own class among open source offerings.

I have given some basic overall introduction about clickhouse and its features in my previous blog post here

Although clickhouse is not drop in replacement for for something like PostgreSQL. It lacks transactions, full-fledged UPDATE and DELETE statements and calls like DROP TABLE won’t work on tables above 54 GB without first changing the server’s configuration. But it is worth note that it will be a very respectable data store.

Why TPC-DS benchmark ?

There are several benchmarks available for clickhouse , In this post i try to push to the boundaries of clickhouse with the available hardware resources using TPC-DS data and queries.

What is TPC-DS?

The Transaction Processing Performance Council (TPC) is a benchmark model decision support system. Several major firms are member of TPC. You can get more information about TPC and members in the official website.

I have personally used TPC-DS to build one of machine learning model and compare results across different environments.

In this post I will explore how to generate test data and test queries using dsdgen and dsqgen utilities on a Centos machine against the product supplier snowflake-type schema as well as how to load test data into the created database in order to run some queries TPC-DS defines. I don’t want to go too much into details as TPC already provides a very comprehensive overview of the database and schema it provides, along with detail description of constraints and assumptions. This post is more focused on how to generate the required components used to evaluate the target platform(Clickhouse) for TPC-DS data sets and data loading.

Steps to Generate and Load TPC-DS Data into Clickhouse Server

Below are the steps to generate and load TPC-DS data into Clickhouse server:

I used this tool kit

Install git and other tools you need with the following command

sudo yum install gcc make  flex  bison  byacc git

Now clone the tools needed for generating dataset

git clone https://github.com/gregrahn/tpcds-kit.git

after downloading , compile it by CD in to it and

cd tpcds-kit/tools and make OS=LINUX

and hit enter

After the above steps I created some scripts in order to generate data based on my requirement for clickhouse

The hardware used in this test contain 8 vCPUs, 32 GB of RAM, 2TB of SSD storage and support 100 Mbps networking. I’ll be using CentOS Linux 7 (Core)Server for the operating system.

Data Generation

Created a seperate directory to store the generated data

mkdir tpcds-data

vi datagen-ch.sh

#! /bin/bash

cd /root/tpcds/tpcds-kit/tools/
./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 1 &
./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 2 &
./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 3 &
./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 4 &
./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 5 &
./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 6 &
./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 7 &
./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 8 &

I am going to run <dsdgen> executable tool kit that we got from github which was complied. I generate 150 GB of data in to tpcds-data dir that I created.

The tpc-ds consist of 24 tables.Before we do anything with the script we have to give execute permission

chmod +x datagen-ch.sh

Now we all set to generate data , Since the data set i generate was large enough to take some time i ran it in background mode

nohup ./datagen-ch.sh &

To make sure it was running

ps-ef | grep datagen-ch

It took almost 3 hrs to generate data, below will show the generated data.

[root@aavin-dev tpcds-data]# ls
call_center_1_8.dat            customer_demographics_5_8.dat   ship_mode_1_4.dat      warehouse_1_4.dat
catalog_page_1_8.dat           customer_demographics_6_8.dat   ship_mode_1_8.dat      warehouse_1_8.dat
catalog_returns_1_8.dat        customer_demographics_7_8.dat   store_1_4.dat          web_page_1_4.dat
catalog_returns_2_8.dat        customer_demographics_8_8.dat   store_1_8.dat          web_page_1_8.dat
catalog_returns_3_8.dat        date_dim_1_4.dat                store_returns_1_4.dat  web_returns_1_4.dat
catalog_returns_4_8.dat        date_dim_1_8.dat                store_returns_1_8.dat  web_returns_1_8.dat
catalog_returns_5_8.dat        dbgen_version_1_4.dat           store_returns_2_4.dat  web_returns_2_4.dat
catalog_returns_6_8.dat        dbgen_version_1_8.dat           store_returns_2_8.dat  web_returns_2_8.dat
catalog_returns_7_8.dat        household_demographics_1_4.dat  store_returns_3_4.dat  web_returns_3_4.dat
catalog_returns_8_8.dat        household_demographics_1_8.dat  store_returns_3_8.dat  web_returns_3_8.dat
catalog_sales_1_8.dat          income_band_1_4.dat             store_returns_4_4.dat  web_returns_4_4.dat
catalog_sales_2_8.dat          income_band_1_8.dat             store_returns_4_8.dat  web_returns_4_8.dat
catalog_sales_3_8.dat          inventory_1_4.dat               store_returns_5_8.dat  web_returns_5_8.dat
catalog_sales_4_8.dat          inventory_1_8.dat               store_returns_6_8.dat  web_returns_6_8.dat
catalog_sales_5_8.dat          inventory_2_4.dat               store_returns_7_8.dat  web_returns_7_8.dat
catalog_sales_6_8.dat          inventory_2_8.dat               store_returns_8_8.dat  web_returns_8_8.dat
catalog_sales_7_8.dat          inventory_3_4.dat               store_sales_1_4.dat    web_sales_1_4.dat
catalog_sales_8_8.dat          inventory_3_8.dat               store_sales_1_8.dat    web_sales_1_8.dat
customer_1_4.dat               inventory_4_4.dat               store_sales_2_4.dat    web_sales_2_4.dat
customer_1_8.dat               inventory_4_8.dat               store_sales_2_8.dat    web_sales_2_8.dat
customer_address_1_4.dat       inventory_5_8.dat               store_sales_3_4.dat    web_sales_3_4.dat
customer_address_1_8.dat       inventory_6_8.dat               store_sales_3_8.dat    web_sales_3_8.dat
customer_demographics_1_4.dat  inventory_7_8.dat               store_sales_4_4.dat    web_sales_4_4.dat
customer_demographics_1_8.dat  inventory_8_8.dat               store_sales_4_8.dat    web_sales_4_8.dat
customer_demographics_2_4.dat  item_1_4.dat                    store_sales_5_8.dat    web_sales_5_8.dat
customer_demographics_2_8.dat  item_1_8.dat                    store_sales_6_8.dat    web_sales_6_8.dat
customer_demographics_3_4.dat  promotion_1_4.dat               store_sales_7_8.dat    web_sales_7_8.dat
customer_demographics_3_8.dat  promotion_1_8.dat               store_sales_8_8.dat    web_sales_8_8.dat
customer_demographics_4_4.dat  reason_1_4.dat                  time_dim_1_4.dat       web_site_1_4.dat
customer_demographics_4_8.dat  reason_1_8.dat                  time_dim_1_8.dat       web_site_1_8.dat

With this we come to end of this blog post …will discuss loading and running queries in next post….here