Columnar database vs Row based database

Table of Contents

Columnar database vs Row based database

Columnar database is built for Data Analytics(Small/Big/Bigger) data – OLAP

Online Analytical Processing

Let us see how analytical queries get benefited by storing data in column oriented database against traditional Row-based transactional database systems like MySQL,SQL Server,PostgreSQL etc

Let us aasume a scenario for exapmle that your manager wanted to know the average sales of your retail company across the stores.

If it was a Row- based database then the query has to look up row by row

row 1
  row 2
    row 3
     row 4
          .
          .
          .
          row n

But in column oriented database(DB),it simply jumps in to sales data column and skip over all non related data. This drastically improves speed of fetching data

let us see the schema of ” SALES” table

DATE	ITEM_NO	ITEM_NAME	COST	QTY	TOTAL_COST
01/01/2019	11011	levis_jeans	$40	2	$80
01/01/2019	32422	gap_t_shirt	$45	5	$225
01/01/2019	34867	Iphone_x	$1300	1	$1300
01/01/2019	68432	DSLR	$3000	1	$3000

Row based DB systems Physical Structure layout

Block 1 --> 01/01/2019 11011 levis_jeans $40 2 $80
Block 2 --> 01/01/2019 32422 gap_t_shirt $45 5 $225
Block 3 --> 01/01/2019 34867 Iphone_x $1300 1 $1300
Block 4 --> 01/01/2019 68432 DSLR $3000 1 $3000
.
.
.

The value or data related to a row are stored next to each other

Columnar DB systems Physical Structure Layout

Block 1 --> 01/01/2019 01/01/2019 01/01/2019 01/01/2019 01/01/2019 ...
Block 2 --> 11011 32422 34867 68432 ...
Block 3 --> levis_jeans gap_t_shirt Iphone_x DSLR ...
Block 4 --> $40 $45 $1300 $3000 ...
Block 5 --> 2 5 1 1 ...
Block 6 --> $80 $225 $1300 $3000 ...

The value or data from the same column stored together

Examples of Column-oriented Databases are Vertica , Paraccel(Action matrix) , Amazon Redshift , Sybase IQ , Exasol , Infobright , InfiniDB , LucidDB , SAP HANA , Google-Dremel , Google-PowerDrill , Druid etc…

Physical Storage of Data in Disks

Different orders for storing data are better suited to different scenarios

The data access scenario refers to what queries are made. How often , How much data is read for each type of query
1. Rows
2. columns
3. Bytes
The relationship between reading and updating data
The working size of data and how locally it used
Whether transactions are used and how isolated they are
Requirements for data replication and logical integrity
Requirements for latency and throughput for each type of query

The higher the load on the system the more important is to customize the system to the scenario and the more specific this customization becomes..

There is no system that is equally well-suited to significantly different scenarios
If a system is adaptable to a wide set of scenarios , under a high load , then the system will handle all the scenarios equally(poorly)

Where to use Row Vs Column oriented database

Row oriented Database	Column oriented Database
Mainly used for records that users need to look up/change all the time	The vast majority of requests are for read access only
Mainly used for Online Transactions Processing(OLTP)	Mainly used for Online Analytical Processing(OLAP)
Used for ACID Transactions	Used for reads , quite a large number of rows are extracted from the DB but only small subset of columns(No ACID)
Tables are not wide and only few columns — since the data will be normalized	Tables are WIDE – Contains large number of columns – Tables are De- Normalised
Mostly for transactions – Financial Transactions,User update, CRM … etc	There are no or Minimal Transactions with updates in large batches(>1000 rows) or it is immutable data(no updates at all)
A query result may be larger than the source data itself — joins are fairly easy	A query result is smaller than the source data — Joins are costly(usually not performed)
SORT/SEARCH queries will perform more efficiently on a small volume of data , but becomes more costly in large volume of data( Tuning has to be done)	SORT/SEARCH queries will perform more efficiently on a large volume of data( 10x – 1000x speed can be achieved)
Since in Row based database the values in a row is stored in same block with different datatypes for different columns , compression is not achieved or poor compression ratio.Aggregation queries will take more resources (ORDER/GROUP BY)	Since in Column based database the values in a column are stored in same block with same datatypes for a column, compression is highly achieved with good compression ratio.Aggregation queries will perform faster(ORDER/GROUP BY)

Columnar database vs Row based database

Columnar database vs Row based database

Row based DB systems Physical Structure layout

Columnar DB systems Physical Structure Layout

Physical Storage of Data in Disks

Different orders for storing data are better suited to different scenarios

Where to use Row Vs Column oriented database

1 Comment

Leave a Comment Cancel reply