Columnar database vs Row based database

Columnar database vs Row based database

Columnar database vs Row based database

Columnar database is built for Data Analytics(Small/Big/Bigger) data – OLAP

Online Analytical Processing

Let us see how analytical queries get benefited by storing data in column oriented database against traditional Row-based transactional database systems like MySQL,SQL Server,PostgreSQL etc

Let us aasume a scenario for exapmle that your manager wanted to know the average sales of your retail company across the stores.

If it was a Row- based database then the query has to look up row by row

row 1
  row 2
    row 3
     row 4
          .
          .
          .
          row n

But in column oriented database(DB),it simply jumps in to sales data column and skip over all non related data. This drastically improves speed of fetching data

let us see the schema of ” SALES” table

DATEITEM_NOITEM_NAMECOSTQTYTOTAL_COST
01/01/201911011levis_jeans$402$80
01/01/201932422gap_t_shirt$455$225
01/01/201934867Iphone_x$13001$1300
01/01/201968432DSLR$30001$3000

Row based DB systems Physical Structure layout

Block 1 --> 01/01/2019 11011 levis_jeans $40 2 $80
Block 2 --> 01/01/2019 32422 gap_t_shirt $45 5 $225
Block 3 --> 01/01/2019 34867 Iphone_x $1300 1 $1300
Block 4 --> 01/01/2019 68432 DSLR $3000 1 $3000
.
.
.

The value or data related to a row are stored next to each other

Columnar DB systems Physical Structure Layout

Block 1 --> 01/01/2019 01/01/2019 01/01/2019 01/01/2019 01/01/2019 ...
Block 2 --> 11011 32422 34867 68432 ...
Block 3 --> levis_jeans gap_t_shirt Iphone_x DSLR ...
Block 4 --> $40 $45 $1300 $3000 ...
Block 5 --> 2 5 1 1 ...
Block 6 --> $80 $225 $1300 $3000 ...

The value or data from the same column stored together

Examples of Column-oriented Databases are Vertica , Paraccel(Action matrix) , Amazon Redshift , Sybase IQ , Exasol , Infobright , InfiniDB , LucidDB , SAP HANA , Google-Dremel , Google-PowerDrill , Druid etc…

Physical Storage of Data in Disks

Physical Storage of Data In Disks

Different orders for storing data are better suited to different scenarios

  1. The data access scenario refers to what queries are made. How often , How much data is read for each type of query
    1. Rows
    2. columns
    3. Bytes
  2. The relationship between reading and updating data
  3. The working size of data and how locally it used
  4. Whether transactions are used and how isolated they are
  5. Requirements for data replication and logical integrity
  6. Requirements for latency and throughput for each type of query

The higher the load on the system the more important is to customize the system to the scenario and the more specific this customization becomes..

  • There is no system that is equally well-suited to significantly different scenarios
  • If a system is adaptable to a wide set of scenarios , under a high load , then the system will handle all the scenarios equally(poorly)

Where to use Row Vs Column oriented database

Row oriented DatabaseColumn oriented Database
Mainly used for records that users need to look up/change all the timeThe vast majority of requests are for read access only
Mainly used for Online Transactions Processing(OLTP)Mainly used for Online Analytical Processing(OLAP)
Used for ACID TransactionsUsed for reads , quite a large number of rows are extracted from the DB but only small subset of columns(No ACID)
Tables are not wide and only few columns — since the data will be normalizedTables are WIDE – Contains large number of columns – Tables are De- Normalised
Mostly for transactions – Financial Transactions,User update, CRM … etcThere are no or Minimal Transactions with updates in large batches(>1000 rows) or it is immutable data(no updates at all)
A query result may be larger than the source data itself — joins are fairly easyA query result is smaller than the source data — Joins are costly(usually not performed)
SORT/SEARCH queries will perform more efficiently on a small volume of data , but becomes more costly in large volume of data( Tuning has to be done)SORT/SEARCH queries will perform more efficiently on a large volume of data( 10x – 1000x speed can be achieved)
Since in Row based database the values in a row is stored in same block with different datatypes for different columns , compression is not achieved or poor compression ratio.Aggregation queries will take more resources (ORDER/GROUP BY) Since in Column based database the values in a column are stored in same block with same datatypes for a column, compression is highly achieved with good compression ratio.Aggregation queries will perform faster(ORDER/GROUP BY)
0

1 Comment

Leave a Comment

Your email address will not be published. Required fields are marked *