Table of Contents
Columnar database vs Row based database
Columnar database is built for Data Analytics(Small/Big/Bigger) data – OLAP
Online Analytical Processing
Let us see how analytical queries get benefited by storing data in column oriented database against traditional Row-based transactional database systems like MySQL,SQL Server,PostgreSQL etc
Let us aasume a scenario for exapmle that your manager wanted to know the average sales of your retail company across the stores.
If it was a Row- based database then the query has to look up row by row
row 1
row 2
row 3
row 4
.
.
.
row n
But in column oriented database(DB),it simply jumps in to sales data column and skip over all non related data. This drastically improves speed of fetching data
let us see the schema of ” SALES” table
DATE | ITEM_NO | ITEM_NAME | COST | QTY | TOTAL_COST |
01/01/2019 | 11011 | levis_jeans | $40 | 2 | $80 |
01/01/2019 | 32422 | gap_t_shirt | $45 | 5 | $225 |
01/01/2019 | 34867 | Iphone_x | $1300 | 1 | $1300 |
01/01/2019 | 68432 | DSLR | $3000 | 1 | $3000 |
Row based DB systems Physical Structure layout
Block 1 --> 01/01/2019 11011 levis_jeans $40 2 $80
Block 2 --> 01/01/2019 32422 gap_t_shirt $45 5 $225
Block 3 --> 01/01/2019 34867 Iphone_x $1300 1 $1300
Block 4 --> 01/01/2019 68432 DSLR $3000 1 $3000
.
.
.
The value or data related to a row are stored next to each other
Columnar DB systems Physical Structure Layout
Block 1 --> 01/01/2019 01/01/2019 01/01/2019 01/01/2019 01/01/2019 ...
Block 2 --> 11011 32422 34867 68432 ...
Block 3 --> levis_jeans gap_t_shirt Iphone_x DSLR ...
Block 4 --> $40 $45 $1300 $3000 ...
Block 5 --> 2 5 1 1 ...
Block 6 --> $80 $225 $1300 $3000 ...
The value or data from the same column stored together
Examples of Column-oriented Databases are Vertica , Paraccel(Action matrix) , Amazon Redshift , Sybase IQ , Exasol , Infobright , InfiniDB , LucidDB , SAP HANA , Google-Dremel , Google-PowerDrill , Druid etc…
Physical Storage of Data in Disks
Different orders for storing data are better suited to different scenarios
- The data access scenario refers to what queries are made. How often , How much data is read for each type of query
Rows
columns
Bytes
- The relationship between reading and updating data
- The working size of data and how locally it used
- Whether transactions are used and how isolated they are
- Requirements for data replication and logical integrity
- Requirements for latency and throughput for each type of query
The higher the load on the system the more important is to customize the system to the scenario and the more specific this customization becomes..
- There is no system that is equally well-suited to significantly different scenarios
- If a system is adaptable to a wide set of scenarios , under a high load , then the system will handle all the scenarios equally(poorly)
Where to use Row Vs Column oriented database
Row oriented Database | Column oriented Database |
Mainly used for records that users need to look up/change all the time | The vast majority of requests are for read access only |
Mainly used for Online Transactions Processing(OLTP) | Mainly used for Online Analytical Processing(OLAP) |
Used for ACID Transactions | Used for reads , quite a large number of rows are extracted from the DB but only small subset of columns(No ACID) |
Tables are not wide and only few columns — since the data will be normalized | Tables are WIDE – Contains large number of columns – Tables are De- Normalised |
Mostly for transactions – Financial Transactions,User update, CRM … etc | There are no or Minimal Transactions with updates in large batches(>1000 rows) or it is immutable data(no updates at all) |
A query result may be larger than the source data itself — joins are fairly easy | A query result is smaller than the source data — Joins are costly(usually not performed) |
SORT/SEARCH queries will perform more efficiently on a small volume of data , but becomes more costly in large volume of data( Tuning has to be done) | SORT/SEARCH queries will perform more efficiently on a large volume of data( 10x – 1000x speed can be achieved) |
Since in Row based database the values in a row is stored in same block with different datatypes for different columns , compression is not achieved or poor compression ratio.Aggregation queries will take more resources (ORDER/GROUP BY) | Since in Column based database the values in a column are stored in same block with same datatypes for a column, compression is highly achieved with good compression ratio.Aggregation queries will perform faster(ORDER/GROUP BY) |
1 Comment