One way or another, a situation arises when, over time, you have to work with huge tables on the database server. This may be due to your application suddenly becoming popular. Traffic and data is starting to grow, and your database gets more overloaded every day. So you have to find a way to split the database by sharding but how …
Let me explain it in this post.
what is sharding?
Sharding is a method of splitting and storing a single logical dataset in multiple databases. By distributing the data among multiple machines, a cluster of database systems can store larger dataset and handle additional requests. Sharding is necessary if a dataset is too large to be stored in a single database. Moreover, many sharding strategies allow additional machines to be added. Sharding allows a database cluster to scale along with its data and traffic growth.
Sharding can be done in two ways at the top level
- Horizontal Sharding
- Vertical Sharding
The distinction of horizontal vs vertical comes from the traditional tabular view of a database. A database can be split horizontally — storing rows of the same table in multiple database nodes or vertically — storing different tables & columns in a separate database.
Sharding usually begins with the largest and most loaded tables. They are allocated in a separate group and taken out on a separate server.
Vertical partitioning is very domain specific. You draw a logical split within your application data, storing them in different databases. It is almost always implemented at the application level — a piece of code routing reads and writes to a designated database.
In many databases, sharding is a implemented by default and the database knows how to store and retrieve data within a cluster. Almost all modern databases are natively sharded. Cassandra, HBase, HDFS, and MongoDB are popular distributed databases. Notable examples of non-sharded modern databases are Sqlite, Redis (spec in progress), Memcached, and Zookeeper.
There exist variety of strategies to distribute data into multiple databases. Each strategy has its own pros and cons depending on various assumptions a strategy makes. It is crucial to understand these assumptions and limitations while designing a database sharding.
Applications may need to search through many databases to find the requested data. These are called cross-partition operations and they tend to be widely inefficient. Another common problem is Hotspots – having uneven distribution of data . Hotspots largely counteract the benefits of sharding.
Do We Need Sharding ?
Sharding definitely adds additional programming and operational complexity to your application. You lose the convenience of accessing the application’s data in a single location. Managing multiple servers adds operational challenges. Before you begin, see whether sharding can be avoided or deferred.
If your application requires heavy read operations then you can add caches or database replicas. They provide additional read capacity without heavily modifying your application.
Vertically partition based on functionality. Binary blobs tend to occupy large amounts of space and are isolated within your application alternatively you can Store files in S3 by that you can reduce storage burden.
If for example your application requires full text search, tagging, and analytics these things can be done by separate databases specifically designed for that functionality like Elastic search for full text search and of Analytics(OLAP) one can choose any of those available in market like Clickhouse,Redshift,Bigquery.
Not everything may need to be sharded. Often, only few tables occupy a majority of the disk space. Very little is gained by sharding small tables with hundreds of rows. Focus on the large tables.
First principles considered before sharding
How the data is read — Databases are used to store and retrieve data. If we do not need to read data at all, we can simply write it to /dev/null. If we only need to batch process the data occasionally, we can append to a single file and periodically scan through them. Data retrieval strategies heavily influence the sharding strategy.
How the data is distributed — Once you have a cluster of machines acting together, it is important to ensure that data and read and write work is evenly distributed. Uneven load causes storage and performance hotspots. Some databases redistribute data dynamically, while others expect clients to evenly distribute and access data.
Once sharding is employed, redistributing data is an important problem. Once your database is sharded, it is likely that the data is growing rapidly. Adding an additional node becomes a regular routine. It may require changes in configuration and moving large amounts of data between nodes. It adds both performance and operational burden.
Shard or Partition Key is a portion of primary key which determines how data should be distributed. A partition key allows you to retrieve and modify data efficiently by routing operations to the correct database. Entries with the same partition key are stored in the same node. A logical shard is a collection of data sharing the same partition key. A database node, sometimes referred as a physical shard, contains multiple logical shards.
As we discussed earlier we are going to split the table (Big Table) column wise,each column will be placed as a separate logical shard. Let us consider MySQL database server.
Preparation for sharding
Let us Make sure that the application code uses a separate connection for all calls to the selected table. Before turning on a separate server.
$con = mysql_connect('10.10.0.1'); $con_photos = mysql_connect('10.10.0.1'); #... mysql_query('SELECT * FROM users ...', $con); #... mysql_query('SELECT * FROM photos ...', $con_photos);
Create a copy
Next, we need to create a full copy of the selected table columns on the new server. This can be achieved simply by dumping the selected col but one has to stop the entire site for the period of creating the copy usually this is not advisable. We can achieve this with out stopping by using replication. In this case, the replica of the necessary tables is configured on the new server. The old server will act as the Master, and the new one will be the Slave.
After that, just switch the connection to the new server:
$con = mysql_connect('10.10.0.1'); $con_photos = mysql_connect('10.10.0.2'); #...
Rewrite logic from Application level
In the application, we will work with different joins for different tables:
$con = mysql_connect('10.10.0.1'); $con_photos = mysql_connect('10.10.0.2'); #... mysql_query('SELECT * FROM users ...', $con); #... mysql_query('SELECT * FROM photos ...', $con_photos);
This means that we will have as many connections as shards. Their number may be large, so it is better to use the technique of lazy resource loading to establish connections.
The JOIN of two tables on different servers is always a problem. There are two options for solving this problem.
- Table groups
- Application level selection
Often JOIN queries occur only between a group of tables that are logically related to each other. For example, tables that store data about albums and photos of users:
- photos list of photos, contains album_id
- albums album list
In this case, it is more convenient is to immediately place the entire group of these tables on a single server or node. This will allow you to use JOIN within this group.
The functionality of JOIN will have to be transferred to the application. For example, a query like this, assuming that photos and albums are stored in same node
SELECT * FROM photos p JOIN albums a ON (a.id = p.album_id) WHERE a.user_id = 1 # Select photos of user_id = 1
The likelihood of database server failures increases with their number.
To ensure fault tolerance, you must back up the database server using replication . In this case, each shard will have a backup server with a copy of the data.
$con = mysql_connect('10.10.0.1'); $con_photos = mysql_connect('10.10.0.2'); $con_photos = mysql_connect('10.10.0.3'); # switch connection from application level logic #...
Vertical sharding is a one solution for scaling databases. Separating logically related table groups into separate shards will even allow you to use JOINs. Be sure to use a redundancy scheme to increase fault tolerance in sharding. Start with the largest tables. For particularly large tables with millions of rows, use the horizontal sharding technique .
Apart form sharding one has to consider particular business use case ,as mentioned earlier in this post there are systems designed for particular use case , if one has to do analytics then there are systems designed for OLAP purposes like Big Query, Redshift, ClickHouse to mention few…0