Database Sharding

Database Sharding

What is Database Sharding?

  • It is the process of storing a large database across multiple machines by splitting data into smaller chunks called shards.

Importance of Sharding

  • As an application grows, too many attempts of users to access the application can slow down the application and hence will affect the user experience.

  • To solve this, sharding can be used as it enables parallel processing of smaller datasets across shards.

Benefits of Database Sharding

  1. Improve response time:- Data retrieval takes longer time on a single large database. Data shards have fewer rows hence it will take less time to retrieve data.

  2. Avoid total service outage:- If the machine hosting the database fails, an application that depends on that database also fails. Database sharding prevents this by distributing shards to different machines. Hence if 1 shard becomes unavailable, data can be accessed by an alternate shard.

  3. Scale efficiently:- New shards can be added at runtime without shutdown of the application during maintenance.

Methods of Database Sharding

  1. Range-based Sharding:- Splits database rows based on a range of values. Then the database designer assigns a shard key to the respective range. Depending on the data values, range-based sharding can result in the overloading of data on a single physical node.

  2. Hashed sharding:- It assigns a shard key to each row of the database. The application uses the hash value as a shard key and stores the information in the corresponding physical shard. It does not separate databases based on the meaning of information.

  3. Directory sharding:- It uses a lookup table to match database info with the corresponding physical shard. It will fail if the lookup table has the wrong info.

  4. Geo sharding:- It splits and stores info according to geographical location. Fast retrieval due to shorter distances.

Optimizing database sharding

  1. Cardinality:- It determines the maximum number of possible shards on a separate column-oriented database.

  2. Frequency:- It is the probability of storing specific info in a particular shard.

  3. Monotonic changes:- It is the rate of change of shard keys.

Alternatives of database sharding

  1. Vertical scaling:- It increases computing power of single machine. It is less costly but there is limit to increase resources.

  2. Replication:- It makes exact copies of the database and stores them across different computers. This enables high availability.

  3. Partitioning:- It splits database in multiple groups. Horizontal partitioning splits the database by rows. Vertical partitioning creates different partitions of the database columns.

Challenges in database sharding

  1. Data hotspots:- Shards get unbalanced due to uneven distribution of data.

  2. Operational complexity:- Instead of handling a single database, now there will be a need to manage multiple shards.

  3. Infrastructure cost:- More computers are added as physical shards.

  4. Application complexity:- Working with some database management systems may need to split, and add data manually.

Conclusion

Database sharding offers a powerful solution for managing large datasets and improving application performance by distributing data across multiple machines. While it enhances scalability and fault tolerance, optimizing sharding strategies and addressing associated challenges are crucial. With careful planning and management, businesses can leverage database sharding to meet the demands of modern data-intensive applications effectively.