Sharding: splits a large dataset across independent database instances (shards), each owning a range or hash of keys. This distributes write load and storage beyond what a single instance can handle.
2 / 5
What is a shard key and why is its choice critical?
Shard key: determines which shard stores each row. A key with low cardinality (e.g., a boolean) concentrates data; a sequential key (e.g., auto-increment) creates hot spots. Random or hash-based keys distribute evenly but complicate range queries.
3 / 5
What is a hot shard?
Hot shard: occurs when the shard key is skewed (e.g., a popular celebrity's data, or time-series with recent data always hitting the same shard). It defeats the purpose of sharding by concentrating load.
4 / 5
How do cross-shard queries (scatter-gather) affect performance?
Scatter-gather: a query that spans shard boundaries must be sent to all relevant shards, results are collected (gathered) and merged. Latency is bounded by the slowest shard, and the aggregation logic must handle partial failures.
5 / 5
What is resharding and why is it operationally complex?
Resharding: when a shard fills up or traffic outgrows the current split, data must be moved to new shards. This requires dual-write, data backfill, cutover coordination, and is typically the most disruptive operation in a sharded architecture.