5 exercises — Practice the English vocabulary used when discussing storage architecture: block, object, and file storage; SAN and NFS; distributed file systems; performance metrics; and data durability strategies.
Core Storage Systems vocabulary clusters
Storage types: block storage (raw volumes, iSCSI), object storage (S3-compatible, metadata, flat namespace), file storage (NFS, SMB, POSIX hierarchy), SAN (Storage Area Network), NAS (Network Attached Storage)
A solutions architect explains storage choices to a new team member: "For the virtual machines we provision raw block storage volumes — each VM sees it as a local disk. For our media pipeline, we use object storage: files are stored as objects with metadata, addressed by a unique key, and accessed over HTTP. For the shared analytics workspace where data scientists need a standard directory tree, we mount an NFS volume so they can use normal file paths." Which storage type is best suited for storing billions of images accessed via an HTTP API?
Object storage (e.g. Amazon S3, Google Cloud Storage, Azure Blob) is the right choice for large-scale unstructured data accessed over HTTP. Each file is stored as an object — a binary blob plus a metadata envelope (content-type, tags, custom headers) — addressed by a globally unique key. The namespace is flat (no real directories, only key prefixes that mimic paths). Key properties: near-infinite horizontal scalability; 11 nines of durability (99.999999999%) via erasure coding or multi-AZ replication; S3-compatible API widely supported by tools and SDKs. Block storage (SAN, EBS) provides raw volumes ideal for databases and VMs that need low-latency random I/O. File storage (NFS, EFS) provides a POSIX filesystem hierarchy ideal for shared workspaces. HDFS is optimised for large sequential reads in batch analytics, not random HTTP access of billions of small files.
2 / 5
A storage engineer is tuning a database server's disk configuration: "The bottleneck isn't throughput — we're not saturating the 2 GB/s bandwidth. The bottleneck is __________ : the database is doing 45,000 small random reads per second and the SSD is rated for only 40,000. Each read is tiny, so bandwidth is fine, but the drive can't keep up with the sheer number of operations." Which metric does the blank refer to?
IOPS (Input/Output Operations Per Second) counts how many individual read or write operations a storage device can handle each second, regardless of data size. It is the key metric for random-access workloads such as OLTP databases, where queries touch many small rows scattered across disk. The three fundamental storage performance metrics: IOPS — operations/second; critical for random I/O (databases, boot volumes). Throughput — MB/s or GB/s; critical for sequential I/O (video streaming, analytics). Latency — milliseconds (ms) or microseconds (µs) for a single operation to complete; critical for interactive workloads. A workload can be IOPS-bound (many small operations), bandwidth-bound (few large sequential reads), or latency-bound (interactive, requires immediate response). NVMe SSDs offer ~1 million IOPS; SATA SSDs ~100k; HDDs ~200. Cloud block storage (EBS gp3) allows you to provision IOPS independently of capacity.
3 / 5
Match the term to its correct definition.
Term: erasure coding
Erasure coding is a forward error correction technique used in distributed storage to achieve high durability with lower storage overhead than full replication. How it works: data is split into k data shards and m parity shards (written as k+m). Any k of the k+m shards are sufficient to reconstruct the original data — so up to m shards can be lost (node failures, disk failures) and the data is still recoverable. Example: RS(6,3) splits data into 6 shards + 3 parity; tolerates 3 simultaneous failures; storage overhead is 50% (9 shards for 6 shards of data). Compare with 3× replication: tolerates 2 failures; storage overhead is 200%. Amazon S3 uses erasure coding across multiple availability zones to achieve 11 nines (99.999999999%) durability. Ceph uses erasure coding pools as an alternative to replicated pools. Trade-off: erasure coding uses less space but has higher CPU overhead and higher read latency (must read from multiple shards), so it is typically used for warm/cold tiers rather than hot/latency-sensitive data.
4 / 5
A cloud architect describes a cost-optimisation strategy to the engineering team: "We've configured tiered storage with a lifecycle policy on our S3 bucket. Objects accessed in the last 30 days stay in the hot tier — standard storage class, highest IOPS, most expensive. Objects from 30–90 days ago move to the warm tier — infrequent access class, cheaper. Anything older than 90 days drops to the cold tier — Glacier, very cheap, but retrieval takes minutes to hours." What is the primary motivation for implementing tiered storage?
Tiered storage exploits the observation that data access frequency declines over time (access patterns follow a power law). By automatically migrating data to progressively cheaper storage, organisations significantly reduce costs without manual intervention. Tier vocabulary: Hot tier — frequently accessed data; highest performance and cost (e.g. S3 Standard, NVMe SSD). Warm tier — occasionally accessed; moderate cost (e.g. S3 Standard-IA, S3 One Zone-IA). Cold tier — rarely accessed archives; very low cost but slow retrieval (e.g. S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval). Archive tier — long-term regulatory archives; lowest cost, retrieval in hours (e.g. S3 Glacier Deep Archive). An object lifecycle policy is a set of rules that automate tier transitions and expiration. Example rules: transition to IA after 30 days; transition to Glacier after 90 days; expire (delete) after 7 years. Durability is the same across tiers (11 nines for S3) — the difference is cost and retrieval speed, not safety.
5 / 5
An infrastructure engineer explains a distributed file system deployment: "We run Ceph with a replication factor of 3 — every object is written to 3 different OSDs on 3 different hosts. If one host fails, the cluster re-replicates automatically from the remaining two copies. We also separate our OSDs across two __________ so that a single data-centre power outage can only affect at most half the replicas." What term completes the blank?
Availability zone (AZ): a physically separate data centre (or group of data centres) within a cloud region, isolated from failures in other AZs — independent power feeds, cooling, and network paths. Distributing replicas across AZs (or racks/rooms in on-premises Ceph) ensures that a single infrastructure event (power outage, fire, flood) cannot destroy all copies of data simultaneously. Key vocabulary: Replication factor — the number of complete copies maintained; RF=3 means 3 full copies on 3 different OSDs/nodes. CRUSH map (Ceph) — the algorithm that determines which OSDs receive each object based on the failure domain hierarchy (OSD → host → rack → row → room → DC). Placement group (PG) — a bucket of objects that Ceph manages as a unit for replication and recovery. OSD — Object Storage Daemon; one OSD per physical disk in Ceph. Namenode / Datanode (HDFS) — HDFS uses a centralised namenode (metadata) and distributed datanodes (block storage); the namenode is a classic SPOF mitigated with HA namenode pairs. In conversation: "We pin OSDs to two AZs with CRUSH rules — even if AZ-A goes dark, AZ-B has enough copies to stay writable at RF=2."