The AWS Well-Architected Framework in Plain English

A clear, practical explanation of all five pillars of the AWS Well-Architected Framework — vocabulary, key concepts, and phrases for architecture reviews and design discussions.

The AWS Well-Architected Framework is the standard reference for evaluating and improving cloud architectures on AWS. It’s also widely referenced in interviews, architecture reviews, and client presentations. Whether you’re running a formal Well-Architected Review (WAR) or just discussing design trade-offs, this vocabulary and phrase guide will help you communicate the framework’s concepts clearly and professionally.


What Is the Well-Architected Framework?

The Well-Architected Framework (WAF) is a set of architectural best practices organised into five pillars — a structured way to evaluate cloud architectures against proven principles.

Originally developed by AWS, the framework is widely referenced across the industry. Each of the five pillars has:

  • Design principles — guiding philosophies
  • Best practices — specific recommendations
  • Review questions — prompts for self-assessment

“Before we submit the architecture for production approval, the team ran a Well-Architected Review. We found three medium-risk findings and one high-risk finding in the Security pillar.”


The Five Pillars


Pillar 1: Operational Excellence

Operational Excellence focuses on running and monitoring systems to deliver business value, and continually improving processes and procedures.

Key terms:

Operations as Code — defining and managing infrastructure and operational procedures using code (IaC, runbooks in automation). “We’ve codified our runbooks — deployments and rollbacks are scripted, not manual.”

Runbook — a document (or automated script) describing the steps to complete a specific operational procedure. “Any incident should have a corresponding runbook — no manual triage from memory.”

Playbook — a higher-level combination of runbooks for responding to a specific scenario or incident type.

Observability — the ability to understand internal system state from its outputs. Encompasses logs, metrics, and traces. “Without observability, we’re flying blind — we can’t tell whether the system is healthy.”

Making frequent, small, reversible changes — a core principle. Small deploys reduce blast radius and enable faster rollback. “We ship every day on the core API — each release is one small change, not a big-bang deployment.”

Failure Mode Analysis — proactively identifying what could go wrong and planning for it. “We ran a failure mode analysis before the peak season — we identified three single points of failure and addressed two of them.”


Pillar 2: Security

Security covers the ability to protect data, systems, and assets — and to detect, investigate, and recover from security events.

Key terms:

Least Privilege — granting the minimum permissions necessary for a function to operate. “Our Lambda functions have IAM roles scoped to exactly the S3 bucket and DynamoDB table they need — nothing more.”

Defence in Depth — using multiple layers of security controls. If one layer fails, others still protect the system. “We don’t rely on network perimeter security alone — we also encrypt data at rest, enforce MFA, and validate inputs at every service boundary.”

Shared Responsibility Model — AWS is responsible for security of the cloud (hardware, networking, physical); the customer is responsible for security in the cloud (data, identity, application). “The shared responsibility model means AWS securing the hypervisor doesn’t relieve us of securing our S3 buckets.”

Encryption at Rest / In Transit“All sensitive data must be encrypted at rest using KMS — no exceptions.”

Traceability — being able to trace every API call and configuration change to an identity. CloudTrail, AWS Config, and VPC Flow Logs enable this. “We have full traceability through CloudTrail — every IAM action is logged and alerts fire on suspicious patterns.”

Automated Security Testing — integrating security checks into CI/CD. “Our pipeline includes SAST, dependency scanning, and Checkov for IaC security — security gates before every deployment.”


Pillar 3: Reliability

Reliability is the ability of a workload to perform its intended function correctly and consistently — and to recover quickly when it fails.

Key terms:

Recovery Time Objective (RTO) — the maximum acceptable downtime after a failure. “Our RTO is 4 hours — we need to be back online within 4 hours of any disaster.”

Recovery Point Objective (RPO) — the maximum acceptable data loss measured in time. “Our RPO is 1 hour — we can’t afford to lose more than one hour of transaction data.”

High Availability (HA) — designing systems to minimise downtime, typically through redundancy. “The database runs in Multi-AZ HA mode — if the primary fails, the secondary promotes automatically within 60 seconds.”

Fault Isolation — containing failures to prevent them from cascading. VPCs, Availability Zones, and service meshes with circuit breakers all implement fault isolation. “Each AZ is a fault isolation boundary — a failure in AZ-1 shouldn’t affect AZ-2.”

Circuit Breaker — a pattern that stops making requests to a failing downstream service, allowing time for recovery. “We implemented circuit breakers on the payment service — if error rates exceed 50%, calls short-circuit for 30 seconds.”

Chaos Engineering — deliberately injecting failures in controlled ways to test resilience. “Our chaos engineering practice includes monthly game days where we simulate AZ failures and database failovers.”

Multi-Region Architecture — deploying across multiple AWS regions for the highest level of resilience. “The core payment service is multi-region active-active — a full region outage won’t take us down.”


Pillar 4: Performance Efficiency

Performance Efficiency is using cloud resources efficiently and maintaining efficiency as demand changes.

Key terms:

Right Resource Selection — choosing the most appropriate instance type, storage type, or service for the workload. “That batch job doesn’t need a general-purpose instance — it’s CPU-intensive and should run on a compute-optimised type.”

Elasticity — the ability to scale resources up/down based on demand. “We use auto-scaling groups — during off-hours the fleet scales down to 20% of peak, which cuts our compute bill.”

Latency — the time between a request and a response. “P99 latency is our KPI for this service — we target sub-200ms at P99.”

Throughput — the rate of processing — requests per second, messages per second, bytes per second. “The pipeline throughput needs to handle 50,000 events per second during peak ingestion.”

Caching — storing results of expensive operations for reuse. “We added a Redis cache layer — cache hit rate is 85%, and database load dropped by 70%.”

CDN (Content Delivery Network) — caching static content at edge locations close to users. “We serve static assets through CloudFront — global P50 latency dropped from 280ms to 35ms after moving to CDN.”

Trade-off analysis — balancing performance against cost, consistency, or complexity. “There’s a trade-off between strong consistency and read performance — we’ve chosen eventual consistency for this cache.”


Pillar 5: Cost Optimisation

Cost Optimisation is delivering business value at the lowest possible price point.

Key terms:

See the full vocabulary guide: FinOps Vocabulary: Cloud Cost Optimisation Terms Explained

Key additional terms in the WAF context:

Adopt a Consumption Model — pay only for what you use. Prefer serverless and managed services where appropriate. “We migrated the batch jobs to Lambda and Step Functions — cost dropped 80% because we no longer pay for idle compute.”

Measure Overall Efficiency — understand the business value delivered per dollar spent. “We track cost per transaction as our unit economics KPI — it needs to decrease as we scale.”

Stop Spending on Undifferentiated Heavy Lifting — use managed services instead of building infrastructure from scratch. “We don’t run our own Kafka — MSK does the undifferentiated work, we focus on business logic.”


The Well-Architected Review (WAR)

A Well-Architected Review is a structured assessment of a workload against the WAF questions. It produces:

  • A risk report — findings rated High, Medium, or Low risk
  • Improvement items — specific recommendations with priority
  • A remediation plan — timeline for addressing high-risk findings

Useful phrases for a WAR:

“This is a high-risk finding — we have no automated failover configured for the database. What’s the plan to address this before the Q2 deadline?”

“We have three medium-risk findings in the Security pillar. I’ll prioritise the MFA enforcement finding — everything else can be part of the next quarter’s roadmap.”

“The Well-Architected Review showed we have strong reliability posture but significant cost optimisation opportunities — an estimated 35% savings through rightsizing and Reserved Instances.”


Practice

Deepen your cloud architecture vocabulary with the Cloud FinOps exercise set and explore all cloud architecture resources in the Cloud Architect learning path.