Kubernetes Autoscaling English: VPA, HPA, KEDA, and Scaling Policy Vocabulary

Master the advanced English vocabulary for Kubernetes autoscaling: HPA, VPA, KEDA, scaling policies, and the capacity planning discussions that surround them.

Kubernetes autoscaling discussions are dense with acronyms, precise technical terms, and subtle distinctions that matter a lot in production. Whether you are presenting a capacity plan to your team, reviewing an HPA configuration in a pull request, or troubleshooting a KEDA scaler in an incident call, the English precision required is high. This post builds the vocabulary and phrases you need for those conversations.

Horizontal vs. Vertical Scaling Terms

HPA (Horizontal Pod Autoscaler) — a Kubernetes controller that automatically adjusts the number of pod replicas in a Deployment or StatefulSet based on observed metrics such as CPU utilization or custom metrics.

“Our HPA is configured to maintain 60% CPU target utilization, scaling between 3 and 50 replicas depending on traffic.”

VPA (Vertical Pod Autoscaler) — a Kubernetes controller that automatically adjusts the CPU and memory requests and limits of running pods, making each pod larger or smaller rather than adding more pods.

“We use VPA in recommendation mode for our batch jobs — it tells us what resource requests we should set, but doesn’t apply changes automatically in production.”

Scale-to-zero — the ability to reduce a workload to zero running instances when there is no traffic, eliminating idle compute costs. Not natively supported by HPA (which requires a minimum of 1), but supported by KEDA and Knative.

“With KEDA handling our event-driven consumers, we finally achieved scale-to-zero on our message processors — they only spin up when there are messages in the queue.”

Target utilization — the threshold metric value that the autoscaler tries to maintain across all pods. For CPU-based HPA, this is typically expressed as a percentage of requested CPU.

“We lowered the target utilization from 80% to 60% after the Black Friday incident — the scaler was reacting too slowly because pods were already saturated before new ones came up.”

KEDA and Advanced Autoscaling Terms

KEDA (Kubernetes Event-Driven Autoscaling) — a CNCF project that extends Kubernetes HPA to support scaling based on external event sources: queues, databases, Prometheus metrics, cron schedules, and more.

“KEDA lets us scale our worker pods based on the number of pending jobs in our RabbitMQ queue, not just CPU — it’s much more responsive to real workload.”

Scaling policy — a set of rules that govern how quickly and by how much an autoscaler can add or remove pods. Policies control the rate of scale-up and scale-down separately.

“We have an aggressive scale-up policy — up to 100% more pods per 30 seconds — but a conservative scale-down policy to avoid thrashing during traffic spikes.”

Cooldown period — a waiting period after a scaling event during which the autoscaler does not trigger another scale-down. This prevents rapid, disruptive cycling of pods.

“The default cooldown period is 5 minutes for scale-down. We extended it to 10 minutes because our traffic has a spiky pattern and we were losing pods too quickly.”

Stabilization window — a time window during which the autoscaler collects metric readings before making a scaling decision. Using the minimum (for scale-up) or maximum (for scale-down) value in that window prevents reacting to brief spikes.

“We set a 2-minute stabilization window for scale-down decisions, so a single burst of low traffic doesn’t immediately shrink our pod count.”

Metrics server — a cluster-level metrics aggregation component that collects resource usage data (CPU and memory) from kubelets and exposes it to the HPA. Required for resource-based autoscaling.

“After upgrading the cluster, the metrics server wasn’t reinstalled — HPA went into an ‘unknown’ state because it couldn’t retrieve CPU metrics.”

Custom metrics — non-default metrics (not CPU or memory) used to drive autoscaling decisions. Accessed via the custom metrics API, often backed by Prometheus Adapter or KEDA scalers.

“We wrote a custom metrics adapter that exposes our request queue depth as a Kubernetes metric, which HPA then uses as its scaling target.”

Real IT Context Phrases

These phrases appear in capacity planning documents, incident reports, and Kubernetes-related pull request reviews:

  • “The HPA is flapping — it scaled up and down four times in 20 minutes.” — incident observation, refers to unstable scaling behavior
  • “We need to tune the stabilization window to prevent this kind of thrashing.” — recommended fix during a postmortem
  • “Let’s set a scale-to-zero floor only for the non-critical workers, not the API pods.” — capacity planning decision in a design review
  • “The scaler isn’t reacting because the metrics server hasn’t collected enough samples yet.” — troubleshooting explanation during an incident call
  • “Our VPA recommendations show the memory limit is 4x higher than actual usage — we’re over-provisioning significantly.” — cost optimization finding in a quarterly review

Key Collocations

CollocationExample
trigger autoscaling”High queue depth triggers autoscaling on the consumer deployment.”
tune the HPA”Let’s tune the HPA stabilization window before the next load test.”
provision pods”KEDA provisions pods proactively before the morning traffic peak.”
scale down aggressively”Scaling down too aggressively causes connection-reset errors for in-flight requests.”
set resource requests”VPA recommendations help us set resource requests more accurately.”
hit the replica limit”We hit the replica limit at 50 pods — we need to raise the max or optimize the service.”
drain a node”Before scaling down the node pool, the cluster drains each node to reschedule pods gracefully.”

Practice

Find your team’s most frequently scaled Kubernetes workload. Write a short capacity planning note in English (one paragraph, 6–8 sentences) that covers: what metric drives scaling, what the current HPA or KEDA configuration targets, what the observed failure mode is during traffic spikes, and what one configuration change you would propose. Use at least five terms from this post. This directly mirrors the format of a capacity planning section in a quarterly infrastructure review document.