Cloudflare Workers AI: Edge AI English for Developers
Learn the essential English vocabulary for running AI inference at the edge with Cloudflare Workers AI — bindings, models, embeddings, and wrangler deploy.
Cloudflare Workers AI lets you run machine learning models directly on Cloudflare’s global network, eliminating the round-trip to a centralised inference server. You can generate text, create embeddings, classify images, and translate language — all from within a Worker script. The vocabulary in this post will help ESL developers read the Workers AI documentation, configure bindings in Wrangler, and discuss edge AI architecture in English.
Bindings and Configuration
AI binding — a configuration entry in your wrangler.toml file that gives your Worker script access to the Workers AI runtime under a specific variable name, such as env.AI.
“We added an AI binding named AI to the wrangler config so every handler in the Worker can call env.AI.run() without additional setup.”
wrangler.toml — the configuration file for Cloudflare Workers projects, where you declare the Worker’s name, bindings, routes, compatibility date, and environment variables.
“The AI binding must appear in the wrangler.toml file before you run wrangler deploy, or the Worker will throw a binding not found error at runtime.”
wrangler deploy — the CLI command that builds your Worker and publishes it to Cloudflare’s network, making it live on the configured routes or workers.dev subdomain.
“After testing locally with wrangler dev, we ran wrangler deploy and the AI-powered endpoint was live in under thirty seconds.”
Running Inference
run() — the primary method on the AI binding that executes a model; you pass the model name and an input object, and it returns the model’s output asynchronously.
“We call env.AI.run() with the summarisation model and the raw article text, then return the summary as a JSON response.”
Workers AI model catalog — the list of publicly available, Cloudflare-hosted models you can reference in run() calls, covering text generation, embeddings, speech-to-text, translation, and image classification.
“We browsed the Workers AI model catalog to find a multilingual embeddings model that supports both English and Spanish input.”
inference — the process of running a trained machine learning model on new input data to produce a prediction or output; on Workers AI this happens at the edge location closest to the user.
“Inference latency dropped from 400 ms to under 80 ms after we moved the summarisation step from our origin server to a Cloudflare Worker.”
Model Types and Outputs
text generation — a category of AI tasks where the model produces human-readable text in response to a prompt; on Workers AI this uses models such as Meta’s Llama series.
“We use text generation to produce a short product description from a structured data object before storing it in the database.”
embeddings — numerical vector representations of text or other data that capture semantic meaning; models that produce embeddings are used for similarity search, clustering, and retrieval-augmented generation.
“We store user query embeddings in Cloudflare Vectorize so the search endpoint can find semantically similar documents without keyword matching.”
streaming response — a mode where the model sends tokens to the client as they are generated rather than waiting for the full output; supported by Workers AI for text generation models.
“We enabled streaming response so users see the first words of the generated answer within 200 ms instead of waiting for the entire completion.”
Rate Limits and Production Concerns
rate limit — the maximum number of AI inference requests allowed per minute or per day on a given plan; exceeding the limit causes Workers AI to return a 429 Too Many Requests error.
“We hit the rate limit during load testing, so we added a caching layer to serve repeated queries from KV Storage instead of calling the model each time.”
neurons — the billing unit Cloudflare uses for Workers AI; each model run consumes a certain number of neurons depending on the model size and input length.
“Switching from the large language model to a smaller, task-specific model reduced our neuron consumption by roughly 60 percent for the same workload.”
gateway — Cloudflare AI Gateway, an optional proxy layer you can route AI requests through to gain logging, caching, rate limiting, and cost analytics across multiple AI providers.
“We routed all Workers AI calls through AI Gateway so the team can see a dashboard of inference costs and error rates without instrumenting the Worker code manually.”
Practice
Create a minimal Cloudflare Worker with an AI binding in wrangler.toml. Call env.AI.run() with a text generation model and a short prompt, then return the result as a plain text response. Describe in English what the AI binding does and why you need it before you can call run().