English Vocabulary for Modal Labs GPU Functions
Learn the professional English vocabulary for Modal Labs — decorators, GPU parameters, volumes, web endpoints, cold starts, and how to talk about them in team discussions.
Modal is a cloud platform for running Python functions on scalable infrastructure, with first-class support for GPU workloads, ML model serving, and background jobs. If your team trains models, runs inference, or deploys ML pipelines using Modal, you need precise English to discuss decorators, resource allocation, and deployment patterns. This post covers the core Modal vocabulary you will encounter in engineering standups, code reviews, and architecture discussions.
Key Vocabulary
@app.function decorator
The Python decorator that transforms a regular function into a Modal function — one that runs remotely in Modal’s cloud infrastructure. You attach it to any function you want Modal to manage, and it accepts parameters that control the execution environment.
Example: “Add the @app.function decorator to the inference function and set the memory to 8 GB — right now it’s hitting the default limit and crashing.”
gpu= parameter
The argument passed to @app.function (or @app.cls) that specifies which GPU type the function should run on. Common values include "A10G", "A100", "H100", and "T4". Choosing the right GPU tier is a cost-performance decision teams discuss regularly.
Example: “We’re using gpu='A10G' for the embedding generation job — it’s fast enough and about 60% cheaper than the A100 for this batch size.”
Volume
A Modal-managed persistent file system that can be mounted into function containers. Unlike in-memory storage, volumes survive across function invocations and can be shared between multiple functions in the same app.
Example: “Mount a volume at /model-cache so the function doesn’t re-download the 15 GB model checkpoint on every cold start.”
web_endpoint
A decorator or method that exposes a Modal function as an HTTP endpoint, turning a Python function into a serverless API without needing a separate web framework. Useful for model inference APIs that need to be called over HTTP.
Example: “I wrapped the generation function with web_endpoint so the product team can call it directly over HTTPS — no need to go through the internal queue.”
stub.run() / app.run()
The local invocation method that triggers a Modal function from your development machine or a script, running the logic remotely in Modal’s cloud while streaming output back to the terminal.
Example: “Use app.run() in the main block so you can test the pipeline end-to-end locally — it will still execute on Modal infrastructure.”
Cold start The latency penalty that occurs when a Modal function is invoked but no warm container is already running. The platform must provision infrastructure, download the image, and load the model before the first request is served. Teams actively work to minimize cold start time. Example: “The cold start for the image generation endpoint is around 40 seconds because of the model download — we should look into keep-warm strategies or pre-downloading to a volume.”
keep_warm
A parameter that tells Modal to maintain a minimum number of warm containers even when there is no traffic. This eliminates cold starts for production endpoints but incurs a constant cost.
Example: “Set keep_warm=1 on the customer-facing inference endpoint — a 40-second cold start is not acceptable for production users.”
Image
In Modal, an Image is a configured container environment — a base OS image with dependencies installed. You define it programmatically using Modal’s image builder API and it is cached and reused across function invocations.
Example: “Build a Modal Image with CUDA 12.1 and install torch, transformers, and accelerate — the function needs a specific CUDA version to run the quantized model.”
How to Use This Vocabulary
When discussing Modal deployments, engineers typically describe functions in terms of their resource configuration (GPU type, memory, timeout), their triggering mechanism (scheduled, event-driven, or HTTP), and their cold start characteristics. A sentence like “We’re running the fine-tuning job on an A100 with a shared volume for checkpoints and a 30-minute timeout” communicates everything a teammate needs to understand the cost and architecture of that function.
In code reviews, vocabulary around keep_warm, cold starts, and GPU tier selection often comes up in the context of cost optimization. Teams balance the need for low latency (which favors keep_warm and larger GPUs) against cost (which favors smaller GPUs and scaling to zero).
Example Conversation
Alex: The inference endpoint has a 45-second cold start. Users are complaining.
Sam: We should mount the model weights to a volume so we skip the download. And set keep_warm=1 for the peak hours window.
Alex: Good call. What GPU tier are we on — still A10G?
Sam: Yes, A10G is fine for this model size. Switching to A100 would cost 3x more for maybe 20% faster inference.
Practice
- Find a Modal function in an open-source repository and identify the GPU type, image definition, and whether a volume is mounted. Try describing it aloud: “This function runs on an [X] GPU, uses an image with [Y] dependencies, and mounts a volume at [Z].”
- Write a sentence explaining the trade-off between
keep_warm=1and scaling to zero for a low-traffic model serving endpoint. Use the words “cold start,” “latency,” and “cost.” - In a code review comment, explain why
gpu='H100'might be excessive for a text embedding function that processes 512-token inputs and suggest a more cost-appropriate alternative.