For most of the last decade, running an AI model meant renting a GPU server, managing drivers, and paying for that hardware whether requests came in or not. Cloudflare Workers AI changes the model entirely: you call an AI model the same way you call any other function in a Worker, it runs on GPUs distributed across Cloudflare’s global network, and you pay only for what you use. No servers to provision, no GPUs to babysit, and inference that happens close to your users.
This guide explains what Workers AI is, which models you can run, what it costs under the Neurons pricing model, and how to call it from code, including a practical image generation example. It is part of my ongoing Cloudflare playbook for building fast, modern applications on the edge.
TL;DR
- Workers AI is serverless AI: it runs LLMs, image generation, embeddings and audio models on Cloudflare’s edge GPUs, called from a Worker
- Pricing uses Neurons: you get 10,000 Neurons per day free, then pay $0.011 per 1,000 Neurons
- It covers four model families: text (LLMs), image, embeddings and audio, with new models added regularly
- You can generate images at the edge and optimise them with Cloudflare Image Transformations
- It pairs naturally with the rest of the Workers platform: KV, R2, D1 and Vectorize for full AI applications
- Need help shipping an AI feature? That is exactly what my AI implementation service is for
What Workers AI Actually Is
Cloudflare Workers AI lets you run machine learning models on Cloudflare’s network using a simple binding inside a Worker. Cloudflare hosts the GPUs and the models; you send input and receive output. There is no infrastructure to manage, no cold-start GPU spin-up to pay for, and no minimum monthly commitment.
Because the inference runs on Cloudflare’s distributed network rather than a single region, the model executes near your user. For interactive features such as chat, classification or content generation, that proximity reduces latency in a way a centralised GPU cluster cannot match.
The models are open and managed by Cloudflare. You reference a model by name, such as a Llama variant for text or a Flux variant for images, and Cloudflare keeps the catalogue current as better open models are released.
Cloudflare Workers AI Models You Can Run
Workers AI organises its catalogue into four main families, plus a set of specialised models:
- Text (LLMs). Language models such as Llama, Mistral and Qwen variants for chat, summarisation, extraction, classification and content generation. Billed on input and output tokens.
- Image. Generation models such as Flux variants that create images from text prompts. Billed by tiles and steps.
- Embeddings. Models such as BGE that turn text into vectors for semantic search and retrieval-augmented generation. Billed on input tokens.
- Audio. Speech-to-text and text-to-speech models. Billed per minute or per character.
- Other. Specialised models for translation, reranking, image classification and recognition.
The catalogue evolves quickly, so always check the current model list for the latest options and their exact pricing.
Cloudflare Workers AI Pricing: How Neurons Work
Workers AI uses a unified unit called a Neuron to express the cost of inference across every model type. Rather than juggling separate prices for tokens, tiles, steps and audio minutes, Cloudflare converts all of them into Neurons so you have one number to reason about.
Based on the official Workers AI pricing for 2026:
| Plan | Neurons |
|---|---|
| Free allocation (Free and Paid plans) | 10,000 Neurons per day |
| Paid usage beyond the free allocation | $0.011 per 1,000 Neurons |
The daily free allocation resets at 00:00 UTC. For prototypes, side projects and low-traffic features, 10,000 Neurons per day often means your AI feature runs at no cost at all. When you exceed it, the rate is low enough that a substantial amount of inference still costs only a few dollars.
Because different models consume Neurons at different rates, the practical cost depends on which model you call and how much data you send. The pricing page lists the Neuron cost per model, so you can estimate before you build.
Calling a Text Model From a Worker
Here is the core pattern. You bind Workers AI in your wrangler.toml, then call env.AI.run() with a model name and input:
1export default {
2 async fetch(request, env) {
3 const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
4 messages: [
5 { role: "system", content: "You are a concise assistant." },
6 { role: "user", content: "Summarise the benefits of edge computing in one sentence." },
7 ],
8 });
9
10 return Response.json(response);
11 },
12};
The AI binding is configured once in your Worker’s settings. After that, calling a model is a single asynchronous function call, returning the model’s output for you to use however you like.
Generating Images at the Edge
Image generation is one of the most compelling Workers AI use cases, and it ties straight back to image hosting and delivery, which is a recurring theme of this blog. You can generate an image from a text prompt and stream the bytes directly back in the response:
1export default {
2 async fetch(request, env) {
3 const inputs = { prompt: "a minimalist mountain landscape at sunrise, flat illustration" };
4
5 const image = await env.AI.run(
6 "@cf/black-forest-labs/flux-1-schnell",
7 inputs
8 );
9
10 return new Response(image, {
11 headers: { "content-type": "image/png" },
12 });
13 },
14};
A powerful pattern is to generate the image, store it in Cloudflare R2 , and then serve it optimised through image transformations . That gives you AI-generated artwork, stored with zero egress fees, delivered as a perfectly sized WebP or AVIF. The entire pipeline, generation, storage and delivery, lives inside Cloudflare.
Building Complete AI Applications
Workers AI is most powerful in combination with the rest of the platform. A realistic AI application stitches several pieces together:
- Workers AI for inference (LLMs, embeddings, image generation)
- Vectorize as a vector database for semantic search and retrieval-augmented generation
- R2 for storing documents, images or audio
- D1 for structured application data (read my D1 guide )
- KV for caching and configuration
For example, a documentation chatbot would embed your content with an embeddings model, store the vectors in Vectorize, retrieve the relevant chunks at query time, and feed them to an LLM, all within a single Worker, all on the edge. This is the architecture behind most modern retrieval-augmented AI features. You can manage the storage layers behind these apps from your desktop with my free Easy Cloudflare R2 , Easy Cloudflare D1 and Easy Cloudflare KV apps, available for Windows, macOS and Linux.
When Workers AI Is the Right Choice
Workers AI is an excellent fit when:
- You want AI inference without managing GPU infrastructure
- Low latency matters and your users are global
- Your workload is spiky or unpredictable, so pay-per-use beats reserved hardware
- You are already building on Workers and want everything in one platform
- The open models in the catalogue meet your quality needs
It is less suited to cases that require a specific proprietary frontier model not in the catalogue, or extremely heavy batch inference where dedicated hardware may be cheaper at sustained full utilisation. For those, a hybrid approach, calling an external model provider from your Worker, is common, and something I help clients design through my AI integration services .
Key Takeaways
- Workers AI runs LLMs, image, embeddings and audio models on Cloudflare’s edge GPUs with no infrastructure to manage
- Pricing is unified in Neurons: 10,000 free per day, then $0.011 per 1,000 Neurons
- Calling a model is a single
env.AI.run()call inside a Worker - You can generate images at the edge, store them in R2, and serve them optimised through transformations
- It combines with Vectorize, R2, D1 and KV to build complete AI applications on one platform
- Pay-per-use and global low latency make it ideal for spiky, user-facing AI features
Frequently Asked Questions
What is Cloudflare Workers AI? It is a service that runs machine learning models on Cloudflare’s network of edge GPUs. You call a model from a Worker using the AI binding, and Cloudflare handles the hardware and model hosting. There are no servers or GPUs for you to manage.
What is a Neuron in Workers AI pricing? A Neuron is Cloudflare’s unified unit for measuring inference cost across all model types. Tokens, image tiles, generation steps and audio minutes are all converted into Neurons so you have a single figure to track. You get 10,000 free per day, then pay $0.011 per 1,000.
Can Workers AI generate images? Yes. The catalogue includes text-to-image models such as Flux variants. You call the model with a prompt and receive image bytes, which you can return directly, store in R2, or optimise with image transformations.
Is Workers AI free? There is a free allocation of 10,000 Neurons per day on both the Free and Paid Workers plans, which resets daily at 00:00 UTC. Many small features run entirely within this allowance. Beyond it, usage is billed at $0.011 per 1,000 Neurons.
Which models does Workers AI support? It supports four families: text LLMs (Llama, Mistral, Qwen), image generation (Flux), embeddings (BGE), and audio (speech-to-text and text-to-speech), plus specialised models for translation, reranking and classification. The catalogue is updated regularly.
How does Workers AI compare to running my own GPU server? Workers AI removes GPU provisioning, scaling and idle-cost management, and runs inference close to users globally. A dedicated GPU server can be cheaper only at sustained full utilisation; for spiky or user-facing workloads, pay-per-use on the edge usually wins.
Comments