Pricing Guide

LLM API Pricing 2026 - 20+ Models Compared Per Token

By Rome Thorndike · April 2, 2026 · 12 min read

Full Model Pricing Comparison

For the complete multi-provider pricing table with GPT-4.1, Claude Opus, Gemini Pro, Llama 4, and 15+ models compared, see LLM Pricing 2026: Every Model from $0.01 to $75/1M. This page covers token cost fundamentals and batch/caching discount calculations.

LLM API pricing changes constantly. New models launch, old ones get cheaper, and providers quietly adjust rates between announcements. This page tracks every major model's cost per million tokens, updated for April 2026.

Whether you're picking a model for a new project, budgeting compute costs, or comparing providers for a procurement decision, the tables below give you the numbers you need without digging through five different pricing pages.

Master Pricing Table: All Major LLM APIs (April 2026)

Prices are per 1 million tokens. Input is what you send to the model. Output is what the model generates back. Every model charges differently for each direction because output tokens require more compute.

Provider Model Input / 1M Tokens Output / 1M Tokens Context Window
OpenAI
OpenAIGPT-5$1.25$10.00128K
OpenAIGPT-4.1$2.00$8.001M
OpenAIGPT-4.1 Mini$0.40$1.601M
OpenAIGPT-4.1 Nano$0.10$0.401M
OpenAIGPT-4o$2.50$10.00128K
OpenAIGPT-4o Mini$0.15$0.60128K
OpenAIo3$10.00$40.00200K
OpenAIo4-mini$1.10$4.40200K
Anthropic
AnthropicClaude Opus 4.6$15.00$75.001M
AnthropicClaude Sonnet 4.6$3.00$15.001M
AnthropicClaude Haiku 4.5$0.80$4.00200K
Google
GoogleGemini 2.5 Pro$1.25$10.001M
GoogleGemini 2.0 Flash$0.10$0.401M
Mistral
MistralMistral Large$2.00$6.00128K
MistralMistral Small$0.10$0.30128K
Cohere
CohereCommand R+$2.50$10.00128K
CohereCommand R$0.15$0.60128K

Note on Gemini 2.5 Pro: Google charges $1.25/$10 for prompts over 200K tokens. Under 200K, input drops to $0.625 and output to $5.00. The table shows the higher tier since most production use cases hit the 200K+ range with system prompts and context.

Models by Budget Tier

Not every project needs a frontier model. Here is how models break down by cost, so you can match your budget to the right capability level.

Under $1 per 1M Input Tokens (Budget Tier)

These models handle classification, extraction, summarization, and simple chat at rock-bottom prices.

ModelInput / 1MOutput / 1MBest For
GPT-4.1 Nano$0.10$0.40High-volume classification, simple extraction
Gemini 2.0 Flash$0.10$0.40Fast inference, multimodal on a budget
Mistral Small$0.10$0.30Lightweight European-hosted tasks
GPT-4o Mini$0.15$0.60General-purpose cheap model
Command R$0.15$0.60RAG-optimized retrieval tasks
GPT-4.1 Mini$0.40$1.60Coding and instruction following on a budget
Claude Haiku 4.5$0.80$4.00Fast responses, customer-facing chat

$1 to $5 per 1M Input Tokens (Mid-Range)

The sweet spot for most production applications. These models handle complex reasoning, coding, and multi-step tasks reliably.

ModelInput / 1MOutput / 1MBest For
o4-mini$1.10$4.40Reasoning tasks at mid-range cost
GPT-5$1.25$10.00Frontier general intelligence
Gemini 2.5 Pro$1.25$10.00Long-context analysis, multimodal
GPT-4.1$2.00$8.00Coding, long-context, instruction following
Mistral Large$2.00$6.00European data residency, multilingual
GPT-4o$2.50$10.00Multimodal (vision + text)
Command R+$2.50$10.00Enterprise RAG, grounded generation
Claude Sonnet 4.6$3.00$15.00Coding, analysis, agentic workflows

$5+ per 1M Input Tokens (Premium)

ModelInput / 1MOutput / 1MBest For
o3$10.00$40.00Hard reasoning, math, science problems
Claude Opus 4.6$15.00$75.00Complex agentic tasks, deep analysis

Premium models are rarely needed for production workloads. Use them for difficult reasoning tasks, complex code generation, or when accuracy on edge cases justifies the 10-50x cost increase over mid-range options.

Batch API Discounts

If your workload can tolerate latency (minutes to hours instead of seconds), batch APIs cut costs significantly.

ProviderBatch DiscountTypical LatencyHow It Works
OpenAI50% off all modelsUp to 24 hoursSubmit JSONL file, results returned asynchronously. Available for all GPT and o-series models.
Anthropic50% off all modelsUp to 24 hoursMessage Batches API. Submit up to 100,000 requests per batch. Results within 24 hours.
Google50% off Gemini modelsUp to 24 hoursBatchGenerateContent API. Minimum 2x discount on all Gemini models through Vertex AI.
MistralVariableVariesBatch inference available through La Plateforme. Discount varies by volume commitment.

With batch pricing, GPT-4.1 drops to $1.00 input / $4.00 output per million tokens. Claude Sonnet 4.6 drops to $1.50 / $7.50. These are significant savings for data processing pipelines, evaluation runs, and content generation at scale.

Prompt Caching

Prompt caching reduces costs when you send the same system prompt or context prefix repeatedly. Instead of reprocessing identical tokens every call, the provider caches them and charges a reduced rate.

ProviderCache Write CostCache Read DiscountTTLMin Tokens
OpenAIFree (automatic)50% off input5-10 min1,024
Anthropic25% surcharge on first write90% off input5 min (refreshes on hit)1,024 (Haiku), 2,048 (Sonnet/Opus)
GoogleSame as input75% off inputConfigurable32,768

Anthropic's caching is the most aggressive: 90% off cached input tokens means a long system prompt that costs $3.00/1M on Sonnet 4.6 drops to $0.30/1M on cache hits. The 25% write surcharge pays for itself after just a few requests. OpenAI's caching is automatic (no code changes needed) but gives a smaller discount. Google requires the most tokens before caching kicks in but offers configurable TTL.

Cost Per 1K Tokens (Conversion Table)

Some documentation and older pricing pages still reference cost per 1,000 tokens. To convert: divide the per-1M price by 1,000.

ModelInput / 1K TokensOutput / 1K Tokens
GPT-4.1 Nano$0.0001$0.0004
Gemini 2.0 Flash$0.0001$0.0004
GPT-4o Mini$0.00015$0.0006
GPT-4.1 Mini$0.0004$0.0016
Claude Haiku 4.5$0.0008$0.004
GPT-5$0.00125$0.01
GPT-4.1$0.002$0.008
Claude Sonnet 4.6$0.003$0.015
GPT-4o$0.0025$0.01
o3$0.01$0.04
Claude Opus 4.6$0.015$0.075

Per-1K pricing looks deceptively cheap. Always multiply by 1,000 to understand real costs at scale. A chatbot handling 1 million tokens per day at $0.002/1K input costs $2/day or $60/month just for input tokens.

How to Estimate Your Monthly API Costs

Use this formula to budget your LLM spend before committing to a provider.

Monthly Cost = (Daily Requests x Avg Input Tokens x Input Price/1M) + (Daily Requests x Avg Output Tokens x Output Price/1M) x 30

Example 1: Customer Support Chatbot

  • 500 conversations/day, 800 input tokens avg (system prompt + user message), 400 output tokens avg
  • Using Claude Sonnet 4.6 ($3/$15 per 1M)
  • Input: 500 x 800 = 400,000 tokens/day = $1.20/day
  • Output: 500 x 400 = 200,000 tokens/day = $3.00/day
  • Monthly: ($1.20 + $3.00) x 30 = $126/month

Example 2: Document Processing Pipeline

  • 200 documents/day, 5,000 input tokens avg (document + extraction prompt), 500 output tokens avg
  • Using GPT-4.1 Mini ($0.40/$1.60 per 1M)
  • Input: 200 x 5,000 = 1,000,000 tokens/day = $0.40/day
  • Output: 200 x 500 = 100,000 tokens/day = $0.16/day
  • Monthly: ($0.40 + $0.16) x 30 = $16.80/month

Example 3: High-Volume Classification

  • 50,000 items/day, 200 input tokens avg, 50 output tokens avg
  • Using GPT-4.1 Nano ($0.10/$0.40 per 1M)
  • Input: 50,000 x 200 = 10,000,000 tokens/day = $1.00/day
  • Output: 50,000 x 50 = 2,500,000 tokens/day = $1.00/day
  • Monthly: ($1.00 + $1.00) x 30 = $60/month

These estimates assume no caching or batching. With prompt caching on a chatbot (where the system prompt repeats), expect 30-60% lower input costs. With batch API, cut both input and output costs in half.

Provider Comparison by Use Case

Cheapest for High-Volume Chatbots

Winner: GPT-4.1 Nano ($0.10/$0.40) or Gemini 2.0 Flash ($0.10/$0.40). Both cost the same and handle conversational tasks well. Gemini Flash has the edge for multimodal inputs (images in chat). GPT-4.1 Nano has stronger instruction following for structured system prompts. Mistral Small ($0.10/$0.30) is cheapest on output if you need European data residency.

Best for Coding Assistants

Winner: Claude Sonnet 4.6 ($3/$15). Consistently top-ranked on coding benchmarks. GPT-4.1 ($2/$8) is a strong alternative at lower cost, especially for its 1M context window that fits entire codebases. For budget coding, GPT-4.1 Mini ($0.40/$1.60) punches well above its price.

Best for Complex Reasoning

Winner: o3 ($10/$40) for math-heavy and scientific reasoning. Claude Opus 4.6 ($15/$75) for detail-sensitive analysis and agentic multi-step tasks. These are premium models for premium problems. For most reasoning tasks, Claude Sonnet 4.6 or GPT-5 at a fraction of the cost will be sufficient.

Best for RAG and Retrieval

Winner: Command R+ ($2.50/$10). Cohere built Command R+ specifically for retrieval-augmented generation with built-in citation support. Google Gemini 2.5 Pro is the alternative when you need a massive context window (1M tokens) to stuff retrieved documents into a single prompt.

Best for Enterprise with Data Residency Requirements

Winner: Mistral Large ($2/$6). Hosted in Europe, strong multilingual performance, and competitive pricing. Mistral is the default choice when GDPR compliance and data residency are non-negotiable.

Open Model APIs vs Frontier Model APIs: Where Open Wins on Price in 2026

The shortest answer: yes, open-weight model APIs are dramatically cheaper than frontier model APIs in 2026, sometimes by 10-100x. The catch is they trail frontier models on hard reasoning, agentic tool use, and frontier-grade coding by a measurable margin. For workloads where the trade-off makes sense (high-volume chat, classification, RAG, summarization, content generation), open-weight APIs through Together AI, Fireworks AI, Groq, and DeepInfra are now the price floor that frontier providers like OpenAI, Anthropic, and Google compete against.

Open-weight pricing across the major hosted-inference providers (verified April 2026):

Open Model Host Input / 1M Tokens Output / 1M Tokens Context Window
Llama 4 Maverick (400B)Together AI$0.27$0.851M
Llama 4 Maverick (400B)Fireworks AI$0.22$0.881M
Llama 4 Scout (109B)Together AI$0.18$0.5910M
Llama 3.3 70BTogether AI$0.88$0.88128K
Llama 3.3 70BGroq$0.59$0.79128K
DeepSeek-V3DeepInfra$0.27$1.10128K
DeepSeek-R1Together AI$3.00$7.00128K
Qwen 2.5 72BTogether AI$1.20$1.20128K
Mistral Nemo (12B)Mistral La Plateforme$0.15$0.15128K
Mixtral 8x22BTogether AI$1.20$1.2064K

The headline matchups against frontier models:

  • Llama 4 Scout ($0.18/$0.59) vs Claude Sonnet 4.6 ($3/$15): Scout is 16x cheaper on input and 25x cheaper on output. Quality gap on general chat is small; gap on hard coding and multi-step agents is real. For RAG, summarization, classification, and most production chatbots, Scout will save 90%+ of the bill at acceptable quality.
  • DeepSeek-V3 ($0.27/$1.10) vs GPT-4.1 ($2/$8): DeepSeek-V3 is 7x cheaper on input and 7x cheaper on output. V3 matches GPT-4.1 on many coding benchmarks (it sits near the top of LiveCodeBench). The trade-off is provider concentration (US users typically route through DeepInfra, Fireworks, or Together rather than DeepSeek directly).
  • DeepSeek-R1 ($3/$7) vs o3 ($10/$40): R1 is 3x cheaper on input and 5x cheaper on output. R1 is a reasoning model in the same family as o3 and scores within a few points on AIME, GPQA, and MATH. For reasoning workloads where you can tolerate a small quality gap, R1 is the cheapest credible alternative to o3 in 2026.
  • Llama 4 Maverick ($0.22/$0.88 on Fireworks) vs Claude Opus 4.6 ($15/$75): Maverick is 68x cheaper on input and 85x cheaper on output. Quality gap is meaningful on agents and tool use; on document analysis and long-context summarization, the gap narrows.

The economic case for open-model APIs strengthens at three points: when the workload is high volume (token cost compounds fast), when you want to fine-tune (open-weight models support LoRA and full-parameter fine-tuning on the host), and when you need data residency or BYOC deployment (Fireworks, Together, and DeepInfra all support private deployments at higher rates but still below frontier prices).

When Open Loses to Frontier on Price

The naive "open = cheaper" framing breaks in a few specific cases. Three to flag:

  1. Frontier mini-tier vs open. GPT-4.1 Nano at $0.10/$0.40 and Gemini 2.0 Flash at $0.10/$0.40 are price-competitive with most open-weight 7B-13B models on hosted inference, while delivering quality closer to mid-range frontier models. For high-volume classification and routing, GPT-4.1 Nano often beats Llama 3.1 8B and Mistral 7B on both price and quality.
  2. Reasoning workloads. For dedicated reasoning models, the gap narrows. DeepSeek-R1 at $3/$7 is cheaper than o3 at $10/$40, but it is not cheaper than o4-mini at $1.10/$4.40, which performs close enough on most reasoning tasks that o4-mini is often the better dollar-per-quality choice.
  3. Caching-heavy workloads. Anthropic's 90% prompt caching discount on cached Sonnet 4.6 reads drops effective input to $0.30/MTok, which is competitive with Llama 4 Scout and beats most open-model options once you factor in Sonnet 4.6's higher quality on hard tasks. If your workload has a 60%+ cache hit rate, Anthropic can come out ahead on total cost.

The pattern in 2026: open-weight APIs dominate the price floor at mid quality, frontier mini-tiers dominate the price floor at low-medium quality with stronger instruction following, and frontier flagships dominate the quality ceiling at a 5-100x price premium. Pick the tier that maps to your accuracy bar, not just the cheapest line.

How to Pick the Right Tier: A Decision Framework

A practical hierarchy for 2026 deployments, in order of decreasing volume tolerance and increasing quality demand:

  • 10M+ tokens/day, classification/extraction: GPT-4.1 Nano, Gemini 2.0 Flash, or Llama 4 Scout on Together. $0.10-$0.18 input. Expect $30-60/month at low volume.
  • 1-10M tokens/day, customer-facing chat: Llama 4 Scout, GPT-4.1 Mini, or Claude Haiku 4.5 with caching. $0.18-$0.80 input. Expect $50-300/month.
  • 500K-5M tokens/day, coding assistant or analysis: DeepSeek-V3 (open) or Claude Sonnet 4.6 with caching (frontier). $0.27-$3.00 input. Expect $100-700/month.
  • 100K-1M tokens/day, agentic workflows: Claude Sonnet 4.6, GPT-4.1, or Llama 4 Maverick for cost-sensitive agents. Caching becomes critical here.
  • 50K-500K tokens/day, hard reasoning: DeepSeek-R1 (open) or o3 / Opus 4.6 (frontier). Reserve these for the 10-20% of queries that truly need them.

For internal links to deeper coverage, see Anthropic Claude API Pricing for the full Anthropic-specific breakdown, Best Open-Source LLMs for model-quality comparisons across Llama, Mistral, Qwen, and DeepSeek, and AI Free Tiers Compared for which providers give you the most before billing kicks in.

Frequently Asked Questions

What is the cheapest LLM API?

As of April 2026, the cheapest LLM APIs are GPT-4.1 Nano, Gemini 2.0 Flash, and Mistral Small, all at $0.10 per million input tokens. Mistral Small edges ahead on output cost at $0.30/1M vs $0.40/1M for the other two. For batch workloads, GPT-4.1 Nano with the 50% batch discount drops to $0.05/$0.20 per million tokens, making it the absolute cheapest option for asynchronous processing.

How much does GPT-4.1 cost per token?

GPT-4.1 costs $2.00 per million input tokens and $8.00 per million output tokens. That works out to $0.000002 per input token and $0.000008 per output token. With the OpenAI Batch API (50% discount), those drop to $1.00/$4.00 per million. With prompt caching (automatic, 50% off cached tokens), repeated system prompts cost $1.00 per million cached input tokens.

How do LLM API prices compare to self-hosting?

Self-hosting open-source models (Llama 3, Mistral, etc.) on your own GPUs costs roughly $1-3 per GPU-hour on cloud providers. At high volume (millions of tokens per day), self-hosting can be 50-80% cheaper than API pricing. At low to moderate volume, APIs are almost always cheaper because you avoid idle GPU costs, infrastructure management, and the engineering overhead of running inference servers. The break-even point is typically around 10-50 million tokens per day, depending on the model size and hardware choice.

What's the difference between input and output token pricing?

Input tokens are what you send to the model: your system prompt, user message, uploaded documents, and any context. Output tokens are what the model generates in response. Output tokens cost 2-5x more than input tokens because generating each output token requires a full forward pass through the model, while input tokens can be processed in parallel. This is why long system prompts with short responses are relatively cheap, while asking a model to write a 5,000-word essay gets expensive fast.

Which LLM API has the best free tier?

Google offers the most generous free tier through Google AI Studio: Gemini 2.0 Flash is free up to 15 requests per minute with generous daily limits. OpenAI offers limited free credits for new accounts. Anthropic provides free access through claude.ai but no free API tier. Mistral offers a free tier on La Plateforme with rate limits. For serious development and testing, Google's free Gemini access is the clear winner.

How often do LLM API prices change?

Prices have been trending down 30-50% per year since 2023. Major price drops usually happen when providers release new model generations (the old model gets cheaper or the new model matches performance at lower cost). OpenAI and Google have been the most aggressive on price cuts. Anthropic tends to hold pricing longer but offers batch and caching discounts. Expect at least 2-3 significant pricing changes per provider per year.

Is an open model API really cheaper than a frontier model API in 2026?

Yes, in most volume tiers. Llama 4 Scout on Together AI costs $0.18/$0.59 per million tokens against Claude Sonnet 4.6 at $3/$15. That is 16x cheaper on input, 25x on output. DeepSeek-V3 at $0.27/$1.10 is 7x cheaper than GPT-4.1 at $2/$8. The savings are real for high-volume chat, RAG, classification, and summarization. The catch is that frontier mini-tiers like GPT-4.1 Nano at $0.10/$0.40 are price-competitive with the smaller open models, and Anthropic's 90% caching discount can flip the math on cache-heavy workloads.

Which open model API has the cheapest input tokens in 2026?

Mistral Nemo (12B) on Mistral La Plateforme at $0.15/$0.15 per million tokens, followed by Llama 4 Scout (109B) on Together AI at $0.18 input and $0.59 output. DeepSeek-V3 on DeepInfra is the cheapest for frontier-quality general intelligence at $0.27/$1.10. For reasoning specifically, DeepSeek-R1 on Together AI at $3/$7 is the cheapest credible alternative to o3.

How does DeepSeek-V3 pricing compare to GPT-4.1 and Claude Sonnet 4.6?

DeepSeek-V3 (hosted on DeepInfra) costs $0.27 per million input tokens and $1.10 per million output tokens. GPT-4.1 costs $2/$8. Claude Sonnet 4.6 costs $3/$15. DeepSeek-V3 is approximately 7x cheaper than GPT-4.1 and 10-14x cheaper than Sonnet 4.6 per token. Quality-wise, V3 sits near the top of LiveCodeBench and matches GPT-4.1 on many coding benchmarks. The gap widens on multi-step agents and frontier-grade reasoning, where Claude Sonnet 4.6 and GPT-4.1 still lead.

When does it make sense to use a frontier API instead of an open model API?

Three cases. First, frontier mini-tiers (GPT-4.1 Nano, Gemini 2.0 Flash) at $0.10/$0.40 are price-competitive with small open models and often have stronger instruction following. Second, caching-heavy workloads with Anthropic Sonnet 4.6 drop effective input to $0.30/MTok at 60%+ cache hit rates, which beats most open-model pricing. Third, hard reasoning, agentic tool use, and frontier coding still favor Claude Sonnet 4.6, Opus 4.6, GPT-4.1, and o3 by a meaningful quality margin. The decision rule: open wins on volume at mid quality, frontier wins on the hardest tasks regardless of cost.

LLM API Pricing 2026 - 20+ Models Compared Per Token - data visualization and comparison chart
Visual summary for LLM API Pricing 2026 - 20+ Models Compared Per Token. Data verified by PE Collective.
RT
About the Author

Rome Thorndike is the founder of the Prompt Engineer Collective, a community of over 1,300 prompt engineering professionals, and author of The AI News Digest, a weekly newsletter with 2,700+ subscribers. Rome brings hands-on AI/ML experience from Microsoft, where he worked with Dynamics and Azure AI/ML solutions, and later led sales at Datajoy (acquired by Databricks).