Question 1

What is a language model?

Accepted Answer

A language model is a statistical system trained on text data to predict and generate sequences of words. Modern large language models (LLMs) like GPT-4o, Claude, and Gemini use transformer architectures trained on billions of parameters to perform tasks including text generation, summarization, translation, code writing, and reasoning. They accept text input (a prompt) and produce text output (a completion). Language models are accessed through APIs with per-token pricing or deployed locally using open weights. sourc.dev tracks 30 language models across pricing, context windows, and capability dimensions.

Question 2

What is a token?

Accepted Answer

A token is the fundamental unit of text that language models process. One token is approximately 0.75 English words, or conversely, one English word averages about 1.3 tokens. The word "hamburger" becomes three tokens (ham-bur-ger). A typical page of English text contains roughly 400-500 tokens. Tokenization varies across models — each provider uses its own tokenizer (GPT models use tiktoken, Claude uses its own BPE tokenizer). All LLM API pricing is denominated in tokens, typically quoted per million tokens for both input and output. Understanding token counts is essential for estimating API costs and working within context window limits.

Question 3

What is the difference between input and output pricing?

Accepted Answer

LLM API providers charge separately for input tokens (what you send to the model) and output tokens (what the model generates back). Input tokens include your prompt, system instructions, conversation history, and any documents or context you provide. Output tokens are the model's response. Output tokens are almost always more expensive than input tokens — typically 2x to 4x the input price. This pricing split exists because generating output requires more computation than processing input. For GPT-4o, input costs $5.00 per million tokens while output costs $15.00 per million tokens — a 3x ratio. For Claude Sonnet 4.6, input is $3.00 and output is $15.00 per million tokens — a 5x ratio. When estimating costs, you need to model both sides. A chatbot application with long system prompts will be input-heavy. A content generation application will be output-heavy. The ratio between input and output spending varies significantly by use case, which is why sourc.dev tracks both input price and output price separately.

Question 4

How do I calculate my monthly API cost?

Accepted Answer

To calculate monthly API cost, you need three numbers: your monthly input token volume, your monthly output token volume, and the per-million-token rates for each. Here is a worked example using GPT-4o pricing. Suppose your application sends 1 million input tokens and receives 200,000 output tokens per month. GPT-4o charges $5.00 per million input tokens and $15.00 per million output tokens. Input cost: 1,000,000 tokens x ($5.00 / 1,000,000) = $5.00. Output cost: 200,000 tokens x ($15.00 / 1,000,000) = $3.00. Total monthly cost: $5.00 + $3.00 = $8.00. The same workload on Claude Sonnet 4.6 ($3.00 input, $15.00 output): $3.00 + $3.00 = $6.00. On Gemini 2.0 Flash ($0.10 input, $0.40 output): $0.10 + $0.08 = $0.18. On DeepSeek V3 ($0.27 input, $1.10 output): $0.27 + $0.22 = $0.49. These are raw API costs only — they exclude infrastructure, caching, retry overhead, and engineering time. For production workloads, multiply your estimate by 1.2-1.5x to account for retries, prompt iteration, and monitoring overhead. sourc.dev tracks input pricing and output pricing for all 30 models with daily verification.

Question 5

How have LLM prices changed since 2020?

Accepted Answer

LLM pricing has dropped 97% in five years. GPT-3 launched in 2020 at $60.00 per million input tokens — the first commercially available large language model API. By early 2023, GPT-3.5 Turbo brought that to $2.00. GPT-4 launched at $30.00 in March 2023, then GPT-4 Turbo reduced that to $10.00 by November 2023. Claude 2 entered at $8.00 in July 2023. Google's Gemini Pro launched at $0.50 in December 2023. GPT-4o arrived at $5.00 in May 2024, then GPT-4o mini at $0.15 in July 2024. DeepSeek V3 launched in December 2024 at $0.27 per million tokens. The pattern is consistent: each generation delivers equivalent or better capability at a fraction of the previous price. This collapse benefits builders — applications that were economically impossible at $60.00 per million tokens become trivial at $0.15.

Question 6

How have context windows expanded since GPT-3?

Accepted Answer

Context windows have grown 244x in four years. GPT-3 launched in 2020 with a 4,096-token context window — roughly three pages of text. That meant the model could only process very short documents. GPT-3.5 Turbo extended to 16,384 tokens in 2023. Claude 2 was the first major model to break the 100,000-token barrier in July 2023, enabling practical long-document processing for the first time. GPT-4 Turbo followed with 128,000 tokens. Claude 3 reached 200,000 tokens in March 2024. Gemini 1.5 Pro set the current record at 1,000,000 tokens — approximately 10 full-length novels in a single prompt. Larger context windows enable new application categories: entire-codebase analysis, long-document summarization, and multi-document reasoning. However, larger is not always better — cost scales linearly with context length, and model attention quality can degrade at extreme lengths.

Question 7

Should I use a proprietary API or self-host an open weights model?

Accepted Answer

The choice between a proprietary API (GPT-4o, Claude, Gemini) and self-hosting an open weights model (Llama 3, Mistral, DeepSeek) depends on five factors. Data sensitivity: if data cannot leave your infrastructure, self-hosting eliminates third-party data exposure. Cost at scale: API pricing is simpler at low volume, but at roughly 10 million+ tokens per day, self-hosting on GPU instances often becomes cheaper. Latency requirements: self-hosted models on dedicated GPUs eliminate network round-trips and provider queue times. Capability requirements: as of early 2025, the largest proprietary models (GPT-4o, Claude Sonnet 4.6) still outperform the best open weights models on complex reasoning tasks, though the gap is narrowing. Operational complexity: self-hosting requires GPU procurement, model serving infrastructure, monitoring, and scaling — a minimum of one dedicated ML engineer. The practical decision rule: start with APIs for development speed, measure actual token volumes and latency needs for 30-60 days, then evaluate self-hosting only if you exceed 5-10 million tokens per day or have strict data residency requirements. Many production systems use both — API models for complex tasks, self-hosted models for high-volume simple tasks.

Question 8

What is the difference between open weights and open source?

Accepted Answer

Open weights and open source are different things, though they are frequently conflated. Open weights means the trained model parameters are publicly downloadable — you can run the model on your own hardware. Llama 3, Mistral, and DeepSeek V3 are open weights models. However, open weights alone does not mean open source. True open source, by the Open Source Initiative definition, requires releasing the training data, training code, and model weights under an OSI-approved license with no usage restrictions. Think of it like cooking: open weights gives you the finished dish to reheat at home, but not the recipe or ingredient sourcing. Open source gives you the complete recipe, the ingredient list, and permission to open your own restaurant. Most "open" models use custom licenses with restrictions — Meta's Llama license prohibits use by companies with over 700 million monthly active users. Mistral uses Apache 2.0 for some models, which is a genuine open source license. The distinction matters for procurement, legal compliance, and long-term vendor risk.

Question 9

What context window size do I actually need?

Accepted Answer

The context window you need depends on your use case. Here are practical token estimates. A single customer support message: 100-300 tokens. A one-page document: 400-500 tokens. A full conversation history (20 turns): 2,000-5,000 tokens. A 10-page research paper: 4,000-6,000 tokens. A full software file (500 lines): 3,000-5,000 tokens. An entire small codebase (50 files): 100,000-200,000 tokens. A book-length document: 80,000-120,000 tokens. For chatbot applications, 16,000-32,000 tokens handles most conversations. For document analysis, you need at least 32,000 tokens. For codebase-level work, 128,000+ tokens. For multi-document research or book-length analysis, 200,000+ tokens. Remember: you pay for every token in your context window, not just the new input. If you stuff 100,000 tokens of context into every API call, your costs multiply accordingly. The practical approach: choose the smallest context window that covers 95% of your use cases, and handle the remaining 5% with chunking or summarization strategies.

Question 10

What is a reasoning model?

Accepted Answer

A reasoning model is a language model specifically trained to perform multi-step logical reasoning before producing a final answer. Unlike standard models that generate responses in a single pass, reasoning models produce an internal chain-of-thought — breaking complex problems into intermediate steps. OpenAI's o1 (September 2024) was the first major reasoning model, showing significant improvements on mathematics, coding, and scientific reasoning benchmarks. DeepSeek R1 (January 2025) demonstrated that reasoning capabilities could be achieved at dramatically lower cost — it is an open weights reasoning model that matched o1 performance on several benchmarks. Reasoning models typically cost more per token and take longer to respond because they generate many internal reasoning tokens. They excel at tasks requiring logical deduction, mathematical proof, code debugging, and complex planning. They are less suited for simple text generation, creative writing, or tasks where speed matters more than accuracy.

Question 11

How many developers use LLM APIs?

Accepted Answer

Developer adoption of LLM tools has grown faster than any previous technology wave. GitHub Copilot reached 1.8 million paid subscribers by the end of 2023 — just two years after launch. The Stack Overflow 2024 Developer Survey found that 76% of developers are using or planning to use AI tools in their workflow. McKinsey's 2024 State of AI report found that 65% of organizations are regularly using generative AI, up from 33% in their 2023 survey — effectively doubling in one year. JetBrains' 2023 developer survey found that 55% of developers had used AI code assistants. ChatGPT reached 100 million weekly active users by November 2024. Hugging Face hosts over 900,000 models. McKinsey called this the fastest technology adoption curve they have documented. The adoption is not uniform — it is concentrated in software development, content creation, and data analysis, with slower uptake in regulated industries like healthcare and finance.

Question 12

Which companies lead LLM development?

Accepted Answer

As of early 2025, LLM development is concentrated among six organizations. OpenAI (GPT family) reached a $157 billion valuation in October 2024 and remains the market leader by API revenue and brand recognition. Anthropic (Claude family) reached a $61.5 billion valuation, differentiating on safety research and long-context capability. Google DeepMind (Gemini family) has the advantage of Google's infrastructure and data assets, plus the largest context window at 1 million tokens. Meta (Llama family) released Llama 2 as open weights in February 2023, fundamentally changing the market by giving away competitive models. Mistral AI, based in Paris, has raised over $1.1 billion and serves as the leading European LLM developer. DeepSeek, a Chinese AI lab, disrupted the market in late 2024 and early 2025 with models matching frontier performance at a fraction of the cost. The market is split between proprietary API providers (OpenAI, Anthropic, Google) and open weights providers (Meta, Mistral, DeepSeek), with this structural divide shaping pricing, adoption, and regulatory dynamics.

Question 13

What is model drift?

Accepted Answer

Model drift is the phenomenon where a language model's behavior changes over time without any change to your code or prompts. Think of it like a scale that slowly loses calibration — you are weighing the same items but getting different results. Drift occurs because providers update, fine-tune, or replace model versions behind the same API endpoint. A prompt that worked reliably in January may produce different outputs in March. Stanford researchers documented measurable drift in GPT-4 and GPT-3.5 over a three-month period in 2023, with accuracy on certain tasks changing by 10-20 percentage points. Drift matters for production applications that depend on consistent model behavior. sourc.dev tracks drift index as a first-class attribute — documenting version changes, behavioral shifts, and provider update announcements for each model.

Question 14

Large model vs small model — when to use which?

Accepted Answer

Large models (GPT-4o, Claude Sonnet 4.6, Gemini 1.5 Pro) and small models (GPT-4o mini, Gemini 2.0 Flash, Llama 3.3 8B) serve different purposes, and the optimal architecture often uses both. Large models excel at complex reasoning, nuanced writing, multi-step problem solving, and tasks requiring broad world knowledge. They cost more per token ($3.00-$5.00 per million input tokens) and respond more slowly. Small models excel at classification, extraction, simple summarization, routing, and high-volume tasks where speed and cost matter more than reasoning depth. They cost 10-50x less ($0.10-$0.40 per million input tokens) and respond faster. The practical pattern in production is a routing architecture: a small, fast model handles 80% of requests (simple queries, classification, extraction), and routes complex requests to a large model for the remaining 20%. This can reduce costs by 60-80% compared to using a large model for everything. Decision rule: if a task can be solved with a clear prompt and structured output, use a small model. If it requires multi-step reasoning, ambiguity handling, or creative generation, use a large model. Test both — small models are better than most developers expect.

Question 15

What is multimodal?

Accepted Answer

Multimodal refers to language models that can process and generate multiple types of media — not just text. GPT-4o, Claude Sonnet 4.6, and Gemini 1.5 Pro can all accept images as input alongside text, enabling tasks like image description, chart reading, document OCR, and visual question answering. Gemini models additionally support video and audio input. Some models can generate images (DALL-E 3 via GPT-4) or speech (GPT-4o voice mode). Multimodal capability matters for applications that need to process real-world documents (PDFs, screenshots, photographs) rather than just clean text. Pricing for multimodal input varies — image tokens are typically more expensive than text tokens.

Question 16

Which LLMs offer EU data residency?

Accepted Answer

For organizations subject to GDPR or EU data residency requirements, several options exist. Mistral AI, headquartered in Paris, processes data within EU jurisdiction by default — their La Plateforme API operates from European data centers. Azure OpenAI Service offers EU data residency through Azure's European regions (Netherlands, France, Sweden, Germany). Google Cloud's Vertex AI for Gemini models can be configured with EU data location constraints. Anthropic offers EU processing through AWS European regions for enterprise customers. For maximum control, open weights models (Llama 3, Mistral, DeepSeek) can be self-hosted entirely within EU infrastructure on European cloud providers like OVHcloud, Hetzner, or Scaleway. German company Aleph Alpha builds sovereign AI infrastructure specifically for European government and enterprise customers. sourc.dev tracks EU data residency as a first-class attribute for every model.

Question 17

What is DeepSeek and why did it matter?

Accepted Answer

DeepSeek is a Chinese AI research lab that released two models that reshaped the LLM market. DeepSeek V3 launched in December 2024 as an open weights model with performance competitive with GPT-4o and Claude Sonnet 4.6 — but priced at $0.27 per million input tokens, roughly 10-20x cheaper than Western equivalents. DeepSeek R1, a reasoning model, followed in January 2025, matching OpenAI's o1 on several benchmarks while being open weights and dramatically cheaper. The market reaction was immediate: on January 27, 2025, AI-related stocks lost approximately $600 billion in market value in a single trading day. Nvidia alone lost nearly $600 billion in market capitalization — the largest single-day market cap loss for any company in history at that time. DeepSeek mattered for three reasons. First, it demonstrated that frontier-level AI performance did not require the massive compute budgets assumed by Western labs. Second, as open weights models, DeepSeek V3 and R1 could be self-hosted by anyone, anywhere. Third, it introduced a US-China competitive dynamic that benefits global builders through lower prices and more options.

Question 18

What is the difference between GPT-4o and Claude?

Accepted Answer

GPT-4o (OpenAI) and Claude Sonnet 4.6 (Anthropic) are both frontier language models, but they differ in design philosophy, pricing, and capabilities. GPT-4o processes text, images, and audio natively, with a 128,000-token context window. It is priced at $5.00 per million input tokens and $15.00 per million output tokens. It has the largest third-party ecosystem (plugins, integrations, fine-tuning). Claude Sonnet 4.6 focuses on safety, instruction following, and long-context performance. It has a 200,000-token context window — 56% larger than GPT-4o. It is priced at $3.00 per million input tokens and $15.00 per million output tokens — cheaper on input. Anthropic emphasizes Constitutional AI and interpretability research. In independent benchmarks, the models trade leads depending on the task category. GPT-4o tends to score higher on coding and multimodal tasks. Claude tends to score higher on long-document analysis and instruction following. Both are updated frequently. sourc.dev tracks both models with daily price verification and does not rank one above the other.

Question 19

What is the EU AI Act?

Accepted Answer

The EU AI Act is the world's first comprehensive legal framework for artificial intelligence, signed into law in August 2024. It classifies AI systems by risk level: unacceptable risk (banned), high risk (strict requirements), limited risk (transparency obligations), and minimal risk (no restrictions). For LLM providers, the Act introduces specific obligations for "general-purpose AI models" (GPAI). Providers of GPAI models must publish training data summaries, comply with EU copyright law, and implement technical documentation requirements. Models classified as presenting "systemic risk" (trained with compute exceeding 10^25 FLOPs) face additional obligations including adversarial testing, incident reporting, and cybersecurity measures. The Act's provisions take effect in phases: prohibited practices from February 2025, GPAI rules from August 2025, and high-risk system requirements from August 2026. For European builders, the practical impact is threefold: increased documentation requirements from LLM providers, stronger incentives to use EU-based providers like Mistral, and new compliance obligations for applications built on top of LLMs in high-risk categories (healthcare, finance, employment).

Question 20

Is AI going to replace software developers?

Accepted Answer

This question generates strong opinions. Here is what the data shows from both sides. The case for significant displacement: GitHub Copilot studies show 55% faster task completion. Cognition AI's Devin (2024) demonstrated autonomous multi-step coding. Google reported that 25% of new code at Google is now AI-generated (October 2024). Reasoning models like o1 and DeepSeek R1 can solve complex programming problems that would challenge mid-level developers. Stack Overflow traffic dropped roughly 50% after ChatGPT launched, suggesting developers are shifting where they seek answers. The case against replacement: software development is more than writing code — it involves understanding requirements, system design, debugging distributed systems, navigating organizational politics, and making tradeoffs that require human judgment. AI-generated code requires human review, testing, and integration. Companies that adopted AI coding tools report they need fewer junior developers but more senior developers to review and architect. The historical pattern with every previous automation technology (compilers, IDEs, frameworks, cloud services) has been that developer productivity increased but total demand for developers also increased — because lower costs expanded the market for software. The most likely outcome based on current trajectory: AI will change what developers do (less boilerplate, more review and architecture) rather than eliminate the role. But the skill floor will rise. The Stack Overflow 2024 survey found 76% of developers are already using or planning to use AI tools — adaptation is happening whether individual developers choose it or not.

Model	Date	Context Window	Approx. Equivalent
GPT-3	Jun 2020	4,096	3 pages
GPT-3.5 Turbo	Mar 2023	16,384	12 pages
GPT-4	Mar 2023	8,192	6 pages
Claude 2	Jul 2023	100,000	1 novel
GPT-4 Turbo	Nov 2023	128,000	1.3 novels
Claude 3	Mar 2024	200,000	2 novels
Gemini 1.5 Pro	Feb 2024	1,000,000	10 novels

Language Models

The price collapse

LLM input pricing 2020–2025 ($ per million tokens)

Context window expansion

Context window size 2020–2024 (tokens)

Who is building at scale

Developer AI tool adoption 2023–2024

What sourc.dev tracks for every model

Model families and generational development

Language models and European data sovereignty

Models tracked

GPT-4o

Claude Sonnet 4.6

Gemini 2.0 Flash

Mistral Large

DeepSeek V3

Llama 3.3 70B

New to LLM pricing?

Frequently asked questions