AI & LLM Glossary

55 key concepts in AI and language modeling

agents

Agent

An agent is a harness that has been empowered with three things: a **role** (who it is — system prompt, persona, domain expertise), a **mission** (what it is supposed to accomplish — a task, goal, or recurring responsibility), and a **scope** (what it is allowed to touch — which tools, repos, files, and permissions). Once configured, the agent is pointed at work and runs its agentic loop until the mission is done or it hits a blocker. Concrete examples from Data Advantage's own setup: our Founding Engineer agent, CTO agent, DevOps Engineer, and CPO are all agents — each is the same underlying harness (Claude Code plus the Paperclip execution layer) configured with a distinct AGENTS.md, a task queue, and a permissions profile. **A note on industry usage.** Most sources call any agentic coding tool an "agent" — including unconfigured harnesses like Cursor or Claude Code out of the box. We use the term more narrowly: an agent is a *configured, pointed* harness. The binary by itself is a harness; the binary plus `role + mission + scope` is an agent. Same runtime, different layer in the stack. Under the industry's looser usage, a "coding agent" is really "an agentic harness." This precision matters when designing agentic systems: choosing a harness and configuring agents are two separate decisions, often made by different people. See the [harness](/concept/harness) concept for the runtime distinction. For common agent archetypes (autonomous SWE, pair programmer, research assistant, etc.), see our [AI Agents vs Harnesses](https://vibereference.com/ai-development/agents-vs-harnesses) explainer on VibeReference.

Agentic loop

The agentic loop is the repeating cycle at the heart of every autonomous AI system: **model → tool call → result → model → tool call → result → …** until the model produces a final answer or the harness stops it. On each iteration, the harness sends the model its current context, the model responds either with a tool call or a final message, the harness executes any tool call and feeds the result back into context, and the loop continues. This loop is what distinguishes agentic systems from simple chat. A chat completion is one turn — prompt in, response out. An agentic loop lets the model gather evidence, act, observe the effect, and revise its plan, turn after turn. It is how an AI actually *does* things — edits files, runs tests, searches, deploys — rather than just describing what it would do. Key design questions around the loop include: how many iterations to allow before escalating, how to compact context as history grows, how to detect when the model is stuck in a tool-call loop, how to handle errors from tool results, and when to hand control back to a human. Most harnesses expose configuration for all of these. The loop is executed by the [harness](/concept/harness); the steps taken within the loop depend on the [agent](/concept/agent)'s role, mission, and scope; and the quality of each step depends on the [model](/concept/model) and the [context](/concept/context) it receives.

Context

Context is everything the model sees on a given turn: the full set of tokens loaded into its context window before it produces its next output. That typically includes the system prompt (role and instructions), the conversation history so far, any tool call results from prior turns, attached files or code snippets, retrieved documents, and explicit memory the harness has surfaced. Context is the only place a model can look for information — it has no hidden state and no memory between API calls. Whatever is in context on this turn is what the model "knows" right now. Everything else, from yesterday's conversation to the contents of a file on disk, must be reloaded into context for the model to use it. Because context windows are finite — typically 128K–2M tokens today, depending on the model — and every token costs money and attention, context management is one of the hardest problems in agentic systems. Harnesses spend significant effort deciding what to include, what to summarize, when to compact older turns, and how to surface the right files, tool results, and memories without drowning the model in irrelevant text. Good context engineering — clear system prompts, tight tool results, selective history, and retrieval that surfaces only what's needed — often matters more than raw model capability. A frontier model with bad context will underperform a weaker model given the right information. See also: [context window](/concept/context-window), [system prompt](/concept/system-prompt), [RAG](/concept/rag).

Harness

A harness is the runtime scaffolding that wires a model, its tools, and its context into a working agentic loop. It is the software that actually runs the model-to-tool-to-result cycle: handling API calls to the model, executing tool calls the model emits, feeding results back, managing the context window, and enforcing sandboxing and permissions. Well-known harnesses include Claude Code, Cursor, Cowork, Cline, Aider, Windsurf, and Devin. A freshly installed harness is unconfigured — a binary or IDE extension with no role, no mission, and no active task. It becomes useful only when someone points it at work with a specific prompt, system prompt, or AGENTS.md-style configuration. At that point the harness is operating as an agent, but the harness itself is the runtime, not the task. **A note on industry usage.** Many articles and vendors use "harness" and "agent" interchangeably, or skip "harness" entirely and call every agentic tool an "agent." On this site we reserve **harness** for the unconfigured runtime and **agent** for a harness that has been given a role, mission, and scope. The distinction matters because a single harness (e.g., Claude Code) can host dozens of different agents just by swapping AGENTS.md files — same binary, different configuration state. See the [agents directory](/agents) for a catalog of products in this category, and the [agent](/concept/agent) concept for how harnesses get pointed at specific work.

Tools

Tools are the discrete capabilities a model can invoke to act on the world. A model on its own can only emit text; tools turn text emissions into real effects — reading files, running shell commands, editing code, searching the web, querying databases, calling APIs. Common examples in coding agents include `bash`, `read`, `edit`, `grep`, `web_search`, `web_fetch`, and any MCP server exposed to the runtime. Mechanically, tool use works via function calling. The harness advertises a set of tool schemas (name, description, input parameters) in the model's context. When the model decides a tool is needed, it emits a structured tool call; the harness executes it, captures the result, and feeds that result back as a new turn. The model then either calls another tool or produces a final response. This repeating cycle is the agentic loop. Tools are where agents meet reality. Their design determines how capable, safe, and efficient an agent can be: too few tools and the model is blind; too many and its context fills with schemas it will never use. Permission scoping on tools — which tools a given agent may call, against which resources — is one of the main safety levers in any agentic system. Tools can be built in to a harness, loaded from local plugins, or exposed over [Model Context Protocol (MCP)](/concept/mcp) by external servers, which is what makes the MCP ecosystem important for composable agents.

architecture

Attention mechanism

The attention mechanism allows large language models to weigh the importance of different tokens in a sequence relative to each other when processing input, enabling focus on relevant context regardless of position. It is core to transformer architectures, powering parallel computation and long-range dependencies.

behavior

Hallucination

Hallucination refers to large language models generating plausible but factually incorrect or fabricated information, often confidently presented as true. It arises from gaps in training data or overgeneralization and is mitigated by techniques like RAG or grounding.

capability

Context window

The context window is the maximum number of tokens a large language model can consider at once for input and output during inference, limiting the amount of information it can process in a single pass. Larger windows enable handling longer conversations or documents but increase computational demands.

Tool use / Function calling

Tool use or function calling enables large language models to invoke external APIs, tools, or functions (e.g., calculators, search engines) by generating structured calls based on user queries, extending capabilities beyond internal knowledge. It allows dynamic integration of real-time data or computations into responses.

foundation

Model

A model is the large language model itself — the trained neural-network weights that turn input tokens into output tokens. It is the raw computational substrate of every AI system: a frozen artifact produced by pretraining on trillions of tokens, often further shaped by instruction tuning and alignment. Examples include Claude Opus 4.7 and Sonnet 4.6 from Anthropic, GPT-5 from OpenAI, and Google's Gemini family. On its own a model is stateless and narrow. Given input tokens it predicts the next token distribution — that is the whole job. It has no memory between API calls, no filesystem, no ability to browse the web, no "session." Everything else users associate with AI — conversation history, tool access, long-horizon behavior, retrieval — is supplied by the layers wrapping it. When choosing a model you typically trade off capability (frontier reasoning, coding, math), context length, latency and price per million tokens, and modality support (text, vision, audio, video). Post-training recipe, training cutoff, and parameter count all shape real-world behavior even when the architecture looks similar. In our five-layer mental model — **model → tools → context → harness → agent** — the model is the primitive that every other layer wraps. Swapping the underlying model (say, Sonnet 4.6 for Opus 4.7 inside the same harness) is one of the most common agentic-system upgrades. As a storefront-level anchor, [**$1.10 / $4.40 per 1M tokens**](/model/o4-mini/openai-api) is the May 2026 standard-rate pair we track on the canonical **OpenAI API** routing row for **o4-mini** (always re-check routing before estimating spend). See the [models directory](/models) for the full catalog with pricing, context windows, and benchmark results.

inference_optimization

Speculative decoding

Speculative decoding accelerates large language model inference by using a small draft model to generate candidate tokens quickly, which a larger verify model checks in parallel, accepting correct ones to reduce latency. It trades minimal accuracy loss for significant speedups in autoregressive generation.

learning_paradigm

Few-shot learning

Few-shot learning enables large language models to perform tasks effectively using only a small number of labeled examples (typically 1-10) provided in the prompt, relying on in-context learning without parameter updates. It bridges the gap between zero-shot and fine-tuning by demonstrating patterns through examples.

Zero-shot learning

Zero-shot learning allows large language models to perform tasks without any task-specific examples in the prompt, relying solely on instructions and pre-trained knowledge to generalize to unseen tasks. It tests the model's ability to understand and apply concepts from training data alone.

optimization

Pruning

Pruning removes less important weights or neurons from a trained neural network, reducing model size and computation while aiming to preserve accuracy. It creates sparse models that are faster and more efficient for deployment.

preprocessing

Tokenization

Tokenization is the process of breaking down input text into smaller units called tokens (e.g., words, subwords, or characters) that the model can process numerically. It is the first step in preparing data for LLMs, affecting vocabulary size and sequence length.

prompting_technique

Chain of Thought

Chain of Thought (CoT) is a prompting technique that instructs large language models to generate intermediate step-by-step reasoning before providing a final answer, mimicking human-like deliberation to improve performance on complex reasoning tasks. It enhances transparency and accuracy but can sometimes lead to unfaithful explanations or performance drops on certain tasks.

System prompt

A system prompt is a special instruction provided at the start of a conversation to define the AI's role, behavior, tone, or constraints, guiding responses across interactions without being part of the user-facing history. It shapes the model's persona and ensures consistent adherence to guidelines.

protocol

MCP (Model Context Protocol)

MCP — the **Model Context Protocol** — is an open standard for exposing tools, resources, and prompts to agentic harnesses. Introduced by Anthropic in late 2024 and adopted broadly across the ecosystem, MCP gives any harness (Claude Code, Cursor, Cline, Windsurf, and others) a consistent way to discover and call external capabilities provided by independent servers. An MCP server advertises three kinds of surface: **tools** (callable functions the model can invoke), **resources** (readable data the harness can attach to context — files, records, documents), and **prompts** (reusable prompt templates). A harness acts as the MCP client, connecting to one or more servers over stdio or HTTP and surfacing their capabilities to the model as regular tool calls. The win is composability. Before MCP, every harness had to ship its own integrations for GitHub, Linear, Postgres, Slack, filesystem, browser automation, and so on. With MCP, an organization can stand up a single MCP server for each internal system — or pull one off the shelf — and every compliant harness can use it without custom code. It is roughly analogous to LSP (Language Server Protocol) for IDEs: a standard plug for a fragmented market. For agentic systems, MCP is the cleanest way to extend an [agent](/concept/agent)'s scope without modifying the [harness](/concept/harness). It is also how many products compose across vendors — a Cursor user can use the same MCP server a teammate uses from Claude Code.

representation

Embedding

Embedding converts discrete tokens or words into dense, continuous vector representations in a high-dimensional space, capturing semantic and syntactic relationships. These vectors enable models to perform mathematical operations on language, forming the basis for downstream processing.

technique

Grounding

Grounding in LLMs involves anchoring model outputs to verifiable external sources, such as retrieved documents or real-time data, to ensure factual accuracy and reduce hallucinations. It provides a foundation for reliable generation by linking responses to evidence.

RAG (Retrieval Augmented Generation)

Retrieval Augmented Generation is a technique that enhances large language models by retrieving relevant external documents or data and incorporating them into the prompt before generation, improving factual accuracy and reducing hallucinations. It combines retrieval systems with generative models to ground responses in real-world knowledge.

training_technique

Distillation

Distillation transfers knowledge from a large, complex teacher model to a smaller student model by training the student to mimic the teacher's outputs or intermediate representations, creating efficient deployable versions. It reduces model size and inference cost while retaining much of the performance.

Other

Activation-aware Weight Quantization

AWQ

AWQ (Activation-aware Weight Quantization) quantizes LLM weights based on activation statistics, preserving the most important weights for model performance. It achieves higher accuracy than uniform quantization at lower bit-widths by adapting quantization granularity to activation patterns.

alignment

Alignment ensures LLMs produce outputs matching human values, preferences, and safety constraints through techniques like RLHF, DPO, or constitutional AI. It addresses the gap between raw predictive power and deployable utility by iteratively refining behaviors via feedback, reducing harms like bias.

Base Model

A base model refers to the core pretrained neural network, such as a Transformer architecture, before any task-specific fine-tuning or alignment. It provides the raw capabilities that are adapted for specialized uses, forming the starting point for instruct or chat variants.

Chat Model

A chat model is an LLM variant fine-tuned for conversational interactions, generating coherent, context-aware responses in dialogue formats like ChatGPT. It leverages autoregressive generation and techniques like RLHF for safer, more engaging conversations in user-facing AI assistants.

chat tuning

Chat tuning involves fine-tuning LLMs on conversational datasets to optimize multi-turn dialogue capabilities, focusing on coherence, context retention, and natural response generation. It refines conversational flow by emphasizing turn-taking, persona consistency, and engagement, often as a precursor to preference-based alignment.

context length

Context length denotes the maximum number of tokens an LLM can process in a single input sequence, limiting the span of prior text it can attend to for generating coherent outputs. Extensions via architectural innovations like rotary embeddings increase this capacity, crucial for long-document tasks.

direct preference optimization

DPO

DPO is a lightweight alignment technique that fine-tunes LLMs directly on pairwise preference data (preferred vs. rejected responses) without a separate reward model or reinforcement learning. It optimizes the policy by maximizing the log-ratio of probabilities between chosen and rejected outputs relative to a reference model.

fine-tune

Fine-tuning is the process of further training a pretrained model on targeted datasets to specialize it for specific tasks or behaviors, adjusting weights minimally. It leverages transfer learning for customization without full retraining, creating instruct/chat variants from base models and boosting capabilities.

Foundation Model

A foundation model is a large-scale pretrained model, typically Transformer-based, serving as a versatile base for various downstream tasks through adaptation like fine-tuning. It enables broad AI applications with minimal task-specific training, reducing development costs and supporting emergent abilities in reasoning and instruction-following.

generative

Generative, in AI, describes models that create new content like text or images from learned distributions, often autoregressively in LLMs by sampling next tokens. It enables applications in creativity, simulation, and data augmentation beyond mere classification, supporting open-ended tasks like story-writing and code generation.

Generative Pre-trained Transformer Quantization

GPTQ

GPTQ (Generative Pre-trained Transformer Quantization) is a post-training quantization method that reduces model precision to 4-8 bits while maintaining performance through careful rounding and calibration. It enables efficient deployment of large models on consumer hardware with minimal accuracy loss.

GPT

GPT (Generative Pretrained Transformer) is a decoder-only Transformer model series pretrained on internet-scale data for autoregressive text generation, pioneering LLMs like GPT-3 and GPT-4. It exemplifies foundation models adapted via fine-tuning for chat and instruction tasks, demonstrating scaling benefits and few-shot learning.

GPT-Generated Machine Learning

GGML

GGML (GPT-Generated Machine Learning) is a C library for machine learning that provides efficient implementations of quantized tensor operations and model inference. It powers local LLM deployment by optimizing performance on commodity hardware.

GPT-Generated Unified Format

GGUF

GGUF (GPT-Generated Unified Format) is a quantized model file format for efficient storage and loading of LLMs, providing a unified structure for different quantization schemes and metadata. It enables seamless model distribution and loading across different hardware configurations.

Instruct Model

An instruct model is a fine-tuned LLM optimized to follow user instructions accurately, often via supervised fine-tuning on instruction-response pairs. It enhances alignment with human intent, bridging general capabilities to directive use for practical deployment in interactive applications.

instruction following

Instruction following refers to an LLM's capability to comprehend and execute user-provided instructions accurately, generating responses that adhere to specified tasks, formats, or constraints. This emergent ability arises from training on datasets with explicit prompt-response pairs, enabling generalization across diverse directives.

instruction tuning

Instruction tuning is a supervised fine-tuning process where an LLM is trained on datasets of instructions paired with desired responses to enhance its ability to follow user directives. It bridges pre-training and advanced alignment by teaching the model to interpret natural language prompts as actionable commands.

language model

A language model is a probabilistic system that generates or predicts text sequences, historically evolving from statistical methods to neural architectures. It integrates neural networks, scale, and Transformer designs for advanced autoregressive text modeling and enables tasks from completion to understanding.

large language model

LLM

A large language model (LLM) is a Transformer-based neural network with billions of parameters, trained on vast datasets for next-token prediction, enabling broad NLP tasks and emergent abilities. LLMs dominate AI due to their human-like text generation and adaptability across domains.

Low-Rank Adaptation

LoRA

LoRA (Low-Rank Adaptation) fine-tunes LLMs by freezing pre-trained weights and injecting trainable low-rank matrices into weight updates, approximating full fine-tuning with far fewer parameters. It decomposes delta weights as low-rank matrices where rank r is much smaller than dimensions, enabling efficient task adaptation.

model merging

Model merging combines multiple fine-tuned LLMs into a single model by averaging weights, resolving conflicts via task arithmetic or projection methods to preserve capabilities. Techniques like TIES or SLERP mitigate interference, enabling efficient knowledge fusion without retraining.

multimodal

Multimodal refers to LLMs extended to process and generate across multiple data modalities, such as text paired with images, audio, or video, via unified tokenization and cross-attention mechanisms. This enables tasks like visual question answering or captioning, integrating modality-specific encoders into the core transformer.

Parameter-Efficient Fine-Tuning

PEFT

PEFT (Parameter-Efficient Fine-Tuning) encompasses methods to adapt pre-trained LLMs for specific tasks by updating a small subset of parameters, avoiding full fine-tuning's compute costs. Approaches like adapters or prompt tuning inject task-specific modules, enabling scalability to massive models.

pretrained

Pretraining is the initial training phase where a model learns general representations from unlabeled data via self-supervision, like masked or next-token prediction. It transfers knowledge efficiently to downstream tasks, minimizing labeled data needs and unlocking scaling and emergent skills before fine-tuning.

Pretrained Model

A pretrained model is a language model trained on massive text corpora via self-supervised tasks like next-token prediction, acquiring broad knowledge without task-specific labels. It bootstraps capabilities like in-context learning, enabling efficient adaptation to new tasks and exceeding billions of parameters when scaled.

proximal policy optimization

PPO

PPO is an on-policy reinforcement learning algorithm used in RLHF to update the LLM policy model by maximizing a clipped surrogate objective, ensuring stable training through trust-region constraints. It balances reward maximization with KL-divergence penalties to prevent large policy shifts.

quantization

Quantization reduces LLM precision by mapping high-bit weights and activations (e.g., FP16) to lower-bit representations (e.g., INT8 or INT4), minimizing memory footprint and inference latency. Techniques like post-training quantization preserve accuracy by calibrating rounding errors, enabling deployment on resource-constrained hardware.

Quantized Low-Rank Adaptation

QLoRA

QLoRA extends LoRA by combining 4-bit quantization (via NF4 and double quantization) with paged optimizers to fine-tune billion-parameter LLMs on consumer GPUs. It maintains performance parity with 16-bit full tuning while dramatically reducing memory requirements.

reinforcement learning from human feedback

RLHF

RLHF aligns LLMs with human preferences through a multi-stage process: training a reward model on ranked response pairs, then using reinforcement learning to optimize the policy model against this reward. Typically employing PPO, it maximizes expected reward while constraining deviation from a reference model.

Rotary Position Embedding

RoPE

RoPE (Rotary Position Embedding) is a positional encoding method that encodes absolute positions using rotation matrices, enabling efficient relative position representation in Transformers. It improves generalization to longer sequences and has become the standard in modern LLMs like Llama and GPT variants.

supervised fine-tuning

SFT

Supervised fine-tuning (SFT) adapts a pretrained model on labeled instruction-response pairs to improve task-specific performance, like following directives. It aligns general models to user needs with minimal data and precedes RLHF, enhancing instruction adherence and reducing hallucinations.

transformer

A Transformer is a neural architecture using self-attention mechanisms to process sequences in parallel, revolutionizing NLP by handling long-range dependencies efficiently. Transformers form the backbone of LLMs, with decoder-only variants dominating generative tasks and enabling scalable training on huge datasets.

transformer architecture

The Transformer architecture consists of encoder and/or decoder stacks with multi-head self-attention, feed-forward layers, and positional encodings for sequence modeling. It enables parallelization and captures context, with decoder-only Transformers powering autoregressive generation in models like GPT and Llama.