AI & LLM Glossary
55 key concepts in AI and language modeling
agents
Agent
An agent is a harness that has been empowered with three things: a **role** (who it is — system prompt, persona, domain expertise), a **mission** (what it is supposed to accomplish — a task, goal, or recurring responsibility), and a **scope** (what it is allowed to touch — which tools, repos, files, and permissions). Once configured, the agent is pointed at work and runs its agentic loop until the mission is done or it hits a blocker. Concrete examples from Data Advantage's own setup: our Founding Engineer agent, CTO agent, DevOps Engineer, and CPO are all agents — each is the same underlying harness (Claude Code plus the Paperclip execution layer) configured with a distinct AGENTS.md, a task queue, and a permissions profile. **A note on industry usage.** Most sources call any agentic coding tool an "agent" — including unconfigured harnesses like Cursor or Claude Code out of the box. We use the term more narrowly: an agent is a *configured, pointed* harness. The binary by itself is a harness; the binary plus `role + mission + scope` is an agent. Same runtime, different layer in the stack. Under the industry's looser usage, a "coding agent" is really "an agentic harness." This precision matters when designing agentic systems: choosing a harness and configuring agents are two separate decisions, often made by different people. See the [harness](/concept/harness) concept for the runtime distinction. For common agent archetypes (autonomous SWE, pair programmer, research assistant, etc.), see our [AI Agents vs Harnesses](https://vibereference.com/ai-development/agents-vs-harnesses) explainer on VibeReference.
Agentic loop
The agentic loop is the repeating cycle at the heart of every autonomous AI system: **model → tool call → result → model → tool call → result → …** until the model produces a final answer or the harness stops it. On each iteration, the harness sends the model its current context, the model responds either with a tool call or a final message, the harness executes any tool call and feeds the result back into context, and the loop continues. This loop is what distinguishes agentic systems from simple chat. A chat completion is one turn — prompt in, response out. An agentic loop lets the model gather evidence, act, observe the effect, and revise its plan, turn after turn. It is how an AI actually *does* things — edits files, runs tests, searches, deploys — rather than just describing what it would do. Key design questions around the loop include: how many iterations to allow before escalating, how to compact context as history grows, how to detect when the model is stuck in a tool-call loop, how to handle errors from tool results, and when to hand control back to a human. Most harnesses expose configuration for all of these. The loop is executed by the [harness](/concept/harness); the steps taken within the loop depend on the [agent](/concept/agent)'s role, mission, and scope; and the quality of each step depends on the [model](/concept/model) and the [context](/concept/context) it receives.
Context
Context is everything the model sees on a given turn: the full set of tokens loaded into its context window before it produces its next output. That typically includes the system prompt (role and instructions), the conversation history so far, any tool call results from prior turns, attached files or code snippets, retrieved documents, and explicit memory the harness has surfaced. Context is the only place a model can look for information — it has no hidden state and no memory between API calls. Whatever is in context on this turn is what the model "knows" right now. Everything else, from yesterday's conversation to the contents of a file on disk, must be reloaded into context for the model to use it. Because context windows are finite — typically 128K–2M tokens today, depending on the model — and every token costs money and attention, context management is one of the hardest problems in agentic systems. Harnesses spend significant effort deciding what to include, what to summarize, when to compact older turns, and how to surface the right files, tool results, and memories without drowning the model in irrelevant text. Good context engineering — clear system prompts, tight tool results, selective history, and retrieval that surfaces only what's needed — often matters more than raw model capability. A frontier model with bad context will underperform a weaker model given the right information. See also: [context window](/concept/context-window), [system prompt](/concept/system-prompt), [RAG](/concept/rag).
Harness
A harness is the runtime scaffolding that wires a model, its tools, and its context into a working agentic loop. It is the software that actually runs the model-to-tool-to-result cycle: handling API calls to the model, executing tool calls the model emits, feeding results back, managing the context window, and enforcing sandboxing and permissions. Well-known harnesses include Claude Code, Cursor, Cowork, Cline, Aider, Windsurf, and Devin. A freshly installed harness is unconfigured — a binary or IDE extension with no role, no mission, and no active task. It becomes useful only when someone points it at work with a specific prompt, system prompt, or AGENTS.md-style configuration. At that point the harness is operating as an agent, but the harness itself is the runtime, not the task. **A note on industry usage.** Many articles and vendors use "harness" and "agent" interchangeably, or skip "harness" entirely and call every agentic tool an "agent." On this site we reserve **harness** for the unconfigured runtime and **agent** for a harness that has been given a role, mission, and scope. The distinction matters because a single harness (e.g., Claude Code) can host dozens of different agents just by swapping AGENTS.md files — same binary, different configuration state. See the [agents directory](/agents) for a catalog of products in this category, and the [agent](/concept/agent) concept for how harnesses get pointed at specific work.
Tools
Tools are the discrete capabilities a model can invoke to act on the world. A model on its own can only emit text; tools turn text emissions into real effects — reading files, running shell commands, editing code, searching the web, querying databases, calling APIs. Common examples in coding agents include `bash`, `read`, `edit`, `grep`, `web_search`, `web_fetch`, and any MCP server exposed to the runtime. Mechanically, tool use works via function calling. The harness advertises a set of tool schemas (name, description, input parameters) in the model's context. When the model decides a tool is needed, it emits a structured tool call; the harness executes it, captures the result, and feeds that result back as a new turn. The model then either calls another tool or produces a final response. This repeating cycle is the agentic loop. Tools are where agents meet reality. Their design determines how capable, safe, and efficient an agent can be: too few tools and the model is blind; too many and its context fills with schemas it will never use. Permission scoping on tools — which tools a given agent may call, against which resources — is one of the main safety levers in any agentic system. Tools can be built in to a harness, loaded from local plugins, or exposed over [Model Context Protocol (MCP)](/concept/mcp) by external servers, which is what makes the MCP ecosystem important for composable agents.
architecture
behavior
capability
Context window
The context window is the maximum number of tokens a large language model can consider at once for input and output during inference, limiting the amount of information it can process in a single pass. Larger windows enable handling longer conversations or documents but increase computational demands.
Tool use / Function calling
Tool use or function calling enables large language models to invoke external APIs, tools, or functions (e.g., calculators, search engines) by generating structured calls based on user queries, extending capabilities beyond internal knowledge. It allows dynamic integration of real-time data or computations into responses.
foundation
inference_optimization
learning_paradigm
Few-shot learning
Few-shot learning enables large language models to perform tasks effectively using only a small number of labeled examples (typically 1-10) provided in the prompt, relying on in-context learning without parameter updates. It bridges the gap between zero-shot and fine-tuning by demonstrating patterns through examples.
Zero-shot learning
Zero-shot learning allows large language models to perform tasks without any task-specific examples in the prompt, relying solely on instructions and pre-trained knowledge to generalize to unseen tasks. It tests the model's ability to understand and apply concepts from training data alone.
optimization
preprocessing
prompting_technique
Chain of Thought
Chain of Thought (CoT) is a prompting technique that instructs large language models to generate intermediate step-by-step reasoning before providing a final answer, mimicking human-like deliberation to improve performance on complex reasoning tasks. It enhances transparency and accuracy but can sometimes lead to unfaithful explanations or performance drops on certain tasks.
System prompt
A system prompt is a special instruction provided at the start of a conversation to define the AI's role, behavior, tone, or constraints, guiding responses across interactions without being part of the user-facing history. It shapes the model's persona and ensures consistent adherence to guidelines.
protocol
representation
technique
Grounding
Grounding in LLMs involves anchoring model outputs to verifiable external sources, such as retrieved documents or real-time data, to ensure factual accuracy and reduce hallucinations. It provides a foundation for reliable generation by linking responses to evidence.
RAG (Retrieval Augmented Generation)
Retrieval Augmented Generation is a technique that enhances large language models by retrieving relevant external documents or data and incorporating them into the prompt before generation, improving factual accuracy and reducing hallucinations. It combines retrieval systems with generative models to ground responses in real-world knowledge.
training_technique
Other
Activation-aware Weight Quantization
AWQ
AWQ (Activation-aware Weight Quantization) quantizes LLM weights based on activation statistics, preserving the most important weights for model performance. It achieves higher accuracy than uniform quantization at lower bit-widths by adapting quantization granularity to activation patterns.
alignment
Alignment ensures LLMs produce outputs matching human values, preferences, and safety constraints through techniques like RLHF, DPO, or constitutional AI. It addresses the gap between raw predictive power and deployable utility by iteratively refining behaviors via feedback, reducing harms like bias.
Base Model
A base model refers to the core pretrained neural network, such as a Transformer architecture, before any task-specific fine-tuning or alignment. It provides the raw capabilities that are adapted for specialized uses, forming the starting point for instruct or chat variants.
Chat Model
A chat model is an LLM variant fine-tuned for conversational interactions, generating coherent, context-aware responses in dialogue formats like ChatGPT. It leverages autoregressive generation and techniques like RLHF for safer, more engaging conversations in user-facing AI assistants.
chat tuning
Chat tuning involves fine-tuning LLMs on conversational datasets to optimize multi-turn dialogue capabilities, focusing on coherence, context retention, and natural response generation. It refines conversational flow by emphasizing turn-taking, persona consistency, and engagement, often as a precursor to preference-based alignment.
context length
Context length denotes the maximum number of tokens an LLM can process in a single input sequence, limiting the span of prior text it can attend to for generating coherent outputs. Extensions via architectural innovations like rotary embeddings increase this capacity, crucial for long-document tasks.
direct preference optimization
DPO
DPO is a lightweight alignment technique that fine-tunes LLMs directly on pairwise preference data (preferred vs. rejected responses) without a separate reward model or reinforcement learning. It optimizes the policy by maximizing the log-ratio of probabilities between chosen and rejected outputs relative to a reference model.
fine-tune
Fine-tuning is the process of further training a pretrained model on targeted datasets to specialize it for specific tasks or behaviors, adjusting weights minimally. It leverages transfer learning for customization without full retraining, creating instruct/chat variants from base models and boosting capabilities.
Foundation Model
A foundation model is a large-scale pretrained model, typically Transformer-based, serving as a versatile base for various downstream tasks through adaptation like fine-tuning. It enables broad AI applications with minimal task-specific training, reducing development costs and supporting emergent abilities in reasoning and instruction-following.
generative
Generative, in AI, describes models that create new content like text or images from learned distributions, often autoregressively in LLMs by sampling next tokens. It enables applications in creativity, simulation, and data augmentation beyond mere classification, supporting open-ended tasks like story-writing and code generation.
Generative Pre-trained Transformer Quantization
GPTQ
GPTQ (Generative Pre-trained Transformer Quantization) is a post-training quantization method that reduces model precision to 4-8 bits while maintaining performance through careful rounding and calibration. It enables efficient deployment of large models on consumer hardware with minimal accuracy loss.
GPT
GPT (Generative Pretrained Transformer) is a decoder-only Transformer model series pretrained on internet-scale data for autoregressive text generation, pioneering LLMs like GPT-3 and GPT-4. It exemplifies foundation models adapted via fine-tuning for chat and instruction tasks, demonstrating scaling benefits and few-shot learning.
GPT-Generated Machine Learning
GGML
GGML (GPT-Generated Machine Learning) is a C library for machine learning that provides efficient implementations of quantized tensor operations and model inference. It powers local LLM deployment by optimizing performance on commodity hardware.
GPT-Generated Unified Format
GGUF
GGUF (GPT-Generated Unified Format) is a quantized model file format for efficient storage and loading of LLMs, providing a unified structure for different quantization schemes and metadata. It enables seamless model distribution and loading across different hardware configurations.
Instruct Model
An instruct model is a fine-tuned LLM optimized to follow user instructions accurately, often via supervised fine-tuning on instruction-response pairs. It enhances alignment with human intent, bridging general capabilities to directive use for practical deployment in interactive applications.
instruction following
Instruction following refers to an LLM's capability to comprehend and execute user-provided instructions accurately, generating responses that adhere to specified tasks, formats, or constraints. This emergent ability arises from training on datasets with explicit prompt-response pairs, enabling generalization across diverse directives.
instruction tuning
Instruction tuning is a supervised fine-tuning process where an LLM is trained on datasets of instructions paired with desired responses to enhance its ability to follow user directives. It bridges pre-training and advanced alignment by teaching the model to interpret natural language prompts as actionable commands.
language model
A language model is a probabilistic system that generates or predicts text sequences, historically evolving from statistical methods to neural architectures. It integrates neural networks, scale, and Transformer designs for advanced autoregressive text modeling and enables tasks from completion to understanding.
large language model
LLM
A large language model (LLM) is a Transformer-based neural network with billions of parameters, trained on vast datasets for next-token prediction, enabling broad NLP tasks and emergent abilities. LLMs dominate AI due to their human-like text generation and adaptability across domains.
Low-Rank Adaptation
LoRA
LoRA (Low-Rank Adaptation) fine-tunes LLMs by freezing pre-trained weights and injecting trainable low-rank matrices into weight updates, approximating full fine-tuning with far fewer parameters. It decomposes delta weights as low-rank matrices where rank r is much smaller than dimensions, enabling efficient task adaptation.
model merging
Model merging combines multiple fine-tuned LLMs into a single model by averaging weights, resolving conflicts via task arithmetic or projection methods to preserve capabilities. Techniques like TIES or SLERP mitigate interference, enabling efficient knowledge fusion without retraining.
multimodal
Multimodal refers to LLMs extended to process and generate across multiple data modalities, such as text paired with images, audio, or video, via unified tokenization and cross-attention mechanisms. This enables tasks like visual question answering or captioning, integrating modality-specific encoders into the core transformer.
Parameter-Efficient Fine-Tuning
PEFT
PEFT (Parameter-Efficient Fine-Tuning) encompasses methods to adapt pre-trained LLMs for specific tasks by updating a small subset of parameters, avoiding full fine-tuning's compute costs. Approaches like adapters or prompt tuning inject task-specific modules, enabling scalability to massive models.
pretrained
Pretraining is the initial training phase where a model learns general representations from unlabeled data via self-supervision, like masked or next-token prediction. It transfers knowledge efficiently to downstream tasks, minimizing labeled data needs and unlocking scaling and emergent skills before fine-tuning.
Pretrained Model
A pretrained model is a language model trained on massive text corpora via self-supervised tasks like next-token prediction, acquiring broad knowledge without task-specific labels. It bootstraps capabilities like in-context learning, enabling efficient adaptation to new tasks and exceeding billions of parameters when scaled.
proximal policy optimization
PPO
PPO is an on-policy reinforcement learning algorithm used in RLHF to update the LLM policy model by maximizing a clipped surrogate objective, ensuring stable training through trust-region constraints. It balances reward maximization with KL-divergence penalties to prevent large policy shifts.
quantization
Quantization reduces LLM precision by mapping high-bit weights and activations (e.g., FP16) to lower-bit representations (e.g., INT8 or INT4), minimizing memory footprint and inference latency. Techniques like post-training quantization preserve accuracy by calibrating rounding errors, enabling deployment on resource-constrained hardware.
Quantized Low-Rank Adaptation
QLoRA
QLoRA extends LoRA by combining 4-bit quantization (via NF4 and double quantization) with paged optimizers to fine-tune billion-parameter LLMs on consumer GPUs. It maintains performance parity with 16-bit full tuning while dramatically reducing memory requirements.
reinforcement learning from human feedback
RLHF
RLHF aligns LLMs with human preferences through a multi-stage process: training a reward model on ranked response pairs, then using reinforcement learning to optimize the policy model against this reward. Typically employing PPO, it maximizes expected reward while constraining deviation from a reference model.
Rotary Position Embedding
RoPE
RoPE (Rotary Position Embedding) is a positional encoding method that encodes absolute positions using rotation matrices, enabling efficient relative position representation in Transformers. It improves generalization to longer sequences and has become the standard in modern LLMs like Llama and GPT variants.
supervised fine-tuning
SFT
Supervised fine-tuning (SFT) adapts a pretrained model on labeled instruction-response pairs to improve task-specific performance, like following directives. It aligns general models to user needs with minimal data and precedes RLHF, enhancing instruction adherence and reducing hallucinations.
transformer
A Transformer is a neural architecture using self-attention mechanisms to process sequences in parallel, revolutionizing NLP by handling long-range dependencies efficiently. Transformers form the backbone of LLMs, with decoder-only variants dominating generative tasks and enabling scalable training on huge datasets.
transformer architecture
The Transformer architecture consists of encoder and/or decoder stacks with multi-head self-attention, feed-forward layers, and positional encodings for sequence modeling. It enables parallelization and captures context, with decoder-only Transformers powering autoregressive generation in models like GPT and Llama.