LLM Reference
Capability-first concept index

LLM Concepts & Capability Filters

Start with the capability that gates your model choice, then drill into the models and provider routes that support it. The remaining glossary stays below for definitions.

Glossary and capability filters for comparison shopping — find LLMs by context window, agent and coding capability, tool use, structured output, batch API support, and token pricing.

11

capability filters

3,586

model matches

68

concept pages

4

capability groups

Capability Filters

Input, output, and context

Use these filters when modality or window size is the gating requirement.

4 filters

Agent and control surfaces

Find models that expose structured control, tool calls, or execution hooks.

4 filters

Cost and throughput levers

Compare models with batch or cache evidence in model/provider data.

2 filters

Customization

Track models with first-party fine-tuning support in the seed data.

1 filters

Glossary Concepts

agents

Agent

An agent is a harness that has been empowered with three things: a **role** (who it is — system prompt, persona, domain expertise), a **mission** (what it is supposed to accomplish — a task, goal, or recurring responsibility), and a **scope** (what it is allowed to touch — which tools, repos, files, and permissions). Once configured, the agent is pointed at work and runs its agentic loop until the mission is done or it hits a blocker. Concrete examples from Data Advantage's own setup: our Founding Engineer agent, CTO agent, DevOps Engineer, and CPO are all agents — each is the same underlying harness (Claude Code plus the Paperclip execution layer) configured with a distinct AGENTS.md, a task queue, and a permissions profile. **A note on industry usage.** Most sources call any agentic coding tool an "agent" — including unconfigured harnesses like Cursor or Claude Code out of the box. We use the term more narrowly: an agent is a *configured, pointed* harness. The binary by itself is a harness; the binary plus `role + mission + scope` is an agent. Same runtime, different layer in the stack. Under the industry's looser usage, a "coding agent" is really "an agentic harness." This precision matters when designing agentic systems: choosing a harness and configuring agents are two separate decisions, often made by different people. See the [harness](/concept/harness) concept for the runtime distinction. For common agent archetypes (autonomous SWE, pair programmer, research assistant, etc.), see our [AI Agents vs Harnesses](https://vibereference.com/ai-development/agents-vs-harnesses) explainer on VibeReference.

Agentic loop

The agentic loop is the repeating cycle at the heart of every autonomous AI system: **model → tool call → result → model → tool call → result → …** until the model produces a final answer or the harness stops it. On each iteration, the harness sends the model its current context, the model responds either with a tool call or a final message, the harness executes any tool call and feeds the result back into context, and the loop continues. This loop is what distinguishes agentic systems from simple chat. A chat completion is one turn — prompt in, response out. An agentic loop lets the model gather evidence, act, observe the effect, and revise its plan, turn after turn. It is how an AI actually *does* things — edits files, runs tests, searches, deploys — rather than just describing what it would do. Key design questions around the loop include: how many iterations to allow before escalating, how to compact context as history grows, how to detect when the model is stuck in a tool-call loop, how to handle errors from tool results, and when to hand control back to a human. Most harnesses expose configuration for all of these. The loop is executed by the [harness](/concept/harness); the steps taken within the loop depend on the [agent](/concept/agent)'s role, mission, and scope; and the quality of each step depends on the [model](/concept/model) and the [context](/concept/context) it receives.

Context

Context is everything the model sees on a given turn: the full set of tokens loaded into its context window before it produces its next output. That typically includes the system prompt (role and instructions), the conversation history so far, any tool call results from prior turns, attached files or code snippets, retrieved documents, and explicit memory the harness has surfaced. Context is the only place a model can look for information — it has no hidden state and no memory between API calls. Whatever is in context on this turn is what the model "knows" right now. Everything else, from yesterday's conversation to the contents of a file on disk, must be reloaded into context for the model to use it. Because context windows are finite — typically 128K–2M tokens today, depending on the model — and every token costs money and attention, context management is one of the hardest problems in agentic systems. Harnesses spend significant effort deciding what to include, what to summarize, when to compact older turns, and how to surface the right files, tool results, and memories without drowning the model in irrelevant text. Good context engineering — clear system prompts, tight tool results, selective history, and retrieval that surfaces only what's needed — often matters more than raw model capability. A frontier model with bad context will underperform a weaker model given the right information. See also: [context window](/concept/context-window), [system prompt](/concept/system-prompt), [RAG](/concept/rag).

Harness

A harness is the runtime scaffolding that wires a model, its tools, and its context into a working agentic loop. It is the software that actually runs the model-to-tool-to-result cycle: handling API calls to the model, executing tool calls the model emits, feeding results back, managing the context window, and enforcing sandboxing and permissions. Well-known harnesses include Claude Code, Cursor, Cowork, Cline, Aider, Windsurf, and Devin. A freshly installed harness is unconfigured — a binary or IDE extension with no role, no mission, and no active task. It becomes useful only when someone points it at work with a specific prompt, system prompt, or AGENTS.md-style configuration. At that point the harness is operating as an agent, but the harness itself is the runtime, not the task. **A note on industry usage.** Many articles and vendors use "harness" and "agent" interchangeably, or skip "harness" entirely and call every agentic tool an "agent." On this site we reserve **harness** for the unconfigured runtime and **agent** for a harness that has been given a role, mission, and scope. The distinction matters because a single harness (e.g., Claude Code) can host dozens of different agents just by swapping AGENTS.md files — same binary, different configuration state. See the [agents directory](/agents) for a catalog of products in this category, and the [agent](/concept/agent) concept for how harnesses get pointed at specific work.

Tools

Tools are the discrete capabilities a model can invoke to act on the world. A model on its own can only emit text; tools turn text emissions into real effects — reading files, running shell commands, editing code, searching the web, querying databases, calling APIs. Common examples in coding agents include `bash`, `read`, `edit`, `grep`, `web_search`, `web_fetch`, and any MCP server exposed to the runtime. Mechanically, tool use works via function calling. The harness advertises a set of tool schemas (name, description, input parameters) in the model's context. When the model decides a tool is needed, it emits a structured tool call; the harness executes it, captures the result, and feeds that result back as a new turn. The model then either calls another tool or produces a final response. This repeating cycle is the agentic loop. Tools are where agents meet reality. Their design determines how capable, safe, and efficient an agent can be: too few tools and the model is blind; too many and its context fills with schemas it will never use. Permission scoping on tools — which tools a given agent may call, against which resources — is one of the main safety levers in any agentic system. Tools can be built in to a harness, loaded from local plugins, or exposed over [Model Context Protocol (MCP)](/concept/mcp) by external servers, which is what makes the MCP ecosystem important for composable agents.

architecture

Attention mechanism

The attention mechanism allows large language models to weigh the importance of different tokens in a sequence relative to each other when processing input, enabling focus on relevant context regardless of position. It is core to transformer architectures, powering parallel computation and long-range dependencies.

Subquadratic Attention

Linear Attention / SSM

Subquadratic attention refers to any language model architecture that scales compute with sequence length at less than O(n²) — the quadratic cost of standard transformer self-attention. Standard transformers compare every token against every other token, making long-context processing computationally expensive (and practically limited to 8K–200K tokens before inference cost becomes prohibitive). Subquadratic architectures break this bottleneck via one of three main approaches: **State Space Models (SSMs):** Replace attention with a compressed recurrent state that evolves through the sequence. The state size is fixed regardless of context length, yielding O(n) training and O(1) constant-memory inference. The Mamba architecture (and Mamba-2) is the canonical SSM, used in models like Falcon Mamba, Falcon3-Mamba, Jamba, Zamba, Codestral Mamba, and Liquid Foundation Models. **Recurrent LLMs:** The RWKV family (Eagle, Finch, Goose) treats the model as a pure RNN with linear complexity, enabling infinite-length inference with a fixed memory footprint. Unlike attention-based SSMs, RWKV requires no KV cache at all — inference memory does not grow with context length. **Sparse / Selective Attention:** Instead of computing all-pairs attention, models select a subset of relevant tokens per query. SubQ (Subquadratic Sparse Attention, 2026) is the most prominent example, claiming O(n) complexity via sparse token selection with a production 1M-token context window. **Hybrid SSM + Attention:** Several production models combine attention layers with SSM layers for the best of both: AI21's Jamba family and Zyphra's Zamba family interleave Mamba blocks with standard transformer attention. Subquadratic models trade off some in-context retrieval precision (transformers with full attention tend to have stronger exact-match recall) against dramatically lower inference cost and memory at long contexts. As context windows extend past 1M tokens, subquadratic approaches become the only economically viable option at scale.

behavior

capability

foundation

Model

A model is the large language model itself — the trained neural-network weights that turn input tokens into output tokens. It is the raw computational substrate of every AI system: a frozen artifact produced by pretraining on trillions of tokens, often further shaped by instruction tuning and alignment. Examples include Claude Opus 4.7 and Sonnet 4.6 from Anthropic, GPT-5 from OpenAI, and Google's Gemini family. On its own a model is stateless and narrow. Given input tokens it predicts the next token distribution — that is the whole job. It has no memory between API calls, no filesystem, no ability to browse the web, no "session." Everything else users associate with AI — conversation history, tool access, long-horizon behavior, retrieval — is supplied by the layers wrapping it. When choosing a model you typically trade off capability (frontier reasoning, coding, math), context length, latency and price per million tokens, and modality support (text, vision, audio, video). Post-training recipe, training cutoff, and parameter count all shape real-world behavior even when the architecture looks similar. In our five-layer mental model — **model → tools → context → harness → agent** — the model is the primitive that every other layer wraps. Swapping the underlying model (say, Sonnet 4.6 for Opus 4.7 inside the same harness) is one of the most common agentic-system upgrades. As a storefront-level anchor, [**$1.10 / $4.40 per 1M tokens**](/model/o4-mini/openai-api) is the May 2026 standard-rate pair we track on the canonical **OpenAI API** routing row for **o4-mini** (always re-check routing before estimating spend). See the [models directory](/models) for the full catalog with pricing, context windows, and benchmark results.

inference_optimization

infrastructure

Vector Database

A specialized database for storing and querying high-dimensional embedding vectors using similarity search rather than exact key lookup.

A vector database is a data storage system designed specifically for storing high-dimensional embedding vectors and performing fast approximate nearest-neighbor (ANN) search over them. Where traditional databases retrieve records by exact key or index match, a vector database retrieves records by semantic similarity — returning the vectors and associated documents closest to a query vector in the embedding space. This makes vector databases the foundational infrastructure layer for RAG systems, semantic search, recommendation engines, duplicate detection, and long-term agent memory. Common vector database systems include Pinecone, Weaviate, Qdrant, Milvus, Chroma, and pgvector (a PostgreSQL extension). Each makes different trade-offs between index algorithm (HNSW is the most widely used for low-latency high-recall search), filtering capabilities (metadata predicates alongside vector queries), scalability, and operational complexity. Performance is typically measured by recall@k — what fraction of the true top-k nearest neighbors are returned — balanced against query latency and indexing throughput. As RAG architectures have matured, vector databases are typically used in conjunction with an embedding model (to generate vectors at index and query time) and often with a re-ranker (to refine initial ANN results with a more precise cross-encoder pass). Vector search is also increasingly available natively inside general-purpose databases (PostgreSQL via pgvector, Redis, Elasticsearch), reducing the need for a dedicated specialized system. Understanding vector databases is prerequisite knowledge for building any retrieval-augmented LLM system.

learning_paradigm

optimization

preprocessing

prompting_technique

protocol

MCP (Model Context Protocol)

MCP — the **Model Context Protocol** — is an open standard for exposing tools, resources, and prompts to agentic harnesses. Introduced by Anthropic in late 2024 and adopted broadly across the ecosystem, MCP gives any harness (Claude Code, Cursor, Cline, Windsurf, and others) a consistent way to discover and call external capabilities provided by independent servers. An MCP server advertises three kinds of surface: **tools** (callable functions the model can invoke), **resources** (readable data the harness can attach to context — files, records, documents), and **prompts** (reusable prompt templates). A harness acts as the MCP client, connecting to one or more servers over stdio or HTTP and surfacing their capabilities to the model as regular tool calls. The win is composability. Before MCP, every harness had to ship its own integrations for GitHub, Linear, Postgres, Slack, filesystem, browser automation, and so on. With MCP, an organization can stand up a single MCP server for each internal system — or pull one off the shelf — and every compliant harness can use it without custom code. It is roughly analogous to LSP (Language Server Protocol) for IDEs: a standard plug for a fragmented market. For agentic systems, MCP is the cleanest way to extend an [agent](/concept/agent)'s scope without modifying the [harness](/concept/harness). It is also how many products compose across vendors — a Cursor user can use the same MCP server a teammate uses from Claude Code.

representation

safety

Guardrails

Input and output validation layers that enforce safety, quality, and policy constraints on language model behavior in production systems.

Guardrails are software components — operating before, during, or after LLM inference — that validate inputs and outputs against defined safety, quality, or policy criteria. Input guardrails screen user requests for harmful intent, prompt injection attempts, sensitive data exposure, or out-of-scope queries before they reach the model. Output guardrails evaluate model responses for hallucinations, toxic content, policy violations, or format errors before delivery to the end user. Together they provide a controllable safety layer that operates independently of the model's own internal alignment training. Common guardrail implementations span a range of complexity: rule-based filters (regex patterns, keyword blocklists), lightweight classifier models specialized for toxicity or policy violation detection (e.g., Meta Llama Guard, Microsoft Azure Content Safety, Google's safety filters), LLM-as-judge pipelines that use a second model to evaluate outputs, and structured output parsers that enforce schema compliance and type safety. Dedicated orchestration frameworks such as Guardrails AI and NVIDIA NeMo Guardrails combine multiple check types into configurable pipelines. Guardrails are especially critical in agentic systems where models can take real-world actions — sending emails, querying databases, executing code. In these contexts, guardrails often include permission checks, rate limits, and action confirmation gates in addition to content filters. A common production design pairs lightweight, low-latency input guardrails (fast keyword or classifier checks, under 50ms) with asynchronous output guardrails (LLM-as-judge) for post-delivery monitoring and compliance logging. Enterprise deployments require documented guardrail policies for legal liability and regulatory compliance sign-off.

Prompt Injection

An attack where malicious instructions embedded in external content attempt to override a language model's intended behavior.

Prompt injection is a class of adversarial attacks against language model systems in which untrusted external content — such as web pages, documents, user inputs, or tool results — contains hidden or overt instructions designed to override the model's original system prompt or directives. When a model processes this content as part of its context, it may follow the injected instructions instead of (or in addition to) its legitimate ones, potentially leaking sensitive data, taking unauthorized actions, or producing responses the system designer did not intend. Indirect prompt injection is the most dangerous variant in agentic contexts: the attacker's payload is not in the user's direct message but embedded in external content the agent retrieves or processes autonomously — for example, a malicious README file, a webpage loaded by a browsing agent, or a document retrieved via RAG. This enables remote attacks without direct access to the victim's session, and the blast radius scales with the agent's permissions. Defenses include input and output validation (guardrails), privilege separation between trusted system-prompt content and untrusted external content, sandboxing tool execution, model-level safety training, clearly delimited prompt structure that marks untrusted content, and explicit confirmation steps before high-consequence actions. Prompt injection is recognized as the primary security threat in agentic AI systems and is tracked by OWASP as a top vulnerability for LLM applications.

technique

Context Engineering

The practice of deliberately managing what goes into an LLM's context window — prompts, retrieved chunks, history, and tool results — to optimize model performance.

Context engineering is the discipline of systematically controlling and optimizing the inputs placed into a language model's context window — including system prompts, retrieved documents, conversation history, tool results, and user instructions — to achieve reliable, high-quality outputs. While prompt engineering focuses on the phrasing and structure of individual instructions, context engineering addresses the broader orchestration question: what information to include, exclude, compress, or retrieve at each inference step. In production agentic systems, context engineering is often the primary lever for improving model behavior without changing the model itself. Key concerns include managing token budgets across multi-turn conversations, selecting the most relevant retrieved chunks for RAG pipelines, structuring chat history to preserve key facts while compressing stale context, and sequencing tool results to maximize downstream reasoning quality. Effective context engineering requires understanding how different model architectures handle positional information (the 'lost in the middle' effect, attention decay near context boundaries), how models weight system versus user versus assistant turns, and how to exploit prompt caching for repeated prefixes to reduce latency and cost. As context windows have grown to 1M+ tokens, the discipline has shifted from fitting everything in to selecting what matters most — an architectural and product design skill as much as a prompting skill.

Grounding

Grounding in LLMs involves anchoring model outputs to verifiable external sources, such as retrieved documents or real-time data, to ensure factual accuracy and reduce hallucinations. It provides a foundation for reliable generation by linking responses to evidence.

LLM Evaluation

Systematic frameworks for measuring LLM performance, capability, safety, and reliability across standardized or custom tasks.

LLM evaluation (commonly shortened to 'evals') refers to the structured practice of measuring a language model's performance across a defined set of tasks, benchmarks, or behavioral criteria. Evals are the primary method by which researchers, engineers, and product teams compare models, track capability regressions, measure safety properties, and make deployment decisions. They range from automated benchmarks graded by code (MMLU, HumanEval, MATH, GPQA, SWE-bench) to human-preference comparisons (Chatbot Arena) to adversarial red-teaming for safety and robustness. For practitioners building LLM-powered products, custom evals are often more valuable than published benchmarks. A custom eval suite captures the actual query distribution, the output quality dimensions that matter (accuracy, format adherence, tone, latency), and the failure modes specific to the application — including edge cases that public benchmarks do not cover. Frameworks such as OpenAI Evals, Braintrust, LangSmith, and Inspect help teams build, version, and continuously track evaluation results. Key eval design considerations include: choosing between reference-based scoring (exact match, ROUGE, F1) and model-graded scoring (LLM-as-judge), preventing benchmark contamination from training data leakage, ensuring reproducibility via deterministic prompts and fixed temperature settings, and weighting edge cases rather than only the central distribution. As enterprise AI deployments mature, evals are increasingly required for compliance, safety sign-off, and vendor selection — making them as important for buyers as for model researchers.

RAG (Retrieval Augmented Generation)

Retrieval Augmented Generation is a technique that enhances large language models by retrieving relevant external documents or data and incorporating them into the prompt before generation, improving factual accuracy and reducing hallucinations. It combines retrieval systems with generative models to ground responses in real-world knowledge.

training_technique

Other

Activation-aware Weight Quantization

AWQ

AWQ (Activation-aware Weight Quantization) quantizes LLM weights based on activation statistics, preserving the most important weights for model performance. It achieves higher accuracy than uniform quantization at lower bit-widths by adapting quantization granularity to activation patterns.

alignment

Alignment ensures LLMs produce outputs matching human values, preferences, and safety constraints through techniques like RLHF, DPO, or constitutional AI. It addresses the gap between raw predictive power and deployable utility by iteratively refining behaviors via feedback, reducing harms like bias.

Base Model

A base model refers to the core pretrained neural network, such as a Transformer architecture, before any task-specific fine-tuning or alignment. It provides the raw capabilities that are adapted for specialized uses, forming the starting point for instruct or chat variants.

Chat Model

A chat model is an LLM variant fine-tuned for conversational interactions, generating coherent, context-aware responses in dialogue formats like ChatGPT. It leverages autoregressive generation and techniques like RLHF for safer, more engaging conversations in user-facing AI assistants.

chat tuning

Chat tuning involves fine-tuning LLMs on conversational datasets to optimize multi-turn dialogue capabilities, focusing on coherence, context retention, and natural response generation. It refines conversational flow by emphasizing turn-taking, persona consistency, and engagement, often as a precursor to preference-based alignment.

direct preference optimization

DPO

DPO is a lightweight alignment technique that fine-tunes LLMs directly on pairwise preference data (preferred vs. rejected responses) without a separate reward model or reinforcement learning. It optimizes the policy by maximizing the log-ratio of probabilities between chosen and rejected outputs relative to a reference model.

Foundation Model

A foundation model is a large-scale pretrained model, typically Transformer-based, serving as a versatile base for various downstream tasks through adaptation like fine-tuning. It enables broad AI applications with minimal task-specific training, reducing development costs and supporting emergent abilities in reasoning and instruction-following.

generative

Generative, in AI, describes models that create new content like text or images from learned distributions, often autoregressively in LLMs by sampling next tokens. It enables applications in creativity, simulation, and data augmentation beyond mere classification, supporting open-ended tasks like story-writing and code generation.

Generative Pre-trained Transformer Quantization

GPTQ

GPTQ (Generative Pre-trained Transformer Quantization) is a post-training quantization method that reduces model precision to 4-8 bits while maintaining performance through careful rounding and calibration. It enables efficient deployment of large models on consumer hardware with minimal accuracy loss.

GPT

GPT (Generative Pretrained Transformer) is a decoder-only Transformer model series pretrained on internet-scale data for autoregressive text generation, pioneering LLMs like GPT-3 and GPT-4. It exemplifies foundation models adapted via fine-tuning for chat and instruction tasks, demonstrating scaling benefits and few-shot learning.

GPT-Generated Machine Learning

GGML

GGML (GPT-Generated Machine Learning) is a C library for machine learning that provides efficient implementations of quantized tensor operations and model inference. It powers local LLM deployment by optimizing performance on commodity hardware.

GPT-Generated Unified Format

GGUF

GGUF (GPT-Generated Unified Format) is a quantized model file format for efficient storage and loading of LLMs, providing a unified structure for different quantization schemes and metadata. It enables seamless model distribution and loading across different hardware configurations.

Instruct Model

An instruct model is a fine-tuned LLM optimized to follow user instructions accurately, often via supervised fine-tuning on instruction-response pairs. It enhances alignment with human intent, bridging general capabilities to directive use for practical deployment in interactive applications.

instruction following

Instruction following refers to an LLM's capability to comprehend and execute user-provided instructions accurately, generating responses that adhere to specified tasks, formats, or constraints. This emergent ability arises from training on datasets with explicit prompt-response pairs, enabling generalization across diverse directives.

instruction tuning

Instruction tuning is a supervised fine-tuning process where an LLM is trained on datasets of instructions paired with desired responses to enhance its ability to follow user directives. It bridges pre-training and advanced alignment by teaching the model to interpret natural language prompts as actionable commands.

language model

A language model is a probabilistic system that generates or predicts text sequences, historically evolving from statistical methods to neural architectures. It integrates neural networks, scale, and Transformer designs for advanced autoregressive text modeling and enables tasks from completion to understanding.

large language model

LLM

A large language model (LLM) is a Transformer-based neural network with billions of parameters, trained on vast datasets for next-token prediction, enabling broad NLP tasks and emergent abilities. LLMs dominate AI due to their human-like text generation and adaptability across domains.

Low-Rank Adaptation

LoRA

LoRA (Low-Rank Adaptation) fine-tunes LLMs by freezing pre-trained weights and injecting trainable low-rank matrices into weight updates, approximating full fine-tuning with far fewer parameters. It decomposes delta weights as low-rank matrices where rank r is much smaller than dimensions, enabling efficient task adaptation.

model merging

Model merging combines multiple fine-tuned LLMs into a single model by averaging weights, resolving conflicts via task arithmetic or projection methods to preserve capabilities. Techniques like TIES or SLERP mitigate interference, enabling efficient knowledge fusion without retraining.

Parameter-Efficient Fine-Tuning

PEFT

PEFT (Parameter-Efficient Fine-Tuning) encompasses methods to adapt pre-trained LLMs for specific tasks by updating a small subset of parameters, avoiding full fine-tuning's compute costs. Approaches like adapters or prompt tuning inject task-specific modules, enabling scalability to massive models.

pretrained

Pretraining is the initial training phase where a model learns general representations from unlabeled data via self-supervision, like masked or next-token prediction. It transfers knowledge efficiently to downstream tasks, minimizing labeled data needs and unlocking scaling and emergent skills before fine-tuning.

Pretrained Model

A pretrained model is a language model trained on massive text corpora via self-supervised tasks like next-token prediction, acquiring broad knowledge without task-specific labels. It bootstraps capabilities like in-context learning, enabling efficient adaptation to new tasks and exceeding billions of parameters when scaled.

proximal policy optimization

PPO

PPO is an on-policy reinforcement learning algorithm used in RLHF to update the LLM policy model by maximizing a clipped surrogate objective, ensuring stable training through trust-region constraints. It balances reward maximization with KL-divergence penalties to prevent large policy shifts.

quantization

Quantization reduces LLM precision by mapping high-bit weights and activations (e.g., FP16) to lower-bit representations (e.g., INT8 or INT4), minimizing memory footprint and inference latency. Techniques like post-training quantization preserve accuracy by calibrating rounding errors, enabling deployment on resource-constrained hardware.

Quantized Low-Rank Adaptation

QLoRA

QLoRA extends LoRA by combining 4-bit quantization (via NF4 and double quantization) with paged optimizers to fine-tune billion-parameter LLMs on consumer GPUs. It maintains performance parity with 16-bit full tuning while dramatically reducing memory requirements.

reinforcement learning from human feedback

RLHF

RLHF aligns LLMs with human preferences through a multi-stage process: training a reward model on ranked response pairs, then using reinforcement learning to optimize the policy model against this reward. Typically employing PPO, it maximizes expected reward while constraining deviation from a reference model.

Rotary Position Embedding

RoPE

RoPE (Rotary Position Embedding) is a positional encoding method that encodes absolute positions using rotation matrices, enabling efficient relative position representation in Transformers. It improves generalization to longer sequences and has become the standard in modern LLMs like Llama and GPT variants.

supervised fine-tuning

SFT

Supervised fine-tuning (SFT) adapts a pretrained model on labeled instruction-response pairs to improve task-specific performance, like following directives. It aligns general models to user needs with minimal data and precedes RLHF, enhancing instruction adherence and reducing hallucinations.

transformer

A Transformer is a neural architecture using self-attention mechanisms to process sequences in parallel, revolutionizing NLP by handling long-range dependencies efficiently. Transformers form the backbone of LLMs, with decoder-only variants dominating generative tasks and enabling scalable training on huge datasets.

transformer architecture

The Transformer architecture consists of encoder and/or decoder stacks with multi-head self-attention, feed-forward layers, and positional encodings for sequence modeling. It enables parallelization and captures context, with decoder-only Transformers powering autoregressive generation in models like GPT and Llama.