AI Glossary
34 key concepts in AI and language modeling
Activation-aware Weight Quantization
AWQ
AWQ (Activation-aware Weight Quantization) quantizes LLM weights based on activation statistics, preserving the most important weights for model performance. It achieves higher accuracy than uniform quantization at lower bit-widths by adapting quantization granularity to activation patterns.
alignment
Alignment ensures LLMs produce outputs matching human values, preferences, and safety constraints through techniques like RLHF, DPO, or constitutional AI. It addresses the gap between raw predictive power and deployable utility by iteratively refining behaviors via feedback, reducing harms like bias.
Base Model
A base model refers to the core pretrained neural network, such as a Transformer architecture, before any task-specific fine-tuning or alignment. It provides the raw capabilities that are adapted for specialized uses, forming the starting point for instruct or chat variants.
Chat Model
A chat model is an LLM variant fine-tuned for conversational interactions, generating coherent, context-aware responses in dialogue formats like ChatGPT. It leverages autoregressive generation and techniques like RLHF for safer, more engaging conversations in user-facing AI assistants.
chat tuning
Chat tuning involves fine-tuning LLMs on conversational datasets to optimize multi-turn dialogue capabilities, focusing on coherence, context retention, and natural response generation. It refines conversational flow by emphasizing turn-taking, persona consistency, and engagement, often as a precursor to preference-based alignment.
context length
Context length denotes the maximum number of tokens an LLM can process in a single input sequence, limiting the span of prior text it can attend to for generating coherent outputs. Extensions via architectural innovations like rotary embeddings increase this capacity, crucial for long-document tasks.
direct preference optimization
DPO
DPO is a lightweight alignment technique that fine-tunes LLMs directly on pairwise preference data (preferred vs. rejected responses) without a separate reward model or reinforcement learning. It optimizes the policy by maximizing the log-ratio of probabilities between chosen and rejected outputs relative to a reference model.
fine-tune
Fine-tuning is the process of further training a pretrained model on targeted datasets to specialize it for specific tasks or behaviors, adjusting weights minimally. It leverages transfer learning for customization without full retraining, creating instruct/chat variants from base models and boosting capabilities.
Foundation Model
A foundation model is a large-scale pretrained model, typically Transformer-based, serving as a versatile base for various downstream tasks through adaptation like fine-tuning. It enables broad AI applications with minimal task-specific training, reducing development costs and supporting emergent abilities in reasoning and instruction-following.
generative
Generative, in AI, describes models that create new content like text or images from learned distributions, often autoregressively in LLMs by sampling next tokens. It enables applications in creativity, simulation, and data augmentation beyond mere classification, supporting open-ended tasks like story-writing and code generation.
Generative Pre-trained Transformer Quantization
GPTQ
GPTQ (Generative Pre-trained Transformer Quantization) is a post-training quantization method that reduces model precision to 4-8 bits while maintaining performance through careful rounding and calibration. It enables efficient deployment of large models on consumer hardware with minimal accuracy loss.
GPT
GPT (Generative Pretrained Transformer) is a decoder-only Transformer model series pretrained on internet-scale data for autoregressive text generation, pioneering LLMs like GPT-3 and GPT-4. It exemplifies foundation models adapted via fine-tuning for chat and instruction tasks, demonstrating scaling benefits and few-shot learning.
GPT-Generated Machine Learning
GGML
GGML (GPT-Generated Machine Learning) is a C library for machine learning that provides efficient implementations of quantized tensor operations and model inference. It powers local LLM deployment by optimizing performance on commodity hardware.
GPT-Generated Unified Format
GGUF
GGUF (GPT-Generated Unified Format) is a quantized model file format for efficient storage and loading of LLMs, providing a unified structure for different quantization schemes and metadata. It enables seamless model distribution and loading across different hardware configurations.
Instruct Model
An instruct model is a fine-tuned LLM optimized to follow user instructions accurately, often via supervised fine-tuning on instruction-response pairs. It enhances alignment with human intent, bridging general capabilities to directive use for practical deployment in interactive applications.
instruction following
Instruction following refers to an LLM's capability to comprehend and execute user-provided instructions accurately, generating responses that adhere to specified tasks, formats, or constraints. This emergent ability arises from training on datasets with explicit prompt-response pairs, enabling generalization across diverse directives.
instruction tuning
Instruction tuning is a supervised fine-tuning process where an LLM is trained on datasets of instructions paired with desired responses to enhance its ability to follow user directives. It bridges pre-training and advanced alignment by teaching the model to interpret natural language prompts as actionable commands.
language model
A language model is a probabilistic system that generates or predicts text sequences, historically evolving from statistical methods to neural architectures. It integrates neural networks, scale, and Transformer designs for advanced autoregressive text modeling and enables tasks from completion to understanding.
large language model
LLM
A large language model (LLM) is a Transformer-based neural network with billions of parameters, trained on vast datasets for next-token prediction, enabling broad NLP tasks and emergent abilities. LLMs dominate AI due to their human-like text generation and adaptability across domains.
Low-Rank Adaptation
LoRA
LoRA (Low-Rank Adaptation) fine-tunes LLMs by freezing pre-trained weights and injecting trainable low-rank matrices into weight updates, approximating full fine-tuning with far fewer parameters. It decomposes delta weights as low-rank matrices where rank r is much smaller than dimensions, enabling efficient task adaptation.
model
A model in LLMs is a neural network trained on data to perform tasks like text generation or prediction, forming the foundational computational structure enabling language understanding and generation. It scales with data and parameters to exhibit emergent capabilities like in-context learning and reasoning.
model merging
Model merging combines multiple fine-tuned LLMs into a single model by averaging weights, resolving conflicts via task arithmetic or projection methods to preserve capabilities. Techniques like TIES or SLERP mitigate interference, enabling efficient knowledge fusion without retraining.
multimodal
Multimodal refers to LLMs extended to process and generate across multiple data modalities, such as text paired with images, audio, or video, via unified tokenization and cross-attention mechanisms. This enables tasks like visual question answering or captioning, integrating modality-specific encoders into the core transformer.
Parameter-Efficient Fine-Tuning
PEFT
PEFT (Parameter-Efficient Fine-Tuning) encompasses methods to adapt pre-trained LLMs for specific tasks by updating a small subset of parameters, avoiding full fine-tuning's compute costs. Approaches like adapters or prompt tuning inject task-specific modules, enabling scalability to massive models.
pretrained
Pretraining is the initial training phase where a model learns general representations from unlabeled data via self-supervision, like masked or next-token prediction. It transfers knowledge efficiently to downstream tasks, minimizing labeled data needs and unlocking scaling and emergent skills before fine-tuning.
Pretrained Model
A pretrained model is a language model trained on massive text corpora via self-supervised tasks like next-token prediction, acquiring broad knowledge without task-specific labels. It bootstraps capabilities like in-context learning, enabling efficient adaptation to new tasks and exceeding billions of parameters when scaled.
proximal policy optimization
PPO
PPO is an on-policy reinforcement learning algorithm used in RLHF to update the LLM policy model by maximizing a clipped surrogate objective, ensuring stable training through trust-region constraints. It balances reward maximization with KL-divergence penalties to prevent large policy shifts.
quantization
Quantization reduces LLM precision by mapping high-bit weights and activations (e.g., FP16) to lower-bit representations (e.g., INT8 or INT4), minimizing memory footprint and inference latency. Techniques like post-training quantization preserve accuracy by calibrating rounding errors, enabling deployment on resource-constrained hardware.
Quantized Low-Rank Adaptation
QLoRA
QLoRA extends LoRA by combining 4-bit quantization (via NF4 and double quantization) with paged optimizers to fine-tune billion-parameter LLMs on consumer GPUs. It maintains performance parity with 16-bit full tuning while dramatically reducing memory requirements.
reinforcement learning from human feedback
RLHF
RLHF aligns LLMs with human preferences through a multi-stage process: training a reward model on ranked response pairs, then using reinforcement learning to optimize the policy model against this reward. Typically employing PPO, it maximizes expected reward while constraining deviation from a reference model.
Rotary Position Embedding
RoPE
RoPE (Rotary Position Embedding) is a positional encoding method that encodes absolute positions using rotation matrices, enabling efficient relative position representation in Transformers. It improves generalization to longer sequences and has become the standard in modern LLMs like Llama and GPT variants.
supervised fine-tuning
SFT
Supervised fine-tuning (SFT) adapts a pretrained model on labeled instruction-response pairs to improve task-specific performance, like following directives. It aligns general models to user needs with minimal data and precedes RLHF, enhancing instruction adherence and reducing hallucinations.
transformer
A Transformer is a neural architecture using self-attention mechanisms to process sequences in parallel, revolutionizing NLP by handling long-range dependencies efficiently. Transformers form the backbone of LLMs, with decoder-only variants dominating generative tasks and enabling scalable training on huge datasets.
transformer architecture
The Transformer architecture consists of encoder and/or decoder stacks with multi-head self-attention, feed-forward layers, and positional encodings for sequence modeling. It enables parallelization and captures context, with decoder-only Transformers powering autoregressive generation in models like GPT and Llama.