LLM Reference

InternLM XComposer2 VL 7B

About

InternLM-XComposer2-VL-7B is an advanced vision-language large model (VLLM) built on InternLM2 architecture, designed for robust text-image comprehension and composition. It leverages Partial LoRA (P-LoRA) to align embedding spaces effectively between a pre-trained Vision Transformer (ViT) and the language model, enhancing multimodal understanding. The model undergoes pretraining to refine general semantics and improve visual capabilities using datasets like COCO and TextCaps, followed by supervised fine-tuning with various vision-language tasks. It excels in image captioning, visual question answering, and creative text-image compositions, capable of handling high-resolution images and fine-grained details. The InternLM-XComposer2-VL-7B family includes a 4-bit quantized version for reduced VRAM usage, along with other variants for high-resolution understanding and long-contextual inputs.

Capabilities

MultimodalFunction CallingTool UseJSON Mode

Specifications

Parameters7B
ArchitectureDecoder Only
Specializationgeneral