LLM ReferenceLLM Reference

InternLM XComposer2 VL 7B

About

InternLM-XComposer2-VL-7B is an advanced vision-language large model (VLLM) built on InternLM2 architecture, designed for robust text-image comprehension and composition. It leverages Partial LoRA (P-LoRA) to align embedding spaces effectively between a pre-trained Vision Transformer (ViT) and the language model, enhancing multimodal understanding. The model undergoes pretraining to refine general semantics and improve visual capabilities using datasets like COCO and TextCaps, followed by supervised fine-tuning with various vision-language tasks. It excels in image captioning, visual question answering, and creative text-image compositions, capable of handling high-resolution images and fine-grained details. The InternLM-XComposer2-VL-7B family includes a 4-bit quantized version for reduced VRAM usage, along with other variants for high-resolution understanding and long-contextual inputs.

Capabilities

VisionMultimodalReasoningFunction CallingTool UseStructured OutputsCode Execution

Rankings

Specifications

Released2024-04-09
Parameters7B
ArchitectureDecoder Only
Specializationgeneral
Trainingfinetuning

Created by

Innovative AI research for societal impact

Shanghai, China
Founded 2023
Website