InternLM XComposer2 VL 7B
About
InternLM-XComposer2-VL-7B is an advanced vision-language large model (VLLM) built on InternLM2 architecture, designed for robust text-image comprehension and composition. It leverages Partial LoRA (P-LoRA) to align embedding spaces effectively between a pre-trained Vision Transformer (ViT) and the language model, enhancing multimodal understanding. The model undergoes pretraining to refine general semantics and improve visual capabilities using datasets like COCO and TextCaps, followed by supervised fine-tuning with various vision-language tasks. It excels in image captioning, visual question answering, and creative text-image compositions, capable of handling high-resolution images and fine-grained details. The InternLM-XComposer2-VL-7B family includes a 4-bit quantized version for reduced VRAM usage, along with other variants for high-resolution understanding and long-contextual inputs.