InternLM XComposer2 4KHD 7B
About
InternLM-XComposer2-4KHD-7B is a large vision-language model designed to understand high-resolution images up to 4K HD (3840 x 1600 pixels). Built on the InternLM2 architecture, it significantly improves over previous models with its dynamic image partitioning approach that divides images into smaller patches while maintaining the original aspect ratio. This enables it to handle fine-grained visual details, making it ideal for tasks like image captioning, visual question answering, and high-resolution OCR. The model features a lightweight Vision Encoder and the InternLM2-7B language model, using Partial LoRA for efficient alignment. With capabilities that extend to complex applications such as automated marketing or e-commerce image captioning, it competes effectively against models like GPT-4V and Gemini Pro, although it requires substantial GPU resources, with RAM usage reported near 80GB.