Florence 2 Large
About
Florence-2 Large is a versatile vision-language foundation model by Microsoft Azure AI, open-sourced under the MIT license. This model excels in handling a variety of vision and vision-language tasks through a unified, prompt-based approach. Unlike single-task focused models, Florence-2 adeptly manages multiple tasks such as image captioning, object detection, visual grounding, and segmentation. Despite its compact size of 0.77 billion parameters, it rivals much larger models in performance due to its training on the extensive FLD-5B dataset. Its architecture features a sequence-to-sequence framework with a DaViT image encoder and a multi-modality encoder-decoder. Florence-2's ability to process diverse tasks with a singular architecture, paired with strong zero-shot and fine-tuning performance, makes it suitable for deployment on resource-constrained devices.