LLM Reference

Phi-3 Vision

About

Phi-3 Vision is a sophisticated multimodal AI model from Microsoft, designed to adeptly integrate language and vision capabilities. Unlike traditional language models, it processes both text and images and can perform tasks such as optical character recognition, chart analysis, and image interpretation. Its architecture features an image encoder, a text-image connector, a projector for mapping image features, and the Phi-3 Mini language model. Despite its relatively small size of 4.2 billion parameters, it competes with larger models and suits devices with limited computational power. Phi-3 Vision's ability to handle up to 128K tokens supports complex multimodal reasoning. It draws upon high-quality and synthetic data for training while incorporating essential safety measures.

Capabilities

VisionMultimodalReasoningFunction CallingTool UseJSON ModeCode Execution

Providers(3)

Compare all →
ProviderInput (per 1M)Output (per 1M)Type
Azure OpenAI$0.28$0.84Provisioned
Fireworks AI$0.2$0.2Serverless
NVIDIA NIMProvisioned

Rankings

Specifications

FamilyPhi-3
Released2024-05-21
Parameters4.2B
Context128K
ArchitectureDecoder Only
Specializationgeneral
Trainingfinetuning

Created by

Advancing the state-of-the-art in AI and computing.

Redmond, Washington, United States
Founded 1991
Website