MAI Models by Microsoft AI
Microsoft AIProprietary
5 models2026Up to 33K ctxFrom $0.36/1M input
About
Microsoft AI (MAI) models announced April 2, 2026. A suite of foundation models for speech recognition, speech generation, and image understanding/generation. Available through Microsoft Foundry enterprise platform.
Current Variants
Use-when guidance is derived from seed capabilities, context, release, and replacement fields.
5 in view
MAI-Image-2eCurrent
Use when the workload needs image, 33K context, and multimodal inputs.
2026-04image33K contextmultimodal inputs
MAI-Transcribe-1Current
Use when the workload needs audio and multimodal inputs.
2026-04audiomultimodal inputs
MAI-Voice-1Current
Use when the workload needs audio and multimodal inputs.
2026-04audiomultimodal inputs
MAI-Image-2Current
Use when the workload needs image and multimodal inputs.
2026-03imagemultimodal inputs
| Model | Use when | Released | Signals | Status |
|---|---|---|---|---|
| MAI-DS-R1 | Use when the workload needs reasoning. | 2026-05 | reasoning | Current |
| MAI-Image-2e | Use when the workload needs image, 33K context, and multimodal inputs. | 2026-04 | image33K contextmultimodal inputs | Current |
| MAI-Transcribe-1 | Use when the workload needs audio and multimodal inputs. | 2026-04 | audiomultimodal inputs | Current |
| MAI-Voice-1 | Use when the workload needs audio and multimodal inputs. | 2026-04 | audiomultimodal inputs | Current |
| MAI-Image-2 | Use when the workload needs image and multimodal inputs. | 2026-03 | imagemultimodal inputs | Current |
Release Timeline
3 release groups2026-05
1 current
MAI-DS-R1
Currentreasoning
2026-04
3 current
MAI-Image-2e
Currentimage33K contextmultimodal inputs
MAI-Transcribe-1
Currentaudiomultimodal inputs
MAI-Voice-1
Currentaudiomultimodal inputs
2026-03
1 current
MAI-Image-2
Currentimagemultimodal inputs
Specifications(5 models)
| Model | Released | Context | Vision | Multimodal | Reasoning |
|---|---|---|---|---|---|
| MAI-DS-R1 | 2026-05 | — | No | No | Yes |
| MAI-Image-2e | 2026-04 | 33K | Yes | Yes | No |
| MAI-Transcribe-1 | 2026-04 | — | No | Yes | No |
| MAI-Voice-1 | 2026-04 | — | No | Yes | No |
| MAI-Image-2 | 2026-03 | — | Yes | Yes | No |
Available From(1 provider)
Pricing
| Model | Provider | Input / 1M | Output / 1M | Type |
|---|---|---|---|---|
| MAI-Transcribe-1 | Microsoft Foundry | $0.36 | — | Serverless |
| MAI-Image-2 | Microsoft Foundry | $5 | $33 | Serverless |
| MAI-Voice-1 | Microsoft Foundry | $22 | — | Serverless |
Frequently Asked Questions
- What is MAI used for?
- MAI is used for reasoning, image, and audio. The family description and listed model capabilities point to those workloads as the best fit.
- How does MAI compare to Claude 3?
- MAI by Microsoft AI is strongest where you need reasoning, while Claude 3 by Anthropic is the closest related family to check for vision and multimodal work. MAI has 5 listed variants and reaches up to 33K context, while Claude 3 reaches up to 200K context, so compare the specs and pricing tables before choosing a production model.
- Which MAI model should I use?
- For the lowest listed input price, start with MAI-Transcribe-1 through Microsoft Foundry at $0.36/1M input tokens. For the most capable/latest local choice, evaluate MAI-Image-2e with 33K context and multimodal inputs.
