PaddleOCR VL Models by Baidu AI
1 model2025Up to 16K ctxFrom $0.02/1M input
About
PaddleOCR VL is Baidu's ultra-compact vision-language model family for multilingual document parsing and OCR. The flagship 0.9B model combines a NaViT-style dynamic resolution visual encoder with ERNIE-4.5-0.3B and supports 109 languages. Achieved SOTA on OmniDocBench V1.5 at launch.
Current Variants
Use-when guidance is derived from seed capabilities, context, release, and replacement fields.
1 in view
PaddleOCR VLCurrent
Use when the workload needs vision, 16K context, and 900M parameters.
2025-10vision16K context900M parameters
| Model | Use when | Released | Signals | Status |
|---|---|---|---|---|
| PaddleOCR VL | Use when the workload needs vision, 16K context, and 900M parameters. | 2025-10 | vision16K context900M parameters | Current |
Release Timeline
1 release group2025-10
1 current
PaddleOCR VL
Currentvision16K context900M parameters
Specifications(1 models)
| Model | Released | Context | Parameters | Vision | Multimodal |
|---|---|---|---|---|---|
| PaddleOCR VL | 2025-10 | 16K | 0.9B | Yes | Yes |
Available From(1 provider)
Pricing
| Model | Provider | Input / 1M | Output / 1M | Type |
|---|---|---|---|---|
| PaddleOCR VL | Novita AI | $0.02 | $0.02 | Serverless |
Frequently Asked Questions
- What is PaddleOCR VL used for?
- PaddleOCR VL is used for vision, vision and multimodal work, and coding. The family description and listed model capabilities point to those workloads as the best fit.
- How does PaddleOCR VL compare to ERNIE 4.5?
- PaddleOCR VL by Baidu AI is strongest where you need vision, while ERNIE 4.5 by Baidu AI is the closest related family to check for vision and multimodal work. PaddleOCR VL has 1 listed variant and reaches up to 16K context, while ERNIE 4.5 reaches up to 128K context, so compare the specs and pricing tables before choosing a production model.
- Which PaddleOCR VL model should I use?
- For the lowest listed input price, start with PaddleOCR VL through Novita AI at $0.02/1M input tokens. For the most capable/latest local choice, evaluate PaddleOCR VL with 16K context and multimodal inputs.


