LLM Reference

BERT Models by Google DeepMind

This model family is considered obsolete. Consider newer alternatives in Related Model Families below.
2 models2018Up to 512 ctx

About

BERT, short for Bidirectional Encoder Representations from Transformers, is a prominent family of large language models (LLMs) originally introduced by Google AI in 2018 1)3. These models utilize the transformer architecture to process text in a unique bidirectional manner, enabling an understanding of context by considering both preceding and following words within a sentence 8. Techniques such as masked language modeling (MLM) and next sentence prediction (NSP) contribute to BERT's superior performance on various natural language processing (NLP) tasks compared to older models 10. Initially, BERT was released in two configurations, BERTBASE with 110 million parameters and BERTLARGE with 340 million parameters, both trained on extensive datasets like the BookCorpus and English Wikipedia 3. The BERT family has since expanded to include multilingual versions and smaller models like DistilBERT and TinyBERT, catering to specific tasks and resource constraints 4. This adaptability has made BERT integral to applications like question answering, text classification, and named entity recognition 2.

Current Variants

Use-when guidance is derived from seed capabilities, context, release, and replacement fields.

2 in view
BERT LargeCurrent

Use when the workload needs 512 context and 340M parameters.

2018-10512 context340M parameters
BERT BaseCurrent

Use when the workload needs 512 context and 110M parameters.

2018-10512 context110M parameters

Release Timeline

1 release group
2018-10
2 current
BERT Base
512 context110M parameters
Current
BERT Large
512 context340M parameters
Current

Specifications(2 models)

BERT model specifications comparison
ModelReleasedContextParameters
BERT Large2018-10512340M
BERT Base2018-10512110M

Frequently Asked Questions

What is BERT used for?
BERT is used for coding. The family description and listed model capabilities point to those workloads as the best fit.
How does BERT compare to Gemma 4?
BERT by Google DeepMind is strongest where you need coding, while Gemma 4 by Google DeepMind is the closest related family to check for multimodal. BERT has 2 listed variants and reaches up to 512 context, while Gemma 4 reaches up to 256k context, so compare the specs and pricing tables before choosing a production model.
Which BERT model should I use?
If price is the main constraint, use the pricing table first because BERT does not have complete provider pricing in the local data. For the most capable/latest local choice, evaluate BERT Large with 512 context.

Models(2)