Snorkel Mistral PairRM
About
The Snorkel Mistral PairRM-DPO is a chat-optimized large language model, leveraging the Mistral-7B-Instruct-v0.2 architecture. Designed to interpret and respond efficiently to user inputs, it employs Direct Preference Optimization alongside the Pairwise Reward Model (PairRM) to enhance its alignment with human preferences. Exclusively trained on the UltraFeedback dataset without input from other LLMs, it excels in generating text for conversational contexts, ranking third on the AlpacaEval 2.0 leaderboard at 30.22. Post-processing with PairRM-best-of-16 boosts its score to 34.86. Despite its strengths, the model has limitations, including the absence of moderation features, a possible bias towards longer responses influenced by the evaluation benchmark, and challenges in understanding its complex internal mechanics.
Capabilities
Providers(2)
| Provider | Input (per 1M) | Output (per 1M) | Type | |
|---|---|---|---|---|
| Together AI API | $0.2 | $0.2 | Serverless | |
| Fireworks AI Platform | — | — | Provisioned |