LLM ReferenceLLM Reference
WildBenchactiveComposite

WildBench

Metric: WB-Reward / WB-Score (higher is better)Introduced: 2024

About

Automated LLM evaluation using 1,024 real user queries from WildChat with LLM-as-judge scoring. Achieves 0.98 Spearman correlation with Chatbot Arena Elo. V2 with periodic updates.