WildBenchactiveComposite
WildBench
Metric: WB-Reward / WB-Score (higher is better)Introduced: 2024
About
Automated LLM evaluation using 1,024 real user queries from WildChat with LLM-as-judge scoring. Achieves 0.98 Spearman correlation with Chatbot Arena Elo. V2 with periodic updates.