September 12, 2025
TutorBench: Grading the Next Generation of AI Tutors

Can an AI be a great tutor? TutorBench is a new, challenging benchmark from Scale designed to find out. Moving beyond right or wrong answers, it grades today's leading AI models on their ability to actually teach: evaluating crucial skills like adaptive explanation, constructive feedback, and active learning support. Using 1,500 multimodal conversations across STEM subjects, many including images of handwritten work, TutorBench reveals that even the most advanced models still have a long way to go to master the nuanced art of tutoring, paving the way for the next generation of AI in education.
Read more
September 2, 2025
Using Rubrics to Build Better Models

How do you know if an AI model is actually learning, or just getting better at faking it? A new paper from researchers at Scale introduces Rubrics as Rewards (RaR), a framework that solves this problem by training models with structured, expert-designed checklists instead of simple preference scores. This approach moves the human role from a simple preference labeler to an expert architect of the AI's values, resulting in up to a 28% performance leap on challenging benchmarks and providing a more transparent, effective path toward reliable AI.
Read more
July 23, 2025
The Future is Multilingual: Scale's New Evaluation Benchmark

Building truly intelligent and equitable multilingual AI requires a new way to measure cultural reasoning. Scale's new Multilingual Native Reasoning Challenge (MultiNRC) is designed to do just that. Created from scratch by native speakers, this benchmark tests for deep linguistic and cultural understanding beyond simple translation, providing a clear path for the AI community to accelerate progress.
Read more
July 23, 2025
WebGuard: A Guardrail for the Agentic Age

As AI agents become more powerful, ensuring their safety is the most critical challenge for deployment. This post explores WebGuard, a new benchmark from researchers at Scale, UC Berkeley, and The Ohio State University that reveals a significant safety gap in current models. Learn how high-quality, human-in-the-loop data provides a path forward, dramatically improving a model's ability to avoid risky behavior.
Read more
June 9, 2025
Precog: Scale's platform for data quality post-training experiments

At Scale, operations, engineering, and research teams work together to ensure the quality of our data. To do this, we rely on a combination of human review, automated linters, data distribution analyses, and model training experiments. In this post, we will focus on the last category and introduce Precog, our platform for running data quality experiments by training models on our own datasets.
Read more
June 5, 2025
It’s Time to Rethink Red Teaming

As advanced AI rapidly evolves, red teaming needs an updated approach. Scale researchers propose a shift to test AI systems, not just models, in real-world contexts with a focus on product safety and realistic threats.
Read more