Blog

Company Updates & Technology Articles

September 17, 2025

From Prototype to Production: Unlocking Mission-Ready AI

This agreement, known as an Other Transaction Authority (OTA), is designed specifically to help the DoD move at speed and partner with non-traditional tech companies like Scale. It streamlines the procurement process, allowing any component across the entire DoD to access our end-to-end AI platform.

September 16, 2025

General

How Morgan Stanley deploys AI that actually works (hint: it's evals) | Human in the Loop: Episode 13

Kaitlin Elliott, who leads firmwide Generative AI Solutions at Morgan Stanley, joined us in the studio to unpack how AI evaluations powered the firm’s successful adoption of production GenAI. This is a real world case study you don't want to miss.

September 15, 2025

Research

Smoothing Out LLM Variance for Reliable Enterprise Evals

A critical challenge in enterprise AI development is the instability of LLM evaluations. Our internal testing revealed that metrics on identical A/B tests can swing by as much as 15% from one day to the next. This level of variance is large enough to invalidate results, making principled, incremental improvement a game of chance. In this post, we dive into the root cause: an industry-wide phenomenon created by the interplay of Sparse Mixture of Experts (MoE) architecture and the batched inference common to provider APIs. By implementing a "cohort of judges," a small panel of LLMs with semantically similar but varied prompts, we successfully reduce this variance by at least 50%. This creates the stable, trustworthy measurement foundation needed to confidently build and improve AI agents.

September 12, 2025

Research

TutorBench: Grading the Next Generation of AI Tutors

Can an AI be a great tutor? TutorBench is a new, challenging benchmark from Scale designed to find out. Moving beyond right or wrong answers, it grades today's leading AI models on their ability to actually teach: evaluating crucial skills like adaptive explanation, constructive feedback, and active learning support. Using 1,500 multimodal conversations across STEM subjects, many including images of handwritten work, TutorBench reveals that even the most advanced models still have a long way to go to master the nuanced art of tutoring, paving the way for the next generation of AI in education.

September 4, 2025

Company

Scale's Commitment to Empower the Next Generation with AI Literacy

Scale is committed to building a brighter, stronger future for America by improving the AI literacy level of students and teachers across America. We believe AI can be a tool for creativity, problem solving, and discovery, whether that means addressing local challenges, sparking curiosity in the classroom, or opening doors to future opportunities. That’s why today, we are proud to share Scale’s commitment to advancing AI literacy and expanding access to AI learning for educators and students nationwide.

September 3, 2025

Research

Toolchaining: The Problem No One is Talking About

We found the standard approach to toolchaining insufficient. Simply giving the LLM access to multiple tools and asking it to clean the data led to extremely poor results, if it could do it at all. Instead, when we gave the LLM access to a python sandbox pre-loaded with these tools and asked it to develop the aforementioned plan, the output was significantly improved. For the remainder of this blog, we dive into why this happened, how we set up our experiment, and what the findings mean for you.

September 2, 2025

Research

Using Rubrics to Build Better Models

How do you know if an AI model is actually learning, or just getting better at faking it? A new paper from researchers at Scale introduces Rubrics as Rewards (RaR), a framework that solves this problem by training models with structured, expert-designed checklists instead of simple preference scores. This approach moves the human role from a simple preference labeler to an expert architect of the AI's values, resulting in up to a 28% performance leap on challenging benchmarks and providing a more transparent, effective path toward reliable AI.

August 25, 2025

Company

Scale AI and Department of Defense Expand Partnership to Advance Army R&D

Scale AI has been awarded a $99 million contract by the U.S. Department of Defense to accelerate Army research and development in artificial intelligence. Building on its expanding partnership with the Pentagon, Scale will deliver data operations, platforms, and engineering support to help the Army adopt AI across critical missions.

August 21, 2025

General

We decided if these viral AI agent demos are hype or real | Human in the Loop: Episode 12

In today’s episode Scale’s enterprise team (Clemens Viernickel, Mark Pfeiffer, Sam Denton, and Felix Su) review several viral AI agent demos from the internet and decide how realistic they are in their current form to be deployed in an enterprise environment. What do you think of their votes?

August 19, 2025

Engineering

AI Doesn’t Live in Text Alone

AI is moving beyond text, toward agents that can listen, speak, and interact naturally with the world. Voice AI requires far more than word; it demands the nuanced tones, emotions, and dynamics of human speech. But unlike text, there’s no vast public library of labeled audio to train on. Scale is building that foundation, delivering high-quality, diverse, and emotionally rich speech data to power every stage of model development. From real-time conversation to multimodal perception, these datasets are unlocking the next era of human-computer interaction. The future is listening.