Amazon’s bet that AI benchmarks don’t matter

December 2, 2025

This is an excerpt of Sources by Alex Heath, a newsletter about AI and the tech industry, syndicated just for The Verge subscribers once a week.

Amazon’s AI chief has a message for the model benchmark obsessives: Stop looking at the leaderboards.

“I want real-world utility. None of these benchmarks are real,” Rohit Prasad, Amazon’s SVP of AGI, told me ahead of today’s announcements at AWS re:Invent in Las Vegas. “The only way to do real benchmarking is if everyone conforms to the same training data and the evals are completely held out. That’s not what’s happening. The evals are frankly getting noisy, and they’re not showing the real power of these models.”

It’s a contrarian stance when every other AI lab is quick to boast about how their new models quickly climb the leaderboards. It’s also convenient for Amazon, given that the previous version of Nova, its flagship model, was sitting at spot 79 on LMArena when Prasad and I spoke last week. Still, dismissing benchmarks only works if Amazon can offer a different story about what progress looks like.

“They’re not showing the real power of these models.”

The centerpiece of today’s re:Invent announcements is Nova Forge, a service that Amazon claims lets companies train custom AI models in ways previously impossible without spending billions of dollars. The problem Forge addresses is real. Most companies trying to customize AI models face three bad options: fine-tune a closed model (but only at the edges), train on open-weight models (but without the original training data and risking capability regression, where the AI becomes an expert on new data but forgets original, broader skills), or build a model from scratch at enormous cost.

Forge offers something else: access to Amazon’s Nova model checkpoints at the pre-training, mid-training, and post-training stages. Companies can inject their proprietary data early in the process, when the model’s “learning capacity is highest,” as Prasad put it, rather than just tweaking model behavior at the end.

“What we have done is democratize AI and frontier model development for your use cases at fractions of what it would cost [before],” Prasad said. Forge was created because Amazon’s internal teams wanted a tool to inject their domain expertise into a base model without having to build from scratch.

“We built Forge because our internal teams wanted Forge,” he said. It’s a familiar Amazon pattern. AWS itself famously began as infrastructure built for Amazon’s own retail operation before becoming the company’s profit engine.

Reddit has been using Forge to build custom safety models trained on 23 years of community moderation data. “I haven’t seen anything like it yet,” Chris Slowe, Reddit’s CTO and first employee, told me. “We’ve had a distinguished engineer who’s just been like a kid in the candy shop.”

Slowe said Reddit ran a continued pre-training job last week that’s “looking really promising.” The goal: Replace multiple bespoke safety models with a single Reddit-expert model that understands the nuances of community moderation, including the notoriously subjective rule that appears across subreddits everywhere: “Don’t be a jerk.”

“Having an expert model, it’s going to understand the community,” Slowe said. “It’s gonna have a pretty good notion of what jerk means.”

That’s the thread Amazon wants developers to pull on: not raw IQ points, but control and specialization.

He explained that Forge enables Reddit to control its models, avoid surprises from API changes, retain ownership of its weights, and avoid sending sensitive data to third-party model providers. He said Reddit is already exploring using the same approach for Reddit Answers and other products.

When I asked Slowe whether it mattered that Nova isn’t a top-tier model on benchmarks, he was blunt: “In this context, what matters is the Reddit expertness of the model.” That’s the thread Amazon wants developers to pull on: not raw IQ points, but control and specialization.

With Forge, Amazon is making a calculated bet that the model race has commoditized and that it can succeed by being the place where companies can build specialized AI for specific business problems. It’s a very AWS-shaped view of the world: infrastructure over intelligence and customization over raw capability. The strategy also lets Amazon sidestep direct comparisons with OpenAI and Anthropic, both of which it once hoped to compete with at the model layer.

Whether Forge is genuinely pioneering or just clever positioning depends, of course, on developer adoption. Amazon insists that the model race, as it’s widely understood, doesn’t matter. If that ends up being true, the scoreboard shifts to something much quieter and harder to game: whether AI models actually deliver real-world utility.

Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.

Alex Heath

Source

This is an excerpt of Sources by Alex Heath, a newsletter about AI and the tech industry, syndicated just for The Verge subscribers once a week. Amazon’s AI chief has a message for the model benchmark obsessives: Stop looking at the leaderboards. “I want real-world utility. None of these benchmarks…

Amazon’s bet that AI benchmarks don’t matter

Leave a Reply Cancel reply

Recent Posts

Archives

Useful Links