Even the most advanced AI models fail more often than you think on structured outputs — raising doubts about the effectiveness of coding assistants
- Report finds AI coding assistants regularly fail one in four structured-output tasks
- Even advanced proprietary models only reach approximately 75% accuracy
- Open source AI models perform worse, averaging closer to 65% reliability
The promise of artificial intelligence as a tireless coding assistant has encountered a significant roadblock after new research claimed such tools can experience a range of issues.
A recent study from the University of Waterloo found AI struggles with software development, with even the most advanced models failing on one in four structured-output tasks.
The research evaluated 11 large language models across 18 different structured formats and 44 tasks to test how well the systems could follow predefined rules, finding a clear disparity between performance on text-based tasks and outputs involving multimedia or complex structures.
Article continues below
Benchmarking reveals a troubling reliability gap
While text-related tasks were generally handled with moderate success, tasks requiring image, video, or website generation proved far more problematic.
Accuracy in these areas dropped sharply, raising questions about how these AI tools can be integrated safely into professional workflows.
“With this kind of study, we want to measure not only the syntax of the code — that is, whether it’s following the set rules — but also whether the outputs produced for various tasks were accurate,” said Dongfu Jiang, a PhD student and co-first author of the study.
Structured outputs, designed to impose format consistency through JSON, XML, or Markdown, were intended to make AI responses more reliable for developers.
Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!
AI companies, including OpenAI, Google, and Anthropic, introduced structured outputs to force responses into predictable formats.
The Waterloo research suggests this approach has not yet delivered the level of dependability developers require.
Waterloo’s benchmarking revealed even the most advanced proprietary models reached only about 75% accuracy, while open source alternatives performed closer to 65%.
These results suggest that, despite improvements, AI systems still make significant errors that cannot be ignored in professional development environments.
The report emphasized the need for human oversight, noting,“Developers might have these agents working for them, but they still need significant human supervision.”
Although structured outputs are a step forward from free-form natural language responses, errors remain common.
The technology is not yet robust enough to operate independently in complex development scenarios.
One might reasonably question whether the industry’s enthusiasm for AI and vibe coding assistants has outpaced the actual capabilities of the underlying technology.
Even the most advanced models demonstrate a significant failure rate on structured tasks, revealing a wide gap between marketing claims and actual performance.
Therefore, for now, developers should treat these tools as experimental aids rather than autonomous colleagues.
Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!
And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too.
Report finds AI coding assistants regularly fail one in four structured-output tasks Even advanced proprietary models only reach approximately 75% accuracy Open source AI models perform worse, averaging closer to 65% reliability The promise of artificial intelligence as a tireless coding assistant has encountered a significant roadblock after new research…
Recent Posts
- Buying your dad a tech gift or gadget for Father’s Day? You may want to wait until Prime Day, if possible
- Which Amazon Fire Stick do I need? A simple guide to the key differences
- Stellar Blade’s slick-looking sequel is officially called Blood Rain
- How much data does your favorite messaging app collect? New study shows 90% of messaging apps now include AI that puts privacy at risk
- More than a decade later, the team behind N++ is back with a multiplayer sequel
Archives
- June 2026
- May 2026
- April 2026
- March 2026
- February 2026
- January 2026
- December 2025
- November 2025
- October 2025
- September 2025
- August 2025
- July 2025
- June 2025
- May 2025
- April 2025
- March 2025
- February 2025
- January 2025
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023