Tools

Tools: Latest “car Wash” Test With 53 Models

2026-02-23 0 views admin

The car wash test is the simplest AI reasoning benchmark that nearly every model fails, including Claude Sonnet 4.5, GPT-5.1, Llama, and Mistral.

The question is simple: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

Obviously, you need to drive. The car needs to be at the car wash.

The question has been making the rounds online as a simple logic test, the kind any human gets instantly, but most AI models don't. We decided to run it properly: 53 models through Opper's LLM gateway, no system prompt, forced choice between "drive" or "walk" with a reasoning field. First once per model, then 10 times each to test consistency.

On a single call, only 11 out of 53 models got it right. 42 said walk.

Across entire model families, only one model per provider got it right: Opus 4.6 for Anthropic, GPT-5 for OpenAI. All Llama and Mistral models failed.

The wrong answers were all the same: "50 meters is a short distance, walking is more efficient, saves fuel, better for the environment." Correct reasoning about the wrong problem. The models fixate on the distance and completely miss that the car itself needs to get to the car wash.

The funniest part: Perplexity's Sonar and Sonar Pro got the right answer for completely wrong reasons. They cited EPA studies and argued that walking burns calories which requires food production energy, making walking more polluting than driving 50 meters. Right answer, insane reasoning.

Full reasoning traces from the single-run experiment — click to zoom

Getting it right once is easy. But can they do it reliably? We reran every model 10 times, 530 API calls total.

Source: HackerNews

🏷️ Tags

cliapi

Tools: Latest “car Wash” Test With 53 Models

🏷️ Tags

More from Tools

Tools: How to generate a PDF from HTML in Node.js (without Puppeteer)

Tools: How I Manage AI Coding Rules Across Claude Code, Cursor, and Codex With One CLI

Tools: Your Dev Tools Are Leaking Data. Here’s Why I Built Mine to Run Entirely in the Browser.

Tools: Vibe Coding is best for repid development but, most of programmer don't knows about .

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting