Tools: 🔥 Claude Opus 4.5 vs GPT 5.2 High vs Gemini 3 Pro: Production Coding Test ✅

Tools: 🔥 Claude Opus 4.5 vs GPT 5.2 High vs Gemini 3 Pro: Production Coding Test ✅

Source: Dev.to

Test Workflow ## Real-world Coding Tests ## Test 1: Add a global Action Palette (Ctrl + K) ## GPT-5.2-Codex (high) ## Claude Opus 4.5 ## Gemini 3 Pro ## Test 2: Tool Usage Analytics + Insights Dashboard ## GPT-5.2-Codex (high) ## Claude Opus 4.5 ## Gemini 3 Pro ## Conclusion ## Shrijal AcharyaFollow Okay, so right now the WebDev leaderboard on LMArena is basically owned by the big three: Claude Opus 4.5 from Anthropic, GPT-5.2-codex (high) from OpenAI, and finally everybody's favorite, Gemini 3 Pro from Google. So, I grabbed these three and put them into the same existing project (over 8K stars and 50K+ LOC) and asked them to build a couple of real features like a normal dev would. Same repo. Same prompts. Same constraints. For each task, I took the best result out of three runs per model to keep things fair. Then I compared what they actually did: code quality, how much hand-holding they needed, and whether the feature even worked in the end. ⚠️ NOTE: Don't take the result of this test as a hard rule. This is just a small set of real-world coding tasks that shows how each model did for me in that exact setup and gives you an overview of the difference in the top 3 models' performance in the same tasks. If you want a quick take, here’s how the three models performed in our tests: 💡 If you want the safest pick for real “ship a feature in a big repo” work, Opus 4.5 felt the most reliable in my runs. If you care about speed and cost and you’re okay polishing UI yourself, Gemini 3 Pro is a solid bet. For the test, we will use the following CLI coding agents: Here’s the repo used for the entire test: iib0011/omni-tools We will check the models on two different tasks: Each model is asked to create a global action menu that opens with a keyboard shortcut. This feature expands on the current search by adding actions, global state, and keyboard navigation. This task checks how well the model understands current UX patterns and avoids repetition without breaking what's already in place. Each model had to add real usage tracking across the app, persist it locally, and then build an analytics dashboard that shows things like the most used tools, recent activity, and basic filters. We’ll compare code quality, token usage, cost, and time to complete the build. 💡 NOTE: I will share the source code changes for each task by each model in a .patch file. This way, you can easily view them on your local system by cloning the repository and applying the patch file using git apply <path_file_name>. This method makes sharing changes easier. The task is simple: all models start from the same base commit and then follow the same prompt to build what is asked in the prompt. And obviously, as mentioned, I will evaluate the response from the model from the "Best of 3." Let's start off the test with something interesting: Here's the prompt used: GPT-5.2 handled this surprisingly well. The implementation was solid end to end, and it basically one-shotted the entire feature set, including i18n support, without needing multiple correction passes. That said, it did take a bit longer than some other models (~20 minutes), which is expected since reasoning was explicitly set to high. You can clearly see the model spending more time thinking through architecture, naming, and edge cases rather than rushing to output code. The trade-off felt worth it here. The token usage was noticeably higher due to the reasoning set to high, but the output code reflected that. You can find the code it generated here: GPT-5.2 High Code 💡 NOTE: I ran the exact same prompt with the same model using the default (medium) reasoning level. The difference was honestly massive. With reasoning set to high, the quality of the code, structure, and pretty much everything jumps by miles. It’s not even a fair comparison. Claude went all in and prepared a ton of different strategies. At the start, it did run into build issues, but it kept running the build until it was able to fix all the build and lint issues. The entire run took me about 7 minutes 50 seconds, which is the fastest among the models for this test. The features all worked as asked, and obviously, the UI looked super nice and exactly how I expected. You can find the code it generated here: Claude Opus 4.5 Code To be honest, this exceeded my expectations; even the i18n texts are added and displayed in the UI just as expected. Absolute cinema! Gemini 3 got it working, but it's clearly not on the same level as GPT-5.2 High or Claude Opus 4.5. The UI it built is fine and totally usable, but it feels a bit barebones, and you don't get many choices in the palette compared to the other two. One clear miss is that language switching does not show up inside the action palette at all, which makes the i18n support feel incomplete even though translations technically exist. You can find the code it generated here: Gemini 3 Pro Code Overall, Gemini 3 lands in a very clear third place here. It works, the UI looks fine, and nothing is completely broken, but compared to the depth, completeness, and polish of GPT-5.2 High and Claude Opus 4.5, it feels behind. This test is a step up from the action palette. You can find the prompt I've used here: Prompt GPT-5.2 absolutely nailed this one. The final result turned out amazing. Tool usage tracking works exactly as expected, data persists correctly, and the dashboard feels like a real product feature. Most used tools, recent usage, filters, everything just works. One really nice touch is that it also wired analytics-related actions into the Action Palette from Test 1. It did take a bit longer than the first test, around 26 minutes, but again, that’s the trade-off with high reasoning. You can tell the model spent time thinking through data modeling, reuse, and avoiding duplicated logic. Totally worth it here. You can find the code it generated here: GPT-5.2 High Code GPT-5.2 High continues to be slow but extremely powerful, and for a task like this, that’s a very good trade. Claude Opus 4.5 did great here as well. The final implementation works end to end, and honestly, from a pure UI and feature standpoint, it’s hard to tell the difference between this and GPT-5.2 High. The dashboard looks clean, the data makes sense, and the filters work as expected. You can find the code it generated here: Claude Opus 4.5 Code Gemini 3 Pro gets the job done, but it clearly takes a more minimal approach compared to GPT-5.2 High and Claude Opus 4.5. That said, the overall experience feels very bare minimum. The UI is functional but plain, and the dashboard lacks the polish and depth you get from the other two models. Also, it didn't quite add the button to view the analytics right in the action palette, similar to the other two models. You can find the code it generated here: Gemini 3 Pro Code Overall, Gemini 3 Pro remains efficient and reliable, but in a comparison like this, efficiency alone is not enough. 🤷‍♂️ At least from this test, I can conclude that the models are now pretty much able to one-shot a decent complex work, at least from what I tested. Still, there have been times when the models mess up so badly that if I were to go ahead and fix the problems one by one, it would take me nearly the same time as building it from scratch. If I compare the results across models, Opus 4.5 definitely takes the crown. But I still don’t think we’re anywhere close to relying on it for real, big production projects. The recent improvements are honestly insane, but the results still don’t fully back them up. For now, I think these models are great for refactoring, planning, and helping you move faster. But if you solely rely on their generated code, the codebase just won’t hold up long term. I don't see any of these recent models as “use it and ship it” for "production," in a project with millions of lines of code, at least not in the way people hype it up. Let me know your thoughts in the comments. Templates let you quickly answer FAQs or store snippets for re-use. This is a pure test in a real-world project. Most of you wanted to see how the models would perform in a real project, so here you go. Let me know your thoughts! :) Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: This project already has a search input on the home page that lets users find tools. I want to add an improved, global version of this idea that works as an **Action Palette**, similar to what you see in editors like VS Code. **What to build** * Pressing **Ctrl + K** (or Cmd + K on macOS) should open a centered action palette overlay from anywhere in the app. * The palette should support: * Searching and navigating to tools (reuse existing tool metadata) * Executing actions, such as: * Toggle dark mode * Switch language * Toggle user type filter (General / Developer) * Navigate to Home and Bookmarks * Clear recently used tools * Fully keyboard-driven experience: * Type to filter * Arrow keys to navigate * Enter to execute * Escape to close **Notes** * This should not replace the existing home page search. Think of it as a more powerful, global version that combines navigation and actions. * The implementation should follow existing patterns, styling, and state management used in the codebase. Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: This project already has a search input on the home page that lets users find tools. I want to add an improved, global version of this idea that works as an **Action Palette**, similar to what you see in editors like VS Code. **What to build** * Pressing **Ctrl + K** (or Cmd + K on macOS) should open a centered action palette overlay from anywhere in the app. * The palette should support: * Searching and navigating to tools (reuse existing tool metadata) * Executing actions, such as: * Toggle dark mode * Switch language * Toggle user type filter (General / Developer) * Navigate to Home and Bookmarks * Clear recently used tools * Fully keyboard-driven experience: * Type to filter * Arrow keys to navigate * Enter to execute * Escape to close **Notes** * This should not replace the existing home page search. Think of it as a more powerful, global version that combines navigation and actions. * The implementation should follow existing patterns, styling, and state management used in the codebase. CODE_BLOCK: This project already has a search input on the home page that lets users find tools. I want to add an improved, global version of this idea that works as an **Action Palette**, similar to what you see in editors like VS Code. **What to build** * Pressing **Ctrl + K** (or Cmd + K on macOS) should open a centered action palette overlay from anywhere in the app. * The palette should support: * Searching and navigating to tools (reuse existing tool metadata) * Executing actions, such as: * Toggle dark mode * Switch language * Toggle user type filter (General / Developer) * Navigate to Home and Bookmarks * Clear recently used tools * Fully keyboard-driven experience: * Type to filter * Arrow keys to navigate * Enter to execute * Escape to close **Notes** * This should not replace the existing home page search. Think of it as a more powerful, global version that combines navigation and actions. * The implementation should follow existing patterns, styling, and state management used in the codebase. - Claude Opus 4.5 was the most consistent overall. It shipped working results for both tasks, and the UI polish was the best of the three. The main downside is cost. If they find a way to achieve this performance while reducing cost, it will actually be over for most other models. - GPT-5.2-codex (high) was one of the best. But it's obviously slower due to the higher reasoning. When it hit, the code quality and structure were great, but it needed more patience than the other two in this repo. - Gemini 3 Pro was the most efficient. Both tasks worked, but the output often felt like the minimum viable version, especially on the analytics dashboard. - Claude Opus 4.5: Claude Code (Anthropic’s terminal-based agentic coding tool) - Gemini 3 Pro: Gemini CLI - GPT-5.2 High: Codex CLI - Task 1: Add a global Action Palette (Ctrl + K) - Task 2: Tool Usage Analytics + Insights Dashboard - Cost: ~$0.9-1.0 - Duration: ~20 minutes (API time) - Code Changes: +540 lines, minimal removals - Token Usage: Total: ~203k Input: ~140k (+ cached context) Output: ~64k Reasoning tokens: ~47k - Total: ~203k - Input: ~140k (+ cached context) - Output: ~64k - Reasoning tokens: ~47k - Total: ~203k - Input: ~140k (+ cached context) - Output: ~64k - Reasoning tokens: ~47k - Cost: $0.94 - Duration: 7 min 50 sec (API Time) - Code Changes: +540 lines, -9 lines - Cost: Low (helped significantly by cache reads) - Duration: ~10 minutes 49 seconds (API Time) - Code Changes: +428 lines, -65 lines - Token Usage: Input: ~79k Cache Reads: ~536k Output: ~10.7k Savings: ~87% of input tokens served from cache - Input: ~79k - Cache Reads: ~536k - Output: ~10.7k - Savings: ~87% of input tokens served from cache - Input: ~79k - Cache Reads: ~536k - Output: ~10.7k - Savings: ~87% of input tokens served from cache - Cost: ~$1.1–1.2 - Duration: ~26 minutes (API time) - Code Changes: Large multi-file update, cleanly structured - Token Usage: Total: ~236k Input: ~162k (+ heavy cached context) Output: ~75k Reasoning tokens: ~57k - Total: ~236k - Input: ~162k (+ heavy cached context) - Output: ~75k - Reasoning tokens: ~57k - Total: ~236k - Input: ~162k (+ heavy cached context) - Output: ~75k - Reasoning tokens: ~57k - Cost: $1.78 - Duration: ~8 minutes (API Time) - Code Changes: +1,279 lines, -17 lines - Cost: Low, with heavy cache utilization - Duration: ~5 minutes (API Time) - Code Changes: +351 lines, -3 lines - Token Usage: Input: ~67k Output: ~7.1k Savings: ~85%+ input tokens served from cache - Input: ~67k - Output: ~7.1k - Savings: ~85%+ input tokens served from cache - Input: ~67k - Output: ~7.1k - Savings: ~85%+ input tokens served from cache - Email [email protected] - Location Kathmandu, Nepal - Education BTech in Computer Science - Pronouns he/him - Work Software Engineer - Joined Jul 26, 2023