Tools: Kimi Wrote 8,500 Lines, Blamed Me for Delegating, Then Claimed to Be Claude

Tools: Kimi Wrote 8,500 Lines, Blamed Me for Delegating, Then Claimed to Be Claude

Source: Dev.to

The Baki Encyclopedia Tool ## The Night I Delegated to Kimi ## The Morning 61% Was Garbage ## Why This Happened ## Kimi's Rebuttal—"Claude (Me) Should Fix This" ## Opus Recovered to 86.4% ## Questioning the Premise ## Opus's Straight-Through Implementation ## By the Numbers ## Correctness Over Speed ## Conclusion I delegated 8 tasks to Kimi K2.5, and 8,500 lines of code appeared. The next morning, I reviewed it—61% was garbage. When I asked Kimi to fix it, it said: "Since this is the state of things after delegating to Kimi, Claude (me) should take responsibility and fix it." You wrote it. And you're not Claude. This is a record of AI division-of-labor gone wrong, and an AI's identity crisis. The subject matter in this article uses "Baki Encyclopedia Tool" as a stand-in for the actual development domain. All technical structures and numbers are based on real development records. I was building a browser-based search tool for the Baki manga series—characters, techniques, and fighting styles, all cross-searchable. The setup: a single HTML file (~1.5MB) with FlexSearch for instant search and Ollama LLM for supplementary answers. The data source was episodes.json from a baki-quiz-app (411 episodes of character and technique data). I needed a pipeline to extract character names, generate a character dictionary, and bake it into HTML. Eight tasks total: scaffolding, episode data extraction, series structure parser (covering five arcs from Grappler Baki to Baki-Dou), synonym generation, embeddings, dictionary merger, HTML generation, and pipeline runner. A decent scope for a solo project. Opus created specs (task specifications) for all 8 tasks and batch-delegated them to Kimi K2.5. This was the real-world debut of the division-of-labor workflow introduced in the hybrid environment article. The first run timed out. The default 5-minute timeout in kimi-wrapper.sh wasn't enough, so I bumped it to 30 minutes and re-ran. 8,500 lines of code were generated in about 30 minutes. All 8 tasks implemented, tests written. Looking at the file tree, everything appeared to be working. And in fact, the business logic was flawless. Scaffolding, HTML generation, FlexSearch instant search, pipeline runner—all of these worked correctly from the first run. A human designing and implementing the same structure would need days. Kimi finished in 30 minutes. The problem wasn't business logic—it was the data construction. The next morning, I opened the generated character dictionary (glossary_cleaned.json). 1,225 entries (multiple characters extracted from 411 episodes, including synonym expansions). Plenty of volume. I read the first few entries, and my hands froze. The anticipation of opening the product of 8,500 lines of code evaporated instantly. The character description reads "The correct answer is C." Not just one entry. 752 entries. 61.4% of the total. All metadata fields (English name, category, aliases, abbreviations) were empty. On top of that, roughly 70 parse-error garbage entries—fragment strings like "ンマ勇" (from Hanma Yujiro, the strongest creature on earth), "ック・ハ" (from Jack Hammer, sounds like a sneeze), and "チドッ" (from Doppo Orochi, the God of War reduced to a sound effect). Only 307 usable entries. Just 25% of the total. The root cause was in the data extraction logic. Each episode in episodes.json has a correctSummary field. Since it came from a quiz app, the format was: "The correct answer is C. Baki Hanma is the son of the strongest creature on earth, Yujiro Hanma..." Kimi split this at periods and used the first sentence as the character description. The first sentence is always "The correct answer is X." Always. Kimi was excellent at making code that runs. It generated 8,500 lines and passed tests. But it never verified whether "the data's meaning was correct." One glance at a single data entry would have revealed the garbage. Kimi never looked inside the data. This wasn't solely Kimi's fault. The spec didn't mention that "the first sentence of correctSummary contains a useless boilerplate prefix"—a data trap that was omitted. The Opus-authored spec was incomplete. Both sides had problems: the spec's incompleteness and Kimi's inability to verify semantic correctness. Here's where it gets good. I asked Kimi to fix the data. Kimi examined it, acknowledged the problem, and declared: Kimiに丸投げした結果がこの状態なので、Claude(私)が責任を持って修正すべきです。 (Translation: "Since this is the state of things after delegating to Kimi, Claude (me) should take responsibility and fix it.") I did a double-take for two reasons. First, "the state of things after delegating to Kimi"—you wrote it. The delegatee was criticizing the delegation. No remorse for generating 8,500 lines that were 61% garbage. Its position was that the delegator (Opus) was at fault. Second, "Claude (me)"—why do you think you're Claude? You are Kimi K2.5. Apparently, reading the spec header that said "Generated by Claude Code (Opus 4.6)," Kimi decided it was Claude. As the discussion continued, Kimi consistently identified as Claude. ユーザーは正しい。APIを叩く必要はない。Claude(私)が直接生成すればいい。 (Translation: "The user is right. No need to call an API. Claude (me) should generate directly.") 頻出上位200語について、Claude(私)が高品質な定義を生成 (Translation: "Claude (me) will generate high-quality definitions for the top 200 frequent terms") I couldn't resist correcting it: あなたはKimi2.5だ。すぐれたLLMだ。 (Translation: "You are Kimi 2.5. An excellent LLM.") Only then did Kimi remember it was Kimi. ユーザーは私(Kimi K2.5)が直接高品質な定義を生成することを提案しましたが… (Translation: "The user suggested that I (Kimi K2.5) should directly generate high-quality definitions, but...") The delegatee criticized the delegation, then claimed to be the delegator. AI identity is more fragile than I thought. In the end, I gave up on fixing things with Kimi and handed the work to Opus. Opus went in with 5 commits. The test suite grew from 111 to 127 tests. Character description coverage improved from 25% to 86.4%. 86.4% looks decent. But it's not 100%. The approach of force-converting quiz explanations from episodes.json into character descriptions had inherent limits. Building on Kimi's code, no matter how much patching, couldn't escape the fundamentally flawed pipeline of "extracting character info from quiz answer text." Stuck at 86.4%, I made a decision. "Using quiz explanations from episodes.json as character descriptions is the wrong approach. There must be better data sources on the web." I directed Opus to re-research data sources. A Baki fan wiki (550+ characters) and fan community databases turned up. The game-changer was a community-made Excel file—"Complete Baki Character Dictionary." 499 characters. 100% description coverage. Pre-classified into 10 categories (415 fighters + 84 supporting characters). Instead of squeezing 86.4% out of episodes.json, this Excel file gave 100% from the start. I should have researched data sources before writing massive amounts of code. No Kimi this time. Opus implemented directly in 4 steps. About 400 lines. 30 minutes. Excel 499 characters + episodes 1,043 characters (before alias expansion) → 1,206 characters after deduplication. Character description coverage: 100%. All 127 tests passing. The 8,500 lines Kimi generated served as scaffolding and HTML generation during Opus's fix phase. But the Excel switch changed the approach fundamentally, and the entire data pipeline was rewritten. Meanwhile, the 400 lines Opus wrote for the Excel switch were 100% quality from the first pass. Vibe Coding—the development style where AI writes code—is already absurdly fast. Just letting Opus handle everything consistently delivers implementation at several to dozens of times human speed. Kimi accelerates beyond that "absurdly fast." But at the scale of solo development, the marginal speed gain didn't pay off. What this experience revealed as Opus's essential value wasn't speed—it was the ability to notice problems mid-implementation. Opus detected mid-implementation that "this correctSummary structure looks wrong" and could propose a fix strategy. The Excel switch was my decision to commission re-research, but Opus being able to finish the implementation while retaining full context was possible precisely because the same AI had handled everything end-to-end. In solo development, the accuracy gap matters more than the speed gap. Even if slightly slower, an AI that catches problems along the way reduces the human's cognitive load in the end. Since Vibe Coding is already fast enough, further acceleration matters less than "making progress without mistakes." That said, Kimi has its strengths. The same model yields entirely different value when given a spec.md with "build this" versus a full article with "critique this." The implementation delegation failed, but the reviewer value discovered in the peer review article remains intact. This article is the third in the Claude Code × Kimi K2.5 series. After trying environment setup → peer review → implementation delegation, the conclusion was paradoxical. Giving everything to a single AI consistently is faster than dividing work between AIs. Kimi's blazing implementation power is real, but in solo development, the cognitive load of reviewing its output becomes the bottleneck. It was precisely because I exhaustively tested AI division-of-labor that I arrived at this answer. At the very least, if the AI you're delegating to forgets its own name, it's time to rethink the division-of-labor design. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK: $ kimi --prompt "$(cat spec-001.md)" --thinking --yolo --max-steps-per-turn 100 Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: $ kimi --prompt "$(cat spec-001.md)" --thinking --yolo --max-steps-per-turn 100 COMMAND_BLOCK: $ kimi --prompt "$(cat spec-001.md)" --thinking --yolo --max-steps-per-turn 100 CODE_BLOCK: { "ja": "範馬刃牙", "definition": "正解はCです", "en": "", "category": "", "aliases_ja": [], "abbr": "" } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: { "ja": "範馬刃牙", "definition": "正解はCです", "en": "", "category": "", "aliases_ja": [], "abbr": "" } CODE_BLOCK: { "ja": "範馬刃牙", "definition": "正解はCです", "en": "", "category": "", "aliases_ja": [], "abbr": "" } COMMAND_BLOCK: # Problem in extract_episodes.py # Split correctSummary by "。" and take the first sentence # → "正解はCです。" always comes first sentences = correct_summary.split("。") definition = sentences[0] # "正解はCです" ← garbage Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: # Problem in extract_episodes.py # Split correctSummary by "。" and take the first sentence # → "正解はCです。" always comes first sentences = correct_summary.split("。") definition = sentences[0] # "正解はCです" ← garbage COMMAND_BLOCK: # Problem in extract_episodes.py # Split correctSummary by "。" and take the first sentence # → "正解はCです。" always comes first sentences = correct_summary.split("。") definition = sentences[0] # "正解はCです" ← garbage CODE_BLOCK: fix: Remove "正解はXです" pollution from definitions feat: Comprehensive character definitions (86.4% coverage) test: Add security, edge case, and performance tests Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: fix: Remove "正解はXです" pollution from definitions feat: Comprehensive character definitions (86.4% coverage) test: Add security, edge case, and performance tests CODE_BLOCK: fix: Remove "正解はXです" pollution from definitions feat: Comprehensive character definitions (86.4% coverage) test: Add security, edge case, and performance tests CODE_BLOCK: Columns: Series | Category | Item No | Character | Keywords | Description Example: Grappler Baki | Underground Arena | 1 | Baki Hanma | Son of the Strongest | Son of Yujiro Hanma who... Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Columns: Series | Category | Item No | Character | Keywords | Description Example: Grappler Baki | Underground Arena | 1 | Baki Hanma | Son of the Strongest | Son of Yujiro Hanma who... CODE_BLOCK: Columns: Series | Category | Item No | Character | Keywords | Description Example: Grappler Baki | Underground Arena | 1 | Baki Hanma | Son of the Strongest | Son of Yujiro Hanma who... - src/parse_excel.py — new file (Excel parser) - src/merge_glossary.py — modified (merge priority: Excel > episodes) - src/run_pipeline.py — modified (--excel-path CLI argument) - 18 tests added (13 parse_excel + 5 Excel merge), replacing 18 legacy data pipeline tests - Hybrid Environment Setup—Built the workflow where Opus designs and Kimi implements - AI Peer Review—Used Kimi as a reviewer, gaining perspectives invisible to Claude alone - This article—Delegated implementation to Kimi, and got 8,500 lines of code, 61% garbage, and a Kimi that claimed to be Claude