Tools: I Started Building a Roguelike RPG — Powered by On-Device AI #2 (2026)

Tools: I Started Building a Roguelike RPG — Powered by On-Device AI #2 (2026)

Running On-Device LLM in Unity Android — Everything That Broke (and How I Fixed It)

0. Why This Tech Stack

1. ONNX Runtime Setup

2. Building a C# Tokenizer From Scratch

3. Building the Inference Engine

4. First Generation Test

5. Android Build

6. Korean Font in TextMeshPro

7. Real Device Results

What I Learned In my last post, I mentioned I was building a roguelike RPG powered by an on-device LLM. This time I'll cover exactly how I did it, what broke, and what the numbers look like. The short version: I got Phi-4-mini running in Unity on a real Android device in one day. It generated valid JSON. It took 8 minutes and 43 seconds. Before the details, here's why I made each choice. Why Phi-4-mini (3.8B)? Microsoft officially distributes it in ONNX format — no conversion work needed. The INT4 quantized version fits in 4.9GB, which is manageable on a 12GB RAM device. At 3.8B parameters, it's roughly the minimum size that can reliably produce structured JSON output. Smaller models tend to fall apart on formatting tasks. Cross-platform support across Android, iOS, Windows, and Mac. There's a Unity C# binding, and the asus4/onnxruntime-unity package makes Unity integration straightforward. Most importantly, switching between hardware acceleration backends (QNN, NNAPI, CoreML) is a single line of code — which matters a lot when you're trying to get NPU acceleration working. Good ecosystem for 2D roguelikes. Android/iOS cross-platform builds. And I can write LLM inference code in C# alongside game logic without needing a Python bridge. Android 12 (API 31) introduced the ability to declare vendor partition libraries via uses-native-library. QNN HTP depends on libcdsprpc.so, which lives in the vendor partition. Without this declaration, NPU acceleration is completely off the table. Dropping below SDK 31 would mean giving up on QNN entirely. Why Samsung Galaxy S24 Ultra as the test device? Snapdragon 8 Gen 3 with Hexagon NPU — one of the few consumer devices where QNN acceleration is actually possible. 12GB RAM gives enough headroom for the 4.9GB model. I wanted to measure the performance ceiling with the best available hardware first. If it doesn't work here, it doesn't work anywhere with current technology. Also, it's my personal phone. There's no test device budget. Installed com.github.asus4.onnxruntime v0.4.4 via NPM scoped registry. IL2CPP compatibility confirmed with no issues. Downloaded Phi-4-mini ONNX (cpu_and_mobile variant) from Hugging Face: model.onnx at 52MB + model.onnx.data at 4.9GB. Phi-4-mini uses a tiktoken-style BPE tokenizer. No Unity C# implementation existed, so I wrote one. Loaded vocab (200,029 entries), merges (199,742 entries), and special tokens (12) from tokenizer.json. Implemented GPT-2 byte↔unicode conversion table, BPE encoding/decoding with cache, and special token splitting. I assumed the merges format was "tok1 tok2" strings. It was actually ["tok1","tok2"] arrays. Added a branch to handle both formats. Implemented KV cache-based auto-regressive greedy decoding. Fixed to new DenseTensor<long>(new[] {batch, seqLen}). Had the path at 3 levels up (../../..). It needed to be 2 levels (../..). Kept the prompt short, max 150 tokens. The mob name on floors 1-3 matches the player character name — that's a prompt issue I'll fix later. The important thing is the JSON structure is valid and complete. Putting a 5GB model in StreamingAssets hits Java's 2.1GB array limit. Renaming the folder didn't help — anything inside StreamingAssets gets included regardless of name. Solution: move the model folder completely outside of Assets, delete the Gradle cache (Library/Bee/Android, 15GB worth), rebuild. APK ships without the model. Model is pushed separately via adb (4.9GB, ~94 seconds). The default TMP font (LiberationSans) doesn't include Korean characters. Converted AppleSDGothicNeo.ttc using TMP Font Asset Creator. Important: the Custom Range field only accepts decimal, not hex. Entering AC00-D7A3 throws a FormatException. Use this instead: The S24 Ultra is 2.1x slower than Mac. Adding QNN HTP barely moved the needle. The reason showed up in the INFO logs: QNN EP registration succeeded, but the backend never actually initialized. The entire thing was falling back to CPU. libcdsprpc.so is Qualcomm's DSP RPC library — it lives in the vendor partition and isn't accessible from the app sandbox by default. The fix is declaring it via uses-native-library in AndroidManifest. That ran into a separate issue: the custom manifest conflicted with Unity's auto-generated one, causing the app to disappear from the launcher entirely. I'll be using a Gradle template to inject just that one line instead. Next: Getting QNN HTP to Actually Work — The libcdsprpc.so Wall Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. as well , this person and/or

Code Block

Copy

Newtonsoft.Json.Linq.JValue → JArray cast failed CODE_BLOCK: Newtonsoft.Json.Linq.JValue → JArray cast failed CODE_BLOCK: Newtonsoft.Json.Linq.JValue → JArray cast failed CODE_BLOCK: CS1503: DenseTensor<long>(seqLen, new[] {batch, seqLen}) CODE_BLOCK: CS1503: DenseTensor<long>(seqLen, new[] {batch, seqLen}) CODE_BLOCK: CS1503: DenseTensor<long>(seqLen, new[] {batch, seqLen}) CODE_BLOCK: model.onnx not found CODE_BLOCK: model.onnx not found CODE_BLOCK: model.onnx not found CODE_BLOCK: [LLM] Generated in 181.4s (150 tokens max) CODE_BLOCK: [LLM] Generated in 181.4s (150 tokens max) CODE_BLOCK: [LLM] Generated in 181.4s (150 tokens max) CODE_BLOCK: [ {"floor":1,"mob":"게으른 빵집 아들","hp":50,"atk":10}, {"floor":4,"mob":"elite","hp":100,"atk":20}, {"floor":5,"mob":"boss","hp":200,"atk":40} ] CODE_BLOCK: [ {"floor":1,"mob":"게으른 빵집 아들","hp":50,"atk":10}, {"floor":4,"mob":"elite","hp":100,"atk":20}, {"floor":5,"mob":"boss","hp":200,"atk":40} ] CODE_BLOCK: [ {"floor":1,"mob":"게으른 빵집 아들","hp":50,"atk":10}, {"floor":4,"mob":"elite","hp":100,"atk":20}, {"floor":5,"mob":"boss","hp":200,"atk":40} ] CODE_BLOCK: compressReleaseAssets FAILED Required array size too large CODE_BLOCK: compressReleaseAssets FAILED Required array size too large CODE_BLOCK: compressReleaseAssets FAILED Required array size too large CODE_BLOCK: adb push ./models/phi-4-mini \ /sdcard/Android/data/com.as1as.helpwantedhero/files/Models/phi-4-mini/ CODE_BLOCK: adb push ./models/phi-4-mini \ /sdcard/Android/data/com.as1as.helpwantedhero/files/Models/phi-4-mini/ CODE_BLOCK: adb push ./models/phi-4-mini \ /sdcard/Android/data/com.as1as.helpwantedhero/files/Models/phi-4-mini/ CODE_BLOCK: 32-126,44032-55203,12593-12686 (ASCII + Korean 가-힣 + ㄱ-ㅣ) CODE_BLOCK: 32-126,44032-55203,12593-12686 (ASCII + Korean 가-힣 + ㄱ-ㅣ) CODE_BLOCK: 32-126,44032-55203,12593-12686 (ASCII + Korean 가-힣 + ㄱ-ㅣ) CODE_BLOCK: Failed in loading stub: dlopen failed: library "libcdsprpc.so" not found Failed to create transport for device, error: 4000 CODE_BLOCK: Failed in loading stub: dlopen failed: library "libcdsprpc.so" not found Failed to create transport for device, error: 4000 CODE_BLOCK: Failed in loading stub: dlopen failed: library "libcdsprpc.so" not found Failed to create transport for device, error: 4000 - 32 layers, 8 KV heads, head_size 128 - Prefill (full prompt at once) → Decode (one token at a time) - past_key_values / present tensor management - Min SDK 31 is required for vendor library declarations — and therefore for QNN HTP acceleration - Don't put large files in StreamingAssets. Anything there gets compressed into the APK - NNAPI is not full NPU acceleration. Most LLM operators fall back to CPU - TMP Custom Range is decimal only - 3.8B parameters on CPU is not viable for a game. NPU is not optional