Tools: Latest: 로컬 LLM 셋업 가이드 (v42)

Tools: Latest: 로컬 LLM 셋업 가이드 (v42)

로컬 LLM 셋업 가이드 (v42)

Overview & Prerequisites

Framework Comparison

Step-by-Step Installation

1. Install Dependencies

2. Install llama.cpp

3. Download Model

Model Selection Guide

Quantization Types Explained

API Setup and Integration

Ollama Setup (Recommended for API)

Custom API Integration

Systemd Service for 24/7 Operation

Monitoring and Performance Tuning

Performance Monitoring

Tuning Parameters

Real Command Examples

Document Processing Pipeline

Batch Processing Script

Troubleshooting Tips Developer's Guide to Local LLM Deployment Running LLMs locally requires minimal hardware but significant RAM. For basic use cases, 8GB RAM is sufficient. For larger models, 16GB+ is recommended. Recommendation: Use llama.cpp for minimal setup or Ollama for production-ready API. Create service file for automatic startup: This guide provides a complete foundation for local LLM deployment that balances performance, cost, and usability. The recommended setup supports real-world 📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7) Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

# Check system requirements lscpu | grep -i "model name" free -h nvidia-smi # if GPU available # Check system requirements lscpu | grep -i "model name" free -h nvidia-smi # if GPU available # Check system requirements lscpu | grep -i "model name" free -h nvidia-smi # if GPU available # Update system -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">upgrade -y # Install build tools -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install build-essential -weight: 500;">git cmake python3--weight: 500;">pip -y # Install CUDA (if using GPU) -weight: 500;">wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb -weight: 600;">sudo dpkg -i cuda-keyring_1.0-1_all.deb -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install cuda-toolkit-11-8 -y # Update system -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">upgrade -y # Install build tools -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install build-essential -weight: 500;">git cmake python3--weight: 500;">pip -y # Install CUDA (if using GPU) -weight: 500;">wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb -weight: 600;">sudo dpkg -i cuda-keyring_1.0-1_all.deb -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install cuda-toolkit-11-8 -y # Update system -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">upgrade -y # Install build tools -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install build-essential -weight: 500;">git cmake python3--weight: 500;">pip -y # Install CUDA (if using GPU) -weight: 500;">wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb -weight: 600;">sudo dpkg -i cuda-keyring_1.0-1_all.deb -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install cuda-toolkit-11-8 -y # Clone repository -weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git cd llama.cpp # Build with GPU support make clean make -j$(nproc) LLAMA_CUDA=1 # Verify installation ./main --help # Clone repository -weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git cd llama.cpp # Build with GPU support make clean make -j$(nproc) LLAMA_CUDA=1 # Verify installation ./main --help # Clone repository -weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git cd llama.cpp # Build with GPU support make clean make -j$(nproc) LLAMA_CUDA=1 # Verify installation ./main --help # Create model directory mkdir -p models cd models # Download a 7B model (example: LLaMA-v2) -weight: 500;">wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf # Verify download ls -la llama-2-7b-chat.Q4_K_M.gguf # Create model directory mkdir -p models cd models # Download a 7B model (example: LLaMA-v2) -weight: 500;">wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf # Verify download ls -la llama-2-7b-chat.Q4_K_M.gguf # Create model directory mkdir -p models cd models # Download a 7B model (example: LLaMA-v2) -weight: 500;">wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf # Verify download ls -la llama-2-7b-chat.Q4_K_M.gguf # Example: Run LLaMA-2-7B ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 128 \ --temp 0.7 \ -p "Q: What is the capital of France? A:" # Example: Run LLaMA-2-7B ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 128 \ --temp 0.7 \ -p "Q: What is the capital of France? A:" # Example: Run LLaMA-2-7B ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 128 \ --temp 0.7 \ -p "Q: What is the capital of France? A:" # Convert model to different quantizations ./llama.cpp/quantize models/llama-2-7b-chat.Q4_K_M.gguf models/llama-2-7b-chat.Q5_K_M.gguf Q5_K_M # Benchmark different quantizations ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 --temp 0.7 -p "Hello world" # Convert model to different quantizations ./llama.cpp/quantize models/llama-2-7b-chat.Q4_K_M.gguf models/llama-2-7b-chat.Q5_K_M.gguf Q5_K_M # Benchmark different quantizations ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 --temp 0.7 -p "Hello world" # Convert model to different quantizations ./llama.cpp/quantize models/llama-2-7b-chat.Q4_K_M.gguf models/llama-2-7b-chat.Q5_K_M.gguf Q5_K_M # Benchmark different quantizations ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 --temp 0.7 -p "Hello world" # Install Ollama -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh # Start Ollama -weight: 500;">service ollama serve & # Pull model ollama pull llama2:7b-chat # Test API -weight: 500;">curl http://localhost:11434/api/generate -d '{ "model": "llama2:7b-chat", "prompt": "Why is the sky blue?", "stream": false }' # Install Ollama -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh # Start Ollama -weight: 500;">service ollama serve & # Pull model ollama pull llama2:7b-chat # Test API -weight: 500;">curl http://localhost:11434/api/generate -d '{ "model": "llama2:7b-chat", "prompt": "Why is the sky blue?", "stream": false }' # Install Ollama -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh # Start Ollama -weight: 500;">service ollama serve & # Pull model ollama pull llama2:7b-chat # Test API -weight: 500;">curl http://localhost:11434/api/generate -d '{ "model": "llama2:7b-chat", "prompt": "Why is the sky blue?", "stream": false }' # api_client.py import requests import json class LocalLLM: def __init__(self, base_url="http://localhost:11434"): self.base_url = base_url def generate(self, prompt, model="llama2:7b-chat"): response = requests.post( f"{self.base_url}/api/generate", json={ "model": model, "prompt": prompt, "stream": False } ) return response.json()['response'] # Usage llm = LocalLLM() result = llm.generate("Extract key information from this invoice: [INVOICE DATA]") print(result) # api_client.py import requests import json class LocalLLM: def __init__(self, base_url="http://localhost:11434"): self.base_url = base_url def generate(self, prompt, model="llama2:7b-chat"): response = requests.post( f"{self.base_url}/api/generate", json={ "model": model, "prompt": prompt, "stream": False } ) return response.json()['response'] # Usage llm = LocalLLM() result = llm.generate("Extract key information from this invoice: [INVOICE DATA]") print(result) # api_client.py import requests import json class LocalLLM: def __init__(self, base_url="http://localhost:11434"): self.base_url = base_url def generate(self, prompt, model="llama2:7b-chat"): response = requests.post( f"{self.base_url}/api/generate", json={ "model": model, "prompt": prompt, "stream": False } ) return response.json()['response'] # Usage llm = LocalLLM() result = llm.generate("Extract key information from this invoice: [INVOICE DATA]") print(result) # Create -weight: 500;">service file -weight: 600;">sudo nano /etc/systemd/system/llm.-weight: 500;">service # Service configuration [Unit] Description=Local LLM Server After=network.target [Service] Type=simple User=developer WorkingDirectory=/home/developer/llama.cpp ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/llama-2-7b-chat.Q4_K_M.gguf -c 2048 -n 128 --temp 0.7 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target # Create -weight: 500;">service file -weight: 600;">sudo nano /etc/systemd/system/llm.-weight: 500;">service # Service configuration [Unit] Description=Local LLM Server After=network.target [Service] Type=simple User=developer WorkingDirectory=/home/developer/llama.cpp ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/llama-2-7b-chat.Q4_K_M.gguf -c 2048 -n 128 --temp 0.7 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target # Create -weight: 500;">service file -weight: 600;">sudo nano /etc/systemd/system/llm.-weight: 500;">service # Service configuration [Unit] Description=Local LLM Server After=network.target [Service] Type=simple User=developer WorkingDirectory=/home/developer/llama.cpp ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/llama-2-7b-chat.Q4_K_M.gguf -c 2048 -n 128 --temp 0.7 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llm.-weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llm.-weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llm.-weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llm.-weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llm.-weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llm.-weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llm.-weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llm.-weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llm.-weight: 500;">service # Monitor GPU usage nvidia-smi -l 1 # Monitor CPU and memory htop # Benchmark inference time time ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 -p "Test prompt" # Monitor GPU usage nvidia-smi -l 1 # Monitor CPU and memory htop # Benchmark inference time time ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 -p "Test prompt" # Monitor GPU usage nvidia-smi -l 1 # Monitor CPU and memory htop # Benchmark inference time time ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 -p "Test prompt" # Optimize for speed ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 32 \ --temp 0.1 \ --repeat-penalty 1.1 # Optimize for quality ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 256 \ --temp 0.8 \ --repeat-penalty 1.2 # Optimize for speed ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 32 \ --temp 0.1 \ --repeat-penalty 1.1 # Optimize for quality ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 256 \ --temp 0.8 \ --repeat-penalty 1.2 # Optimize for speed ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 32 \ --temp 0.1 \ --repeat-penalty 1.1 # Optimize for quality ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 256 \ --temp 0.8 \ --repeat-penalty 1.2 # Extract text from PDF pdftotext -layout document.pdf extracted.txt # Process with LLM ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -n 256 \ -p "Extract structured data from this receipt text: $(cat extracted.txt)" \ --temp 0.3 > output.json # Validate JSON python3 -m json.tool output.json # Extract text from PDF pdftotext -layout document.pdf extracted.txt # Process with LLM ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -n 256 \ -p "Extract structured data from this receipt text: $(cat extracted.txt)" \ --temp 0.3 > output.json # Validate JSON python3 -m json.tool output.json # Extract text from PDF pdftotext -layout document.pdf extracted.txt # Process with LLM ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -n 256 \ -p "Extract structured data from this receipt text: $(cat extracted.txt)" \ --temp 0.3 > output.json # Validate JSON python3 -m json.tool output.json #!/bin/bash # batch_processor.sh for file in receipts/*.pdf; do pdftotext -layout "$file" temp.txt ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -n 256 \ -p "Extract invoice data: $(cat temp.txt)" \ --temp 0.1 > "${file%.pdf}.json" rm temp.txt done #!/bin/bash # batch_processor.sh for file in receipts/*.pdf; do pdftotext -layout "$file" temp.txt ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -n 256 \ -p "Extract invoice data: $(cat temp.txt)" \ --temp 0.1 > "${file%.pdf}.json" rm temp.txt done #!/bin/bash # batch_processor.sh for file in receipts/*.pdf; do pdftotext -layout "$file" temp.txt ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -n 256 \ -p "Extract invoice data: $(cat temp.txt)" \ --temp 0.1 > "${file%.pdf}.json" rm temp.txt done # Common troubleshooting commands ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -c 1024 -n 64 --temp 0.7 -p "Test" # Check GPU memory nvidia-smi --query-gpu=memory.used,memory.total --format=csv # Common troubleshooting commands ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -c 1024 -n 64 --temp 0.7 -p "Test" # Check GPU memory nvidia-smi --query-gpu=memory.used,memory.total --format=csv # Common troubleshooting commands ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -c 1024 -n 64 --temp 0.7 -p "Test" # Check GPU memory nvidia-smi --query-gpu=memory.used,memory.total --format=csv - Linux 64-bit system (Ubuntu 20.04+ recommended) - 8GB+ RAM (16GB+ for larger models) - NVIDIA GPU with CUDA support (optional but highly recommended) - 20GB+ disk space - Python 3.8+ - Q4_K_M: 4-bit quantization with k-mer optimization (best balance) - Q5_K_M: 5-bit quantization (better quality, slightly larger) - Q8_0: 8-bit quantization (highest quality, largest file) - F16: Full precision (16-bit float, largest files) - Memory issues: Reduce context size (-c parameter) - Slow startup: Use smaller models (Phi-2, TinyLlama) - GPU memory issues: Add --no-gpu flag - Permission denied: Check file permissions and user groups