Tools: Latest: 로컬 LLM 셋업 가이드 (v42)
로컬 LLM 셋업 가이드 (v42)
Overview & Prerequisites
Framework Comparison
Step-by-Step Installation
1. Install Dependencies
2. Install llama.cpp
3. Download Model
Model Selection Guide
Quantization Types Explained
API Setup and Integration
Ollama Setup (Recommended for API)
Custom API Integration
Systemd Service for 24/7 Operation
Monitoring and Performance Tuning
Performance Monitoring
Tuning Parameters
Real Command Examples
Document Processing Pipeline
Batch Processing Script
Troubleshooting Tips Developer's Guide to Local LLM Deployment Running LLMs locally requires minimal hardware but significant RAM. For basic use cases, 8GB RAM is sufficient. For larger models, 16GB+ is recommended. Recommendation: Use llama.cpp for minimal setup or Ollama for production-ready API. Create service file for automatic startup: This guide provides a complete foundation for local LLM deployment that balances performance, cost, and usability. The recommended setup supports real-world 📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7) Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse
# Check system requirements
lscpu | grep -i "model name"
free -h
nvidia-smi # if GPU available
# Check system requirements
lscpu | grep -i "model name"
free -h
nvidia-smi # if GPU available
# Check system requirements
lscpu | grep -i "model name"
free -h
nvidia-smi # if GPU available
# Update system
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">upgrade -y # Install build tools
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install build-essential -weight: 500;">git cmake python3--weight: 500;">pip -y # Install CUDA (if using GPU)
-weight: 500;">wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
-weight: 600;">sudo dpkg -i cuda-keyring_1.0-1_all.deb
-weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install cuda-toolkit-11-8 -y
# Update system
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">upgrade -y # Install build tools
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install build-essential -weight: 500;">git cmake python3--weight: 500;">pip -y # Install CUDA (if using GPU)
-weight: 500;">wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
-weight: 600;">sudo dpkg -i cuda-keyring_1.0-1_all.deb
-weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install cuda-toolkit-11-8 -y
# Update system
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">upgrade -y # Install build tools
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install build-essential -weight: 500;">git cmake python3--weight: 500;">pip -y # Install CUDA (if using GPU)
-weight: 500;">wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
-weight: 600;">sudo dpkg -i cuda-keyring_1.0-1_all.deb
-weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt-get -weight: 500;">install cuda-toolkit-11-8 -y
# Clone repository
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git
cd llama.cpp # Build with GPU support
make clean
make -j$(nproc) LLAMA_CUDA=1 # Verify installation
./main --help
# Clone repository
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git
cd llama.cpp # Build with GPU support
make clean
make -j$(nproc) LLAMA_CUDA=1 # Verify installation
./main --help
# Clone repository
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git
cd llama.cpp # Build with GPU support
make clean
make -j$(nproc) LLAMA_CUDA=1 # Verify installation
./main --help
# Create model directory
mkdir -p models
cd models # Download a 7B model (example: LLaMA-v2)
-weight: 500;">wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf # Verify download
ls -la llama-2-7b-chat.Q4_K_M.gguf
# Create model directory
mkdir -p models
cd models # Download a 7B model (example: LLaMA-v2)
-weight: 500;">wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf # Verify download
ls -la llama-2-7b-chat.Q4_K_M.gguf
# Create model directory
mkdir -p models
cd models # Download a 7B model (example: LLaMA-v2)
-weight: 500;">wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf # Verify download
ls -la llama-2-7b-chat.Q4_K_M.gguf
# Example: Run LLaMA-2-7B
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 128 \ --temp 0.7 \ -p "Q: What is the capital of France? A:"
# Example: Run LLaMA-2-7B
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 128 \ --temp 0.7 \ -p "Q: What is the capital of France? A:"
# Example: Run LLaMA-2-7B
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 128 \ --temp 0.7 \ -p "Q: What is the capital of France? A:"
# Convert model to different quantizations
./llama.cpp/quantize models/llama-2-7b-chat.Q4_K_M.gguf models/llama-2-7b-chat.Q5_K_M.gguf Q5_K_M # Benchmark different quantizations
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 --temp 0.7 -p "Hello world"
# Convert model to different quantizations
./llama.cpp/quantize models/llama-2-7b-chat.Q4_K_M.gguf models/llama-2-7b-chat.Q5_K_M.gguf Q5_K_M # Benchmark different quantizations
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 --temp 0.7 -p "Hello world"
# Convert model to different quantizations
./llama.cpp/quantize models/llama-2-7b-chat.Q4_K_M.gguf models/llama-2-7b-chat.Q5_K_M.gguf Q5_K_M # Benchmark different quantizations
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 --temp 0.7 -p "Hello world"
# Install Ollama
-weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh # Start Ollama -weight: 500;">service
ollama serve & # Pull model
ollama pull llama2:7b-chat # Test API
-weight: 500;">curl http://localhost:11434/api/generate -d '{ "model": "llama2:7b-chat", "prompt": "Why is the sky blue?", "stream": false
}'
# Install Ollama
-weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh # Start Ollama -weight: 500;">service
ollama serve & # Pull model
ollama pull llama2:7b-chat # Test API
-weight: 500;">curl http://localhost:11434/api/generate -d '{ "model": "llama2:7b-chat", "prompt": "Why is the sky blue?", "stream": false
}'
# Install Ollama
-weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh # Start Ollama -weight: 500;">service
ollama serve & # Pull model
ollama pull llama2:7b-chat # Test API
-weight: 500;">curl http://localhost:11434/api/generate -d '{ "model": "llama2:7b-chat", "prompt": "Why is the sky blue?", "stream": false
}'
# api_client.py
import requests
import json class LocalLLM: def __init__(self, base_url="http://localhost:11434"): self.base_url = base_url def generate(self, prompt, model="llama2:7b-chat"): response = requests.post( f"{self.base_url}/api/generate", json={ "model": model, "prompt": prompt, "stream": False } ) return response.json()['response'] # Usage
llm = LocalLLM()
result = llm.generate("Extract key information from this invoice: [INVOICE DATA]")
print(result)
# api_client.py
import requests
import json class LocalLLM: def __init__(self, base_url="http://localhost:11434"): self.base_url = base_url def generate(self, prompt, model="llama2:7b-chat"): response = requests.post( f"{self.base_url}/api/generate", json={ "model": model, "prompt": prompt, "stream": False } ) return response.json()['response'] # Usage
llm = LocalLLM()
result = llm.generate("Extract key information from this invoice: [INVOICE DATA]")
print(result)
# api_client.py
import requests
import json class LocalLLM: def __init__(self, base_url="http://localhost:11434"): self.base_url = base_url def generate(self, prompt, model="llama2:7b-chat"): response = requests.post( f"{self.base_url}/api/generate", json={ "model": model, "prompt": prompt, "stream": False } ) return response.json()['response'] # Usage
llm = LocalLLM()
result = llm.generate("Extract key information from this invoice: [INVOICE DATA]")
print(result)
# Create -weight: 500;">service file
-weight: 600;">sudo nano /etc/systemd/system/llm.-weight: 500;">service # Service configuration
[Unit]
Description=Local LLM Server
After=network.target [Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/llama-2-7b-chat.Q4_K_M.gguf -c 2048 -n 128 --temp 0.7
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
# Create -weight: 500;">service file
-weight: 600;">sudo nano /etc/systemd/system/llm.-weight: 500;">service # Service configuration
[Unit]
Description=Local LLM Server
After=network.target [Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/llama-2-7b-chat.Q4_K_M.gguf -c 2048 -n 128 --temp 0.7
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
# Create -weight: 500;">service file
-weight: 600;">sudo nano /etc/systemd/system/llm.-weight: 500;">service # Service configuration
[Unit]
Description=Local LLM Server
After=network.target [Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/llama-2-7b-chat.Q4_K_M.gguf -c 2048 -n 128 --temp 0.7
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llm.-weight: 500;">service
# Monitor GPU usage
nvidia-smi -l 1 # Monitor CPU and memory
htop # Benchmark inference time
time ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 -p "Test prompt"
# Monitor GPU usage
nvidia-smi -l 1 # Monitor CPU and memory
htop # Benchmark inference time
time ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 -p "Test prompt"
# Monitor GPU usage
nvidia-smi -l 1 # Monitor CPU and memory
htop # Benchmark inference time
time ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 -p "Test prompt"
# Optimize for speed
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 32 \ --temp 0.1 \ --repeat-penalty 1.1 # Optimize for quality
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 256 \ --temp 0.8 \ --repeat-penalty 1.2
# Optimize for speed
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 32 \ --temp 0.1 \ --repeat-penalty 1.1 # Optimize for quality
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 256 \ --temp 0.8 \ --repeat-penalty 1.2
# Optimize for speed
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 32 \ --temp 0.1 \ --repeat-penalty 1.1 # Optimize for quality
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -c 2048 \ -n 256 \ --temp 0.8 \ --repeat-penalty 1.2
# Extract text from PDF
pdftotext -layout document.pdf extracted.txt # Process with LLM
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -n 256 \ -p "Extract structured data from this receipt text: $(cat extracted.txt)" \ --temp 0.3 > output.json # Validate JSON
python3 -m json.tool output.json
# Extract text from PDF
pdftotext -layout document.pdf extracted.txt # Process with LLM
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -n 256 \ -p "Extract structured data from this receipt text: $(cat extracted.txt)" \ --temp 0.3 > output.json # Validate JSON
python3 -m json.tool output.json
# Extract text from PDF
pdftotext -layout document.pdf extracted.txt # Process with LLM
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -n 256 \ -p "Extract structured data from this receipt text: $(cat extracted.txt)" \ --temp 0.3 > output.json # Validate JSON
python3 -m json.tool output.json
#!/bin/bash
# batch_processor.sh for file in receipts/*.pdf; do pdftotext -layout "$file" temp.txt ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -n 256 \ -p "Extract invoice data: $(cat temp.txt)" \ --temp 0.1 > "${file%.pdf}.json" rm temp.txt
done
#!/bin/bash
# batch_processor.sh for file in receipts/*.pdf; do pdftotext -layout "$file" temp.txt ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -n 256 \ -p "Extract invoice data: $(cat temp.txt)" \ --temp 0.1 > "${file%.pdf}.json" rm temp.txt
done
#!/bin/bash
# batch_processor.sh for file in receipts/*.pdf; do pdftotext -layout "$file" temp.txt ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \ -n 256 \ -p "Extract invoice data: $(cat temp.txt)" \ --temp 0.1 > "${file%.pdf}.json" rm temp.txt
done
# Common troubleshooting commands
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -c 1024 -n 64 --temp 0.7 -p "Test" # Check GPU memory
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
# Common troubleshooting commands
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -c 1024 -n 64 --temp 0.7 -p "Test" # Check GPU memory
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
# Common troubleshooting commands
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -c 1024 -n 64 --temp 0.7 -p "Test" # Check GPU memory
nvidia-smi --query-gpu=memory.used,memory.total --format=csv - Linux 64-bit system (Ubuntu 20.04+ recommended)
- 8GB+ RAM (16GB+ for larger models)
- NVIDIA GPU with CUDA support (optional but highly recommended)
- 20GB+ disk space
- Python 3.8+ - Q4_K_M: 4-bit quantization with k-mer optimization (best balance)
- Q5_K_M: 5-bit quantization (better quality, slightly larger)
- Q8_0: 8-bit quantization (highest quality, largest file)
- F16: Full precision (16-bit float, largest files) - Memory issues: Reduce context size (-c parameter)
- Slow startup: Use smaller models (Phi-2, TinyLlama)
- GPU memory issues: Add --no-gpu flag
- Permission denied: Check file permissions and user groups