Tools: Essential Guide: 로컬 LLM 셋업 가이드 (v5)

Tools: Essential Guide: 로컬 LLM 셋업 가이드 (v5)

Local LLM Setup Guide (v5)

Practical Installation & Optimization Guide for Developers

1. Overview & Prerequisites

2. Framework Comparison

3. Step-by-Step Installation (llama.cpp)

4. Model Selection Guide

5. Quantization Types Explained

6. API Setup and Integration

7. Systemd Service for 24/7 Operation

8. Monitoring and Performance Tuning

9. Real Command Examples Running LLMs locally requires minimal hardware but significant attention to memory management and system configuration. Hardware Requirements: OS: Ubuntu 22.04 LTS or Debian 12 Prerequisites Installation: Recommendation: Use llama.cpp for development, Ollama for quick testing. For Chat Applications: For Research/Development: Benchmark comparison (1000 tokens, RTX 4090): Quantization command: Simple HTTP API with Python: Integration with existing tools: Create /etc/systemd/system/local-llm.service: Memory monitoring script: Performance optimization flags: Complete workflow example: Quick deployment command: Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git cmake build-essential python3--weight: 500;">pip pip3 -weight: 500;">install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git cmake build-essential python3--weight: 500;">pip pip3 -weight: 500;">install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git cmake build-essential python3--weight: 500;">pip pip3 -weight: 500;">install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # Clone and build llama.cpp -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean make # Download a model (example: Mistral 7B) -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # Basic inference test ./main -m models/mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" --temp 0.2 # Clone and build llama.cpp -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean make # Download a model (example: Mistral 7B) -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # Basic inference test ./main -m models/mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" --temp 0.2 # Clone and build llama.cpp -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean make # Download a model (example: Mistral 7B) -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # Basic inference test ./main -m models/mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" --temp 0.2 # Chat model ./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "You are a helpful AI assistant. User: What is 2+2? Assistant:" \ --temp 0.1 --repeat_penalty 1.1 # Code model ./main -m models/codellama-7b.Q4_K_M.gguf \ -p "def fibonacci(n):" \ --temp 0.0 --repeat_penalty 1.0 # Chat model ./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "You are a helpful AI assistant. User: What is 2+2? Assistant:" \ --temp 0.1 --repeat_penalty 1.1 # Code model ./main -m models/codellama-7b.Q4_K_M.gguf \ -p "def fibonacci(n):" \ --temp 0.0 --repeat_penalty 1.0 # Chat model ./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "You are a helpful AI assistant. User: What is 2+2? Assistant:" \ --temp 0.1 --repeat_penalty 1.1 # Code model ./main -m models/codellama-7b.Q4_K_M.gguf \ -p "def fibonacci(n):" \ --temp 0.0 --repeat_penalty 1.0 # Convert GGUF model with specific quantization python3 convert-hf-to-gguf.py models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-q4k.gguf # Convert GGUF model with specific quantization python3 convert-hf-to-gguf.py models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-q4k.gguf # Convert GGUF model with specific quantization python3 convert-hf-to-gguf.py models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-q4k.gguf # server.py from flask import Flask, request, jsonify import subprocess import json app = Flask(__name__) @app.route('/generate', methods=['POST']) def generate(): data = request.json prompt = data['prompt'] model_path = data.get('model', 'models/mistral-7b-v0.1.Q4_K_M.gguf') result = subprocess.run([ './main', '-m', model_path, '-p', prompt, '--temp', '0.2', '--repeat_penalty', '1.1' ], capture_output=True, text=True) return jsonify({'response': result.stdout.strip()}) if __name__ == '__main__': app.run(host='0.0.0.0', port=8000) # server.py from flask import Flask, request, jsonify import subprocess import json app = Flask(__name__) @app.route('/generate', methods=['POST']) def generate(): data = request.json prompt = data['prompt'] model_path = data.get('model', 'models/mistral-7b-v0.1.Q4_K_M.gguf') result = subprocess.run([ './main', '-m', model_path, '-p', prompt, '--temp', '0.2', '--repeat_penalty', '1.1' ], capture_output=True, text=True) return jsonify({'response': result.stdout.strip()}) if __name__ == '__main__': app.run(host='0.0.0.0', port=8000) # server.py from flask import Flask, request, jsonify import subprocess import json app = Flask(__name__) @app.route('/generate', methods=['POST']) def generate(): data = request.json prompt = data['prompt'] model_path = data.get('model', 'models/mistral-7b-v0.1.Q4_K_M.gguf') result = subprocess.run([ './main', '-m', model_path, '-p', prompt, '--temp', '0.2', '--repeat_penalty', '1.1' ], capture_output=True, text=True) return jsonify({'response': result.stdout.strip()}) if __name__ == '__main__': app.run(host='0.0.0.0', port=8000) # Test API -weight: 500;">curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain quantum computing in simple terms:"}' # Test API -weight: 500;">curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain quantum computing in simple terms:"}' # Test API -weight: 500;">curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain quantum computing in simple terms:"}' [Unit] Description=Local LLM Service After=network.target [Service] Type=simple User=developer WorkingDirectory=/home/developer/llama.cpp ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/mistral-7b-v0.1.Q4_K_M.gguf -p "System ready" --port 8080 --host 0.0.0.0 Restart=always RestartSec=10 Environment=LD_LIBRARY_PATH=/usr/local/cuda/lib64 [Install] WantedBy=multi-user.target [Unit] Description=Local LLM Service After=network.target [Service] Type=simple User=developer WorkingDirectory=/home/developer/llama.cpp ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/mistral-7b-v0.1.Q4_K_M.gguf -p "System ready" --port 8080 --host 0.0.0.0 Restart=always RestartSec=10 Environment=LD_LIBRARY_PATH=/usr/local/cuda/lib64 [Install] WantedBy=multi-user.target [Unit] Description=Local LLM Service After=network.target [Service] Type=simple User=developer WorkingDirectory=/home/developer/llama.cpp ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/mistral-7b-v0.1.Q4_K_M.gguf -p "System ready" --port 8080 --host 0.0.0.0 Restart=always RestartSec=10 Environment=LD_LIBRARY_PATH=/usr/local/cuda/lib64 [Install] WantedBy=multi-user.target -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status local-llm -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status local-llm -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status local-llm #!/bin/bash # monitor.sh while true; do echo "Timestamp: $(date)" echo "GPU Memory:" nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits echo "System Memory:" free -h echo "---" sleep 30 done #!/bin/bash # monitor.sh while true; do echo "Timestamp: $(date)" echo "GPU Memory:" nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits echo "System Memory:" free -h echo "---" sleep 30 done #!/bin/bash # monitor.sh while true; do echo "Timestamp: $(date)" echo "GPU Memory:" nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits echo "System Memory:" free -h echo "---" sleep 30 done # For high throughput (1000+ tokens/sec) ./main -m model.gguf \ --threads 16 \ --ctx-size 4096 \ --batch-size 512 \ --temp 0.0 \ --repeat_penalty 1.0 \ --n-predict 1000 # For low latency (sub-100ms response) ./main -m model.gguf \ --threads 8 \ --ctx-size 1024 \ --batch-size 64 \ --temp 0.2 \ --repeat_penalty 1.1 # For high throughput (1000+ tokens/sec) ./main -m model.gguf \ --threads 16 \ --ctx-size 4096 \ --batch-size 512 \ --temp 0.0 \ --repeat_penalty 1.0 \ --n-predict 1000 # For low latency (sub-100ms response) ./main -m model.gguf \ --threads 8 \ --ctx-size 1024 \ --batch-size 64 \ --temp 0.2 \ --repeat_penalty 1.1 # For high throughput (1000+ tokens/sec) ./main -m model.gguf \ --threads 16 \ --ctx-size 4096 \ --batch-size 512 \ --temp 0.0 \ --repeat_penalty 1.0 \ --n-predict 1000 # For low latency (sub-100ms response) ./main -m model.gguf \ --threads 8 \ --ctx-size 1024 \ --batch-size 64 \ --temp 0.2 \ --repeat_penalty 1.1 # 1. Setup directory mkdir -p ~/llm-dev/models cd ~/llm-dev # 2. Install llama.cpp -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make # 3. Download model -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # 4. Test inference ./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "Write a bash script that checks disk space:" \ --temp 0.1 --repeat_penalty 1.0 # 5. Monitor performance watch -n 1 nvidia-smi # 1. Setup directory mkdir -p ~/llm-dev/models cd ~/llm-dev # 2. Install llama.cpp -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make # 3. Download model -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # 4. Test inference ./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "Write a bash script that checks disk space:" \ --temp 0.1 --repeat_penalty 1.0 # 5. Monitor performance watch -n 1 nvidia-smi # 1. Setup directory mkdir -p ~/llm-dev/models cd ~/llm-dev # 2. Install llama.cpp -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make # 3. Download model -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # 4. Test inference ./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "Write a bash script that checks disk space:" \ --temp 0.1 --repeat_penalty 1.0 # 5. Monitor performance watch -n 1 nvidia-smi #!/bin/bash # benchmark.sh MODEL_PATH="models/mistral-7b-v0.1.Q4_K_M.gguf" PROMPT="The quick brown fox jumps over the lazy dog. This is a test prompt for benchmarking." echo "Starting benchmark..." time ./main -m $MODEL_PATH -p "$PROMPT" --temp 0.0 --repeat_penalty 1.0 --n-predict 100 echo "Benchmark complete" #!/bin/bash # benchmark.sh MODEL_PATH="models/mistral-7b-v0.1.Q4_K_M.gguf" PROMPT="The quick brown fox jumps over the lazy dog. This is a test prompt for benchmarking." echo "Starting benchmark..." time ./main -m $MODEL_PATH -p "$PROMPT" --temp 0.0 --repeat_penalty 1.0 --n-predict 100 echo "Benchmark complete" #!/bin/bash # benchmark.sh MODEL_PATH="models/mistral-7b-v0.1.Q4_K_M.gguf" PROMPT="The quick brown fox jumps over the lazy dog. This is a test prompt for benchmarking." echo "Starting benchmark..." time ./main -m $MODEL_PATH -p "$PROMPT" --temp 0.0 --repeat_penalty 1.0 --n-predict 100 echo "Benchmark complete" bash # One-liner setup and test -weight: 500;">git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make && \ -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/ --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7) bash # One-liner setup and test -weight: 500;">git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make && \ -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/ --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7) bash # One-liner setup and test -weight: 500;">git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make && \ -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/ --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7) - GPU: NVIDIA RTX 30xx/40xx series recommended (8GB+ VRAM) - CPU: Intel i5-12600K or AMD Ryzen 7 5800X - RAM: Minimum 16GB, 32GB recommended - Storage: 500GB+ SSD for model storage - Mistral 7B Q4_K_M (balanced quality vs size) - Llama 2 7B Q4_K_M (best commercial support) - CodeLlama 7B Q4_K_M (best for programming tasks) - Phi-2 Q4_K_M (smaller, fast) - Llama 3 8B Q4_K_M (latest architecture) - Mixtral 8x7B Q4_K_M (sparse mixture of experts) - Q4_K_M: 150 tokens/sec - Q5_K_M: 130 tokens/sec - Q8_0: 100 tokens/sec