Tools: Latest: 로컬 LLM 셋업 가이드 (v17)
Local LLM Setup Guide (v17)
Overview & Prerequisites
Framework Comparison
Step-by-Step Installation
1. Install llama.cpp
2. Download a Model
3. Test Basic Inference
4. Setup Ollama (Alternative)
Model Selection Guide
Quantization Types Explained
API Setup and Integration
Simple HTTP Server with llama.cpp
Python Integration Example
Systemd Service for 24/7 Operation
Monitoring and Performance Tuning
GPU Memory Monitoring
Performance Testing
Memory Optimization Flags
Real Command Examples
Complete Setup Script
API Integration with curl
Configuration Files
Default llama.cpp Settings
Environment Variables
Benchmark Results Running large language models locally requires understanding hardware constraints and software requirements. This guide assumes you're working with a modern Linux system (Ubuntu 20.04+ recommended) with at least 8GB RAM and a GPU with CUDA support (RTX 30xx or newer). Hardware Requirements: Prerequisites Installation: Recommendation: Use llama.cpp for development, Ollama for quick testing, and vLLM for production. For Chat Applications: Mistral-7B-v0.1 or Phi-3-mini
For Code Generation: CodeLlama-7B or StarCoder2For Research: Llama-3-8B or Mixtral-8x7BFor Memory-Limited Systems: TinyLlama or Phi-2 Quantization reduces model size and improves performance: Create /etc/systemd/system/local-llm.service: Create /opt/llama.cpp/config.json: Model: Mistral-7B Q5_K_MHardware: RTX 4090, 32GB RAM
Results: This setup provides a production-ready local LLM environment that costs $3-7 to operate while offering performance comparable to cloud services. 📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7) Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse
$ -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git -weight: 500;">curl build-essential python3--weight: 500;">pip
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git -weight: 500;">curl build-essential python3--weight: 500;">pip
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git -weight: 500;">curl build-essential python3--weight: 500;">pip
cd /opt
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean && make
cd /opt
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean && make
cd /opt
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean && make
cd /opt
mkdir models && cd models
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf
cd /opt
mkdir models && cd models
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf
cd /opt
mkdir models && cd models
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf
cd /opt/llama.cpp
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Hello world" --temp 0.1
cd /opt/llama.cpp
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Hello world" --temp 0.1
cd /opt/llama.cpp
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Hello world" --temp 0.1
-weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh
ollama run mistral
-weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh
ollama run mistral
-weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh
ollama run mistral
# Download recommended models
cd /opt/models
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf
-weight: 500;">wget https://huggingface.co/TheBloke/Phi-3-mini-128k-instruct-GGUF/resolve/main/phi-3-mini-128k-instruct.Q5_K_M.gguf
# Download recommended models
cd /opt/models
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf
-weight: 500;">wget https://huggingface.co/TheBloke/Phi-3-mini-128k-instruct-GGUF/resolve/main/phi-3-mini-128k-instruct.Q5_K_M.gguf
# Download recommended models
cd /opt/models
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf
-weight: 500;">wget https://huggingface.co/TheBloke/Phi-3-mini-128k-instruct-GGUF/resolve/main/phi-3-mini-128k-instruct.Q5_K_M.gguf
# Convert model to different quantizations
./convert-llama2-ggml.py /path/to/model.bin --outtype q5_k_m
# Convert model to different quantizations
./convert-llama2-ggml.py /path/to/model.bin --outtype q5_k_m
# Convert model to different quantizations
./convert-llama2-ggml.py /path/to/model.bin --outtype q5_k_m
# Run model as HTTP server
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33
# Run model as HTTP server
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33
# Run model as HTTP server
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33
import requests def call_local_llm(prompt): response = requests.post( "http://localhost:8080/completion", json={"prompt": prompt, "n_predict": 100} ) return response.json()['content'] # Usage
result = call_local_llm("Explain quantum computing in simple terms")
import requests def call_local_llm(prompt): response = requests.post( "http://localhost:8080/completion", json={"prompt": prompt, "n_predict": 100} ) return response.json()['content'] # Usage
result = call_local_llm("Explain quantum computing in simple terms")
import requests def call_local_llm(prompt): response = requests.post( "http://localhost:8080/completion", json={"prompt": prompt, "n_predict": 100} ) return response.json()['content'] # Usage
result = call_local_llm("Explain quantum computing in simple terms")
[Unit]
Description=Local LLM Service
After=network.target [Service]
Type=simple
User=your_user
WorkingDirectory=/opt/llama.cpp
ExecStart=/opt/llama.cpp/main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
[Unit]
Description=Local LLM Service
After=network.target [Service]
Type=simple
User=your_user
WorkingDirectory=/opt/llama.cpp
ExecStart=/opt/llama.cpp/main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
[Unit]
Description=Local LLM Service
After=network.target [Service]
Type=simple
User=your_user
WorkingDirectory=/opt/llama.cpp
ExecStart=/opt/llama.cpp/main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm
# Monitor GPU usage
watch -n 1 nvidia-smi # Check memory usage of running process
nvidia-smi pmon -c 1
# Monitor GPU usage
watch -n 1 nvidia-smi # Check memory usage of running process
nvidia-smi pmon -c 1
# Monitor GPU usage
watch -n 1 nvidia-smi # Check memory usage of running process
nvidia-smi pmon -c 1
# Benchmark model performance
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Benchmark test" --temp 0.1 --n-predict 100
# Benchmark model performance
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Benchmark test" --temp 0.1 --n-predict 100
# Benchmark model performance
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Benchmark test" --temp 0.1 --n-predict 100
# For high memory systems
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 33 --ctx 8192 --temp 0.1 # For low memory systems
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 10 --ctx 2048 --temp 0.1
# For high memory systems
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 33 --ctx 8192 --temp 0.1 # For low memory systems
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 10 --ctx 2048 --temp 0.1
# For high memory systems
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 33 --ctx 8192 --temp 0.1 # For low memory systems
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 10 --ctx 2048 --temp 0.1
#!/bin/bash
# setup-local-llm.sh
cd /opt
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean && make # Download model
cd /opt/models
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf # Run benchmark
echo "Starting benchmark..."
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Test" --temp 0.1 --n-predict 50 echo "Setup complete. Run '-weight: 500;">systemctl -weight: 500;">start local-llm' to -weight: 500;">start -weight: 500;">service."
#!/bin/bash
# setup-local-llm.sh
cd /opt
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean && make # Download model
cd /opt/models
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf # Run benchmark
echo "Starting benchmark..."
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Test" --temp 0.1 --n-predict 50 echo "Setup complete. Run '-weight: 500;">systemctl -weight: 500;">start local-llm' to -weight: 500;">start -weight: 500;">service."
#!/bin/bash
# setup-local-llm.sh
cd /opt
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean && make # Download model
cd /opt/models
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf # Run benchmark
echo "Starting benchmark..."
./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Test" --temp 0.1 --n-predict 50 echo "Setup complete. Run '-weight: 500;">systemctl -weight: 500;">start local-llm' to -weight: 500;">start -weight: 500;">service."
# Basic API test
-weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Write a python function to reverse a string", "n_predict": 100}' # Streaming response
-weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain neural networks", "n_predict": 100, "stream": true}'
# Basic API test
-weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Write a python function to reverse a string", "n_predict": 100}' # Streaming response
-weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain neural networks", "n_predict": 100, "stream": true}'
# Basic API test
-weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Write a python function to reverse a string", "n_predict": 100}' # Streaming response
-weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain neural networks", "n_predict": 100, "stream": true}'
{ "model_path": "/opt/models/mistral-7b-v0.1.Q5_K_M.gguf", "port": 8080, "host": "0.0.0.0", "n_gpu_layers": 33, "ctx_size": 8192, "temp": 0.1, "n_predict": 100
}
{ "model_path": "/opt/models/mistral-7b-v0.1.Q5_K_M.gguf", "port": 8080, "host": "0.0.0.0", "n_gpu_layers": 33, "ctx_size": 8192, "temp": 0.1, "n_predict": 100
}
{ "model_path": "/opt/models/mistral-7b-v0.1.Q5_K_M.gguf", "port": 8080, "host": "0.0.0.0", "n_gpu_layers": 33, "ctx_size": 8192, "temp": 0.1, "n_predict": 100
}
# Add to ~/.bashrc
export LOCAL_LLM_MODEL="/opt/models/mistral-7b-v0.1.Q5_K_M.gguf"
export LOCAL_LLM_PORT="8080"
export LOCAL_LLM_NGL="33"
# Add to ~/.bashrc
export LOCAL_LLM_MODEL="/opt/models/mistral-7b-v0.1.Q5_K_M.gguf"
export LOCAL_LLM_PORT="8080"
export LOCAL_LLM_NGL="33"
# Add to ~/.bashrc
export LOCAL_LLM_MODEL="/opt/models/mistral-7b-v0.1.Q5_K_M.gguf"
export LOCAL_LLM_PORT="8080"
export LOCAL_LLM_NGL="33" - CPU: 4+ cores (8+ recommended)
- RAM: 16GB+ (32GB+ for larger models)
- GPU: NVIDIA RTX 30xx or newer with CUDA support
- Storage: 50GB+ free space (models can be 2-10GB each) - Q4_K_M: 4-bit, high quality, good for most use cases
- Q5_K_M: 5-bit, balanced quality/performance
- Q6_K: 6-bit, excellent quality, larger files
- Q8_0: 8-bit, minimal loss, best for performance - Context: 8192 tokens
- Response time: ~1.2s for 100 tokens
- GPU memory usage: ~12GB - Use --ctx to increase context window
- Increase --ngl for more GPU layers
- Lower --temp for faster responses
- Use --n-predict to limit generation length