Tools: Essential Guide: 로컬 LLM 셋업 가이드 (v5)
Local LLM Setup Guide (v5)
Practical Installation & Optimization Guide for Developers
1. Overview & Prerequisites
2. Framework Comparison
3. Step-by-Step Installation (llama.cpp)
4. Model Selection Guide
5. Quantization Types Explained
6. API Setup and Integration
7. Systemd Service for 24/7 Operation
8. Monitoring and Performance Tuning
9. Real Command Examples Running LLMs locally requires minimal hardware but significant attention to memory management and system configuration. Hardware Requirements: OS: Ubuntu 22.04 LTS or Debian 12 Prerequisites Installation: Recommendation: Use llama.cpp for development, Ollama for quick testing. For Chat Applications: For Research/Development: Benchmark comparison (1000 tokens, RTX 4090): Quantization command: Simple HTTP API with Python: Integration with existing tools: Create /etc/systemd/system/local-llm.service: Memory monitoring script: Performance optimization flags: Complete workflow example: Quick deployment command: Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuseCommandCopy$ -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git cmake build-essential python3--weight: 500;">pip
pip3 -weight: 500;">install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git cmake build-essential python3--weight: 500;">pip
pip3 -weight: 500;">install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git cmake build-essential python3--weight: 500;">pip
pip3 -weight: 500;">install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Clone and build llama.cpp
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean
make # Download a model (example: Mistral 7B)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # Basic inference test
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" --temp 0.2
# Clone and build llama.cpp
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean
make # Download a model (example: Mistral 7B)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # Basic inference test
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" --temp 0.2
# Clone and build llama.cpp
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean
make # Download a model (example: Mistral 7B)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # Basic inference test
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" --temp 0.2
# Chat model
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "You are a helpful AI assistant. User: What is 2+2? Assistant:" \ --temp 0.1 --repeat_penalty 1.1 # Code model
./main -m models/codellama-7b.Q4_K_M.gguf \ -p "def fibonacci(n):" \ --temp 0.0 --repeat_penalty 1.0
# Chat model
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "You are a helpful AI assistant. User: What is 2+2? Assistant:" \ --temp 0.1 --repeat_penalty 1.1 # Code model
./main -m models/codellama-7b.Q4_K_M.gguf \ -p "def fibonacci(n):" \ --temp 0.0 --repeat_penalty 1.0
# Chat model
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "You are a helpful AI assistant. User: What is 2+2? Assistant:" \ --temp 0.1 --repeat_penalty 1.1 # Code model
./main -m models/codellama-7b.Q4_K_M.gguf \ -p "def fibonacci(n):" \ --temp 0.0 --repeat_penalty 1.0
# Convert GGUF model with specific quantization
python3 convert-hf-to-gguf.py models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-q4k.gguf
# Convert GGUF model with specific quantization
python3 convert-hf-to-gguf.py models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-q4k.gguf
# Convert GGUF model with specific quantization
python3 convert-hf-to-gguf.py models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-q4k.gguf
# server.py
from flask import Flask, request, jsonify
import subprocess
import json app = Flask(__name__) @app.route('/generate', methods=['POST'])
def generate(): data = request.json prompt = data['prompt'] model_path = data.get('model', 'models/mistral-7b-v0.1.Q4_K_M.gguf') result = subprocess.run([ './main', '-m', model_path, '-p', prompt, '--temp', '0.2', '--repeat_penalty', '1.1' ], capture_output=True, text=True) return jsonify({'response': result.stdout.strip()}) if __name__ == '__main__': app.run(host='0.0.0.0', port=8000)
# server.py
from flask import Flask, request, jsonify
import subprocess
import json app = Flask(__name__) @app.route('/generate', methods=['POST'])
def generate(): data = request.json prompt = data['prompt'] model_path = data.get('model', 'models/mistral-7b-v0.1.Q4_K_M.gguf') result = subprocess.run([ './main', '-m', model_path, '-p', prompt, '--temp', '0.2', '--repeat_penalty', '1.1' ], capture_output=True, text=True) return jsonify({'response': result.stdout.strip()}) if __name__ == '__main__': app.run(host='0.0.0.0', port=8000)
# server.py
from flask import Flask, request, jsonify
import subprocess
import json app = Flask(__name__) @app.route('/generate', methods=['POST'])
def generate(): data = request.json prompt = data['prompt'] model_path = data.get('model', 'models/mistral-7b-v0.1.Q4_K_M.gguf') result = subprocess.run([ './main', '-m', model_path, '-p', prompt, '--temp', '0.2', '--repeat_penalty', '1.1' ], capture_output=True, text=True) return jsonify({'response': result.stdout.strip()}) if __name__ == '__main__': app.run(host='0.0.0.0', port=8000)
# Test API
-weight: 500;">curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain quantum computing in simple terms:"}'
# Test API
-weight: 500;">curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain quantum computing in simple terms:"}'
# Test API
-weight: 500;">curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain quantum computing in simple terms:"}'
[Unit]
Description=Local LLM Service
After=network.target [Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/mistral-7b-v0.1.Q4_K_M.gguf -p "System ready" --port 8080 --host 0.0.0.0
Restart=always
RestartSec=10
Environment=LD_LIBRARY_PATH=/usr/local/cuda/lib64 [Install]
WantedBy=multi-user.target
[Unit]
Description=Local LLM Service
After=network.target [Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/mistral-7b-v0.1.Q4_K_M.gguf -p "System ready" --port 8080 --host 0.0.0.0
Restart=always
RestartSec=10
Environment=LD_LIBRARY_PATH=/usr/local/cuda/lib64 [Install]
WantedBy=multi-user.target
[Unit]
Description=Local LLM Service
After=network.target [Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/mistral-7b-v0.1.Q4_K_M.gguf -p "System ready" --port 8080 --host 0.0.0.0
Restart=always
RestartSec=10
Environment=LD_LIBRARY_PATH=/usr/local/cuda/lib64 [Install]
WantedBy=multi-user.target
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status local-llm
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status local-llm
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status local-llm
#!/bin/bash
# monitor.sh
while true; do echo "Timestamp: $(date)" echo "GPU Memory:" nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits echo "System Memory:" free -h echo "---" sleep 30
done
#!/bin/bash
# monitor.sh
while true; do echo "Timestamp: $(date)" echo "GPU Memory:" nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits echo "System Memory:" free -h echo "---" sleep 30
done
#!/bin/bash
# monitor.sh
while true; do echo "Timestamp: $(date)" echo "GPU Memory:" nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits echo "System Memory:" free -h echo "---" sleep 30
done
# For high throughput (1000+ tokens/sec)
./main -m model.gguf \ --threads 16 \ --ctx-size 4096 \ --batch-size 512 \ --temp 0.0 \ --repeat_penalty 1.0 \ --n-predict 1000 # For low latency (sub-100ms response)
./main -m model.gguf \ --threads 8 \ --ctx-size 1024 \ --batch-size 64 \ --temp 0.2 \ --repeat_penalty 1.1
# For high throughput (1000+ tokens/sec)
./main -m model.gguf \ --threads 16 \ --ctx-size 4096 \ --batch-size 512 \ --temp 0.0 \ --repeat_penalty 1.0 \ --n-predict 1000 # For low latency (sub-100ms response)
./main -m model.gguf \ --threads 8 \ --ctx-size 1024 \ --batch-size 64 \ --temp 0.2 \ --repeat_penalty 1.1
# For high throughput (1000+ tokens/sec)
./main -m model.gguf \ --threads 16 \ --ctx-size 4096 \ --batch-size 512 \ --temp 0.0 \ --repeat_penalty 1.0 \ --n-predict 1000 # For low latency (sub-100ms response)
./main -m model.gguf \ --threads 8 \ --ctx-size 1024 \ --batch-size 64 \ --temp 0.2 \ --repeat_penalty 1.1
# 1. Setup directory
mkdir -p ~/llm-dev/models
cd ~/llm-dev # 2. Install llama.cpp
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make # 3. Download model
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # 4. Test inference
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "Write a bash script that checks disk space:" \ --temp 0.1 --repeat_penalty 1.0 # 5. Monitor performance
watch -n 1 nvidia-smi
# 1. Setup directory
mkdir -p ~/llm-dev/models
cd ~/llm-dev # 2. Install llama.cpp
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make # 3. Download model
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # 4. Test inference
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "Write a bash script that checks disk space:" \ --temp 0.1 --repeat_penalty 1.0 # 5. Monitor performance
watch -n 1 nvidia-smi
# 1. Setup directory
mkdir -p ~/llm-dev/models
cd ~/llm-dev # 2. Install llama.cpp
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make # 3. Download model
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # 4. Test inference
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "Write a bash script that checks disk space:" \ --temp 0.1 --repeat_penalty 1.0 # 5. Monitor performance
watch -n 1 nvidia-smi
#!/bin/bash
# benchmark.sh
MODEL_PATH="models/mistral-7b-v0.1.Q4_K_M.gguf"
PROMPT="The quick brown fox jumps over the lazy dog. This is a test prompt for benchmarking." echo "Starting benchmark..."
time ./main -m $MODEL_PATH -p "$PROMPT" --temp 0.0 --repeat_penalty 1.0 --n-predict 100
echo "Benchmark complete"
#!/bin/bash
# benchmark.sh
MODEL_PATH="models/mistral-7b-v0.1.Q4_K_M.gguf"
PROMPT="The quick brown fox jumps over the lazy dog. This is a test prompt for benchmarking." echo "Starting benchmark..."
time ./main -m $MODEL_PATH -p "$PROMPT" --temp 0.0 --repeat_penalty 1.0 --n-predict 100
echo "Benchmark complete"
#!/bin/bash
# benchmark.sh
MODEL_PATH="models/mistral-7b-v0.1.Q4_K_M.gguf"
PROMPT="The quick brown fox jumps over the lazy dog. This is a test prompt for benchmarking." echo "Starting benchmark..."
time ./main -m $MODEL_PATH -p "$PROMPT" --temp 0.0 --repeat_penalty 1.0 --n-predict 100
echo "Benchmark complete"
bash
# One-liner setup and test
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make && \
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/ --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)
bash
# One-liner setup and test
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make && \
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/ --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)
bash
# One-liner setup and test
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make && \
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/ --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7) - GPU: NVIDIA RTX 30xx/40xx series recommended (8GB+ VRAM)
- CPU: Intel i5-12600K or AMD Ryzen 7 5800X
- RAM: Minimum 16GB, 32GB recommended
- Storage: 500GB+ SSD for model storage - Mistral 7B Q4_K_M (balanced quality vs size)
- Llama 2 7B Q4_K_M (best commercial support) - CodeLlama 7B Q4_K_M (best for programming tasks)
- Phi-2 Q4_K_M (smaller, fast) - Llama 3 8B Q4_K_M (latest architecture)
- Mixtral 8x7B Q4_K_M (sparse mixture of experts) - Q4_K_M: 150 tokens/sec
- Q5_K_M: 130 tokens/sec
- Q8_0: 100 tokens/sec
$ -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git cmake build-essential python3--weight: 500;">pip
pip3 -weight: 500;">install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git cmake build-essential python3--weight: 500;">pip
pip3 -weight: 500;">install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git cmake build-essential python3--weight: 500;">pip
pip3 -weight: 500;">install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Clone and build llama.cpp
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean
make # Download a model (example: Mistral 7B)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # Basic inference test
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" --temp 0.2
# Clone and build llama.cpp
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean
make # Download a model (example: Mistral 7B)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # Basic inference test
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" --temp 0.2
# Clone and build llama.cpp
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean
make # Download a model (example: Mistral 7B)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # Basic inference test
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" --temp 0.2
# Chat model
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "You are a helpful AI assistant. User: What is 2+2? Assistant:" \ --temp 0.1 --repeat_penalty 1.1 # Code model
./main -m models/codellama-7b.Q4_K_M.gguf \ -p "def fibonacci(n):" \ --temp 0.0 --repeat_penalty 1.0
# Chat model
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "You are a helpful AI assistant. User: What is 2+2? Assistant:" \ --temp 0.1 --repeat_penalty 1.1 # Code model
./main -m models/codellama-7b.Q4_K_M.gguf \ -p "def fibonacci(n):" \ --temp 0.0 --repeat_penalty 1.0
# Chat model
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "You are a helpful AI assistant. User: What is 2+2? Assistant:" \ --temp 0.1 --repeat_penalty 1.1 # Code model
./main -m models/codellama-7b.Q4_K_M.gguf \ -p "def fibonacci(n):" \ --temp 0.0 --repeat_penalty 1.0
# Convert GGUF model with specific quantization
python3 convert-hf-to-gguf.py models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-q4k.gguf
# Convert GGUF model with specific quantization
python3 convert-hf-to-gguf.py models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-q4k.gguf
# Convert GGUF model with specific quantization
python3 convert-hf-to-gguf.py models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-q4k.gguf
# server.py
from flask import Flask, request, jsonify
import subprocess
import json app = Flask(__name__) @app.route('/generate', methods=['POST'])
def generate(): data = request.json prompt = data['prompt'] model_path = data.get('model', 'models/mistral-7b-v0.1.Q4_K_M.gguf') result = subprocess.run([ './main', '-m', model_path, '-p', prompt, '--temp', '0.2', '--repeat_penalty', '1.1' ], capture_output=True, text=True) return jsonify({'response': result.stdout.strip()}) if __name__ == '__main__': app.run(host='0.0.0.0', port=8000)
# server.py
from flask import Flask, request, jsonify
import subprocess
import json app = Flask(__name__) @app.route('/generate', methods=['POST'])
def generate(): data = request.json prompt = data['prompt'] model_path = data.get('model', 'models/mistral-7b-v0.1.Q4_K_M.gguf') result = subprocess.run([ './main', '-m', model_path, '-p', prompt, '--temp', '0.2', '--repeat_penalty', '1.1' ], capture_output=True, text=True) return jsonify({'response': result.stdout.strip()}) if __name__ == '__main__': app.run(host='0.0.0.0', port=8000)
# server.py
from flask import Flask, request, jsonify
import subprocess
import json app = Flask(__name__) @app.route('/generate', methods=['POST'])
def generate(): data = request.json prompt = data['prompt'] model_path = data.get('model', 'models/mistral-7b-v0.1.Q4_K_M.gguf') result = subprocess.run([ './main', '-m', model_path, '-p', prompt, '--temp', '0.2', '--repeat_penalty', '1.1' ], capture_output=True, text=True) return jsonify({'response': result.stdout.strip()}) if __name__ == '__main__': app.run(host='0.0.0.0', port=8000)
# Test API
-weight: 500;">curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain quantum computing in simple terms:"}'
# Test API
-weight: 500;">curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain quantum computing in simple terms:"}'
# Test API
-weight: 500;">curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain quantum computing in simple terms:"}'
[Unit]
Description=Local LLM Service
After=network.target [Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/mistral-7b-v0.1.Q4_K_M.gguf -p "System ready" --port 8080 --host 0.0.0.0
Restart=always
RestartSec=10
Environment=LD_LIBRARY_PATH=/usr/local/cuda/lib64 [Install]
WantedBy=multi-user.target
[Unit]
Description=Local LLM Service
After=network.target [Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/mistral-7b-v0.1.Q4_K_M.gguf -p "System ready" --port 8080 --host 0.0.0.0
Restart=always
RestartSec=10
Environment=LD_LIBRARY_PATH=/usr/local/cuda/lib64 [Install]
WantedBy=multi-user.target
[Unit]
Description=Local LLM Service
After=network.target [Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/mistral-7b-v0.1.Q4_K_M.gguf -p "System ready" --port 8080 --host 0.0.0.0
Restart=always
RestartSec=10
Environment=LD_LIBRARY_PATH=/usr/local/cuda/lib64 [Install]
WantedBy=multi-user.target
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status local-llm
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status local-llm
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status local-llm
#!/bin/bash
# monitor.sh
while true; do echo "Timestamp: $(date)" echo "GPU Memory:" nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits echo "System Memory:" free -h echo "---" sleep 30
done
#!/bin/bash
# monitor.sh
while true; do echo "Timestamp: $(date)" echo "GPU Memory:" nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits echo "System Memory:" free -h echo "---" sleep 30
done
#!/bin/bash
# monitor.sh
while true; do echo "Timestamp: $(date)" echo "GPU Memory:" nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits echo "System Memory:" free -h echo "---" sleep 30
done
# For high throughput (1000+ tokens/sec)
./main -m model.gguf \ --threads 16 \ --ctx-size 4096 \ --batch-size 512 \ --temp 0.0 \ --repeat_penalty 1.0 \ --n-predict 1000 # For low latency (sub-100ms response)
./main -m model.gguf \ --threads 8 \ --ctx-size 1024 \ --batch-size 64 \ --temp 0.2 \ --repeat_penalty 1.1
# For high throughput (1000+ tokens/sec)
./main -m model.gguf \ --threads 16 \ --ctx-size 4096 \ --batch-size 512 \ --temp 0.0 \ --repeat_penalty 1.0 \ --n-predict 1000 # For low latency (sub-100ms response)
./main -m model.gguf \ --threads 8 \ --ctx-size 1024 \ --batch-size 64 \ --temp 0.2 \ --repeat_penalty 1.1
# For high throughput (1000+ tokens/sec)
./main -m model.gguf \ --threads 16 \ --ctx-size 4096 \ --batch-size 512 \ --temp 0.0 \ --repeat_penalty 1.0 \ --n-predict 1000 # For low latency (sub-100ms response)
./main -m model.gguf \ --threads 8 \ --ctx-size 1024 \ --batch-size 64 \ --temp 0.2 \ --repeat_penalty 1.1
# 1. Setup directory
mkdir -p ~/llm-dev/models
cd ~/llm-dev # 2. Install llama.cpp
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make # 3. Download model
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # 4. Test inference
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "Write a bash script that checks disk space:" \ --temp 0.1 --repeat_penalty 1.0 # 5. Monitor performance
watch -n 1 nvidia-smi
# 1. Setup directory
mkdir -p ~/llm-dev/models
cd ~/llm-dev # 2. Install llama.cpp
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make # 3. Download model
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # 4. Test inference
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "Write a bash script that checks disk space:" \ --temp 0.1 --repeat_penalty 1.0 # 5. Monitor performance
watch -n 1 nvidia-smi
# 1. Setup directory
mkdir -p ~/llm-dev/models
cd ~/llm-dev # 2. Install llama.cpp
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make # 3. Download model
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf -P models/ # 4. Test inference
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "Write a bash script that checks disk space:" \ --temp 0.1 --repeat_penalty 1.0 # 5. Monitor performance
watch -n 1 nvidia-smi
#!/bin/bash
# benchmark.sh
MODEL_PATH="models/mistral-7b-v0.1.Q4_K_M.gguf"
PROMPT="The quick brown fox jumps over the lazy dog. This is a test prompt for benchmarking." echo "Starting benchmark..."
time ./main -m $MODEL_PATH -p "$PROMPT" --temp 0.0 --repeat_penalty 1.0 --n-predict 100
echo "Benchmark complete"
#!/bin/bash
# benchmark.sh
MODEL_PATH="models/mistral-7b-v0.1.Q4_K_M.gguf"
PROMPT="The quick brown fox jumps over the lazy dog. This is a test prompt for benchmarking." echo "Starting benchmark..."
time ./main -m $MODEL_PATH -p "$PROMPT" --temp 0.0 --repeat_penalty 1.0 --n-predict 100
echo "Benchmark complete"
#!/bin/bash
# benchmark.sh
MODEL_PATH="models/mistral-7b-v0.1.Q4_K_M.gguf"
PROMPT="The quick brown fox jumps over the lazy dog. This is a test prompt for benchmarking." echo "Starting benchmark..."
time ./main -m $MODEL_PATH -p "$PROMPT" --temp 0.0 --repeat_penalty 1.0 --n-predict 100
echo "Benchmark complete"
bash
# One-liner setup and test
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make && \
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/ --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)
bash
# One-liner setup and test
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make && \
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/ --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)
bash
# One-liner setup and test
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make && \
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/ --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7) - GPU: NVIDIA RTX 30xx/40xx series recommended (8GB+ VRAM)
- CPU: Intel i5-12600K or AMD Ryzen 7 5800X
- RAM: Minimum 16GB, 32GB recommended
- Storage: 500GB+ SSD for model storage - Mistral 7B Q4_K_M (balanced quality vs size)
- Llama 2 7B Q4_K_M (best commercial support) - CodeLlama 7B Q4_K_M (best for programming tasks)
- Phi-2 Q4_K_M (smaller, fast) - Llama 3 8B Q4_K_M (latest architecture)
- Mixtral 8x7B Q4_K_M (sparse mixture of experts) - Q4_K_M: 150 tokens/sec
- Q5_K_M: 130 tokens/sec
- Q8_0: 100 tokens/sec