Tools: Latest: 로컬 LLM 셋업 가이드 (v17)

Tools: Latest: 로컬 LLM 셋업 가이드 (v17)

Local LLM Setup Guide (v17)

Overview & Prerequisites

Framework Comparison

Step-by-Step Installation

1. Install llama.cpp

2. Download a Model

3. Test Basic Inference

4. Setup Ollama (Alternative)

Model Selection Guide

Quantization Types Explained

API Setup and Integration

Simple HTTP Server with llama.cpp

Python Integration Example

Systemd Service for 24/7 Operation

Monitoring and Performance Tuning

GPU Memory Monitoring

Performance Testing

Memory Optimization Flags

Real Command Examples

Complete Setup Script

API Integration with curl

Configuration Files

Default llama.cpp Settings

Environment Variables

Benchmark Results Running large language models locally requires understanding hardware constraints and software requirements. This guide assumes you're working with a modern Linux system (Ubuntu 20.04+ recommended) with at least 8GB RAM and a GPU with CUDA support (RTX 30xx or newer). Hardware Requirements: Prerequisites Installation: Recommendation: Use llama.cpp for development, Ollama for quick testing, and vLLM for production. For Chat Applications: Mistral-7B-v0.1 or Phi-3-mini

For Code Generation: CodeLlama-7B or StarCoder2For Research: Llama-3-8B or Mixtral-8x7BFor Memory-Limited Systems: TinyLlama or Phi-2 Quantization reduces model size and improves performance: Create /etc/systemd/system/local-llm.service: Create /opt/llama.cpp/config.json: Model: Mistral-7B Q5_K_MHardware: RTX 4090, 32GB RAM

Results: This setup provides a production-ready local LLM environment that costs $3-7 to operate while offering performance comparable to cloud services. 📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7) Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git -weight: 500;">curl build-essential python3--weight: 500;">pip -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git -weight: 500;">curl build-essential python3--weight: 500;">pip -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git -weight: 500;">curl build-essential python3--weight: 500;">pip cd /opt -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean && make cd /opt -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean && make cd /opt -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean && make cd /opt mkdir models && cd models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf cd /opt mkdir models && cd models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf cd /opt mkdir models && cd models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf cd /opt/llama.cpp ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Hello world" --temp 0.1 cd /opt/llama.cpp ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Hello world" --temp 0.1 cd /opt/llama.cpp ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Hello world" --temp 0.1 -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh ollama run mistral -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh ollama run mistral -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh ollama run mistral # Download recommended models cd /opt/models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf -weight: 500;">wget https://huggingface.co/TheBloke/Phi-3-mini-128k-instruct-GGUF/resolve/main/phi-3-mini-128k-instruct.Q5_K_M.gguf # Download recommended models cd /opt/models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf -weight: 500;">wget https://huggingface.co/TheBloke/Phi-3-mini-128k-instruct-GGUF/resolve/main/phi-3-mini-128k-instruct.Q5_K_M.gguf # Download recommended models cd /opt/models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf -weight: 500;">wget https://huggingface.co/TheBloke/Phi-3-mini-128k-instruct-GGUF/resolve/main/phi-3-mini-128k-instruct.Q5_K_M.gguf # Convert model to different quantizations ./convert-llama2-ggml.py /path/to/model.bin --outtype q5_k_m # Convert model to different quantizations ./convert-llama2-ggml.py /path/to/model.bin --outtype q5_k_m # Convert model to different quantizations ./convert-llama2-ggml.py /path/to/model.bin --outtype q5_k_m # Run model as HTTP server ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33 # Run model as HTTP server ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33 # Run model as HTTP server ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33 import requests def call_local_llm(prompt): response = requests.post( "http://localhost:8080/completion", json={"prompt": prompt, "n_predict": 100} ) return response.json()['content'] # Usage result = call_local_llm("Explain quantum computing in simple terms") import requests def call_local_llm(prompt): response = requests.post( "http://localhost:8080/completion", json={"prompt": prompt, "n_predict": 100} ) return response.json()['content'] # Usage result = call_local_llm("Explain quantum computing in simple terms") import requests def call_local_llm(prompt): response = requests.post( "http://localhost:8080/completion", json={"prompt": prompt, "n_predict": 100} ) return response.json()['content'] # Usage result = call_local_llm("Explain quantum computing in simple terms") [Unit] Description=Local LLM Service After=network.target [Service] Type=simple User=your_user WorkingDirectory=/opt/llama.cpp ExecStart=/opt/llama.cpp/main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target [Unit] Description=Local LLM Service After=network.target [Service] Type=simple User=your_user WorkingDirectory=/opt/llama.cpp ExecStart=/opt/llama.cpp/main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target [Unit] Description=Local LLM Service After=network.target [Service] Type=simple User=your_user WorkingDirectory=/opt/llama.cpp ExecStart=/opt/llama.cpp/main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm # Monitor GPU usage watch -n 1 nvidia-smi # Check memory usage of running process nvidia-smi pmon -c 1 # Monitor GPU usage watch -n 1 nvidia-smi # Check memory usage of running process nvidia-smi pmon -c 1 # Monitor GPU usage watch -n 1 nvidia-smi # Check memory usage of running process nvidia-smi pmon -c 1 # Benchmark model performance ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Benchmark test" --temp 0.1 --n-predict 100 # Benchmark model performance ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Benchmark test" --temp 0.1 --n-predict 100 # Benchmark model performance ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Benchmark test" --temp 0.1 --n-predict 100 # For high memory systems ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 33 --ctx 8192 --temp 0.1 # For low memory systems ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 10 --ctx 2048 --temp 0.1 # For high memory systems ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 33 --ctx 8192 --temp 0.1 # For low memory systems ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 10 --ctx 2048 --temp 0.1 # For high memory systems ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 33 --ctx 8192 --temp 0.1 # For low memory systems ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 10 --ctx 2048 --temp 0.1 #!/bin/bash # setup-local-llm.sh cd /opt -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean && make # Download model cd /opt/models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf # Run benchmark echo "Starting benchmark..." ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Test" --temp 0.1 --n-predict 50 echo "Setup complete. Run '-weight: 500;">systemctl -weight: 500;">start local-llm' to -weight: 500;">start -weight: 500;">service." #!/bin/bash # setup-local-llm.sh cd /opt -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean && make # Download model cd /opt/models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf # Run benchmark echo "Starting benchmark..." ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Test" --temp 0.1 --n-predict 50 echo "Setup complete. Run '-weight: 500;">systemctl -weight: 500;">start local-llm' to -weight: 500;">start -weight: 500;">service." #!/bin/bash # setup-local-llm.sh cd /opt -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean && make # Download model cd /opt/models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf # Run benchmark echo "Starting benchmark..." ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Test" --temp 0.1 --n-predict 50 echo "Setup complete. Run '-weight: 500;">systemctl -weight: 500;">start local-llm' to -weight: 500;">start -weight: 500;">service." # Basic API test -weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Write a python function to reverse a string", "n_predict": 100}' # Streaming response -weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain neural networks", "n_predict": 100, "stream": true}' # Basic API test -weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Write a python function to reverse a string", "n_predict": 100}' # Streaming response -weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain neural networks", "n_predict": 100, "stream": true}' # Basic API test -weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Write a python function to reverse a string", "n_predict": 100}' # Streaming response -weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain neural networks", "n_predict": 100, "stream": true}' { "model_path": "/opt/models/mistral-7b-v0.1.Q5_K_M.gguf", "port": 8080, "host": "0.0.0.0", "n_gpu_layers": 33, "ctx_size": 8192, "temp": 0.1, "n_predict": 100 } { "model_path": "/opt/models/mistral-7b-v0.1.Q5_K_M.gguf", "port": 8080, "host": "0.0.0.0", "n_gpu_layers": 33, "ctx_size": 8192, "temp": 0.1, "n_predict": 100 } { "model_path": "/opt/models/mistral-7b-v0.1.Q5_K_M.gguf", "port": 8080, "host": "0.0.0.0", "n_gpu_layers": 33, "ctx_size": 8192, "temp": 0.1, "n_predict": 100 } # Add to ~/.bashrc export LOCAL_LLM_MODEL="/opt/models/mistral-7b-v0.1.Q5_K_M.gguf" export LOCAL_LLM_PORT="8080" export LOCAL_LLM_NGL="33" # Add to ~/.bashrc export LOCAL_LLM_MODEL="/opt/models/mistral-7b-v0.1.Q5_K_M.gguf" export LOCAL_LLM_PORT="8080" export LOCAL_LLM_NGL="33" # Add to ~/.bashrc export LOCAL_LLM_MODEL="/opt/models/mistral-7b-v0.1.Q5_K_M.gguf" export LOCAL_LLM_PORT="8080" export LOCAL_LLM_NGL="33" - CPU: 4+ cores (8+ recommended) - RAM: 16GB+ (32GB+ for larger models) - GPU: NVIDIA RTX 30xx or newer with CUDA support - Storage: 50GB+ free space (models can be 2-10GB each) - Q4_K_M: 4-bit, high quality, good for most use cases - Q5_K_M: 5-bit, balanced quality/performance - Q6_K: 6-bit, excellent quality, larger files - Q8_0: 8-bit, minimal loss, best for performance - Context: 8192 tokens - Response time: ~1.2s for 100 tokens - GPU memory usage: ~12GB - Use --ctx to increase context window - Increase --ngl for more GPU layers - Lower --temp for faster responses - Use --n-predict to limit generation length