Tools

Tools: Latest: 로컬 LLM 셋업 가이드 (v17)

2026-05-25 0 views admin

Local LLM Setup Guide (v17)

Overview & Prerequisites

Framework Comparison

Step-by-Step Installation

1. Install llama.cpp

2. Download a Model

3. Test Basic Inference

4. Setup Ollama (Alternative)

Model Selection Guide

Quantization Types Explained

API Setup and Integration

Simple HTTP Server with llama.cpp

Python Integration Example

Systemd Service for 24/7 Operation

Monitoring and Performance Tuning

GPU Memory Monitoring

Performance Testing

Memory Optimization Flags

Real Command Examples

Complete Setup Script

API Integration with curl

Configuration Files

Default llama.cpp Settings

Environment Variables

Benchmark Results Running large language models locally requires understanding hardware constraints and software requirements. This guide assumes you're working with a modern Linux system (Ubuntu 20.04+ recommended) with at least 8GB RAM and a GPU with CUDA support (RTX 30xx or newer). Hardware Requirements: Prerequisites Installation: Recommendation: Use llama.cpp for development, Ollama for quick testing, and vLLM for production. For Chat Applications: Mistral-7B-v0.1 or Phi-3-mini

For Code Generation: CodeLlama-7B or StarCoder2For Research: Llama-3-8B or Mixtral-8x7BFor Memory-Limited Systems: TinyLlama or Phi-2 Quantization reduces model size and improves performance: Create /etc/systemd/system/local-llm.service: Create /opt/llama.cpp/config.json: Model: Mistral-7B Q5_K_MHardware: RTX 4090, 32GB RAM

Results: This setup provides a production-ready local LLM environment that costs $3-7 to operate while offering performance comparable to cloud services. 📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7) Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git -weight: 500;">curl build-essential python3--weight: 500;">pip -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git -weight: 500;">curl build-essential python3--weight: 500;">pip -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git -weight: 500;">curl build-essential python3--weight: 500;">pip cd /opt -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean && make cd /opt -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean && make cd /opt -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean && make cd /opt mkdir models && cd models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf cd /opt mkdir models && cd models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf cd /opt mkdir models && cd models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf cd /opt/llama.cpp ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Hello world" --temp 0.1 cd /opt/llama.cpp ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Hello world" --temp 0.1 cd /opt/llama.cpp ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Hello world" --temp 0.1 -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh ollama run mistral -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh ollama run mistral -weight: 500;">curl -fsSL https://ollama.com/-weight: 500;">install.sh | sh ollama run mistral # Download recommended models cd /opt/models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf -weight: 500;">wget https://huggingface.co/TheBloke/Phi-3-mini-128k-instruct-GGUF/resolve/main/phi-3-mini-128k-instruct.Q5_K_M.gguf # Download recommended models cd /opt/models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf -weight: 500;">wget https://huggingface.co/TheBloke/Phi-3-mini-128k-instruct-GGUF/resolve/main/phi-3-mini-128k-instruct.Q5_K_M.gguf # Download recommended models cd /opt/models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf -weight: 500;">wget https://huggingface.co/TheBloke/Phi-3-mini-128k-instruct-GGUF/resolve/main/phi-3-mini-128k-instruct.Q5_K_M.gguf # Convert model to different quantizations ./convert-llama2-ggml.py /path/to/model.bin --outtype q5_k_m # Convert model to different quantizations ./convert-llama2-ggml.py /path/to/model.bin --outtype q5_k_m # Convert model to different quantizations ./convert-llama2-ggml.py /path/to/model.bin --outtype q5_k_m # Run model as HTTP server ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33 # Run model as HTTP server ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33 # Run model as HTTP server ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33 import requests def call_local_llm(prompt): response = requests.post( "http://localhost:8080/completion", json={"prompt": prompt, "n_predict": 100} ) return response.json()['content'] # Usage result = call_local_llm("Explain quantum computing in simple terms") import requests def call_local_llm(prompt): response = requests.post( "http://localhost:8080/completion", json={"prompt": prompt, "n_predict": 100} ) return response.json()['content'] # Usage result = call_local_llm("Explain quantum computing in simple terms") import requests def call_local_llm(prompt): response = requests.post( "http://localhost:8080/completion", json={"prompt": prompt, "n_predict": 100} ) return response.json()['content'] # Usage result = call_local_llm("Explain quantum computing in simple terms") [Unit] Description=Local LLM Service After=network.target [Service] Type=simple User=your_user WorkingDirectory=/opt/llama.cpp ExecStart=/opt/llama.cpp/main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target [Unit] Description=Local LLM Service After=network.target [Service] Type=simple User=your_user WorkingDirectory=/opt/llama.cpp ExecStart=/opt/llama.cpp/main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target [Unit] Description=Local LLM Service After=network.target [Service] Type=simple User=your_user WorkingDirectory=/opt/llama.cpp ExecStart=/opt/llama.cpp/main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 33 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable local-llm -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start local-llm # Monitor GPU usage watch -n 1 nvidia-smi # Check memory usage of running process nvidia-smi pmon -c 1 # Monitor GPU usage watch -n 1 nvidia-smi # Check memory usage of running process nvidia-smi pmon -c 1 # Monitor GPU usage watch -n 1 nvidia-smi # Check memory usage of running process nvidia-smi pmon -c 1 # Benchmark model performance ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Benchmark test" --temp 0.1 --n-predict 100 # Benchmark model performance ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Benchmark test" --temp 0.1 --n-predict 100 # Benchmark model performance ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Benchmark test" --temp 0.1 --n-predict 100 # For high memory systems ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 33 --ctx 8192 --temp 0.1 # For low memory systems ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 10 --ctx 2048 --temp 0.1 # For high memory systems ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 33 --ctx 8192 --temp 0.1 # For low memory systems ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 10 --ctx 2048 --temp 0.1 # For high memory systems ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 33 --ctx 8192 --temp 0.1 # For low memory systems ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -ngl 10 --ctx 2048 --temp 0.1 #!/bin/bash # setup-local-llm.sh cd /opt -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean && make # Download model cd /opt/models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf # Run benchmark echo "Starting benchmark..." ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Test" --temp 0.1 --n-predict 50 echo "Setup complete. Run '-weight: 500;">systemctl -weight: 500;">start local-llm' to -weight: 500;">start -weight: 500;">service." #!/bin/bash # setup-local-llm.sh cd /opt -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean && make # Download model cd /opt/models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf # Run benchmark echo "Starting benchmark..." ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Test" --temp 0.1 --n-predict 50 echo "Setup complete. Run '-weight: 500;">systemctl -weight: 500;">start local-llm' to -weight: 500;">start -weight: 500;">service." #!/bin/bash # setup-local-llm.sh cd /opt -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean && make # Download model cd /opt/models -weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf # Run benchmark echo "Starting benchmark..." ./main -m /opt/models/mistral-7b-v0.1.Q5_K_M.gguf -p "Test" --temp 0.1 --n-predict 50 echo "Setup complete. Run '-weight: 500;">systemctl -weight: 500;">start local-llm' to -weight: 500;">start -weight: 500;">service." # Basic API test -weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Write a python function to reverse a string", "n_predict": 100}' # Streaming response -weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain neural networks", "n_predict": 100, "stream": true}' # Basic API test -weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Write a python function to reverse a string", "n_predict": 100}' # Streaming response -weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain neural networks", "n_predict": 100, "stream": true}' # Basic API test -weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Write a python function to reverse a string", "n_predict": 100}' # Streaming response -weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain neural networks", "n_predict": 100, "stream": true}' { "model_path": "/opt/models/mistral-7b-v0.1.Q5_K_M.gguf", "port": 8080, "host": "0.0.0.0", "n_gpu_layers": 33, "ctx_size": 8192, "temp": 0.1, "n_predict": 100 } { "model_path": "/opt/models/mistral-7b-v0.1.Q5_K_M.gguf", "port": 8080, "host": "0.0.0.0", "n_gpu_layers": 33, "ctx_size": 8192, "temp": 0.1, "n_predict": 100 } { "model_path": "/opt/models/mistral-7b-v0.1.Q5_K_M.gguf", "port": 8080, "host": "0.0.0.0", "n_gpu_layers": 33, "ctx_size": 8192, "temp": 0.1, "n_predict": 100 } # Add to ~/.bashrc export LOCAL_LLM_MODEL="/opt/models/mistral-7b-v0.1.Q5_K_M.gguf" export LOCAL_LLM_PORT="8080" export LOCAL_LLM_NGL="33" # Add to ~/.bashrc export LOCAL_LLM_MODEL="/opt/models/mistral-7b-v0.1.Q5_K_M.gguf" export LOCAL_LLM_PORT="8080" export LOCAL_LLM_NGL="33" # Add to ~/.bashrc export LOCAL_LLM_MODEL="/opt/models/mistral-7b-v0.1.Q5_K_M.gguf" export LOCAL_LLM_PORT="8080" export LOCAL_LLM_NGL="33" - CPU: 4+ cores (8+ recommended) - RAM: 16GB+ (32GB+ for larger models) - GPU: NVIDIA RTX 30xx or newer with CUDA support - Storage: 50GB+ free space (models can be 2-10GB each) - Q4_K_M: 4-bit, high quality, good for most use cases - Q5_K_M: 5-bit, balanced quality/performance - Q6_K: 6-bit, excellent quality, larger files - Q8_0: 8-bit, minimal loss, best for performance - Context: 8192 tokens - Response time: ~1.2s for 100 tokens - GPU memory usage: ~12GB - Use --ctx to increase context window - Increase --ngl for more GPU layers - Lower --temp for faster responses - Use --n-predict to limit generation length

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolslatestoverviewprerequisitesframeworkcomparison

More from Tools

Tools: 로컬 LLM 셋업 가이드 (v16)

2026-05-25 0

Tools: 터미널 AI 에이전트 구축 (v17) (2026)

2026-05-25 0

Tools: OpenClaw on DigitalOcean: A No-BS, Security-First Setup Guide (2026)

2026-05-25 0

Tools: Stop Manually Checking for Server Updates: Automate With Email Notifications - Guide

2026-05-25 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: Latest: 로컬 LLM 셋업 가이드 (v17)

Local LLM Setup Guide (v17)

Overview & Prerequisites

Framework Comparison

Step-by-Step Installation

1. Install llama.cpp

2. Download a Model

3. Test Basic Inference

4. Setup Ollama (Alternative)

Model Selection Guide

Quantization Types Explained

API Setup and Integration

Simple HTTP Server with llama.cpp

Python Integration Example

Systemd Service for 24/7 Operation

Monitoring and Performance Tuning

GPU Memory Monitoring

Performance Testing

Memory Optimization Flags

Real Command Examples

Complete Setup Script

API Integration with curl

Configuration Files

Default llama.cpp Settings

Environment Variables

🏷️ Tags

More from Tools

Tools: 로컬 LLM 셋업 가이드 (v16)

Tools: 터미널 AI 에이전트 구축 (v17) (2026)

Tools: OpenClaw on DigitalOcean: A No-BS, Security-First Setup Guide (2026)

Tools: Stop Manually Checking for Server Updates: Automate With Email Notifications - Guide

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting