Tools: 로컬 LLM 셋업 가이드 (v30) - Expert Insights
로컬 LLM 셋업 가이드 (v30)
Overview & Prerequisites
Framework Comparison
Step-by-Step Installation
1. llama.cpp 설치
2. 모델 다운로드 및 준비
Model Selection Guide
Quantization Types Explained
API Setup and Integration
1. API 서버 실행
2. Python 클라이언트 예제
Systemd Service for 24/7 Operation
1. 서비스 파일 생성
2. 서비스 관리
Monitoring and Performance Tuning
1. 시스템 모니터링 스크립트
2. 성능 최적화 파라미터
3. 벤치마크 테스트
Real Command Examples
1. 모델 변환 및 최적화 로컬 LLM 실행은 클라우드 의존성에서 벗어나 데이터 보안과 비용 절감을 위해 중요합니다. 이 가이드는 Linux 기반 시스템에서 최적화된 로컬 LLM 셋업을 위한 실전 가이드입니다. 추천: llama.cpp + systemd 조합. 최고의 성능과 관리 편의성. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuseCommandCopy# 사전 설치 확인
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git -weight: 500;">curl build-essential
# 사전 설치 확인
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git -weight: 500;">curl build-essential
# 사전 설치 확인
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git -weight: 500;">curl build-essential
# 소스 다운로드 및 빌드
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git
cd llama.cpp
make # 테스트 실행 (모델 다운로드 필요)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf
./main -m mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" -n 10
# 소스 다운로드 및 빌드
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git
cd llama.cpp
make # 테스트 실행 (모델 다운로드 필요)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf
./main -m mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" -n 10
# 소스 다운로드 및 빌드
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git
cd llama.cpp
make # 테스트 실행 (모델 다운로드 필요)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf
./main -m mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" -n 10
# 모델 디렉토리 생성
mkdir -p ~/models
cd ~/models # Mistral-7B 모델 다운로드 (Q4_K_M quantization)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf # 폴더 구조 정리
mkdir -p ~/llama_service/{models,logs,config}
# 모델 디렉토리 생성
mkdir -p ~/models
cd ~/models # Mistral-7B 모델 다운로드 (Q4_K_M quantization)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf # 폴더 구조 정리
mkdir -p ~/llama_service/{models,logs,config}
# 모델 디렉토리 생성
mkdir -p ~/models
cd ~/models # Mistral-7B 모델 다운로드 (Q4_K_M quantization)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf # 폴더 구조 정리
mkdir -p ~/llama_service/{models,logs,config}
# 성능 테스트 예시
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -p "What is your purpose?" -n 100 -e
# 성능 테스트 예시
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -p "What is your purpose?" -n 100 -e
# 성능 테스트 예시
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -p "What is your purpose?" -n 100 -e
# Quantization 종류 및 성능 비교
# Q4_K_M: 최적화된 4bit quantization, 50% 메모리 절약
# Q5_K_M: 5bit quantization, 60% 메모리 절약
# Q6_K: 6bit quantization, 70% 메모리 절약
# Q8_0: 8bit quantization, 최대 품질 # 변환 명령어 예시
./convert-hf-to-gguf.py ~/models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-v0.1.Q4_K_M.gguf
# Quantization 종류 및 성능 비교
# Q4_K_M: 최적화된 4bit quantization, 50% 메모리 절약
# Q5_K_M: 5bit quantization, 60% 메모리 절약
# Q6_K: 6bit quantization, 70% 메모리 절약
# Q8_0: 8bit quantization, 최대 품질 # 변환 명령어 예시
./convert-hf-to-gguf.py ~/models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-v0.1.Q4_K_M.gguf
# Quantization 종류 및 성능 비교
# Q4_K_M: 최적화된 4bit quantization, 50% 메모리 절약
# Q5_K_M: 5bit quantization, 60% 메모리 절약
# Q6_K: 6bit quantization, 70% 메모리 절약
# Q8_0: 8bit quantization, 최대 품질 # 변환 명령어 예시
./convert-hf-to-gguf.py ~/models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-v0.1.Q4_K_M.gguf
# API 서버 실행
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -a 127.0.0.1:8080 -p "System prompt here" # 또는 OpenAI API 호환 모드
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf --host 127.0.0.1 --port 8080 --api-key your-api-key
# API 서버 실행
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -a 127.0.0.1:8080 -p "System prompt here" # 또는 OpenAI API 호환 모드
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf --host 127.0.0.1 --port 8080 --api-key your-api-key
# API 서버 실행
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -a 127.0.0.1:8080 -p "System prompt here" # 또는 OpenAI API 호환 모드
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf --host 127.0.0.1 --port 8080 --api-key your-api-key
# client.py
import openai client = openai.OpenAI( base_url="http://localhost:8080/v1", api_key="sk-1234"
) response = client.chat.completions.create( model="mistral", messages=[{"role": "user", "content": "Hello, world"}], max_tokens=100
) print(response.choices[0].message.content)
# client.py
import openai client = openai.OpenAI( base_url="http://localhost:8080/v1", api_key="sk-1234"
) response = client.chat.completions.create( model="mistral", messages=[{"role": "user", "content": "Hello, world"}], max_tokens=100
) print(response.choices[0].message.content)
# client.py
import openai client = openai.OpenAI( base_url="http://localhost:8080/v1", api_key="sk-1234"
) response = client.chat.completions.create( model="mistral", messages=[{"role": "user", "content": "Hello, world"}], max_tokens=100
) print(response.choices[0].message.content)
# /etc/systemd/system/llama.-weight: 500;">service
-weight: 600;">sudo tee /etc/systemd/system/llama.-weight: 500;">service << EOF
[Unit]
Description=Local LLM Service
After=network.target [Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/llama.cpp
ExecStart=/home/ubuntu/llama.cpp/main -m /home/ubuntu/models/mistral-7b-v0.1.Q4_K_M.gguf -a 127.0.0.1:8080 --host 127.0.0.1 --port 8080
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
EOF
# /etc/systemd/system/llama.-weight: 500;">service
-weight: 600;">sudo tee /etc/systemd/system/llama.-weight: 500;">service << EOF
[Unit]
Description=Local LLM Service
After=network.target [Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/llama.cpp
ExecStart=/home/ubuntu/llama.cpp/main -m /home/ubuntu/models/mistral-7b-v0.1.Q4_K_M.gguf -a 127.0.0.1:8080 --host 127.0.0.1 --port 8080
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
EOF
# /etc/systemd/system/llama.-weight: 500;">service
-weight: 600;">sudo tee /etc/systemd/system/llama.-weight: 500;">service << EOF
[Unit]
Description=Local LLM Service
After=network.target [Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/llama.cpp
ExecStart=/home/ubuntu/llama.cpp/main -m /home/ubuntu/models/mistral-7b-v0.1.Q4_K_M.gguf -a 127.0.0.1:8080 --host 127.0.0.1 --port 8080
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
EOF
# 서비스 시작 및 활성화
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llama.-weight: 500;">service # 로그 확인
-weight: 600;">sudo journalctl -u llama.-weight: 500;">service -f
# 서비스 시작 및 활성화
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llama.-weight: 500;">service # 로그 확인
-weight: 600;">sudo journalctl -u llama.-weight: 500;">service -f
# 서비스 시작 및 활성화
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llama.-weight: 500;">service # 로그 확인
-weight: 600;">sudo journalctl -u llama.-weight: 500;">service -f
# monitor.sh
#!/bin/bash
while true; do echo "=== Memory Usage ===" free -h echo "=== GPU Usage ===" nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv -l 1 echo "=== CPU Usage ===" top -b -n 1 | head -20 sleep 30
done
# monitor.sh
#!/bin/bash
while true; do echo "=== Memory Usage ===" free -h echo "=== GPU Usage ===" nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv -l 1 echo "=== CPU Usage ===" top -b -n 1 | head -20 sleep 30
done
# monitor.sh
#!/bin/bash
while true; do echo "=== Memory Usage ===" free -h echo "=== GPU Usage ===" nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv -l 1 echo "=== CPU Usage ===" top -b -n 1 | head -20 sleep 30
done
# 최적화된 실행 명령어
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf \ -a 127.0.0.1:8080 \ --host 127.0.0.1 \ --port 8080 \ --n-gpu-layers 30 \ --ctx-size 8192 \ --temp 0.7 \ --top-p 0.9 \ --n-predict 100
# 최적화된 실행 명령어
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf \ -a 127.0.0.1:8080 \ --host 127.0.0.1 \ --port 8080 \ --n-gpu-layers 30 \ --ctx-size 8192 \ --temp 0.7 \ --top-p 0.9 \ --n-predict 100
# 최적화된 실행 명령어
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf \ -a 127.0.0.1:8080 \ --host 127.0.0.1 \ --port 8080 \ --n-gpu-layers 30 \ --ctx-size 8192 \ --temp 0.7 \ --top-p 0.9 \ --n-predict 100
# 성능 테스트 명령어
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "Explain quantum computing in simple terms." \ -n 100 \ --timing # 결과 예시
# Time for prompt: 18.25 ms
# Time for completion: 124.82 ms
# Total tokens: 100
# 성능 테스트 명령어
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "Explain quantum computing in simple terms." \ -n 100 \ --timing # 결과 예시
# Time for prompt: 18.25 ms
# Time for completion: 124.82 ms
# Total tokens: 100
# 성능 테스트 명령어
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "Explain quantum computing in simple terms." \ -n 100 \ --timing # 결과 예시
# Time for prompt: 18.25 ms
# Time for completion: 124.82 ms
# Total tokens: 100
bash
# HF 모델을 GGUF로 변환
python3 convert-hf-to-gguf.py /home/ubuntu/models/Llama-3-8B \ --outtype q4_k_m \ --outfile llama3-8b.Q4_K_M.gguf # 메모리 사용량 최적화
./main -m llama3-8b.Q4_K_M.gguf -n-gpu-layers 28 -ctx-size 4096 --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)
bash
# HF 모델을 GGUF로 변환
python3 convert-hf-to-gguf.py /home/ubuntu/models/Llama-3-8B \ --outtype q4_k_m \ --outfile llama3-8b.Q4_K_M.gguf # 메모리 사용량 최적화
./main -m llama3-8b.Q4_K_M.gguf -n-gpu-layers 28 -ctx-size 4096 --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)
bash
# HF 모델을 GGUF로 변환
python3 convert-hf-to-gguf.py /home/ubuntu/models/Llama-3-8B \ --outtype q4_k_m \ --outfile llama3-8b.Q4_K_M.gguf # 메모리 사용량 최적화
./main -m llama3-8b.Q4_K_M.gguf -n-gpu-layers 28 -ctx-size 4096 --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7) - Linux 64비트 시스템 (Ubuntu 20.04 이상 권장)
- GPU (NVIDIA RTX 30xx 이상 권장)
- 최소 16GB RAM (32GB 권장)
- 최소 10GB 디스크 공간 (모델별로 다름)
- -weight: 500;">git, -weight: 500;">curl, build-essential 설치
# 사전 설치 확인
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git -weight: 500;">curl build-essential
# 사전 설치 확인
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git -weight: 500;">curl build-essential
# 사전 설치 확인
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update && -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y -weight: 500;">git -weight: 500;">curl build-essential
# 소스 다운로드 및 빌드
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git
cd llama.cpp
make # 테스트 실행 (모델 다운로드 필요)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf
./main -m mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" -n 10
# 소스 다운로드 및 빌드
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git
cd llama.cpp
make # 테스트 실행 (모델 다운로드 필요)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf
./main -m mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" -n 10
# 소스 다운로드 및 빌드
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git
cd llama.cpp
make # 테스트 실행 (모델 다운로드 필요)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf
./main -m mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" -n 10
# 모델 디렉토리 생성
mkdir -p ~/models
cd ~/models # Mistral-7B 모델 다운로드 (Q4_K_M quantization)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf # 폴더 구조 정리
mkdir -p ~/llama_service/{models,logs,config}
# 모델 디렉토리 생성
mkdir -p ~/models
cd ~/models # Mistral-7B 모델 다운로드 (Q4_K_M quantization)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf # 폴더 구조 정리
mkdir -p ~/llama_service/{models,logs,config}
# 모델 디렉토리 생성
mkdir -p ~/models
cd ~/models # Mistral-7B 모델 다운로드 (Q4_K_M quantization)
-weight: 500;">wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf # 폴더 구조 정리
mkdir -p ~/llama_service/{models,logs,config}
# 성능 테스트 예시
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -p "What is your purpose?" -n 100 -e
# 성능 테스트 예시
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -p "What is your purpose?" -n 100 -e
# 성능 테스트 예시
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -p "What is your purpose?" -n 100 -e
# Quantization 종류 및 성능 비교
# Q4_K_M: 최적화된 4bit quantization, 50% 메모리 절약
# Q5_K_M: 5bit quantization, 60% 메모리 절약
# Q6_K: 6bit quantization, 70% 메모리 절약
# Q8_0: 8bit quantization, 최대 품질 # 변환 명령어 예시
./convert-hf-to-gguf.py ~/models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-v0.1.Q4_K_M.gguf
# Quantization 종류 및 성능 비교
# Q4_K_M: 최적화된 4bit quantization, 50% 메모리 절약
# Q5_K_M: 5bit quantization, 60% 메모리 절약
# Q6_K: 6bit quantization, 70% 메모리 절약
# Q8_0: 8bit quantization, 최대 품질 # 변환 명령어 예시
./convert-hf-to-gguf.py ~/models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-v0.1.Q4_K_M.gguf
# Quantization 종류 및 성능 비교
# Q4_K_M: 최적화된 4bit quantization, 50% 메모리 절약
# Q5_K_M: 5bit quantization, 60% 메모리 절약
# Q6_K: 6bit quantization, 70% 메모리 절약
# Q8_0: 8bit quantization, 최대 품질 # 변환 명령어 예시
./convert-hf-to-gguf.py ~/models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-v0.1.Q4_K_M.gguf
# API 서버 실행
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -a 127.0.0.1:8080 -p "System prompt here" # 또는 OpenAI API 호환 모드
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf --host 127.0.0.1 --port 8080 --api-key your-api-key
# API 서버 실행
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -a 127.0.0.1:8080 -p "System prompt here" # 또는 OpenAI API 호환 모드
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf --host 127.0.0.1 --port 8080 --api-key your-api-key
# API 서버 실행
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -a 127.0.0.1:8080 -p "System prompt here" # 또는 OpenAI API 호환 모드
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf --host 127.0.0.1 --port 8080 --api-key your-api-key
# client.py
import openai client = openai.OpenAI( base_url="http://localhost:8080/v1", api_key="sk-1234"
) response = client.chat.completions.create( model="mistral", messages=[{"role": "user", "content": "Hello, world"}], max_tokens=100
) print(response.choices[0].message.content)
# client.py
import openai client = openai.OpenAI( base_url="http://localhost:8080/v1", api_key="sk-1234"
) response = client.chat.completions.create( model="mistral", messages=[{"role": "user", "content": "Hello, world"}], max_tokens=100
) print(response.choices[0].message.content)
# client.py
import openai client = openai.OpenAI( base_url="http://localhost:8080/v1", api_key="sk-1234"
) response = client.chat.completions.create( model="mistral", messages=[{"role": "user", "content": "Hello, world"}], max_tokens=100
) print(response.choices[0].message.content)
# /etc/systemd/system/llama.-weight: 500;">service
-weight: 600;">sudo tee /etc/systemd/system/llama.-weight: 500;">service << EOF
[Unit]
Description=Local LLM Service
After=network.target [Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/llama.cpp
ExecStart=/home/ubuntu/llama.cpp/main -m /home/ubuntu/models/mistral-7b-v0.1.Q4_K_M.gguf -a 127.0.0.1:8080 --host 127.0.0.1 --port 8080
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
EOF
# /etc/systemd/system/llama.-weight: 500;">service
-weight: 600;">sudo tee /etc/systemd/system/llama.-weight: 500;">service << EOF
[Unit]
Description=Local LLM Service
After=network.target [Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/llama.cpp
ExecStart=/home/ubuntu/llama.cpp/main -m /home/ubuntu/models/mistral-7b-v0.1.Q4_K_M.gguf -a 127.0.0.1:8080 --host 127.0.0.1 --port 8080
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
EOF
# /etc/systemd/system/llama.-weight: 500;">service
-weight: 600;">sudo tee /etc/systemd/system/llama.-weight: 500;">service << EOF
[Unit]
Description=Local LLM Service
After=network.target [Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/llama.cpp
ExecStart=/home/ubuntu/llama.cpp/main -m /home/ubuntu/models/mistral-7b-v0.1.Q4_K_M.gguf -a 127.0.0.1:8080 --host 127.0.0.1 --port 8080
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
EOF
# 서비스 시작 및 활성화
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llama.-weight: 500;">service # 로그 확인
-weight: 600;">sudo journalctl -u llama.-weight: 500;">service -f
# 서비스 시작 및 활성화
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llama.-weight: 500;">service # 로그 확인
-weight: 600;">sudo journalctl -u llama.-weight: 500;">service -f
# 서비스 시작 및 활성화
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llama.-weight: 500;">service # 로그 확인
-weight: 600;">sudo journalctl -u llama.-weight: 500;">service -f
# monitor.sh
#!/bin/bash
while true; do echo "=== Memory Usage ===" free -h echo "=== GPU Usage ===" nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv -l 1 echo "=== CPU Usage ===" top -b -n 1 | head -20 sleep 30
done
# monitor.sh
#!/bin/bash
while true; do echo "=== Memory Usage ===" free -h echo "=== GPU Usage ===" nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv -l 1 echo "=== CPU Usage ===" top -b -n 1 | head -20 sleep 30
done
# monitor.sh
#!/bin/bash
while true; do echo "=== Memory Usage ===" free -h echo "=== GPU Usage ===" nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv -l 1 echo "=== CPU Usage ===" top -b -n 1 | head -20 sleep 30
done
# 최적화된 실행 명령어
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf \ -a 127.0.0.1:8080 \ --host 127.0.0.1 \ --port 8080 \ --n-gpu-layers 30 \ --ctx-size 8192 \ --temp 0.7 \ --top-p 0.9 \ --n-predict 100
# 최적화된 실행 명령어
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf \ -a 127.0.0.1:8080 \ --host 127.0.0.1 \ --port 8080 \ --n-gpu-layers 30 \ --ctx-size 8192 \ --temp 0.7 \ --top-p 0.9 \ --n-predict 100
# 최적화된 실행 명령어
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf \ -a 127.0.0.1:8080 \ --host 127.0.0.1 \ --port 8080 \ --n-gpu-layers 30 \ --ctx-size 8192 \ --temp 0.7 \ --top-p 0.9 \ --n-predict 100
# 성능 테스트 명령어
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "Explain quantum computing in simple terms." \ -n 100 \ --timing # 결과 예시
# Time for prompt: 18.25 ms
# Time for completion: 124.82 ms
# Total tokens: 100
# 성능 테스트 명령어
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "Explain quantum computing in simple terms." \ -n 100 \ --timing # 결과 예시
# Time for prompt: 18.25 ms
# Time for completion: 124.82 ms
# Total tokens: 100
# 성능 테스트 명령어
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf \ -p "Explain quantum computing in simple terms." \ -n 100 \ --timing # 결과 예시
# Time for prompt: 18.25 ms
# Time for completion: 124.82 ms
# Total tokens: 100
bash
# HF 모델을 GGUF로 변환
python3 convert-hf-to-gguf.py /home/ubuntu/models/Llama-3-8B \ --outtype q4_k_m \ --outfile llama3-8b.Q4_K_M.gguf # 메모리 사용량 최적화
./main -m llama3-8b.Q4_K_M.gguf -n-gpu-layers 28 -ctx-size 4096 --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)
bash
# HF 모델을 GGUF로 변환
python3 convert-hf-to-gguf.py /home/ubuntu/models/Llama-3-8B \ --outtype q4_k_m \ --outfile llama3-8b.Q4_K_M.gguf # 메모리 사용량 최적화
./main -m llama3-8b.Q4_K_M.gguf -n-gpu-layers 28 -ctx-size 4096 --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)
bash
# HF 모델을 GGUF로 변환
python3 convert-hf-to-gguf.py /home/ubuntu/models/Llama-3-8B \ --outtype q4_k_m \ --outfile llama3-8b.Q4_K_M.gguf # 메모리 사용량 최적화
./main -m llama3-8b.Q4_K_M.gguf -n-gpu-layers 28 -ctx-size 4096 --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7) - Linux 64비트 시스템 (Ubuntu 20.04 이상 권장)
- GPU (NVIDIA RTX 30xx 이상 권장)
- 최소 16GB RAM (32GB 권장)
- 최소 10GB 디스크 공간 (모델별로 다름)
- -weight: 500;">git, -weight: 500;">curl, build-essential 설치