Tools: 로컬 LLM 셋업 가이드 (v16)
로컬 LLM 셋업 가이드 (v16)
1. 개요 및 전제 조건
필수 라이브러리 설치
2. 프레임워크 비교
3. 추천 설정: llama.cpp + systemd
1. 소스 코드 빌드
2. 모델 다운로드 및 변환
3. 서버 실행
4. 모델 선택 가이드
5. 양자화 타입 설명
6. API 설정 및 통합
HTTP API 사용
Python 클라이언트
7. Systemd 서비스 설정
8. 모니터링 및 성능 튜닝
성능 테스트 명령어
성능 최적화 설정
9. 실제 작업 예시
1. 모델 다운로드 스크립트
2. 전체 서버 스크립트
10. 성능 벤치마크
11. 문제 해결 팁
GPU 메모리 부족
연결 시간 초과 로컬 LLM 실행은 GPU 지원이 있는 Linux 머신이 필요합니다. 추천 사양: 이 가이드는 단일 개발자나 소규모 팀이 빠르게 로컬 LLM 환경을 구축할 수 있도록 설계되었습니다. 실질적인 성능과 확장성을 모두 고려한 실용적인 접근법을 제공합니다. 📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7) Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuseCommandCopy$ -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y build-essential cmake -weight: 500;">git python3--weight: 500;">pip
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y build-essential cmake -weight: 500;">git python3--weight: 500;">pip
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y build-essential cmake -weight: 500;">git python3--weight: 500;">pip
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git
cd llama.cpp
make clean
make -j$(nproc)
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git
cd llama.cpp
make clean
make -j$(nproc)
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git
cd llama.cpp
make clean
make -j$(nproc)
# 예시: LLaMA-2 7B 모델 다운로드
-weight: 500;">wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf
# 예시: LLaMA-2 7B 모델 다운로드
-weight: 500;">wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf
# 예시: LLaMA-2 7B 모델 다운로드
-weight: 500;">wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf
./server -m ./llama-2-7b.Q4_K_M.gguf \ --port 11434 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048 \ --batch-size 512 \ --threads 8
./server -m ./llama-2-7b.Q4_K_M.gguf \ --port 11434 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048 \ --batch-size 512 \ --threads 8
./server -m ./llama-2-7b.Q4_K_M.gguf \ --port 11434 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048 \ --batch-size 512 \ --threads 8
-weight: 500;">curl http://localhost:11434/api/generate \ -H "Content-Type: application/json" \ -d '{ "model": "llama-2-7b", "prompt": "Write a 100-word introduction to machine learning.", "stream": false }'
-weight: 500;">curl http://localhost:11434/api/generate \ -H "Content-Type: application/json" \ -d '{ "model": "llama-2-7b", "prompt": "Write a 100-word introduction to machine learning.", "stream": false }'
-weight: 500;">curl http://localhost:11434/api/generate \ -H "Content-Type: application/json" \ -d '{ "model": "llama-2-7b", "prompt": "Write a 100-word introduction to machine learning.", "stream": false }'
import requests def call_llm(prompt): response = requests.post( 'http://localhost:11434/api/generate', json={ 'model': 'llama-2-7b', 'prompt': prompt, 'stream': False } ) return response.json()['response'] print(call_llm("What is Rust programming language?"))
import requests def call_llm(prompt): response = requests.post( 'http://localhost:11434/api/generate', json={ 'model': 'llama-2-7b', 'prompt': prompt, 'stream': False } ) return response.json()['response'] print(call_llm("What is Rust programming language?"))
import requests def call_llm(prompt): response = requests.post( 'http://localhost:11434/api/generate', json={ 'model': 'llama-2-7b', 'prompt': prompt, 'stream': False } ) return response.json()['response'] print(call_llm("What is Rust programming language?"))
-weight: 600;">sudo nano /etc/systemd/system/llm.-weight: 500;">service
-weight: 600;">sudo nano /etc/systemd/system/llm.-weight: 500;">service
-weight: 600;">sudo nano /etc/systemd/system/llm.-weight: 500;">service
[Unit]
Description=Local LLM Server
After=network.target [Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/server \ -m /home/developer/models/llama-2-7b.Q4_K_M.gguf \ --port 11434 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
[Unit]
Description=Local LLM Server
After=network.target [Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/server \ -m /home/developer/models/llama-2-7b.Q4_K_M.gguf \ --port 11434 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
[Unit]
Description=Local LLM Server
After=network.target [Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/server \ -m /home/developer/models/llama-2-7b.Q4_K_M.gguf \ --port 11434 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llm.-weight: 500;">service
# 추론 성능 측정
./tests/test-quantize-perf.sh # 메모리 사용량 모니터링
watch -n 1 nvidia-smi # CPU 사용량
htop
# 추론 성능 측정
./tests/test-quantize-perf.sh # 메모리 사용량 모니터링
watch -n 1 nvidia-smi # CPU 사용량
htop
# 추론 성능 측정
./tests/test-quantize-perf.sh # 메모리 사용량 모니터링
watch -n 1 nvidia-smi # CPU 사용량
htop
# 더 많은 GPU 레이어 사용
--n-gpu-layers 40 # 컨텍스트 크기 증가
--ctx-size 4096 # 배치 크기 조정
--batch-size 1024 # 스레드 수 조정
--threads 16
# 더 많은 GPU 레이어 사용
--n-gpu-layers 40 # 컨텍스트 크기 증가
--ctx-size 4096 # 배치 크기 조정
--batch-size 1024 # 스레드 수 조정
--threads 16
# 더 많은 GPU 레이어 사용
--n-gpu-layers 40 # 컨텍스트 크기 증가
--ctx-size 4096 # 배치 크기 조정
--batch-size 1024 # 스레드 수 조정
--threads 16
#!/bin/bash
# download-model.sh
MODEL_NAME="llama-2-7b"
Q_TYPE="Q4_K_M"
-weight: 500;">wget -O ${MODEL_NAME}.${Q_TYPE}.gguf \ "https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.${Q_TYPE}.gguf"
#!/bin/bash
# download-model.sh
MODEL_NAME="llama-2-7b"
Q_TYPE="Q4_K_M"
-weight: 500;">wget -O ${MODEL_NAME}.${Q_TYPE}.gguf \ "https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.${Q_TYPE}.gguf"
#!/bin/bash
# download-model.sh
MODEL_NAME="llama-2-7b"
Q_TYPE="Q4_K_M"
-weight: 500;">wget -O ${MODEL_NAME}.${Q_TYPE}.gguf \ "https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.${Q_TYPE}.gguf"
#!/bin/bash
# -weight: 500;">start-server.sh
MODEL_PATH="/home/developer/models/llama-2-7b.Q4_K_M.gguf"
PORT=11434 if [ ! -f "$MODEL_PATH" ]; then echo "모델 파일이 없습니다. 먼저 다운로드하세요." exit 1
fi echo "LLM 서버 시작 중..."
./server \ -m "$MODEL_PATH" \ --port $PORT \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048 \ --batch-size 512 \ --threads 8
#!/bin/bash
# -weight: 500;">start-server.sh
MODEL_PATH="/home/developer/models/llama-2-7b.Q4_K_M.gguf"
PORT=11434 if [ ! -f "$MODEL_PATH" ]; then echo "모델 파일이 없습니다. 먼저 다운로드하세요." exit 1
fi echo "LLM 서버 시작 중..."
./server \ -m "$MODEL_PATH" \ --port $PORT \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048 \ --batch-size 512 \ --threads 8
#!/bin/bash
# -weight: 500;">start-server.sh
MODEL_PATH="/home/developer/models/llama-2-7b.Q4_K_M.gguf"
PORT=11434 if [ ! -f "$MODEL_PATH" ]; then echo "모델 파일이 없습니다. 먼저 다운로드하세요." exit 1
fi echo "LLM 서버 시작 중..."
./server \ -m "$MODEL_PATH" \ --port $PORT \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048 \ --batch-size 512 \ --threads 8
# GPU 레이어 수 감소
--n-gpu-layers 20 # 컨텍스트 크기 축소
--ctx-size 1024
# GPU 레이어 수 감소
--n-gpu-layers 20 # 컨텍스트 크기 축소
--ctx-size 1024
# GPU 레이어 수 감소
--n-gpu-layers 20 # 컨텍스트 크기 축소
--ctx-size 1024
# 타임아웃 증가
--timeout 300
# 타임아웃 증가
--timeout 300
# 타임아웃 증가
--timeout 300 - GPU: NVIDIA RTX 3060 이상 (CUDA 지원)
- RAM: 16GB 이상 (32GB 이상 추천)
- 저장소: 50GB 이상 여유 공간
$ -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y build-essential cmake -weight: 500;">git python3--weight: 500;">pip
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y build-essential cmake -weight: 500;">git python3--weight: 500;">pip
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -y build-essential cmake -weight: 500;">git python3--weight: 500;">pip
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git
cd llama.cpp
make clean
make -j$(nproc)
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git
cd llama.cpp
make clean
make -j$(nproc)
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git
cd llama.cpp
make clean
make -j$(nproc)
# 예시: LLaMA-2 7B 모델 다운로드
-weight: 500;">wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf
# 예시: LLaMA-2 7B 모델 다운로드
-weight: 500;">wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf
# 예시: LLaMA-2 7B 모델 다운로드
-weight: 500;">wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf
./server -m ./llama-2-7b.Q4_K_M.gguf \ --port 11434 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048 \ --batch-size 512 \ --threads 8
./server -m ./llama-2-7b.Q4_K_M.gguf \ --port 11434 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048 \ --batch-size 512 \ --threads 8
./server -m ./llama-2-7b.Q4_K_M.gguf \ --port 11434 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048 \ --batch-size 512 \ --threads 8
-weight: 500;">curl http://localhost:11434/api/generate \ -H "Content-Type: application/json" \ -d '{ "model": "llama-2-7b", "prompt": "Write a 100-word introduction to machine learning.", "stream": false }'
-weight: 500;">curl http://localhost:11434/api/generate \ -H "Content-Type: application/json" \ -d '{ "model": "llama-2-7b", "prompt": "Write a 100-word introduction to machine learning.", "stream": false }'
-weight: 500;">curl http://localhost:11434/api/generate \ -H "Content-Type: application/json" \ -d '{ "model": "llama-2-7b", "prompt": "Write a 100-word introduction to machine learning.", "stream": false }'
import requests def call_llm(prompt): response = requests.post( 'http://localhost:11434/api/generate', json={ 'model': 'llama-2-7b', 'prompt': prompt, 'stream': False } ) return response.json()['response'] print(call_llm("What is Rust programming language?"))
import requests def call_llm(prompt): response = requests.post( 'http://localhost:11434/api/generate', json={ 'model': 'llama-2-7b', 'prompt': prompt, 'stream': False } ) return response.json()['response'] print(call_llm("What is Rust programming language?"))
import requests def call_llm(prompt): response = requests.post( 'http://localhost:11434/api/generate', json={ 'model': 'llama-2-7b', 'prompt': prompt, 'stream': False } ) return response.json()['response'] print(call_llm("What is Rust programming language?"))
-weight: 600;">sudo nano /etc/systemd/system/llm.-weight: 500;">service
-weight: 600;">sudo nano /etc/systemd/system/llm.-weight: 500;">service
-weight: 600;">sudo nano /etc/systemd/system/llm.-weight: 500;">service
[Unit]
Description=Local LLM Server
After=network.target [Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/server \ -m /home/developer/models/llama-2-7b.Q4_K_M.gguf \ --port 11434 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
[Unit]
Description=Local LLM Server
After=network.target [Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/server \ -m /home/developer/models/llama-2-7b.Q4_K_M.gguf \ --port 11434 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
[Unit]
Description=Local LLM Server
After=network.target [Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/server \ -m /home/developer/models/llama-2-7b.Q4_K_M.gguf \ --port 11434 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llm.-weight: 500;">service
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llm.-weight: 500;">service
# 추론 성능 측정
./tests/test-quantize-perf.sh # 메모리 사용량 모니터링
watch -n 1 nvidia-smi # CPU 사용량
htop
# 추론 성능 측정
./tests/test-quantize-perf.sh # 메모리 사용량 모니터링
watch -n 1 nvidia-smi # CPU 사용량
htop
# 추론 성능 측정
./tests/test-quantize-perf.sh # 메모리 사용량 모니터링
watch -n 1 nvidia-smi # CPU 사용량
htop
# 더 많은 GPU 레이어 사용
--n-gpu-layers 40 # 컨텍스트 크기 증가
--ctx-size 4096 # 배치 크기 조정
--batch-size 1024 # 스레드 수 조정
--threads 16
# 더 많은 GPU 레이어 사용
--n-gpu-layers 40 # 컨텍스트 크기 증가
--ctx-size 4096 # 배치 크기 조정
--batch-size 1024 # 스레드 수 조정
--threads 16
# 더 많은 GPU 레이어 사용
--n-gpu-layers 40 # 컨텍스트 크기 증가
--ctx-size 4096 # 배치 크기 조정
--batch-size 1024 # 스레드 수 조정
--threads 16
#!/bin/bash
# download-model.sh
MODEL_NAME="llama-2-7b"
Q_TYPE="Q4_K_M"
-weight: 500;">wget -O ${MODEL_NAME}.${Q_TYPE}.gguf \ "https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.${Q_TYPE}.gguf"
#!/bin/bash
# download-model.sh
MODEL_NAME="llama-2-7b"
Q_TYPE="Q4_K_M"
-weight: 500;">wget -O ${MODEL_NAME}.${Q_TYPE}.gguf \ "https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.${Q_TYPE}.gguf"
#!/bin/bash
# download-model.sh
MODEL_NAME="llama-2-7b"
Q_TYPE="Q4_K_M"
-weight: 500;">wget -O ${MODEL_NAME}.${Q_TYPE}.gguf \ "https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.${Q_TYPE}.gguf"
#!/bin/bash
# -weight: 500;">start-server.sh
MODEL_PATH="/home/developer/models/llama-2-7b.Q4_K_M.gguf"
PORT=11434 if [ ! -f "$MODEL_PATH" ]; then echo "모델 파일이 없습니다. 먼저 다운로드하세요." exit 1
fi echo "LLM 서버 시작 중..."
./server \ -m "$MODEL_PATH" \ --port $PORT \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048 \ --batch-size 512 \ --threads 8
#!/bin/bash
# -weight: 500;">start-server.sh
MODEL_PATH="/home/developer/models/llama-2-7b.Q4_K_M.gguf"
PORT=11434 if [ ! -f "$MODEL_PATH" ]; then echo "모델 파일이 없습니다. 먼저 다운로드하세요." exit 1
fi echo "LLM 서버 시작 중..."
./server \ -m "$MODEL_PATH" \ --port $PORT \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048 \ --batch-size 512 \ --threads 8
#!/bin/bash
# -weight: 500;">start-server.sh
MODEL_PATH="/home/developer/models/llama-2-7b.Q4_K_M.gguf"
PORT=11434 if [ ! -f "$MODEL_PATH" ]; then echo "모델 파일이 없습니다. 먼저 다운로드하세요." exit 1
fi echo "LLM 서버 시작 중..."
./server \ -m "$MODEL_PATH" \ --port $PORT \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 2048 \ --batch-size 512 \ --threads 8
# GPU 레이어 수 감소
--n-gpu-layers 20 # 컨텍스트 크기 축소
--ctx-size 1024
# GPU 레이어 수 감소
--n-gpu-layers 20 # 컨텍스트 크기 축소
--ctx-size 1024
# GPU 레이어 수 감소
--n-gpu-layers 20 # 컨텍스트 크기 축소
--ctx-size 1024
# 타임아웃 증가
--timeout 300
# 타임아웃 증가
--timeout 300
# 타임아웃 증가
--timeout 300 - GPU: NVIDIA RTX 3060 이상 (CUDA 지원)
- RAM: 16GB 이상 (32GB 이상 추천)
- 저장소: 50GB 이상 여유 공간