Tools: 로컬 LLM 셋업 가이드 (v11) - Complete Guide
로컬 LLM 셋업 가이드 (v11)
1. 개요 및 전제 조건
2. 프레임워크 비교
3. 설치 절차 (llama.cpp 기반)
4. 모델 선택 가이드
5. 퀀타이제이션 유형 설명
6. API 설정 및 통합
7. Systemd 서비스 설정
8. 모니터링 및 성능 최적화
9. 실전 예제
코드 생성 예제 로컬 LLM은 클라우드 비용을 절감하고 데이터 프라이버시를 보장할 수 있습니다. 이 가이드는 16GB RAM 이상의 Linux 머신에서 실행 가능한 최적의 로컬 LLM 설정을 다룹니다. 우선순위: llama.cpp (최적화된 성능 + 간단한 설치) Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuseCommandCopy# 1. 필요 패키지 설치
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -weight: 500;">git cmake build-essential python3--weight: 500;">pip -y # 2. llama.cpp 클론
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp # 3. 최적화된 빌드
mkdir build && cd build
cmake ..
make -j$(nproc) # 4. 실행 가능한 파일 확인
ls -la llama-server
# 1. 필요 패키지 설치
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -weight: 500;">git cmake build-essential python3--weight: 500;">pip -y # 2. llama.cpp 클론
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp # 3. 최적화된 빌드
mkdir build && cd build
cmake ..
make -j$(nproc) # 4. 실행 가능한 파일 확인
ls -la llama-server
# 1. 필요 패키지 설치
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -weight: 500;">git cmake build-essential python3--weight: 500;">pip -y # 2. llama.cpp 클론
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp # 3. 최적화된 빌드
mkdir build && cd build
cmake ..
make -j$(nproc) # 4. 실행 가능한 파일 확인
ls -la llama-server
# 5. 모델 다운로드 및 변환
cd ..
mkdir models
-weight: 500;">wget https://huggingface.co/TheBloke/Llama-3.2-3B-Instruct-GGUF/resolve/main/llama-3.2-3b-instruct.Q4_K_M.gguf -O models/llama32-3b.Q4_K_M.gguf # 6. 서버 시작
./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048
# 5. 모델 다운로드 및 변환
cd ..
mkdir models
-weight: 500;">wget https://huggingface.co/TheBloke/Llama-3.2-3B-Instruct-GGUF/resolve/main/llama-3.2-3b-instruct.Q4_K_M.gguf -O models/llama32-3b.Q4_K_M.gguf # 6. 서버 시작
./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048
# 5. 모델 다운로드 및 변환
cd ..
mkdir models
-weight: 500;">wget https://huggingface.co/TheBloke/Llama-3.2-3B-Instruct-GGUF/resolve/main/llama-3.2-3b-instruct.Q4_K_M.gguf -O models/llama32-3b.Q4_K_M.gguf # 6. 서버 시작
./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048
# 모델별 성능 비교
./build/bin/llama-server --model models/llama32-3b.Q4_K_M.gguf --n-predict 100 --temp 0.1 --port 8081 &
./build/bin/llama-server --model models/mistral-7b.Q5_K_M.gguf --n-predict 100 --temp 0.1 --port 8082 &
# 모델별 성능 비교
./build/bin/llama-server --model models/llama32-3b.Q4_K_M.gguf --n-predict 100 --temp 0.1 --port 8081 &
./build/bin/llama-server --model models/mistral-7b.Q5_K_M.gguf --n-predict 100 --temp 0.1 --port 8082 &
# 모델별 성능 비교
./build/bin/llama-server --model models/llama32-3b.Q4_K_M.gguf --n-predict 100 --temp 0.1 --port 8081 &
./build/bin/llama-server --model models/mistral-7b.Q5_K_M.gguf --n-predict 100 --temp 0.1 --port 8082 &
# 퀀타이제이션 변환 예시
python3 convert.py models/llama-3.2-3b-instruct.gguf --outtype q4_k_m --output models/llama32-3b.Q4_K_M.gguf
# 퀀타이제이션 변환 예시
python3 convert.py models/llama-3.2-3b-instruct.gguf --outtype q4_k_m --output models/llama32-3b.Q4_K_M.gguf
# 퀀타이제이션 변환 예시
python3 convert.py models/llama-3.2-3b-instruct.gguf --outtype q4_k_m --output models/llama32-3b.Q4_K_M.gguf
# config.yaml - llama.cpp 서버 설정
port: 8080
host: 0.0.0.0
model: models/llama32-3b.Q4_K_M.gguf
n_gpu_layers: 35
ctx_size: 4096
temp: 0.2
n_predict: 2048
# config.yaml - llama.cpp 서버 설정
port: 8080
host: 0.0.0.0
model: models/llama32-3b.Q4_K_M.gguf
n_gpu_layers: 35
ctx_size: 4096
temp: 0.2
n_predict: 2048
# config.yaml - llama.cpp 서버 설정
port: 8080
host: 0.0.0.0
model: models/llama32-3b.Q4_K_M.gguf
n_gpu_layers: 35
ctx_size: 4096
temp: 0.2
n_predict: 2048
# OpenAI 호환 API 설정
./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --api-key YOUR_API_KEY \ --endpoint /v1/chat/completions \ --endpoint /v1/completions
# OpenAI 호환 API 설정
./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --api-key YOUR_API_KEY \ --endpoint /v1/chat/completions \ --endpoint /v1/completions
# OpenAI 호환 API 설정
./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --api-key YOUR_API_KEY \ --endpoint /v1/chat/completions \ --endpoint /v1/completions
# Python 클라이언트 예제
import openai client = openai.OpenAI( base_url="http://localhost:8080/v1", api_key="sk-1234567890"
) response = client.chat.completions.create( model="llama32-3b", messages=[{"role": "user", "content": "Hello world"}], temperature=0.2, max_tokens=512
)
# Python 클라이언트 예제
import openai client = openai.OpenAI( base_url="http://localhost:8080/v1", api_key="sk-1234567890"
) response = client.chat.completions.create( model="llama32-3b", messages=[{"role": "user", "content": "Hello world"}], temperature=0.2, max_tokens=512
)
# Python 클라이언트 예제
import openai client = openai.OpenAI( base_url="http://localhost:8080/v1", api_key="sk-1234567890"
) response = client.chat.completions.create( model="llama32-3b", messages=[{"role": "user", "content": "Hello world"}], temperature=0.2, max_tokens=512
)
# /etc/systemd/system/llama-server.-weight: 500;">service
[Unit]
Description=Local LLM Server
After=network.target [Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/llama.cpp
ExecStart=/home/ubuntu/llama.cpp/build/bin/llama-server \ --model /home/ubuntu/llama.cpp/models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
# /etc/systemd/system/llama-server.-weight: 500;">service
[Unit]
Description=Local LLM Server
After=network.target [Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/llama.cpp
ExecStart=/home/ubuntu/llama.cpp/build/bin/llama-server \ --model /home/ubuntu/llama.cpp/models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
# /etc/systemd/system/llama-server.-weight: 500;">service
[Unit]
Description=Local LLM Server
After=network.target [Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/llama.cpp
ExecStart=/home/ubuntu/llama.cpp/build/bin/llama-server \ --model /home/ubuntu/llama.cpp/models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
# 서비스 등록 및 시작
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama-server
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama-server
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llama-server
# 서비스 등록 및 시작
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama-server
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama-server
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llama-server
# 서비스 등록 및 시작
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama-server
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama-server
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llama-server
# GPU 사용량 모니터링
nvidia-smi -l 1 # CPU 및 메모리 사용량
htop # 로그 확인
journalctl -u llama-server -f
# GPU 사용량 모니터링
nvidia-smi -l 1 # CPU 및 메모리 사용량
htop # 로그 확인
journalctl -u llama-server -f
# GPU 사용량 모니터링
nvidia-smi -l 1 # CPU 및 메모리 사용량
htop # 로그 확인
journalctl -u llama-server -f
# 성능 벤치마크
ab -n 100 -c 10 http://localhost:8080/v1/chat/completions
# 성능 벤치마크
ab -n 100 -c 10 http://localhost:8080/v1/chat/completions
# 성능 벤치마크
ab -n 100 -c 10 http://localhost:8080/v1/chat/completions
# 고급 설정 예제
server_config: model: models/llama32-3b.Q4_K_M.gguf n_gpu_layers: 35 ctx_size: 8192 n_predict: 2048 temp: 0.2 top_p: 0.9 frequency_penalty: 0.0 presence_penalty: 0.0 -weight: 500;">stop: ["\nUser:", "\nAssistant:"]
# 고급 설정 예제
server_config: model: models/llama32-3b.Q4_K_M.gguf n_gpu_layers: 35 ctx_size: 8192 n_predict: 2048 temp: 0.2 top_p: 0.9 frequency_penalty: 0.0 presence_penalty: 0.0 -weight: 500;">stop: ["\nUser:", "\nAssistant:"]
# 고급 설정 예제
server_config: model: models/llama32-3b.Q4_K_M.gguf n_gpu_layers: 35 ctx_size: 8192 n_predict: 2048 temp: 0.2 top_p: 0.9 frequency_penalty: 0.0 presence_penalty: 0.0 -weight: 500;">stop: ["\nUser:", "\nAssistant:"]
bash
# -weight: 500;">curl로 API 호출
-weight: 500;">curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama32-3b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write a Python function to calculate Fibonacci numbers"} ], --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)
bash
# -weight: 500;">curl로 API 호출
-weight: 500;">curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama32-3b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write a Python function to calculate Fibonacci numbers"} ], --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)
bash
# -weight: 500;">curl로 API 호출
-weight: 500;">curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama32-3b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write a Python function to calculate Fibonacci numbers"} ], --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7) - Ubuntu 20.04 이상 또는 Debian 11 이상
- NVIDIA GPU (적어도 8GB VRAM) 또는 CPU-only 환경
- 최소 16GB RAM (32GB 이상 권장)
- 100GB 이상의 디스크 공간
# 1. 필요 패키지 설치
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -weight: 500;">git cmake build-essential python3--weight: 500;">pip -y # 2. llama.cpp 클론
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp # 3. 최적화된 빌드
mkdir build && cd build
cmake ..
make -j$(nproc) # 4. 실행 가능한 파일 확인
ls -la llama-server
# 1. 필요 패키지 설치
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -weight: 500;">git cmake build-essential python3--weight: 500;">pip -y # 2. llama.cpp 클론
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp # 3. 최적화된 빌드
mkdir build && cd build
cmake ..
make -j$(nproc) # 4. 실행 가능한 파일 확인
ls -la llama-server
# 1. 필요 패키지 설치
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">update
-weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -weight: 500;">git cmake build-essential python3--weight: 500;">pip -y # 2. llama.cpp 클론
-weight: 500;">git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp # 3. 최적화된 빌드
mkdir build && cd build
cmake ..
make -j$(nproc) # 4. 실행 가능한 파일 확인
ls -la llama-server
# 5. 모델 다운로드 및 변환
cd ..
mkdir models
-weight: 500;">wget https://huggingface.co/TheBloke/Llama-3.2-3B-Instruct-GGUF/resolve/main/llama-3.2-3b-instruct.Q4_K_M.gguf -O models/llama32-3b.Q4_K_M.gguf # 6. 서버 시작
./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048
# 5. 모델 다운로드 및 변환
cd ..
mkdir models
-weight: 500;">wget https://huggingface.co/TheBloke/Llama-3.2-3B-Instruct-GGUF/resolve/main/llama-3.2-3b-instruct.Q4_K_M.gguf -O models/llama32-3b.Q4_K_M.gguf # 6. 서버 시작
./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048
# 5. 모델 다운로드 및 변환
cd ..
mkdir models
-weight: 500;">wget https://huggingface.co/TheBloke/Llama-3.2-3B-Instruct-GGUF/resolve/main/llama-3.2-3b-instruct.Q4_K_M.gguf -O models/llama32-3b.Q4_K_M.gguf # 6. 서버 시작
./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048
# 모델별 성능 비교
./build/bin/llama-server --model models/llama32-3b.Q4_K_M.gguf --n-predict 100 --temp 0.1 --port 8081 &
./build/bin/llama-server --model models/mistral-7b.Q5_K_M.gguf --n-predict 100 --temp 0.1 --port 8082 &
# 모델별 성능 비교
./build/bin/llama-server --model models/llama32-3b.Q4_K_M.gguf --n-predict 100 --temp 0.1 --port 8081 &
./build/bin/llama-server --model models/mistral-7b.Q5_K_M.gguf --n-predict 100 --temp 0.1 --port 8082 &
# 모델별 성능 비교
./build/bin/llama-server --model models/llama32-3b.Q4_K_M.gguf --n-predict 100 --temp 0.1 --port 8081 &
./build/bin/llama-server --model models/mistral-7b.Q5_K_M.gguf --n-predict 100 --temp 0.1 --port 8082 &
# 퀀타이제이션 변환 예시
python3 convert.py models/llama-3.2-3b-instruct.gguf --outtype q4_k_m --output models/llama32-3b.Q4_K_M.gguf
# 퀀타이제이션 변환 예시
python3 convert.py models/llama-3.2-3b-instruct.gguf --outtype q4_k_m --output models/llama32-3b.Q4_K_M.gguf
# 퀀타이제이션 변환 예시
python3 convert.py models/llama-3.2-3b-instruct.gguf --outtype q4_k_m --output models/llama32-3b.Q4_K_M.gguf
# config.yaml - llama.cpp 서버 설정
port: 8080
host: 0.0.0.0
model: models/llama32-3b.Q4_K_M.gguf
n_gpu_layers: 35
ctx_size: 4096
temp: 0.2
n_predict: 2048
# config.yaml - llama.cpp 서버 설정
port: 8080
host: 0.0.0.0
model: models/llama32-3b.Q4_K_M.gguf
n_gpu_layers: 35
ctx_size: 4096
temp: 0.2
n_predict: 2048
# config.yaml - llama.cpp 서버 설정
port: 8080
host: 0.0.0.0
model: models/llama32-3b.Q4_K_M.gguf
n_gpu_layers: 35
ctx_size: 4096
temp: 0.2
n_predict: 2048
# OpenAI 호환 API 설정
./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --api-key YOUR_API_KEY \ --endpoint /v1/chat/completions \ --endpoint /v1/completions
# OpenAI 호환 API 설정
./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --api-key YOUR_API_KEY \ --endpoint /v1/chat/completions \ --endpoint /v1/completions
# OpenAI 호환 API 설정
./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --api-key YOUR_API_KEY \ --endpoint /v1/chat/completions \ --endpoint /v1/completions
# Python 클라이언트 예제
import openai client = openai.OpenAI( base_url="http://localhost:8080/v1", api_key="sk-1234567890"
) response = client.chat.completions.create( model="llama32-3b", messages=[{"role": "user", "content": "Hello world"}], temperature=0.2, max_tokens=512
)
# Python 클라이언트 예제
import openai client = openai.OpenAI( base_url="http://localhost:8080/v1", api_key="sk-1234567890"
) response = client.chat.completions.create( model="llama32-3b", messages=[{"role": "user", "content": "Hello world"}], temperature=0.2, max_tokens=512
)
# Python 클라이언트 예제
import openai client = openai.OpenAI( base_url="http://localhost:8080/v1", api_key="sk-1234567890"
) response = client.chat.completions.create( model="llama32-3b", messages=[{"role": "user", "content": "Hello world"}], temperature=0.2, max_tokens=512
)
# /etc/systemd/system/llama-server.-weight: 500;">service
[Unit]
Description=Local LLM Server
After=network.target [Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/llama.cpp
ExecStart=/home/ubuntu/llama.cpp/build/bin/llama-server \ --model /home/ubuntu/llama.cpp/models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
# /etc/systemd/system/llama-server.-weight: 500;">service
[Unit]
Description=Local LLM Server
After=network.target [Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/llama.cpp
ExecStart=/home/ubuntu/llama.cpp/build/bin/llama-server \ --model /home/ubuntu/llama.cpp/models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
# /etc/systemd/system/llama-server.-weight: 500;">service
[Unit]
Description=Local LLM Server
After=network.target [Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/llama.cpp
ExecStart=/home/ubuntu/llama.cpp/build/bin/llama-server \ --model /home/ubuntu/llama.cpp/models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048
Restart=always
RestartSec=10 [Install]
WantedBy=multi-user.target
# 서비스 등록 및 시작
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama-server
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama-server
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llama-server
# 서비스 등록 및 시작
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama-server
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama-server
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llama-server
# 서비스 등록 및 시작
-weight: 600;">sudo -weight: 500;">systemctl daemon-reload
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama-server
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama-server
-weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llama-server
# GPU 사용량 모니터링
nvidia-smi -l 1 # CPU 및 메모리 사용량
htop # 로그 확인
journalctl -u llama-server -f
# GPU 사용량 모니터링
nvidia-smi -l 1 # CPU 및 메모리 사용량
htop # 로그 확인
journalctl -u llama-server -f
# GPU 사용량 모니터링
nvidia-smi -l 1 # CPU 및 메모리 사용량
htop # 로그 확인
journalctl -u llama-server -f
# 성능 벤치마크
ab -n 100 -c 10 http://localhost:8080/v1/chat/completions
# 성능 벤치마크
ab -n 100 -c 10 http://localhost:8080/v1/chat/completions
# 성능 벤치마크
ab -n 100 -c 10 http://localhost:8080/v1/chat/completions
# 고급 설정 예제
server_config: model: models/llama32-3b.Q4_K_M.gguf n_gpu_layers: 35 ctx_size: 8192 n_predict: 2048 temp: 0.2 top_p: 0.9 frequency_penalty: 0.0 presence_penalty: 0.0 -weight: 500;">stop: ["\nUser:", "\nAssistant:"]
# 고급 설정 예제
server_config: model: models/llama32-3b.Q4_K_M.gguf n_gpu_layers: 35 ctx_size: 8192 n_predict: 2048 temp: 0.2 top_p: 0.9 frequency_penalty: 0.0 presence_penalty: 0.0 -weight: 500;">stop: ["\nUser:", "\nAssistant:"]
# 고급 설정 예제
server_config: model: models/llama32-3b.Q4_K_M.gguf n_gpu_layers: 35 ctx_size: 8192 n_predict: 2048 temp: 0.2 top_p: 0.9 frequency_penalty: 0.0 presence_penalty: 0.0 -weight: 500;">stop: ["\nUser:", "\nAssistant:"]
bash
# -weight: 500;">curl로 API 호출
-weight: 500;">curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama32-3b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write a Python function to calculate Fibonacci numbers"} ], --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)
bash
# -weight: 500;">curl로 API 호출
-weight: 500;">curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama32-3b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write a Python function to calculate Fibonacci numbers"} ], --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)
bash
# -weight: 500;">curl로 API 호출
-weight: 500;">curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama32-3b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write a Python function to calculate Fibonacci numbers"} ], --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7) - Ubuntu 20.04 이상 또는 Debian 11 이상
- NVIDIA GPU (적어도 8GB VRAM) 또는 CPU-only 환경
- 최소 16GB RAM (32GB 이상 권장)
- 100GB 이상의 디스크 공간