Tools: 로컬 LLM 셋업 가이드 (v11) - Complete Guide

Tools: 로컬 LLM 셋업 가이드 (v11) - Complete Guide

로컬 LLM 셋업 가이드 (v11)

1. 개요 및 전제 조건

2. 프레임워크 비교

3. 설치 절차 (llama.cpp 기반)

4. 모델 선택 가이드

5. 퀀타이제이션 유형 설명

6. API 설정 및 통합

7. Systemd 서비스 설정

8. 모니터링 및 성능 최적화

9. 실전 예제

코드 생성 예제 로컬 LLM은 클라우드 비용을 절감하고 데이터 프라이버시를 보장할 수 있습니다. 이 가이드는 16GB RAM 이상의 Linux 머신에서 실행 가능한 최적의 로컬 LLM 설정을 다룹니다. 우선순위: llama.cpp (최적화된 성능 + 간단한 설치) Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

# 1. 필요 패키지 설치 -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -weight: 500;">git cmake build-essential python3--weight: 500;">pip -y # 2. llama.cpp 클론 -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp # 3. 최적화된 빌드 mkdir build && cd build cmake .. make -j$(nproc) # 4. 실행 가능한 파일 확인 ls -la llama-server # 1. 필요 패키지 설치 -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -weight: 500;">git cmake build-essential python3--weight: 500;">pip -y # 2. llama.cpp 클론 -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp # 3. 최적화된 빌드 mkdir build && cd build cmake .. make -j$(nproc) # 4. 실행 가능한 파일 확인 ls -la llama-server # 1. 필요 패키지 설치 -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install -weight: 500;">git cmake build-essential python3--weight: 500;">pip -y # 2. llama.cpp 클론 -weight: 500;">git clone https://github.com/ggerganov/llama.cpp cd llama.cpp # 3. 최적화된 빌드 mkdir build && cd build cmake .. make -j$(nproc) # 4. 실행 가능한 파일 확인 ls -la llama-server # 5. 모델 다운로드 및 변환 cd .. mkdir models -weight: 500;">wget https://huggingface.co/TheBloke/Llama-3.2-3B-Instruct-GGUF/resolve/main/llama-3.2-3b-instruct.Q4_K_M.gguf -O models/llama32-3b.Q4_K_M.gguf # 6. 서버 시작 ./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048 # 5. 모델 다운로드 및 변환 cd .. mkdir models -weight: 500;">wget https://huggingface.co/TheBloke/Llama-3.2-3B-Instruct-GGUF/resolve/main/llama-3.2-3b-instruct.Q4_K_M.gguf -O models/llama32-3b.Q4_K_M.gguf # 6. 서버 시작 ./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048 # 5. 모델 다운로드 및 변환 cd .. mkdir models -weight: 500;">wget https://huggingface.co/TheBloke/Llama-3.2-3B-Instruct-GGUF/resolve/main/llama-3.2-3b-instruct.Q4_K_M.gguf -O models/llama32-3b.Q4_K_M.gguf # 6. 서버 시작 ./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048 # 모델별 성능 비교 ./build/bin/llama-server --model models/llama32-3b.Q4_K_M.gguf --n-predict 100 --temp 0.1 --port 8081 & ./build/bin/llama-server --model models/mistral-7b.Q5_K_M.gguf --n-predict 100 --temp 0.1 --port 8082 & # 모델별 성능 비교 ./build/bin/llama-server --model models/llama32-3b.Q4_K_M.gguf --n-predict 100 --temp 0.1 --port 8081 & ./build/bin/llama-server --model models/mistral-7b.Q5_K_M.gguf --n-predict 100 --temp 0.1 --port 8082 & # 모델별 성능 비교 ./build/bin/llama-server --model models/llama32-3b.Q4_K_M.gguf --n-predict 100 --temp 0.1 --port 8081 & ./build/bin/llama-server --model models/mistral-7b.Q5_K_M.gguf --n-predict 100 --temp 0.1 --port 8082 & # 퀀타이제이션 변환 예시 python3 convert.py models/llama-3.2-3b-instruct.gguf --outtype q4_k_m --output models/llama32-3b.Q4_K_M.gguf # 퀀타이제이션 변환 예시 python3 convert.py models/llama-3.2-3b-instruct.gguf --outtype q4_k_m --output models/llama32-3b.Q4_K_M.gguf # 퀀타이제이션 변환 예시 python3 convert.py models/llama-3.2-3b-instruct.gguf --outtype q4_k_m --output models/llama32-3b.Q4_K_M.gguf # config.yaml - llama.cpp 서버 설정 port: 8080 host: 0.0.0.0 model: models/llama32-3b.Q4_K_M.gguf n_gpu_layers: 35 ctx_size: 4096 temp: 0.2 n_predict: 2048 # config.yaml - llama.cpp 서버 설정 port: 8080 host: 0.0.0.0 model: models/llama32-3b.Q4_K_M.gguf n_gpu_layers: 35 ctx_size: 4096 temp: 0.2 n_predict: 2048 # config.yaml - llama.cpp 서버 설정 port: 8080 host: 0.0.0.0 model: models/llama32-3b.Q4_K_M.gguf n_gpu_layers: 35 ctx_size: 4096 temp: 0.2 n_predict: 2048 # OpenAI 호환 API 설정 ./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --api-key YOUR_API_KEY \ --endpoint /v1/chat/completions \ --endpoint /v1/completions # OpenAI 호환 API 설정 ./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --api-key YOUR_API_KEY \ --endpoint /v1/chat/completions \ --endpoint /v1/completions # OpenAI 호환 API 설정 ./build/bin/llama-server \ --model models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --api-key YOUR_API_KEY \ --endpoint /v1/chat/completions \ --endpoint /v1/completions # Python 클라이언트 예제 import openai client = openai.OpenAI( base_url="http://localhost:8080/v1", api_key="sk-1234567890" ) response = client.chat.completions.create( model="llama32-3b", messages=[{"role": "user", "content": "Hello world"}], temperature=0.2, max_tokens=512 ) # Python 클라이언트 예제 import openai client = openai.OpenAI( base_url="http://localhost:8080/v1", api_key="sk-1234567890" ) response = client.chat.completions.create( model="llama32-3b", messages=[{"role": "user", "content": "Hello world"}], temperature=0.2, max_tokens=512 ) # Python 클라이언트 예제 import openai client = openai.OpenAI( base_url="http://localhost:8080/v1", api_key="sk-1234567890" ) response = client.chat.completions.create( model="llama32-3b", messages=[{"role": "user", "content": "Hello world"}], temperature=0.2, max_tokens=512 ) # /etc/systemd/system/llama-server.-weight: 500;">service [Unit] Description=Local LLM Server After=network.target [Service] Type=simple User=ubuntu WorkingDirectory=/home/ubuntu/llama.cpp ExecStart=/home/ubuntu/llama.cpp/build/bin/llama-server \ --model /home/ubuntu/llama.cpp/models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target # /etc/systemd/system/llama-server.-weight: 500;">service [Unit] Description=Local LLM Server After=network.target [Service] Type=simple User=ubuntu WorkingDirectory=/home/ubuntu/llama.cpp ExecStart=/home/ubuntu/llama.cpp/build/bin/llama-server \ --model /home/ubuntu/llama.cpp/models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target # /etc/systemd/system/llama-server.-weight: 500;">service [Unit] Description=Local LLM Server After=network.target [Service] Type=simple User=ubuntu WorkingDirectory=/home/ubuntu/llama.cpp ExecStart=/home/ubuntu/llama.cpp/build/bin/llama-server \ --model /home/ubuntu/llama.cpp/models/llama32-3b.Q4_K_M.gguf \ --port 8080 \ --host 0.0.0.0 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --temp 0.2 \ --n-predict 2048 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target # 서비스 등록 및 시작 -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama-server -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama-server -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llama-server # 서비스 등록 및 시작 -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama-server -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama-server -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llama-server # 서비스 등록 및 시작 -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama-server -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama-server -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">status llama-server # GPU 사용량 모니터링 nvidia-smi -l 1 # CPU 및 메모리 사용량 htop # 로그 확인 journalctl -u llama-server -f # GPU 사용량 모니터링 nvidia-smi -l 1 # CPU 및 메모리 사용량 htop # 로그 확인 journalctl -u llama-server -f # GPU 사용량 모니터링 nvidia-smi -l 1 # CPU 및 메모리 사용량 htop # 로그 확인 journalctl -u llama-server -f # 성능 벤치마크 ab -n 100 -c 10 http://localhost:8080/v1/chat/completions # 성능 벤치마크 ab -n 100 -c 10 http://localhost:8080/v1/chat/completions # 성능 벤치마크 ab -n 100 -c 10 http://localhost:8080/v1/chat/completions # 고급 설정 예제 server_config: model: models/llama32-3b.Q4_K_M.gguf n_gpu_layers: 35 ctx_size: 8192 n_predict: 2048 temp: 0.2 top_p: 0.9 frequency_penalty: 0.0 presence_penalty: 0.0 -weight: 500;">stop: ["\nUser:", "\nAssistant:"] # 고급 설정 예제 server_config: model: models/llama32-3b.Q4_K_M.gguf n_gpu_layers: 35 ctx_size: 8192 n_predict: 2048 temp: 0.2 top_p: 0.9 frequency_penalty: 0.0 presence_penalty: 0.0 -weight: 500;">stop: ["\nUser:", "\nAssistant:"] # 고급 설정 예제 server_config: model: models/llama32-3b.Q4_K_M.gguf n_gpu_layers: 35 ctx_size: 8192 n_predict: 2048 temp: 0.2 top_p: 0.9 frequency_penalty: 0.0 presence_penalty: 0.0 -weight: 500;">stop: ["\nUser:", "\nAssistant:"] bash # -weight: 500;">curl로 API 호출 -weight: 500;">curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama32-3b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write a Python function to calculate Fibonacci numbers"} ], --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7) bash # -weight: 500;">curl로 API 호출 -weight: 500;">curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama32-3b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write a Python function to calculate Fibonacci numbers"} ], --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7) bash # -weight: 500;">curl로 API 호출 -weight: 500;">curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama32-3b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write a Python function to calculate Fibonacci numbers"} ], --- 📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7) - Ubuntu 20.04 이상 또는 Debian 11 이상 - NVIDIA GPU (적어도 8GB VRAM) 또는 CPU-only 환경 - 최소 16GB RAM (32GB 이상 권장) - 100GB 이상의 디스크 공간