Tools

Tools: Complete Guide to 로컬 LLM 셋업 가이드 (v23)

2026-05-25 0 views admin

로컬 LLM 셋업 가이드 (v23)

1. 개요 및 사전 준비

사전 요구사항

시스템 확인

2. 프레임워크 비교

3. 추천 설정 - llama.cpp 설치

4. 모델 선택 가이드

5. 양자화 유형 설명

실제 모델 변환 예시

6. API 설정 및 도구 통합

외부 도구 통합 예시 (Python)

7. Systemd 서비스 설정

8. 모니터링 및 성능 최적화

성능 모니터링 스크립트

최적화 옵션

9. 실제 성능 벤치마크

추론 성능 테스트

추론 시간 기록 (예시)

10. 실전 사용 사례 로컬 LLM(대형 언어 모델)을 실행하는 것은 비용 효율적인 방법으로 AI 기능을 통합할 수 있는 가장 간단한 접근 방식입니다. 이 가이드는 Linux 기반 시스템에서 로컬 LLM을 설정하고 최적화하는 실용적인 방법을 제공합니다. llama.cpp는 가장 적절한 선택입니다. 간단하고 빠르며 최적화된 성능을 제공합니다. 24시간 실행을 위해 systemd 서비스를 설정합니다. 📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7) Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

# GPU 확인 nvidia-smi # RAM 확인 free -h # CPU 확인 lscpu # GPU 확인 nvidia-smi # RAM 확인 free -h # CPU 확인 lscpu # GPU 확인 nvidia-smi # RAM 확인 free -h # CPU 확인 lscpu # 설치 전 준비 -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install build-essential -weight: 500;">git -y # llama.cpp 다운로드 및 컴파일 -weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git cd llama.cpp # 컴파일 make clean make # 필수 라이브러리 설치 (필요시) -weight: 500;">pip -weight: 500;">install torch numpy # 모델 다운로드 (예시: LLaMA-2 7B) mkdir -p models -weight: 500;">wget https://huggingface.co/llamav2-7b/resolve/main/llama-2-7b.gguf -O models/llama-2-7b.gguf # 설치 전 준비 -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install build-essential -weight: 500;">git -y # llama.cpp 다운로드 및 컴파일 -weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git cd llama.cpp # 컴파일 make clean make # 필수 라이브러리 설치 (필요시) -weight: 500;">pip -weight: 500;">install torch numpy # 모델 다운로드 (예시: LLaMA-2 7B) mkdir -p models -weight: 500;">wget https://huggingface.co/llamav2-7b/resolve/main/llama-2-7b.gguf -O models/llama-2-7b.gguf # 설치 전 준비 -weight: 600;">sudo -weight: 500;">apt -weight: 500;">update -weight: 600;">sudo -weight: 500;">apt -weight: 500;">install build-essential -weight: 500;">git -y # llama.cpp 다운로드 및 컴파일 -weight: 500;">git clone https://github.com/ggerganov/llama.cpp.-weight: 500;">git cd llama.cpp # 컴파일 make clean make # 필수 라이브러리 설치 (필요시) -weight: 500;">pip -weight: 500;">install torch numpy # 모델 다운로드 (예시: LLaMA-2 7B) mkdir -p models -weight: 500;">wget https://huggingface.co/llamav2-7b/resolve/main/llama-2-7b.gguf -O models/llama-2-7b.gguf # 양자화 유형별 설명 # Q4_K_M: 최적화된 4비트 양자화, 높은 성능/정확도 비율 # Q5_K_M: 5비트 양자화, 정확도 향상 # Q6_K: 6비트, 최고 정확도 # Q8_0: 8비트, 최대 정확도 # 양자화 유형별 설명 # Q4_K_M: 최적화된 4비트 양자화, 높은 성능/정확도 비율 # Q5_K_M: 5비트 양자화, 정확도 향상 # Q6_K: 6비트, 최고 정확도 # Q8_0: 8비트, 최대 정확도 # 양자화 유형별 설명 # Q4_K_M: 최적화된 4비트 양자화, 높은 성능/정확도 비율 # Q5_K_M: 5비트 양자화, 정확도 향상 # Q6_K: 6비트, 최고 정확도 # Q8_0: 8비트, 최대 정확도 # Q5_K_M 양자화 ./convert-hf-to-gguf.py models/llama-2-7b/ --outtype q5_k_m --outfile models/llama-2-7b-q5k.gguf # Q4_K_M 양자화 ./convert-hf-to-gguf.py models/llama-2-7b/ --outtype q4_k_m --outfile models/llama-2-7b-q4k.gguf # Q5_K_M 양자화 ./convert-hf-to-gguf.py models/llama-2-7b/ --outtype q5_k_m --outfile models/llama-2-7b-q5k.gguf # Q4_K_M 양자화 ./convert-hf-to-gguf.py models/llama-2-7b/ --outtype q4_k_m --outfile models/llama-2-7b-q4k.gguf # Q5_K_M 양자화 ./convert-hf-to-gguf.py models/llama-2-7b/ --outtype q5_k_m --outfile models/llama-2-7b-q5k.gguf # Q4_K_M 양자화 ./convert-hf-to-gguf.py models/llama-2-7b/ --outtype q4_k_m --outfile models/llama-2-7b-q4k.gguf # llama.cpp API 서버 시작 ./server -m models/llama-2-7b-q5k.gguf -c 2048 --host 0.0.0.0 --port 8080 # API 테스트 -weight: 500;">curl http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{ "prompt": "Hello, how are you?", "n_predict": 128, "temperature": 0.7 }' # llama.cpp API 서버 시작 ./server -m models/llama-2-7b-q5k.gguf -c 2048 --host 0.0.0.0 --port 8080 # API 테스트 -weight: 500;">curl http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{ "prompt": "Hello, how are you?", "n_predict": 128, "temperature": 0.7 }' # llama.cpp API 서버 시작 ./server -m models/llama-2-7b-q5k.gguf -c 2048 --host 0.0.0.0 --port 8080 # API 테스트 -weight: 500;">curl http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{ "prompt": "Hello, how are you?", "n_predict": 128, "temperature": 0.7 }' import requests def llama_completion(prompt, max_tokens=128, temperature=0.7): response = requests.post( "http://localhost:8080/completion", json={ "prompt": prompt, "n_predict": max_tokens, "temperature": temperature } ) return response.json()['content'] # 사용 예시 result = llama_completion("Python에서 JSON 파싱 방법은?") print(result) import requests def llama_completion(prompt, max_tokens=128, temperature=0.7): response = requests.post( "http://localhost:8080/completion", json={ "prompt": prompt, "n_predict": max_tokens, "temperature": temperature } ) return response.json()['content'] # 사용 예시 result = llama_completion("Python에서 JSON 파싱 방법은?") print(result) import requests def llama_completion(prompt, max_tokens=128, temperature=0.7): response = requests.post( "http://localhost:8080/completion", json={ "prompt": prompt, "n_predict": max_tokens, "temperature": temperature } ) return response.json()['content'] # 사용 예시 result = llama_completion("Python에서 JSON 파싱 방법은?") print(result) # 서비스 파일 생성 -weight: 600;">sudo nano /etc/systemd/system/llama.-weight: 500;">service # 서비스 내용 [Unit] Description=Local LLM Server After=network.target [Service] Type=simple User=your_user WorkingDirectory=/home/your_user/llama.cpp ExecStart=/home/your_user/llama.cpp/server -m /home/your_user/llama.cpp/models/llama-2-7b-q5k.gguf -c 2048 --host 0.0.0.0 --port 8080 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target # 서비스 활성화 -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama.-weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama.-weight: 500;">service # 서비스 파일 생성 -weight: 600;">sudo nano /etc/systemd/system/llama.-weight: 500;">service # 서비스 내용 [Unit] Description=Local LLM Server After=network.target [Service] Type=simple User=your_user WorkingDirectory=/home/your_user/llama.cpp ExecStart=/home/your_user/llama.cpp/server -m /home/your_user/llama.cpp/models/llama-2-7b-q5k.gguf -c 2048 --host 0.0.0.0 --port 8080 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target # 서비스 활성화 -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama.-weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama.-weight: 500;">service # 서비스 파일 생성 -weight: 600;">sudo nano /etc/systemd/system/llama.-weight: 500;">service # 서비스 내용 [Unit] Description=Local LLM Server After=network.target [Service] Type=simple User=your_user WorkingDirectory=/home/your_user/llama.cpp ExecStart=/home/your_user/llama.cpp/server -m /home/your_user/llama.cpp/models/llama-2-7b-q5k.gguf -c 2048 --host 0.0.0.0 --port 8080 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target # 서비스 활성화 -weight: 600;">sudo -weight: 500;">systemctl daemon-reload -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">enable llama.-weight: 500;">service -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start llama.-weight: 500;">service # 성능 모니터링 스크립트 (monitor.sh) #!/bin/bash while true; do echo "=== Memory Usage ===" free -h echo "=== GPU Usage ===" nvidia-smi echo "=== CPU Load ===" top -bn1 | grep "Cpu(s)" sleep 30 done # 성능 모니터링 스크립트 (monitor.sh) #!/bin/bash while true; do echo "=== Memory Usage ===" free -h echo "=== GPU Usage ===" nvidia-smi echo "=== CPU Load ===" top -bn1 | grep "Cpu(s)" sleep 30 done # 성능 모니터링 스크립트 (monitor.sh) #!/bin/bash while true; do echo "=== Memory Usage ===" free -h echo "=== GPU Usage ===" nvidia-smi echo "=== CPU Load ===" top -bn1 | grep "Cpu(s)" sleep 30 done # 빠른 추론 (적은 메모리 사용) ./server -m models/llama-2-7b-q5k.gguf -c 512 -n 128 # 최대 성능 (높은 메모리 사용) ./server -m models/llama-2-7b-q5k.gguf -c 2048 -n 2048 --threads 8 # GPU 메모리 최적화 ./server -m models/llama-2-7b-q5k.gguf --gpu-layers 30 -c 1024 # 빠른 추론 (적은 메모리 사용) ./server -m models/llama-2-7b-q5k.gguf -c 512 -n 128 # 최대 성능 (높은 메모리 사용) ./server -m models/llama-2-7b-q5k.gguf -c 2048 -n 2048 --threads 8 # GPU 메모리 최적화 ./server -m models/llama-2-7b-q5k.gguf --gpu-layers 30 -c 1024 # 빠른 추론 (적은 메모리 사용) ./server -m models/llama-2-7b-q5k.gguf -c 512 -n 128 # 최대 성능 (높은 메모리 사용) ./server -m models/llama-2-7b-q5k.gguf -c 2048 -n 2048 --threads 8 # GPU 메모리 최적화 ./server -m models/llama-2-7b-q5k.gguf --gpu-layers 30 -c 1024 # 성능 테스트 ./server -m models/llama-2-7b-q5k.gguf -c 2048 --port 8081 # 빠른 테스트 ab -n 10 -c 5 http://localhost:8081/completion # 실제 요청 테스트 -weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "The capital of France is", "n_predict": 10}' \ -w "%{time_total}s\n" # 성능 테스트 ./server -m models/llama-2-7b-q5k.gguf -c 2048 --port 8081 # 빠른 테스트 ab -n 10 -c 5 http://localhost:8081/completion # 실제 요청 테스트 -weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "The capital of France is", "n_predict": 10}' \ -w "%{time_total}s\n" # 성능 테스트 ./server -m models/llama-2-7b-q5k.gguf -c 2048 --port 8081 # 빠른 테스트 ab -n 10 -c 5 http://localhost:8081/completion # 실제 요청 테스트 -weight: 500;">curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{"prompt": "The capital of France is", "n_predict": 10}' \ -w "%{time_total}s\n" LLaMA-2 7B (Q5_K_M): - 문맥 길이 512: 0.8초 - 문맥 길이 1024: 1.2초 - 문맥 길이 2048: 2.1초 Mistral 7B (Q4_K_M): - 문맥 길이 512: 0.5초 - 문맥 길이 1024: 0.9초 - 문맥 길이 2048: 1.6초 LLaMA-2 7B (Q5_K_M): - 문맥 길이 512: 0.8초 - 문맥 길이 1024: 1.2초 - 문맥 길이 2048: 2.1초 Mistral 7B (Q4_K_M): - 문맥 길이 512: 0.5초 - 문맥 길이 1024: 0.9초 - 문맥 길이 2048: 1.6초 LLaMA-2 7B (Q5_K_M): - 문맥 길이 512: 0.8초 - 문맥 길이 1024: 1.2초 - 문맥 길이 2048: 2.1초 Mistral 7B (Q4_K_M): - 문맥 길이 512: 0.5초 - 문맥 길이 1024: 0.9초 - 문맥 길이 2048: 1.6초 - 운영 체제: Ubuntu 20.04 이상 또는 Debian 11 이상 - 하드웨어: GPU: NVIDIA RTX 30xx 이상 (최소 8GB VRAM) CPU: 최소 8코어 RAM: 최소 32GB (64GB 이상 권장) 저장소: 최소 100GB 여유 공간 - GPU: NVIDIA RTX 30xx 이상 (최소 8GB VRAM) - CPU: 최소 8코어 - RAM: 최소 32GB (64GB 이상 권장) - 저장소: 최소 100GB 여유 공간 - GPU: NVIDIA RTX 30xx 이상 (최소 8GB VRAM) - CPU: 최소 8코어 - RAM: 최소 32GB (64GB 이상 권장) - 저장소: 최소 100GB 여유 공간

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolscompleteguidepythonsystemd

More from Tools

Tools: Essential Guide: SSH Login Taking Forever? Check Your DNS Settings

2026-05-25 0

Tools: Breaking: Why DevOps Engineers Need Practical Tutorials, Not Just Theory

2026-05-25 0

Tools: Production Lab: ECS Fargate + Prometheus + Grafana + Loki + Alloy + Node Exporter (2026)

2026-05-25 0

Tools: Vivado 2026.1 and Linux: why this decision matters beyond the headline - 2025 Update

2026-05-25 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: Complete Guide to 로컬 LLM 셋업 가이드 (v23)

로컬 LLM 셋업 가이드 (v23)

1. 개요 및 사전 준비

사전 요구사항

시스템 확인

2. 프레임워크 비교

3. 추천 설정 - llama.cpp 설치

4. 모델 선택 가이드

5. 양자화 유형 설명

실제 모델 변환 예시

6. API 설정 및 도구 통합

외부 도구 통합 예시 (Python)

7. Systemd 서비스 설정

8. 모니터링 및 성능 최적화

성능 모니터링 스크립트

최적화 옵션

9. 실제 성능 벤치마크

추론 성능 테스트

추론 시간 기록 (예시)

🏷️ Tags

More from Tools

Tools: Essential Guide: SSH Login Taking Forever? Check Your DNS Settings

Tools: Breaking: Why DevOps Engineers Need Practical Tutorials, Not Just Theory

Tools: Production Lab: ECS Fargate + Prometheus + Grafana + Loki + Alloy + Node Exporter (2026)

Tools: Vivado 2026.1 and Linux: why this decision matters beyond the headline - 2025 Update

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting