Tools: Getting Started with RamaLama on Fedora - 2025 Update

Tools: Getting Started with RamaLama on Fedora - 2025 Update

Prerequisites

Installation

How It Works

Model Registries

Pulling and Running Models

From Ollama (default)

From Hugging Face

Useful Flags

Set Context Window Size; --ctx-size / -c

Set Temperature; --temp

Select Inference Backend; --backend

Enable Debug Output; --debug

Managing Models

Serving a Model as an API

Web UI

Things to Watch Out For

References RamaLama is an open-source tool built under the containers organization that makes running AI models locally as straightforward as working with containers. The goal is to make AI inference boring and predictable. RamaLama handles host configuration by pulling an OCI (Open Container Initiative) container image tuned to the hardware it detects on your system, so you skip the manual dependency setup entirely. If you already work with Podman or Docker, the mental model is familiar. Models are pulled, listed, and removed much like container images. Before installing RamaLama, make sure you have the following: On Fedora, RamaLama is available directly from the default repositories: Once installed, verify the version: On first run, RamaLama inspects your system for GPU support and falls back to CPU if no GPU is found. It then pulls the appropriate OCI container image with all the inference dependencies baked in, including llama.cpp, which powers the model execution layer. Models are stored locally and reused across runs, so the pull only happens once per model. RamaLama supports pulling models from multiple registries. The default registry is Ollama, but you can reference models from any supported source using a transport prefix: This pulls the granite3.1-moe:3b model from the Ollama registry and drops you into an interactive chat session. On first run, the model is downloaded to local storage; subsequent runs reuse it. Note: Some newer Hugging Face models may fail with a gguf_init_from_file_impl: failed to read magic error due to format incompatibilities with llama.cpp. When that happens, look for a pre-converted GGUF version of the same model on Hugging Face by searching the model name with "GGUF" appended. In this case, MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF worked as a compatible alternative. By default, RamaLama does not override the model's native context length. For llama3.1:8b, that default is 131072 tokens, which requires ~16GB of KV cache allocation, well above what most dev machines can handle. Use the -c flag to cap the context size: A context size of 16384 tokens requires ~2GB of KV cache for llama3.1:8b. You can use the KV Cache Size Calculator to find the right value for your available memory and target model. On memory-constrained machines, this flag is really helpful. Temperature controls the randomness of the model's output. The default is typically around 0.8. Setting it to 0 makes the model more deterministic: A temperature of 0 is useful for factual Q&A or benchmarking where you want consistent, reproducible outputs. Keep in mind it reduces randomness, not hallucination. If the knowledge is absent from the model's training data, --temp 0 will just make it consistently wrong. RamaLama auto-detects the best backend for your hardware, but you can override it explicitly: On systems without a GPU, RamaLama falls back to CPU inference automatically. --debug is a global flag and must be placed before the subcommand: This prints the underlying container commands RamaLama executes, hardware detection steps, and registry fetch details. Useful when troubleshooting model compatibility issues, unexpected behavior, or hardware detection problems. List locally stored models: Pull a model without running it: Remove a model from local storage: RamaLama can expose a model as an OpenAI-compatible REST endpoint: This starts a local server on port 8080 by default. You can point any OpenAI-compatible client at it without changing how those clients are written. Useful for integrating a local model into applications, RAG pipelines, or tooling like LangChain and LlamaIndex. When running ramalama serve, a browser-based chat interface is available at http://localhost:8080 by default. To disable it: The web UI is powered by the llama.cpp HTTP server's built-in interface and gives you a quick way to interact with the model without writing any client code. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ -weight: 600;">sudo -weight: 500;">dnf -weight: 500;">install ramalama -weight: 600;">sudo -weight: 500;">dnf -weight: 500;">install ramalama -weight: 600;">sudo -weight: 500;">dnf -weight: 500;">install ramalama ramalama version ramalama version ramalama version ramalama version x.x.x ramalama version x.x.x ramalama version x.x.x ramalama run granite3.1-moe:3b ramalama run granite3.1-moe:3b ramalama run granite3.1-moe:3b ramalama run huggingface://MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF ramalama run huggingface://MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF ramalama run huggingface://MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF ramalama run -c 16384 llama3.1:8b ramalama run -c 16384 llama3.1:8b ramalama run -c 16384 llama3.1:8b ramalama run --temp 0 granite3.1-moe:3b ramalama run --temp 0 granite3.1-moe:3b ramalama run --temp 0 granite3.1-moe:3b ramalama run --backend vulkan granite3.1-moe:3b # AMD/Intel or CPU fallback ramalama run --backend cuda granite3.1-moe:3b # NVIDIA ramalama run --backend rocm granite3.1-moe:3b # AMD ROCm ramalama run --backend vulkan granite3.1-moe:3b # AMD/Intel or CPU fallback ramalama run --backend cuda granite3.1-moe:3b # NVIDIA ramalama run --backend rocm granite3.1-moe:3b # AMD ROCm ramalama run --backend vulkan granite3.1-moe:3b # AMD/Intel or CPU fallback ramalama run --backend cuda granite3.1-moe:3b # NVIDIA ramalama run --backend rocm granite3.1-moe:3b # AMD ROCm ramalama --debug run granite3.1-moe:3b ramalama --debug run granite3.1-moe:3b ramalama --debug run granite3.1-moe:3b ramalama list ramalama list ramalama list ramalama pull llama3.1:8b ramalama pull llama3.1:8b ramalama pull llama3.1:8b ramalama rm llama3.1:8b ramalama rm llama3.1:8b ramalama rm llama3.1:8b ramalama serve granite3.1-moe:3b ramalama serve granite3.1-moe:3b ramalama serve granite3.1-moe:3b ramalama serve --webui off granite3.1-moe:3b ramalama serve --webui off granite3.1-moe:3b ramalama serve --webui off granite3.1-moe:3b - A Fedora system (this guide uses Fedora with -weight: 500;">dnf) - Podman installed, RamaLama uses it as the default container engine - Sufficient disk space for model storage (models range from ~2GB to 10GB+) - At least 8GB RAM for smaller models; 16GB+ recommended for 7B+ parameter models - Model format compatibility: Some Hugging Face models require a pre-converted GGUF version to work with RamaLama. Stick to GGUF-format models when in doubt. - Memory and context size: Always check the model's default context length before running on a memory-constrained machine. Use -c to cap it appropriately. - Model size vs. accuracy: Smaller models (3B) are fast and lightweight but may lack knowledge on niche topics. For factual accuracy, 7B+ models are noticeably more reliable. - --debug flag placement: It must come before the subcommand, i.e. ramalama --debug run not ramalama run --debug. - RamaLama is still in active development: The project moves fast. Flag names, behaviors, and supported features can change between versions. When in doubt, check ramalama --help or the official docs. - RamaLama Official Docs - RamaLama GitHub Repository - RamaLama Blog