Creating Recipes

A Bloc recipe is a YAML configuration manifest that describes how to run a local AI model using llama.cpp (or other engines) on target hardware. It is not the weights themselves; it is the configuration recipe telling the Bloc CLI exactly how to download, optimize, and serve a model.

The Two-Layer Design

Every Bloc recipe YAML has two distinct layers serving different purposes:

Registry Metadata (Hub Website): Read by the Bloc Hub website to power search filters (VRAM, platform, model size), registry cards, and recipe pages.
Engine Configuration (CLI Execution): Passed verbatim to the Bloc CLI. The CLI translates these structured fields into engine arguments (e.g. llama-server flags) and environment setups.

Full Schema Reference

`schema` (required)

Tells the Bloc CLI which parser version to use for compatibility.

schema: "bloc/v1"

`extends` (optional)

Inherit fields from another recipe and override only what differs. Great for multi-VRAM variants of the same model.

extends: "arnav/qwen3-base"

`metadata` (required)

metadata:
  name: "qwen3-30b-moe-8gb-cpu-offload"
  description: "What model, what hardware, what result."
  tags: [moe, long-context, 8gb, cuda, reasoning]
  author_notes: |
    Multi-line developer notes. trade-offs, experiments, and tips.

name: Unique per-user lowercase identifier (letters, numbers, hyphens). Combined with your username, it forms the unique ID: {username}/{name}.
description: Brief summary displayed on cards.
tags: Arrays used for search filters (e.g., moe, cuda, metal, reasoning).

`model` (required)

Describes the GGUF model file or Hugging Face repository download source. You must specify either download_url (for GGUF) or hf_repo (for full repository downloads like vLLM).

model:
  source: "huggingface:Qwen/Qwen3-30B-A3B"
  
  # For GGUF/llama.cpp downloads:
  gguf_repo: "huggingface:bartowski/Qwen_Qwen3-30B-A3B-GGUF"
  file: "Qwen_Qwen3-30B-A3B-Q4_K_M.gguf"
  download_url: "https://huggingface.co/bartowski/Qwen_Qwen3-30B-A3B-GGUF/resolve/main/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf"
  
  # For full HF repository downloads (e.g. vLLM):
  # hf_repo: "Qwen/Qwen3-30B-A3B"
  
  quantization: "Q4_K_M"
  size_gb: 17.2
  parameters: "30B"
  architecture: "MoE"

source: The upstream base model on Hugging Face (huggingface:org/repo).
hf_repo: Hugging Face repo path (org/repo) for full model repository downloads. Validated against a strict security regex.
file: Precise GGUF filename for deterministic download caching.
download_url: Direct HTTPS GGUF download link.
sha256 (optional): Cryptographic SHA256 checksum of the GGUF file. Highly recommended for absolute integrity validation.
size_gb (optional): Approximate download size in gigabytes (e.g. 4.8). Used to display size stats on the website and check disk space before downloading.

`engine` (required)

engine:
  name: "llama.cpp" # options: "llama.cpp", "vllm"
  runtime: "native" # options: "native", "docker"
  version: "0.9.0"  # pinned engine version (semver format)
  image: "vllm/vllm-openai:v0.9.0" # Docker image tag (docker runtime only)
  tested_commit: "b5350" # (llama.cpp only)

runtime: How to run the engine. Defaults to native. Use docker to launch in containerized mode.
version: A validated semver string (e.g., 0.9.0) specifying the pinned version.
image: Validated Docker image tag format (e.g., vllm/vllm-openai:v0.9.0).
tested_commit: The short git hash of llama.cpp this was verified against. The CLI checks this against the user's binary to alert them of version shifts.

`hardware` (required)

Specifies minimum recommendations for running the configuration.

hardware:
  min_vram: "8GB" # options: 4GB, 8GB, 12GB, 16GB, 24GB, Unified (Mac)
  target_platform: "cuda" # options: cuda, metal, rocm, cpu, vulkan
  gpu_count: 1
  recommended_ram: "32GB"

Engine Configuration (`engine_config`)

These fields map to engine-specific CLI flags during deployment execution. Fields set to null are ignored.

[!IMPORTANT] Cross-Engine Validation: To prevent confusing behavior, llama.cpp-only flags (gpu_layers, n_cpu_moe, batch_size, ubatch_size) cannot be defined in a vllm engine recipe, and vLLM-only flags cannot be used in a llama.cpp recipe.

[!NOTE] The Explicit Field Rule: Always define options explicitly rather than relying on default values, ensuring your recipes remain compatible if engines update their defaults.

Context & Batching

ctx_size: Size of the KV cache/context window. Memory scales linearly with this.
batch_size: Parallelism of prompt processing (-b). High numbers speed up prompt prefill but increase peak memory. (llama.cpp only)
ubatch_size: Micro-batch size (-ub). Halving this (e.g., from 512 to 256) can save up to ~1.4GB VRAM during prefill on tight setups. (llama.cpp only)

GPU Offloading

gpu_layers: Number of layers loaded into VRAM (-ngl). Use 99 for full GPU offload. (llama.cpp only)
split_mode: Multi-GPU splitting strategy (layer, row, or none). (llama.cpp only)
tensor_split: Ratio/split weight across multiple GPUs (e.g. 58,42). (llama.cpp only)
flash_attn: Reduces VRAM usage linearly with context depth. Highly recommended on compatible hardware. (llama.cpp only)

Mixture-of-Experts (MoE)

n_cpu_moe: Number of active experts to offload to system RAM. (llama.cpp only)
- 99 = All experts on CPU (maximum context, lowest VRAM usage, ~10% slower speed).
- 32 = Partial expert offload (faster inference, uses more VRAM).

KV Cache Quantization

cache_type_k / cache_type_v: Quantization format for key/value matrices. (llama.cpp only)
- f16: Default, full precision, highest memory.
- q8_0: Near-perfect quality, ~50% VRAM reduction (recommended default).
- q4_0: Aggressive VRAM saving (great for deep chat context; may cause minor reasoning issues).

Speculative Decoding

spec_type: Speculative decoding type (draft or draft-mtp for native multi-token prediction). (llama.cpp only)
spec_draft_model: Path or HF identifier for the draft model. (llama.cpp only)

vLLM-Specific Parameters

tensor_parallel_size: Number of GPUs to partition the model across using Tensor Parallelism.
gpu_memory_utilization: Fraction of GPU memory allocated for model weights and KV cache (0.0 to 1.0).
max_model_len: Maximum sequence length capacity.
dtype: Model weights precision data type (e.g., auto, float16, bfloat16).
kv_cache_dtype: Key-value cache precision format.
quantization: Specify model quantization type (e.g. awq, gptq, squeezellm).
enable_expert_parallel: Enable parallel MoE expert processing on multi-GPU setups.
tokenizer_mode: Tokenizer behavior override (e.g. auto, slow).
tool_call_parser: Parser to interpret model-generated tool calls (e.g. llama3, mistral).
reasoning_parser: Parser for model reasoning/thinking tokens (e.g. deepseek_r1).
trust_remote_code: Must be explicitly set to true to execute custom model-specific code. This triggers an interactive user safety confirmation gate on the CLI.
speculative_model: Pointer to the draft model for speculative decoding.
num_speculative_tokens: Number of speculative tokens to sample.

System Settings

threads: Number of physical CPU cores (do not include logical cores/threads to avoid CPU lockups).
mlock: Locks weights in physical RAM to prevent OS swapping/latency spikes.
jinja: Enable Jinja2 chat templates (required for modern chat formatting like Qwen/Llama).
port: Port to serve on.

[!WARNING] Ports under 1024 (privileged) and port 0 (dynamic binding) are blocked for safety. Only ports between 1024 and 65535 are allowed.
extra_args: Escape hatch array for custom/unsupported flags:
```
extra_args:
  - "--cuda-graphs"
```
[!WARNING] Flags passed to extra_args are validated against a strict security allowlist. Dangerous flags (like --api-key or overriding host addresses) are rejected.

Pre-Run Hooks (`pre_run`)

Optional system-level operations executed before the engine launches. The CLI will request confirmation from the user before executing commands.

pre_run:
  env:
    CUDA_VISIBLE_DEVICES: "0,1"
  commands:
    - "sudo nvidia-smi -pl 250" # power limits GPU for efficiency

Common Configuration Patterns

Single GPU Budget Setup (8GB VRAM)

hardware:
  min_vram: "8GB"
  target_platform: "cuda"
engine_config:
  gpu_layers: 28
  ctx_size: 8192
  flash_attn: true
  cache_type_k: "q8_0"
  cache_type_v: "q8_0"
  threads: 8

Apple Silicon (Metal Unified Memory)

hardware:
  min_vram: "Unified"
  target_platform: "metal"
engine_config:
  gpu_layers: 99
  ctx_size: 32768
  flash_attn: true
  cache_type_k: "q8_0"
  cache_type_v: "q8_0"
  threads: 6 # matched to physical performance cores
  mlock: false
  jinja: true

MoE Offloading (Large Model, Low VRAM)

hardware:
  min_vram: "8GB"
  recommended_ram: "32GB"
engine_config:
  gpu_layers: 99
  n_cpu_moe: 99 # moves experts to system RAM
  ctx_size: 262144
  flash_attn: true
  cache_type_k: "q4_0"
  cache_type_v: "q4_0"
  jinja: true

vLLM Single-GPU Setup (Containerized)

model:
  source: "huggingface:meta-llama/Meta-Llama-3-8B-Instruct"
  hf_repo: "meta-llama/Meta-Llama-3-8B-Instruct"
  parameters: "8B"
  architecture: "Dense"
engine:
  name: "vllm"
  runtime: "docker"
  version: "0.9.0"
  image: "vllm/vllm-openai:v0.9.0"
hardware:
  min_vram: "16GB"
  target_platform: "cuda"
  gpu_count: 1
engine_config:
  tensor_parallel_size: 1
  gpu_memory_utilization: 0.90
  max_model_len: 8192
  trust_remote_code: false
  host: "127.0.0.1"
  port: 8080

YAML Template Reference

Below is a complete, fully annotated template blueprint you can copy to create your own recipes. It covers both engine branches and all validation parameters under the bloc/v1 schema.

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
#  BLOC RECIPE BLUEPRINT  ·  schema bloc/v1
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

schema: "bloc/v1"
extends: null

# ─── LAYER 1: Registry Metadata (Parsed & Indexed by Hub) ──────
metadata:
  name: "your-recipe-name"     # Lowercase letters, numbers, and hyphens only
  description: "A short, one-sentence card summary."
  tags:
    - "cuda"                   # cuda | metal | rocm | cpu | vulkan
    - "8gb"                    # VRAM target
    - "reasoning"
  author_notes: |
    Multi-line developer details, benchmarks, and target hardware.

model:
  source: "huggingface:org/repo"
  
  # For llama.cpp (GGUF):
  gguf_repo: "huggingface:quantizer/repo-gguf"
  file: "model-Q4_K_M.gguf"
  download_url: "https://huggingface.co/..."
  
  # For vLLM (Alternative):
  # hf_repo: "org/repo"

  quantization: "Q4_K_M"
  size_gb: 4.8
  parameters: "7B"
  architecture: "Dense"

engine:
  name: "llama.cpp"           # "llama.cpp" or "vllm"
  runtime: "native"           # "native" or "docker"
  version: "0.9.0"            # pinned engine version (semver format)
  image: null                 # Docker image tag (required if runtime is docker)
  tested_commit: "b5350"      # (llama.cpp only)

hardware:
  min_vram: "8GB"             # 4GB | 8GB | 12GB | 16GB | 24GB | Unified
  target_platform: "cuda"     # cuda | metal | rocm | cpu | vulkan
  gpu_count: 1
  recommended_ram: "16GB"

# ─── LAYER 2: Engine Config (Translated directly to backend CLI flags)
engine_config:
  # [ Server (Shared) ]
  host: "127.0.0.1"
  port: 8080                  # Must be between 1024 and 65535
  n_parallel: 1               # -np (llama.cpp) or --max-num-seqs (vLLM)

  # ─── llama.cpp-specific configurations ───
  ctx_size: 8192              # -c
  gpu_layers: 99              # -ngl (99 = offload all)
  flash_attn: true            # -fa
  mlock: false                # --mlock
  mmap: true                  # Set false for --no-mmap
  split_mode: null            # --split-mode  null | "none" | "layer" | "row"
  tensor_split: null          # -ts
  main_gpu: 0                 # -mg
  threads: 8                  # -t
  batch_size: 512             # -b
  ubatch_size: 256            # -ub
  cache_type_k: "q8_0"        # -ctk
  cache_type_v: "q8_0"        # -ctv
  jinja: true                 # --jinja

  # ─── vLLM-specific configurations ───
  # tensor_parallel_size: 1
  # gpu_memory_utilization: 0.90
  # max_model_len: 8192
  # trust_remote_code: false

  # [ Escape Hatch (Allowlisted flags only) ]
  extra_args: []

# ─── LAYER 3: Pre-Run System Hooks ─────────────────────────────
pre_run:
  env: {}
  commands: []