CUDA16GB VRAM

arnav080/qwen3-30b-dual-gpu-speculative-100k

Name: arnav080/qwen3-30b-dual-gpu-speculative-100k
Author: arnav080

Qwen3-30B across two GPUs with native MTP speculative decoding. Tensor split 58/42 to account for MTP pinning the draft head + KV growth on GPU1. -ub 256 is the key finding: halves per-prefill compute reserve (~1.8 GB), freeing ~1.4 GB on the choking card. Honest context ceiling: 102400 (Config D ran 0→94k clean).

Configuration Specifications

Base Model	huggingface:Qwen/Qwen3-30B-A3B
Engine	llama.cpp
Quantization	Q4_K_M
Platform	cuda
VRAM Required	16GB minimum

Telemetry Benchmarks

No runs recorded yet

Be the first to benchmark this recipe! Run the CLI command from your terminal:

bloc run arnav080/qwen3-30b-dual-gpu-speculative-100k

qwen3-30b-dual-gpu-speculative-100k.yaml

Loading workspace editor...