
Arnav Gautam
Building and learning about local ai
arnav080/ornstein3.6-27b-4060-q3km-bb
>
arnav080/kimi-k2-6-nvfp4-0xS
Kimi-K2.6 519B (NVFP4) served via SGLang at 256k context with reasoning and tool-call parsers
arnav080/ornstein3.6-27b-dual-3090-mtp-JDT
>
arnav080/step-3-7-flash-nvfp4
Step 3.7 Flash MoE (198B) quantized to 4-bit (NVFP4) optimized for dual RTX 6000 GPUs
arnav080/minimax-m2-7-nvfp4
MiniMax M2.7 MoE (230B) quantized to 4-bit (NVFP4) optimized for dual RTX 6000 GPUs
arnav080/deepseek-v4-flash
DeepSeek V4-Flash MoE (284B) production profile tuned for 2x RTX 6000 GPUs
arnav080/step-3-7-flash-nvfp4-hikari
Step 3.7 Flash MoE (198B) quantized to 4-bit (NVFP4) optimized for dual RTX 6000 GPUs
arnav080/minimax-m2-7-nvfp4-hikari
MiniMax M2.7 MoE (230B) quantized to 4-bit (NVFP4) optimized for dual RTX 6000 GPUs
arnav080/deepseek-v4-flash-hikari
DeepSeek V4-Flash MoE (284B) production profile tuned for 2x RTX 6000 GPUs
arnav080/qwen3.6-35b-moe-tq3-rtx4060ti-no-checkpoint-witch
Highly optimized 8GB GPU recipe for Qwen 3.6 35B MoE, disabling context checkpoints to double deep-context throughput (~30 tok/s at 26k).
arnav080/qwen3.6-35b-moe-rtx5080-128k-LeftCDev
Fully GPU-accelerated Qwen 3.6 35B MoE recipe optimized for 16GB VRAM GPUs, achieving 150 tok/s and 128k context.
arnav080/step-3-7-flash-bf16-multi-gpu
Unquantized BF16 Step-3.7-Flash multimodal recipe optimized for multi-GPU server clusters (e.g., 4x A100/H100 80GB).
arnav080/step-3.7-flash-multimodal-128gb-SudoSU
Multimodal Step-3.7-Flash config for ultra-large hardware (128GB VRAM / GB10) featuring f16 vision support and 65k context.
arnav080/qwen3.6-35b-moe-turboquant-tq3-AJKV
Optimized Qwen 3.6 35B MoE recipe using the experimental llama.cpp-tq3 (TurboQuant) engine, TQ3_4S quant, and hybrid offload.
arnav080/qwen3.6-35b-moe-4060ti-above_spec
Optimized Qwen 3.6 35B MoE config for 8GB GPUs, reaching 200k context via CPU expert offload and q8_0 KV cache.
arnav080/qwen3.6-27b-iq3-k-r4-5060ti-above_spec
Dense Qwen 3.6 27B on a single RTX 5060 Ti 16GB running completely on GPU with R4 quantization and 139k deep context.
arnav080/qwen3.6-35b-moe-4060ti-iq4-above_spec
Qwen 3.6 35B MoE on RTX 4060 Ti 8GB using the ik_llama.cpp fork for high-quality IQ4_K_R4 quantization and 262k context.
arnav080/qwen3.6-27b-dual-gpu-mtp-above_spec
Optimized Qwen 3.6 27B dual-GPU recipe with native MTP speculative decoding & 100k context.
arnav080/qwen3-30b-dual-gpu-speculative-100k
Qwen3-30B across two GPUs with native MTP speculative decoding. Tensor split 58/42 to account for MTP pinning the draft head + KV growth on GPU1. -ub 256 is the key finding: halves per-prefill compute reserve (~1.8 GB), freeing ~1.4 GB on the choking card. Honest context ceiling: 102400 (Config D ran 0→94k clean).
arnav080/qwen3-30b-moe-8gb-cpu-offload
Qwen3-30B-A3B (MoE, 3B active params) on a single RTX 4060 Ti 8GB. Offloads all 41 MoE expert blocks to CPU RAM. Achieves 49 t/s at full 262k native context with 3.2 GB VRAM to spare. Trades ~10% speed for 4x the context vs partial GPU expert loading.