Successfully running Qwen 3.5 27B on my NVIDIA RTX 4090 (using 21 GB of CUDA memory)

I have successfully started the Qwen 3.5 27B model, with 4-bit quantization, on my GPU server with a NVIDIA RTX 4090.

The launch command I used was:

./build/bin/llama-server -m Qwen3.5-27B-Q4_K_M.gguf -ngl 99 -c 262144 -np 1 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --host 0.0.0.0

To use the model, I am running it with the Hermes agent. My setup was inspired by Sudo Su’s blog, which showed how capable the Qwen 3.5 9B model is when paired with the Hermes agent.

With the 27B model, I am using around 21 GB of CUDA memory. It is hard to imagine having access to GPT-5 level intelligence on such modest GPU hardware! I have switched my email triage system to Hermes and QWen 3.5 27B as well.