LLM

Publish Date: 2025-08-02

Update Date: 2025-09-05

Word Count: 2.4k

Read Times: 15 Min

Running large language models like Gemma 3 on a Mac mini M4 with only 16GB of RAM quickly forced me to confront the hard limits of local inference. Initially, I tried running the 12B quantized GGUF models via Ollama, but kept hitting 500 errors—turns out, even Q4_0 quantization wasn’t enough to fit the model into available memory. Digging deeper, I realized that most of the 16GB RAM was already claimed by macOS and background services, leaving barely 6–7GB free for inference.

Instead of forcing it, I shifted strategy: I profiled my system memory using vm_stat, trimmed background processes, and tested multiple quantization levels (Q4, Q5, even Q8) across models. Still, running 12B remained unstable. The breakthrough came when I discovered the MLX community’s Apple-optimized versions of Gemma.
These use Metal-backed GPU acceleration and memory-efficient formats like int4, making it possible to run 7B and even 12B models smoothly on the same hardware. By choosing the right quantization and inference backend (MLX over GGUF), I found a setup that balances speed, memory, and model capability, all locally on Apple Silicon.

So this week, I went full mad-scientist mode trying to pick the best model to finetune for a secure, local, codable chatbot experience. The contenders? Gemma 3B/7B and Qwen3-7B/8B. My goals were clear:

Must run on my Mac mini (M4, 16GB RAM)
Must respond like a helpful dev agent — not lobotomized
Must be finetunable (via LoRA or QLoRA, ideally)
Must support private data integration (RAG or post-training)
Must be able to code, explain, and assist

This document details my exploration of large language models, including my experiments with quantization and fine-tuning.

Phase 0: Qwen3 vs. Gemma3 – Which to Choose?

Simply put, Qwen3 (especially Qwen3-8B) performs better with instruction tuning – it has a more flexible language style and answers feel more “human.” Gemma3 is more akin to a Google-led team,
with a cleaner architecture and a stronger focus on research, making it ideal for DIY projects. However, the default base version doesn’t chat at all; you need to train it yourself or select a
version with an “it” suffix.

The key takeaway here is: a bad model isn’t necessarily a bad model – it’s often just under-trained.

Phase 1: What is DeepSeek Abliteration and Why Does it Work?

Recently, DeepSeek pulled off a clever move: they used the chain-of-thought output from their large model, DeepSeek-R1-0528, to distill Qwen3-8B. This resulted in a new model:

DeepSeek-R1-0528-Qwen3-8B-GGUF

Technically, it’s a fine-tuned branch of Qwen3, optimized through DeepSeek’s R1 method.

Models like:

LLaMA3-Instruct
MythoMax / Hermes2
OpenChat 3.5

fall into the same category – they use custom SFT (Supervised Fine-Tuning) data and RLHF (Reinforcement Learning from Human Feedback) / Preference Tuning to train base models into specific
behaviors.

Example:

Prompt: “Please write a Python function to determine if a string is a palindrome.”

Qwen3-8B-Base: Outputs a lot of explanations, but no code.
DeepSeek Abliterated Qwen3-8B: Directly provides the function body, test cases, and an explanation of why.

This demonstrates the power of abliteration – it re-teaches a previously cautious model to think and output complete code logic.

Phase 2: Technical Deep Dive into Model Quantization (Especially for Local Deployment)

If you’re using a Mac mini or a laptop with 32GB of memory or less, quantization is a topic you can’t ignore.

What is Base Model Quantization?

Base models are the original, un-instruction-tuned large models, having undergone massive self-supervised training. Quantization is the process of compressing the float16/32 weights into int4,
int8, etc. formats.

Model Size	Precision	Memory Usage	Inference Speed
FP16	High	High	Slow
Q4_0	Medium-Low	Very Low	Fast
Q4_K_M	Medium-High	Low	Fast
Q8_0	High	Medium-High	Slightly Slow

Compare

模型	大小	RAM 需求	TPS	中文能力	适合
Gemma-3n E4B Q4_K_M	~4B	<6GB	✅ 快	一般	本地日常使用、agent
Qwen3-14B Q4_K_M	~14B	13–16GB+	❌ 慢	很强	离线高质量推理，慢也能忍
Qwen1.5-7B Q4_K_M	~7B	7–9GB	⚠ 中	强	折中选择

GGUF memory calc

Total memory ≈ weight memory + KV cache memory + other buffers
Weight memory ≈ GGUF file size × decompression factor (approximately 1.5 for Q2_K)
KV cache ≈ number of layers × context_length × embedding_length × bytes_per_value
Other buffers ≈ hundreds of MB
Therefore, when deploying on Ollama, llama.cpp, or mlc-llm, choose the appropriate GGUF version – Q4_K_M and Q8_0 offer a good balance.

First, the model weights themselves. After quantization, the GGUF file stores compressed integer weights, but llama.cpp decodes them into compact tensors when loading. For 2-bit × group quantization (Q2_K), the measured memory usage is approximately 1.5 times the file size—for example, a 9.3 GB Q2_K file will occupy approximately 14 GB after loading. This is consistent with the common saying in the community that “loading a quantized model requires ∼(file size) × (bits/8) × decompression factor”, and also confirms the experience that “memory requirements are about 2-4 times the disk size” (WizardLM-7B 14GB file loading requires ∼30GB of memory) ([Artificial Intelligence Stack Exchange][1]).

Secondly, the KV cache. The longer the statement context, the more key/value pairs need to be stored. Each Transformer layer generates a KV tensor for context_length (e.g. 4096), with a shape of (context_length, head_count_kv, head_size), and numbers are generally stored in FP16. Taking Qwen3-Coder-30B as an example: 48 layers, head_count_kv=4, head_size=128, but embedding_length=2048, this is actually…

1
2
3

layers × context × embedding_length × 2 bytes
≈ 48 × 4096 × 2048 × 2 bytes
≈ 800 MiB

If you enable a longer context (like 16KB), the KV cache will linearly double.

Phase 3: Understanding the Model Repository Structure (Using Qwen as an Example)

Qwen/Qwen3-8B-Base
├── Adapters (8)
├── Finetunes (80)
├── Merges (3)
└── Quantizations (30)

Interpretation:

Base: The original model.
Adapters: LoRA (Low-Rank Adaptation) fine-tuning modules.
Finetunes: Full or mixed fine-tuning versions.
Merges: Weight combinations from multiple finetunes.
Quantizations: Quantized format versions (like GGUF).

Benchmarks & Tradeoffs — Qwen3 vs Gemma3

Let’s cut to it: Qwen3 models are generally more chatty, multilingual, with better instruction tuning (especially for their Chat versions). But Gemma3 has some architectural advantages, like being more open, more compact, and Google Research-backed.

But here’s where it gets spicy — Gemma3 Base is not finetuned for chat out-of-the-box. Same with Qwen3 Base. So if you’re not careful, you’ll think they “suck” when in reality, they’re just raw. The value is in finetuning them your way.

I tried running:

gemma-3b-it-Q4_K_M (int4 quant via GGUF)
qwen3-8b-base-Q4_K_M
deepseek-moe-16x1.3b just for fun

Load times were fine. But when asked to “write a function to parse JSON,” only Qwen3 understood my vague prompt. Gemma3 gave up like it was trained on StackOverflow flags.

I stumbled on a distilled model called:

DeepSeek-R1-0528-Qwen3-8B-GGUF

This is Qwen3-8B Base, post-trained using chain-of-thought data from DeepSeek’s raw model. It’s sharper, more logical, and yes, slightly deranged in a good way. I ran some evals:

Prompt: “Implement a WebSocket server in Go”
Vanilla Qwen3-8B-Base: Half answer, talks about goroutines, no code.
Abliterated DeepSeek-Qwen3-8B: Full server, with error handling, log statements, and config options.

Conclusion? Abliteration works — if your model is overaligned, abliterate it with purpose.

Can It Run Locally and Privately?

Yes. With llama.cpp, ggml/gguf formats, and OpenWebUI, you can run a fully uncensored assistant on macOS. For better UX and parameter tweaking (temperature, top_p, stop words), I used OpenWebUI and KoboldCpp.

Settings that worked for me:

Temperature: 0.7
Top_p: 0.95
Context length: 4096 (up to 8192 if you dare)

This way, the model never calls OpenAI. Everything is local, encrypted, and yours. You can even hook it into a vector DB (like Chroma or Milvus) for RAG-style knowledge injection.

First, select the appropriate GGUF (recommend Q4_K_M or Q8_0).
Then, consider whether you want a specific finetune type (e.g., chat, coder, instruction).

If the model is base + adapter, ensure your framework supports adapters (e.g., llama.cpp > v0.2).

Here’s what I ended up running:

Base: DeepSeek-R1-0528-Qwen3-8B-Q4_K_M
Frontend: OpenWebUI or custom Vite/React wrapper
Backend: llama.cpp compiled for Metal (M4 optimized)
Memory Budget: ~12GB RAM peak for Qwen3-8B
Custom Finetune: Alpaca-style SFT on coding prompts (still training…)

Tips for Future Hackers

If it says “chat” and acts dumb, it’s probably over-aligned. Post-train it.
Look into abliteration techniques — distilled thought traces are more valuable than safety data.
Don’t waste time with RLHF-ed-only models if you’re building a coder assistant.
Use gguf formats for performance, especially on Apple Silicon. Compile with Metal backend.
Learn to smell lobotomized models. They always avoid certain topics, answer vaguely, and fail to write full functions.
Local deployment isn’t complicated; quantization and Ollama can handle it.
Don’t let excessive alignment limit your model’s capabilities.
If you’re looking for privacy, security, and customization, local LLM is the only solution.

Coming Next: Local LoRA fine tune and Gemma3n image and audio test

I want to see how small I can go and still retain quality coding assistance. Spoiler: Gemma3-2B already shows promise with some light training, but hallucination risk is high.

—

That’s it. From confusion to abliteration to building my private dev assistant, this journey taught me one thing: you don’t need OpenAI to build a good coder chatbot. You just need good data, good models, and the will to destroy and rebuild.

If you’re curious or want to try abliteration yourself, ping me. I’ll share my LoRA configs, ggml hacks, and more.

Running Large GGUF Models on Apple Silicon: A Real-World Guide for M4 Mac (16GB) Users

You’ve downloaded a massive 30-billion-parameter AI model — maybe Qwen3-Coder-30B-A3B-Instruct-UD-Q2_K_XL.gguf — excited to run it locally on your sleek M4 Mac with 16GB of unified memory.

You fire up llama-server, set --n-gpu-layers 999, and wait.

Then… crash.

1 2	ggml_metal_graph_compute: command buffer 1 failed with status 5 error: Insufficient Memory (kIOGPUCommandBufferCallbackErrorOutOfMemory)

Or worse: silent hangs, SIGSTOP, LLDB showing 0xffffffffffffffff, and no way to kill the process.

This is not a failure of will — it’s a collision between ambition and reality.

In this article, we’ll walk through:

Why large GGUF models fail on 16GB Macs
The truth about quantization (spoiler: Q2_K ≠ small memory)
How to actually run models without crashing
And how to fix the infamous pos_min == -1 bug in llama.cpp

Let’s get real about local LLMs on Apple Silicon.

🔍 Part 1: The Myth of “Small” Quantized Models

You see a file:
Qwen3-Coder-30B-A3B-Instruct-UD-Q2_K_XL.gguf — 10.97 GiB

“Perfect!” you think. “It’s under 16GB. It should fit.”

But reality hits hard.

📊 Why 11 GB ≠ 11 GB

Component	Memory Use
Model weights (Q2_K hybrid)	~11 GB
KV Cache (for context)	~1–2 GB
Metal GPU working buffers	~2–4 GB
Intermediate activations	~1–2 GB
macOS & other apps	~1–2 GB
Total Peak Usage	~16–20 GB

👉 Even with Unsloth UD v2.0 or Q2_K_XL, the runtime memory exceeds your 16GB unified RAM.

The M4 GPU doesn’t have “VRAM” — it shares memory with the CPU.
So when Metal runs out of space, everything stops.

💥 Part 2: The GPU Memory Crash

You see:

1 2	ggml_metal_graph_compute: command buffer 1 failed with status 5 error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)

This isn’t a software bug — it’s physics.

The GPU tried to allocate buffers for computation and failed.

Even with --mmap (memory mapping), the active layers must be loaded into memory.

✅ Workaround: Reduce GPU Offloading

1 2	--n-gpu-layers 0 # CPU-only (slow but stable) --n-gpu-layers 32 # Partial GPU (best balance)

Or switch to a 7B model — they’re faster, fit in memory, and often outperform bloated 30B models in practice.

🐛 Part 3: The `pos_min == -1` Bug in `llama.cpp`

After surviving the memory battle, you hit a new enemy:

1	/Users/stanmac/Downloads/llm/llama.cpp/tools/server/server.cpp:3292: pos_min == -1, but n_past > 0 - should not happen

This is a known regression in recent llama.cpp builds (April 2025), triggered by:

Chat format mismatches
Slot reuse bugs
Broken KV cache state

The server crashes or freezes, and even kill -9 fails.

LLDB shows:

1 2	* thread #1, stop reason = signal SIGSTOP frame #0: 0xffffffffffffffff

👉 The process is zombie-stuck — only a reboot can fully free the Metal GPU memory.

✅ Part 4: How to Actually Succeed

Here’s what works — tested on M4 Mac 16GB.

✅ 1. Use a Stable `llama.cpp` Version

Avoid the latest master. Use a known good commit:

1 2	git checkout 8f7c702c make clean && make server -j

Or wait for the fix in:
👉 https://github.com/ggml-org/llama.cpp/pull/13833

✅ 2. Run This Command (Safe & Stable)

./build/bin/llama-server \
  --model "Your model path" \
  --port 10000 \
  --n-gpu-layers 0 \
  --ctx-size 4096 \
  --batch-size 512 \
  --threads 8 \
  --temp 0.7 \
  --repeat-penalty 1.1 \
  --no-cache \
  --log-disable

Key Flags:

--n-gpu-layers 0: Avoid GPU OOM
--ctx-size 4096: Limit context to save memory
--no-cache: Bypass broken KV cache logic
--log-disable: Reduce noise

✅ 3. Use `/completion`, Not `/chat/completions`

Avoid the buggy chat format parser.

Use:

curl http://localhost:10000/completion -d '{
  "prompt": "def binary_search(arr, x):",
  "n_predict": 128
}'

Not:

1	curl http://localhost:10000/v1/chat/completions -d '{ "messages": [...] }'

✅ 4. Kill Stuck Processes

When the server hangs:

1	pkill -9 -f llama-server

If that fails:

Open Activity Monitor → force quit
Or reboot (sadly, often necessary)

✅ 5. Better: Switch to 7B Models

Model	Size	RAM Use	Speed	Quality
`Qwen2.5-Coder-7B-Q5_K_S.gguf`	~6 GB	~9 GB	⚡⚡	✅✅✅
`DeepSeek-Coder-V2-Lite-Q6_K.gguf`	~5.2 GB	~8 GB	⚡⚡⚡	✅✅✅✅
`CodeLlama-7B-Instruct-Q5_K_M.gguf`	~5.8 GB	~9 GB	⚡⚡	✅✅

👉 These run fully on GPU, are faster, and don’t crash.

🏁 Conclusion: Be Realistic, Be Smart

✅ Do:

Use 7B–14B models for best experience
Stick to stable llama.cpp versions
Use --no-cache, --n-gpu-layers 0 for debugging
Prefer /completion over chat endpoints

❌ Don’t:

Assume Q2_K = small memory
Run 30B models on 16GB Macs
Ignore the pos_min == -1 bug — it’s real
Expect GPU to save you — it shares memory

Stan ke

https://stan370.github.io/2025/08/02/LLM-Finetune/

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Stan ke !

LLM

Vibe coding的是软件开发的未来吗？

2025-09-05 Stan ke

并发效率与多线程安全

2025-07-28 计算机科学

计算机基础多线程

Not Enough RAM! Gemma3n vs Qwen3 on 16Gb device Test, Building My Private AI Assistant

Compare

GGUF memory calc

Benchmarks & Tradeoffs — Qwen3 vs Gemma3

Can It Run Locally and Privately?

Tips for Future Hackers

Coming Next: Local LoRA fine tune and Gemma3n image and audio test

Running Large GGUF Models on Apple Silicon: A Real-World Guide for M4 Mac (16GB) Users

🔍 Part 1: The Myth of “Small” Quantized Models

📊 Why 11 GB ≠ 11 GB

💥 Part 2: The GPU Memory Crash

✅ Workaround: Reduce GPU Offloading

🐛 Part 3: The pos_min == -1 Bug in llama.cpp

✅ Part 4: How to Actually Succeed

✅ 1. Use a Stable llama.cpp Version

✅ 2. Run This Command (Safe & Stable)

Key Flags:

✅ 3. Use /completion, Not /chat/completions

✅ 4. Kill Stuck Processes

✅ 5. Better: Switch to 7B Models

🏁 Conclusion: Be Realistic, Be Smart

✅ Do:

❌ Don’t:

你的赏识是我前进的动力

🐛 Part 3: The `pos_min == -1` Bug in `llama.cpp`

✅ 1. Use a Stable `llama.cpp` Version

✅ 3. Use `/completion`, Not `/chat/completions`