llm · qwen · models · claude · code · lm studio

Local LLMs, QWEN Models, Claude Code

Engineering has become absurdly fast.

I’ve been in software long enough to see the shift:

  • Years ago, speed was limited by typing and manual refactoring.
  • Copilot accelerated boilerplate, but architecture and reasoning were still manual.
  • Today, local LLMs and agents can scaffold and refactor entire apps in days, not weeks.

Here’s how I’ve been running things locally, why quantization matters and how I combine speed and precision.

Using LM Studio I run multiple models Locally.

I’ve tested models:

  1. QWEN3-Coder-Next 80B 4bit (MLX)
  2. QWEN3-Coder-Next 80B 8bit (MLX)
  3. QWEN3-Coder 30B 4bit (MLX)


Reason why I picked QWEN3 is simple:
LM Studio showing them as latest (updated) model which has “staff pick” flag.
Please don’t blame me, I’m still new to having local LLMs.

lm studio: models


Before working with Claude Code need to download models in LM Studio and run server

lm studio: download model

lm studio: load model


Setting Up Claude Code to Use LM Studio

Claude Code can query a local LM Studio server. I set it up using environment variables:

# Point Claude Code to local LM Studio
export ANTHROPIC_BASE_URL=http://localhost:1234

# LM Studio authentication token
export ANTHROPIC_AUTH_TOKEN=lmstudio

I heavily use direnv tool which helps me keep env vars per project folder by having .envrc files at root of project.

So when I cd into project - these env vars are applied.

┌[num8er☮g8way1]-(~)
└> cd ~/Work/Something
direnv: loading ~/Work/Something/.envrc
direnv: export +ANTHROPIC_AUTH_TOKEN +ANTHROPIC_BASE_URL

┌[num8er☮g8way1]-(~/Work/Something)
└> cat .envrc
export ANTHROPIC_BASE_URL=http://localhost:1234
export ANTHROPIC_AUTH_TOKEN=lmstudio

now let’s launch Claude Code with our model:

claude --model qwen/qwen3-coder-next

Reference: https://lmstudio.ai/docs/integrations/claude-code

As result You’ll see such screen:

claude code: qwen3

In screenshot I’ve typed prompt which makes model to read my project and give some information.

While Claude Code was requesting our LM Studio with our prompt we can see such screen which proves that we are successfully started to use our local LLM model.

lm studio: server screen

Nice!
We are good to go!

claude code: edit


Quantization: 4-bit vs 8-bit

4-bit models:

  • Faster token generation
  • Lower memory usage
  • Good for scaffolding, code exploration, iterative edits, drafting

8-bit 80B model:

  • Slower and heavier on RAM
  • More precise, generates cleaner changes

But! (:
Even though 80B 8-bit is slower than 4-bit, it’s still much faster than opening an IDE, searching for place to fix and etc stuff

So - there’s always a WIN.

Let’s conclude!

Precision comes from careful 8-bit or bigger models when editing high-risk code.

With the right stack, local deployment, and quantization strategy, building complex applications becomes faster, more private, and resilient — even without internet.

In Ukraine, internet and electricity outages are real.
Local inference isn’t optional — it’s resilience.
That’s why I decided to make local LLMs run in my laptop.
I hope my post is helpful for beginners (as me) to LLM world.


P.S. The result I’ve come made me ask ChatGPT to draw me my relationship with AI.

me and robot


fin

Published:
comments powered by Disqus