Running local models is good now

(vickiboykis.com)

880 points | by jfb 7 hours ago

95 comments

c0rruptbytes 6 hours ago
I don't know about good, I use a lot of local models and they're still pretty painful to run locally
You have dense models (qwen 27b, gemma 31b) who are pretty smart, but pretty slow
You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes
You need a lot of memory to run these well, quantization makes tool calling weaker, so most run at 4 bit quants and are wondering why it kinda sucks and that's because you've essentially lobotomized the model (I recommend unsloth quants, i recommend 6bit for MoEs and 5bit for dense)
So you need a lot of compute to make the pre-fill fast, you need bandwidth to make the decode fast, you need a lot of memory to hold everything - lot of ifs
On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.
So are they good? not really. Do they work? yes
edit: just wanna clarify - i think open models are the future, i think they're super important, i'm contributing constantly to the ecosystem - i think people should play around with these models, i think people should use `pi` and learn how it all works - but don't download a model expecting it to be good out of the box, you will have to tune and configure a lot of stuff to replace a "coding agent" that most people are using models for
[-]
- saghm 6 hours ago
  This is basically my experience as well. I have a moderately recent but high spec desktop (Radeon 6900 XT with 16 GB VRAM, Ryzen 9 7900X 12-core, 64 GB system RAM), and I tried out some recommended models with ollama a month or two ago. Anything not geared specifically towards coding seemed to struggled with actually making tool calls instead of just stating the actions they would take without making them (and trying to get help from them to explain what I needed to configure to change that behavior was useless; qwen refused to believe that it was running in ollama and insisted that it was running from the Alibaba cloud without access to my local system), and the models intended for coding were barely thinking faster than I could type (if they had any ability to show thinking at all).
  The best "free" experience I've found is using OpenCode with Big Pickle. It's not especially smart, so it often won't produce the correct result the first time, but the free tier is generous enough that I don't think I've hit the limit more than twice over around a month with frequent multi-hour sessions. If running locally is truly the goal, it's not going to fit the bill, but if the goal is just "get the best experience without having to pay for a sub or tokens", it's the least bad option I've found so far.
  [-]
  - spockz 4 hours ago
    For what it is worth, I’m on a similar machine. (9070XT,5900X) and found a lot of performance improvement over ollama by compiling llama.cpp and running with —no-mmap and —perf. The context is still quite small though. With online models I use contexts of at least 200k which is useful for longer running/more complicated commands.
    Locally I haven’t gone much further than 8k. That is sufficient for small changes on small code bases. And you need condensed tool output.
    I haven’t tried any tool that compresses the tokens yet.
    [-]
    - echelon 3 hours ago
      I would rather we give up the idea of running open models on RTX cards and instead focus on running much bigger open models on H200s.
      1. The hardware will eventually catch up.
      2. This keeps the delta between frontier models smaller.
      3. We can still fine tune and own the weights.
      4. The models will be more useful, faster, and reliable.
      RTX is hobbyist tier, not professional tier.
      Gated cloud models from hyperscalers treat us like hobbyists in their own right.
      We need equivalent scale models, but open.
      [-]
      - dofm 15 minutes ago
        Pressure on small model quality and design is absolutely what is needed. There are still gains to be made.
      - zozbot234 2 hours ago
        H200s and other enterprise datacenter GPUs are completely overkill in any realistic single- or few-users inference scenario. They're hugely unbalanced towards compute capacity which will go almost entirely unused (i.e. wasted) unless you're running huge batches on a continued basis. I've argued many times that local inference engines should support batched inference on a somewhat smaller scale for a variety of reasons (especially given the unexpected effectiveness of SSD streamed inference with larger-than-RAM models), but even I don't think we can realistically go to 300x or so for real-time inference, which is the range that pencils out quite consistently from a simple roofline model of these datacenter cards.
        [-]
        echelon 2 hours ago
        If you're doing professional work in coding or video, you can easily saturate a single H200.
        This is what RunPod-type services are for.
        For instance, ComfyUI is an abomination that can't do half of what Nano Banana and Seedance 2.0 can do. And you have to sit around and wait 10x longer for single results.
        I can rent an H200 for $3.50 an hour. That's INSANELY cheap.
        I do not understand this split between hosted APIs and rinky-dink local RTX models. Both suck.
        The ideal solution is models we own run on RunPods leveraging H200s.
        I can spend $100-200/day on compute making much more value with the model outputs.
        ----
        edit: I want to respond to comments, but the damned HN rate limits keep me to five comments a day now because I'm a contrarian and say things that rile up the anti-AI folks.
        You don't need to buy an H200. It's a depreciating asset. You rent one. It's cheap to rent.
        [-]
        spockz 1 hour ago
        Sure, to approach frontier model quality locally we need to have more power. And H200s are a way to get there.
        However, we need to use the tools that we have. Even if I wanted to buy a (bunch of) H200 for me and my colleagues and could get the expense approved, they are hard to source where we are.
        Yes. You can rent them, but I’m not sure how that affects the IP discussion.
        Moreover, not everyone is doing coding and video so we have different tasks that can fit quite well on relatively light laptops (Gemma et al), for relatively directed coding sessions we can make do with RTX cards, or a small step up, all the way to H200 in the workstation. Or pods thereof.
        We have the graphics cards and laptops with MLX right now. The H200 will take a year at least to arrive. Better get used to run stuff locally.
        zozbot234 1 hour ago
        I'll definitely believe that for video generation models, but those are also very compute-intensive for rather middling results.
      - SR2Z 3 hours ago
        That GPU costs 25k which means you really should have a rack to put it in. It's not realistic.
      - MrLeap 3 hours ago
        There's a lot more professionals that have RTX cards than H200s. You're inevitably see more development and experimentation on things actual humans have lmao.
  - rapind 4 hours ago
    > The best "free" experience I've found is using OpenCode with Big Pickle.
    I have absolutely zero interest in free. I honestly don't think I'm even remotely in the same demographic as people using free tiers / models.
    I want to pay. I don't want my data used for training. I want it to be open. I want it to be consistently up (more than Claude!). I want it to be fast. I don't want it to be subsidized as that's just an excuse for shitty quality. Deepseek flash knocks it out of the park on all of these except you're data is used in training. I'm fine with it being hosted since there's no way I'm using it 24/7, but data MUST be private.
    Basically I want Hetzner and OVH to run open model clouds. I'm convinced this is going to happen eventually when everyone realizes this is a commodity.
    [-]
    - aamoscodes 4 hours ago
      You can pay, and also use deepseek-v4-flash. OpenRouter even lets you "block" or limit your usage to providers that don't train on data. Since the weights are open, other companies are already serving the model on non-DeepSeek owned hardware: https://openrouter.ai/deepseek/deepseek-v4-flash
      [-]
      - rapind 4 hours ago
        Good to know. I hadn't checks since early is DS4's launch when they were the only provide (I think maybe there was one other, but they also trained on your data). I see several private options now.
    - saghm 3 hours ago
      I'm probably somewhat adjacent to you. I would be happy to pay, but I just don't want to pay any of the companies that are actually offering things right now. I had the $20/month sub for Claude for a couple months, until one day I kept inexplicably getting errors saying I hit the limit even though their site showed my usage at less than half for the session and 8% for the week, and it seemed silly to pay for something that couldn't even properly respect its own measurements. OpenAI sketches me out too much as a company, Cursor feels lackluster when I use it for work from the account they pay for (and now is getting acquired by maybe the only AI company even sketchier than OpenAI), and I wasn't particularly impressed with Gemini or Mistral Vibe either when I tried them on the free tiers either.
      [-]
      - rapind 3 hours ago
        I was paying around $500 / month on average between multiple providers for over a year. I cancelled one a while ago because of pretty bad service availability (Bet you guess who that is!), which by all reports hasn't improved much.
        For me, paying from $200 - $500 / month is reasonable if I can sustain a disruption free flow that doesn't require constant yak shaving. What I've found experimenting with DeepSeek on some open source library stuff is that it's actually going to cost me much less if I don't need frontier vibing (which I don't).
        [-]
        gaolei8888 2 hours ago
        who?
    - rlkf 1 hour ago
      > Basically I want Hetzner and OVH to run open model clouds
      You can run Qwen3 on OVH already:
      <https://www.ovhcloud.com/en/public-cloud/ai-endpoints/catalo...>
      [-]
      - johndough 55 minutes ago
        I see that OVH offers Qwen3.5-397B-A17B, which is a bit surprising to me. I thought that EU providers had to comply with the AI act where you have to provide opt-out and information about the training data once the model is sufficiently large (over 10^23 FLOPs, likely the case here), but providing information is not possible since people who train those models only give vague information at best.
        Does anyone know if OVH is ignoring the law here, or whether it does not apply for some reason?
        [-]
        dofm 14 minutes ago
        Which law is that?
        Not doubting you — just want to read it!
        [-]
        johndough 1 minute ago
        Article 53 of the AI Act: https://ai-act-law.eu/article/53/
        The definition of a "genral-purpose AI model" is described in more detail in the "Guidelines on the scope of obligations for providers of general-purpose AI models under the AI Act": https://ec.europa.eu/newsroom/dae/redirection/document/11834...
    - darkmarmot 4 hours ago
      Hard to guarantee it's private if you don't keep it local... I don't have a lot of trust for companies in this space.
      [-]
      - rapind 4 hours ago
        Yes, but I think that'll change eventually. If you trust hosting your code with a specific cloud provider then you'll probably also trust them for code assist. At least that's my theory.
        There'll probably need to be a threat of massive litigation should they fail to comply with such a policy.
        [-]
        pessimizer 4 hours ago
        > If you trust hosting your code with a specific cloud provider then you'll probably also trust them for code assist.
        I'm interested in this thought. There is significant motivation for providers to create a verifiable way for them not to deal with having access to client interactions with LLMs at all. Whatever standards and protocols have to be come up with in order to reassure clients.
        Any good standards for privacy when interacting with LLMs could also trickle down to smaller providers, and everyone could offer guarantees. Even if the guarantee was literally just an insurance policy and a private court to decide if it pays out.
    - Bnjoroge 4 hours ago
      You can specify which providers you want to serve your model in OpenRouter. Then you can chose US-based ones.
    - bel8 4 hours ago
      These competent open models you want to use were trained on data from people like you and me.
      I wonder if there are competent models trained purely on permissive open-source code like MIT or Apache 2.0.
      [-]
      - yencabulator 4 hours ago
        MIT and Apache 2.0 both require attribution, so it's not like limiting to those would help in license compliance.
  - redmalang 4 hours ago
    Try llama.cpp it seems to be a lot more performant and a lot more hackable. Also I'm surprised how substantial the impact of some of the inference configs (beyond just temp) can have, though this is much more model specific.
  - ryukoposting 4 hours ago
    I found that, with the heavily quantized Qwen3 models I can cram onto my 3060 Ti, telling the model to use its tools in the system prompt made it a lot more likely to actually do it. YMMV of course, but give it a shot.
    [-]
    - saghm 1 hour ago
      I did try this, and it was pretty hit-or-miss still. I even went as far as configuring context for Zed to inject into all conversations saying stuff like "If you need to read a file, call read_file NOW. Do not say you will read it", and it still didn't really make a huge difference.
- aftbit 5 hours ago
  IMO running local models "well" still requires an expensive hardware investment. You really want 96GB of VRAM on a modern Blackwell arch to run these models with decent KV cache. Trying to run them on a unified memory Mac, an AI Max AMD processor, or a DGX Spark-alike is really just asking for trouble. Prefill kills perf.
  If you throw the right GPUs at the problem, they become much better - but still not quite in the realm of Sonnet or DeepSeek 4 Flash, let alone Opus / DeepSeek Pro or Mythos/Fable/GPT-5.5.
  Given enough budget, power, and cooling, you can run some pretty good data pipelines, but for code, I think it still makes sense to shell out to an API provider most of the time.
  [-]
  - girvo 15 minutes ago
    > DGX Spark-alike is really just asking for trouble. Prefill kills perf.
    You're right that prefill kills perf, but shrug the GB10 has far more compute than it has memory bandwidth, so prefill isn't it's bottleneck.
  - ryan_glass 3 hours ago
    For a fraction of the price of 96GB vram, I built a desktop based on a supermicro server mobo and EPYC 9 series CPU, with just under 400GB rdimm ram (approx $4500 all in but this was before the ram price hike). Works really well for serving larger local modals at a decent enough speed (I consider anything more than 10 tokens/second usable and value accuracy over speed).
  - dofm 4 hours ago
    FWIW I think it might be both.
    Ultimately if you skip over the opportunity to play with these models on your own machine you are losing out on a lot of really interesting educational opportunities — it helps make a lot of stuff feel more concrete in a way that only tinkering can.
    But then I think once I had an idea of something that I was building against Gemma 4 or Qwen 3.6 I would be looking at openrouter etc., to stabilise it for the next tier of experimentation (and to get back a kind of multi-device access without tailscale/lm link etc.).
    Are they good enough to replace what people seem to want to do with Claude? Maybe not. But it's an unparalleled learning opportunity.
  - EagnaIonat 4 hours ago
    Depends what you need the model to do. The recent granite4.1:3b just takes 2GB of memory and is fast. Results are pretty good and support tool calling. Barely a squeak out of the Mac laptop.
    Even faster with the MLX builds.
    Then when I need more heavy lifting I fire up a larger model.
    IMHO the issue isn't the models. I've had OpenClaw give the same results as Claude using open models locally. Slower but does the job. Something that can do optimal model switching is what's needed.
    [-]
    - aftbit 2 hours ago
      [dead]
  - wincy 4 hours ago
    If I could just save up $6000 I could sell off my RTX 5090 for $4,000 and buy an RTX 6000 Blackwell Pro Workstation. I can fit models into the 32GB of vram but my context window ends up being tiny for any halfway capable model.
    [-]
    - layer8 3 hours ago
      Isn’t the RTX 6000 Blackwell Pro Workstation over $13000 now?
      [-]
      - girvo 14 minutes ago
        And rising. It's depressing.
  - jtbaker 4 hours ago
    > Trying to run them on a unified memory Mac
    > but still not quite in the realm of Sonnet or DeepSeek 4 Flash
    these are not mutually exclusive anymore. DS4 has set the bar for me these days. https://github.com/antirez/ds4
    [-]
    - trueno 3 hours ago
      someone just put this on my radar yesterday, im about to try this today. how's your experience with it?
      me thinks there's a lot of optimization strats we're currently leaving on the table just because the amount of things to explore and test are so expansive. but this one is super interesting targeting metal primarily and zeroing in on one model. instead of a one size fits all llama.cpp im very interested to see if theres a future where super tailor-made variants per model pans out to harnesses that can rapidly switch ultimately providing something akin to sonnet/early opus territory (that's my personal bench mark of good-enough i shall now cancel the hell out of this claude sub)
      [-]
      - jtbaker 2 hours ago
        I'm on the verge of cancelling my anthropic $20 plan since it's come out. On an M5 Max 128GB, hooked up to the pi.dev harness, I get in the neighborhood of 400-450tps prefill and 30-35tps generation. It is imminently usable and at times feels more stable than my previous CC setup. Occasionally there are things it struggles with that I will bounce back over to CC for, but it is highly usable. The future is bright for local models! As a tinkerer, it makes me really happy to have a local setup I can be just as productive in, and not have the token overlords ready to shut me down at any time.
        [-]
        aftbit 2 hours ago
        That's DS4 Flash right? How does it feel in intelligence and speed compared to DS4 Flash hosted by Deepseek themselves or another API provider? I've been using API DS4 Flash for a lot of personal projects and have been quite impressed. I've spent $1 on building ~10 toy projects and gotten them all to work within the bounds of what I wanted without having to do much besides guide the model away from dumb loops.
        [-]
        jtbaker 2 hours ago
        I'm using the DS4 flash IQ2 2-bit quant, per Salvadore's recommendations for my hardware in the repo. I haven't messed with the cloud hosted variant. The only other paid API I have messed with is a $20 Anthropic sub, primarily with whatever the latest version of Sonnet is. For the most part, this local configuration feels on par with that.
        With this configuration (set up over the last month) I have been working on Python data processing tools, an internal Svelte 5/SvelteKit data intensive BI app, and some smaller Rust projects. It's been doing really well there.
  - monksy 2 hours ago
    That RTX6000Pro you mentioned is $12k.
    [-]
    - aftbit 2 hours ago
      Yep - I'd say either that or 4x 5090 is a great entry point to running local models "well". Two of them would be even better. If you don't have $12-24k to spend, you can try your hand with tiny models or quants or slow speeds, but it will be a much more painful experience. You're already giving up a lot by dropping down from frontier models - you're giving up even more by trying to squeeze them into little RAM and compute.
      Prices will fall in the next few years. Maybe just play with the tiny toy models for now to learn how they work, then keep using API providers until they do.
  - eek2121 5 hours ago
    Not really, Qwen 27b offloads to a decent gaming GPU (RTX 4090 in my case) without needing tons of RAM.
    [-]
    - mathisfun123 5 hours ago
      can you give more info? llama.cpp vs vllm? config? i wanna try specifically this model
- zozbot234 6 hours ago
  Maybe we shouldn't be running these models on laptops with their thermally constrained form factor, and we shouldn't expect quick inference on a par with a large cloud-based platform either, at least not for near-SOTA model quality. It's still worth it to avoid becoming massively reliant on centralized services.
  [-]
  - stemlord 1 hour ago
    > It's still worth it to avoid becoming massively reliant on centralized services.
    This isn't really good enough. Many of us need to get things done in a pinch and if our employers are already getting used to the idea of paying for enterprise subscriptions to cloud llm's then the local option needs to be good
  - greenavocado 6 hours ago
    I have a 5070 12 GB laptop GPU and can hit 72 tokens per second in the first couple thousand tokens before dropping to mid-high 50s after about 15k context.
    This setup is extremely optimized down to the last flag. Changing any param above the temp flag craters performance.
    I don't have enough system RAM to properly handle the large context windows so I don't use local models.
```
  # 1,257 tokens 17s 72.18 t/s

  $env:CUDA_DEVICE_SCHEDULE = "SPIN"
  cd D:\src\llama.cpp\
  .\build\bin\Release\llama-server.exe `
    --port 8080 `
    --host 127.0.0.1 `
    -m "D:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf" `
    -fitt 2048 `
    -c 98304 `
    -n 32768 `
    -fa on `
    -np 1 `
    --kv-unified `
    -ctk q8_0 `
    -ctv q8_0 `
    -ctkd q8_0 `
    -ctvd q8_0 `
    -ctxcp 64 `
    --mlock `
    --no-warmup `
    --spec-type draft-mtp `
    --spec-draft-n-max 2 `
    --spec-draft-p-min 0.1 `
    --chat-template-kwargs '{\"preserve_thinking\": true}' `
    --temp 0.6 `
    --top-p 0.95 `
    --top-k 20 `
    --min-p 0.0 `
    --presence-penalty 0.0 `
    --repeat-penalty 1.0
```
    [-]
    - themanualstates 5 hours ago
      That’s useless without describing WHY you chose those flags, and how you did the optimisation…
      [-]
      - halJordan 4 hours ago
        The switches are all in the -h of llama.cpp (although the maintainers have a tendency to use the word in its definition). The actual values are essentially just what alibaba recommends. So you just need their model card. I would not call it highly optimized, more appropriately tuned.
        [-]
        greenavocado 3 hours ago
        I found every possible flag and its description including CUDA related environment variables and went back and iterated with Claude Opus 4.8 High until every single flag mattered above the temp one.
    - nateb2022 5 hours ago
      I get over 100 tok/s sustained on my M4 Max and M5 Max, in MacBook Pro's. LM Studio + MLX.
      [-]
      - Terretta 4 hours ago
        With Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf?
        Also, funny lumping the M4 "and" the M5, I find them 15% to 45% different performance, depending.
        And for a good deal of work, an M3 Studio Ultra outpaces the M4 and ties the M5 on single work at a time, outpaces both doing multiple work at a time.
    - ridiculous_leke 4 hours ago
      Can you comment on the quality and accuracy of it? People have managed to run Gemma 26b without GPU on old CPUs but I don't think quality is anywhere close to what Gemma 12b offers.
    - mattmanser 5 hours ago
      That's a quant 4 which the thread OP specifically called out as rubbish.
      The Q4_K_XL bit for those not in the know.
      [-]
      - stymaar 4 hours ago
        Anyone calling Qwen3.6-35B-A3B-Q4_K_XL “rubish” has no idea what they are talking about.
        [-]
        embedding-shape 4 hours ago
        I'd agree that the quality degrades a lot between Q8 and Q4, borderline unusable as they start to fail with tool calling syntax even. Personally I'd say Q8 is as low as you want to go.
        c0rruptbytes 4 hours ago
        q4 isn't rubbish, but it's a compromise for a good value, q6 is essentially a no-compromise quantization and it's what i recommend for MoEs in my experience for agentic workflows
        greenavocado 4 hours ago
        He's probably calling me out for this comment https://news.ycombinator.com/item?id=48557579
      - greenavocado 4 hours ago
        I typically find myself using a context of between 150-500k with GPT models so local models are simply not enough and I stopped using them.
        [-]
        stymaar 4 hours ago
        That's way higher than their optimal ceiling (and absolutely suboptimal from a token cost point of view), why are you doing that?
        [-]
        greenavocado 4 hours ago
        You're 100% right and its even severe than that: I daily drive on xhigh. I really try to avoid it, but when reconciling APIs across two large codebases you really start pressing north of 200k. I find myself topping out at 800k sometimes and that's with careful context management. I actually had to drop to GPT 5.4 for 1M context in my subscription because GPT 5.5 tops out at 272k. Hitting 800k context is better than repeatedly hitting let's say 200k out of 272k with multiple rounds of compaction. I run Can's snapcompact and while its better than normal compaction it still lobotomizes the model more than running with a very high context window.
        c0rruptbytes 4 hours ago
        large contexts degrade the performance - attention doesn't work will for large windows like that and cloud models are kind of hacking it
        local models do involve some context engineering to get it okay, but it's not that rough
- adam_arthur 6 hours ago
  Gemma 4 is particularly good at pipeline/automation tasks.
  It outperforms all the Qwen models (even 100B+) for rule following/automation style tasks in my experience. Its image interpretation is also very good, and out-benchmarks Opus.
  Qwen seems to ignore instructions and consistently outputs incorrect formats (when token generation format is not explicitly constrained)
  But yes, on the DGX Spark Gemma 31B Q4 with MTP runs around 20 tok/s and Gemma 26B A4B around 60 tok/s. Still quite slow. But on a high end Nvidia card would run significantly faster and still fit in memory.
  I'd recommend for anyone getting into local models to focus on memory bandwidth over RAM. Models under 100B parameters are now sufficient and hugely useful for automation.
  I agree that for coding/creation use cases, there's still not a compelling argument for local models.
  But e.g. if you want to scan a list of stocks and interpret news/high pass filtering, interpreting logs, interpreting screenshots, the local models are more than sufficient already.
  [-]
  - dstryr 5 hours ago
    This is not my experience at all. Even the Nous Research guys have stated that "Qwen3.6-27B is the canonical local model to use Hermes Agent with" [https://old.reddit.com/r/LocalLLaMA/comments/1sz2y76/ama_wit...]. I am finding the same when used with Pi and OpenCode.
    Gemma will just stop mid-tool call. It's been slower and I've had to reduce context size to run it. Qwen3.6 27b has been rock solid using club 3090's single card setup for agentic use -- https://github.com/noonghunna/club-3090/blob/master/docs/SIN...
    [-]
    - adam_arthur 5 hours ago
      I'm talking about automation generally, not agent loops.
      E.g. prompt A to achieve X, output in format Y. Use Y to do something in prompt B.
      Agentic loops will underperform deterministic control flow pipelines (with non-determinism constrained to LLM calls).
      Agents are more general, which is the main advantage. But inherently a more general solution will waste context on unnecessary reasoning.
      Try asking the smaller Qwen models to output a JSON in a specific format. It basically can't do it consistently with a moderately sized prompt unless you constrain the token generation via GGML or are extremely repetitive and specific about it. (Thinking disabled)
      Gemma 4 will do it correctly pretty much 100% of the time. (Thinking disabled)
      Applies to other rule following as well in my experience.
      Qwen may be better at toolcalling and certainly probably codegen.
      It seems to me Google explicitly designed Gemma for edge device automation, and didn't fine tune for agentic or coding use cases.
  - trouve_search 5 hours ago
    On a 5090, gemma4 26B runs at 350TPS with the command below [1] and gemma4 31B is around 150TPS with a similar command.
    I'm really surprised how much slower a DGX spark is for the same price.
    1. Here's my command.
    PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \ --dtype auto \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 \ --enable-chunked-prefill \ --enable-prefix-caching \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser gemma4 \ --reasoning-parser gemma4 \ --max-num-batched 16000 \ --max-model-len 64000 \ --max-num-seqs 12 --speculative-config '{"model": "./gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 4}'
    [-]
    - adam_arthur 5 hours ago
      Yes, I'd recommend a 5090 over the DGX Spark if your goal is general automation.
      You can run multiple instances of these models in parallel on the DGX Spark which somewhat mitigates the difference if your task is parallelizable.
      But I'd take the simplicity of a single thread and higher throughput personally.
      Overall of course still better to wait for next gen devices if you can.
  - ozim 3 hours ago
    I was expecting DGX Spark to run Gemma 31b Q4 much faster.
    I was expecting it would run Q8 in 50 tok/s.
    I guess that’s good I stopped thinking about buying it because I would be disappointed.
    [-]
    - girvo 9 minutes ago
      I love my Spark-alike, but they really aren't inference boxes IMO. They're experimentation boxes. A couple of 3080 20GB's for cheap from China, a 5090, an RTX Pro 6000 if you can swing the horrible cost: those are better choices IMO
      That said, I'm still running Step 3.7 Flash at ~40tk/s decode, 1000tk/s+ prefill on mine and its both very capable and fast enough
      I got Gemma 31b to run on this at ~22tk/s decode at FP8 using MTP
  - gopher_space 5 hours ago
    In my mind it’s a question of knowing what you want to build and how to divide the project into tasks your local setup can handle.
    If you don’t need the machine to respond instantly (or explain your own business model to you) everything can be local and it’s been like that for a few years now.
  - msp26 4 hours ago
    Yep agreed completely. I couldn't imagine torturing myself with a small model for local coding. But Gemma 4 31B is so fucking good for a variety of language modelling tasks.
- freehorse 3 hours ago
  > You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes
  This is sadly also my experience. I wish we had some MoE models with a higher ratio of active parameters per total. My experience is that the newer MoE models that can run in a 64b laptop have too few active parameters to be useful outside narrower, specific tasks. Mixtral 8x7b was a 14b active parameter (56b total) MoE model a few years ago and was probably the best model one could run in that range for some time, but it is too old now.
  I have been using the qwen 27b and it is great, but running a dense model like this in a macbook is a bit suboptimal, and i wish I could run sth faster than 15 tok/s.
  [-]
  - c0rruptbytes 3 hours ago
    I would try a 6-bit MoE and maybe with unsloth's studio, they claim to have auto tool fixing which is where i see a lot of issues with MoEs
    I'm on a 48gb M5 Pro right now and it's been okay, a lot of my rough experiences have been with MLX and I'm finding that GGUFs are okay now
- beadw 25 minutes ago
  I think you’re spot on. In my experience people confuse a models ability to solve some benchmark as a sign of its usefulness. Token throughput is often just as important from my personal usage. I am excited for more diffusion models to see how progress happens there.
  [-]
  - peterlk 19 minutes ago
    Yes to diffusion models! Combo pipelines of generative and diffusion models have super interesting potential
- hnlmorg 3 hours ago
  To be honest even the cloud models are a hot mess at times. This week I’ve spent more time rejected code from OpenAI models than I have approving it.
  In fact it really feels like OpenAI models have taken a nose dive this week compared with Claude. At least for my specific workloads (these things are so variable it’s like trying to compare Google results…)
- Stagnant 2 hours ago
  I've been using unsloth/gemma-4-31B-it-qat-GGUF daily for various small parsing and programming tasks using opencode and llama-server's front end. The past couple of weeks have made a big difference after google released the QAT variant and llama.cpp got support for MTP which means it is possible to now get 60-80 Tok/s with RTX 4090. The model fits in VRAM comfortably enough to keep it loaded even while browsing and having multiple programs.
  [-]
  - amdivia 2 hours ago
    110+ Tok/s as another data point on the RTX 5090 (Gemma 4 31B QAT + MTP at UD-Q4_K_XL) (at peak used 27 GB of vram)
    The real lovely thing was getting 300+ Tok/s (Gemma 4 26B QAT + MTP at UD-Q4_K_XL) (at peak, I think I saw vram usage reach 21 GB of vram)
  - lmedinas 1 hour ago
    the problem of that setup is that it will run out of context pretty quick. So for coding agent it will limit your workflow very fast.
- smcleod 1 hour ago
  Those dense models are pretty fast with MTP now. 40-70TK/s depending on your machine, that's faster than cloud models (although not as smart obviously).
- heipei 6 hours ago
  Depends on what you mean by "local". On your Macbook, large dense models like Qwen 3.6 27B will be slow, sure. On a local workstation with a dedicated RTX card you can get > 100 tps, which is more than good enough to work with it, and faster than cloud models in many cases.
  [-]
  - jstanley 6 hours ago
    But how smart is it? All the people running local models never seem to mention that they are way dumber than cloud models.
    I don't care how many tokens per second of nonsense it can generate.
    [-]
    - throwawayffffas 5 hours ago
      Qwen 3.6 35b a3b is about as good as sonnet 4.5. It varies but it's at that level.
    - notnullorvoid 5 hours ago
      Quantized Gemma 4 26B is as smart or better than GPT 5 in most of my testing. Granted GPT 5 is nearly a year old at this point, but I can run Gemma 4 on a ~6 year old consumer GPU (RTX 3090) and get 140 t/s.
    - heipei 6 hours ago
      It is smart enough that I use for all my coding tasks, and a lot of other mundane tasks.
      It is probably not smart enough for "design this whole architecture of this complex system from scratch, make no mistakes", but that is not something I want from a coding tool anyway. I want a model that I can point to a file and tell it to make some changes to the file and related files. Or that I can ask to review a PR with regards to certain aspects.
      My suggestion is to simply try it and see what it feels like.
    - lelanthran 3 hours ago
      > But how smart is it? All the people running local models never seem to mention that they are way dumber than cloud models.
      Well, you aren't going to give it a 20k line sec and have it churn out a full app after 4 hours hours.
      But, you can get it to write code for you if you do the design.
    - myaccountonhn 6 hours ago
      Its not going to be as good as Claude, but if you know what you're doing, it may be good enough to get your work done.
      [-]
      - data-ottawa 6 hours ago
        This is task dependent.
        I find devstral (even though it’s weak generally) much better at writing and documentation than Opus. I’m actually now delegating all documentation to devstral and away from Claude, which makes a mess.
      - garciasn 5 hours ago
        A highly skilled carpenter may be able to 'get work done' by banging nails in with a heavy-bottomed cocktail glass, doesn't mean it's not painful to do so when it is continuously breaking and leaving shards of glass all over the workshop for you to find every day for the rest of your life until you clean up the mess you made using the wrong tool for the job.
        [-]
        sgt101 3 hours ago
        If someone comes into the workshop and takes all the tools (hello Donald) then having a cocktail glass to hand might be a bit of a lucky break.
        (geddit?)
        CamperBob2 5 hours ago
        More like, a highly-skilled carpenter can work miracles with a $6 hammer from the hardware store, while the pros on the commercial crew are using fancy compressed-air tools.
        The carpenter has to get up close and personal with the wood. He can't match the crew's throughput, but maybe that's not what he's trying to do.
  - c0rruptbytes 5 hours ago
    I'm talking about the common use case that I think hacker news people have:
    you get a macbook for work, you run the macbook
    they're not going to start giving GPUs to employees to run local models
- ridiculous_leke 4 hours ago
  A median laptop is no bueno for running a reliable model(which will be qwen 27b as per my reading here and r/localllama). Powerful macs would be prevalent in certain areas of the world but in rest of the world personal machines aren't always that powerful.
- FuriouslyAdrift 4 hours ago
  Kimi 2.6 or 2.8 is what we are playing with locally. They need 512GB to 1TB to run with full capabilities so that's not exactly "desktop"
  Our GPU computer server cost $110k.
- robomartin 2 hours ago
  > On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.
  Laptop?
  OK, I've made that mistake before. I understand modern laptops are powerful, but nobody wanting to do serious AI/ML work should be using a laptop for anything other than SSH or similar low-performance access into a proper system.
  Years ago I fried two laptops just doing finite element analysis work running 18+ hours per day. It was one of those "I'm giving you all she's got, Captain!" workloads. They fried, even with powerful fans cooling them. I should have known better. Such workloads belong on purpose built systems.
- atomicnumber3 4 hours ago
  I largely don't disagree with you but come to a different conclusion. I have two systems:
  1) a "programming desktop" with a $500 upper mid range Ryzen (idr exact), 8GB VRAM Radeon card I bought solely for RuneScape, and 64GB ram
  2) a maxed out Alienware 16 Area51, so it's a 5090 with 24GB vram and 64GB system ram. I bought it for gaming, of course.
  I run qwen 3.6 35B A3B Q6 with 200k context window. I compare this to Claude pro max or whatever that I use at work.
  The main difference between the machines is that the one with the RuneScape gpu does 10 TPS while the Alienware does 30-40tps. Both are fine though the 30-40tps is obviously a lot snappier.
  I find with both models that:
  - they do really well at "be a 30GB zip file of reddit and stackoverflow answers"
  - they do really well at point fixing random bullshit errors that would otherwise waste my time (this is related to above of course)
  - they do quite well at, given a pretty good specification of what you want, figuring it out, even if you've specified several steps needed
  - they both cannot really be given a large ish task and left to just drive it on their own
  The main difference between the two is with that last one, Claude is somewhat better and figuring SOMETHING out, but if Claude is having to figure it out, it's probably because I don't know what I want and it's very likely to not make a sane choice, and will generally produce slop given even the slightest amount of leash still.
  I've also found that the boundary between "well specified small to medium thing" and "idk just do thing and figure it out" is the difference between you keeping control of the code and losing control. There's an "escape velocity" of AI use that, when you hit it, you're doomed to slop forever. (Or you have to deorbit... enjoy that). And while claude might have slightly higher velocity allowed while remaining suborbital, it's very diminishing returns.
  So, are these models "worse" than Claude? Yeah. Am I looking forward to continued improvements? Yeah. But I now also have no desire to pay anthropic any amount of money, which has the nice side effect that i won't be helping them end up with so much money that they can distort our democracy.
- everdrive 5 hours ago
  What counts as a lot of memory? What could someone do with 16 GB of RAM?
  [-]
  - throwawayffffas 4 hours ago
    Not much, the capable models won't fit unless you go with very low quantization but that leads to a lot of loss.
    You generally want to run q8 or some kind of "6bit" quantization at least.
    40GB of VRAM is the entry-point in my experience, you can run qwen 3.6 35b a3b with full context or qwen 27b with about 92k of context.
    Before you get fully discouraged, you don't need 1 gpu with 40GBs you can use multiple cards, with minimum impact on performance.
  - zozbot234 5 hours ago
    Modern inference engines can stream in weights from SSD in order to save on RAM, but this makes inference very slow, especially for the trivial single-session case. (Jury is still out on whether batching multiple sessions together can mitigate this well enough, but even then that's mostly helpful for the "running lots of inferences overnight and getting fresh results first thing in the morning" case. Which is interesting (the big third-party suppliers don't really offer a way of doing this at reasonable cost) but a bit of a niche.)
  - abalashov 5 hours ago
    Not a ton. I'd say 64 GB minimal to play, 96-128 GB better.
    [-]
    - throwawayffffas 4 hours ago
      Nah, you can run the 24b - 35b class with between 90k and 256k of context with about 40GB and they are pretty good. Especially the MOE variants fit neatly in 40GB.
      [-]
      - abalashov 1 hour ago
        Yeah, but then you need RAM for the rest of your OS and applications. I'd say 64 to be comfortable in the sense to which most HN users are accustomed.
  - ValdikSS 5 hours ago
    Gemma e2b, Gemma e4b. It's made for smartphones basically. You can run e2b with 8GB RAM.
  - trouve_search 5 hours ago
    gemma 12B 4bit quant; try something with MTP and an AWQ quant
  - monegator 5 hours ago
    gemma runs pretty well
- greenavocado 6 hours ago
  4 bit unsloth quants are good if you never ask for more than 20k context, use it as autocomplete on steroids, and never delegate serious questions to it
- iwontberude 6 hours ago
  They are good if you were clever enough to buy a powerful enough rig before memory went up. For everyone else I say just wait. M1 Ultra 128GB and higher is sufficient to run gemma4:31b-mlx or qwen3.6:35b-mlx with subagents. It’s only slow if you don’t know how to plan your work effectively.
- dominotw 5 hours ago
  maybe painful if you are using it like a chatbot. you are sitting there waiting for response. vs ambient ai like automatically classifying your family pics and discarding random things like parking floor number pic.
  i use it usecases like that latter and they are fine.
- citizenpaul 3 hours ago
  They are still terrible at tool usage which loses 99% of the effectiveness of the agent. I've had to concede and use paid frontier models that can use tools or its not worth using agents....copy...paste....copy....paste....
  [-]
  - iwontberude 7 minutes ago
    Your models aren’t big enough and they are forgetting about the tools. Try a larger model. If you can’t, then your rig was too underpowered anyways.
angry_octet 13 minutes ago
Programmers are used to paying nothing for tools. A basic laptop (SSD, multi core, 16GB of RAM) is hugely powerful if you are building in C/C++/Rust, even python. But all of a sudden it's no good, and we're back to using someone else's computer, hiring our tools every day. Worse, we get a different model every day, and maybe we aren't allowed to borrow the good tools some days because some mafioso are shaking down the manufacturer.
Most other trades need to invest significantly in tools. If you want good tooling, you really want 64GB of GPU memory (e.g. 2x 5090) and 96GB of RAM. If I'm paying $200k for an expert engineer then $50k every other year for tooling seems pretty reasonable.
hypfer 7 hours ago
After having been a happy user of Qwen3.6-27B for a few weeks, due to being away from the hardware, I'm currently forced to use Claude Sonnet 4.6
It is such a downgrade. I don't understand how that's even possible. The thing has so many strongly-held opinions I did not ever ask it for, talking just way too much and generally feeling somehow dumber.
Of course, being significantly larger, it will encode more knowledge, but that doesn't help me when I hate talking to it. And all that on top of the fact that talking with it costs real money.
I wonder what it might be that makes me hate it so much. Maybe because it doesn't see itself as a tool but almost an equal? As if its opinions would have weight.
Qwen too can act like an overeager intern, but if you tell it that it is an idiot, it will drop that ego. Not so much with Claude. In my experience, anyway.
Anyway, point is: full ack on that headline.
[-]
- ggerganov 6 hours ago
  I haven't spent a dime on cloud inference, so cannot make a direct comparison like you. But I can 100% attest to the fact that Qwen3.6-27B is a very capable local model for coding tasks. Over the last month and a half I've been using it almost daily, either on my M2 Ultra or on my RTX 5090 box. I use it for small mundane tasks at ggml-org [0] - nothing really impressive, but definitely a helpful tool for a maintainer. I think I would be using it much more, if I didn't have to spend a lot of my time on reviewing PRs. Currently, I have a very lightweight harness - the pi agent with everything stripped (`pi -nc --offline`) and a short system prompt [1] to align it a bit with my style. About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac. I definitely prefer running it on the RTX machine - it's so much faster. But for the sake of testing and getting wider experience with local configurations, I often run it on the Mac too.
  [0] - https://github.com/search?q=%22Assisted-by%22+user%3Aggml-or...
  [1] - https://github.com/ggml-org/llama.cpp/blob/master/.pi/gg/SYS...
  [-]
  - girvo 5 minutes ago
    > Currently, I have a very lightweight harness - the pi agent with everything stripped (`pi -nc --offline`) and a short system prompt [1] to align it a bit with my style
    This really is the secret to getting the most out of these models IMO. Pi is so damned good. I have a strongly tuned Pi for running Step 3.7 Flash (IQ4_XS) and Qwen 3.6 27B (FP8)
    Also, thank you for llama.cpp mate :)
  - trilogic 6 hours ago
    I also confirm that local inference is on par with proprietary cloud services (with a bit of local setup, simple agents.md and some utils skills). This local models come with tools, that's mind blowing, considering that some months ago we had to .md tools ourselves. What makes a model worth even more is "Memory". We implemented that long ago. Last time I used proprietary services was 3 months ago, don´t really need it, my subscription is going blank.
    Gerganov, hope you will consider developing further the CLI cause we suffering with the server.
    [-]
    - jayGlow 4 hours ago
      what are you using for memory with your local models? is there a specific harness you would recommend for local agents?
      [-]
      - trilogic 3 hours ago
        I use HugstonOne (that backend a personalized version of llama.cpp). Implemented it´s own double layer memory that recall the full or partial previous session/file with an ON/OFF switch (which picks up where left off in CLI or Server or both same time) and another that reads back a % of current tab if memory switch is off doing checkpoints every certain tokens, summarizing and referring back to it when needed (recalled by certain logics). There is more to it when involving local RAG (making it tripple memory layer) but thats a long story.
        About the harness depends on for what you need it, but basically for a universal unit of measure, Harness is multilayered and logic and domain specific dependent. I would definitely include Type of Hardware, Model parameters/knowledge, Model Intelligence, Model size/context, type of conversion, type and quantization (models comes with some default tools), but adding your (domain specific), skills, tools, memory, logs, security, Rag, Online search... (which as scary as they sound are mostly simple logics in a txt file, like if this do that).
        The full pack is Harness 10, every missing thing lower the harness score.
        To answer to your question I would definitely recommend smth like HugstonOne (or anyway llama.cpp CLI) with Qwen 3.6 35B finetuned/distill (deepseek 4 or claude 4.7) with none of the current coding agents out there that are screaming internet connection and proprietary API and data collection. DO this, if you can find a tool that you can download and choose a local model (of your choice in whatever folder locally) and load it ready for inference without any need of internet connection that is the tool you should aim for. Right now there is none out there.
      - mft_ 2 hours ago
        I’m using Hermes at the moment - it comes with lots of tools already baked in for the agent to use - for example web and browser access just worked, rather than having to mess around loads with config scripts and plugins.
        I’ve also tried OpenCode (similar but a bit less so) and Pi (fast but you have to add lots of features yourself which is a bit of a pain). Claude Code can also be pointed at a local model and works, but the default system prompt is huge. (~140k of text when I extracted mine, IIRC.)
  - kpw94 5 hours ago
    > About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac
    Curious if you can share the prefill speed too?
    I run locally on a crappy desktop (some AMD iGPU with Vulkan llama.cpp, 32 GB DDR4 RAM) for experimentation. I get 15 tok/s on generation for the qwen & gemma4 MoE models. I get around 150 tok/s prefill speed.
    Reason I'm asking about the prefill is looking at my stats at work, I use between 20M to peaks of 300M input tokens daily. Some of those token are cached but in general, I seem to have roughly 500x more input tokens than output. So interested in prefill tok/s stats.
    Huge Thank you for llama.cpp btw!!
    [-]
    - ggerganov 5 hours ago
      Here are the prefill speeds:
      Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32109 MiB | model | size | params | backend | fa | test | t/s | | ------------------------------ | ---------: | ---------: | -------- | --: | --------------: | -------------------: | | qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d512 | 3714.02 ± 10.85 | | qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d1024 | 3684.86 ± 15.21 | | qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d2048 | 3650.80 ± 8.53 | | qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d8192 | 3473.88 ± 0.97 | | qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d32768 | 2754.69 ± 4.07 | ggml_metal_device_init: GPU name: MTL0 (Apple M2 Ultra) | model | size | params | backend | fa | test | t/s | | ------------------------------ | ---------: | ---------: | -------- | -: | --------------: | -------------------: | | qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d512 | 379.75 ± 0.21 | | qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d1024 | 377.15 ± 0.35 | | qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d2048 | 371.46 ± 0.91 | | qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d8192 | 344.84 ± 0.41 | | qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d32768 | 222.42 ± 5.29 |
      Btw, based on your numbers, I think our use cases are quite different. I use the agent for very targeted sessions - basically things that are clear to me how to do, just want to automate them. My workflow is usually: new session -> read this, this and this -> do that. I.e. I don't let it wander at all in the codebase, so I rarely exceed the context window.
      Also, I get a lot of mileage from the ngram-based speculative decoding functionality [0] as it allows me to iterate on the implementation much faster.
      [0] https://github.com/ggml-org/llama.cpp/pull/19164
      [-]
      - kpw94 5 hours ago
        Thanks! Super helpful.
        I do use it the same way as you're describing on personal projects at home, in a very crude manner (pasting code snippets in llama server web UI prompt. Next will attempt OpenCode)
        At work I use it in similar manner with more mature tools, but the vast majority of token spend comes from a totally different workflow: "pretend the AI is a fleet of junior/intern engineer you're delegating work to", where the agent will on its own do the implementation, commit the changes etc.
        It does indeed spend a lot of tokens wandering the codebase, talking to MCPs, loading skills etc.
  - celrod 6 hours ago
    What quant do you run it at? 32GB seems like cutting it close on the rtx 5090 if going 8b, but other commenters are saying 4b lobotomizes the model.
    [-]
    - ggerganov 5 hours ago
      As a baseline, I run all models in Q8 [0] because I want to be confident that when I observe a problem, the root cause is not due to the quantization. However, in this specific case, I use Q8 on the mac and Q4 on the RTX machine because the latter does not fit the full context at Q8. So far, I don't have conclusive evidence that the Q4 quantization affects the quality in a significant way for this model and the tasks that I am using it for.
      [0] https://huggingface.co/ggerganov/presets/blob/main/preset.in...
      [-]
      - girvo 3 minutes ago
        27B seems surprisingly resiliant to quantisation. Though my evals showed there was some impact to coding ability from 8 bit to 4 bit, it was less than I would've expected: and it was on task types that you've said above that you don't really do with these!
  - toddmorey 4 hours ago
    For the curious, it looks like a PC with a RTX 5090 32GB graphics card will run you about $6,000.
  - fridder 5 hours ago
    Not too shabby. I like the regular Qwen but prompt prefill on my m1max is slow as hell
- StevenWaterman 6 hours ago
  Yep, I daily drive Qwen3.6-27B (including for work), have done pretty much since it came out. IMO it's the only (small-ish, local) model worth using, if you can run it. It might not be as good as Opus at "add X large feature" but I don't want that in a model. I want to do the thinking while it does the typing. And Qwen 3.6 27B is perfectly good at that (while in my experience models like the 35A3B and gemma are significant downgrades)
  Plus, I never have to worry about rate limits, quotas, or sitting in a queue during peak time. And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.
  Running on 2x 3090, 500-1000tok/s prefill and 60tok/s output at Q6_K_XL with MTP on llama.cpp, 220k tokens context window (starts to get a bit dumb above 160k ish), no KV quantization
  [-]
  - indoordin0saur 6 hours ago
    > And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.
    For this reason I wonder if local models are a potential business opportunity. Provide the service to engineering teams to give them a pre-built and setup GPU rig they can run in a closet. No need to worry about all the things you mentioned and clients can rest-assured their data isn't disappearing into a sketchy data center. There might be regulatory reasons that make on-prem setups appealing as well.
    [-]
    - amoshebb 6 hours ago
      This is, as far as I know, the business model of coys like mistral and cohere
    - suncemoje 6 hours ago
      On-premise (1960-2010) -> Cloud (2010-2026) -> On-premise (2026+)?
      [-]
      - indoordin0saur 6 hours ago
        I think that's overstated, but the loss of trust companies have with the big AI players is pretty serious. Not a big deal if your app is for sharing cat videos, but if you're medical or wealth management or a government contractor or the like enterprise clients really like to see good data security policies.
        [-]
        lelanthran 3 hours ago
        > Not a big deal if your app is for sharing cat videos, but if you're medical or wealth management or a government contractor or the like enterprise clients really like to see good data security policies.
        If this mattered to them, they wouldn't be running so much in the cloud or in proprietary software that they have no ability to air-gap.
        If companies ever cared about this, Windows would not be dominant on the desktop.
        [-]
        indoordin0saur 2 hours ago
        There are a lot of government jobs I know of that are absolutely air-gapped. Your computer has basically no internet access, everything is stored on-prem. Hedge funds also tend to be extremely locked down, from what I saw when I interviewed. With certain data sets either having strict encryption-in-transit or a being stored in a quirky on-prem service. I can't imagine they're going to be dumping their data into Claude, etc.
        As to why Windows is so dominant, I'm as clueless as you.
        suncemoje 6 hours ago
        Agree. I also wonder how zero e.g., Claude Enterprise ZDR really is, and what their data pipeline actually looks like.
    - cyanydeez 6 hours ago
      I think the next step to anyone but overbloated USA models is to follow https://chatjimmy.ai/ with one of the qwen models. If they can mass produce something at relative cost, these would be awesome sidecars.
  - giancarlostoro 6 hours ago
    > (starts to get a bit dumb above 160k ish)
    If open models can ever hold roughly 600k token windows, I'll be really excited, I found that around 300 ~ 400k of Claude reading through your codebase results in better outputs. I also have Claude read official docs instead of "guessing" as to how to do something.
    [-]
    - StevenWaterman 6 hours ago
      I think we'll get there. Right now it works for me, because I'm naturally pretty verbose in my prompts, and know the codebase well, so I know what it needs to look at. Plus subagents for anything exploratory.
      I think deepseek v4 pro has 1m context and does pretty well up to around 600k. But if you have the hardware to run that locally, you already know
      Even then if there's a smaller model with 1M context, you'll need a ton of RAM to actually run it at full 1M. I guess that's why you don't see it too much. Anyone that could run Qwen 3.6 27B with 1m context would be better off running a much bigger model with smaller context instead, in the same amount of VRAM.
      In terms of optimizing further, huge context + KV quantization sounds like a terrible idea, but there's some decent innovation in sparse attention, KV cache rotation allowing Q8 to perform nearly as well as full 16-bit precision, plus some ideas around offloading KV cache to system RAM (but I'm skeptical)
      [-]
      - zozbot234 6 hours ago
        DeepSeek V4 (both Flash and Pro) has very good scaling of context length wrt. RAM use, so this is not an inherent limit of LLMs in general.
    - 0xc133 5 hours ago
      With yarn and rope scaling arguments for llama.cpp you could run qwen3.6-27B with 1M context… if you have enough memory to store it.
    - cyanydeez 6 hours ago
      I don't really think you're making reasonable decisions at that size; but I suppose if you're not allowed to refactor it, maybe.
      I think the way these models work excludes sane behaviors the larger the context gets as each token introduces potential ambiguities between "USER" and "SYSTEM" messages leading to all the catastrophic behaviors.
      Anyway, with AMD395+ I'm finding ~100k is both speed and context usefulness unless it's scoped tightly. with opencode, I manage it with dynamic context pruning: https://github.com/Opencode-DCP/opencode-dynamic-context-pru... ; then anything I touch ends up being refactored so context doesn't get bloated with unecessary functions, etc.
      Obviously, this isn't compatible with certain business codebases, so I can see why bloat meets bloat.
  - hughw 6 hours ago
    Just this morning I tweaked my single 3090 setup too:
```
  OLLAMA_FLASH_ATTENTION=1
  OLLAMA_KV_CACHE_TYPE=q8_0
  OLLAMA_CONTEXT_LENGTH=180000
```
    and that fits in 23GB.
    [edited for format]
  - iamtheworstdev 5 hours ago
    are you running an NVLink? I have the same setup but no NVLink and it feels like it's best just splitting the 3090s to run separate models concurrently. But I also have no idea what I'm doing.
    [-]
    - fluoridation 2 hours ago
      It depends on what you're comparing. If the same model fits on the combined VRAM but not on a single contiguous VRAM, then it won't be faster to run two instances of it. If you're comparing a 23 GB model running duplicated vs a 46 GB model running split, then yeah, that will likely be faster, just because there's no synchronization between cards.
      AFAIUI, there'd be little advantage in having a higher speed inter-card connection, because the cards don't really talk to each other during inference. The loss of efficiency compared to a monolithic memory architecture comes from scheduling, not from data transfer.
  - QuantumNoodle 6 hours ago
    Do you have any resources on hardware necessary for running models and tweaks? I see you mention 2x 3090 and I wanted to do more search on what hardware is satisfactory for what models.
  - Andrex 4 hours ago
    How long have you been using it?
- epistasis 6 hours ago
  > talking just way too much
  OMG this is such an annoying property, just shut the hell up please, and be concise.
  I suspect that this is an artifact of the thinking property, but please just summarize the thinking process far more concisely, where a single sentence answer is more than sufficient the frontier models seem devoted to going on to a minimum of 5 paragraphs and offering 3-5 new directions.
  And requests to please only offer a single step at once, or single option at once, or to even stop eagerly offering future directions is really hard to prompt correctly.
  And look, there I did exactly what I was complaining about...
  [-]
  - bityard 6 hours ago
    I'm not sure to what degree you can influence how a model thinks, but you can definitely hide the thinking tokens and tell the model how you want it to talk to you.
    For example, the Claude web UI has an Instructions field where I have told it never to congratulate or praise me for asking questions. Earlier Copilot models used a ridiculous number of emoji and bullet lists when answering literally every prompt, I told it to knock that off and prefer detailed paragraphs in prose.
    Local agents/frameworks/whatever all have their equivalents for overall user preferences.
    [-]
    - epistasis 5 hours ago
      Thanks for the reminder! For others looking for this setting, it is currently under User Menu (click your account name in the lower left), then "Settings", then the "General" tab there's an "Instructions for Claude" box.
      Asking Claude for this provides incorrect instructions for me, so I'm guessing it moves around a lot.
  - illegalsmile 6 hours ago
    That's why you have to give claude and others directives/.md at the beginning so it doesn't go off the deep end with suggestions.
    [-]
    - epistasis 6 hours ago
      Yeah, I've tried, and I'm sure somebody is going to say "skill issue" but it's not so easy to get the model to do that. Maybe it should be a SKILLS.md issue.
      Edit: also, how can I stop the LLM from all this fake glazing, as if every question I have is some sort of unique genius insight, it's so damn annoying. I just got the third straight round of this while merely trying to get summarization of a PDF:
      > Good question — it gets right at a real tension in the paper. Let me check the current state of actual SV-imputation efforts, since this has moved since 2020.
      [-]
      - bornfreddy 5 hours ago
        I didn't try telling to be concise and stop pampering me yet (but good idea, tomorrow), however I found that instead of me writing agent instructions, it works much better if I tell claude to write instructions for itself. I do check if they make sense of course, but its wording works much better than mine.
  - frereubu 2 hours ago
    [dead]
- derethanhausen 6 hours ago
  I would not generalize based on experiences with Sonnet. The flagship models (Opus being the claude equivalent) are dramatically better.
  [-]
  - hypfer 6 hours ago
    Opus in my experience is equally unpleasant "character"-wise, but at least it actually gets stuff done more often, so it's at least slightly more earned at that. It's still a neurotic cargo-culting dogmatic idiot, but one that at least sometimes does produce deliverables instead of only bottom-tier HN-esque opinions.
    Hmm. I think I might just fundamentally disagree with Anthropic about the idea of what a "tool" should be.
    [-]
    - KaoruAoiShiho 3 hours ago
      Fable largely fixed the annoying chatterness so sucks that it's gone now.
- kitd 7 hours ago
  Funny that coding agents have personalities, including "that colleague" you want to avoid even if you know they're probably quite good at what they do!
  [-]
  - otabdeveloper4 1 hour ago
    That's exactly what RLHF is for.
    (In fact, "that colleague" might have even been the source of the RLHF training set.)
- radium3d 6 hours ago
  If you think about it, they're splitting the power across millions of users. Essentially, these AI companies have YOUR hardware that YOU are paying (them) for in a cabinet at some data center. This means the hardware could easily be run locally for inference for these 'big' models. It's just a problem of dynamics-- RAM is being bought in bulk by these companies through these B200 style cards, instead of sold slowly through the open public markets.
  This is likely due to a combination of mass funding for the AI companies, but also they are trying to governmentally restrict which countries get access to these cards so certain countries get a head start. The only way to lock that down is to have them literally locked in their own GPU prisons (data centers). Third reason is it does make it possible to train the models faster by having them in the same data center connected directly. Having them distributed to everyone would slow down training considerably.
  The current way to 'own' decent RAM and GPUs right now is through the stock market it seems.
- giancarlostoro 6 hours ago
  There's a model on Huggingface where someone takes Qwen and makes it think Opus style, and that one seems to be decent, not sure if they have the 27B variant in that style. I do wonder if you can tweak your system prompt to force Qwen to behave better?
  [-]
  - StevenWaterman 6 hours ago
    You read the OP backwards, they said Sonnet is a downgrade from Qwen, and prefer Qwen's tone
    [-]
    - giancarlostoro 6 hours ago
      Sure, but my argument still holds, the idea is that Qwen reasons the way that Opus on High (what is now Max or whatever?) level thinking to reason about problems instead of its standard approach.
  - whythismatters 6 hours ago
    Yes, Qwopus :) I've been pleasantly surprised by its quality
    [-]
    - giancarlostoro 6 hours ago
      Seen that one too, same guy I'm thinking of too, havent had a chance to try all of their models. For anyone curious I believe the username is Jackrong on huggingface? They've got several models out on there each focused on programming from different approaches.
- MostlyStable 6 hours ago
  Curious if you have tried custom instructions. I was never quite as unhappy with Claude's voice as you appear to be, but there were several things I didn't like. A custom prompt fixed almost all of them.
  [-]
  - clickety_clack 6 hours ago
    I think it would be very hard to convince someone to pay $100/mo to go back to Claude if they have a local model up and running, particularly now that model improvement has basically been stalled for the last 6 months. It’s so easy to set it up for yourself now too with things like LM studio. That said, there will always be unsophisticated users who can’t figure it out, so there will always be someone there to pay.
    [-]
    - MostlyStable 6 hours ago
      The person I was replying to specifically said that the Claude will "encode more knowledge" and that their problem was that they didn't like talking to Claude. It sounds like they think that Claude is at least slightly more functional. And the "not liking talking to it" is probably fixable. Someone for whom a local model works, and for whom the economics make sense, should absolutely run a local model and I wouldn't try to convince them otherwise. I'm sure it's the right choice for a lot of people. But not liking the personality of Claude is probably not a great reason on its own, given the minuscule amount of effort it takes to fix.
    - Scoundreller 6 hours ago
      The third category are the occasional users that won’t have the hardware and won’t stomach a monthly fee for “unlimited” but are happy to pay-per-use.
      I’d think the volume for that category would be low but LLMs aren’t just for coding.
      [-]
      - dghlsakjg 6 hours ago
        I’m probably the third category. I like experimenting and trying different models and techniques. I want api access for my own apps and Claude subscriptions don’t have that.
        Sure I could splash out a ton of money for a high ram Mac, but deepseek is so dirt cheap that I think depreciation on a high end machine costs more than my api spend.
        Example of what I’m using it for: building a semantic database of podcast content (podcast discoverability sucks on an episode level). I need a cheap LLM, an embedder, a transcriber, none of which Claude will do.
        My api costs for coding agents plus running apps are about ~$20/month, but I get more than just chat + Claude code.
        If all I was doing was pumping an employers codebase through a coding agent, Claude would be the answer.
    - chrisweekly 6 hours ago
      Not everyone has the right hardware.
      [-]
      - clickety_clack 6 hours ago
        I guess I’m thinking of the $100/mo users, for whom it’s probably possible to get the right hardware.
- andix 4 hours ago
  Sonnet is extremely overpriced. It's a good model, but not worth the money Anthropic charges for it.
- dackdel 6 hours ago
  what kind of hardware do you need in order to run qwen3.6-27b
  [-]
  - giancarlostoro 6 hours ago
    Depends on which variant you pull down, but a single 5090 GPU (I know these are insanely expensive, but for context) could run either the Q8 or Q4_K_M version. It will not fit the 52GB version (BF16) on the other hand. So any modern Mac with a Pro or better processor and more than 52GB of RAM (don't forget VRAM for context window also matters!) would suffice, as someone else noted, probably a 128GB model would do the trick, and give you enough wiggle room to max out the context window.
    My Mac only has 16GB of VRAM (20GB total - 8 is reserved for the OS) so I have to leave room for VRAM, I usually find a model that fits in 5 to 7 GB of VRAM and then max the context window as much as I can.
    [-]
    - daemonologist 2 hours ago
      The benefit of running the full precision version is negligible (probably not even measurable above the benchmark noise floor). Most common for cost-conscious users is to run something around 4-6 bits per weight, which would fit on a 24 or 32 GB card (as you mentioned).
    - pixelesque 4 hours ago
      Note you can change the amount of shared (V)RAM reserved for the OS with:
      sudo sysctl iogpu.wired_limit_mb=18800
      will allow you to use more, but you do need to leave a bit for the OS obviously!
      [-]
      - giancarlostoro 4 hours ago
        Oh man! I had no idea I could do this at all! What do you usually tweak it to? I feel like 8 GB is probably still a reasonable amount to give the rest of the OS.
        [-]
        pixelesque 3 hours ago
        I've got a 32 GB MBPro, and I set it to 27700, which I haven't seen a problem with so far.
  - sbmthakur 6 hours ago
    I could run it on 7900 XT with 64k context. You could run it more comfortably on a 24 gb vram.
  - iagooar 6 hours ago
    I recommend MacBook M5 Max with 128 GB of RAM to run it comfortably and fast. If you have something like a regular M4, go with qwen3.6-35b-a3d - the Mixture of Expert architecture makes it run 2-3x faster than the 27b version.
- indoordin0saur 6 hours ago
  Very curious what hardware you're running this on!
  [-]
  - hypfer 6 hours ago
    The same 24GB VRAM RTX 4090 I bought to play Cyberpunk 2077 with.
    Works perfectly fine in llama.cpp throwing 70+t/s at me with 128k q8 K/V context when using the IQ4_NL quant + MTP at q4 MTP K/V.
    Also leaving this here because you might find it useful: https://hypfer.github.io/will-it-fit-llama-cpp/
    [-]
    - Rzor 3 minutes ago
      [delayed]
    - indoordin0saur 6 hours ago
      Nice! Do you do anything with that compute when you're not actively using it? Is the crypto-mining hobby still worth it? I've also wondered if such expensive hardware can be rented back out to offset cost. Looks like these cards are going for as much as $4k nowadays.
      [-]
      - all2 6 hours ago
        There are services where you can hook your card up and rent it out to other users. I don't know what any of them are called, but they do exist.
        [-]
        dghlsakjg 6 hours ago
        Salad.com is one. (I’m unaffiliated, just happened to come across it this week while looking for a cheap option)
      - hypfer 6 hours ago
        I've paid ~2k€ in 2023. Since I'm usually sitting next to it, I'm only using it when I want to use it. It can get quite loud and warm.
        Crypto (to my knowledge at least) moved away from GPU mining. I guess you could maybe rent out GPU compute, but - being in germany - it's not worth the legal hassle. You could of course always commit tax fraud, though I wouldn't recommend that.
      - esseph 6 hours ago
        > I've also wondered if such expensive hardware can be rented back out to offset cost.
        Massive legal liability. Not worth it.
    - cdelsolar 6 hours ago
      What did you call me?
- zerd 5 hours ago
  I noticed Fable was quite a bit terser, and I think it's due to changes in the system prompt [0]. They're literally saying "just give me the TLDR" and "give brief updates". You can tweak a lot of that with an AGENTS.md.
  [0] https://twelvetables.blog/comparing-claude-fable-5s-system-p...
- chrisweekly 6 hours ago
  Why Sonnet 4.6 not Opus?
- dyauspitr 4 hours ago
  Why would I want some half assed coding assist tool. I want something that takes in a requirement and spits out a finished product. It’s not your equal, it’s better than you.
- ltononro 6 hours ago
  Well but comparing with sonnet 4.6 instead of opus 4.6,.7 or .8 doesnt make a real point I mean, pay 200 USD/month (if you have that cash, or your company has it), might not justify using local at all (unless you have some reason to suspect about data leakage)
- calebm 5 hours ago
  sync/ack
- cmrdporcupine 4 hours ago
  The Anthropic models have always been annoying this way -- chatty/opinionated and Dunning-Krugerish. And love to run away and do things unprompted with me jamming my ESC ESC ESC key over and over so I can get a word in edgewise.
  FWIW Codex/GPT models are way less this way. Maybe to a fault.
  I'm setting up my DGX Spark to try Qwen 3.6 27B again, as I'm hearing a lot of good reviews. When I tried it some time ago it was still early for support in llama.cpp.
rmunn 7 hours ago
This is the kind of thing that Anthropic et al should be worried about. As it becomes easier and easier to run local models, the ceiling of what they'll be able to charge will get lower and lower. Not that nobody will be willing to pay $$$$$ per month, but a lot of people are going to multiply the per-month charge by 12 or 24 and say "Could I set up a local model for less than that, and have it pay for itself within a year or two?" And if a significant portion of customers decide to buy instead of rent, the companies whose business model is entirely centered around renting will suddenly find themselves hurting for customers.
[-]
- sathackr 6 hours ago
  The opposite of that has been happening for 20 years now with cloud compute.
  It won't happen with AI models either.
  It's almost ingrained in the American business model now. Outsource everything. Nobody wants to manage a room full of servers when they can spend 2-3x as much and outsource that headache along with the responsibility for it.
  Same will happen with AI. Whether that means paying Anthropic that premium or paying AWS.
  I'm in a relatively small business, we recently had an outage related to our local infrastructure.
  I got pressure from the CEO saying it wasn't reliable to host our own infrastructure anymore even though our total internal down time over the last 5 years is significantly less than even a single of the larger recent AWS outages.
  Everyone wants to shuck the chore and the responsibility.
  [-]
  - preommr 6 hours ago
    > The opposite of that has been happening for 20 years now with cloud compute. It won't happen with AI models either.
    AI is different.
    Cloud computing genuinely is cheaper on average. It's better than paying for cisco servers, and at scale, it's cheaper than managed platforms (ala Heroku), and it's a coin toss for when you're in the middle ground and constantly approaching the point of rebuilding poor-man versions of existing products but with very very expensive engineering salaries.
    In contrast, local models offer dramatic savings, and are magnitude of orders better in certain aspects: like stability - the performance is all over the place with traditional AI companies as they divert compute to their next big thing.
    The benefits to maintaining your own infrastructure are pretty moderate to low, with very high risk.
    And also, alternate models are pretty easy to use and easy to swap out unlike the vendor lock-in that exists with cloud services.
    [-]
    - codethief 4 hours ago
      > AI is different.
      I agree. The other thing here is that, once you can run LLMs on a single piece of commodity hardware (whether that includes one GPU or several), the difference between cloud vs. on-premise LLMs will largely be about where your hardware is located. There will be very little software configuration involved (just an HTTP endpoint that talks to the GPU). This is decidedly different from cloud products where the moat of hyperscalers is largely in the software and services on top of the hardware, not the hardware itself. (Sure, GPUs will eventually break & need replacement, too, but there's no state to lose, so that's already orders of magnitude easier than replacing hard drives.)
    - richardwhiuk 4 hours ago
      There's no economic reason why running a model locally should be better than using a cloud hosted version.
      [-]
      - moregrist 55 minutes ago
        “There is no reason anyone would want a computer in their home." - Ken Olson, Founder of Digital Equipment Corporation, in 1977
      - spockz 4 hours ago
        Sure there is. Keeping your IP in house.
    - 15155 3 hours ago
      > Cloud computing genuinely is cheaper on average.
      For some applications, sure. Availability is a large part of what one is paying for with cloud computing, but it's also something that not every business needs.
      If you sacrifice availability and have a pure-compute use case (low durability requirements), on-prem can quickly end up cheaper for far better hardware.
  - TkTech 6 hours ago
    For many companies (country-dependent) that's not really why they use cloud services vs purchasing. It's tax shenanigans and business process overhead. OpEx vs CapEx, and a small (%) bump in the huge AWS bill no one will even notice or a $30k+ invoice for hardware that has to go through rigorous review and 3 departments.
    Same reason people pay for things through the AWS marketplace (like Vanta) instead of having to go through their invoicing process.
    [-]
    - codethief 4 hours ago
      Good point. Maybe there'll be companies that maintain your on-premise GPU cluster just like there are companies that service the coffee machine in your office?
      [-]
      - otabdeveloper4 1 hour ago
        > on-premise GPU cluster
        Renting a GPU server from a cloud and hosting your own llama.cpp is the path of least resistance.
  - dreambuffer 6 hours ago
    It's just not comparable though is it? You need cloud services because it's physically impossible to use your single home computer as a server, CDN, load balancer, mass storage, security service, and distributed system.
    But AI is just weights, you can run a reasonably intelligent model at home, or on a few GPUs if you're a small-medium sized company, and it doesn't require dedicated maintenance.
    [-]
    - pessimizer 3 hours ago
      If you're a medium-large company, you should definitely run your own AI because you can max out the CPUs more often. You're not only able to run privately and locally, but you're also able to run efficiently.
  - cheema33 6 hours ago
    > I got pressure from the CEO saying it wasn't reliable to host our own infrastructure anymore even though our total internal down time over the last 5 years is significantly less than even a single of the larger recent AWS outages.
    Same here. My job as a software dev does not require me to self-host services we need and use. Quite the opposite. But, I am reluctant to hand over all control to AWS or equivalent for several reasons that I will get into here.
    I have found that Infrastructure as Code (IaC) and modern tools like opentofu, ansible, combined with frontier AI models and harnesses gives you superpowers in this space. Almost all of our self-hosted services are fully managed by these tools. e.g. We perform backups and test them more often now than we ever did before. Entirely because it is so much easier to do all of that now.
  - Terr_ 4 hours ago
    IMO local-vs-cloud may be a misleading dichotomy, versus:
```
    1. Individual dev machines
    2. Shared local server
    3. Shared server in corporate cloud
    4. Third-party LLM SaaS provider
```
    Even if you don't want your laptop melting, there are still some important differences between 3 and 4 in terms of data privacy and security.
  - derfurth 6 hours ago
    That's an interesting take, however there is no ongoing maintenance related to local models, maybe the only effort is giving more capable machines to the workforce; but yeah I can see how it might feel like a barrier.
    [-]
    - sathackr 6 hours ago
      The hardware, the power systems, the cooling systems. They need maintenance.
      The OS needs updates, file systems get corrupted.
      Fans get dirty.
      All the things that you need to deal with in hosting your own server infrastructure you have to deal with when hosting your own AI infrastructure (which runs on servers...)
      [-]
      - ajb 5 hours ago
        However, you can get many of the benefits of a "local model" by outsourcing all the hardware maintenance but still using an open model. Guaranteed repeatability for one.
        A lot of the reason people outsource normal software is its brittle security properties, not sure that even applies to an LLM - it can go and look up the latest security best practices just like an engineer can.
  - otabdeveloper4 1 hour ago
    > in the American business model
    AI company valuations won't survive if they're only for the "American business model".
  - davidw 6 hours ago
    Still though, perhaps the existence of low-margin, generic, cloud LLM's puts some downward pressure on the 'brand name' companies?
  - CamperBob2 5 hours ago
    outsource that headache along with the responsibility for it
    You know what gives me headaches? When I'm in the middle of a session and the model gets rug-pulled out from under me because somebody at the model provider didn't pay the Trump bill that month.
    Or when someone at the model provider decides that the curve-fitting algorithm in my graphics package looks a little too much like Skynet for comfort.
    Or when they do any number of other things to undermine my work for the sake of their business model, some of which I won't even notice until the damage is done.
    The sad thing is, if you know how inference works, you know that it really is insanely wasteful for everybody to run it locally. If anything naturally belongs in the cloud, it's inference. But at the same time, what choice are we being given?
- starshadowx2 24 minutes ago
  Earlier I was thinking it's maybe comparable to paying for Netflix vs torrenting and running Plex or something. For the majority of normal, mainstream users I feel like most would just pay for the thing that is already setup and ready for them. There'll still be all the more techy or determined types who will do it themselves, I just wonder what the percentages of both groups will be.
- indoordin0saur 6 hours ago
  I'm curious when coding-heavy companies will start running their own on-prem AI clusters. Has anyone had the idea to sell something like 4 GPU machine an engineering team could throw in a closet somewhere and run whatever they want on it? I imagine this won't appeal to everybody but with the trust issues the hyperscalers have developed hoovering up people's data and using it to train their models, I imagine some will find value in a machine and model they have transparent control over including the option to walk over and unplug the thing.
  [-]
  - CamperBob2 5 hours ago
    Has anyone had the idea to sell something like 4 GPU machine an engineering team could throw in a closet somewhere and run whatever they want on it?
    I think that's basically Geohot's business model at Tiny Corp.
- wuliwong 6 hours ago
  These local models can do some of the work the non-frontier models can do but for me, that's not worth much. If I am just using Sonnet 4.6, I can pretty much work all day on the $20/month plan. And Sonnet is still a way more powerful model than a one you could self host on an M2 mac.
  If things change to token usage billing for everyone, maybe I'll be singing a different tune but on a subscription, I don't think it makes sense financially.
  Fun? Yes. Financially sound? No.
- storus 6 hours ago
  They are working hard on you not being able to run a thing locally. OpenAI buys all RAM on the spot market, causing the rise of RAM/VRAM prices 6x, making GPUs and decent computers unreachable for the majority of the population. OK, some richer folks might be able to get a 512GB MacStudio or a single RTX Pro 6000 for 13k and be able to run some decent local models, but the vast majority will need to use API. And at some point Nvidia might say: "We don't sell that many 6000s, so let's just cancel them altogether as we can gain 4x profit on datacenter-only GPUs" and then they'll become unobtainium and no private person would ever be able to run anything decent (~1 year behind the frontier) locally.
  [-]
  - nodja 2 hours ago
    I wonder if this move will backfire on them. All the fabs are focusing on HBM and leaving DDR behind, if one of the big frontier labs folds all the memory fabs will be left holding a big bag of HBM memory. They won't have any other choice but sell for cheap so it wouldn't surprise me if we see a return of HBM in the consumer market in 3-5 years.
- bityard 5 hours ago
  The general consensus is that local models will continue to improve drastically, but hosted models will as well. There will _always_ be a pretty big gulf of capability between what you can do with a desk full of hardware at home vs a few racks of hardware in a datacenter. That seems to be the real "moat" of hosted models at this point in time: access to capital.
  What's interesting/exciting is that local models are _already_ quite good at tasks we never imagined AI _ever_ doing before ChatGPT hit the scene just a few short years ago.
  We're also in an interesting point in time where companies are releasing the fruits of their research/labor (the LLMs) to the general public for free. For now, I think they see it in their best interest to gain mindshare and rapport, as well as advancing the state of the art in smaller LLMs ("a rising tide lifts all boats") but I fear and expect that these will dry up as the major players buy the minor players, and all will seek a return on their considerable investments in AI research.
  [-]
  - cogman10 5 hours ago
    I believe there's a level of diminishing returns. Sure, SOTA will probably always benchmark better than local models. But do we need it? That's the question that the likes of OpenAI and Anthropic should be worried about.
    [-]
    - regularfry 5 hours ago
      The difference won't be in the individual tasks. It'll be in the scale of job they can take on and how you interact with the model. Think of pairing with a junior vs replacing a full delivery team, that's the sort of difference we'll be looking at. We'll be able to get closer to the latter by being more clever with harnesses, I reckon, but the frontier labs will run ahead because for any given harness trick they can lean harder on model smarts.
      [-]
      - cogman10 5 hours ago
        True, but my point is that if/when local models get to the point where they are capable of doing the "delivery team" work what's next? What can these bigger SOTA models offer? And especially what can they offer above and beyond what you might be able to get from much cheaper models which the open models are based on?
        That's what I mean by diminishing returns.
  - spockz 4 hours ago
    There is also the thing of workflow.
    We have set up something where you create a ticket, Make sure it contains enough information, and with the right tag added it will make a branch with PR for you which stays up to date based on updates to the ticket and comments on the PR.
    It’s creepy in a way. But you also can’t really use local (as in workstation LLM) for that. Sure we could run something like a distributed task scheduler across all our engineer devices but just pushing it to copilot is easier.
- icoder 6 hours ago
  What I don't understand is that on one hand we read 'what they charge is much less than it costs them' and on the other hand this thread seems to suggest that 'what they charge is more than it would cost me'.
  [-]
  - bluGill 6 hours ago
    What it costs is tricky to measure. A large part of the costs are training the model. Once they have the model they are making a ton of profit from what they charge (or so we think - I haven't seen the numbers). However the sunk costs of getting the model need to be paid for and that means an accounting problem where we have to guess how much the model will be used in the future.
    Accountants are reasonably good at figuring this out - there are a lot of different things that need a large upfront investment before you can charge anything. People still debate if they are correct in this each case.
  - 15155 3 hours ago
    They have to provide the service at peak scale and high-availability, your local setup doesn't have those extremely expensive requirements.
  - esailija 6 hours ago
    Bigger models that Antrophic want to sell cost disproportionately more (e.g. 100% more cost for 5% performance improvement) than small models you would use locally
- frollogaston 4 hours ago
  Anthropic isn't just renting out compute, they're renting out a closed model that's better than anything you can download for free. So they're rightfully focused on preventing others from distilling their model.
- themaninthedark 7 hours ago
  Maybe that is why they are buying up as much hardware as they can? If their service is the only game in town.
  [-]
  - otterdude 7 hours ago
    Data Center providers are buying hardware, not anthropic. Certainly related but alot of the hardware purchased is just sitting in a warehouse waiting for a data center to get built.
- ActorNightly 3 hours ago
  Local models will never achieve "real" performance (i.e actual usage, not benchmarks) compared to frontier models.
- pessimizer 3 hours ago
  > but a lot of people are going to multiply the per-month charge by 12 or 24 and say "Could I set up a local model for less than that, and have it pay for itself within a year or two?" And if a significant portion of customers decide to buy instead of rent, the companies whose business model is entirely centered around renting will suddenly find themselves hurting for customers.
  And those are going to all be big enterprise companies that probably will set up LLM services entirely in-house, because they've got the headcount to utilize servers at 100%.
  I wonder if there will be (or is currently) business in selling their compute while they're not working, to opposite time zones, etc.
  What's left for the big providers will be the dregs of individual subscriptions and small businesses that at their least paranoid might let employees just use their own subscriptions for work.
- sbmthakur 6 hours ago
  Someone was able to run gemma-4-26B-A4B on an i5-8500 with 32 gb ram with NO GPU. Granted this is an extreme example these MoE models are value for money for a lot of use cases.
  https://www.reddit.com/r/LocalLLaMA/s/YontVNVRbL
embedding-shape 7 hours ago
Show us the resulting code of using them! :) I want to use local models, I have the hardware for it, but while trying them out as replacements for GPT 5.5 xhigh or Opus or other SOTA models, they aren't quite ready to be replaced yet, sadly. The quality and bumps they encounter just slows down the workflow so much, even screwing up tool call syntax sometimes.
But, for smaller more well-defined workflows, or as straight "edit this part to be like this exact" edits, they seem more than enough. Still waiting for them to become mature enough to be able to replace what we have as SOTA today, I'd say it's ready to be switched over then.
Speaking of local models, DiffusionGemma (and diffusion models in general) should not be slept on for local usage! Usually the problem locally is that the LLMs aren't efficiently making use of your hardware, unless you start batching requests and run many at the same time, but that require different approaches in general. Instead, diffusion models work much faster for individual prompts, and not by a small margin either.
Today I finally finished porting diffusiongemma-26B-A4B-it support from Transformers into Candle, and together with some optimizations I now have it basically flying with ~450 tok/s (~19 it/s) in Candle during inference, instead of ~180 tok/s (~11 it/s) from HF's Transformers library. Even using vLLM with similar sized LLMs, I don't think I've ever gotten past the ~250 tok/s threshold for single prompts, exciting stuff for local models :)
[-]
- zozbot234 6 hours ago
  > Instead, diffusion models work much faster for individual prompts, and not by a small margin either.
  Diffusion models can't really be trained beyond low-to-mid size and have lower quality than an equally sized, plain one-token-at-a-time model.
  [-]
  - embedding-shape 6 hours ago
    As mentioned, I've just finished the implementation and started playing around with it, seems to be doing similarly well inside of my own agent harness as similarly sized "traditional" LLMs. Of course, neither come close to SOTA models, but I suppose if we can figure out the scaling issues you mention, we'd get a bit closer. The performance just feels like it's too good to quickly ditch diffusion. Do you have more info what those "can't be trained beyond low/mid size" issues are in practice today?
    [-]
    - zozbot234 6 hours ago
      The issues around training diffusion models are well known among researchers. They're likely to not be feasibly scalable far beyond the 26B size of DiffusionGemma itself, and their lower quality compared to an equally-sized auto-regressive model (the usual one-token-at-a-time flow) is also a matter of broad consensus.
      [-]
      - embedding-shape 6 hours ago
        > They're likely to not be feasibly scalable far beyond the 26B size of DiffusionGemma itself
        I think people used to say the same about the 8B text-diffusion models too when they came out, like LLaDA. LLaDA2.0 seemingly claims 100B total / 6.1B active MoE diffusion (DiffusionGemma is also MoE). Not saying you're wrong about the current consensus, but it has a way of changing over time, might be a bit early to claim it's infeasible to scale them, especially considering the final artifact being much more suitable for local usage.
        [-]
        famouswaffles 3 hours ago
        Difficulty of scaling is not the only issue. Nobody is going to be particularly invested in scaling an architecture that has:
        - consistently proven behind their auto-regressive counterparts in quality. Look at the dgemma benchmarks - pretty steep dropoffs and the more difficult the benchmark the worse the dropoff. That's not a good look and it's not like its some artifact of google's release. Every dllm is like this.
        - And whose inference benefits are negated at scale. Transformers are still cheaper if you want to serve lots of users.
        >"DiffusionGemma's speedup is designed for local and low-concurrency inference. In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs"
        Put yourself in the shoes of all the labs, even open source ones. Why would you put much effort into this ?
        [-]
        embedding-shape 2 hours ago
        > - And whose inference benefits are negated at scale. Transformers are still cheaper if you want to serve lots of users.
        But my entire point is about the reverse of this, the context of what I bring up is in single-user scenarios, which is where these diffusion models really make a large difference in performance.
        Sure, I agree it's not a good fit for every single use case out there, everywhere. But after starting to play around with it closer myself, I think people are dismissing it a bit too quickly, at least if you're interested in running local models on your own hardware.
        [-]
        famouswaffles 2 hours ago
        I don't think you're really getting the point I'm trying to make. Everyone training llms regularly cares about serving users at scale and quality per compute invested. It's not just about OpenAI or Anthropic or Google. Qwen, Deepseek, Moonshot, whatever. They all care about it very much and basically can't afford to take a step back in those areas.
        Since training models is currently a very expensive procedure, diffusion llms are destined to be relegated to the occasional research artifact at best. As things stand, making a serious commitment to them is basically the equivalent of throwing money into a fire pit and things are expensive enough as is.
        Alternate Architectures that do a much better job matching transformers in quality have basically gone nowhere but you expect one that is basically worse in every way the labs care about won't ? I'm not trying to 'dismiss' dllms. I'm interested in them for the same reason you are. I'm just stating the factors at play plainly.
        zozbot234 2 hours ago
        Single user scenarios can also use MTP to make auto-regressive inference more compute-intensive with no loss of quality.
pornel 2 hours ago
[meta] I wonder why people have such wildly different bar for what is "good" agentic coding?
In a way, it's absolutely amazing that we've went from "Playing 'Set a Timer' on Apple Music" intelligence to something that may pass the Turing Test, but in practical terms the small models are still far from what I'd call "good" for more than a tech demo.
To me, 7B models are just a fuzzy echo of Wikipedia. Gemma models at 4 bit are too clumsy to even reliably generate JSON for tool calls or copy a line of code to apply a patch.
Qwen needs so much detail and babysitting to stop it from doom looping or losing the plot, that the instructions that I need to give are usually longer than the code I end up keeping.
Is there some magic prompt that I don't know? Do other people just have a lot more patience, or way lower expectations?
[-]
- papersail 2 hours ago
  I had similar doubts. I think expectations differ because the workload differs. For small scripts, glue code, or simple CRUD changes, smaller models such as Qwen3.6-27B can work wonders than they do on a larger, messier code base.
- verdverm 22 minutes ago
  There is a lower bar (that gets lower over time), but ime, the config you are describing is too low still.
  qwen/gemma in the 27/35B range @fp8 are better than gemini-2.5, but less than gemini-3.1, you can run DS4-flash @fp8 on two DGX spark, and things keep becoming better. DiffusionGemma came out recently with 4x token gen speeds.
  tl;dr - the models you appear to be trying with are too small or too quant'd
K0IN 27 minutes ago
In a day to day base i host Qwen3.6:27b, but i *Really* want to host deepseekv4 flash, its such a "good" model for its size/speed/price.
I really wonder when companies will start hosting theire model for everday tasks on prem, cause its good enough (and realative cheap), instead of paying subscriptions for all devs.
iagooar 6 hours ago
I love running two models locally: qwen3.6 27B 8bit (dense) and qwen3.6 35B 4bit (MoE).
The 27B is the smarter, more reliable one - but it is slower. The 35B is faster, still very smart but below 27B, a bit less reliable. The reason is the MoE - Mixture of Experts architecture, which only activates a subset of parameters, making the model much much faster.
I run the 27B on a MacBook Pro M5 Max + 40 GPU cores + 128GB RAM (well, on this beast I can have 27B + 35B in memory at the same time with headroom for all the other stuff). But because this is a laptop, it is not possible to run local LLMs all the time - it just gets too hot and too loud.
What excites me more: I run the 35B model on a MacMini M4 with 64GB RAM. It is fast, it gets a lot of work done (e.g. it scans, extracts and classifies my emails, it watches the mailbox all the time and does work). I also use it as my private Hermes assistant ("when is the next Starship launch?", "who is playing today at the World Cup? Give me some trivia").
Next step I am planning is a RTX Pro 6000 Blackwell workstation I can put in my basement. I want to run qwen really fast, with multiple threads / prompts / agents at once. And MAYBE if the budget allows, a 2x RTX Pro 6000 setup in order to run DeepSeek v4 flash on it (to run research on it).
[-]
- Barbing 6 hours ago
  Did you get a Brave search API key or something for that “Hermes”?
  [-]
  - verdverm 20 minutes ago
    I'm using SearXNG, EXA, Tavily, and soon (tm) Cloudflare
    They all give slightly different results, you can dedup / fusion with heuristics / another agent
  - nickthegreek 4 hours ago
    I have my mine setup with a searxng instance I run in a docker. Works great and costs zero.
  - dghlsakjg 6 hours ago
    Hermes is just an agent that can be setup for whatever you want (coding or more commonly personal assistant ala clawdbot). You can set it up with any of the standard tools and MCPs like brave or tavily for search.
  - iagooar 5 hours ago
    Yes, Brave search is one of these services I highly recommend paying for, the search they provide (similar to Exa, Tavily) is what makes an "OK LLM" become super smart.
- zerd 5 hours ago
  I'd love an RTX 6000 Pro, but how can you justify it when it costs 10 years worth of Claude Max?
  [-]
  - iagooar 5 hours ago
    10 years worth of Claude Max today. Also - Anthropic recently removed a model I relied on and isn't giving it back. As a non-US citizen, I would rather pay in advance but be sure, I will keep having access to inference on my own terms.
    Also, it will just be faster - and more fun too.
sosodev 6 hours ago
I think this is overselling their capabilities. I've used Gemma 4 and Qwen 3.6 quite a bit on my strix halo home server. They're great models and the dense variants are significantly better, but they're still very far behind the frontier. If you boot up Gemma 4 MoE and OpenCode/Pi and expect to perform anything like Claude Code or Codex you're going to be very disappointed.
Computer0 1 minute ago
I have 16GB VRAM and 96GB Ram on all my computers and I do enjoy local models. I would not use them for coding, though I have experimented with it, it is largely a waste of time on my hardware. I love local chat with different models however, when using the model in this way it is much easier to experiment with the largest models near the limit of your hardware, and I do find it useful on the airplane somewhat. I have also used local models for data classification tasks and let it run over the weekend etc and the results were acceptable.
delis-thumbs-7e 1 hour ago
Nobody asked, but I don’t think any of us should be using SoA models to code or to do pretty much anything at all. Instead we should develop open models to work on specific tasks and learn to code, write, draw etc. using fingers made of bones and brains made of flesh. Big corporations and research facilities can run them to generate code or math or whatever, with a bunch of specialists to check the output to be correct. Then again, even that might not be worth the costs (e.g. OpenAI’s 36B$ net loss last year), when the open models are so close and the whole AI scheme is running out of scams to pull.
There’s a lot of things we could use even quite small models for, which would not need an insane amount of computing power and memory, but too few of us is really researching them.
chrismarlow9 6 hours ago
You can use a frontier model to create a plan that's specific enough for a local model of a very small size to execute on. The more specific you are and compartmentalize tasks the "dumber" the local model can be.
Edit: Obviously you'll be using more tokens but this is the trade off for running a smaller model and running locally. Similar to time memory trade off but in token economics. Sorry I need more coffee
segmondy 6 hours ago
It's more than good. As of today, it's great. Those models listed in the blog are horrible compared to what you can run today, There's absolutely no reason to run those, you have Qwen3.6, Gemma4, and plenty other sized comparable models.
If you're resourceful, you can even run SOTA models. KimiK2.7, MiMo-V2.5/V2.5-Pro, MiniMax2.5/2.7/3, DeepSeekV3.1/v3.2/V4-Flash/V4Pro, GLM5.1, Step3.7-Flash, Qwen3.5-397B, Qwen3.5-122B, gpt-oss-120B
ngxson 6 hours ago
My 2c: I think the "cloud vs local" debate is (maybe) a false dichotomy. In my experience, I use a hybrid approach and I've seen a huge productivity boost from it.
The cloud-based models are fine for big and complex tasks, but the pricing is ridiculous for small stuff—like summarizing a discussion or fixing a small bug. And cloud and privacy have never been a good match.
As an example, this comment itself was written with the help of Qwen3.5-4B running locally with an extension on top of llama.cpp default web UI [1]. The extension injects my browser's context directly into the conversation, which allows me to summarize things and draft up comments quickly. Speed is pretty acceptable for the size: ~5s TTFT and ~100 t/s generation, all running on a Macbook M5.
And when I want to run bigger tasks, I don't just stick to one provider. Apart from well-known closed-weight providers like OpenAI or Anthropic, I also experiment with open-weight models like GLM-5.1, DeepSeek V4, and Qwen3.6-27B, which provide quite good results for the price.
I'd argue both have value, and I don't see why anyone needs to choose one exclusively. Anyone else doing this?
[1]: https://github.com/ngxson/llama-companion
[-]
- phainopepla2 5 hours ago
  Why not just use DS V4 Flash for the small stuff? Very fast and extremely cheap.
  [-]
  - ngxson 5 hours ago
    The dsv4 flash is 158B params in total. It is possible to run locally but will require all my system RAM.
    Also, a lot of my day-to-day tasks perform the same on both small and bigger models: summarize a web page, draft a response, translations, quick web search, etc.
    [-]
    - phainopepla2 5 hours ago
      Sorry, I meant non-locally.
      I'm assuming privacy is not a concern since you mentioned using Deepseek already. The cost of V4 Flash for small tasks is so minuscule as to be almost free, and you don't have to deal with a churning laptop (or even buying a high-end laptop, for someone who doesn't already have one).
      I guess what I'm really asking is, what's the advantage of using these small local models if privacy isn't a concern?
      [-]
      - ngxson 4 hours ago
        I do use both DSv4 the "normal" and the flash variant, non-locally. It works well, not exceptionally. And while it's cheap, I'd say that the difference between $1 per month vs $5 per month is not a big concern to me. IMO pricing is pretty competitive among open-weight models: https://huggingface.co/inference/models
        Depending on use cases, but for me I found 2 use cases where a local model is a must and not optional:
        - Running offline without internet access: for example, I have this project that allow transcribe and summarize audio in real time. I already used it in some events where wifi is not available: https://github.com/ngxson/llama.cpp-realtime-audio-recap
        - Handle private personal data, for example health records. This is the same category of "privacy" that you mentioned, but I just want to bring up the fact that people value their privacy differently.
    - coder543 2 hours ago
      dsv4 flash has 284 billion parameters, not 158 billion.
      Huggingface's little parameter count badge seems unreliable.
infogulch 2 hours ago
Anybody used a tinybox? https://tinygrad.org/#tinybox
The most "affordable" option is red v2 with 64GB GPU ram and costs $12,000. This is only ("only") 1.5x-3x the price of a beefy desktop (https://pcpartpicker.com/builds/), and could crush inference work even on bigger models. It could support coding tasks for a small team of developers, or run an AI agent for every person in your household...
0xc0c0c0 6 hours ago
I have used local models (around 128 gb) and the big proprietary models, and while I do want local models to win, it's important we keep the expectations of local models realistic. There are many blog posts about how local models today can fully replace some of the proprietary models and in some cases its true for the much smaller proprietary models, its very clearly much more behind the larger models.
You can be far more ambiguous with your tasks with the larger proprietary models as opposed to the local models. You can achieve the similar results with local models but you need to be much more detailed in your prompt.
One of the biggest things about running these local models is that the harness matters almost just as much as the model too. Codex is optimized for GPT models, CC is optimized for Claude, Cursor has a great harness that works very well across these providers. It took me a couple of iterations of the different harnesses to find one that would work well with the smaller Qwen models to do local coding.
[-]
- failbuffer 6 hours ago
  So which harness did you end up choosing?
minton 2 hours ago
I’m glad people are looking into this because I do think it’s the future. However, why would you not take advantage of the heavily subsidized frontier models while you can. It’s obvious that they’re gonna have to raise prices at which point it might make sense to consider local models, but not today.
[-]
- fendy3002 2 hours ago
  Curiosity or anticipation I think. I have tried it in the name of those 2 factors, because when the frontier model price increase happens and we don't know anything about local models, we're screwed
polotics 1 hour ago
So I've made this [me+vibe+tests]-coded Android alarm app called Promptly, and as Gemini-CLI on the Google Pro subscription is getting google-killed on June 18th, I set up two branches, one for Antigravity+Gemini3.5 and one for Pi-coding-agent with Qwen3-Coder-Next...
Running the same prompt on both with the same .md memory state...
Gemini3.5 is more "intelligent" but Antigravity gets it to decide to go on tangents that are quite time and token-consuming I think. Nice casino machine.
Pi+Qwen3 (~80GB, llama.cpp) is like vibecoding about 1.5 years ago, when you had to babysit, structure your program to have self-contained chunks, and keep an eye on all the cross-cutting concerns to not trip it up. When it works it works fine and when it fails it's my job to ensure it fails fast.
The code is about 10'000 lines of Kotlin in total so it already takes some effort to keep it simple for the AI. It's not a slopped quantity of code, i got solid feature creep :^)
https://play.google.com/store/apps/details?id=com.sixteenam.... ...hat tip to the recent copycat squatter btw it's an honor!
_doctor_love 7 hours ago
"Just get a 64GB Mac with 1TB of storage!"
LOL - some of us have a budget
[-]
- swatcoder 7 hours ago
  Sure, but it's also not really out of scale with the cost of a shop tool in other trades.
  If you're a professional that's confident in a positive return on the investment (optimal or not), or just a hobbyist with the luxury budget for a "shop" that cost is well within norms.
  That's not everybody, of course, but it's not some inconceivable fantasy. A lot of people in the tech community here on HN, specifically, end up with pretty high discretionary budgets that they pour into stuff like this.
  [-]
  - frollogaston 3 hours ago
    But you can get that return from a paid service too, in fact it'll be better. So just comparing costs, what's the annualized ROI on the Mac Studio assuming it means you avoid paying $240/y for Claude? Cause I can always set aside the Mac's price in some investments and pay for Claude out of that.
    [-]
    - swatcoder 3 hours ago
      Same with many and their shop tools in other trades.
      Most hobbyists and many professionals could end up far ahead financially by leveraging makerspaces, tool rentals, and co-op shops or even by hiring out a professional to prep certain intermediates for them, but they get psychological value -- as well as flexibility, reliability, and resale opportunity -- from having their own well-outfitted shop.
      And they can afford that premium, so they do. At the scale of individuals and small shops, not everything that matters gets captured in financial models.
      [-]
      - frollogaston 3 hours ago
        Yeah but the local model doesn't have those advantages for the coding use cases, at least not yet. In theory you could post-train one on your codebase or something, but nobody cares to do that when any vanilla coding agent service can read and understand the whole thing better than a locally tuned free model. I was already being very generous towards the Mac in pretending it does the same thing as the paid service.
        Aside, physical tools tend to be financially advantageous to own if you're going to use them a lot. Even if the owner were targeting 0 profit, they'd have to charge more to factor in the cost of dealing with customers and increased risk of wear/damage by users who don't care as much.
        [-]
        swatcoder 3 hours ago
        The shifting sands of commercial models or pay-per-use managed models are just really not appealing to a lot of people.
        Most come with huge privacy concerns, total costs and availability are impossible to forecast very far out, and the specific behavior of frontier models in particular is not something anybody can rely on as those are subscription products that are subject behavior on their publisher's whims (whether from changing system prompts, new "safeguards", retired models, forced "updates", new regulations, etc).
        It's quite hard to put a price on all that, and as more people find local models productive enough or develop curiosity to explore models, training, or harness-crafting in their own ways, the marginal cost of buying some shop hardware just sort of disappears into the budget noise for plenty enough people.
- amalcon 7 hours ago
  A Strix Halo with similar RAM is considerably cheaper. Still not cheap, mind, but performance is OK (not great) and it will run more or less the same models.
  [-]
  - AbsurdCensor 6 hours ago
    At least for me, it's been pretty great, but I bought my system when it was $1800, now looks like the same system is $2700 and out of stock. I still haven't quite been able to run 120B parameter models under Windows, but for Qwen Coder 30B, it works pretty darn well for my at home needs.
    [-]
    - amalcon 6 hours ago
      Yeah, they have gone up a lot since I bought mine too. I did get Qwen3.5-122b running on all-GPU (on a 128GB machine) under a minimal Arch Linux setup (I do my GUI work on a much cheaper box). It worked, but Qwen3.6-35b is performing almost as well and a lot faster.
      Still cheaper than a new Mac. Maybe not cheaper than a used one.
      [-]
      - AbsurdCensor 4 hours ago
        I've certainly thought about just moving the box to Linux, but it took far to long personally to get everything running under AMD and it works 'well enough' that I don't want to make the switch. I tried playing with GAIA on it, felt a bit limited, and now have Hermes up and running, and that seems to work quite well. All the tools are changing so quickly, it's sometimes difficult to settle in on 'what's best', so I certainly can understand folks that just want to pay for a AI subscription and be done with it.
- techscruggs 7 hours ago
  He is using a 2022 M2, which you can get that for about $2k used. That is beyond reasonable.
  [-]
  - Shekelphile 6 hours ago
    She
  - psychoslave 6 hours ago
    Global Affordability Estimate:
    Top 10% of global earners (~800M people) can afford a $2,000 device without major financial strain.
    Top 25% (~2B people) could afford it with some budget adjustments.
    Bottom 50% (~4B people) would find it prohibitively expensive.
    So for a SV top income, maybe that might look more like the weekly pet brushing budget, but for most people out there this is not that much of a no-brainer.
    [-]
    - disgruntledphd2 6 hours ago
      The maths changes if you're working for yourself. Because I live in Europe, I've ended up working as a contractor due to the lack of a legal entity in my country. While that mostly sucked for a bunch of reasons, I was able to get a 64Gb Mac M2 a few years back with approximately a 52% discount, which was kinda nice.
      [-]
      - weego 6 hours ago
        If you're working for yourself paying monthly is exactly the same as amortising an asset. Personally I'd rather my business just pay $100 a month than have to deal with additional hardware and software maintenance while using a depreciating asset that is break-even after 3-5 years depending on the spec.
    - frollogaston 3 hours ago
      Bottom 50% aren't paying for Claude either, probably also don't own PCs or write code
    - richwater 6 hours ago
      Yes, because the bottom 50%, mostly impoverished or near impoverished folks were spending money on Claude Code subscriptions instead /s
- p-e-w 7 hours ago
  No need. You can run the Gemma 4 and Qwen3.5 MoE models with as little as 12 GB of VRAM at 30-40 tps (Q4/Q5), and they both blow GPT-4o and DeepSeek R1 out of the water.
- tjwebbnorfolk 7 hours ago
  AI and budgets don't mix well at the moment
- themythfable 7 hours ago
  Yeah, I never had a computer that cost north of $800 until recently. While that is far from the typical HN user's budget, my bet is that it is much closer to average.
  Besides those with effectively unlimited budgets for their personal compute, local models are still a long ways off.
  Though, that shouldn't be conflated with the value of open-source models, which can be used by cloud providers to significantly reduce cost of intelligence.
  [-]
  - embedding-shape 7 hours ago
    > Yeah, I never had a computer that cost north of $800 until recently. While that is far from the typical HN user's budget, my bet is that it is much closer to average.
    There are segments, everything from "Average person in world" to "Average creative professional using computers for work" and more on HN, with a wide range of costs for the hardware. HN probably skews towards the latter rather than the former, probably sitting with enterprise hardware next to them basically for fun, hard to make wider conclusions from what people here have or not.
  - sublinear 6 hours ago
    If we define "typical" as the median HN budget, it's probably about the same as yours. Maybe the answer would have been different 10 or 20 years ago, but the era of truly needing a big budget PC has been over for a while.
    It's just for gaming and AI now. Maybe not even gaming as much anymore.
    Consider the perspective of someone who has a practically unlimited budget for PCs, doesn't game much anymore, and doesn't need AI to do their job. It's just part of getting older, and there are plenty of people in their late 30s and older on here.
- anarticle 6 hours ago
  Pros buy their own tools. This is why working for yourself is better than working for a corpo, you get to choose your weapon.
- dofm 6 hours ago
  [dead]
ptx 3 hours ago
> Security: I run every Pi session in a Docker container and give it permissions only to bash so that it can’t run Python code or do web browsing
How does that work? The script in the post references the file "docker-compose.sandbox.yml", but I don't anything about what that file does.
The post that this one links to, that it's based on, says that Pi doesn't do proper sandboxing.
Presumably bash can still execute other binaries, otherwise it would be fairly useless. What stops it from executing Python? Or opening a network connection and downloading Python?
dejawu 6 hours ago
If vibe-coding is hopping into a self-driving car and telling it to take you anywhere you can get a coffee, then I use coding agents more like a bicycle - they let me get further faster than if I'd walked, but I still have to decide where to go and how to get there, and I still have to pedal.
I don't vibe-code, but I do decide what to implement and what patterns to use (perhaps asking the model to analyze and give advice on this first), then I have it handle the nitty-gritty of the implementation itself. For this usage style, the latest local models are as good as having Claude at home.
I won't say it's been _easy_ (I ended up implementing my own harness to accommodate the idiosyncrasies of local models), but I will say that for the effort, having a coding agent that's essentially free to query as much as I want has been life-changing as a dev, especially when it comes to working on side projects. Knowing that my agent will never get worse in quality, suddenly cost more than it does now, or be suddenly made unavailable by external factors, was absolutely worth the trouble. And on top of all that, I can't believe it's as good as it is.
gregwebs 4 hours ago
All these conversations seem like they are missing talking about planning vs execution. I want the best possible frontier model to plan out my changes. I also have a 2nd agent that is a frontier model check the plan. Then at that point the implementation can be done by a lesser and possibly local model. The frontier model can still do a final code review on the implementation of the changes.
Claude code supports this by setting the model to "opusplan"- it will automatically use Opus for planning and sonnet for implementation. This was completely necessary with the fable release. I was able to do this with fable and it was necessary to avoid getting quickly rate limited. In settings.json:
"env": { "ANTHROPIC_DEFAULT_OPUS_MODEL": "claude-fable-5" },
Obviously have that set to "claude-opus-4-8" now.
[-]
- noveltyaccount 2 hours ago
  I do this with Codex 5.5 for planning (specs, technical design, and task list); and Qwen 3.5-35B for task by task build out. It requires more hand holding and makes more mistakes than using Codex for everything, but it helps me spread my $20 chatGPT subscription pretty far.
bayshark 2 hours ago
Hey everyone, made a local LLM, configured for Home Assistant called Selora AI.
Specs: qwen3_17b_base.Q6_K.gguf selora-v047-answer.f16.gguf selora-v047-automation.f16.gguf selora-v047-clarification.f16.gguf selora-v047-command.f16.gguf
The full base model and LoRA adapters are only 3.5GB
Capabilities include configuring for smart home setup to help with answers, clarifications, commands, and creating automations in Home Assistant. The models with the LoRA adapters were made with lean scripted data made specifically for Home Assistant. A lot of work was put into this, feel free to give it a try and happy for any feedback!
https://huggingface.co/selorahomes/Selora-AI
Tharre 6 hours ago
I've been running Qwen3.6-35B-A3B (and 3.5 previously) locally and it's a great model for many small tasks, probably a significant chunk of what most normal people are using LLMs for right now.
But for coding in a harness? In my experience it's unusable even for small projects. It just gets hard stuck at every little problem, wasting hundreds of thousands of tokens trying to make a convoluted solution work instead of doing the obvious thing. Or it will spend hours trying to reason through a fairly simple code flow, incrementally adding debug print statements, only to get confused by the output and then editing completely unrelated code that it convinced itself is the problem.
I've tried instead giving Sonnet the problem description and code and have it come up with a detailed plan that Qwen should implement, but doing that actually consumes a significant amount of tokens compared to just telling it to implement everything, and the results are honestly not that much better. There are just too often subtle issues with the plan that Qwen doesn't recognize when implementing, but make the resulting solution it comes up with unusable.
simonw 6 hours ago
I think gemma-4-26b-a4b and Qwen3.6-35B-A3B show that there's something very interesting about a local model that does mixture-of-experts (which helps a lot with performance) and has in the order of 30 billion parameters.
These models are very capable, and use around 20-30GB of RAM while they are running.
Provided you have 64GB of RAM that leaves space for running other applications at the same time.
[-]
- chrisweekly 6 hours ago
  Obtaining that 64GB RAM is a meaningful obstacle for many.
  [-]
  - simonw 6 hours ago
    I'm still amazed that you can run LLMs of this quality on a machine that costs less than $3,000.
    I used to assume that anything GPT-4 equivalent or higher would need $30,000+ of server-class hardware.
    That said... gemma-4-12b-qat is 7.15GB on disk so should run reasonably well in 16GB, that takes it down to MacBook Air territory https://lmstudio.ai/models/google/gemma-4-12b-qat
  - frollogaston 3 hours ago
    Not just RAM, VRAM, right? Though they're one and the same on the Mac.
richbradshaw 7 hours ago
I’m keen to understand speed here etc etc. if I bought a Mac studio with 96GB - what can I realistically run, how’s it compare to fable/opus etc and how fast is it?
Currently maxing out two Claude code accounts every x hours when working on large code migrations or setting up new iOS apps etc - most of time it’s fine but occasionally it’s mega frustrating!
[-]
- simonw 6 hours ago
  I strongly recommend trying LM Studio - it's the lowest friction way to try out models, you can browse https://lmstudio.ai/models and click "Get" and then "Run in LM Studio" to download and run a model.
  With 96GB I'd start with the Gemma 4 and Qwen 3.6 models. Any of those should work fine.
- AbsurdCensor 6 hours ago
  I think currently you can only get the M3 Ultra Studio with 96gb, and for coding tasks, say you rub Qwen Coder on it (which doesn't need that much ram), it's not the fastest, something like 30-40 tok/sec. Probably better with a MacBook Pro with the M5 chip. There is a website for comparing different configurations and models: https://llmcheck.net/benchmarks
- pizza234 6 hours ago
  [dead]
ltononro 6 hours ago
Good depends a lot. If you are in the token maxxing hype you will probably find these models very bad comparing to SOTA, unfortunately.
The good news might be: opensource models are now good (enough) for day2day usage. But is it really? I feel that companies will always naturally strive for the best and use the SOTA (as long it is not too expensive).
I see OSS models being a good backbone for companies in the future that have validated workflows and could use those for privacy or to spare costs.
IDK, might have gone a little bit off-topic here.
aquarious_ 4 hours ago
I support local models and enjoy playing around with them, but even for personally development it is just more viable for me to pay $200 a month to Anthropic for the latest models. It seems to me with the cost of hardware needed to run local models that, for now, it is pure hobbyist and exploratory (which is fun in its own right)
pjmlp 4 hours ago
Only if blessed with enough RAM and disk space,
> 64 GB RAM and 1TB storage
Ah ok, not something regular joe and jane happen to have lying around at home.
Additionally the whole configuration is still very much low level, bunch of CLI commands, and if the model doesn't fit for the task at hand, it starts allucinating, generating gibberish, whatever.
[-]
- sparkling 3 hours ago
  Even if i had such a machine, im not sure i would be willing to sacrifice 80% of my RAM and 50% of my disk to run a semi-okay model locally.
robertkarl 4 hours ago
You can trade off latency / accuracy / cost for any ML task. And with the local models.... the cost is free.
Having a local Qwen check another Qwen's work increases the accuracy quite a bit at the cost of more latency. You can't have your cake and eat it too.
In benchmarking local models, I'm having success increasing even a 9B qwen's score on terminal-bench adjacent problems, just by asking it to plan and handing the plan back to qwen with a fresh context. Try it with Qwen3.5, unsloth Q4+, and a thinking budget of around 1024 tokens.
abalashov 5 hours ago
And if you want to dial in a setting in between: I've switched to Kimi K2.6 (now K2.7) and DeepSeek through OpenRouter and Reasonix for pretty much everything, with no discernible loss of analytical quality or utility.
However, like many commenters, I don't really believe in vibe-coding, long-horizon agentic one-shot agentic coding, etc. and do not use LLMs for huge generation tasks that involve designing things end-to-end.
I also have an MBP with 128 GB of unified memory and do quite a bit of Qwen3.6-35B-A3B. No, it's not as smart as the aforementioned models, to say nothing of frontier, but many people seem pleasantly shocked by the number of banal tasks that do not require these.
wxw 7 hours ago
> “if we are constrained by performance and price, what architectural tradeoffs do we need to make?” a question that so far has not really been asked in the mad token gold rush.
To be fair, I think the labs are also interested in this (e.g OpenAI parameter golf). But the incentives are tricky. When the subsidies and tokenmaxxing era ends, local models will be essential.
mohamedkoubaa 2 hours ago
I wonder when a cheaper consumer grade inference chip will hit the market. The general purpose GPUs have much more silicon and complex firmware than what's strictly needed for inference
huydotnet 5 hours ago
I love that local LLMs are being discussed more often on HN recently. But for the post, I find it strange that the author claimed they were working with local models from day 1, but wrote a post that still links to Qwen2.5 and Qwen3 in mid June 2026.
lthi747 1 hour ago
Maybe it is good but it is very difficult, or at least with regular computer. For users like me with 16GB laptop it is almost impossible task.
b3ing 4 hours ago
They are ok for simple stuff, coding is weak, chat is alright, writing is ok. But I had many of them write stories for ideas and they kept using the same names regardless of what the story was about. I can’t complain, it’s free. Can’t wait till they get even better, but for local image generation they are good, slow but just create a bunch in the background while you do other things otherwise it’s like 14.4k modems
aliljet 6 hours ago
The problem here is always the cost-benefit. For $200/mo, you're receiving subsidized best of breed access. There's no model competing for that price anywhere. If a 27B param model is what you choose, show me your hardware! I would love to be wrong...
[-]
- rsolva 5 hours ago
  But for how long? The subsidized phase is probably short, and then what? I run Qwen 3.5 27 Dense om my old AMD RX7900XTX at about 45 t/s and barely use my Claude Code subscription anymore.
valisvalis 5 hours ago
There are good use cases for them for sure, the Gemma 4 Good hackathon a while ago showed how local models can solve problems in health and education in areas with low connectivity or small infrastructure.
anubhav200 6 hours ago
I have been using qwen and glm based models from last 2 years, ended up buying mutiple machines for the same. Overall i feel 24vram is a must have to get get performance (speed wise) to match hosted soln. I have 2 machines a 12gb vram one and a 24gb one. On 12gb vram i get around 50tps generation and 500tps prompt processing and on 24gb one i get 180tps generation and 3500tps prompt processing. I have different configs for different scenarios and I also use llama cpp manager manage all my configs (https://github.com/anubhavgupta/llama-cpp-manager)
aleksandrm 1 hour ago
Clickbait title, because running local models is still not good now.
cautiouscat 6 hours ago
> I have no concrete scientific evidence of this - my own personal vibe metric of “is a model good enough” is, “do I have to double-check it against an API model”, and GPT-OSS was the first one where I started doing that a lot less often.
The good old butt dyno!
I’ve been eyeing local models more and more with Anthropic squeezing more and more on the subscriptions. A few comments on HN had me waiting until they improved more but this article makes me wonder if I should reconsider that.
I’ve been doing some pretty niche development using a game and a script extender for said game. If these models can handle that, I’d feel good about switching.
jszymborski 4 hours ago
I run local models and they work fine for me, but specifically for use in coding harnesses, I'm having a hard time. Tools tend to end up in the same loop, trying to `ls` the same folder or `grep` the same file, over and over and eating up the whole context. Super hard to get it to do anything but that. Any tips?
cube00 7 hours ago
The challenge I have is getting a large enough context window so tool calls work reliably, the local models easily slip into hallucinated JSON tool responses and won't trigger the tools as a result.
[-]
- glaslong 6 hours ago
  Same here. I'm curious what others loving Qwen are doing differently, because it constantly hits this issue for me. It's been great for autofilling blocks, but difficult for me to use agentically.
andix 4 hours ago
Because I've seen too many people spending a lot of money on expensive hardware, without really using it in the end:
Most of those models are also available via Openrouter and many other platforms. Dirt cheap, and much faster than on consumer GPUs. Perfect to try and compare the different options.
jlengrand 4 hours ago
Just wanna say it's always fun and nostalgic to see authors pass by here who I was reading back when I started my career. I was reading Vicki's blogs way back, even remember learning some email parsing in python from her over 10 years ago. TY!
MrKoby07 4 hours ago
I think a lot of people just don't have specs like that, making it still painful.
jotato 6 hours ago
I currently have a desktop with a 4060 ti (16gb of vram). Most models I have tested that fit within that are not good enough for anything other then type completion (in regards to coding tasks)
I have been considering getting the 58gb Mac Mini but that is a decent amount of money to spend without confirmation on a) how fast is it and b) will it work for well-defined tasks.
frollogaston 4 hours ago
"Good" refers to the speed and not the quality. There's so much hype about Macs being great for LLMs, but nobody seems to be seriously using them for that because the open models are unfortunately so far behind.
blobbers 3 hours ago
Have you tried optimizing for MLX? It seems like a waste to have neural cores and not use them.
I've often wondered why the hype around apple neural core when 99% of software doesn't use them.
ta-run 4 hours ago
Not related, but, I can't seem to get my copilot-cli (office is an MS shop) use qwen3.5:27b on ollama for some odd reason.
After the recent changes to usage, I've spent an annoyingly long number of hours trying to get this to work.
throwarayes 5 hours ago
I am happy to pay OpenAI for a cheaper model a few generations behind. But they deprecate models aggressively. They push you to bigger and smarter models, when 95% of my work doesn’t need it.
I’d love it if model providers just let old models run and let us pay less, but the deprecation makes me want to look into local models.
WASDx 5 hours ago
Looking at some benchmarks, the latest ~30B Gemma/Qwen score similar as Claude or GPT versions that were released just one year earlier. That's crazy progress. I can't imagine how it will be in a few years.
k__ 5 hours ago
I tried some smaller Gemma4 and Qwen3.6 quants on my MBA with M5/16GB and had like 20-60 tokens per second. At 60 it felt pretty okay and that hardware is on the lower end.
I'd assume a Mac with 32-64GB memory would get some reasonable results.
fridder 6 hours ago
Is there a local harness designed around the local model use case that is claude code like? Opencode has been problematic at times, pi works for one off for me but not back and forth conversations with the LLM. Considering I only use Qwen or Gemma models I'm close to just writing my own at this point
nikagrawal121 3 hours ago
I tried for my legal AI application that I'm building and it was able to do majority of the tasks. I used gemma4:26B
anax32 7 hours ago
I've just made a milestone on my project, moving away from AWS (budget) to self-hosted and the local models are so much faster than in the past. Beyond LLMs, having embeddings, image, video, audio gen available is crazy.
Running locally is the bar; it's hard to make these things a service which scales.
prlin 6 hours ago
If you wanted to do some research or learn about post training and agent harnesses, is that a good option with these local models? What hardware is recommended, or easiest to go with a Mac Studio with 64GB+ RAM?
0xbadcafebee 2 hours ago
Local models have been good for a while. But this being the HN echo chamber, people here think that local models can only be used for coding, and are expecting Opus 4.8 on their iPhone. Turns out AI can be used for things other than just coding. Even tiny models (<4B parameters) can do tons of useful things on local devices. Search, index, summarization, troubleshooting, crafting documents/formatting, image analysis, transcription, object identification, robot navigation, text-to-speech, speech-to-text, browser/window control, MCP/tool calls, and much more.
Larger models just do more complex reasoning. But if you want them to be really good, you need a beefy Mac. They have the best combination of memory bandwidth and RAM to allow medium-sized models to run at speed. GPUs have less memory but more bandwidth, and AMD iGPUs have more memory but less bandwidth. The Mac is the best compromise on the market today.
Once you do have a beefy Mac, you want to run a dense model. This gives you the best possible result with the system you have. You can go MoE for faster results, use cutting-edge inference techniques, parameter tweaks, etc. But a basic dense model (at Q6 quant) on a big-ass mac will serve 90% of your coding needs.
wrxd 5 hours ago
I wonder how much local models hallucinate. I am getting almost daily an "Honest answers: I made that up." reply from Claude Opus when I challenge some silly thing it's trying to do.
malkosta 5 hours ago
The problem with QWEN is that it just can't edit files reliably, I had to hack Pi all over to reduce the pain, but still far from perfect...does Gemma 4 strugle on this?
bthornbury 4 hours ago
the qwopus 27b model is good for grunt work style tasks, even across multiple files. Piping a bunch of things through, small factoring changes, stuff that just takes time to type out.
I wouldn't rely on it for large stuff like codex though. I haven't tried out deepseek/kimi, if we could run those locally it would be great.
ridruejo 4 hours ago
Local models are one of the main drivers for our installer / Desktop app for OpenClaw https://holaclaw.ai (disclaimer I am one of the founders). The smaller models are really only suitable for the most basic tasks, but if you have 32gb-64gb you can get real work done (ie complex web workflows) without third party hosted models
osigurdson 4 hours ago
Running AI on timesharing mainframes does seem like an odd final state for the world.
daniban 6 hours ago
With Apple silicon and now the RTX Spark there are real discussions whether local AI is the future. The only problem is Western open source models are so far behind. I genuinely feel there's a push to fix this. Gemma is getting more frequent releases and Nvdia is quietly creating very cool small models. I hope both the hardware and models catch up and local really does emerge.
ibizaman 6 hours ago
Tangential but reading on mobile, the font size in the code snippets are all over the place. I actually have the same issue on my blog. Anyone knows why?
fl4regun 6 hours ago
In my experience, with a system of 32GB RAM and 24GB VRAM, no, they aren't that good.
fg137 6 hours ago
> I have a 2022 M2 Mac with 64 GB RAM
I closed the article after that.
The author has no idea what a privilege it is to have a machine like that for personal use, and how 99% of the population are not going to afford a setup like that.
Just some back-of-the-envelope maths will tell you that a $20/month Claude subscription makes much more sense financially.
[-]
- orf 6 hours ago
  99% of the population don’t code using models, local or remote. So that’s a useless metric.
  What % of developers could afford an older MacBook model, second hand? Far, far more than 1%.
  [-]
  - fg137 1 hour ago
    could or will?
    I am pretty sure even among software engineers, much fewer than 1% are going to spend their money on that.
    Most software engineers know how to spend their money responsibly.
wasimxyz 6 hours ago
https://canirun.ai
stared 6 hours ago
I really recommend Qwen3.6 27B.
Make some tests, and its 8 bit version runs at 30tok/s when using llama.cpp with MTP and run on Macbook Max M5. I have 128 GB, but but 64 GB is well enough. https://github.com/stared/benching-local-llms-on-apple-silic...
When using benchmarks, it gives more-or-less the level of SotA mid-late 2025.
[-]
- iagooar 6 hours ago
  I run the exact same model, on the exact same hardware - amazing results. Pair it with good search skills (Tavily, Brave, Exa) and you have a near-SOTA model on your desk.
- wizzledonker 6 hours ago
  Did you mean 2025?
  [-]
  - stared 6 hours ago
    Yes, fixed
xienze 6 hours ago
The big caveat here is that these local models require you to invest some time tweaking your harness, AGENTS.md, and skills in order to get things roughly to the level you'd expect. But something like Qwen3.6-27B with web search capabilities and a good set of skills really is impressive! Especially considering that you can go wild and not worry about token costs.
The other thing that people tend to gloss over is that you really do need to spend some $$$ on decent hardware. Yeah, you CAN run some 4-bit quant with heavily quantized cache on your 16GB card, but it's not going to be a great experience (I think this is where a lot of the "if you think it's gonna be any good, you're going to be disappointed" stuff comes from). Yes it's a lot of $$$ upfront but it's very much unknown when hardware prices are going to come back to reality. There's a lot of hopes and dreams that any minute now an H100 will be worth pennies because "that's how it's always been" w.r.t. computer hardware, but we are living in interesting times. So you can't just make the tired old assumptions that a Claude subscription over three years time will work out to be dramatically less than the value of some card three years from now. We STILL have basically anything with >=24GB VRAM appreciating in value, which is absolutely wild. What I'm saying is, the depreciation curve may very well be a lot less dramatic and fast than it used to be, going forward.
drchaim 5 hours ago
really want to try local models, but I don't have the hardware yet. Probably I'm the only one here still using a Mac Mini m1 8gb 2020. :/
[-]
- tennfown 4 hours ago
  I have some decent specs, but I’m stuck with AMD graphics card which I’ve been told is a non-starter
atulmy 3 hours ago
Exact reason I'm building csuite.so, do check it out and let me know if you need early access!
matrix12 2 hours ago
gemma:12b at 75% of frontier? Yeah....
Mr_Eri_Atlov 2 hours ago
I think this is a pivotal moment for LLMs.
Gemma 4 and Qwen3.6 27B aren't perfect, yet they are such a step forward from the previous generation that it's both feasible to get stuff done locally with patience and very likely that future releases will subvert cloud capabilities entirely.
Plus, they have definite reliability advantages over cloud models that can be wiped out by a government order or lobotomized to handle traffic surges.
jmyeet 2 hours ago
It's not "good". A more accurate description would be "sometimes useful and not far from being good". The author is using pretty small models. There have been a lot of improvements that scale in any case (eg MTP) but ultimately this is still hardware limited by 3 factors:
1. Memory bandwidth
2. VRAM size, which limits the size of a model you can use effectively. Yes you can swap but then you're taking a performance hit;
3. Raw FLOPS, including quantization.
Apple here is interesting because they have a shared memory model and you can buy Macs currently with up to 128GB of RAM (previously 256/612GB on Mac Studios, both discontinued). New M5 Mac Studios are expected in Q3 but that's not guaranteed. It may take until next year
Depending on the chip, Macs top out at ~900GB/s. A 5090 or 6000 Pro has 1800GB/s. A B100 is at like 3.2TB/s. A 5090 has, depending on how you count, 5-7x the FLOPS of a M5 Pro so a 5090 is still better than any current Max... except for the 32GB limit.
NVidia aggressively segment the market by limiting VRAM. The RTX 6000 Pro is basically a 5090 with slightly more CUDA cores and 96GB of VRAM instead of 32GB for $10-11k instead of $3k.
So let's project this into the future a little. The M6 Ultra/Max may well be 1TB+/s memory bandwidth with much higher FLOPS and thus actually be competitive for larger models. A 6090 in the current market will probably still have 32GB of VRAM if I had to guess. Maybe it goes up to 48GB.
But anyway I think we're only 2-3 years away from sub-$5000 hardware that does 100-300+tok/s on models larger than 31B. And that's going to be a game changer.
ZionBoggan 5 hours ago
This is actually a really insightful post !
jauntywundrkind 3 hours ago
i'd love to get to a point where big models can launch subagents that are fast and local. there's a lot of focus on token rate, but just as much, the way cloud providers have other latencies & processing styles not optimized for latency (running large batches all at once), and i think local might have some real wins. Gemma 4 seems already on the right track. lfm2.5-8b-a1b (https://www.liquid.ai/blog/lfm2-5-8b-a1b) and DiffusionGemma seem to both be very high token rate. but getting that latency down, so that a series of tool calls can happen faster, would be a real win. I think especially with good prompting that becomes much more possible.
One caveat, I have absolutely no patience for a lot of subagent systems, like opencode, where the subagent is walled off and incommunicatable. My subagents really should be their own session, that i can deal with as I please, with some MessageChannel like offerings/tools available to them. Ideally with modes where messages auto-flow in and out, and modes where I can be a gate-monitor. https://developer.mozilla.org/en-US/docs/Web/API/MessageChan...
Not really super related but MCP has been working on Events for a while. That ability to respond fast would be great. https://github.com/modelcontextprotocol/experimental-ext-tri...
Asking local to be fast feels like an obvious folly, but given how much better small models have got, and seeing these models tune themselves for speed: I want to hope!
jingw222 5 hours ago
open source must win
monegator 6 hours ago
I've been trying local models for the boring stuff you might be thinking about: writing small docs.
So i've tested a couple, and the speed is finally impressive. My colleague uses paid tiers of claude and GPT, and the speed is comparable. Maybe even slightly faster on my end.
The problem is: i'm running the model on my work laptop, a 12th gen i5 with 16GB of RAM (which, you know, i asked to upgrade to 64, but that was right at the time of the great RAM shortage of the '20s) so i'm pretty limited in what i can use. And this is running alongside the usual suspects: Web browser hugging 1.5GB, MPLABX hugging 3, windows taking at least 5 just to sit idle, thermal throttled to 1GHz ... And yet its speed is comparable to a paid service. A lunch's worth of tokens vs a few cents of power.
So, what i found, what i fount... What i found is that i need AT LEAST 16k of context window, otherwise they will halt when i pass a small C file for analysis. And coding models will shit the bed with 4k. But we all know that, context size is King.
I found out that Qwen will keep looping while thinking, but that's not a surprise to you, either. But give it enough time and you will get an useful answer. I was hoping to using it as a better warning system for some languages, but i fear i need muuuch more context size, because i tried to feed a file that had a function with an endless loop:
At 4k context it almost shit the bed if i gave it just the offending function, then told it where to look at. At 16k context, with the whole file, it needed some guidance to what the problem was, and after 10-15 minutes of thinking it found the issue. Problem is, it kept second guessing itself for another 20 minutes on the same unrelated thing before giving the output. For which the fix was wrong, but the semanthic was correct. Good enough. Maybe it will be faster if i don't ask for a fix (which i didn't i just asked to look for a specific issue)
Wish i had 3 times the RAM so i can see what happens with more context.
Then i gave it the task to analyze a C file to make an API document. It took half an hour, but then i had a good starting point, which i had to keep changing because it would confuse commands with IDs and things like that.
This was the Qwen 3.5 9B model.
I then tested Gemma 4, being impressed at the tokens per second it gives on my Pixel 8A. Same tasks: same issues with short context, with long context it gave absolutely useless answers when looking at code, but it took 1/3 the time of qwen.
In producing documentation, instead, it was much faster, and it never hallucinated data. Good. in 15 minutes i had everything done.
Not bad for stuff running on a business laptop, while doing actual work.
Tomorrow i will try Qwen 3.6, let's see how it goes..
holoduke 4 hours ago
Good? My Macbook m3 with 36gb locked up after it filled all memory with Gemma4. A bit useful yes. But it eats all resources. For local models to be useful we need at least 128gb of system memory and 512gb of video memory. Plus 8 times the compute of a single 5090/h200
mrkn1 47 minutes ago
[flagged]
aplomb1026 4 hours ago
[flagged]
eugmai86 4 hours ago
[flagged]
RishiByte 4 hours ago
[flagged]
kordlessagain 7 hours ago
[dead]
maxothex 6 hours ago
[flagged]
Veer_Pratap08 6 hours ago
[flagged]
azzzxcc123 5 hours ago
[dead]
huflungdung 4 hours ago
[dead]
Rekindle8090 4 hours ago
[dead]
Lapsa 3 hours ago
[dead]
iluvcommunism 7 hours ago
[dead]