Runing GLM-5.2 on local hardware

(unsloth.ai)

191 points | by TechTechTech 5 hours ago

19 comments

segmondy 28 minutes ago
I run Q4_K_XL. All it takes to run to get about 6tk/sec is 512gb of ram and 2 3090 GPUs with llama.cpp -cmoe. I also have crappy DDR4, 2400mhz, 3200mhz will bring that speed up to about 9tk/sec. I also have ok 32core epyc CPU, a better 64core would bring it up to about 11tk/sec. I did a budget build before the crazy hardware cost and I regret it everyday. Nevertheless, it's fantastic being able to run this model at home. It's great for planning, one shot prompting once you have a plan or all the context you need. This entire hardware cost $2400 when it was built. If you're willing to be resourceful, you can find ways to run these models at home. I often get the silly question of why, and suggestions about how much I can save using cloud API, but the Fable drama has opened up eyes on why it's good for us to be independent. Thanks team unsloth, Q4_K_XL is solid, if you are going to grab a quant, make sure to get the K_XL variant if it can fit.
[-]
- redox99 5 minutes ago
  That's crazy good for $2400.
xrd 4 hours ago
So close! My machine with 192GB RAM + RTX 3090 24GB can almost run this. It says it needs 24GB of VRAM and 256GB of RAM for MoE offloading.
https://unsloth.ai/docs/models/glm-5.2#usage-guide
In a prior thread, someone said it would take $500k in hardware:
https://news.ycombinator.com/item?id=48629970
[-]
- elliotbnvl 3 hours ago
  $500k is a vast overestimation. For massive concurrency at FP8 or even BF16 maybe.
  NVFP4 at reasonable speeds (~120 tok/s) and concurrency is possible at a $80/90k figure with today's prices, maybe even less. That buys you 6 RTX 6000 PRO Blackwells, a decent CPU and motherboard, power supply. 576gb of VRAM.
  You could do it for under $50k if you're OK with 40 tok/s decode, ~1200 tok/s prefill.
  [-]
  - hbbio 1 hour ago
    Yes, a single GB300 workstation also does it, probably even more than 120tok/s.
    Official price 85k...
  - __m 2 hours ago
    How fast will the hardware become outdated? Are there big improvements expected in the next 3 years?
    [-]
    - easygenes 1 hour ago
      M5 Ultra will ship before end of year, likely. Though with current RAM shortage, likely max spec will be 256GB and in short supply.
      In late 2027 or early 2028, Nvidia will release Vera Rubin DGX Spark, likely with double or better the performance of current Blackwell, though unclear if memory capacity will go up much from current 128GB. Two to four of those will run models like this decently.
      In 2028 we should expect Vera Rubin RTX discrete lineup, including the replacement to the RTX PRO 6000. Likely memory spec will be minimum 128GB. Good chance of up to 200GB. Two to four of those will run NVFP4 models in this class very well.
      [-]
    - digitaltrees 35 minutes ago
      I feel like the models are good enough for a decade of future work. So Once you have a working set up you can keep using it to do the work at the same level. There will be better stuff and may make that type of work obsolete but if you can do useful things it won’t be worth less.
    - segmondy 23 minutes ago
      P40 was release 2016 and still selling like hotcakes!
  - easygenes 1 hour ago
    [dead]
- mgambati 3 hours ago
  With 2 wouldn’t have good results. Ideal range for coding is at least Q8.
  [-]
  - kibibu 3 hours ago
    According to this very article, 4-bit dynamic is essentially lossless
    [-]
    - Aurornis 2 hours ago
      Watch out. Those claims are often made based on KL-divergence over some arbitrary corpus, not performance in the real world or benchmarks.
      I’ve found that I need to go a couple steps past whatever quantizations are good enough in the KL-divergence testing to get good performance in real tasks with long context. So when Q4 is claimed to be lossless I end up with Q5 or Q6 for actual long-context tasks.
- uberex 2 hours ago
  Funny I casually asked Gemini and it said 500k for unquantized with decent throughput.
  [-]
  - stymaar 1 hour ago
    This is why you shouldn't believe uncritically an answer from an LLM (neither should you do for any answer from a human either though).
  - colinsane 42 minutes ago
    i asked gemini and it replied with "Error: 400 Your prompt was blocked by safety filters. Please revise and try again."
  - j45 54 minutes ago
    LLMs aren't discrete calcluators or estimators of things unless framed and guided to do so.
- cheema33 3 hours ago
  I have the RAM, but not the VRAM. What kind of speed/tps could you expect from a 3090 with 24GBs of RAM? I am somewhat tempted to pick a GPU with 24GBs of RAM.
  [-]
  - phamilton 2 hours ago
    Generation is basically just memory bandwidth math.
    Each token has to read all the active weights. I think that's around 40B parameters active. At a 4-bit quant that's 20GB. With 100GB/s (replace with whatever your bandwidth is) and you get 5 tokens per second.
- ijidak 1 hour ago
  Crossing my fingers that this boom jumpstarts 90's like improvements in computing hardware.
  I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.
  Most of the money and energy went to mobile for the last fifteen years.
  Affordable local inference might be the gravy train the server, desktop, and laptop manufacturers need to get back in gear.
  [-]
  - gruez 36 minutes ago
    >I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.
    No, we're running into limits of moore's law, and it's showing in prices for new nodes, where they're getting denser but not cheaper.
  - linzhangrun 39 minutes ago
    Physical limitation of the manufacturing process may be more significant factor, starting from the TSMC 10nm ten years ago
skiing_crawling 2 hours ago
"it can fit" on 256GB of RAM, but it will be heavily quantized and still run very slowly. The headline number is not token generation, its prompt processing. So if you get 10 tok/s and an API gives you 20-30 tok/s, it doesn't seem that bad on its face, but a mac studio or any other machine that's not loading all of it into GPU will do PP 20-50X slower than a purely GPU based setup, which is what actually makes this unusable without $50k in GPUs.
On top of that, you will still be heavily quantized.
[-]
- gerdesj 2 hours ago
  A nvidia spark thingie has 128GB unified RAM. They also have a dual port version of one of these things: https://www.nvidia.com/content/dam/en-zz/Solutions/networkin.... ie 2 x 100GB/s ports, they may even be 2 x 200GB/s. Once I've got my paws on one, I'll know more.
  You can cluster these beasts too. Two and three (with two IP subnets) is fairly obvious. Four or more might need a switch depending on how much network latency affects things.
  Apple seem to have forgotten about M series with gobs of RAM. I can't get the Apple shop to show more than 96GB of unified RAM and that costs a kidney.
  [-]
  - mapontosevenths 1 hour ago
    I have one, and I love it. That said my buddies Mac smokes it for inference workloads in terms of tokens per second AND its more usable for other things.
    If you are training and doing research it's great, if you want to cluster them it cant be beat, but if you just want local inference on a single box buy a mac or even a strix halo device.
    [-]
    - colinsane 37 minutes ago
      can those macs boot linux? i've heard about Asahi but have no idea how far along they are. i've got my fleet configured with nix and sure, nix can target darwin, but there's a _lot_ of sharp edges there: i don't really want to pull that thread unless i have to...
      [-]
      - mapontosevenths 5 minutes ago
        I don't know. I think he just uses LMStudio most of the time on his, but that's one place I can say the spark really shines for me.
        I'm a Linux guy, but also don't always have alot of time. The Spark comes out of the box with a nice Linux distro that's pre-configured to be easy to setup and the guides and online resources make getting up and running trivial, for even some complex tasks. You would have to do a LOT of tinkering just to figure out some of the things the nvidia resources walk you through natively. They have guides for a ton of stuff that include the optimal settings so you don't have to figure it all out through trial and error.
        Check out these "playbooks" for some examples. [0] There's a lot to be said for not having to piece all that together yourself.
        https://build.nvidia.com/spark
        I think between unboxing mine setting it up to run headless, and generating tokens was like 20 minutes total for me.
    - Fizz43 1 hour ago
      which mac is smoking the spark?
      [-]
      - pmarreck 51 minutes ago
        pretty much any of them, dude, as long as you have enough RAM, since it uses unified RAM and a powerful SoC CPU/GPU. Literally any M-class model, but the M5 is currently top tier.
        [-]
        mapontosevenths 15 minutes ago
        Yep. Memory bandwidth is what decides how fast LLM's generate tokens (mostly). The DGX Spark has something like 270 GB/s of memory bandwidth, and the m5 ultra is ~615 GB/s. Theoretically DOUBLE the speed. In practice he only generates like 25% more tok/s, but that's still very impressive.
        The spark can fine tune models in 1/4 the time and excels at other compute tasks in ways that Mac never can. Plus the high bandwidth ConnectX-7 ports would be like $1700 to buy on a card just for the network adapters... But for generating tokens, it just plain loses.
  - jauntywundrkind 1 hour ago
    200 Gb / s (not GB/s)!
    (Still potentially very useful! But not magically ultra fast.)
  - Computer0 2 hours ago
    128 gb of much slower ram than Apple.
pheggs 3 hours ago
I feel like the gap is closing to be able to run good enough models locally even for coding and I would assume it could make some companies a bit nervous. Am I wrong about that?
[-]
- UncleOxidant 3 hours ago
  If we didn't have a RAM/GPU shortage right now they would be more nervous than they are. But as it is very few people are going to be able to afford a rig that can run this model effectively. That's probably not going to change for several more years yet. I think if the Z.ai folks decide to come out with a flash version of GLM-5.2 specialized for coding that came in about about 80B params, then the US frontier labs would probably be more worried. Overall, the Chinese AI companies have been showing the way to do the same amount with less (sometimes much less) and as that trend continues it's going to make the frontier labs worried - but even the Chinese AI companies are going to want to protect their moat by not releasing capable models that are significantly smaller than their current flagship models. AliBaba Qwen seems to be there now - it's gotten mighty quiet from them lately - their latest 395B model is just too large for most folks to run at home and they don't seem to be making any noises about releasing smaller ones this time around.
  [-]
  - gpm 2 hours ago
    The ram/gpu shortage won't last forever though. Moreover we can be pretty confident that long-term the prices will obey wrights law and come down in cost significantly (from the pre-shortage prices) as we learn to produce them more efficiently.
    LLM companies are valued as if they're going to have some enduring monopoly that they can extract money from... GLM-5.2 and similar models make that valuation very very questionable.
    [-]
    - UncleOxidant 2 hours ago
      > The ram/gpu shortage won't last forever though.
      No disagreement there, but it could easily last another 3 to 5 years which is a long time in tech terms.
    - mannanj 2 hours ago
      > The ram/gpu shortage won't last forever though
      Don't underestimate the markets ability to remain irrational
      [-]
      - colinsane 27 minutes ago
        the companies which have the power to alleviate these shortages are the same companies who are profiting most from the shortage. scarcity is an asset, it's not irrational that a concentrated marked will produce more of that asset.
        [-]
        selectodude 7 minutes ago
        The solution for high prices is high prices.
        If making RAM and SSDs is now cause for a 10 figure valuation, after enough time somebody will dive in.
  - elorant 2 hours ago
    Very few people, but quite a lot of companies especially after per token pricing took effect and companies see their invoices skyrocketing. You pay an upfront cost once and you’re done.
  - verdverm 2 hours ago
    I suspect the time horizon is shorter because of software advances. We are getting more capability out of smaller models
    Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)
    [-]
    - UncleOxidant 29 minutes ago
      > Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)
      True, Qwen3.6-27B is amazing for it's size. However, it seems likely that we're not going to see anymore of these smaller models from Alibaba/Qwen since several key players exited that organization a few months back.
- simplyluke 54 minutes ago
  You don't even need to run them locally for them to be a threat. Plenty of companies are looking at paying third party companies to host these models and they come in at fractions of the price of the frontier labs.
- cogman10 3 hours ago
  I don't think so. I could easily see a company deciding to host and run these models for their own development. If you have a dev team of about 10 people, a one time $50k investment in an LLM server has to be pretty tempting. Unlimited tokens, decent performance, upgrade options, and potential product integrations.
  For companies wanting LLMs in their products in general, I have to think going the local llm route is even more tempting. Somewhat dumb models are more than good enough for a lot of the things people are integrating LLMs into their products.
  [-]
  - twelvechairs 2 hours ago
    Surely for most the desire is just an LLM provider that doesnt store or sell their queries (including by national actors). As long as that is allowed to happen surely its the answer for the vast majority.
  - eventualcomp 3 hours ago
    Where is $50k coming from again?
    [-]
    - stingraycharles 3 hours ago
      That’s less than the monthly salary of 10 software engineers, and assuming they pay API prices, probably earns itself back in about a year.
      Having said that, I don’t think it’s all that tempting for companies at all, considering this whole market is developing rapidly and it’s nearly impossible to predict where we’ll be at in a year or two.
      [-]
      - cogman10 2 hours ago
        The hardware requirements aren't evolving and the local models have only been improving.
        It's not like you'd lose capabilities, if anything this solution just gets better with time.
        [-]
        chatmasta 2 hours ago
        If the newer models require more/better hardware then you’ll lose capabilities.
        I think you’re better off renting GPU instances and running all the software on those. It’ll be cheaper than Anthropic and OpenRouter but slightly more expensive than electricity and depreciation of hardware.
        [-]
        cogman10 1 hour ago
        The newer models don't require more/better hardware. There's a small army of local llm enthusiasts who are running LLMs using 3090s and H100s because they have lots of memory. Them being old isn't really that big of an issue as the compute power needed is relatively low all things considered.
        The number of parameters needed for these open weight models has mostly stabilized so the actual memory requirements aren't likely to change all that much.
    - cogman10 3 hours ago
      As in who pays for it or how did I arrive at that number?
      For who pays for it, obviously the employer would.
      For "how did I arrive at this number" Ballpark estimate from what I know about part cost. Most of that money will go towards AI cards about $5k for the mb, cpu, power supply, etc. $45k would be for as much ram and as big/expensive nVidia cards as you can get your hands on. The B300 has 288GB of VRAM in it. Probably what you'd be after.
- scosman 40 minutes ago
  It's not economic to run them locally. It's amazing for privacy, and fun hobby. But you're either looking at super slow CPU builds with $10k in RAM, $90k worth of GPUs, or a really quantized model that doesn't compare in quality.
  I might build one for fun, but it's not going to change the economics alone. Still exciting it's possible.
- fny 3 hours ago
  The RAM requirements are still pretty painful.
  [-]
  - yieldcrv 3 hours ago
    equilibrium in one or two more years on the consumer/prosumer side
    think Apple M6 or M7 with a currently unforeseen denser memory style, 256gb RAM
    a couple inference or cache improvements on the algorithmic side, using less ram for context windows and doubling token speed again
    denser open source models, packing more experts for smaller active layers
    it'll still be expensive but like $8,000 - $13,000 instead of $450,000 worth of B200s
    [-]
    - stingraycharles 3 hours ago
      Fairly certain that model sizes and computational requirements will grow as the price for LLM compute drops.
      [-]
      - 3stacks 1 hour ago
        Maybe there's a conversation to be had about how much is enough... Unless something beyond my imagination happened, I would be happy enough with Opus 4.5 levels of productivity
      - yieldcrv 1 hour ago
        have you seen the open source LLM space? people fulfill all niches and there are active communities at every range of RAM and all are looking for the most capable in their respective range
        a lot of innovation occurring
- CamouflagedKiwi 3 hours ago
  The hardware requirements to run this locally are still very high. Seems far enough off mainstream for those companies not to be too worried yet.
- notatoad 2 hours ago
  locally on what hardware? something like the new dgx spark, ryzen halo, or mac studio will cost you ~ $4k plus whatever you pay for power. at the rate AI is currently progressing, i think you'd be optimistic to consider that as having a 2 year depreciation.
  for $4k, you can get 20 months of claude max 200. i'd take claude over the hardware.
  anthropic will have something to worry about when you can run a local model on your macbook that can code. but i think we're quite a ways off from that.
  [-]
  - chatmasta 2 hours ago
    Just a hunch, but I think the most cost effective “local” deployment method right now is renting GPU clusters by the hour and running all the inference software on them yourself. This will be cheaper than capital expenditure on hardware that will depreciate and become last-gen, and cheaper than OpenRouter pay per token.
  - tomr75 2 hours ago
    people who can't afford Claude max 200 are using qwen 3.6 27b for local coding assistance already
- stymaar 1 hour ago
  Honestly, Qwen3.6 is already what you need for the large majority of tasks.
  (I only ask Opus every 5 to 10 requests, when my local Qwen fails or when I encounter a situation that is too world-knowledge specific to be worth asking, but that way you can live easily with Claude's cheapest plan without ever facing usage limit).
CGamesPlay 2 hours ago
Can somebody help me understand the Quantization Analysis? It says "dynamic 4-bit UD-Q4_K_XL and dynamic 5-bit UD-Q5_K_XL are generally lossless" while showing a top-1% token agreement on the chart of 97.5%. Not what I would consider "generally lossless". Is this implying that some post-processing is going to account for the 2.5% loss? Beam search?
andai 3 hours ago
How is this model half the size of DeepSeek V4 Pro? Is it because DeepSeek did more aggressive cost cutting on the attention mechanism?
jonathanhefner 52 minutes ago
> Runing GLM-5.2 on local hardware
Do the runes make it smarter or just run faster (or both)?
ramgine 1 hour ago
I have up to 1tb of ddr4 in my server but it only has a 12gb vram 3060. Would getting a 24gb vram make this a viable system or am I throwing money away?
[-]
- segmondy 19 minutes ago
  You can run it today with that 12gb vram 3060, but I would suggest getting 2 3090s. Use cmoe option. This will keep the attention/route tensors on the GPU and offload the rest to system memory. Try it now and see the performance.
- rnewme 18 minutes ago
  Should work yes.
dofm 1 hour ago
Can't run this myself.
But I do like Unsloth Studio, quite a lot. It's nicely designed.
snootypoot 1 hour ago
if sam altman didnt exist i could afford to run this
Wowfunhappy 2 hours ago
> The full model requires 1.51TB of disk space
...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage?
I definitely don't have the hardware to run this model at any kind of reasonable speed (and I don't want to use a super aggressive quantization that would kill performance). Even so, I think it would be cool to retain an offline copy, in case... I don't really know, a solar flare destroys the internet some day, or maybe a zombie apocalypse. It would just be cool to have.
But 1.5 TB is a bit too much! If it could be compressed down into something semi kind of reasonable, that would be fun!
[-]
- gcr 2 hours ago
  There are two forms of compression relevant to LLMs:
  1. Reduce the number of parameters
  2. Reduce the resolution of each parameter (quantization)
  For 1, changing the architecture is typically only possible by the labs producing the models, which is why each OSS model release tends to feature a small number of carefully chosen model sizes (for example, Gemma4 comes in e2B, e4B, 12B, 26Ba4B, and 31B sizes).
  Generally, models with higher parameter counts have more world knowledge. For coding models, this shows up as a stronger command of uncommon libraries/languages. Very small models (<20B) also lack “smarts.”
  Reducing the resolution of each parameter is easier which is why lots of practitioners have their own quantizations, but this makes it harder for a model to “think” fluently. Interacting with heavily quantized models feels like interacting with someone who didn’t get any sleep the night before.
  Models that have higher-fidelity quantization take more RAM and have higher “smarts,” but don’t necessarily have more world knowledge. Models with aggressive quantization tend to be more likely to make rookie mistakes, emit malformed tool calls, get stuck in loops, or even exhibit signs of “neuroticism” / “distress” in their thinking tokens.
  Parameter counts = world knowledge, quantization = “smarts.”
  This is a soft rule of thumb, the difference isn’t very strong.
- SirMadam 2 hours ago
  SOTA LLM specific compression achieves around ~54%! https://arxiv.org/abs/2505.06252v3
- redox99 2 hours ago
  Probably not at all, considering weights are randomly initialized.
hxii 2 hours ago
Any time I see one of these posts about models of this size a quote comes to mind – "Your Scientists Were So Preoccupied With Whether Or Not They Could, They Didn’t Stop To Think If They Should".
Only a select few have the hardware required to run this to begin with, and even then the forecasted performance makes me wonder if it’s worth it at all.
[-]
- segmondy 14 minutes ago
  Completely worth it. At 6tk a second. If I can get 2 hrs of token generation. That's 2hrs * 3600secs * 6tk = 43200 tokens, at about 10tk to a line of code, that's about 4320 lines. Let's even trim it more and slice it by half. That's 2160 lines of code a day. Most professional programmers can't deliver that much consistently in a day.
  The key to a model this large is (1) Use it to plan, generate lots of plan and farm out to a smaller model. Then for very specific and complicated portions precisely prompt for what you need.
nullc 2 hours ago
Just running cpu only w/ Q6 on 9684X I get about 1tok/s ... also still get about 1tok/s/stream when running 16 in parallel.
zuzululu 3 hours ago
wonder if AMD's new ai chip can run this with ease? I'm seriously consider buying it. GLM 5.2 is just shy of GPT 5.4 so I would welcome offloading any grunt work locally
I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR
This should put more pressure on the frontier models to avoid sitting on any fancy stuff and lower token prices as a whole.
Nothing beats a local LLM disconnected from the cloud.
[-]
- UncleOxidant 2 hours ago
  Are you talking about Medusa Halo? It's going to support up to 256GB unified memory (up from 128GB for Strix Halo and 192GB for Gorgon Halo). That might just be barely enough to run a 2-bit quant GLM-5.2. It will expand memory bus to 384-bits, vs. 256-bits for Strix Halo which will help with bandwidth (projected to be around 500 GB/sec). But don't expect Madusa Halo-based machines to appear until sometime in 2028.
  The other way this could go is that Z.ai could decide to release a smaller model targeted towards coding. They've done that before (GLM-4.7-Flash had 30B params). It would be great if they decided to release something in the 80B-100B param range. Something that size would easily run in a current Strix Halo system.
  [-]
  - zuzululu 1 hour ago
    yeah you are correct 2 bit quant won't be enough
    guess we'll be paying $200/month for a while
- nl 3 hours ago
  > I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR
  We are maybe 10 years off that.
  RAM prices are going to continue to increase for the next 2 years at least.
  Even putting that aside it's currently around 40-70,000 EUR to run this with a FP8 quantization (which you need to get close to maximum performance).
  To actually get GPT 5.5-xhigh performance in the real world you need more headroom to support things like subagents (which will fill up your KV cache).
  I like local models but realism is important. The sweet spot for the next 3 years will continue to be ~35B MoE models. They might match GPT 5.5-xhigh for chat-style problems but not for coding.
  [-]
  - hsuduebc2 2 hours ago
    I wonder, if in the near future any acquisitions of some RAM producers with intent to just keep RAM prices up, will happen from the AI companies. It could seriously hurt their business, if companies will be able to host their AI in some time.
    [-]
    - nl 1 hour ago
      I think AI companies have enough things to spend capital on already.
  - zuzululu 1 hour ago
    [dead]
- Iolaum 3 hours ago
  At full quantization GLM 5.2 may be close to GPT 5.4. But at Q2 or whatever one needs in order to run it on a pro-sumer device it will be worse.
  Also I m not sure where you are getting the under 2k value. I bought a Framework desktop 128GB last year and my setup was around 2.7k. The same setup now sells for around 4.7k.
- nh43215rgb 3 hours ago
  Even with upcoming AI Max+ PRO 495 we are capped with 192GB, so no...
- kccqzy 3 hours ago
  The AMD 395 supports up to 128GB unified RAM. So still not enough even at 1-bit quant unfortunately.
- benjiro29 3 hours ago
  "GLM 5.2 is just shy of GPT 5.4"... If your running the full model. As in have 750 (FP8) to 1.5TB(FP16) of memory available.
  Do not mix the benchmark results of GLM 5.2 FP16/FP8 with FP4 or FP2.
  * FP4 will mean a accuracy loss of about 3%. Not noticeable but more chance for mistakes.
  * FP2 ... what is what most people are able to run at home, for a "reasonable" price. Your looking at over 17% loss in accuracy.
  At that point, your running at less then claude-sonnet-4.6, as the issues compound with accuracy losses. And reasonable priced is still in the ~ $5000 range (192GB + GPU 32GB active/kv cache system).
  For that price your using a Codex / Claude Pro subscription for the next 4+ years with better models (by default), let alone with a FP2 GLM 5.2 version. And your looking at < 10 fps. A MacStudio with 512GB will net you 18 a 20fps+ with FP4, but ... i mean, those used to be $10.000.
  Unfortunately the local hardware cost is a major issue for running large models like that.
  Edit: Its funny whenever the issue of cost and what you need to give up vs the subscription services, there are always people who downvote in bad faith.
  [-]
  - kgeist 10 minutes ago
    The cost of local hardware is amortized if a whole team uses it instead of just 1 dev (GPUs are extremely underutilized if you launch just 1 generation stream). I'm not sure why everyone always assumes solo devs with Macs. We've just ordered a large datacenter-grade node for use by the whole dev team, and the calculations show that it's going to cost the same amount of money if we kept using AWS Bedrock (infosec reasons) for a couple years but... it gives us 100% privacy, we're immune to all the AI regulation dramas in the US/EU, all the random outages, and the developers won't have to think about token limits/weekly caps etc. ever again. And all that with a model which is Opus-grade
  - zuzululu 1 hour ago
    you are right that means GLM is still quite far off from truly competitive
    i think your answer was perfect not sure why you are being downvoted
cws_ai_buddy 7 minutes ago
[flagged]
CHUNK_CHUNK 27 minutes ago
[flagged]
boringspinner 12 minutes ago
[dead]
VaporJournalAPP 59 minutes ago
[flagged]
tsouth2 2 hours ago
[dead]