Niche Model of the Day: Openbuddy 25.2q, QwQ 32B with Quantization Aware Training

brucethemoose@lemmy.world · 22 hours ago

deleted by creator

brucethemoose@lemmy.world · 23 hours ago

The NPUs will have to be rearchitected to optimize themselves for bitnet.

brucethemoose@lemmy.world · 23 hours ago

Yeah it’s awesome.

There’s a lot of optimization left on the table, too. Check out this fork, which doesn’t quite work yet (seems a python config file is missing?) but is integrating some optimizations (like torch.compile) and extra options (like last frames or negative prompts): https://github.com/wongfei2009/FramePack

brucethemoose@lemmy.world · edit-2 24 hours ago

Niche Model of the Day: Openbuddy 25.2q, QwQ 32B with Quantization Aware Training

brucethemoose@lemmy.world · 2 days ago

It was selectively given to institutions and “major” celebrities before that.

Selling them dilutes any meaning of “verified” because any joe can just pay for extra engagement. It’s a perverse incentive, as the people most interest in grabbing attention buy it and get amplified.

It really has little to do with Musk.

brucethemoose@lemmy.world · edit-2 2 days ago

the whole concept is stupid.

+1

Being that algorithmic just makes any Twitter-like design too easy to abuse.

Again, Lemmy (and Reddit) is far from perfect, but fundamentally, grouping posts and feeds by niche is way better. It incentivizes little communities that are concerned about their own health, while users have zero control over that shouting into the Twitter maw.

brucethemoose@lemmy.world · edit-2 2 days ago

Not sure where you’re going with that, but it’s a perverse incentive, just like the engagement algorithm.

Elon is a problem because he can literally force himself into everyone’s feeds, but also because he always posts polarizing/enraging things these days.

Healthy social media design/UI is all about incentivizing good, healthy communities and posts. Lemmy is not perfect, but simply not designing for engagement/profit because Lemmy is “self hosted” instead of commercial is massive.

brucethemoose@lemmy.world · 6 days ago

This. $200 is what raised my eyebrows, Windows should not cost them that much when it’s usually “standard”

brucethemoose@lemmy.world · 6 days ago

Is this a tariff thing? Like is it suddenly more expensive to license Windows, hence pushing OEMs to offer discount options?

Lenovo is at least partially Chinese.

brucethemoose@lemmy.world · edit-2 10 days ago

I mean, “modest” may be too strong a word, but a 2080 TI-ish workstation is not particularly exorbitant in the research space. Especially considering the insane dataset size (years of noisy, raw space telescope data) they’re processing here.

Also that’s not always true. Some “AI” models, especially oldschool ones, function fine on old CPUs. There are also efforts (like bitnet) to get larger ones fast cheaply.

brucethemoose@lemmy.world · edit-2 10 days ago

I have no idea if it has any impact on the actual results though.

Is it a PyTorch experiment? Other than maybe different default data types on CPU, the results should be the same.

brucethemoose@lemmy.world · edit-2 10 days ago

That’s even overkill. A 3090 is pretty standard in the sanely priced ML research space. It’s the same architecture as the A100, so very widely supported.

5090 is actually a mixed bag because it’s too new, and support for it is hit and miss. And also because it’s ridiculously priced for a 32G card.

And most CPUs with tons of RAM are fine, depending on the workload, but the constraint is usually “does my dataset fit in RAM” more than core speed (since just waiting 2X or 4X longer is not that big a deal).

brucethemoose@lemmy.world · edit-2 10 days ago

The model was run (and I think trained?) on very modest hardware:

The computer used for this paper contains an NVIDIA Quadro RTX 6000 with 22 GB of VRAM, 200 GB of RAM, and a 32-core Xeon CPU, courtesy of Caltech.

That’s a double VRAM Nvidia RTX 2080 TI + a Skylake Intel CPU, an aging circa-2018 setup. With room for a batch size of 4096, nonetheless! Though they did run into some preprocessing bottleneck in CPU/RAM.

The primary concern is the clustering step. Given the sheer magnitude of data present in the catalog, without question the task will need to be spatially divided in some way, and parallelized over potentially several machines

brucethemoose@lemmy.world · edit-2 10 days ago

It’s a little too plausible, heh.

brucethemoose@lemmy.world · edit-2 10 days ago

I posit the central flaw is the engagement system, which was AI driven long before LLM/diffusion was public. The slop is made because it works in that screwed up, toxic system.

I dunno about substitutions or details, but the bottom line is simple: if engagement-driven social media doesn’t die in fire, the human race will.

brucethemoose@lemmy.world · edit-2 12 days ago

TBH most of the cost is from the individual components. The core chip fab, the memory fab, the oled screen fab, the battery, power regulation, cameras, all massive operations and very automated. Not to speak of the software stack. Or the chip R&D and tape out costs.

The child labor is awful, but it’s not the most expensive part of a $1k+ iPhone.

brucethemoose@lemmy.world · edit-2 12 days ago

They’re more often subsidized by carriers here (in the US), too. I didn’t really want an iPhone, but $400 new for a Plus, with a plan discount, just makes an Android set not worth it.

brucethemoose@lemmy.world · edit-2 12 days ago

OP’s being abrasive, but I sympathize with the sentiment. Bluesky is algorithmic just like Twitter.

…Dunno about Bluesky, but Lemmy feels like a political purity test to me. Like, I love Lemmy and the Fediverse, but at the same time mega upvoted posts/comments like “X person should kill themself,” explulsion of nuance in specific issues, leaks into every community and such are making me step back more and more.

brucethemoose@lemmy.world · edit-2 14 days ago

It’s not theoretical, it’s just math. Removing 1/3 of the bus paths, and also removing the need to constantly keep RAM powered

And here’s the kicker.

You’re supposing it’s (given the no refresh bonus) 1/3 as fast as dram, similar latency, and cheap enough per gigabyte to replace most storage. That is a tall order, and it would be incredible if it hits all three of those. I find that highly improbable.

Even dram is starting to become a bottleneck for APUs, specifically, because making the bus wide is so expensive. This applies to the very top (the MI300A) and bottom (smartphones and laptop APUs).

Optane, for reference, was a lot slower than DRAM and a lot more expensive/less dense than flash even with all the work Intel put into it and busses built into then top end CPUs for direct access. And they thought that was pretty good. It was good enough for a niche when used in conjunction with dram sticks

brucethemoose@lemmy.world · edit-2 14 days ago

You are talking theoretical.

A big reason that supercomputers moved to a network of “commodity” hardware architecture is that its cost effective.

How would one build a giant unified pool of this memory? CXL, but how does it look physically? Maybe you get a lot of bandwidth in parallel, but how would it be even close to the latency of “local” DRAM busses on each node? Is that setup truly more power efficient than banks of DRAM backed by infrequently touched flash? If your particular workload needs fast random access to memory, even at scale the only advantage seems to be some fault tolerance at a huge speed cost, and if you just need bulk high latency bandwidth, flash has got you covered for cheaper.

…I really like the idea of non a nonvolatile, single pool backed by caches, especially at scale, but ultimately architectural decisions come down to economics.

brucethemoose@lemmy.world · edit-2 14 days ago

How is that any better than DRAM though? It would have to be much cheaper/GB, yet reasonably faster than the top-end SLC/MLC flash Samsung sells.

Another thing I don’t get… in all the training runs I see, dataset bandwidth needs are pretty small. Like, streaming images (much less like 128K tokens of text) is a minuscule drop in the bucket compared to how long a step takes, especially with hardware decoders for decompression.

Weights are an entirely different duck, and stuff like Cerebras clusters do stream them into system memory, but they need the speed of DRAM (or better).

brucethemoose@lemmy.world · edit-2 6 months ago

Guide to Self Hosting LLMs Faster/Better than Ollama