

The NPUs will have to be rearchitected to optimize themselves for bitnet.
The NPUs will have to be rearchitected to optimize themselves for bitnet.
Yeah it’s awesome.
There’s a lot of optimization left on the table, too. Check out this fork, which doesn’t quite work yet (seems a python config file is missing?) but is integrating some optimizations (like torch.compile) and extra options (like last frames or negative prompts): https://github.com/wongfei2009/FramePack
It was selectively given to institutions and “major” celebrities before that.
Selling them dilutes any meaning of “verified” because any joe can just pay for extra engagement. It’s a perverse incentive, as the people most interest in grabbing attention buy it and get amplified.
It really has little to do with Musk.
the whole concept is stupid.
+1
Being that algorithmic just makes any Twitter-like design too easy to abuse.
Again, Lemmy (and Reddit) is far from perfect, but fundamentally, grouping posts and feeds by niche is way better. It incentivizes little communities that are concerned about their own health, while users have zero control over that shouting into the Twitter maw.
Not sure where you’re going with that, but it’s a perverse incentive, just like the engagement algorithm.
Elon is a problem because he can literally force himself into everyone’s feeds, but also because he always posts polarizing/enraging things these days.
Healthy social media design/UI is all about incentivizing good, healthy communities and posts. Lemmy is not perfect, but simply not designing for engagement/profit because Lemmy is “self hosted” instead of commercial is massive.
This. $200 is what raised my eyebrows, Windows should not cost them that much when it’s usually “standard”
Is this a tariff thing? Like is it suddenly more expensive to license Windows, hence pushing OEMs to offer discount options?
Lenovo is at least partially Chinese.
I mean, “modest” may be too strong a word, but a 2080 TI-ish workstation is not particularly exorbitant in the research space. Especially considering the insane dataset size (years of noisy, raw space telescope data) they’re processing here.
Also that’s not always true. Some “AI” models, especially oldschool ones, function fine on old CPUs. There are also efforts (like bitnet) to get larger ones fast cheaply.
I have no idea if it has any impact on the actual results though.
Is it a PyTorch experiment? Other than maybe different default data types on CPU, the results should be the same.
That’s even overkill. A 3090 is pretty standard in the sanely priced ML research space. It’s the same architecture as the A100, so very widely supported.
5090 is actually a mixed bag because it’s too new, and support for it is hit and miss. And also because it’s ridiculously priced for a 32G card.
And most CPUs with tons of RAM are fine, depending on the workload, but the constraint is usually “does my dataset fit in RAM” more than core speed (since just waiting 2X or 4X longer is not that big a deal).
The model was run (and I think trained?) on very modest hardware:
The computer used for this paper contains an NVIDIA Quadro RTX 6000 with 22 GB of VRAM, 200 GB of RAM, and a 32-core Xeon CPU, courtesy of Caltech.
That’s a double VRAM Nvidia RTX 2080 TI + a Skylake Intel CPU, an aging circa-2018 setup. With room for a batch size of 4096, nonetheless! Though they did run into some preprocessing bottleneck in CPU/RAM.
The primary concern is the clustering step. Given the sheer magnitude of data present in the catalog, without question the task will need to be spatially divided in some way, and parallelized over potentially several machines
It’s a little too plausible, heh.
I posit the central flaw is the engagement system, which was AI driven long before LLM/diffusion was public. The slop is made because it works in that screwed up, toxic system.
I dunno about substitutions or details, but the bottom line is simple: if engagement-driven social media doesn’t die in fire, the human race will.
TBH most of the cost is from the individual components. The core chip fab, the memory fab, the oled screen fab, the battery, power regulation, cameras, all massive operations and very automated. Not to speak of the software stack. Or the chip R&D and tape out costs.
The child labor is awful, but it’s not the most expensive part of a $1k+ iPhone.
They’re more often subsidized by carriers here (in the US), too. I didn’t really want an iPhone, but $400 new for a Plus, with a plan discount, just makes an Android set not worth it.
OP’s being abrasive, but I sympathize with the sentiment. Bluesky is algorithmic just like Twitter.
…Dunno about Bluesky, but Lemmy feels like a political purity test to me. Like, I love Lemmy and the Fediverse, but at the same time mega upvoted posts/comments like “X person should kill themself,” explulsion of nuance in specific issues, leaks into every community and such are making me step back more and more.
It’s not theoretical, it’s just math. Removing 1/3 of the bus paths, and also removing the need to constantly keep RAM powered
And here’s the kicker.
You’re supposing it’s (given the no refresh bonus) 1/3 as fast as dram, similar latency, and cheap enough per gigabyte to replace most storage. That is a tall order, and it would be incredible if it hits all three of those. I find that highly improbable.
Even dram is starting to become a bottleneck for APUs, specifically, because making the bus wide is so expensive. This applies to the very top (the MI300A) and bottom (smartphones and laptop APUs).
Optane, for reference, was a lot slower than DRAM and a lot more expensive/less dense than flash even with all the work Intel put into it and busses built into then top end CPUs for direct access. And they thought that was pretty good. It was good enough for a niche when used in conjunction with dram sticks
You are talking theoretical.
A big reason that supercomputers moved to a network of “commodity” hardware architecture is that its cost effective.
How would one build a giant unified pool of this memory? CXL, but how does it look physically? Maybe you get a lot of bandwidth in parallel, but how would it be even close to the latency of “local” DRAM busses on each node? Is that setup truly more power efficient than banks of DRAM backed by infrequently touched flash? If your particular workload needs fast random access to memory, even at scale the only advantage seems to be some fault tolerance at a huge speed cost, and if you just need bulk high latency bandwidth, flash has got you covered for cheaper.
…I really like the idea of non a nonvolatile, single pool backed by caches, especially at scale, but ultimately architectural decisions come down to economics.
How is that any better than DRAM though? It would have to be much cheaper/GB, yet reasonably faster than the top-end SLC/MLC flash Samsung sells.
Another thing I don’t get… in all the training runs I see, dataset bandwidth needs are pretty small. Like, streaming images (much less like 128K tokens of text) is a minuscule drop in the bucket compared to how long a step takes, especially with hardware decoders for decompression.
Weights are an entirely different duck, and stuff like Cerebras clusters do stream them into system memory, but they need the speed of DRAM (or better).
deleted by creator