Glad this is becoming a meme

lledrtx@lemmy.world · 10 months ago

Glad this is becoming a meme

dislocate_expansion@reddthat.com · 10 months ago

Anyone know why most are a 2021 internet data cut off?

Natanael@slrpnk.net · 10 months ago

Training from scratch and retraining is expensive. Also, they want to avoid training on ML outputs as samples, they want primarily human made works as samples, and after the initial public release of LLMs it has become harder to create large datasets without ML stuff in them

Scrubbles@poptalk.scrubbles.tech · edit-2 10 months ago

There was a good paper that came out recently saying that training on ml data will result in a collapse of cohesion. It’s going to be real interesting, I don’t know if they’ll be able to train as easily ever again

Iron Lynx@lemmy.world · 10 months ago

I recall spotting a few things about Image Generators having their training data contaminated using generated images, and the output becoming significantly worse. So yeah, I guess LLMs and IGA’s need natural sources, or it gets more inbred than the Habsburgs.

TurtleJoe@lemmy.world · 10 months ago

I think it’s telling that they acknowledge that the stuff their bots churn out is often such garbage that training their bots on it would ruin them.

Donkter@lemmy.world · 10 months ago

I think it’s just that most are based on chatgpt which cuts off at 2021.

can@sh.itjust.works · 10 months ago

Hey, did you know your profile is set to appear as a bot and as a result many may be filtering your posts and comments? You can change this in your Lemmy settings.

Unless you are a bot… In which case where did you get your data?

dislocate_expansion@reddthat.com · 10 months ago

The data wasn’t stolen, I can at least assure you of that

can@sh.itjust.works · 10 months ago

You paid Hoffman?