But there were still bots making shit up back then. r/SubredditSimulator was pretty popular for awhile, and repost and astroturfing bots were a problem form decades on Reddit.
Existing AIs such as ChatGPT were trained in part on that data so obviously they’ve got ways to make it work. They filtered out some stuff, for example - the “glitch tokens” such as solidgoldmagikarp were evidence of that.
That paper is yet to be peer reviewed or released. I think you are jumping into conclusion with that statement. How much can you dilute the data until it breaks again?
Peer review, for all its flaws is a good minimum before a paper is worth taking seriously.
In your original comment you said tha model collapse can be easily avoided with this technique, which is notably different from it being mitigated. I’m not saying that these findings are not useful, just that you are overselling them a bit with this wording.
“Model collapse” can be easily avoided by keeping old human data with new synthetic data in the training set. The old archives of Reddit content from before there was AI are still around.
A model trained on jokes about bacon, narwhals, and rage comics.
By “old archives” I mean everything from 2022 and earlier.
But there were still bots making shit up back then. r/SubredditSimulator was pretty popular for awhile, and repost and astroturfing bots were a problem form decades on Reddit.
Existing AIs such as ChatGPT were trained in part on that data so obviously they’ve got ways to make it work. They filtered out some stuff, for example - the “glitch tokens” such as solidgoldmagikarp were evidence of that.
I SAID RAGE COMICS
That paper is yet to be peer reviewed or released. I think you are jumping into conclusion with that statement. How much can you dilute the data until it breaks again?
Removed by mod
Peer review, for all its flaws is a good minimum before a paper is worth taking seriously.
In your original comment you said tha model collapse can be easily avoided with this technique, which is notably different from it being mitigated. I’m not saying that these findings are not useful, just that you are overselling them a bit with this wording.
Removed by mod