Generative artificial intelligence (GenAI) company Anthropic has claimed to a US court that using copyrighted content in large language model (LLM) training data counts as “fair use”, however.

Under US law, “fair use” permits the limited use of copyrighted material without permission, for purposes such as criticism, news reporting, teaching, and research.

In October 2023, a host of music publishers including Concord, Universal Music Group and ABKCO initiated legal action against the Amazon- and Google-backed generative AI firm Anthropic, demanding potentially millions in damages for the allegedly “systematic and widespread infringement of their copyrighted song lyrics”.

  • Sonori@beehaw.org
    link
    fedilink
    arrow-up
    1
    ·
    11 months ago

    The thing is, i’m not sure at all that it’s even physically possible for an LLM be trained like a four year old, they learn in fundamentally different ways. Even very young children quickly learn by associating words with concepts and objects, not by forming a statistical model of how often x mingingless string of characters comes after every other meaningless string of charecters.

    Similarly when it comes to image classifiers, a child can often associate a word to concept or object after a single example, and not need to be shown hundreds of thousands of examples until they can create a wide variety of pixel value mappings based on statistical association.

    Moreover, a very large amount of the “progress” we’ve seen in the last few years has only come by simplifying the transformers and useing ever larger datasets. For instance, GPT 4 is a big improvement on 3, but about the only major difference between the two models is that they threw near the entire text internet at 4 as compared to three’s smaller dataset.