We Asked A.I. to Create the Joker. It Generated a Copyrighted Image.::Artists and researchers are exposing copyrighted material hidden within A.I. tools, raising fresh legal questions.

  • QubaXR@lemmy.world
    link
    fedilink
    English
    arrow-up
    6
    arrow-down
    6
    ·
    10 months ago

    These models were trained on datasets that, without compensating the authors, used their work as training material. It’s not every picture on the net, but a lot of it is scrubbing websites, portfolios and social networks wholesale.

    A similar situation happens with large language models. Recently Meta admitted to using illegally pirated books (Books3 database to be precise) to train their LLM without any plans to compensate the authors, or even as much as paying for a single copy of each book used.

    • Jilanico@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      1
      ·
      10 months ago

      Most of the stuff that inspires me probably wasn’t paid for. I just randomly saw it online or on the street, much like an AI.

      AI using straight up pirated content does give me pause tho.

      • QubaXR@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        arrow-down
        3
        ·
        edit-2
        10 months ago

        I was on the same page as you for the longest time. I cringed at the whole “No AI” movement and artists’ protest. I used the very same idea: Generations of artists honed their skills by observing the masters, copying their techniques and only then developing their own unique style. Why should AI be any different? Surely AI will not just copy works wholesale and instead learn color, composition, texture and other aspects of various works to find it’s own identity.

        It was only when my very own prompts started producing results I started recognizing as “homages” at best and “rip-offs” at worst that gave me a stop.

        I suspect that earlier generations of text to image models had better moderation of training data. As the arms race heated up and pace of development picked up, companies running these services started rapidly incorporating whatever training data they could get their hands on, ethics, copyright or artists’ rights be damned.

        I remember when MidJourney introduced Niji (their anime model) and I could often identify the mangas and characters used to train it. The imagery Niji produced kept certain distinct and unique elements of character designs from that training data - as a result a lot of characters exhibited “Chainsaw Man” pointy teeth and sticking out tongue - without as much as a mention of the source material or even the themes.

    • archomrade [he/him]
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      3
      ·
      edit-2
      10 months ago

      These models were trained on datasets that, without compensating the authors, used their work as training material.

      Couple things:

      • this doesn’t explain ops question about how the information is stored. On fact op is right, that the images and source material is NOT stored in a database within the model, it basically just stores metadata about the source material as a whole in order to construct new material from text descriptions

      • the use of copyrighted works in the training isn’t necessarily infringing if the model is found to be a fair use, and there is a very strong fair use argument here.

      • QubaXR@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        ·
        10 months ago

        “metadata” is such a pretty word. How about “recipe” instead? It stores all information necessary to reproduce work verbatim or grab any aspect of it.

        The legal issue of copyright is a tricky one, especially in the US where copyright is often being weaponized by corporations. The gist of it is: The training model itself was an academic endeavor and therefore falls under a fair use. Companies like StabilityAI or OpenAI then used these datasets and monetized products built on them, which in my understanding skims gray zone of being legal.

        If these private for-profit companies simply took the same data and built their own, identical dataset they would be liable to pay the authors for use of their work in commercial product. They go around it by using the existing model, originally created for research and not commercial use.

        Lemmy is full of open source and FOSS enthusiasts, I’m sure someone can explain it better than I do.

        All in all I don’t argue about the legality of AI, but as a professional creative I highlight ethical (plagiarism) risks that are beginning to arise in majority of the models. We all know Joker, Marvel superheroes, popular Disney and WB cartoon characters - and can spot when “our” generations cross the line of copying someone else’s work. But how many of us are familiar with Polish album cover art, Brazilian posters, Chinese film superheroes or Turkish logos? How sure can we be that the work “we” produced using AI is truly original and not a perfect copy of someone else’s work? Does our ignorance excuse this second-hand plagiarism? Or should the companies releasing AI models stop adding features and fix that broken foundation first?

        • archomrade [he/him]
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          1
          ·
          10 months ago

          “metadata” is such a pretty word. How about “recipe” instead?

          Well isn’t recipe another one of those pretty words? ‘Metadata’ is specific to other precedents that deal with computer programs that gather data about works (see Authors Guild, Inc. v. HathiTrust and Authors Guild v. Google), but you’re welcome to challenge the verbiage if you don’t like it. Regardless, what we’re discussing is objectively something that describes copyrighted works, not copies or a copy of the works themselves. A computer program that is very good at analyzing textual/pixelated data is still only analyzing data, it is itself a novel, non-expressive factual representation of other expressive works, and because of this, it cannot be considered as infringement on its own.

          It stores all information necessary to reproduce work verbatim or grab any aspect of it.

          This isn’t really true, at least not for the majority of works analyzed by the model, but granted. If a person uses a tool to copy the work of another person, it is the person who is doing the copying, not the tool. I think it is far more reasonable to hold an individual who uses an AI model to infringe on a copyright responsible. If someone chooses to author a work with the use of a tool that does the work for them (in part or in whole), it is more than reasonable to expect that individual to check the work that is being produced.

          All in all I don’t argue about the legality of AI, but as a professional creative I highlight ethical (plagiarism) risks that are beginning to arise in majority of the models.

          As a professional creative myself, I think this is a load of horseshit. We always hold individual authors responsible for the work that they publish, and it should be no different here. That some choose to be lazy and careless is more of a reflection of them.

          How sure can we be that the work “we” produced using AI is truly original and not a perfect copy of someone else’s work?

          If you have the words to describe a desired image/text response to the model that produce a ‘perfect copy of someone else’s work’, then we have the words to search for that work, too.

          Or should the companies releasing AI models stop adding features and fix that broken foundation first?

          How about we stop expanding the scope of an already broken copyright law and fix that broken foundation first?