• howrar@lemmy.ca
    link
    fedilink
    arrow-up
    54
    ·
    5 months ago

    We have models that are specifically made to be good at these kinds of tasks. Why would you choose the ones that aren’t and then make generalizing claims about how AI sucks in this domain?

    • spaduf@slrpnk.net
      link
      fedilink
      arrow-up
      15
      arrow-down
      1
      ·
      edit-2
      5 months ago

      Yeah this is probably just straight up misinformation. By no means is a diagnosis going to be made by a generalist multimodal LLM. Diagnosis is a literally a binary classification (although that is an oversimplification) and on medical CV you are optimizing on that directly.

      • snooggums
        link
        fedilink
        English
        arrow-up
        8
        arrow-down
        9
        ·
        edit-2
        5 months ago

        They did not use a LLM.

        In a recent experiment, they set out to determine how reliable LMMs are in medical diagnosis — asking both general and more specific diagnostic questions — as well as whether models were even being evaluated correctly for medical purposes.

        Curating a new dataset and asking state-of-the-art models questions about X-rays, MRIs and CT scans of human abdomens, brain, spine and chests, they discovered “alarming” drops in performance.

        • Starbuck@lemmy.world
          link
          fedilink
          arrow-up
          11
          arrow-down
          1
          ·
          5 months ago

          models including GPT-4V and Gemini Pro

          What a joke, a few generic LLMs making a judgement call about all AI models.

        • can@sh.itjust.works
          link
          fedilink
          arrow-up
          2
          ·
          5 months ago

          They used one to create the dataset for their experiments:

          In their experiments, they introduced a new dataset, Probing Evaluation for Medical Diagnosis (ProbMed), for which they curated 6,303 images from two widely-used biomedical datasets. These featured X-ray, MRI and CT scans of multiple organs and areas including the abdomen, brain, chest and spine.

          GPT-4 was then used to pull out metadata about existing abnormalities, the names of those conditions and their corresponding locations. This resulted in 57,132 question-answer pairs covering areas such as organ identification, abnormalities, clinical findings and reasoning around position.

          • snooggums
            link
            fedilink
            English
            arrow-up
            1
            arrow-down
            1
            ·
            5 months ago

            The seven models tested included GPT-4V, Gemini Pro and the open-source, 7B parameter versions of LLaVAv1, LLaVA-v1.6, MiniGPT-v2, as well as specialized models LLaVA-Med and CheXagent. These were chosen because their computational costs, efficiencies and inference speeds make them practical in medical settings, researchers explain.

            It seems like this is a case of “they just aren’t using AI right, if they used it right it works” when it sure looks like they are using the models intended for these specific medical tasks.

            • spaduf@slrpnk.net
              link
              fedilink
              arrow-up
              3
              arrow-down
              1
              ·
              edit-2
              5 months ago

              Those are not the sort of model anybody in the field would use (medical CV with deep learning based analysis is a vibrant field with many breakthroughs in recent years). These are the sort of models tech bros are trying to sell to the public as general AI. There is a world of difference.

    • NocturnalEngineer@lemmy.world
      link
      fedilink
      arrow-up
      3
      ·
      5 months ago

      Not defending this article, but companies & big tech are generalizing the crap out of AI right now, and forcing it into everything.

      They could have (and definitely should’ve) promoted the strengths and weaknesses of their models, specifically regarding what it can and can’t do. But they don’t. They get more money when their shareholders & customers think it’s the next best thing for everything.

  • ResoluteCatnap@lemmy.ml
    link
    fedilink
    English
    arrow-up
    38
    arrow-down
    1
    ·
    5 months ago

    As others have said, you don’t need (and shouldn’t use) a LLM for a classification task like this. There are machine learning models that can handle this and identify underlying patterns that humans can not easily detect. And yes, they can get accuracy and precision scores much higher than 50%

    What an incredibly stupid article.

    • Umbrias@beehaw.org
      link
      fedilink
      arrow-up
      11
      ·
      5 months ago

      Correct, you shouldn’t use llm for this task.

      Which is literally the point of the paper, because various techbros have been trying to claim that they are good at these tasks.

  • Match!!@pawb.social
    link
    fedilink
    English
    arrow-up
    20
    ·
    5 months ago

    Coincidentally, I trained a CNN to tell dogs from cats and it does a godawful job diagnosing cancer

  • Pennomi@lemmy.world
    link
    fedilink
    English
    arrow-up
    9
    ·
    5 months ago

    LLMs are notorious yes-men. Why would you ever use that for diagnosis? Just use bespoke classifiers like we have for years.

    • Lojcs@lemm.ee
      link
      fedilink
      arrow-up
      3
      ·
      5 months ago

      Because some researcher wanted to document what would happen and a journalist thought writing about that would get many clicks