• asudox@lemmy.world
    link
    fedilink
    arrow-up
    104
    ·
    3 months ago

    Block? Nope, robots.txt does not block the bots. It’s just a text file that says: “Hey robot X, please do not crawl my website. Thanks :>”

    • ɐɥO@lemmy.ohaa.xyz
      link
      fedilink
      arrow-up
      57
      ·
      3 months ago

      I disallow a page in my robots.txt and ip-ban everyone who goes there. Thats pretty effective.

    • Cynicus Rex@lemmy.mlOP
      link
      fedilink
      arrow-up
      11
      ·
      3 months ago

      Unfortunate indeed.

      “Can AI bots ignore my robots.txt file? Well-established companies such as Google and OpenAI typically adhere to robots.txt protocols. But some poorly designed AI bots will ignore your robots.txt.”

      • breadsmasher@lemmy.world
        link
        fedilink
        English
        arrow-up
        22
        ·
        3 months ago

        typically adhere. but they don’t have to follow it.

        poorly designed AI bots

        Is it a poor design if its explicitly a design choice to ignore it entirely to scrape as much data as possible? Id argue its more AI bots designed to scrape everything regardless of robots.txt. That’s the intention. Asshole design vs poor design.

    • majestictechie@lemmy.fosshost.com
      link
      fedilink
      English
      arrow-up
      6
      ·
      3 months ago

      This is why I block in a htaccess:

      # Bot Agent Block Rule
      RewriteEngine On
      RewriteCond %{HTTP_USER_AGENT} (BOTNAME|BOTNAME2|BOTNAME3) [NC]
      RewriteRule (.*) - [F,L]