I mean, yeah. AI companies are nearly universally, and objectively, piloted by fucksticks. Lemmy instances are also constantly scraped rather than just spinning up an instance and pulling the entire threadiverse.
I have a public gitweb repository. I am constantly being hit by dumb crawlers that, left to their own devices, request every single diff of every single commit simply because links requesting such operations are presented. All of which are unnecessary if they would only do a simple git pull, then my server would happily provide the 50 MB of the entire git repo history. Instead, they download gigabytes of HTML boilerplate, probably never actually get a full commit history, and probably can’t even use what data they do scrape since they’re just randomly pulling commits in between blocks and bans.
All of this only became an issue around a year ago. Since then, I just accept my public facing static pages are all that’s reliable anymore.
“Use the torrents, damn!”
I mean, yeah. AI companies are nearly universally, and objectively, piloted by fucksticks. Lemmy instances are also constantly scraped rather than just spinning up an instance and pulling the entire threadiverse.
I have a public gitweb repository. I am constantly being hit by dumb crawlers that, left to their own devices, request every single diff of every single commit simply because links requesting such operations are presented. All of which are unnecessary if they would only do a simple
git pull, then my server would happily provide the 50 MB of the entire git repo history. Instead, they download gigabytes of HTML boilerplate, probably never actually get a full commit history, and probably can’t even use what data they do scrape since they’re just randomly pulling commits in between blocks and bans.All of this only became an issue around a year ago. Since then, I just accept my public facing static pages are all that’s reliable anymore.