I would imagine the torrents are a snapshot in time. I don’t think they can be updated after being created? Also, picture the average dev. Half of them are lazier and wouldn’t deal with rss or torrents, when you can just make get thousands of redundant GET get requests to use for training data.
The answer is simpler than you could ever conceive. Companies piloted by incompetent, selfish pricks are just scraping the entire internet in order to grab every niblet of data they can. Writing code to do what they’re doing in a less destructive fashion would require effort that they are entirely unwilling to put in. If that weren’t the case, the overwhelming majority of scrapers wouldn’t ignore robot.txt files. I hate AI companies so fucking much.
Close, I think the average dev can’t even imagine a product that isn’t for profit and is available to everyone as a public service. Scraping uses a lot more resources than just regularly downloading the snapshot, so it genuinely makes no sense. It’s like shoplifting from a soup kitchen.
But… I know tech people personally. They’re really that dumb.
I would imagine the torrents are a snapshot in time. I don’t think they can be updated after being created? Also, picture the average dev. Half of them are lazier and wouldn’t deal with rss or torrents, when you can just make get thousands of redundant GET get requests to use for training data.
This makes no sense, the snapshots are updated regularly and Wikipedia isn’t even that big. Like 25GB.
The answer is simpler than you could ever conceive. Companies piloted by incompetent, selfish pricks are just scraping the entire internet in order to grab every niblet of data they can. Writing code to do what they’re doing in a less destructive fashion would require effort that they are entirely unwilling to put in. If that weren’t the case, the overwhelming majority of scrapers wouldn’t ignore robot.txt files. I hate AI companies so fucking much.
“robots.txt files? You mean those things we use as part of the site index when scraping it?”
— AI companies, probably
Close, I think the average dev can’t even imagine a product that isn’t for profit and is available to everyone as a public service. Scraping uses a lot more resources than just regularly downloading the snapshot, so it genuinely makes no sense. It’s like shoplifting from a soup kitchen.
But… I know tech people personally. They’re really that dumb.