Could scraper data be cached online?

seriema

Hi,

Something I've wondered for a while: Why isn't all the scraper data cached somewhere? Screenscraper.fr is constantly under so much pressure their servers go down, so why isn't their API behind a CDN like Cloudflare? Is it a cost question? Is it because everyone that scrapes what the absolute latest, and not 5min old data? Something totally different? (repeat these questions for every online source)

A follow up is an idea I got from using @muldjord awesome Skyscraper: Skyscraper builds up a local cache from calling all the different scraper sources, and then you get a very nice gamelist with the best of all sources. But all of us using it must have a large enough overlap when hitting all those sources that we could just share the Skyscraper cache when building our gamelists. Something like this:

The images themselves don't necessarily have to be copied, they could just be URL references, so it'd all be just text data and that doesn't take much storage. Even including the images, and even videos, wouldn't be unrealistic in size. Transfer speed doesn't need to be streaming-level.

Is this totally bonkers?

mitu

@seriema said in Could scraper data be cached online?:

Is this totally bonkers?

Just my 2c. You'd be building a 2nd ScreenScraper/ThegamesDB/OpenRetro site, aggregating their info. Leaving aside the permission to do so (which is not something that should be easily glanced over), you will have the same scalability issues the source sites have right now and you'll have to manage a - non-negligible - server side back-end.

While in theory sounds nice, in practice you're just trying the old let's add another layer of indirection/caching method to solve a scalability problem.

muldjord

@seriema I suggested to ScreenScraper to use a caching server a long time ago. I don't know if they've looked into it.

Technically what you suggest is pretty easy I believe. But getting permission to do so is not. It's such a huge undertaking that I almost get dizzy thinking about it. You'd need some pretty crazy amounts of goodwill from the people behind the services for them to grant permissions. And they probably have an entire community that has to agree as well, as their data "is theirs". Would be cool though.

seriema

Like both of you mention, I'm totally ignoring the human aspect around permissions. At least until I understand why they want to manage their own (often crashing-under-the-weight) servers.

@mitu said in Could scraper data be cached online?:

[...] you will have the same scalability issues the source sites have right now and you'll have to manage a - non-negligible - server side back-end.

I really can't see how a CDN like CloudFlare wouldn't just take care of this. I don't see the need for up-to-the-second fresh data. It's not the stock market, it's games that's been out for decades. Maybe you can help me understand why they aren't caching things harder as it is?

@mitu said in Could scraper data be cached online?:

While in theory sounds nice, in practice you're just trying the old let's add another layer of indirection/caching method to solve a scalability problem.

Not sure I agree, or maybe I don't follow. Skyscraper is already the layer that's doing this. Instead of everyone having their own local cache, they share one?

@muldjord said in Could scraper data be cached online?:

Technically what you suggest is pretty easy I believe. [...] Would be cool though.

It would be cool, wouldn't it? 😁 My retro-cloud project already sets up one VM and one File Share per user, that uses Skyscraper. It'd be easy to change it so everyone use the same VM and File Share (for the cache, not the ROMs as that's a very different bag of "permission issues"). The main problem for me would be the cost. But somehow, not doing it just feels irresponsible to waste so much processing power, physical space, electricity, and resources overall. If I ever get to it, I'll show you a demo. (I too have other things I'd rather be doing right now. Including learning to play a musical instrument, so I totally understand your decision on Skyscraper).

Clyde

@seriema said in Could scraper data be cached online?:

Like both of you mention, I'm totally ignoring the human aspect around permissions. At least until I understand why they want to manage their own (often crashing-under-the-weight) servers.

One issue would be the need to coordinate many more people than before: make group decisions, setup rules and hierarchies etc. Don't underestimate the time and effort for that, and most (all?) of these sites are operated by (small groups of) private individuals in their free time.

Maybe you can help me understand why they aren't caching things harder as it is?

Money? (again, private people) That said, are you donating to some of them already? That would be a start to address the problem. Screenscraper.fr even gives you some privileges for it: ongoing access on high server load, more scrapes per day, and multiple scraping threads.

It would be cool, wouldn't it? 😁

Maybe, it definitely does sound so "on paper". But nobody but the operators can tell if it is really feasible. You could try and ask them, maybe they'll answer or even welcome your suggestion? (But don't get your hopes up too high for the latter, as I doubt that they never thought about something like this by themselves.)

Clyde

Deleted, double post.

seriema

@Clyde said in Could scraper data be cached online?:

Money?

The point of a cache server or CDN is that it's cheaper than trying to scale the server. Or at least, that's what I've always assumed.

@Clyde said in Could scraper data be cached online?:

are you donating to some of them already?

I'm a monthly donor to both RetroPie and Screenscraper.fr 🥰

@Clyde said in Could scraper data be cached online?:

You could try and ask them, maybe they'll answer

I was hoping some of them hanged around here and would see this post. I'm guessing they don't. Screenscraper.fr that I have the most problems with so far (going offline) have their chats in French. 😥

BenMcLean

Why isn't screenscraper.fr's content mirrored on the Internet Archive? Seems like exactly the sort of thing they'd collect.