Versatile C++ game scraper: Skyscraper

muldjord

@chipsnblip This is certainly an interesting concept. I've asked about the issues at TheGamesDb's forum, and they actually seem to seek out a decentralized system (not p2p though). If I understand it correctly, they want to allow a bunch of copies of their database to be downloaded from the get-go. Then those will be placed on mirrors that are used by any app. And instead of re-downloading the entire database each month, they supply updates instead.

Unfortunately that doesn't help much in Skyscrapers case, as I don't have the option of running a Skyscraper game database anywhere.

So I guess they see thegamesdb.net as the central hub, where everyone will update and add games. And then they have, sortof, slaves, that mirror the updates every once in a while to stay updated. And any app would use those instead of the central one.

Their core api is still not finished, so time will tell how they decide to do this.

muldjord

This is tiresome to say the least. In my talks with the 'thegamesdb' team we've basically been misunderstanding each other for a while now. It seems that the key I have, is in fact an app API key (and not a user key). And the limit is not on a per-key basis, but on IP!!!! So simple, so many misunderstandings.

Bottom line: It seems like I can use my 'thegamesdb' implementation, and each user WILL NOT need their own key. At least not for the time being. There is a per-IP request limit though, and this means that I will remove 'thegamesdb' as a part of the Simple Mode scraping process. Instead users can use it for 1000-2000 rom scrapinger per month, and that's it for the moment. In time there might be the option of adding to this limit with user keys where you earn requests by helping out with the database. It seems that this is still being worked out.

I would like to thank the guys at 'thegamesdb' for helping out with this.

muldjord

Skyscraper version 2.5.0 released: https://github.com/muldjord/skyscraper

'thegamesdb' removed from Simple Mode scraping scripts due to new api restrictions
Implemented new 'thegamesdb' api, be wary of monthly request limit!
Made sure 'remaining requests' is clearly stated when using 'thegamesdb'
Implemented request limit test which makes Skyscraper stop if limit is reached
Made sure cli header always has correct number of dashes

So, as mentioned, the 'thegamesdb' module has been cut down to size quite severely. The current API only allows 1000 requests per month (this might change, but this is what we have at the moment)! And 1 rom scrape takes up 2 requests. That means just a meager 500 roms can be scraped using this module PER MONTH. I have therefore removed it entirely from the Simple Mode scripts, as it just doesn't make sense to have in there.

As always, please update, and let me know if you run into trouble.

muldjord

Ok, I'm pretty much done with 'thegamesdb'. I've been increasingly aggravated about their answers and non-answers over the past week. I do not want to spend more time on them, unless they put up a friendly face. This is not worth my time. Clearly automated scrapers aren't welcome on their service anymore. They excuse themselves by asking us to set up our own mirror, which is just not an option for most people. And when I propose a higher request limit, they pretty much turn their backs on me and start accusing me of "farming".

The current implementation will stand for the time being. I might remove it entirely. Pretty much depends on how the team behind the service treats me from here on out. I make Skyscraper for the fun of it. And this is NOT fun.

muldjord

Ok, so I've spend the past hour being aggravated about this situation. It's not worth it. I will remove 'thegamesdb' from Skyscraper in the next release. If they change their stance in the future, I might readd it. But for the time being, it's not in a usable state, unless I host my own mirror of their data, which, obviously, is not an option for most people.

I'm sad to see it go, but this is how it has to be in order for me to stay sane.

muldjord

I've rewinded to 2.5.0. I see no reason to not have the limited 'thegamesdb' implemented, just be aware of the limit.

circo

I've read the conversation, they were surprisingly hostile about it... :/
I'm sorry, that sucks.

AnalogHero

Sad to hear. I cant find the thread on their forums, maybe because im not registered there.

Can understand that servers cost money, and scrapers can produce a lot of traffic.
I just tested the 2.5.0 with gamesdb as scrapermodule and it worked (counted down from 1000 uses). Cant understand what was wrong with this solution.

Used2BeRX

Well if anybody can figure out a good way to host this so everybody could use everything without any hassle or drama by the time I put my release out, I'll be happy to do the work there so everybody can enjoy it.

So far 2,118 unique NES/FDS games are covered. This includes (or will include by release time) Box Art, Cartridge/Disk Art, Title and Action screenshots, "3D" Box Art, synopsis files that contain all the game information for the gamelist.xml tags and a lot of info that RetroPie doesn't have tags for, HD video previews, Game Manuals (either PDF or Zipped JPGs or both... so far around 950 of the games have them), GameFAQs zipped for most official titles and some other goodies.

At some point when I feel they're ready, I could probably release the synopsis files first so you guys aren't waiting another 6 months + for at least that part. I proof read a ton on those, and I believe they're the best descriptions available out there for the games, including many of the obscure pirates and unlicensed games that usually aren't covered on any of the major gaming sites that would be scraped with this scraper. I also removed all of the "weird" characters that don't like to show up properly in either RetroPie or on the XBox, so there are no strange "empty box" characters in the descriptions anymore.

I've got a few days off. I'm really going to make an effort to get my spredsheet where I want it to be for a public release so everybody can see what progress has been made so far and follow along if they're bored. :)

muldjord

@analoghero I have readded support for 'thegamesdb'. Just be wary of the limit. Time will tell if it changes.

muldjord

@used2berx Have you considered somehow uploading the information you are creating to screenscraper.fr? That would be the optimal way of making use of it in Skyscraper. It sounds like you have a pretty much perfect collection of data on your hands, I'm pretty sure they would appreciate the data if it could be automated somehow. I don't know if you're interested in working with them on that. They have been very friendly towards me whenever I've contacted them, so you could consider that if you wish.

cyperghost

@muldjord You're right! Maybe they overcome their descission - time will show. But I think 1000 entries per unique IP are enough for a user.

What is this "queue" thing? I understand this as connection request to their server and with one request you can do 20 actions. So in theory you can retrieve 10.000-20.000 entries per IP - or am I wrong?

muldjord

@cyperghost If I understand it correctly, which I might not, their API can contain up to 20 game results per request. So optimally, if I knew the ID's of the games in their database beforehand, I could requests a comma-seperated list of 20 specific ID's per request. And all of those 20 games would be returned to be in a JSON answer. Problem here being that I do not know the ID's, that's what Skyscraper is trying to figure out. So I'd have to search for the filename one at a time instead, find the best result and its ID, and then fetch the data. So I'd use 2-3 requests per game.

But I might have misunderstood this completely which it seems I have a habbit of doing when it comes to the new API. Pretty embarrasing to be honest.

cyperghost

@muldjord Well ... I think there is a possibilty to get these IDs. I think it's just a checksum of the ROM files (surely rearranged and changed with aretmetics)

Or am I completely wrong this time?

mitu

@cyperghost said in Versatile C++ game scraper: Skyscraper:

Or am I completely wrong this time?

I'd say yes, the ID refers to the (internal) identifier of the game in thegamesdb database, not the hash of the file. You search by a game name (don't know if their API has a 'search by hash' option) and you get a list of games with their IDs.

cyperghost

@mitu Yes you're right ;)
The id for Sonic the Hedgehog is just 5544
So the call to this is only ...
httpx://thegamesdb.net/game.php?id=5544

muldjord

@mitu Correct, they don't currently support hash searches as screenscraper does. The id is just a numeric identifier starting from 0 and going upwards for any new game added it seems.

cyperghost

@muldjord Well I think for a single IP 1000 calls is okay ;)
You can check if data is received and if there is an failure then report to the user ... and I think you're fine with this. Not the best solution to satisfy all needs but good enough to go.

muldjord

@cyperghost Yes, I agree that it should be usable for some minor installations. And obviously it is not a good thing if people are scraping 50000 games at a time no matter what source they are hammering. So I have never been against limits, they are necessary.

I would love it if TheGamesDb would support md5 and sha1 hashes aswell though. Then I could fetch 1 game per request instead of using two request for 1 game. But I think I've worn out my welcome for now, so I'll leave them be and spare myself any further embarrasment.

circo

@muldjord said in Versatile C++ game scraper: Skyscraper:

So I'd use 2-3 requests per game.

You could try stringing them together? As in, first you send the requests for the individual games to get the IDs, then send just a single request for the metadata for every game being scrapped at once. Then you string those together, and you send a single request with the comma-separated IDs.
This could reduce the number of requests to n+1, where n is the number of games that are being scrapped.