Versatile C++ game scraper: Skyscraper

timb

@muldjord
Okay, so it compiled and installed fine, but immediately I ran into a bit of an issue. This is the output of the first two NES games it tries to scrape:

#1/733 (T1) Pass 1 Pass 2 Pass 3 Pass 4 ---- Game '10-Yard Fight (USA, Europe)' not found :( ----


Debug output:
Tried with sha1, md5, filename: '10-Yard%20Fight%20%28USA%2C%20Europe%29.7z', 'DA39A3EE5E6B4B0D3255BFEF95601890AFD80709', 'D41D8CD98F00B204E9800998ECF8427E'
Platform: nes
Tried with sha1: DA39A3EE5E6B4B0D3255BFEF95601890AFD80709
Platform: nes
Tried with md5: D41D8CD98F00B204E9800998ECF8427E
Platform: nes
Tried with name: 10-Yard%20Fight%20%28USA%2C%20Europe%29.7z
Platform: nes

Elapsed time: 00:00:02
Estimated time left: 00:35:22

#2/733 (T1) Pass 1 Pass 2 Pass 3 Pass 4 ---- Game '1942 (Japan, USA)' not found :( ----


Debug output:
Tried with sha1, md5, filename: '1942%20%28Japan%2C%20USA%29.7z', 'DA39A3EE5E6B4B0D3255BFEF95601890AFD80709', 'D41D8CD98F00B204E9800998ECF8427E'
Platform: nes
Tried with sha1: DA39A3EE5E6B4B0D3255BFEF95601890AFD80709
Platform: nes
Tried with md5: D41D8CD98F00B204E9800998ECF8427E
Platform: nes
Tried with name: 1942%20%28Japan%2C%20USA%29.7z
Platform: nes

Elapsed time: 00:00:04
Estimated time left: 00:29:06

Notice the MD5/SHA1 sums are the same on both? I know that's not correct. Here's the MD5 on the first title (extracted):

timb-mba:~ timb$ 7z e "~/10-Yard Fight (USA, Europe).7z" -so | md5
8caac792d5567da81e6846dbda833a57

(I set 7z to extract directly to STDOUT and piped it into md5 so it would work like it does in your program.)

Now, after scraping these first two files it starts to scrape fine:

#3/733 (T1) Pass 1 ---- Game '1943 - The Battle of Midway (USA)' found! :) ----
~SNIP~
Debug output:
Tried with sha1, md5, filename: '1943%20-%20The%20Battle%20of%20Midway%20%28USA%29.7z', '443D235FBDD0E0B3ADB1DCF18C6CAA7ECEEC8BEE', 'DEEAF9D51FC3F13F11F8E1A65553061A'
Platform: nes

Elapsed time: 00:00:18
Estimated time left: 01:15:41

So, it seems like the program is somehow hashing bad data at first? I took a cursory glance at the source but nothing instantly jumps out at me.

Also, something else I noticed: It seems the local cache lookup is still doing a SHA1 on the packed 7z file, instead of the unpacked file it contains. This means that it's re-downloading all the assets from screenscrapers, instead of using the assets it previously downloaded when I scraped the extracted files. (I know this is a quick first pass of the feature, I just wanted to point it out.)

muldjord

@timb I'll look into this when I get the time. And yes, it will keep identifying the files from the sha1 of the actual file on the disk. That is not gonna change. You should see this as a tool to help you identify custom made 7z or zip files as they often differ from the ones in the screenscraper database. So I expect the user to have "completed" their files before they start scraping. Both solutions have caveats obviously. :)

timb

@muldjord
Personally, I think it makes more sense to cache based on the hash of the actual ROM file itself and not the hash of the compressed file that contains it (because the hash of a 7z, ZIP or RAR file may change, even though the data it contains does not change).

That said, you’re right, it is a lot simpler to just cache based on the base file’s hash. I can also understand why you wouldn’t want to spread the 7z unpack hack around further in the code. So I totally see your point. :)

Let me know if you need any more debug data on the incorrect hash bug.

muldjord

@timb Yes, I do see your point in that aswell if I change it to be an option for '--nounpack' instead (so it would be default to always unpack and id from the files inside zips in the local cache). But it would mean that all users' current localdb's would break in an instant... So I'd need to convert the sha1's on the fly from zip sha1's to contained file sha1's in order for that not happen which means I'd need to sha1 twice for a long time in order to compare the sha1 from the localdb. Ugh... :D

timb

@muldjord
Yeah, that’s kind of a mess. Though, there is a potential solution: I’m about halfway through writing a quick and dirty Python script that parses the database XML file and changes the SHA1 keys from the unpacked ROM to that of the packed 7z file. It could easily be modified to do the reverse. You could prompt the user to run it during an upgrade. Obviously it wouldn’t change entries where the original 7z file no longer exists, but the script could always trim those from the DB and run the Skyscraper cleanup option. Hmm, yeah, it seems totally doable.

(If you do eventually decide to go this route, I’d be happy to write a real production ready version of such a script with proper error handling and such. I could also do it in Bash (ugh) if you didn’t want any dependencies.)

muldjord

@timb Thank you, I'll give it some thought. I'm leaning towards a solution that does it automatically for each localdb entry when it is requested. That is, if I decide to go ahead with it at all. :) I'm also quite happy with the solution as it is if I get that weird bug fixed you posted about earlier.

timb

@muldjord
Speaking of the weird bug, it gets weirder!

Total number of games: 733
Successfully scraped games: 661
Skipped games: 72 (Filenames saved to '~/.skyscraper/skipped-screenscraper.txt')

Now, these same 733 games scraped just fine as extracted ROMs. If I use 7z on the Pi to extract these problem files (the same 7z Skyscraper is using) they extract fine and show the correct hashes. So I know it's not the archive or ROMs that are the problem.

All 72 of the skipped files are returning the same hash (that I posted earlier, it doesn't appear to change). Weird, right?

muldjord

@timb I have a guess. Maybe it's because it checksums "no data" (or an error message) because of the way I use the QProcess read or something. It's very hacked. Haven't had time to look properly into this yet, I'm just guessing without having entirely checked your descriptions yet.

timb

@muldjord
I had similar thoughts earlier. An easy way to see if it’s checksumming an error message from 7zip is to entirely suppress all output from it. We can do that with the following flags:

7z e -y -bd -bso0 -bse0 -bsp0 -so

That essentially completely disables console, error and progress output and also answers ‘yes’ to any prompts. It’s the closest you can get to a “-qq” flag. I’ll add that to the source this afternoon, recompile and see if it makes any difference.

No rush on this, by the way. I appreciate you implementing it at all. :)

muldjord

Please git pull and try it again. I've improved it quite a bit and added further error checking. I'm interested in knowing if you still get those weird same-sha1 errors. Also, I've added a 20 meg limit to using the unpacking feature to try and avoid running into mem limitations on the pi. Hopefully this is temporary as I would like it to read chunks instead, but so far I haven't gotten that to work reliably. So it take up the amount of ram the rom takes up on disk to calculate the checksums. And if you run several threads AND the compositor is working with images that spells trouble.

timb

@muldjord

Whatever you did seems to have fixed it! I’ll let you know how the other platforms go (I’ve only tried NES so far). I’m just putting the finishing touches on my script that swaps the SHA1 hashes in the DB files; I want to run it first before scraping the other platforms.

A 20MiB limit seems sensible for now. I don’t think any of my ROMs come close to that. (I keep my N64 stuff uncompressed as the emulator doesn’t support on the fly decompression.)

A Former User

The only time I could see the 20 MiB limit coming in to play is if someone has zipped PSX PBP files. I am unsure if the PSX emulators allow for zipped media though.

muldjord

@livefastcyyoung Yes, it would probably only be relevant for the "newer" platforms such as psx and n64 so I think it'll be ok. Thanks for your input guys, I appreciate all of it!

timb

@muldjord
So once I converted all the hashes in the database with my script, it seems to have successfully scraped everything just fine. Thanks for getting the basics of this feature working! :)

muldjord

@timb Glad it works :) And you're welcome! I'll probably make a release with this in a few days.

SteveW25561

This tool is amazing! Thanks for your work on this, @muldjord

Is there a way to get Skyscraper to scrape ALL of the systems in my RetroPie directory (or a selected set), rather than specifying one at a time? The selph scraper allows you to choose "ALL" or "Selected" systems, and I was looking for the same in Skyscraper but don't see it in the docs.

Also, I just scraped a bunch of MAME games and many of the videos Skyscraper got back were not playable (initial simple mode scrape). I can see altbeast.mp4 (2.8 MB) or centiped.mp4 (928 KB) for example, and they won't play via Mac or on the Pi. Any way to fix this?

muldjord

@stevew25561 Thank you, glad you like it! :) No, there is no way to scrape all platforms, you'll have to script that youself. :)

Did you get those videos from screenscraper or arcadedb? I just scraped centiped from both arcadedb and screenscraper and they play just fine with mplayer. I did notice that it has some weird dimensions because centipede is a vertical game. So I'm guessing it's because EmulationStation has issues with that. Not really something I can fix I'm afraid, as I basically just download the videos and save them as is.

parasven

@muldjord
is there a way to get the media off of screenscraper that actually belong to the scraped game region ? For example there is multiple covers for this game:

007 - The World Is Not Enough:
https://www.screenscraper.fr/gameinfos.php?gameid=102752&action=onglet&zone=gameinfosmedias

The covers are different for different regions of the game. Is it actually possible to get the cover for the german version through skyscraper?

In the source of skyscraper i found following regions:
eu
us
ss
uk
wor
jp

Are there more options for the region parameter?
What do the parameter ss and wor stand for?
wor = world?

muldjord

@parasven If you want the German ones, just use '--region de' I believe it is. 'ss' simply means 'screenscraper' and is a generic region they apply to any media they don't have a region for. The regions listed in the source are just the priority list I use internally. 'wor' just means world I'm guessing. Probably for games that are regionless.

There are many more options, they are listed with this call (Chrome can view this url directly, otherwise save it and open it in a text editor):
https://www.screenscraper.fr/api/regionsListe.php?devid=xxx&devpassword=yyy&softname=zzz&output=xml&ssid=test&sspassword=test

Btw, keep in mind that even though you set '--region de' it doesn't mean it will find german versions of all the covers. It always falls back to the internal regions if the one provided manually can't be found. So you will still see the others for the ones where a 'de' version didn't exist.

parasven

@muldjord
Thank you very much for that list. I found some of these regions by hand with try and error hehe

Love your scraper btw. It is very fast and works like a charm. The localDB is a very cool feature to be honest :)