Versatile C++ game scraper: Skyscraper
-
Good news - that worked perfectly!
Bad news - "Hitman™" is still problematic, with an error message like this:
Couldn't calculate sha1 hash sum of rom file 'Hitman���.ml', please check permissions and try again, now exiting...
I think I'm just going to rename that one. Perhaps Moonlight script should trip such odd characters from filenames.
-
Skyscraper 2.9.5 released: https://github.com/muldjord/skyscraper
- MAJOR: Added option "--purgedb vacuum" which vacuums all resources not related to your current romset. Remember to make backups of your cache before using this
- MAJOR: Added option "--purgedb all" that purges all resources for the selected platform. Remember to make backups of your cache before using this
- MAJOR: Added "--symlink" option which forces cached videos to be symlinked to destination instead of being copied when scraping with the "localdb" scraping module
- MAJOR: Added "esgamelist" emulationstation gamelist.xml scraping module. Contributed by "mgerhardy". Rewritten by me to better conform to Skyscraper design
- MAJOR: Added aliasMap.csv that forces the use of a title alias when searching for specific filenames
- Removed version bracket tag for Amiga lha files
- Improved getCompareTitle for mame games and lha files
- Code cleanup for sqrNotes
- Added the "ti99" platform. (Thank you to "jhbeskow" for suggesting it)
Lots of cool stuff in this release. A new platform was added (
ti99
). Several improvements to the--purgedb
option made it in, which now allows you to vacuum (--purgedb vacuum
) and clear out (--purgedb all
) the cache for the currently selected platform.
On top of this the newesgamelist
scraping module was also added that scrapes data and artwork from a local EmulationStation gamelist.xml. Originally contributed by mgerhardy, but I had to rewrite most of it to conform better with the Skyscraper code. The idea is that you can now use any scraper to create an EmulationStation gamelist.xml. And then use theesgamelist
module to import that data into Skyscraper's local database cache. When that's done, you can then make use of it by prioritizing it in the~/.skyscrapers/dbs/[platform]/priorities.xml
file and scraping the platform withSkyscraper -p [platform]
which defaults to scraping it with-s localdb
where the data is stored. Should come in handy. :) So thanks to mgerhardy for providing the initial code and the idea!
I've also added the new file~/.skyscraper/aliasMap.csv
which is a lookup file that allows you to create an alias for any filename (thanks to silent for suggesting this). So if you're having trouble scraping a game calledGame Name Which Is Waaaaaaay Too Long
you can try adding an alias for it to, for instance, just beGame Name
. The file will be created the first time you run Skyscraper after updating to this version. The file contains instructions on how to use it. I will also document it at some point at github. Maybe later today.
Last of the more prominent features is the--symlink
option which will simply symlink any videos from the cache instead of copying them when creating the gamelist for the selected frontend. This will save space, but comes with the caveat that if you remove the video from the cache, the link will break! So please be aware of this. :)Let me know what you think of these new features.
Merry christmas if that's your thing, and happy scraping! -
Interesting - I've been scraping my newly purchased games via Moonlight, and I ran into a case where:
- Game file was called "Worms W.M.D"
- thegamesdb scraped it fine, mobygames did not because there it is called "Worms: W.M.D."
- I therefore added
"Worms W.M.D";"Worms: W.M.D."
to aliasMap and scraped again - this time both scrapers worked fine.
In other words, case:
Compare title: 'Worms: W.M.D.' Result title: 'Worms W.M.D'
works fine, but the opposite returns no matches for mobygames.
Can I find out if it's mobygames not matching this query or Skyscraper rejects it because it's not similar enough?
-
@Silent said in Versatile C++ game scraper: Skyscraper:
Can I find out if it's mobygames not matching this query or Skyscraper rejects it because it's not similar enough?
It says so in the output.
-
This looks awesome! I can't wait to try this scrapper out. Do you have any plans to include video scraping options? It would be awesome if you could specify a "maximum video length" and have the scrapper cut the videos to that length after download, and also fix the format / codec, etc. I have a script that does it using ffmpeg, but haven't been able to make something that will run on non-windows.
-
@PC No current plans to expand the video functionality. Currently it relies entirely on the video that the source delivers. But it's, as you mention, pretty easy to mass convert the videos. You can easily just convert all of the videos in the cache with ffmpeg to suit your needs.
I could implement some ffmpeg calls that simply runs those commands, but it's an "ugly" solution codewise. So I think I'll let other tools handle this.
-
Mild request: Could
--nobrackets
be ignored when processing imported data? It should be fair to assume that "User knows better" and stripping brackets there makes no sense.Use case: User wants to have
(U) [!] {BESTVERSION}
stripped but still keep some brackets, eg.Gran Turismo 2 (Arcade)
. With this, it could just be defined in an import. -
@Silent It doesn't work like that (it's not source filtered), so that would be a no.
-
I just had a realization (and a case where I tested this) - for Screenscraper module, it's taking checksums of .cue files, which is making queries extremely sensitive to filename changes. Have you considered adding a pass using checksums of corresponding .bin files?
Yes, I realize it would be painfully slow to hash a big .bin file, so if anything that should be an opt-in option with a very clear "Please be patient, go get yourself some tea or do something useful for once" message.
I also realize
--query
can handle this case - so if you're not a fan of this option (honestly, waiting minutes till hashes are done may be sup-optimal), maybe we could get something parallel toaliasMap
, but for Screenscraper hashes? Some kinda "don't bother calculating, use this hash instead" file, so games with a .cue and .bin files could be manually tailored like this.EDIT:
I have great success scraping my PSX roms with--query
so yes, IMO ahashMap.csv
file identical to how aliases work would be great. Objectively better than making skyscraper hash .bin files together with .cue, as then 1) matching filenames would not be a concern 2) can hash files from PC and do it just once.EDIT2:
Just so I don't double post, another unrelated idea - what do you think about allowing Moonlight .ml extensions (once it's part of RetroPie-Setup of course) for all emulators, like .zip and .7z are handled now? Technically you can stream any emulator to pi using it, so it'd be very logical to allow it to be scraped from everywhere. I just set up a ps2 system like this and it works wonders. -
@Silent said in Versatile C++ game scraper: Skyscraper:
EDIT2:
Just so I don't double post, another unrelated idea - what do you think about allowing Moonlight .ml extensions (once it's part of RetroPie-Setup of course) for all emulators, like .zip and .7z are handled now? Technically you can stream any emulator to pi using it, so it'd be very logical to allow it to be scraped from everywhere. I just set up a ps2 system like this and it works wonders.I wouldn't be opposed to that as long as it makes sense. But let's talk about that when we get there.
-
I am working towards the 3.0.0 release and I am planning a change which will clarify the gather and combine paths in Skyscraper. Basically the change is simply this:
Only when scraping with the
localdb
module will the artwork and game list generator be run.I other words: When scraping with anything other than
localdb
, it won't save a gamelist.xml and composite the artwork. Ever.So why am I changing this? For several reasons. I almost always personally forget to add
--pretend
whenever I gather data from any of the non-localdb modules. This means that hundreds or thousands of image files are written to disk and my gamelist.xml is overwritten with data from just that one source. Processing artwork slows down the scraping process significantly and hammers the SD card for no reason. And when using Skyscraper you should always scrape withlocaldb
after having gathered data from any of the other modules anyways. I've outlined that a bit here.With this in place I will of course also give the user better tools to exclude certain sources when scraping any platform. So if you want to scrape from cached but only allow resources from one or more sources, you will be able to do that.
Please let me hear your thoughts on this change. I know some of you might be against this change, but please read the above before jumping to conclusions. :)
-
Those are very good changes! It IMO makes sense that scraping from online sources should not update your gamelists, as more often than not you'll want to scrape your brand new ROMs from multiple sources and then have
localdb
output the best of the best.This change should make the difference between "scraping" and "generating gamelists" pretty well defined - admitedly, that was a tiny bit confusing for me when I started using skyscraper, but with this new behaviour it should be clear.
-
Has anyone got the igdb scraping module to work? I have tried passing my credentials with the -u command line option in addition to trying to use config.ini I have also tried using my api key and my userid:pw and can't get any combination to work. I only have the free account with igdb. Any suggestions are appreciated!
Thanks!
-
@sglavach All keys given out by igdb currently are APIv3 keys. I am in the process of converting the module to the new API, so until then it won't work.
-
I've been talking to the good people at IGDB and gotten some things cleared up. The "user-key" provided to me for the API is only meant for developers. As such, the 10k monthly limit should be applied to all Skyscraper users in total, not 1 key per user. So this will be changed to use a hidden key instead. The good thing is that people will then no longer need to supply their own to use it. The caveat is the limit obviously. But to keep the databases stable, we need to adhere to these things. And I certainly will with Skyscraper.
-
That's great news! From what I have seen IGDB has very good resources, so it may improve quality of some scrapes (especially for newer games) significantly.
-
really nice tool
skyscraper+screenscraper was able to scrape 98% of all my nintendo and sega roms. :Dhowever i do have a problem when scraping fba and mame roms.
as example:
1941 when scraped with sselph scraper returns the name "1941: Counter Attack (World 900227)" from arcadeitalia and 1941u returns the name "1941: Counter Attack (USA 900227)", but when i scrape it with skyscraper, all 1941 variants have the name "1941: Counter Attack" and its impossible to tell them apart.
is there a way to force it to return the name in the same way as sselph's scraper?
i noticed that the nintendo and sega games show version and region in the name, but my guess is thats maybe from the filenames?
idkanyway, it would also be nice to see a detailed description of the options in config.ini, its rather hard to figure out how or what some of them does.
otherwise, keep up the good work :)
-
@Halvhjearne It already does so :) It's only in the output it doesn't show the bracket notes (if not you might have disabled it with the --nobrackets option or brackets="false" in config.ini. Please reload your ES and check it. It shows up as "1941: Counter Attack (World 900227)" here on my system.
-
@muldjord
i tried it a few times and every time i scraped with skyscraper, all variants have the same name.its running now with brackets="true" as i thought maybe i got it wrong and it takes a while to scrape, when only using 1 thread, however its almost done now and i will see what came out and if its still the same i will try with brackets="false", but isnt that supposed to be default behaviour?
im pretty sure i tried that too with same result or maybe im just confused by now ...
-
@Halvhjearne I think you are missing out a bit. If you scrape with the same scraping module twice, it will be really fast, unless you have enabled "--refresh". If refresh is enabled it will rescrape all of the files from the source again. No need for that. Skyscraper has a cache that is much faster. With refresh disabled, it will simply use the already cached data. I recommend skimming the documentation if you haven't already done so. Understanding how the cache works is pretty important if you want to use Skyscraper to its full potential: https://github.com/muldjord/skyscraper . It's a very powerful tool beyond just scraping from a single source.
Either way, I am not sure what you have done to your setup, but by default, if nothing in the config has been changed, it will have the USA and WORLD designations for fba and mame roms. I just tested it here and it works perfectly.
For a quick use case example, check here: https://github.com/muldjord/skyscraper/blob/master/USECASE.md
Contributions to the project are always appreciated, so if you would like to support us with a donation you can do so here.
Hosting provided by Mythic-Beasts. See the Hosting Information page for more information.