Please do not post a support request without first reading and following the advice in https://retropie.org.uk/forum/topic/3/read-this-first

Versatile C++ game scraper: Skyscraper



  • @Used2BeRX I mentioned the checksum problem and multiple roms in a zip problem. If you zip a rom, depending on what zip encoding you use, the rom will differ in size. That makes it very hard to do meaningful sha1 checksumming for use when scraping or in general identifying which rom you are looking at. Especially when caching data locally for scraping.

    For instance, many roms have a [!] version which is considered a "good dump". This means the rom is pretty much perfect. That one version is THE version of that game for that region. That's just ONE file. Now, someone decides to zip it. They zip it using X software. It suddenly becomes a different file when you look at the contents. They create a nice rom pack with this zip in it. Someone else downloads it. Some other dude does the same thing, but with a different zip encoding. This file also gets shared. Now you have THREE versions of the same rom file. You can probably see where I'm going with this.

    For identifying roms, this is a bit of a problem because I can't cache data for X rom and assume the same cached data will be used for the same game when I give my locally cached data to someone else. Because he might have zipped his file! So when he scrapes the exact same rom, because his was zipped this way or other, it won't be recognized.

    People can do what they want, I don't personally care. It just messes up a bunch of stuff for scrapers who want to not hit the web sources so hard and try to cache things to prevent that.



  • Skyscraper 1.7.4 released: https://github.com/muldjord/skyscraper

    • Added textual import with 'import' scraper using '[homedir]/.skyscraper/important/definitions.dat' file
    • Added video import with 'import' scraper
    • Improved 'uvlist' description scraping
    • Now properly handles empty nodes in EmulationStation gamelist.xml export

    Be sure to read the readme's thoroughly. Everything is explained in there. Enjoy!



  • @muldjord I see.

    Torrentzip eliminates this problem. Not sure if you've heard of that program before.

    EDITED TO ADD:

    You can be sure everything I do is torrentzipped, so any datfiles I would make would be for torrentzipped stuff. That way as long as people found the right stuff and used the dats in romcenter or clrmamepro theirs would be identical as well.



  • @muldjord @Used2BeRX Another workaround for this 'problem' would be that the scraper unzips before scraping, generating the checksum and afterwards cleaning the temp folder. Would create a cpu overhead for sure.
    Fun fact: I scraped my zipped snes folder with about 700 games in it with various scrapers. Endresult: about 15 skipped games.
    Did the same thing again but this time unzipped them. Result: almost the same as above.
    Can you explain how the romnames and the checksum are compared to the websource? From what i see only the name has to match (adjustable with the -m flag).



  • @AnalogHero Only the 'screenscraper' uses sha1 checksum for identifying the game. The other web sources uses the file name. Read the readme about the local database cache for more information on why sha1 is important for Skyscraper beyond that. It's all explained in great detail in there. :)



  • @muldjord A little update....

    The zip function is going to be even more valuable for my set. For the hacks and translations, the file inside will have quite a bit of information about the rom that the zipfile itself will not have.

    So far, it would look something like this inside the zip: [game name] (Eng-Trans, [hacker-name], [patch version], [patch release date].nes). I'm also considering whether or not to add original CRC and finished CRC to this file name as well. That's a ton of great information for people if somebody wanted to upgrade a particular translation down the road so they could compare this info and see if there was a more current release.

    Also, this should overcome any problems there are with long file names. For example, XBox uses FATX which is limited to only 42 characters including the extension, but I tested a few of these and they work fine. They will all be tested on the Pi at some point as well.

    I'm considering having the name of the rom file inside the zip for official releases be the no-intro file name as well... we'll see how much time I have. I will not be involved in supplying any roms to anybody, but my intention is to make it as easy as possible to use the work I've done for the end users.



  • Yeah, so video howtos ain't gonna happen anytime soon. Just spend 4 hours in video recording hell and I am not going back. It is clearly not where my talent lies. Deleted everything, too inconsistent and... frankly, crap. I'm just about ready to throw my computer out the window. Not going to, but man, the guys on Youtube who knows how to do this? Mad respect from me. It is friggin' HARD! So many details go wrong all the time. Stumbling over words, forgetting commands, technical problems, having to reset all the time after each "take"...

    Just trying to figure out decent examples of what I want to convey in the videos is really, really difficult.

    If anyone wants to help out on this front, let me know.



  • @Used2BeRX Basically Skyscraper is "feature complete" as of 1.7.4. I have implemented the "import" scraper which allows anyone do import their own data (artwork and textual) and define the format in the '[homedie]/.skyscraper/import/definitions.dat' file. I recon an importer for your data can be made from this so feel free to do so.

    Aside from reported bug fixes, I'm gonna take a break from the project now, the requests are getting very specific, which is fine, but a lot of it is beyond the scope of what I want for Skyscraper. The importer was made in a way so that it can fit basically any custom format of information, as long as they are contained in single text files and artwork files named after the roms you wish to scrape.

    Everything is detailed in the github readmes, so feel free to check those out.

    That is all for now. :)



  • @muldjord Could you improve the way neogeo games are handled? Results are pretty bad, since naming of neogeo roms is like mame roms. For example you cant scrape mslug.zip or sonicwi3.zip. They dont match with Metal Slug or Aero Fighter/Sonic Wings 3. And in this case you cant unzip them. It would be a mess.

    EDIT: I just read that you take a break, so nvm!



  • @AnalogHero Actually that is the one thing I would like to work on. I even created the mamemap.csv file for this purpose some time back. All I need to do is to look up the name in that file before scraping and use that instead of the actual filename. I'll think about it over the next couple of days and try to work it in. If it works well, I'll release it with 1.7.5 sometime soon.



  • Skyscraper 1.8.0 released: https://github.com/muldjord/skyscraper

    • Added 'arcadedb' scraper module with video support
    • Vastly improved scraping of 'neogeo' and 'arcade' platforms in general by mapping the filenames to real names from mameMap.csv
    • Improved 'neogeo' and 'arcade' search platform matching

    Apparently my idea of taking a break from a project is to keep working on it... :D Anyways, 1.8.0 is here! And the big news this time around is vastly improved scraping of 'neogeo' and 'arcade' and also a new scraping module using the data from http://adb.arcadeitalia.net/ . This module also supports video!

    Enjoy!



  • @muldjord Just compiled it and rescraped my neogeo roms. 100% match first try. Thank you!



  • @AnalogHero Glad it works now :)



  • @muldjord Sounds good man. Sorry I couldn't get those synopsis files to you yet. I'm still working on them. "Real Life" has become abnormally busy for me for the foreseeable future to. I'm still working on stuff in my down time, but it's nowhere near the pace of the last few months.

    Maybe by the time things calm down for me and you take a nice long break from your own work we could put our heads together on this when I have the synopsis files ready for the NES. So far there are around 2,100 games accounted for, and there will likely be around 2,250 when I'm done I'm guessing.



  • Skyscraper 1.8.1 released: https://github.com/muldjord/skyscraper

    • Added 'rating' scraping to 'thegamesdb'

    This is a minor one, just wanted to get it out there. :)



  • Is there any of the databases that removes articles such as the or a and puts it at the end? Something like "Legend of Zelda, The". Roms are already named like this under the goodtools, but somehow that's undone by every scraper I find and I'm this close to manually editing the xml file manually to fix the names



  • @stephanepare None that I know of, but The Games DB which is the built in EmulationStation scraper does allow for anyone to update it. You just have to create an account on the site.



  • Oh, I was also wondering if it would be possible to add the colecovision and pc engine console to the list of supported platforms



  • @stephanepare Yes, Skyscraper automatically moves "The" to the end of names to make sure they are sorted correctly.

    I will look into including colecovision and pc-engine soon. It's easily done. Thanks for the feedback.



  • Hi @muldjord and thanks for the amazing job!
    I've read your arguments about adding zip file support and would really appreciate you to reconsider.
    The compression standards idea doesn't matter if you use no-intro romsets, which are pretty much the standard. It is not usual at all that you have more than one rom inside the zips too. People that have very large collections do save tons of space. Even if you you just have few roms, that should apply most of time for every single file. and even if you still think it is not the best to have compressed roms, there are lots of people that have a different opinion and right now they would need to do much stuff manually.
    Best regards and thanks for bringing another scraper to the game!



Contributions to the project are always appreciated, so if you would like to support us with a donation you can do so here.

Hosting provided by Mythic-Beasts. See the Hosting Information page for more information.