RetroPie forum home
    • Recent
    • Tags
    • Popular
    • Home
    • Docs
    • Register
    • Login

    Versatile C++ game scraper: Skyscraper

    Scheduled Pinned Locked Moved Ideas and Development
    skyscraperscrapergamelist.xmlscrapinggithub
    1.6k Posts 113 Posters 1.6m Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Used2BeRXU
      Used2BeRX @dorkvader
      last edited by

      @dorkvader Thanks man. That would be great. :)

      Not sure where to find your email though?

      D 1 Reply Last reply Reply Quote 0
      • E
        easye9inches
        last edited by

        Can someone help me in installing Skyscraper to RetroPie on Ubuntu? muldjord mention that install is exactly the same as is on the PI. I have done the steps listed in the GitHub and I am getting this error:

        $ wget -q -O - https://raw.githubusercontent.com/muldjord/skyscraper/master/update_skyscraper.sh | bash
        bash: line 3: curl: command not found
        --- Fetching Skyscraper v. ---
        --2018-09-03 22:34:48-- https://github.com/muldjord/skyscraper/archive/.tar.gz
        Resolving github.com (github.com)... 192.30.253.113, 192.30.253.112
        Connecting to github.com (github.com)|192.30.253.113|:443... connected.
        HTTP request sent, awaiting response... 404 Not Found
        2018-09-03 22:34:48 ERROR 404: Not Found.

        I am assuming that I should also install this in the RetroPie folder?

        mituM 1 Reply Last reply Reply Quote 0
        • mituM
          mitu Global Moderator @easye9inches
          last edited by

          @easye9inches said in Versatile C++ game scraper: Skyscraper:

          I am assuming that I should also install this in the RetroPie folder?

          No, you can run it from anywhere you have write access. You need to install curl and then re-run the installation

          sudo apt-get -y install curl
          
          E 1 Reply Last reply Reply Quote 1
          • E
            easye9inches @mitu
            last edited by

            @mitu said in Versatile C++ game scraper: Skyscraper:

            @easye9inches said in Versatile C++ game scraper: Skyscraper:

            I am assuming that I should also install this in the RetroPie folder?

            No, you can run it from anywhere you have write access. You need to install curl and then re-run the installation

            sudo apt-get -y install curl
            

            Thanks! got it installed

            1 Reply Last reply Reply Quote 0
            • E
              easye9inches
              last edited by easye9inches

              Do I have to edit the config file if all my roms are on a external HDD? Because nothing is scraping because it is the default home/pi/roms setting. Id rather scrape my games on the external for backup purpose, and than transfer them over to the pi.

              muldjordM 1 Reply Last reply Reply Quote 0
              • muldjordM
                muldjord
                last edited by

                Skyscraper version 2.7.3 released: https://github.com/muldjord/skyscraper

                • Improved image cropping to now also crop black borders, but only for screenshots (Thank you to 'chipsnblip' for suggesting this)
                • Made 'import' base folder configurable in config.ini
                • Fixed bug in 'import' scraping module that caused dummy titles to be saved to localdb when scraping media resources
                • Changed 'curl' to 'wget' in update_skyscraper.sh script to avoid curl requirement
                1 Reply Last reply Reply Quote 1
                • muldjordM
                  muldjord @easye9inches
                  last edited by muldjord

                  @easye9inches Please read the output of '--help' and also check out the options in '~/.skyscraper/config.ini.example'. What you want to do is easy to setup.

                  1 Reply Last reply Reply Quote 0
                  • D
                    dorkvader @Used2BeRX
                    last edited by

                    @used2berx Sorry, I thought my email was in my profile. Does it not appear?

                    Used2BeRXU 1 Reply Last reply Reply Quote 0
                    • Used2BeRXU
                      Used2BeRX @dorkvader
                      last edited by

                      @dorkvader I didn't see it anywhere. Just see the @ symbol and your name. If you actually see mine, you can just send me an email though. I'm not sure what my profile looks like to other people either. :)

                      D 1 Reply Last reply Reply Quote 0
                      • T
                        timb
                        last edited by timb

                        @muldjord Thanks for the awesome tool!

                        So, I’ve run into an issue scraping my collection with screenscraper.fr as the source. My ROMs are all compressed as individual 7z files (to save space), however Skyscraper is trying to match the SHA1/MD5s of the 7z files themselves, instead of the ROM they contain. Obviously this won’t wok, so it’s failing to match anything.

                        Most other tools will actually uncompress the file and get the hash of the resulting file(s), or even pull the hash out of the header for supported formats (ZIP files have CRC hashes stored in the header for each file in the archive; 7z stores MD5 hashes in the header).

                        Can this be implemented in Skyscraper? I belive the libarchive library has support for 7z, so it shouldn’t be hard to implement. Alternatively, Skyscraper could call the external 7z binary with the “7z h” option and read the hashes from stdin (this is a very hacky way to do it, but would be easy to implement).

                        Thanks for all the great work!

                        Edit: If this is too much work, how about adding a command line option to override the SHA1 and MD5 for individual ROMs? That way I could wrap Skyscraper in a custom Python script that would traverse a directory and for each file decompress it, generate SHA1 and MD5 hashes then call Skyscraper. Here’s some Python like pseudocode to illustrate:

                        #!/bin/pseudocode
                        for file in “/path/to/compressed/roms”:
                          run(“7z e file --stdout”)
                          rom = run.STDIN
                          rom_md5 = hash.MD5(rom)
                          rom_sha1 = hash.SHA1(rom)
                          run(“Skyscraper -s screenscraper -p snes -sha1 rom_sha1 -md5 rom_md5 file”)
                        done
                        

                        I could write a working script that does the above in basically no time. The script shouldn’t add more than 1 second per ROM to the scraping times (compared to running Skyscraper by itself on uncompressed files), which I could live with.

                        So yeah, if adding native compression isn’t a priority, adding hash override options would be an acceptable alternative I could live with. :)

                        muldjordM 1 Reply Last reply Reply Quote 0
                        • muldjordM
                          muldjord @timb
                          last edited by muldjord

                          EDIT: I just thought about the obvious solution to the below mentioned issues. I can just decompress to ram.

                          @timb Yes, this has been requested quite a few times actually. If I did this, it would be implemented with an '--unpack' option that would enable the decompression functionality. And it would work on zip and 7z files. The only issue I see is the bashing I would be doing on the SD card. Unpacking each rom to a temporary file and subsequently deleting the rom again, sometimes thousands of times in a row... That's just a very bad idea if you want to keep your SD card alive for more than a few weeks. SD cards don't spread around its writes like an SSD does. So this is a real concern.

                          With that said, I could add it, and make sure to put up a red warning text when using it. At least people would be aware of the issue and can't blame me when their setup goes up in smoke. ;)

                          T 1 Reply Last reply Reply Quote 0
                          • D
                            dorkvader @Used2BeRX
                            last edited by

                            @used2berx OK, my email is now visible in my profile. You just have to enable it to be seen in through profile edit. It's like the snake was at my feet ready to strike, but all be damned if I could see it. I did finish the Vectrex box art, btw.

                            Used2BeRXU 1 Reply Last reply Reply Quote 0
                            • T
                              timb @muldjord
                              last edited by

                              @muldjord

                              Yes, decompressing to ram is the way to go. That’s what the pseudocode in my last post essentially did. (It ran 7z with the —stdout option, which of course decompresses to stdout, which is read with a pipe into the pseudocode process as a data object so it never touches the SD card.) I use this method with a Python script I wrote that sorts and verifies ROMs. Obviously you won’t have to worry about piping from an external process, since you can just use libarchive (or another library) to implement the decompression natively, but decompressing into memory is the way to go.

                              That said, decompressing to a temp file can still work if you have to. Simply do it to /dev/shm, which is on a tmpfs (ramdisk). Or do it to /tmp with a warning that includes instructions on how to enable mounting /tmp as tmpfs (it’s a single systemctl enable command). Obviously both of these options are Linux specific, but you could always implement fall back code for other platforms if you’re worried about that sort of thing.

                              As a side note, I’m not super concerned about wearing out an SD card. Modern cards have wear leveling algorithms built into their controllers, just like SSDs, so they should be fairly robust. If they do wear out, they’re cheap and plentiful. I do embedded development (hardware and software) for a living and have never seen an SD card die from wear. (That said, for datalogging applications I started using the F2FS filesystem instead of ext4 last year; I’ve been so pleased with it I’ve started using it on most of the embedded Linux systems I work with, including all my personal SBCs. It’s a log-structured file system designed specifically for NAND flash storage helps reduce wear, increase speed and improved reliability, especially when it comes to a lot of small read-modify-write operations.)

                              1 Reply Last reply Reply Quote 1
                              • muldjordM
                                muldjord
                                last edited by muldjord

                                Thank you for the detailed suggestions. They all make sense to me. I think I'll go with libarchive, it's relatively simple to write a wrapper for it. I'll figure something out. My main concern with using libarchive is that it complicates the installation as it would obviously require libarchive as a dependency. Might not seem like a big deal, but I absolutely despise build systems such as autotools and cmake and try to avoid them at all costs. So I'll try to see if I can make it work in a simple manner with Qt's qmake instead. Might not be as robust, but at least I won't have to spend the next 2 weeks pulling my hair out of my scalp while fighting the unbelievably overcomplicated has-to-do-everything-to-the-point-that-only-absolute-experts-get-it-build systems. Oh yeah, we better check for that fortran compiler, that's important.

                                EDIT: You know what, I'll look into just using Qt's QProcess and use 7z to stdout which is returned to my QProcess.

                                T 1 Reply Last reply Reply Quote 0
                                • T
                                  timb @muldjord
                                  last edited by timb

                                  @muldjord

                                  Most distros have libarchive available as a package (Debian/Raspbian: apt install libarchive libarchive-dev), so can’t you just dynamically link to the library the package installs?

                                  I can’t see how that would be any worse than the dependency of having to install 7z (p7zip) and it would work much, much better than reading 7z’s STDOUT. (I’ve written a few she’ll scripts around p7zip. While it is stable, it’s very clear that it’s a Linux console application written by a Windows developer, which it was! I had to write a lot of “logic code” around it in order to get it to behave like a normal *nix shell utility.)

                                  I’ve been slowly converting my hundreds of utility shell scripts into Python and of those about a dozen call p7zip, however I’ve been going back and rewriting them to use PyArchive (which is a Python module for libarchive) instead. The ones I’ve done so far ended up much simpler than calling 7z as a subprocesses (one script went from 50 lines to 15).

                                  If it were me, I’d go with a native libarchive implementation. (I actually thought about making an attempt at implementing it, but I don’t really do much work with C++, so it would end up more like C and be sort of hacky.)

                                  1 Reply Last reply Reply Quote 1
                                  • muldjordM
                                    muldjord
                                    last edited by muldjord

                                    It's on master now in a quick first draft. Please try it out if you get the time. The option is '--unpack'. It will unpack and checksum a single file inside a 7z or zip file. If more than 1 file is found inside the compressed file, it falls back to doing the checksum calculation from the base file. Let me know how it goes if you do check it out.

                                    EDIT: I could dynamically link. The issue here is that I'd just add it manually to the flags without knowing actual locations for the specific system. If for some reason that doesn't exist/work on the system, it breaks the compilation. And I do not want to get into that territory. I'm ok with the current solution. If the user doesn't have p7zip, it just doesn't work. It still compiles fine. I'll add some error checking aswell.

                                    T 2 Replies Last reply Reply Quote 0
                                    • T
                                      timb @muldjord
                                      last edited by timb

                                      @muldjord

                                      Wow, that was fast! Okay, I’ll download it tonight and do my best to break it! :D

                                      I’m just thinking out loud here; as an alternative in case the 7zip solution ends up not working reliably for some reason: As for dynamically linking, you could always just let the user pass the location of the library (—with-libarchive=/use/local/include) to the linker. Then just use a define or something around any code that uses the library, so it doesn’t compile if the location isn’t provided. It wouldn’t be hard to add some logic to the setup shell script either that searches for libarchive and informs the user if it can’t be found. (Using a normal makefile based build system anyway; I’m not sure about the build system Qt uses.) It’s not ideal, and sort of hacky, but would be straight forward to do.

                                      1 Reply Last reply Reply Quote 1
                                      • T
                                        timb @muldjord
                                        last edited by

                                        @muldjord
                                        Okay, so it compiled and installed fine, but immediately I ran into a bit of an issue. This is the output of the first two NES games it tries to scrape:

                                        #1/733 (T1) Pass 1 Pass 2 Pass 3 Pass 4 ---- Game '10-Yard Fight (USA, Europe)' not found :( ----
                                        
                                        
                                        Debug output:
                                        Tried with sha1, md5, filename: '10-Yard%20Fight%20%28USA%2C%20Europe%29.7z', 'DA39A3EE5E6B4B0D3255BFEF95601890AFD80709', 'D41D8CD98F00B204E9800998ECF8427E'
                                        Platform: nes
                                        Tried with sha1: DA39A3EE5E6B4B0D3255BFEF95601890AFD80709
                                        Platform: nes
                                        Tried with md5: D41D8CD98F00B204E9800998ECF8427E
                                        Platform: nes
                                        Tried with name: 10-Yard%20Fight%20%28USA%2C%20Europe%29.7z
                                        Platform: nes
                                        
                                        Elapsed time: 00:00:02
                                        Estimated time left: 00:35:22
                                        
                                        #2/733 (T1) Pass 1 Pass 2 Pass 3 Pass 4 ---- Game '1942 (Japan, USA)' not found :( ----
                                        
                                        
                                        Debug output:
                                        Tried with sha1, md5, filename: '1942%20%28Japan%2C%20USA%29.7z', 'DA39A3EE5E6B4B0D3255BFEF95601890AFD80709', 'D41D8CD98F00B204E9800998ECF8427E'
                                        Platform: nes
                                        Tried with sha1: DA39A3EE5E6B4B0D3255BFEF95601890AFD80709
                                        Platform: nes
                                        Tried with md5: D41D8CD98F00B204E9800998ECF8427E
                                        Platform: nes
                                        Tried with name: 1942%20%28Japan%2C%20USA%29.7z
                                        Platform: nes
                                        
                                        Elapsed time: 00:00:04
                                        Estimated time left: 00:29:06
                                        

                                        Notice the MD5/SHA1 sums are the same on both? I know that's not correct. Here's the MD5 on the first title (extracted):

                                        timb-mba:~ timb$ 7z e "~/10-Yard Fight (USA, Europe).7z" -so | md5
                                        8caac792d5567da81e6846dbda833a57
                                        

                                        (I set 7z to extract directly to STDOUT and piped it into md5 so it would work like it does in your program.)

                                        Now, after scraping these first two files it starts to scrape fine:

                                        #3/733 (T1) Pass 1 ---- Game '1943 - The Battle of Midway (USA)' found! :) ----
                                        ~SNIP~
                                        Debug output:
                                        Tried with sha1, md5, filename: '1943%20-%20The%20Battle%20of%20Midway%20%28USA%29.7z', '443D235FBDD0E0B3ADB1DCF18C6CAA7ECEEC8BEE', 'DEEAF9D51FC3F13F11F8E1A65553061A'
                                        Platform: nes
                                        
                                        Elapsed time: 00:00:18
                                        Estimated time left: 01:15:41
                                        

                                        So, it seems like the program is somehow hashing bad data at first? I took a cursory glance at the source but nothing instantly jumps out at me.

                                        Also, something else I noticed: It seems the local cache lookup is still doing a SHA1 on the packed 7z file, instead of the unpacked file it contains. This means that it's re-downloading all the assets from screenscrapers, instead of using the assets it previously downloaded when I scraped the extracted files. (I know this is a quick first pass of the feature, I just wanted to point it out.)

                                        muldjordM 1 Reply Last reply Reply Quote 0
                                        • muldjordM
                                          muldjord @timb
                                          last edited by muldjord

                                          @timb I'll look into this when I get the time. And yes, it will keep identifying the files from the sha1 of the actual file on the disk. That is not gonna change. You should see this as a tool to help you identify custom made 7z or zip files as they often differ from the ones in the screenscraper database. So I expect the user to have "completed" their files before they start scraping. Both solutions have caveats obviously. :)

                                          T 1 Reply Last reply Reply Quote 0
                                          • T
                                            timb @muldjord
                                            last edited by timb

                                            @muldjord
                                            Personally, I think it makes more sense to cache based on the hash of the actual ROM file itself and not the hash of the compressed file that contains it (because the hash of a 7z, ZIP or RAR file may change, even though the data it contains does not change).

                                            That said, you’re right, it is a lot simpler to just cache based on the base file’s hash. I can also understand why you wouldn’t want to spread the 7z unpack hack around further in the code. So I totally see your point. :)

                                            Let me know if you need any more debug data on the incorrect hash bug.

                                            muldjordM 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post

                                            Contributions to the project are always appreciated, so if you would like to support us with a donation you can do so here.

                                            Hosting provided by Mythic-Beasts. See the Hosting Information page for more information.