Versatile C++ game scraper: Skyscraper
-
@timekills @AnalogHero, Concerning scraping just one type of resource: I have a bit too much on the plate as it is right now, so it won't be possible anytime soon. I do however think it could be done, so I'll make note of it. Thanks for the suggestion.
@timekills, Concerning user credentials for screenscraper it is actually already halfway implemented. It's something I want to look into soon. Especially since I just updated it to the V2 api from screenscraper.
@timekills, I've never actually had a Patreon but I just created one. If anyone wishes to support my work, feel free to do so. BUT, keep in mind that doing so is in no way a requirement. I mainly do this for my own amusement and self-education. :)
-
@muldjord Few (more) concerns/notes:
-
When (attempting to) update from 2.3.5 to 2.3.6 by executing "./update_skyscraper.sh" it returns that you are already at the latest version. The var $LATEST returns 2.3.5 so it doesn't execute as it already believes it is update to date.
-
You still have to manually add --updatedb for the textual updates to take. When I ran it with just -s import it updated the media but not the text. When I added the --updatedb with the same files it also included the text updates.
-
And I ran a UXS scrape of Atari 2600 and got ratings for about half of them. The amount isn't really relevant; what is relevant is that UXS only scrapes from ScreenScraper and it is giving ratings.
I'll be buggered if I can figure out where they're stored on the site or how to access them short of another scraping application.Found it (see below screenshot.)
(Just in case, I deleted all data for the 2600 ROMs just to ensure there was no historic data changing the test before I ran the UXS scrape of ScreenScraper.)
Example:
<game id="14008" source="ScreenScraper">
<path>/home/pi/RetroPie/roms/atari2600/Activision Decathlon, The (USA).zip</path>
<name>Activision Decathlon, The (USA)</name>
<desc>removed to shorten</desc>
<image>/home/pi/RetroPie/roms/atari2600/media/screenshots/Activision Decathlon, The (USA)-image.png</image>
<marquee/>
<video/>
<thumbnail/>
<rating>0.8</rating>
<releasedate>19840101T000000</releasedate>
<developer>Activision</developer>
<publisher>Activision</publisher>
<genre>Sports</genre>
<players>1-4</players>
</game>
Breakout (USA) at screenscraper.fr rating:
-
-
-
2.3.6 is not out yet, so you have the latest version (you can always check it here: https://github.com/muldjord/skyscraper/releases) :) I always post about new versions here aswell. I wanted to release it yesterday but I've postponed it a bit because I have some stuff I'd like to include. But maybe I'll just release it and include the rest later.
-
I just checked this, and it already adds '--updatedb' automatically whenever you use '-s import'.
-
Looks interesting with the rating of atarigames. I'll try running some through and see if I can find an entry that has it. Then it'll be piece of cake to include it. If I can find it, I will include it in 2.3.6. :)
EDIT: I found the rating! It's called "note" which I always assumed was some kind of... well, note. I had no idea that translated to "rating" in english. Anyways, it's now implemented and will be in 2.3.6.
-
-
Skyscraper version 2.3.6 released: https://github.com/muldjord/skyscraper
- Completely rewrote the openretro parser to make use of the 'edit' page instead
- Added '*.lha' suffix to amiga platform
- Changed default scraping module for all platforms to 'localdb'
- Added 'wheel' support to 'openretro' scraping module
- Rewrote thread queue so entries are always taken alphabetically
- Now forces 4 threads for 'screenscraper' scraping module to accomodate their limits
- Updated screenscraper API to use v2
- Added 'rating' to screenscraper scraping module
This release is a "bits and pieces" release. Nothing ground-breaking, but still important changes. The most important one being that each thread now works from a global file queue instead of splitting all files into groups and handing a group to each thread. The old way meant that some threads completed their work faster than others, resulting in a potential slowdown at the end of a scraping run because some threads had finished before others.
To update check the documentation on the github page.
Happy scraping! :)
-
@muldjord Thanks for the update.
Also had to use --updatedb in the past, because skyscraper didnt download videos. Dont know exactly why this happened and couldnt reproduce it, so i didnt report it.
Can you explain why you added .lha extensions for amiga? Are these supported by amiberry or is this for future whdload updates by HoraceAndSpider?
-
@analoghero Yes, lha will soon be supported directly in Amiberry so that's the reason. :) That is probably the best to happen for Amiga emulation in ages imo. Getting rid of the whole adf switching or uae config juggling is just amazing. Sooo looking forward to that.
-
.lha support for Amiberry has already been added to the development branch ;)
-
I'm currently working on a "add spaces to filename" solution for the .lha scraping on Amiga. Most .lha files come in the name format of "ThisIsAGameName3_v1.2.lha". So figuring out rules to turn this into "This Is A Game Name 3" for use with searching is a bit of a fiddly thing. That example is easy. But what about "3DPool", "4x4Driving" or "ABCGame"? As you might be able to guess, those break the easily applied rules and return "3 D Pool", "4x 4 Driving" and "A B C Game". And then I can fix special rules. For instance I can check if we have a '3' before a 'D' and then I won't be inserting a space before it. But then what about "Game3Deluxe"? That would then become "Game 3Deluxe"... Then I add a special rule if it says "Deluxe" and so on. It's a game of tugging. Adding a new rule breaks others. But I've so far found a pretty good middle ground. There are some that are just names badly. Some games have odd numbers at the end of them that aren't sequel numbers. Such as "ThisGame45" where 45 is seemingly also some sort of versioning (it's not sequel numbering in these cases).
Btw, user credentials for 'screenscraper' has been implemented and will be in 2.3.7. You can then insert your own ss user and use as many threads as that allows.
2.3.7 NOT released yet, just to clarify. This is just a progress update. :)
-
@muldjord What if you create a txt file from a complete whdload directory, then insert spaces with a tool, and then check it if its correct. When you have a txt file, use it as a lookup table like BrianTheLionAGA.lha = Brian the Lion AGA.lha. Its a little bit of work, but maybe more accurate than checking every filename.
-
@analoghero Thank you for the suggestion. Yes, it will eventually become something like that. It's the same thing I currently do to translate the mame names to actual game names. The problem with these lookup tables is that they are fixed. So if someone decides to change the filename of an .lha file, it stops working. I could then use the sha1 of the files instead, but they might change aswell depending on where people got them from.
For the time being I think I'll stick with the automated method until I can come up with one that uses the slave files from inside the lha archives. That would work better for a lookup method.
-
@muldjord I have downloaded lhas back in the day on my real amiga, and have them stored on a amiga - sfs formated harddisk along with workbench. Im waiting for the day i can somehow connect that to the pi, as i dont have a pc with ide connections anymore.
Problem is: all my whdload files are zips, so cant help you with testing :(
-
@analoghero No worries I have plenty of files to test on :)
-
Just implemented '--startat' and '--endat' options that allow the user to define what files to begin and end at when scraping. In this mode it ONLY caches data, it doesn't change the game list.
This allows you to only scrape a span of files. So if you have 1200 files, but only want to scrape the middle 60, then you can do so by, alphabetically, defining the first and last file in that span. Makes sense? Useful? I know I will be using it myself at least. :D
-
@muldjord That will be very useful. It would be great for me where I'm trying to merge some files into a set, but unfortunately those files aren't in a contiguous group (i.e. some start with a, some b, some c, etc.) However, I will use it to portion out updates on some of the 6,000 ROM set systems so I can do it by letter and check rather than have it run and then go back to see where it stopped, what was missed, etc.
Thanks!
-
@timekills When i add roms to a system, i just run skyscraper and select yes to skip existing entries.
Flipside is that you can do this only one time with only one scrapermodule. So if you want to scrape with another module, you have to scrape all files.
-
Skyscraper version 2.3.7 released: https://github.com/muldjord/skyscraper
- Now checks for .lha suffix and adds spaces where appropriate to get better results
- Improved returned image data validity check (libpng errors still happen, but can be ignored)
- Rewrote the worker to main thread communication a bit
- Implemented '--startat' option that tells Skyscraper the first file to scrape
- Implemented '--endat' option that tells Skyscraper the last file to scrape
- Added thread id to terminal output
- Applied serverside artwork size limit to openretro module to avoid running out of memory
- Improved network communication class
Another "bits and pieces" release. The inclusion of '--startat' and '--endat' should make it easier to do scraping of a subset of your games. This has been requested a few times and I've made use of it myself A LOT during testing.
2.3.6 had some issues with the new openretro parser since some of the returned covers are INSANELY large. Like, 10000x10000 resolution. That in conjunction with the new alphabeticized queue system made it eat up ALL of the Pi ram really fast, which in turn made the kernel kill it off to ensure system stability. This has been fixed serverside, simply by asking for a resize before it reaches Skyscraper.
The changes to the network communicator are kind of beta. Which means that I've tested it, and haven't seen any problems. But I've removed the clearAccessCache call again since it should no longer be necessary. That call ensured that data didn't get mixed up, but was a bit of a workaround. My new code should ensure that data doesn't get mixed up without using that call. So pleeeeease, if you encounter game media getting mixed up between games, let me know! Shouldn't happen though.This was a bit of a tough one. I've been spending the last 4 days trying to figure out what the hell was causing the crashes, until I stumbled upon those insane cover artwork resolutions from openretro. Then it became quite clear that Skyscraper wasn't the problem at all...
Anyways, all my testing has gone well on my end, so please do update and try it out.
Happy scraping! :)
-
Skyscraper version 2.3.8 released: https://github.com/muldjord/skyscraper
- Implemented user credentials ('-u user:password') to set up threads for 'screenscraper' module
- Made sure artwork output gets exported, even if entry has no base artwork resource
- Changed 'verbose' to 'verbosity' to allow levels and made terminal output more useful overall
- Added '--dbstats' command line option that prints stats for the selected local dabatase cache
- Added '--purgedb' command line option that allows purging resources from localdb
- Fixed bugs in mergedb command line option
- Fixed bug in Simle Mode where 'attractmode' would not work properly (thank you Humayun)
Fixed a few bugs, added a few options, and FINALLY included the '-u' option properly. If you have a ScreenScraper user, please provide that using the format '-u user:password'. Remember that you can set this in '[homedir]/.skyscraper/config.ini' so you don't have to have it on the command line.
You can now check the stats of the local database cache with '--dbstats'. If you want to purge stuff from it, do this with '--purgedb' by adding 'm:[module]' and/or 't:[type]'. You can also have both comma-separated.
Example: 'Skyscraper -p amiga --purgedb m:thegamesdb,t:cover'. This will purge all covers from thegamesdb module from the Amiga platform's local database cache.
Another example: 'Skyscraper -p amiga --purgedb m:thegamesdb'. This will purge all resources from thegamesdb module from the Amiga platform's local database cache.
Last example: 'Skyscraper -p amiga --purgedb t:cover'. This will purge all cover resources from any module from the Amiga platform's local database cache.
Lastly, I've changed '--verbose' to '--verbosity [level]' where [level] can be 0-3. The higher the level, the more output it will give you while scraping.As always, please report bugs if you encounter any.
Happy scraping! :)
-
@muldjord Thanks for updating and new additions xD
I was going to write that don't know why but initial testings in 2.3.7 didn't work for me.This line for example:
Skyscraper -d /home/pi/RetroPie/dbs/arcade --nosubdirs --noresize --updatedb -t 8 -p arcade --videos --unattend --skipped -s arcadedb --pretend
just showed me 720.zip game not found (the first rom in the arcade folder)
print the elapsed and remaning time
and just hangs there, no more output, nothing, after 10min i hit ctrl+cI thought my roms or dbs folders are not ok, but yes, they are ok
Anyways, I updated a pair of hours ago to 2.3.7, and then ... I'll test 2.3.8 version
I remember you said generating a log for debugging purposes it was a bit difficult in the actual skyscraper, but if this could be added, debugging and spotting errors could be possible much much easier? What do you think?
Thanks you very much for so fast support!
Edit: tested 2.3.8, the same thing, I'll try another commandlines for other platforms, no idea what the hell is going wrong here.... Also i don't know why it can't find 720 in arcadedb...
The message i am getting is
#1/141 (T1) Pass 1 ---- Game '720' not found :( ----
Then the times and that's all...Edit 2: skipped arcadedb and tried
Skyscraper -d /home/pi/RetroPie/dbs/arcade --nosubdirs --noresize -t 8 -p arcade --videos --unattend --skipped -s openretro --pretend
It's working.
Btw, i use a generic bash script tailored to my platforms, i use -t 8 in all, i know this number is adjusted for specific sites, so, no problem using -t 8 in a generic way ?
-
@bleuge Something seems to be wrong with ArcadeDB, not sure what. I'll investigate.
EDIT: You can use -t 8 if you prefer, it'll adjust acoordingly. Be aware that you might run into problems with ScreenScraper though, as I am a bit unsure if the 4 threads limit I have on now is correct. If I were you I would provide my ScreenScraper user credentials in the config.ini so it's always used correctly for it.
EDIT2: I found the error with ArcadeDB. I had changed some stuff elsewhere that broke it... I'll release a patch shortly that will bring it back in working order.
-
2.3.9 out, please try that
Contributions to the project are always appreciated, so if you would like to support us with a donation you can do so here.
Hosting provided by Mythic-Beasts. See the Hosting Information page for more information.