Versatile C++ game scraper: Skyscraper
-
@muldjord Welcome back :)
Have to report good things about using skyscraper. Bought a new (bigger) microsd and thought maybe its time to do video snaps. Backed up my (known good) dbs folders, but deleted them later (because of owner issues, another story). So rescraped everything again (think ive done it 10 times now).
Results are amazing. :)
Have some ideas to improve skyscraper, but just take your time now to sort your things first.
-
@analoghero Happy to hear that! I am currently investigating some stuff with Dom from the Amiberry team for optimizing Amiga scraping and I also need to implement the Mobygames(.com) scraping module. But just post your ideas here and I'll check them out.
-
Has anyone tried scraping SNES recently and successfully had the ratings included?
Across the multiple sites that Skyscraper uses, only thegamesdb seems to get ratings pulled by Skyscraper, and it's not very compelete.However, if I use another scraper (UXS, SSelph, etc.), they will get the ratings from other sites that Skyscraper also uses, to a much higher completion rate (nearing 100% from screenscraper...)
Of course, the problem is none of the different scrapers can agree on how to format the gamelist.xml file, much less where to store the media, so each one overwrites the other.If there were a way to import the data for ALL the games at once, it would be fine. But breaking apart a gameslist.xml file into each one of 1000+ roms and naming each file...not really viable.
Bottom line: any good way to either
- Get the ratings when scraping (specifically SNES, can't speak to other games yet.)
or - Merge the gamelist.xml file from another scraper without having to create 1000+ individually named txt files?
(Note: I could not find in the priorities.xml which of the choices prioritizes which site's ratings are used or how to format just priorities when importing your own data files. I assume it's part of "description", but I don't need or want to overwrite the WHOLE description - just the ratings.)
- Get the ratings when scraping (specifically SNES, can't speak to other games yet.)
-
@timekills You can just add the <order type="rating"> node yourself in the priorities.xml file. I had no idea screeenscraper has ratings, so I haven't implemented it. I will look into it straight away and see if I can find it. If I can, it will be included in 2.3.6 which will be released shortly (probably today or tomorrow).
EDIT: Just had a look at the screenscraper xml and I can't find a rating / score anywhere. If you have any information on this, please let me know. Do you mean age rating?
-
@muldjord While i had to rescrape everything again (Snes, nes, mastersystem, megadrive/genesis, arcade, amiga, neogeo, atari7800, c64) i had a few ideas.
I ran Skyscraper without arguments first, for each system. Then i scraped again with
-s screenscraper --videos
for the console systems.So what would be nice:
- If you could run skyscraper to scrape specific data for example screenshots.
- If you could check what data is missing.
- A different artwork.xml for arcade. I know i should do this by myself, but im not good at it :)
I hope you dont get me wrong, as skyscraper works good as it is and i know you do this in your spare time. Consider this as a input from an user, not a demand :)
-
@muldjord said in Versatile C++ game scraper: Skyscraper:
@timekills You can just add the <order type="rating"> node yourself in the priorities.xml file. I had no idea screeenscraper has ratings, so I haven't implemented it. I will look into it straight away and see if I can find it. If I can, it will be included in 2.3.6 which will be released shortly (probably today or tomorrow).
EDIT: Just had a look at the screenscraper xml and I can't find a rating / score anywhere. If you have any information on this, please let me know. Do you mean age rating?
I don't know what I was thinking...you are correct, it definitely was not screenscraper.
I'll also note that after re-prioritizing the sources and running again from thegamesdb, I did get an equivalent rating listing. Mea culpa, and I apologize for the red herring. (aside - allowing users to enter their username/password might help with the thread allowance for sites such as screenscraper, not to mention get you some good karma with them for the "free" advertising.)The request part remains as you answered however, and I appreciate that. I'm still a little unclear from the various instructions on importing prior data if I have to create a separate list for each individual game, or if it will scan for the <game> or <name> and add the correct <rating> (or whatever order type is associated) from large, contiguous file including all the games in certain system.
@AnalogHero request above, specifically the run for specific named data such as you can for videos, would assist in this. I realize whenever you run a scrape with --pretend it should update any missing data information, which then you can lock in with the localdb. Would scraping for just one data type speed this up?
What I think would be even better than telling it what to change/scan for, would be telling it what not to change/scan for. I.E. I already have a list of screenshot files that are compilations I want to keep. I've been doing this by importing and setting <source>import</source> as the top in the <screenshot> group and running a --pretend update just to update the localdb files then a -s localdb set. It seems to work, but adding the ability to deselect a data point would make this a bit less complicated.
Finally, I may have missed it but do you have a Patreon or some other donation site? Although many of us (cough me cough) sound like typical complaining about a free service whiners, I recognize how much capability you've provided with this fantastic scraper, and would like to donate.
-
@timekills @AnalogHero, Concerning scraping just one type of resource: I have a bit too much on the plate as it is right now, so it won't be possible anytime soon. I do however think it could be done, so I'll make note of it. Thanks for the suggestion.
@timekills, Concerning user credentials for screenscraper it is actually already halfway implemented. It's something I want to look into soon. Especially since I just updated it to the V2 api from screenscraper.
@timekills, I've never actually had a Patreon but I just created one. If anyone wishes to support my work, feel free to do so. BUT, keep in mind that doing so is in no way a requirement. I mainly do this for my own amusement and self-education. :)
-
@muldjord Few (more) concerns/notes:
-
When (attempting to) update from 2.3.5 to 2.3.6 by executing "./update_skyscraper.sh" it returns that you are already at the latest version. The var $LATEST returns 2.3.5 so it doesn't execute as it already believes it is update to date.
-
You still have to manually add --updatedb for the textual updates to take. When I ran it with just -s import it updated the media but not the text. When I added the --updatedb with the same files it also included the text updates.
-
And I ran a UXS scrape of Atari 2600 and got ratings for about half of them. The amount isn't really relevant; what is relevant is that UXS only scrapes from ScreenScraper and it is giving ratings.
I'll be buggered if I can figure out where they're stored on the site or how to access them short of another scraping application.Found it (see below screenshot.)
(Just in case, I deleted all data for the 2600 ROMs just to ensure there was no historic data changing the test before I ran the UXS scrape of ScreenScraper.)
Example:
<game id="14008" source="ScreenScraper">
<path>/home/pi/RetroPie/roms/atari2600/Activision Decathlon, The (USA).zip</path>
<name>Activision Decathlon, The (USA)</name>
<desc>removed to shorten</desc>
<image>/home/pi/RetroPie/roms/atari2600/media/screenshots/Activision Decathlon, The (USA)-image.png</image>
<marquee/>
<video/>
<thumbnail/>
<rating>0.8</rating>
<releasedate>19840101T000000</releasedate>
<developer>Activision</developer>
<publisher>Activision</publisher>
<genre>Sports</genre>
<players>1-4</players>
</game>
Breakout (USA) at screenscraper.fr rating:
-
-
-
2.3.6 is not out yet, so you have the latest version (you can always check it here: https://github.com/muldjord/skyscraper/releases) :) I always post about new versions here aswell. I wanted to release it yesterday but I've postponed it a bit because I have some stuff I'd like to include. But maybe I'll just release it and include the rest later.
-
I just checked this, and it already adds '--updatedb' automatically whenever you use '-s import'.
-
Looks interesting with the rating of atarigames. I'll try running some through and see if I can find an entry that has it. Then it'll be piece of cake to include it. If I can find it, I will include it in 2.3.6. :)
EDIT: I found the rating! It's called "note" which I always assumed was some kind of... well, note. I had no idea that translated to "rating" in english. Anyways, it's now implemented and will be in 2.3.6.
-
-
Skyscraper version 2.3.6 released: https://github.com/muldjord/skyscraper
- Completely rewrote the openretro parser to make use of the 'edit' page instead
- Added '*.lha' suffix to amiga platform
- Changed default scraping module for all platforms to 'localdb'
- Added 'wheel' support to 'openretro' scraping module
- Rewrote thread queue so entries are always taken alphabetically
- Now forces 4 threads for 'screenscraper' scraping module to accomodate their limits
- Updated screenscraper API to use v2
- Added 'rating' to screenscraper scraping module
This release is a "bits and pieces" release. Nothing ground-breaking, but still important changes. The most important one being that each thread now works from a global file queue instead of splitting all files into groups and handing a group to each thread. The old way meant that some threads completed their work faster than others, resulting in a potential slowdown at the end of a scraping run because some threads had finished before others.
To update check the documentation on the github page.
Happy scraping! :)
-
@muldjord Thanks for the update.
Also had to use --updatedb in the past, because skyscraper didnt download videos. Dont know exactly why this happened and couldnt reproduce it, so i didnt report it.
Can you explain why you added .lha extensions for amiga? Are these supported by amiberry or is this for future whdload updates by HoraceAndSpider?
-
@analoghero Yes, lha will soon be supported directly in Amiberry so that's the reason. :) That is probably the best to happen for Amiga emulation in ages imo. Getting rid of the whole adf switching or uae config juggling is just amazing. Sooo looking forward to that.
-
.lha support for Amiberry has already been added to the development branch ;)
-
I'm currently working on a "add spaces to filename" solution for the .lha scraping on Amiga. Most .lha files come in the name format of "ThisIsAGameName3_v1.2.lha". So figuring out rules to turn this into "This Is A Game Name 3" for use with searching is a bit of a fiddly thing. That example is easy. But what about "3DPool", "4x4Driving" or "ABCGame"? As you might be able to guess, those break the easily applied rules and return "3 D Pool", "4x 4 Driving" and "A B C Game". And then I can fix special rules. For instance I can check if we have a '3' before a 'D' and then I won't be inserting a space before it. But then what about "Game3Deluxe"? That would then become "Game 3Deluxe"... Then I add a special rule if it says "Deluxe" and so on. It's a game of tugging. Adding a new rule breaks others. But I've so far found a pretty good middle ground. There are some that are just names badly. Some games have odd numbers at the end of them that aren't sequel numbers. Such as "ThisGame45" where 45 is seemingly also some sort of versioning (it's not sequel numbering in these cases).
Btw, user credentials for 'screenscraper' has been implemented and will be in 2.3.7. You can then insert your own ss user and use as many threads as that allows.
2.3.7 NOT released yet, just to clarify. This is just a progress update. :)
-
@muldjord What if you create a txt file from a complete whdload directory, then insert spaces with a tool, and then check it if its correct. When you have a txt file, use it as a lookup table like BrianTheLionAGA.lha = Brian the Lion AGA.lha. Its a little bit of work, but maybe more accurate than checking every filename.
-
@analoghero Thank you for the suggestion. Yes, it will eventually become something like that. It's the same thing I currently do to translate the mame names to actual game names. The problem with these lookup tables is that they are fixed. So if someone decides to change the filename of an .lha file, it stops working. I could then use the sha1 of the files instead, but they might change aswell depending on where people got them from.
For the time being I think I'll stick with the automated method until I can come up with one that uses the slave files from inside the lha archives. That would work better for a lookup method.
-
@muldjord I have downloaded lhas back in the day on my real amiga, and have them stored on a amiga - sfs formated harddisk along with workbench. Im waiting for the day i can somehow connect that to the pi, as i dont have a pc with ide connections anymore.
Problem is: all my whdload files are zips, so cant help you with testing :(
-
@analoghero No worries I have plenty of files to test on :)
-
Just implemented '--startat' and '--endat' options that allow the user to define what files to begin and end at when scraping. In this mode it ONLY caches data, it doesn't change the game list.
This allows you to only scrape a span of files. So if you have 1200 files, but only want to scrape the middle 60, then you can do so by, alphabetically, defining the first and last file in that span. Makes sense? Useful? I know I will be using it myself at least. :D
-
@muldjord That will be very useful. It would be great for me where I'm trying to merge some files into a set, but unfortunately those files aren't in a contiguous group (i.e. some start with a, some b, some c, etc.) However, I will use it to portion out updates on some of the 6,000 ROM set systems so I can do it by letter and check rather than have it run and then go back to see where it stopped, what was missed, etc.
Thanks!
Contributions to the project are always appreciated, so if you would like to support us with a donation you can do so here.
Hosting provided by Mythic-Beasts. See the Hosting Information page for more information.