Versatile C++ game scraper: Skyscraper
-
i connected to windows host, these are the pastebins
I use last skyscraper cloned from repo (git clone ...)
I've made a non-exhaustive check of png images, it seems only bad images in arcade folder.
Also, regarding the hack-link bug. When you fixed it, i re-scrapped everything with --updatedb option, to (as far i understood) update everything. But i'm finding still many wrong images of hack-link associated to games.
I guess this could not be fixed as they are in my localdb, and filename is not related to hack-link so skyscraper is getting these files again.So, the solution is clear (delete) all my localdb folders and rescrape everything...?
Thanks muldjord!
-
@bleuge Well, the first problem with the doubles is pretty obvious. You have the galaga.zip file in both the fba folder and the fba/samples folder. Skyscraper also scrapes subdirs. If you don't want this, please us the '--nosubdirs' option.
If you run all scraping modules with the '--updatedb' option, the hack-link entries in your localdb's should go away without the need for deleting the entire dbs/platform folder. If it doesn't maybe it's because you're skipping existing entries?
I'll check the arcadedb module to see if anything is wrong with the image check.
EDIT: Can you provide a filename for a game that returns a faulty png?
-
@muldjord I've seen faulty pngs in 720 (arcade) and alien3 for example. This is not something important i think, i could update this files by hand or whatever.
The files don't start with the typical PNG header (PNG magic in the first bytes), but they end with a IEND tag (marking end of file in a png).
As i said, it's not important. But maybe if it's easy and don't slow skyscraper, checking that is a png file prior to saving it could be nice.This is part of the script i used to rescrape the sets:
- Skyscraper --updatedb -t 8 -p coleco --videos --unattend --skipped -s openretro --pretend
Skyscraper --updatedb -t 8 -p coleco --videos --unattend --skipped -s thegamesdb --pretend
Skyscraper --updatedb -t 8 -p coleco --videos --unattend --skipped -s screenscraper --pretend
Skyscraper --updatedb -t 8 -p coleco --videos --unattend --skipped -s localdb
I used --skiped because i wanted that even don't recognized files will be included in the gamelist (so i could check visually for not scraped files)
- Skyscraper --updatedb -t 8 -p coleco --videos --unattend --skipped -s openretro --pretend
-
@bleuge Did you delete the non-screenscraper localdb's? Otherwise there's no reason to use '--updatedb' on those, it just wastes bandwidth for the sources since your cached data is completely the same.
Concerning the png's I have no idea why it doesn't have a header. I DO check the images and they are saved using the Qt image function which creates the header itself. So I'll need to think that over for a bit to figure out why it would ever do that.
-
2.3.1 is coming along nicely. I've implemented the scummvm.ini parsing which is really cool, so thank you for that suggestion @stoo .
I've also implemented numeral checking on titles. Which basically means that if a file is called "Blah 4" and the returned title is "Blargh" it won't even check them, it just skips it. The default numeral is "1" so "Blah 1" and "Blah" is a match.
-
Skyscraper 2.3.1 released: https://github.com/muldjord/skyscraper
- Fixed 'players' tag to always conform to a 1-digit format
- Now filters out ".hack-Link" results from 'screenscraper' to avoid bad localdb data
- Added note to output about how many new resources have been added during scraping run
- Added 'color="#fffff"' option to stroke effect for the geeky people (including me of course)
- Conformed 'game tags' to 'Platform, Action' format
- Fixed so 'localdb' folder isn't created inside dbs media folders
- Optimized the mameMap a bit
- Improved the searchMatch system to also consider numerals
- Now looks up 'scummvm' dummy files in 'scummvm.ini' and uses the correct game name
This release contains some user requests, the screenscraper 'hack-Link' fix and a bunch of optimizations. Most prominently Skyscraper is now fully aware of game numerals ("Game 4" or "Game IV") and acts upon them when comparing results. This should mark the end of game sequels being matched with results that don't have the same numeral in the title as the filename. A quick note: You might notice that you have fewer "game found" with this release. This is intentional. I've changed the default minimum match percentage to 65 (from 50 before) to eliminate more false-positives. And combined with the more strict numeral checking, that will result in less false-positives, which might look like it finds less correct results. That should not be the case. The quality of the results are just more precise.
Let me know what you think and happy scraping!
-
Thanks very much!!! Skyscraper is getting better and better!
-
Skyscraper is great. But I still get the ".hack-Link" results from screenscraper with version 2.3.1.
-
@jwcbronski Yes, you need to rerun it with '--updatedb'. Otherwise it uses the cached results (which still contain the hack-link entries). Running it with '--updatedb' will refetch the data from screenscraper and overwrite the faulty hack-link results effectively removing them.
-
@muldjord I deleted the whole .skyscraper folder before I installed and ran Skyscraper 2.3.1. So there where no cached results. I started from scratch and got the ".hack-Link" results again. That was just 2 hours ago.
-
@jwcbronski Crap. Just for good measure, can you please run "Skyscraper" and visually read the version number to verify 100% that you are in fact running 2.3.1? Just so I don't start spending a lot of time creating a new fix for no reason.
Problem here being that I can't test the fix myself since I don't get these faulty results. So I made the fix blind. But it really should work. I check every result from screenscraper and compare it to ".hack-Link" and then skip it.
Could you please provide a snippet of the output from Skyscraper when it delivers the faulty results? Then I can use that to work on a new fix.
EDIT: Also, anyone else still having the issue?
-
@muldjord I just started Skyscraper and it is v2.3.1. Here's a gamelist entry:
<game> <path>/home/pi/RetroPie/roms/c64/Supermacy.d64</path> <name>.hack//Link</name> <cover>/home/pi/RetroPie/roms/c64/media/covers/Supermacy.png</cover> <image /> <marquee /> <rating /> <desc>The first game in the .hack series for PSP (and the planned final game for the franchise), .hack//LINK logs player into a new version of its virtual landscape called The World R:X (the "R" stands for "Revision"). Set 10 years after the last .Hack, players take control of Tokio Kuryu, a second year junior-high student. Presented through manga-style visuals, the game's story promises to clear up the mysteries from past entries. Over 100 characters from past .hack games, anime, manga, and books will make an appearance. Gameplay promises to retain the basics of past titles, with players facing off in battle against enemies as they explore dungeons. The difference here is that you move around in a party of two, with the CPU-controlling the other character. The game will include 33 such CPU-controlled characters. For the PSP game, the battle system has been changed to a more action-heavy combat system.</desc> <releasedate /> <developer>Bandai Namco</developer> <publisher>CyberConnect2</publisher> <genre>Role playing games</genre> <players /> </game>
-
@muldjord I scraped everything new except for c64 and it worked fine. Tested c64 now, and it gave me also-hack link results when scrapeing with
Skyscraper -p c64 -s screenscraper --updatedb
. As i said strange thing is that every other platform i have worked fine with-s screenscraper
(except for amiga, which we discussed earlier). -
@jwcbronski Oh, I see the problem... There's more than one way it'll return the hack-link entry. I only filter on ".hack-Link" not ".hack//Link". I wonder how many there is then... Anyways, I'll create more robust filter that simply looks for "hack" and "Link" and filters all of those.
Thank you for your help on this.
-
@analoghero Yes, it appears that the problems on screenscrapers end persists and even seem to be broader than I first thought. Anyways, 2.3.2 coming up... I want this fix out there asap.
-
@muldjord Glad I could help. Have you seen this thread on GitHub?
https://github.com/sselph/scraper/issues/214
They also talk about ".hack//Link".
-
@jwcbronski Thank you, yes I glanced over that thread just earlier today. I thought the "//" was just a spelling error. But it would seem that it actually sometimes returns one and sometimes the other.
Either way... Release is ready.
-
Skyscraper 2.3.2 released: https://github.com/muldjord/skyscraper
- Added support for 'wii' and 'gc' platforms
- Added '.chd' format to a bunch of platforms
- Added more robust filtering of the faulty screenscraper 'hack-Link' results
It now looks for "hack" and "Link" and if both exist in the title it skips it. So it'll work for both ".hack-Link" and ".hack//Link". Please let me know if the issue persists in any form.
Also added two new platforms per user request and a bunch of file formats to new and existing platforms. :)Happy scraping!
-
Curious, what happens if you want to scrape the game .hack//Link?
-
@livefastcyyoung Haha, yeah, I thought about that myself and it simply won't. I could do some further checks, for instance check if the platform is "psp" and then allow it anyways, but what if other psp results are faulty? Of course there is a way to get around all of that, but frankly I don't feel like it's worth plastering my code with all sorts of weird checks, just to let people scrape that one game. :) So I hope people are ok with that. At least until screenscraper fixes the problem and I can remove the checks again.
Contributions to the project are always appreciated, so if you would like to support us with a donation you can do so here.
Hosting provided by Mythic-Beasts. See the Hosting Information page for more information.