Raspberry Pi OS 64bit version
-
@molokkoplus the majority of default cores, retroarch, etc already compile in 64-bit via RetroPie, but you will need the 64-bit userland for that to happen (ie, the raspberry pi OS 64-bit beta version). but retropie doesn't officially support this, yet, so it's at your own risk...
PS, not an obvious gain in performance, to me (but then very little is CPU-bound on retropie for pi 4 anyway).
-
@molokkoplus said in Raspberry Pi OS 64bit version:
64-bit will only lead to a substantial gain in performance when all components of Retropie(Cores, Retroarch etc.) are properly rewritten in 64-bit.
Expect this to take another few years.@dankcushions said in Raspberry Pi OS 64bit version:
PS, not an obvious gain in performance, to me (but then very little is CPU-bound on retropie for pi 4 anyway).
I run my RetroPie “test build” exclusively on 64-bit RPiOS with the KMS video and audio drivers and have had a lot of stability with it.
Building the cores/emulators in 64-bit is fairly trivial as most of them already support other arm64/aarch64 platforms.
The main issue with 64-bit RPiOS is the lack of hardware acceleration for video decoding but that isn’t much of a problem for RetroPie and this feature will be implemented in the near term at Pi Towers.
@dankcushions is 100% right, the real hw limitation on the Pi4 is the GPU. The first GB of the Pi4’s memory is only available to the GPU through CMA so when you start to crank up the resolution on 3D emulators the GPU will become the bottleneck. The Vulkan driver helps but it can only do so much until the 1GB limitation rears it’s ugly head.
The Vulkan driver is getting optimization improvements every few weeks but the only way to get that driver is from building from MESA gitlab. I suspect that the user base won’t see the driver in apt until the next Debian major release “Bullseye” whenever that happens.
-
@bluestang most pi4 are sold with 4gb of memory. Why is there a 1 GB bottleneck?
-
@billymild 1 GB access for limitation for the GPU. the CPU has access to the full memory.
i'm not convinced that any viable emulation task would ever get near 1GB GPU memory, though. i believe the internal bandwidth of the pi and various bottlenecks in the GPU would make framerates a crawl at the kinds of tasks that need 1GB VRAM , regardless.
realtime VRAM usage used to be easy enough to measure:
sudo vcdbg reloc
- presume this doesn't work correctly on pi4, though. -
Specific to MAME, there are measurable improvements comparing 32bit to 64bit.
I benchmarked 650 games in MAME on an RPi3B+ and an RPi4B, and compared stock speeds, overclocked, 32bit and 64bit (full kernel+os+binary). These are tested on Raspbian Buster (based on Debian 10 Buster, GCC8.3).
Write up and results are here:
https://stickfreaks.com/misc/raspberry-pi-mame-benchmarksWithin that link you can find the raw CSV data, as well as a Google Sheet link with some silly stats I did. Across all 650 games there's an average jump of around 18% on an RPi4, however there are individual games that see well over 50% speedups (and a very small amount that go backwards). Overall it seems to be very positive.
I can't speak for other emulators or RetroArch cores, but I would assume similar results for those given the inherent nature of what emulation does. If anyone knows of a similar method of benchmarking other emulators, please let me know. (MAME makes this easy with a "-bench" command line flag, and the ability to automate the process and output the results to a text file for easy logging). I'm happy to repeat the tests on other emulators if there's an objective way to do so that doesn't involve manually recording frame rates from a GUI overlay.
Similar results happened with MAME on x86 when it moved to 64bit. Discussion from back in 2007 here: http://forum.arcadecontrols.com/index.php?topic=74600.0
-
@elvis said in Raspberry Pi OS 64bit version:
Specific to MAME, there are measurable improvements comparing 32bit to 64bit.
I benchmarked 650 games in MAME on an RPi3B+ and an RPi4B, and compared stock speeds, overclocked, 32bit and 64bit (full kernel+os+binary). These are tested on Raspbian Buster (based on Debian 10 Buster, GCC8.3).
Write up and results are here:
https://stickfreaks.com/misc/raspberry-pi-mame-benchmarksWithin that link you can find the raw CSV data, as well as a Google Sheet link with some silly stats I did. Across all 650 games there's an average jump of around 18% on an RPi4, however there are individual games that see well over 50% speedups (and a very small amount that go backwards). Overall it seems to be very positive.
nice! did you run within the desktop, though? because MAME on pi4 via x is allegedly majorly faster to the point i'm not really sure there's any value in any other type of optimization, at least via retropie (which is not run within X), at least, until we've investigated this further (via kms vs fkms, later sdl2 versions, etc).
I can't speak for other emulators or RetroArch cores, but I would assume similar results for those given the inherent nature of what emulation does.
i'm afraid very few of the poorly performing emulators on pi are cpu bound. it's true that emulation is traditionally cpu-intensive, but most emulators in retropie on pi have dynarecs and/or accelerated graphics (leveraging gpu) that make the cpu requirements often quite low, and it's more often the GPU or system bandwith that is the bottleneck (the latter may be improved in 64-bit, mind). mame is an exception to this as they don't typically use dynarecs or accelerated 3d graphics, opting for full-fat cpu emulation.
that said, it could be useful for those who want faster than fullspeed performance in 2d emulators, for fast-forward functions, etc.
If anyone knows of a similar method of benchmarking other emulators, please let me know. (MAME makes this easy with a "-bench" command line flag, and the ability to automate the process and output the results to a text file for easy logging). I'm happy to repeat the tests on other emulators if there's an objective way to do so that doesn't involve manually recording frame rates from a GUI overlay.
it's possible to harvest benchmarking data from retroarch cores via verbose logging. an unfinished POC: https://github.com/dankcushions/retropie-auto-testing
-
@dankcushions said in Raspberry Pi OS 64bit version:
nice! did you run within the desktop, though? because MAME on pi4 via x is allegedly majorly faster to the point i'm not really sure there's any value in any other type of optimization, at least via retropie (which is not run within X), at least, until we've investigated this further (via kms vs fkms, later sdl2 versions, etc).
I used the MAME supplied "-bench" flag. It renders video and audio internally, but doesn't send them to the GPU/screen/sound hardware. This is useful in that you see the true "raw" performance of MAME, free from any display technology or driver issues.
I'm happy to retest a games under actual display modes if required. MAME supports a number of screen display and rendering modes (including a number of legacy lossy-compressed framebuffer modes like YUV420, which were brilliant back in the day of poor 2D performance in Linux), but also your usual candidates like OpenGL, SDL, X11 etc. That will also answer the question of exactly how much overhead exists between the upper bounds of CPU performance, and what overhead actually spitting out a picture introduces.
I'm currently running the overclocked Pi4 64bit tests again under Debian Bullseye (merely dist-upgrading Raspbian from Buster) and a newly compiled MAME via GCC10 (up from GCC8 on Buster). If you have specific video outputs you want me to test, let me know (I'll fetch a complete list and post it when I'm back at a PC).
it's possible to harvest benchmarking data from retroarch cores via verbose logging. an unfinished POC: https://github.com/dankcushions/retropie-auto-testing
Brilliant, cheers, I'll check that out. My overall goal was less around comparing 32bit to 64bit, and more just trying to come up with an objective yes/no answer to the "will a Pi run my favourite game/system?" question. So even with all the dynarec points acknowledged, this will help me with my original goal.
-
I used the MAME supplied "-bench" flag. It renders video and audio internally, but doesn't send them to the GPU/screen/sound hardware. This is useful in that you see the true "raw" performance of MAME, free from any display technology or driver issues.
unfortunately the player doesn't have that luxury ;)
If you have specific video outputs you want me to test, let me know (I'll fetch a complete list and post it when I'm back at a PC).
the thread i linked uses ddp3 as an example. we use sdl, and no x11. we did bench a variety of different display modes when creating the build script and that seemed to be the fastest, outside of x11.
Brilliant, cheers, I'll check that out
great! feel free to iterate on what i've done. i would love to see some kind of automated testing in retropie.
-
@dankcushions said in Raspberry Pi OS 64bit version:
@elvis
unfortunately the player doesn't have that luxury ;)Totally understood. I was originally testing outside of the scope of RetroPie (I run an RPi4 as a desktop, using MAME, ScummVM and Box86 for various emulation and game research tasks). So that number interested me without the variability of display drivers, scaling, etc. Plus then I can also pretty objectively compare to John IV's long running published benchmarks that do the same.
the thread i linked uses ddp3 as an example. we use sdl, and no x11. we did bench a variety of different display modes when creating the build script and that seemed to be the fastest, outside of x11.
Ok cool, I'll take a read through that tomorrow and catch up on it all. I'd considered testing at least a portion of the really fast games under different display outputs, at least to see how big that delta was. Maybe it's worth running the whole lot through all the display modes?
The Pi4 in particular, with its newer GPU and drivers, seems to have potential for improvement. I have a feeling those results will change as driver and Mesa improvements continue.
I'm starting to feel like I could just run these benchmarks on loop forever. Maybe I need another couple of RPis :)
-
Quick and dirty test on what's on my RPi4 (CPU 2.0GHz, GPU 600MHz/128MB) now. I'm in a different physical location to the Pi, so I can't change the OS at the moment. Currently running 64bit (aarch64) Debian 11 Bullseye with the Raspbian kernel 5.10.17-v8 (current setup was Raspbian Buster, dist-upgraded to Bullseye). MAME 0.230 is compiled with Bullseye's GCC10.
I ran 1942 and arabian with common flags "-verbose -seconds_to_run 90 -nothrottle -nowaitvsync", and then combinations of video, video drivers, and video scale modes appropriate to each output type. Some failed to run (accel_directfb, soft_yuy2 and soft_yuy2x2 reported as not supported).
Currently attached to a monitor set to 1024x768 resolution.
bgfx_backend is set to opengl, and bgfx_screen_chains set to unfiltered in the config file.
All modes tested (just for reference to compare to what actually produced a useful log):
$ ls -1d 1942*log
1942_accel_auto.log
1942_accel_directfb.log
1942_accel_x11.log
1942_bgfx.log
1942_none.log
1942_opengl.log
1942_soft_hwbest.log
1942_soft_hwblit.log
1942_soft_none.log
1942_soft_yuy2.log
1942_soft_yuy2x2.log
1942_soft_yv12.log
1942_soft_yv12x2.logAnd the results that returned a speed, sorted from best to worst are:
$ grep ^Aver 1942*log | sort -k3 -gr
1942_none.log:Average speed: 571.93% (89 seconds)
1942_accel_auto.log:Average speed: 300.01% (89 seconds)
1942_accel_x11.log:Average speed: 297.49% (89 seconds)
1942_soft_hwbest.log:Average speed: 295.65% (89 seconds)
1942_soft_hwblit.log:Average speed: 289.79% (89 seconds)
1942_opengl.log:Average speed: 271.81% (89 seconds)
1942_soft_yv12.log:Average speed: 271.39% (89 seconds)
1942_soft_yv12x2.log:Average speed: 265.13% (89 seconds)
1942_bgfx.log:Average speed: 223.18% (89 seconds)
1942_soft_none.log:Average speed: 137.64% (89 seconds)$ grep ^Aver arabian*log | sort -k3 -gr
arabian_none.log:Average speed: 1671.79% (89 seconds)
arabian_accel_auto.log:Average speed: 314.14% (89 seconds)
arabian_accel_x11.log:Average speed: 312.98% (89 seconds)
arabian_opengl.log:Average speed: 305.48% (89 seconds)
arabian_soft_hwbest.log:Average speed: 304.67% (89 seconds)
arabian_soft_hwblit.log:Average speed: 303.42% (89 seconds)
arabian_soft_yv12.log:Average speed: 268.34% (89 seconds)
arabian_soft_yv12x2.log:Average speed: 260.45% (89 seconds)
arabian_bgfx.log:Average speed: 219.92% (89 seconds)
arabian_soft_none.log:Average speed: 151.68% (89 seconds)Definitely backs the choice of SDL without x11 as the preferred video output. Fairly similar upper bounds on each, despite no-video runs different by a scale of almost 3.
Is there any comment on what future improvements to VC4/V3D drivers will bring to these speeds? Or are we seeing internal bandwidth limits here?
-
@elvis said in Raspberry Pi OS 64bit version:
Definitely backs the choice of SDL without x11 as the preferred video output. Fairly similar upper bounds on each, despite no-video runs different by a scale of almost 3.
it's strange then the findings of @George in that thread - but maybe the retropie environment has an additional affect (again, we use a custom version of sdl)
Is there any comment on what future improvements to VC4/V3D drivers will bring to these speeds? Or are we seeing internal bandwidth limits here?
i would have thought the simple stuff like blitting to screen is about as good as it's ever going to be, but like i say, check out that thread - there may be a compounding issue for retropie. i'm currently building mame on my 64-bit x-less retropie install so will be able to test myself on some of these games. will be using ddp3 to try and recreate @George's findings.
(maybe we should continue this discussion there)
Contributions to the project are always appreciated, so if you would like to support us with a donation you can do so here.
Hosting provided by Mythic-Beasts. See the Hosting Information page for more information.