Raspberry Pi OS 64bit version
-
Some emulators could also run slower. Although ones that may already be full speed. Due to having Arm 32bit neon optimisations which can't be used without rewriting for aarch64.
Btw I saw the comment on the bugtracker. I've not yet gone through the experimental packages. I'm aware some stuff isn't working :-)
-
Clear…
i stopped my test as i found out rpi os 64 bit is too unstable. Had issues with the display driver. when i googled i found out i wasn't the only one
based on the performance gain on lr-mupen64plus-next it is a promising development though, so i hope rpi 64 bit os will be supported in the future..
for now: i will keep away from it until it is a more stable environment...
-
Has there been any updates for a 64 bit image?
-
@billymild Raspberry Pi OS 64-bit version is still in beta, so i wouldn't expect an image until it's at least official.
-
@dankcushions where is the beta version hosted?
-
-
@billymild For fun, you could try using the 64-bit kernel in your existing 32-bit RetroPie image.
You need to add arm_64bit=1 to your /boot/config.txt file and reboot. (More detail in this page for example
how-to-make-your-raspberry-pi-4-faster-with-a-64-bit-kernel)It's easy to comment out or remove when you're finished, and reboot back into the regular all-32-bit configuration.
In a really quick test I found that the 64-bit kernel reduces audio glitching in lr-mupen64plus-next. (And probably the better mupen emulators too, like mupen64-gliden64).
Unfortunately running the 64-bit kernel with the 32-bit RetroPie also causes lr-genesis_plus_gx to crash soon after starting any game I tried. (From memory, perhaps because the 64-bit kernel is a bit stricter about how code should be organised in memory).
In the full 64-bit kernel+OS both of those emulators are stable (but then lr-parallel_n64 won't start for me). The full 64-bit RetroPie also has a big chunk of packages not yet available, but it is safe enough to try on a spare partition or SD card.
You need to install the base Raspberry Pi OS 64-bit beta, update it, and then install RetroPie from source.
-
@busywait that won't work properly - the userland will be 32-bit still.
-
@dankcushions said in Raspberry Pi OS 64bit version:
@busywait that won't work properly - the userland will be 32-bit still.
Yes, 32-bit userland still used.
I think that the reason that I did see a difference depending on the arm_64bit setting (32 bit kernel vs 64-bit kernel) could be because of the way that data is copied in the HDMI driver. I'm using the full KMS and HDMI/arm-side audio driver if that makes any difference.
Edit: yes, I know this is an unsupported configuration for RetroPie; no, I don't use it regularly, I was just interested to see what happened.
-
64-bit will only lead to a substantial gain in performance when all components of Retropie(Cores, Retroarch etc.) are properly rewritten in 64-bit.
Expect this to take another few years. -
@molokkoplus the majority of default cores, retroarch, etc already compile in 64-bit via RetroPie, but you will need the 64-bit userland for that to happen (ie, the raspberry pi OS 64-bit beta version). but retropie doesn't officially support this, yet, so it's at your own risk...
PS, not an obvious gain in performance, to me (but then very little is CPU-bound on retropie for pi 4 anyway).
-
@molokkoplus said in Raspberry Pi OS 64bit version:
64-bit will only lead to a substantial gain in performance when all components of Retropie(Cores, Retroarch etc.) are properly rewritten in 64-bit.
Expect this to take another few years.@dankcushions said in Raspberry Pi OS 64bit version:
PS, not an obvious gain in performance, to me (but then very little is CPU-bound on retropie for pi 4 anyway).
I run my RetroPie “test build” exclusively on 64-bit RPiOS with the KMS video and audio drivers and have had a lot of stability with it.
Building the cores/emulators in 64-bit is fairly trivial as most of them already support other arm64/aarch64 platforms.
The main issue with 64-bit RPiOS is the lack of hardware acceleration for video decoding but that isn’t much of a problem for RetroPie and this feature will be implemented in the near term at Pi Towers.
@dankcushions is 100% right, the real hw limitation on the Pi4 is the GPU. The first GB of the Pi4’s memory is only available to the GPU through CMA so when you start to crank up the resolution on 3D emulators the GPU will become the bottleneck. The Vulkan driver helps but it can only do so much until the 1GB limitation rears it’s ugly head.
The Vulkan driver is getting optimization improvements every few weeks but the only way to get that driver is from building from MESA gitlab. I suspect that the user base won’t see the driver in apt until the next Debian major release “Bullseye” whenever that happens.
-
@bluestang most pi4 are sold with 4gb of memory. Why is there a 1 GB bottleneck?
-
@billymild 1 GB access for limitation for the GPU. the CPU has access to the full memory.
i'm not convinced that any viable emulation task would ever get near 1GB GPU memory, though. i believe the internal bandwidth of the pi and various bottlenecks in the GPU would make framerates a crawl at the kinds of tasks that need 1GB VRAM , regardless.
realtime VRAM usage used to be easy enough to measure:
sudo vcdbg reloc
- presume this doesn't work correctly on pi4, though. -
Specific to MAME, there are measurable improvements comparing 32bit to 64bit.
I benchmarked 650 games in MAME on an RPi3B+ and an RPi4B, and compared stock speeds, overclocked, 32bit and 64bit (full kernel+os+binary). These are tested on Raspbian Buster (based on Debian 10 Buster, GCC8.3).
Write up and results are here:
https://stickfreaks.com/misc/raspberry-pi-mame-benchmarksWithin that link you can find the raw CSV data, as well as a Google Sheet link with some silly stats I did. Across all 650 games there's an average jump of around 18% on an RPi4, however there are individual games that see well over 50% speedups (and a very small amount that go backwards). Overall it seems to be very positive.
I can't speak for other emulators or RetroArch cores, but I would assume similar results for those given the inherent nature of what emulation does. If anyone knows of a similar method of benchmarking other emulators, please let me know. (MAME makes this easy with a "-bench" command line flag, and the ability to automate the process and output the results to a text file for easy logging). I'm happy to repeat the tests on other emulators if there's an objective way to do so that doesn't involve manually recording frame rates from a GUI overlay.
Similar results happened with MAME on x86 when it moved to 64bit. Discussion from back in 2007 here: http://forum.arcadecontrols.com/index.php?topic=74600.0
-
@elvis said in Raspberry Pi OS 64bit version:
Specific to MAME, there are measurable improvements comparing 32bit to 64bit.
I benchmarked 650 games in MAME on an RPi3B+ and an RPi4B, and compared stock speeds, overclocked, 32bit and 64bit (full kernel+os+binary). These are tested on Raspbian Buster (based on Debian 10 Buster, GCC8.3).
Write up and results are here:
https://stickfreaks.com/misc/raspberry-pi-mame-benchmarksWithin that link you can find the raw CSV data, as well as a Google Sheet link with some silly stats I did. Across all 650 games there's an average jump of around 18% on an RPi4, however there are individual games that see well over 50% speedups (and a very small amount that go backwards). Overall it seems to be very positive.
nice! did you run within the desktop, though? because MAME on pi4 via x is allegedly majorly faster to the point i'm not really sure there's any value in any other type of optimization, at least via retropie (which is not run within X), at least, until we've investigated this further (via kms vs fkms, later sdl2 versions, etc).
I can't speak for other emulators or RetroArch cores, but I would assume similar results for those given the inherent nature of what emulation does.
i'm afraid very few of the poorly performing emulators on pi are cpu bound. it's true that emulation is traditionally cpu-intensive, but most emulators in retropie on pi have dynarecs and/or accelerated graphics (leveraging gpu) that make the cpu requirements often quite low, and it's more often the GPU or system bandwith that is the bottleneck (the latter may be improved in 64-bit, mind). mame is an exception to this as they don't typically use dynarecs or accelerated 3d graphics, opting for full-fat cpu emulation.
that said, it could be useful for those who want faster than fullspeed performance in 2d emulators, for fast-forward functions, etc.
If anyone knows of a similar method of benchmarking other emulators, please let me know. (MAME makes this easy with a "-bench" command line flag, and the ability to automate the process and output the results to a text file for easy logging). I'm happy to repeat the tests on other emulators if there's an objective way to do so that doesn't involve manually recording frame rates from a GUI overlay.
it's possible to harvest benchmarking data from retroarch cores via verbose logging. an unfinished POC: https://github.com/dankcushions/retropie-auto-testing
-
@dankcushions said in Raspberry Pi OS 64bit version:
nice! did you run within the desktop, though? because MAME on pi4 via x is allegedly majorly faster to the point i'm not really sure there's any value in any other type of optimization, at least via retropie (which is not run within X), at least, until we've investigated this further (via kms vs fkms, later sdl2 versions, etc).
I used the MAME supplied "-bench" flag. It renders video and audio internally, but doesn't send them to the GPU/screen/sound hardware. This is useful in that you see the true "raw" performance of MAME, free from any display technology or driver issues.
I'm happy to retest a games under actual display modes if required. MAME supports a number of screen display and rendering modes (including a number of legacy lossy-compressed framebuffer modes like YUV420, which were brilliant back in the day of poor 2D performance in Linux), but also your usual candidates like OpenGL, SDL, X11 etc. That will also answer the question of exactly how much overhead exists between the upper bounds of CPU performance, and what overhead actually spitting out a picture introduces.
I'm currently running the overclocked Pi4 64bit tests again under Debian Bullseye (merely dist-upgrading Raspbian from Buster) and a newly compiled MAME via GCC10 (up from GCC8 on Buster). If you have specific video outputs you want me to test, let me know (I'll fetch a complete list and post it when I'm back at a PC).
it's possible to harvest benchmarking data from retroarch cores via verbose logging. an unfinished POC: https://github.com/dankcushions/retropie-auto-testing
Brilliant, cheers, I'll check that out. My overall goal was less around comparing 32bit to 64bit, and more just trying to come up with an objective yes/no answer to the "will a Pi run my favourite game/system?" question. So even with all the dynarec points acknowledged, this will help me with my original goal.
-
I used the MAME supplied "-bench" flag. It renders video and audio internally, but doesn't send them to the GPU/screen/sound hardware. This is useful in that you see the true "raw" performance of MAME, free from any display technology or driver issues.
unfortunately the player doesn't have that luxury ;)
If you have specific video outputs you want me to test, let me know (I'll fetch a complete list and post it when I'm back at a PC).
the thread i linked uses ddp3 as an example. we use sdl, and no x11. we did bench a variety of different display modes when creating the build script and that seemed to be the fastest, outside of x11.
Brilliant, cheers, I'll check that out
great! feel free to iterate on what i've done. i would love to see some kind of automated testing in retropie.
-
@dankcushions said in Raspberry Pi OS 64bit version:
@elvis
unfortunately the player doesn't have that luxury ;)Totally understood. I was originally testing outside of the scope of RetroPie (I run an RPi4 as a desktop, using MAME, ScummVM and Box86 for various emulation and game research tasks). So that number interested me without the variability of display drivers, scaling, etc. Plus then I can also pretty objectively compare to John IV's long running published benchmarks that do the same.
the thread i linked uses ddp3 as an example. we use sdl, and no x11. we did bench a variety of different display modes when creating the build script and that seemed to be the fastest, outside of x11.
Ok cool, I'll take a read through that tomorrow and catch up on it all. I'd considered testing at least a portion of the really fast games under different display outputs, at least to see how big that delta was. Maybe it's worth running the whole lot through all the display modes?
The Pi4 in particular, with its newer GPU and drivers, seems to have potential for improvement. I have a feeling those results will change as driver and Mesa improvements continue.
I'm starting to feel like I could just run these benchmarks on loop forever. Maybe I need another couple of RPis :)
-
Quick and dirty test on what's on my RPi4 (CPU 2.0GHz, GPU 600MHz/128MB) now. I'm in a different physical location to the Pi, so I can't change the OS at the moment. Currently running 64bit (aarch64) Debian 11 Bullseye with the Raspbian kernel 5.10.17-v8 (current setup was Raspbian Buster, dist-upgraded to Bullseye). MAME 0.230 is compiled with Bullseye's GCC10.
I ran 1942 and arabian with common flags "-verbose -seconds_to_run 90 -nothrottle -nowaitvsync", and then combinations of video, video drivers, and video scale modes appropriate to each output type. Some failed to run (accel_directfb, soft_yuy2 and soft_yuy2x2 reported as not supported).
Currently attached to a monitor set to 1024x768 resolution.
bgfx_backend is set to opengl, and bgfx_screen_chains set to unfiltered in the config file.
All modes tested (just for reference to compare to what actually produced a useful log):
$ ls -1d 1942*log
1942_accel_auto.log
1942_accel_directfb.log
1942_accel_x11.log
1942_bgfx.log
1942_none.log
1942_opengl.log
1942_soft_hwbest.log
1942_soft_hwblit.log
1942_soft_none.log
1942_soft_yuy2.log
1942_soft_yuy2x2.log
1942_soft_yv12.log
1942_soft_yv12x2.logAnd the results that returned a speed, sorted from best to worst are:
$ grep ^Aver 1942*log | sort -k3 -gr
1942_none.log:Average speed: 571.93% (89 seconds)
1942_accel_auto.log:Average speed: 300.01% (89 seconds)
1942_accel_x11.log:Average speed: 297.49% (89 seconds)
1942_soft_hwbest.log:Average speed: 295.65% (89 seconds)
1942_soft_hwblit.log:Average speed: 289.79% (89 seconds)
1942_opengl.log:Average speed: 271.81% (89 seconds)
1942_soft_yv12.log:Average speed: 271.39% (89 seconds)
1942_soft_yv12x2.log:Average speed: 265.13% (89 seconds)
1942_bgfx.log:Average speed: 223.18% (89 seconds)
1942_soft_none.log:Average speed: 137.64% (89 seconds)$ grep ^Aver arabian*log | sort -k3 -gr
arabian_none.log:Average speed: 1671.79% (89 seconds)
arabian_accel_auto.log:Average speed: 314.14% (89 seconds)
arabian_accel_x11.log:Average speed: 312.98% (89 seconds)
arabian_opengl.log:Average speed: 305.48% (89 seconds)
arabian_soft_hwbest.log:Average speed: 304.67% (89 seconds)
arabian_soft_hwblit.log:Average speed: 303.42% (89 seconds)
arabian_soft_yv12.log:Average speed: 268.34% (89 seconds)
arabian_soft_yv12x2.log:Average speed: 260.45% (89 seconds)
arabian_bgfx.log:Average speed: 219.92% (89 seconds)
arabian_soft_none.log:Average speed: 151.68% (89 seconds)Definitely backs the choice of SDL without x11 as the preferred video output. Fairly similar upper bounds on each, despite no-video runs different by a scale of almost 3.
Is there any comment on what future improvements to VC4/V3D drivers will bring to these speeds? Or are we seeing internal bandwidth limits here?
Contributions to the project are always appreciated, so if you would like to support us with a donation you can do so here.
Hosting provided by Mythic-Beasts. See the Hosting Information page for more information.