Overclocking the Pi3b+ GPU (Results)
-
@dankcushions said in Overclocking the Pi3b+ GPU (Results):
it would be useful to see some specific examples (games, benchmarks, etc) as i don't really get why ondemand (which i think is the default) would be slower than performance.
since ondemand ramps up the speed with load. i would have thought there should be no difference between the runtime cpu frequency between governor in cpu-heavy applications. they both should be running the cpu at full speed in a cpu-limited emulator, right?
It might be mainly a problem if you decrease buffering, for example by setting max_swapchain_images=2. I believe the issue is caused by the on demand CPU governor not being able to handle the spiky CPU load. The CPU will emulate one frame and then push it to the GPU. While the GPU waits for a frame flip, the CPU will more or less idle, before kicking off emulation of the next frame. My guess is that the governor spins down the CPU and loses too much time when spinning it back up again during the next frame.
I would say using the performance governor as default for the run command would be safe. The user should expect (and want) the CPU to be in the high performance state anyway when running an emulator (and have the necessary cooling in place). The fact that the CPU may not always hit or stay at max frequency is the actual unexpected part here.
EDIT: On second thought, I guess the reduced buffering just makes the issue more likely to crop up. The unwanted CPU frequency reduction probably happens all the time at default settings as well, it’s just that there’s an additional frame buffered that will mostly cover the performance drop and prevent frame rate hitches.
@quicksilver said in Overclocking the Pi3b+ GPU (Results):
@hhromic the interesting thing is the current model pis cant overvolt any higher then a value of 4. At over_voltage=4 the core voltage equals 1.394v and it will not increase any higher than that. Values of 6-8 still only equal 1.394v (as @Rascas noted earlier). I think the stock core voltage is set higher on current model pis. I am sure that force_turbo would set the warranty bit but does over_voltage=4 now also set it? Official rpi documents are vague about this.
My Pi 3 B+ actually hits 1.39V already at over_voltage=1. That’s actually pretty high on 40nm, so I wouldn’t want to push it more. I guess the A53 really isn’t made to cope with high frequencies... 1.5GHz on 40nm and 1.4V is pretty abysmal.
-
Most likely safe, but I prefer end users to make decisions like this. Also not everything would benefit - maybe some things that are launched you want to reduce clock if not in use. Frotz? Kodi? No doubt things will run hotter and consume more power. And in many cases it would be a waste. What about handhelds?
It's not going to be changed :-)
-
@Brunnis the ondemand governor is not so primitive for switching speed.
While what you say is true that the CPU idles more with these emulators that use the GPU, that idling time is very short and the governor won't be micro-switching the frequency so fast. In particular, there is a setting for how often the governor will monitor the load to do adjustments:* sampling_rate: Measured in uS (10^-6 seconds), this is how often you want the kernel to look at the CPU usage and to make decisions on what to do about the frequency. Typically this is set to values of around '10000' or more. It's default value is (cmp. with users-guide.txt): transition_latency * 1000. Be aware that transition latency is in ns and sampling_rate is in us, so you get the same sysfs value by default. Sampling rate should always get adjusted considering the transition latency to set the sampling rate 750 times as high as the transition latency in the bash (as said, 1000 is default), do: $ echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate
And there are many other settings to control the rather advanced ondemand governor :)
Ref: https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt
-
@BuZz said in Overclocking the Pi3b+ GPU (Results):
Most likely safe, but I prefer end users to make decisions like this. Also not everything would benefit - maybe some things that are launched you want to reduce clock if not in use. Frotz? Kodi? No doubt things will run hotter and consume more power. And in many cases it would be a waste. What about handhelds?
It's not going to be changed :-)
Yep, for Kodi and the likes it’s definitely not a good idea to use the performance governor. For handhelds, performance governor would still be the way to go for emulation, since it’s still the predictable and stable mode. Any battery life issues should be handled by adjusting frequencies instead.
I definitely understand your stance, though. Development is full of compromises and I’m happy with just changing the setting via the run command menu.
-
@hhromic said in Overclocking the Pi3b+ GPU (Results):
@Brunnis the ondemand governor is not so primitive for switching speed.
While what you say is true that the CPU idles more with these emulators that use the GPU, that idling time is very short and the governor won't be micro-switching the frequency so fast. In particular, there is a setting for how often the governor will monitor the load to do adjustments:* sampling_rate: Measured in uS (10^-6 seconds), this is how often you want the kernel to look at the CPU usage and to make decisions on what to do about the frequency. Typically this is set to values of around '10000' or more. It's default value is (cmp. with users-guide.txt): transition_latency * 1000. Be aware that transition latency is in ns and sampling_rate is in us, so you get the same sysfs value by default. Sampling rate should always get adjusted considering the transition latency to set the sampling rate 750 times as high as the transition latency in the bash (as said, 1000 is default), do: $ echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate
And there are many other settings to control the rather advanced ondemand governor :)
Ref: https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt
Thanks for the info. If it’s not the idle time that’s causing the issue, it’s something about the code itself being executed that fools the governor into thinking it’s okay to downclock. I believe I once saw it mentioned that the code used to emulate SuperFX games on the SNES could cause this issue. No idea if there’s any truth to it, though. I believe I’ve only seen the issue in SuperFX games, but at the same time they’re usually the most demanding ones I emulate, so they’d naturally be the first ones to have issues if margins are slim.
EDIT:
This section is pretty interesting (from the link you posted):
* up_threshold: This defines what the average CPU usage between the samplings of 'sampling_rate' needs to be for the kernel to make a decision on whether it should increase the frequency. For example when it is set to its default value of '95' it means that between the checking intervals the CPU needs to be on average more than 95% in use to then decide that the CPU frequency needs to be increased.
In my own tests, I've noticed that the Pi 3 may need as much as 5 ms between pushing a 1080p frame to the GPU and the frame flip occurring. In that time, unless there's additional frame buffers to render to, the CPU is mostly idle. That gives an average CPU usage of ~70%, which would not be enough to stay at the highest frequency. EDIT: Actually, the above talks about what's needed to initially increase clocks... There's also this:
* sampling_down_factor: This parameter controls the rate at which the kernel makes a decision on when to decrease the frequency while running at top speed. When set to 1 (the default) decisions to reevaluate load are made at the same interval regardless of current clock speed. But when set to greater than 1 (e.g. 100) it acts as a multiplier for the scheduling interval for reevaluating load when the CPU is at its top speed due to high load. This improves performance by reducing the overhead of load evaluation and helping the CPU stay at its top speed when truly busy, rather than shifting back and forth in speed. This tunable has no effect on behavior at lower speeds/lower CPU loads.
It's not completely clear, but I'm guessing that, by default, decisions about down clocking are made using the same sampling period and load evaluation as when increasing the clocks. Anyone know for sure?
-
24h test complete. No issues found (Quake 3 + memtester 512 + sysbench (2 threads)) at the following settings:
arm_freq=1475 core_freq=600 v3d_freq=400 sdram_freq=550 over_voltage=1 temp_soft_limit=70
-
@Brunnis said in Overclocking the Pi3b+ GPU (Results):
24h test complete. No issues found (Quake 3 + memtester 512 + sysbench (2 threads)) at the following settings:
arm_freq=1475 core_freq=600 v3d_freq=400 sdram_freq=550 over_voltage=1 temp_soft_limit=70
i have played a little with overclocking on 3B+... I have no issues with temperature and games play well without any trouble
However, I have issues with compiling (updating from source) with an overclocked 3B+. updating mame2003-plus almost always freezes or stops with errors... Any idea about that? (I don't know my exact settings but I had this issue with many settings found around the internet, even for moderate overclocking)
Edit: I also did a sysbench stress test without issues
-
@robertvb83 said in Overclocking the Pi3b+ GPU (Results):
updating mame2003-plus almost always freezes or stops with errors..
What kind of errors ? If they're memory related error (not enough memory), then you can increase the amount of swap added during compilation to get over those issues. Do you get the same kind of errors without overclocking ?
-
@BuZz @hhromic To expand on the discussion regarding CPU governor and lower than expected performance: I've been watching the output of the 'top' command now, while running some SNES loads and below are some results. "Tweaked video settings" below means:
video_driver="dispmanx" video_threaded="false" video_max_swapchain_images=2
For the ondemand CPU governor tests above, I also ran a script that read actual CPU frequency every second. Turns out the ondemand CPU governor leads to frequent downclocking (to 600 MHz) in all test cases (whether running Super Mario World or Super Mario World 2 and whether using default or tweaked video settings). Here are the printouts:
Test 2: Governor ondemand (tweaked video settings) - SMW
Test 2: Governor ondemand (tweaked video settings) - SMW2
Test 4: Governor ondemand (default video settings) - SMW
Test 4: Governor ondemand (default video settings) - SMW2So, to conclude, it doesn't look like the ondemand CPU scheduler handles this in an optimal way. The constant ping-ponging of the CPU frequency (even with default RetroPie settings) is hardly optimal and may lead to performance issues in some cases. For most situations, the additional frame buffering used on a default installation seems to mask the impact of the reduced CPU frequency. Removing that buffering (i.e. using video_max_swapchain_images=2) reveals the issue in an obvious way with stuttering performance in demanding situations (such as SMW2).
-
@Brunnis what's your 'Est. single CPU load (%)' column about? with video_threaded="false" retroarch should be entirely operating on one core. even with video_threaded="true" the threaded video tasks are very minor.
-
@dankcushions
That's just converting top's CPU load (which is for all four cores) to the estimated resulting single core load. So:("Total CPU load"/25)*100 gives you the value in the "Est. single CPU load" column.
-
@Brunnis said in Overclocking the Pi3b+ GPU (Results):
@dankcushions
That's just converting top's CPU load (which is for all four cores) to the estimated resulting single core load. So:("Total CPU load"/25)*100 gives you the value in the "Est. single CPU load" column.
actually top's percentage is cumulative. eg, 100% load on 4 cores would appear on top as 400%
that said, these emulators are not threaded so they won't be using the other cores, so top's total load will be - or very close to - the load on one core (some OS tasks might be working on other cores)
if you press 1 within top you get a % per core - https://unix.stackexchange.com/a/146090
-
@dankcushions said in Overclocking the Pi3b+ GPU (Results):
@Brunnis said in Overclocking the Pi3b+ GPU (Results):
@dankcushions
That's just converting top's CPU load (which is for all four cores) to the estimated resulting single core load. So:("Total CPU load"/25)*100 gives you the value in the "Est. single CPU load" column.
actually top's percentage is cumulative. eg, 100% load on 4 cores would appear on top as 400%
that said, these emulators are not threaded so they won't be using the other cores, so top's total load will be - or very close to - the load on one core (some OS tasks might be working on other cores)
if you press 1 within top you get a % per core - https://unix.stackexchange.com/a/146090
The %Cpu(s) value at the top (which is what I looked at, should have just looked at RetroArch in the process list below instead) is not cumulative unless you press 1. So, unless you press 1, a full load on all four cores will show as a combined value of 100. But thanks for the tip about pressing 1. Didn't know that!
I'll see if I can update the figures with slightly more accurate ones anyway.
-
I just updated the chart to be a bit more clear on what it's showing.
-
@Brunnis yeah i couldn't quite work out why you were "estimating" them but that checks out :)
i guess i still don't see a smoking gun with the figures being given, especially when the issue is only apparent using video settings where stutter is a known risk under cpu load situations. however if it's a binary thing to your eyes where the stutter is eliminated once the performance governor is set, i guess that is all that needs to be said.
this seems like a perfect test case for my benchmarking script that i never got back to :) https://github.com/dankcushions/retropie-auto-testing/blob/master/retropie-auto-testing.sh
-
@dankcushions said in Overclocking the Pi3b+ GPU (Results):
i guess i still don't see a smoking gun with the figures being given, especially when the issue is only apparent using video settings where stutter is a known risk under cpu load situations. however if it's a binary thing to your eyes where the stutter is eliminated once the performance governor is set, i guess that is all that needs to be said.
Well, in this case the stuttering does not occur because the CPU isn't fast enough, but because the ondemand governor is not able to determine that the CPU should stay at max frequency. That's a pretty big difference in my eyes.
The figures I posted above show us that the ondemand governor doesn't work as we'd expect and that the resulting performance issue is simply masked by buffering with the default settings. With this testing alone, I can't say for sure that it doesn't affect some marginal games even at default settings. It's certainly possible that only video_max_swapchain_images=2 exposes it. In that case, it would of course be okay to leave the governor at the current default.
I didn't post this to press for a change of default governor (since BuZz has already said it won't happen). However, I thought the figures were interesting, since the frequency rollercoaster behavior at default settings didn't seem to be common knowledge.
-
@mitu said in Overclocking the Pi3b+ GPU (Results):
@robertvb83 said in Overclocking the Pi3b+ GPU (Results):
updating mame2003-plus almost always freezes or stops with errors..
What kind of errors ? If they're memory related error (not enough memory), then you can increase the amount of swap added during compilation to get over those issues. Do you get the same kind of errors without overclocking ?
I did not have any errors without overclocking!
this is where compiling ends when overclocked:
-
@Brunnis said in Overclocking the Pi3b+ GPU (Results):
@dankcushions said in Overclocking the Pi3b+ GPU (Results):
i guess i still don't see a smoking gun with the figures being given, especially when the issue is only apparent using video settings where stutter is a known risk under cpu load situations. however if it's a binary thing to your eyes where the stutter is eliminated once the performance governor is set, i guess that is all that needs to be said.
Well, in this case the stuttering does not occur because the CPU isn't fast enough, but because the ondemand governor is not able to determine that the CPU should stay at max frequency. That's a pretty big difference in my eyes.
The figures I posted above show us that the ondemand governor doesn't work as we'd expect and that the resulting performance issue is simply masked by buffering with the default settings.
forgive me, but i don't think they neccesarily show that. they only show that the governor has decided the CPU should be downclocked (to 600) during some tests. we know the emulators in question do not exert a constant load on the cpu (~90% usage). from the earlier information, it looks like the governor should be checking CPU load and making this decision every 0.01 of a second (
sampling_rate
defaults to 10000 usecs?), so given that fidelity i am now not surprised that you will see it downlocking every so often. it's probably changing the core speed 100s of times a second.the performance issue you observe must be caused by his process, agreed, but that stutters is not specifically measured in the above data, if you get what i mean. we need an fps benchmark for that.
anyway, it seems to me like a good fix might be to increase the sampling_rate fidelity to something north of a frame. eg over 16666667 usec. that way, applications that are generally low load will still downlock, but cpu-heavy emulators will stay full speed. i don't know if that's a good idea, just my initial thought.
-
@robertvb83 said in Overclocking the Pi3b+ GPU (Results):
I did not have any errors without overclocking!
this is where compiling ends when overclocked:remember that retropie compiles use 2 of the 4 cores, but emulation mostly uses 1 core, so an overclock that is stable in games can definitely be unstable in compiles.
-
@dankcushions said in Overclocking the Pi3b+ GPU (Results):
forgive me, but i don't think they neccesarily show that. they only show that the governor has decided the CPU should be downclocked (to 600) during some tests. we know the emulators in question do not exert a constant load on the cpu (~90% usage). from the earlier information, it looks like the governor should be checking CPU load and making this decision every 0.01 of a second (sampling_rate defaults to 10000 usecs?), so given that fidelity i am now not surprised that you will see it downlocking every so often. it's probably changing the core speed 100s of times a second.
Yes, I think I may have expressed myself a bit unclear. I agree that the governor probably just behaves according to spec. The "unexpected" part is that it affects the performance in a negative way in certain cases. Ideally, the "ondemand" governor should produce the same (or very close to the same) end result (i.e. performance) as the "performance" governor.
the performance issue you observe must be caused by his process, agreed, but that stutters is not specifically measured in the above data, if you get what i mean. we need an fps benchmark for that.
Yeah, we can definitely measure the performance delta, but what would we do with the data? Would it affect the current discussion in any significant way? For an initial discussion on whether the ondemand governor can cope with the load without affecting the result, audio-visual cues are certainly sufficient. The regression in the end result is not exactly subtle.
anyway, it seems to me like a good fix might be to increase the sampling_rate fidelity to something north of a frame. eg over 16666667 usec. that way, applications that are generally low load will still downlock, but cpu-heavy emulators will stay full speed. i don't know if that's a good idea, just my initial thought.
They write sample rate in the docs, but I guess they mean period? Increasing the sampling period wouldn't really help. The average load over the period would then often be too low to clock up at all. We'd instead need a really small sample period, so that the downclocked core spins up as fast as possible when the load increases (i.e. the next frame rendering kicks off). The current issue is probably that the default of, say, 10 ms means that once the core is clocked down and the emulator kicks off again, you're spending up to 10 ms rendering the frame at 600 MHz before the governor checks the load and decides to clock back up. Then it's too late and you won't be able to submit the frame on time.
Contributions to the project are always appreciated, so if you would like to support us with a donation you can do so here.
Hosting provided by Mythic-Beasts. See the Hosting Information page for more information.