2017-02-27

I made a post in the main sticky about how the recent performance improving thermal changes resulted in very high CPU temps, with readings over 100C being easily possible and regular.

I had been running a slightly custom thermal profile for a while on 4.9, but it wasn't really fit for use in the wild. The recent HK changes finally gave me the encouragement to finally write a better solution to this problem

The problem that the HK patch tried (and failed) to address is that the 4.9 kernel was only reading one CPU thermal sensor (of the first core). HK solution was to read all sensors, but they did so within one thermal zone. This is actually even worse as the sensors end up conflicting and it causes unpredictable behaviour, which is what results in the CPU's going well over 100C. What should happen is each sensor needs it's own thermal zone, but this results in huge code duplication and a large maintenance cost should any changes to the zone be required, as each trip point needs it's own unique label per CPU core, then this has to be fed into the cooling maps. This can get pretty ugly but the below solution should overcome all of these issues

Functionality improvements in the patch:

1. Each CPU core now has it's own thermal zone and will correctly enable the fan, or throttle as soon as any core breaches the appropriate trip point. It's no longer possible for a rogue core to thermally runaway.
2. Trip points spun out into a separate file, almost bringing back the file moon.linux created. This allows all the work for the unique labels per trip per CPU to be generated without duplicating the code. There's still just one table of trip points and cooling maps, so if any changes are needed it's very easy to manage and maintain.
3. Improvements to throttling behaviour. A7's not throttled at all until 95C as they contribute almost nothing to the thermal load, so no sense in limiting them. A15's also throttled more slowly and progressively to better enhance thermal efficiency under very heavy loads.
4. Fan points tweaked just slightly. They are lowered by 5C vs the HK to allow more run for throttling on the high end. Under idle all XU3/4's should still remain passively cooled, with the fan only kicking the A15's need to do some work. The fan speeds for trip0 and 1 have also been reduced slightly too further reduce noise, though it's really only a small improvement.
5. Hysteresis values improved. The large values were fine for the fan to stop unwanted fast changes in noise, but on the passive cooling steps large value hinder performance. Lower values enabled the cooling maps to reach an equilibrium performance level more easily.

The results are now that no core can end up in a dangerous thermal position without prompting a response from the thermal driver, which is what was happening previously. Mostly decently (active) cooled devices should settle between 85-90C under a heavy 8 thread load, and passively cooled device might just hit the 95C but shouldn't really extend past that point. Performance is always maximised to what the hottest core will allow, ensuring safe temps at all times.

Further work will be required on the CPU op points in due course however, as the current CPU voltages used hinder thermals and performance significantly as no stepping is used on the high frequencies. An area of low hanging fruit to pick before a full release

This is rather longer than my usual couple of line patches, so I'd welcome anyone to take a look to make sure I'm not doing something silly here So far it seems to work perfectly in my testing.

Code:

From c056d9c5418b4fe88bc2e82c64ae22b573ba2d67 Mon Sep 17 00:00:00 2001
From: DarkBahamut <darkbahamut@gmail.com>
Date: Mon, 27 Feb 2017 23:25:42 +0000
Subject: [PATCH] arm: dts: Enable per cpu thermal trips

1. Each A15 cores thermal sensor now correctly used and will trigger the fan or passive throttling as required.
2. Separate file used for trip points to allow unique labels per trip per cpu to be generated without having to duplicate the trips each time. Keeps code clear and allows for easy changes.
3. Trip points tweaked to optimise performance. A7's kept at full speed for longer since they contribute little to the thermal load. Efficiency is improved by not throttling them until required. A15's throttled earlier to manage performance better under heavy loads to extract maximum performance from the available cooling.
---
arch/arm/boot/dts/exynos5422-odroidxu3-common.dtsi | 110 +++++----------------
.../boot/dts/exynos5422-odroidxu3-trip-points.dtsi | 100 +++++++++++++++++++
2 files changed, 123 insertions(+), 87 deletions(-)
create mode 100644 arch/arm/boot/dts/exynos5422-odroidxu3-trip-points.dtsi

diff --git a/arch/arm/boot/dts/exynos5422-odroidxu3-common.dtsi b/arch/arm/boot/dts/exynos5422-odroidxu3-common.dtsi
index 7341aa6..fdf5950 100755
--- a/arch/arm/boot/dts/exynos5422-odroidxu3-common.dtsi
+++ b/arch/arm/boot/dts/exynos5422-odroidxu3-common.dtsi
@@ -64,7 +64,7 @@
cooling-min-state = <0>;
cooling-max-state = <3>;
#cooling-cells = <2>;
-      cooling-levels = <0 130 170 230>;
+      cooling-levels = <0 110 160 230>;
};

mali: mali@0x11800000 {
@@ -107,92 +107,28 @@

thermal-zones {
cpu0_thermal: cpu0-thermal {
-         thermal-sensors = <&tmu_cpu0 0 &tmu_cpu1 0 &tmu_cpu2 0 &tmu_cpu3 0>;
-         polling-delay-passive = <250>;
-         polling-delay = <1000>;
-         trips {
-            cpu_alert0: cpu-alert-0 {
-               temperature = <75000>; /* millicelsius */
-               hysteresis = <10000>; /* millicelsius */
-               type = "passive";
-            };
-            cpu_alert1: cpu-alert-1 {
-               temperature = <80000>; /* millicelsius */
-               hysteresis = <10000>; /* millicelsius */
-               type = "passive";
-            };
-            cpu_alert2: cpu-alert-2 {
-               temperature = <85000>; /* millicelsius */
-               hysteresis = <10000>; /* millicelsius */
-               type = "passive";
-            };
-            cpu_alert3: cpu-alert-3 {
-               temperature = <90000>; /* millicelsius */
-               hysteresis = <10000>; /* millicelsius */
-               type = "passive";
-            };
-            cpu_alert4: cpu-alert-4 {
-               temperature = <95000>; /* millicelsius */
-               hysteresis = <10000>; /* millicelsius */
-               type = "passive";
-            };
-            cpu_alert5: cpu-alert-5 {
-               temperature = <103000>; /* millicelsius */
-               hysteresis = <10000>; /* millicelsius */
-               type = "passive";
-            };
-            cpu_alert6: cpu-alert-6 {
-               temperature = <110000>; /* millicelsius */
-               hysteresis = <10000>; /* millicelsius */
-               type = "passive";
-            };
-            cpu_criti0: cpu-crit-0 {
-               temperature = <115000>; /* millicelsius */
-               hysteresis = <10000>; /* millicelsius */
-               type = "critical";
-            };
-         };
-         cooling-maps {
-            map0 {
-               trip = <&cpu_alert0>;
-               cooling-device = <&fan0 0 1>;
-            };
-            map1 {
-               trip = <&cpu_alert1>;
-               cooling-device = <&fan0 1 2>;
-            };
-            map2 {
-               trip = <&cpu_alert2>;
-               cooling-device = <&fan0 2 3>;
-            };
-            /*
-             * When reaching cpu_alert3, reduce CPU
-             * by 2 steps. On Exynos5422/5800 that would
-             * be: 1600 MHz and 1100 MHz.
-             */
-            map3 {
-               trip = <&cpu_alert3>;
-               cooling-device = <&cpu0 0 2>;
-            };
-            map4 {
-               trip = <&cpu_alert3>;
-               cooling-device = <&cpu4 0 2>;
-            };
-
-            /*
-             * When reaching cpu_alert4, reduce CPU
-             * further, down to 600 MHz (11 steps for big,
-             * 7 steps for LITTLE).
-             */
-            map5 {
-               trip = <&cpu_alert4>;
-               cooling-device = <&cpu0 3 8>;
-            };
-            map6 {
-               trip = <&cpu_alert4>;
-               cooling-device = <&cpu4 3 13>;
-            };
-         };
+         thermal-sensors = <&tmu_cpu0 0>;
+         #define CPU_THERMAL_ZONE_NUM 0
+         #include "exynos5422-odroidxu3-trip-points.dtsi"
+         #undef CPU_THERMAL_ZONE_NUM
+      };
+      cpu1_thermal: cpu1-thermal {
+         thermal-sensors = <&tmu_cpu1 0>;
+         #define CPU_THERMAL_ZONE_NUM 1
+         #include "exynos5422-odroidxu3-trip-points.dtsi"
+         #undef CPU_THERMAL_ZONE_NUM
+      };
+      cpu2_thermal: cpu2-thermal {
+         thermal-sensors = <&tmu_cpu2 0>;
+         #define CPU_THERMAL_ZONE_NUM 2
+         #include "exynos5422-odroidxu3-trip-points.dtsi"
+         #undef CPU_THERMAL_ZONE_NUM
+      };
+      cpu3_thermal: cpu3-thermal {
+         thermal-sensors = <&tmu_cpu3 0>;
+         #define CPU_THERMAL_ZONE_NUM 3
+         #include "exynos5422-odroidxu3-trip-points.dtsi"
+         #undef CPU_THERMAL_ZONE_NUM
};
};
};
diff --git a/arch/arm/boot/dts/exynos5422-odroidxu3-trip-points.dtsi b/arch/arm/boot/dts/exynos5422-odroidxu3-trip-points.dtsi
new file mode 100644
index 0000000..8037b2f
--- /dev/null
+++ b/arch/arm/boot/dts/exynos5422-odroidxu3-trip-points.dtsi
@@ -0,0 +1,100 @@
+/*
+ * Device tree sources for default OdroidXU3/Exynos5422 thermal zone definition
+ *
+ * Copyright (c) 2015 Lukasz Majewski <l.majewski@samsung.com>
+ *                  Anand Moon <linux.amoon@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+
+#define _TOKENPASTE(x, y) x ## y
+#define TOKENPASTE(x, y) _TOKENPASTE(x, y)
+#define UNIQIFY(label) TOKENPASTE(label, CPU_THERMAL_ZONE_NUM)
+
+         polling-delay-passive = <250>;
+         polling-delay = <1000>;
+         trips {
+            UNIQIFY(cpu_alert0): cpu-alert-0 {
+               temperature = <70000>; /* millicelsius */
+               hysteresis = <10000>; /* millicelsius */
+               type = "active";
+            };
+            UNIQIFY(cpu_alert1): cpu-alert-1 {
+               temperature = <75000>; /* millicelsius */
+               hysteresis = <10000>; /* millicelsius */
+               type = "active";
+            };
+            UNIQIFY(cpu_alert2): cpu-alert-2 {
+               temperature = <80000>; /* millicelsius */
+               hysteresis = <10000>; /* millicelsius */
+               type = "active";
+            };
+            UNIQIFY(cpu_alert3): cpu-alert-3 {
+               temperature = <85000>; /* millicelsius */
+               hysteresis = <3000>; /* millicelsius */
+               type = "passive";
+            };
+            UNIQIFY(cpu_alert4): cpu-alert-4 {
+               temperature = <90000>; /* millicelsius */
+               hysteresis = <3000>; /* millicelsius */
+               type = "passive";
+            };
+            UNIQIFY(cpu_alert5): cpu-alert-5 {
+               temperature = <95000>; /* millicelsius */
+               hysteresis = <3000>; /* millicelsius */
+               type = "passive";
+            };
+            UNIQIFY(cpu_criti0): cpu-crit-0 {
+               temperature = <115000>; /* millicelsius */
+               hysteresis = <3000>; /* millicelsius */
+               type = "critical";
+            };
+         };
+         cooling-maps {
+            map0 {
+               trip = <&UNIQIFY(cpu_alert0)>;
+               cooling-device = <&fan0 0 1>;
+            };
+            map1 {
+               trip = <&UNIQIFY(cpu_alert1)>;
+               cooling-device = <&fan0 1 2>;
+            };
+            map2 {
+               trip = <&UNIQIFY(cpu_alert2)>;
+               cooling-device = <&fan0 2 3>;
+            };
+            /*
+             * When reaching cpu_alert3, reduce A15 cores by 1 step.
+             * The 2GHz step causes high thermals on multithreaded workloads
+             * so better performance is gained by managing it out early.
+             */
+            map3 {
+               trip = <&UNIQIFY(cpu_alert3)>;
+               cooling-device = <&cpu4 0 1>;
+            };
+            /*
+            * When reaching cpu_alert4, reduce A15 cores by 3 steps
+            * to further manage the performance level while keeping
+            * thermals under control.
+            */
+            map4 {
+               trip = <&UNIQIFY(cpu_alert4)>;
+               cooling-device = <&cpu4 2 4>;
+            };
+            /*
+             * When reaching cpu_alert5, reduce all CPUs to ensure thermal
+             * safety. A7 cores don't produce much thermal load so they are
+             * reduced less to optimise performance.
+             */
+            map5 {
+               trip = <&UNIQIFY(cpu_alert5)>;
+               cooling-device = <&cpu0 0 2>;
+            };
+            map6 {
+               trip = <&UNIQIFY(cpu_alert5)>;
+               cooling-device = <&cpu4 5 14>;
+            };
+         };
--
2.7.4

Statistics: Posted by DarkBahamut — Tue Feb 28, 2017 8:34 am

Show more