Results 71 to 80 of 119
Thread: Spontaneous turning-on
-
2020-03-15, 07:43 #71
- Join Date
- May 2010
- Location
- London, UK
- Posts
- 800
-
2020-03-17, 05:20 #72
The crontab command is enabled in the busybox config but crond disabled. Not sure what use the crontab command is without the cron daemon.
The crontab file location is also set to /var which is remapped to a ramdisk during boot, so any crontab jobs are lost when the radio resets.Ralphy
1-Touch, 5-Classics, 3-Booms, 2-UE Radio
Squeezebox client builds donations always appreciated.
-
2020-03-23, 04:23 #73
- Join Date
- Jan 2010
- Location
- Hertfordshire
- Posts
- 6,614
Just rebooted my Touch after 24 days 5 hours using the button on the back. I can relax for another 24 days
Sent from my Pixel 3a using Tapatalk
-
2021-02-19, 10:06 #74
- Join Date
- Dec 2020
- Posts
- 125
Chiming in (reviving)
I relocated the logs to another machine so they wouldn't get lost at reboot and since the problem appears to be related to the Tasks module I enabled some debug options on it. What I found is that the application does not actually crash, which I kind of expected already, but it appears to enter a loop where the network task repeatedly adds and removes socket tasks until watchdog intervenes stating that some semaphore was not changed (within some set time?). At that point the last time the `ui` task was executed was 8.nnn seconds ago (normally executes 20+ times per second), so that definitely seems related.
I have another two weeks before the next player reaches fatality and possibly drill deeper towards the source.
-
2021-03-03, 12:44 #75
- Join Date
- Dec 2020
- Posts
- 125
Right, that was somewhat disappointing. There's nothing weird with networking to be seen and reviewing the log times it seems like the task manager simply stops executing the `ui` task even though there is no logging of this task having been removed. This appears to indicate that the task manager thinks that the task is still running, which may actually be true if it entered an endless loop or at least longer than it takes for watchdog to kick in.
On to the next restart...
-
2021-03-04, 15:14 #76
- Join Date
- May 2010
- Location
- London, UK
- Posts
- 800
-
2021-03-05, 14:41 #77
Thank you for having a closer look.
As I already posted before (see post #52), a quick code review revealed several pieces of Jive code that are not timer-wrapping-aware. This causes events to no longer fire and delay/wait loops to loop forever.
Basically, all these buggy pieces of code look similar:
Code:if (timer_value > some_point_of_time) then {do something}; (pseudocode)
Code:if (signed_value(timer_value - some_point_of_time) > 0) then {do something}; (pseudocode)
1x Squeezebox Classic, 3x Radio, 1x Touch, LMS 7.9.1 running on ODROID-U3, Ubuntu 16.04 and I'm happy with it! :)
-
2021-03-08, 04:29 #78
- Join Date
- Dec 2020
- Posts
- 125
I actually did have a look into that and essentially all of bits of code I examined don't really use the value itself (which would be pointless anyway as it does not reference any 'real' time that would mean anything to a user) but are about calculating a time difference that should not be affected by negative numbers being fed to it. e.g. `-111 - (-110)` still results in a valid result of positive 1 and really the only bad thing I can imagine happening here would be something like `(-MAX+4) - (MAX-3)` which results in -2*MAX + 7, but I would not expect this to happen consistently with every roll-over.
I do think someone mentioned having disabled watchdog and the UI becoming unresponsive after the critical time? That does sound consistent with my finding that the `ui` task is no longer being called several seconds before watchdog causes the player to reboot. It will be interesting to know if any of the other functions seize at that point as well, in particular the alsa backend which appears to be the reason that this clock method is being used.
-
2021-03-08, 06:13 #79
SUM-UP: code that is not aware of timer wrapping
That's interesting. My findings were completely different.
Summing up: The spontaneous reboots are caused by Watchdog resets, bailing out, because the wrapping of the (int32_t) jiffy_timer causes a stall of event processing or an endless loop somewhere in Jive Framework after 24.855 days.
In the Jive Framework, there are many lines of code not being aware of that timer wrapping, thus suspicious to cause this.
These bugs affect all Jive-based projects, being at least the Radios, Touch, Controller and PiCorePlayer.
See the suspicious lines of code, e.g. here:
in ui/jive_framework.c:
(these can easily be fixed by changing `timeout < now` to `(int32_t)(timeout - now) < 0`)
https://github.com/Logitech/squeezep...mework.c#L1305
https://github.com/Logitech/squeezep...mework.c#L1310
https://github.com/Logitech/squeezep...mework.c#L1322
https://github.com/Logitech/squeezep...mework.c#L1334
in jive/ui/Timer.lua:
these lines being part of very crucial timer handling code
https://github.com/ralph-irving/jive...Timer.lua#L161
https://github.com/ralph-irving/jive...Timer.lua#L177
https://github.com/ralph-irving/jive...Timer.lua#L183
in jive/ui/Framework.lua:
https://github.com/ralph-irving/jive...ework.lua#L330
https://github.com/ralph-irving/jive...ework.lua#L347
https://github.com/ralph-irving/jive...ework.lua#L364
As an example, post #37 already showed how these lines of codes can completely stall all UI process handling.
Please note that not only direct timer value comparisons need to be fixed, but one also need to inspect all lines of code that add offsets to a timer value, like this one and many others, because the timer value is a Lua_Number corresponding to a C double, so it can represent larger numbers than the int32_t jiffy timer itself.
This means that close to int32_t wrapping, the result of an addition may lie beyond the value range of the jiffy timer.
in system.c (of jivelite):
A very funny line of code regarding "Uptime" (although very probably irrelevant), because it shows that there was at least one programmer being aware of the wrapping problem:
https://github.com/ralph-irving/jive.../system.c#L100
Note as @mrw pointed out in post #46, the reason for timer wrapping at signed int32_t boundary (24.855 days) as opposed to 49.710 days, although jive_jiffies() is Uint32, lies in jiveL_get_ticks().
One should also have a closer look at jiveL_thread_time() some lines below.
Currently, three proposed solutions exist:
- mitigation: introduce controlled reboots at a user-defined interval <24 days, see post #51
- extend the upper bits of timer values returned by jiveL_get_ticks(), see post #52, which might raise other problems, see post #93
- fix all lines of code dealing with timer values, preferably by using a userdata type in Lua only dealing with timer value differences and making use of modular arithmetic, see post #52 and post #83
Also note that the easiest way to test the changes may be to modify the code of jive_jiffies(), so that it returns a timer value close to (int32_t) wrapping right from the start (eg. return timer value + 0x7F820000), so we only need to wait for 15 minutes instead of 24 days, see https://github.com/Logitech/squeezep...c/common.h#L67. @ralphy is able to provide this for Radios, based on the current cummunity firmware, see #92Last edited by Steevee28; 2021-03-12 at 15:43. Reason: added more links to related code, added note about testing, corrected syntax errors, added sum-up and proposed solutions
1x Squeezebox Classic, 3x Radio, 1x Touch, LMS 7.9.1 running on ODROID-U3, Ubuntu 16.04 and I'm happy with it! :)
-
2021-03-09, 00:47 #80
- Join Date
- Dec 2020
- Posts
- 125
I read that somewhat different, because the values of xxx.expires are created elsewhere from jive_jiffies incremented with some max_process_time value. The sole purpose of those lines you mentioned is thus to identify timeouts.
oh, and that's a funny one (although very probably irrelevant):
https://github.com/ralph-irving/jive.../system.c#L100
Note: the easiest way to test the changes may be to modify the code of jive_jiffies(), so that it returns a timer value close to (int32_t) wrapping right from the start, so we don't need to wait for 24 days, see https://github.com/Logitech/squeezep...c/common.h#L67
Another option for testing in this case would be to upgrade LUA to 5.4+ and replace all instances of `lua_Number` (signed) to `lua_Integer` (unsigned), which should either shift the restart to after 49 days or possibly continue indefinitely.Last edited by gordonb3; 2021-03-09 at 00:53.