Home of the Squeezebox™ & Transporter® network music players.
Page 8 of 12 FirstFirst ... 678910 ... LastLast
Results 71 to 80 of 119
  1. #71
    Senior Member
    Join Date
    May 2010
    Location
    London, UK
    Posts
    800
    Quote Originally Posted by ralphy View Post
    Someone clicked send the crashlog to logitech when they went to use it. Once sent the crashlog applet deletes the logfile.
    On at least one of my Radios it is definitely unclicked. Must check the other...

    Anyway, it does sound to me more than likely that you have established the source of the trigger. And, I would guess, itself caused by some consequence of the timer 'wrap around'.

  2. #72
    Senior Member ralphy's Avatar
    Join Date
    Jan 2006
    Location
    Canada
    Posts
    2,837
    Quote Originally Posted by mrw View Post
    Any reason why you don't run a cron job on the device itself ?
    Quote Originally Posted by Steevee28 View Post
    No. Cron is not available on the Squeezeboxes.
    The crontab command is enabled in the busybox config but crond disabled. Not sure what use the crontab command is without the cron daemon.

    The crontab file location is also set to /var which is remapped to a ramdisk during boot, so any crontab jobs are lost when the radio resets.
    Ralphy

    1-Touch, 5-Classics, 3-Booms, 2-UE Radio
    Squeezebox client builds donations always appreciated.

  3. #73
    Senior Member
    Join Date
    Jan 2010
    Location
    Hertfordshire
    Posts
    6,614
    Just rebooted my Touch after 24 days 5 hours using the button on the back. I can relax for another 24 days

    Sent from my Pixel 3a using Tapatalk

  4. #74
    Senior Member
    Join Date
    Dec 2020
    Posts
    125
    Chiming in (reviving)

    I relocated the logs to another machine so they wouldn't get lost at reboot and since the problem appears to be related to the Tasks module I enabled some debug options on it. What I found is that the application does not actually crash, which I kind of expected already, but it appears to enter a loop where the network task repeatedly adds and removes socket tasks until watchdog intervenes stating that some semaphore was not changed (within some set time?). At that point the last time the `ui` task was executed was 8.nnn seconds ago (normally executes 20+ times per second), so that definitely seems related.

    I have another two weeks before the next player reaches fatality and possibly drill deeper towards the source.

  5. #75
    Senior Member
    Join Date
    Dec 2020
    Posts
    125
    Right, that was somewhat disappointing. There's nothing weird with networking to be seen and reviewing the log times it seems like the task manager simply stops executing the `ui` task even though there is no logging of this task having been removed. This appears to indicate that the task manager thinks that the task is still running, which may actually be true if it entered an endless loop or at least longer than it takes for watchdog to kick in.

    On to the next restart...

  6. #76
    Senior Member
    Join Date
    May 2010
    Location
    London, UK
    Posts
    800
    Quote Originally Posted by gordonb3 View Post
    Right, that was somewhat disappointing. There's nothing weird with networking to be seen and reviewing the log times it seems like the task manager simply stops executing the `ui` task even though there is no logging of this task having been removed. This appears to indicate that the task manager thinks that the task is still running, which may actually be true if it entered an endless loop or at least longer than it takes for watchdog to kick in.

    On to the next restart...
    Possibly there is more to be seen if you capture Squeezeplayĺs STDOUT and/or STDERR.
    Or possibly not.

  7. #77
    Senior Member Steevee28's Avatar
    Join Date
    Feb 2010
    Location
    Mannheim, Germany
    Posts
    103
    Quote Originally Posted by gordonb3 View Post
    (...) This appears to indicate that the task manager thinks that the task is still running, which may actually be true if it entered an endless loop or at least longer than it takes for watchdog to kick in.
    (...)
    Thank you for having a closer look.
    As I already posted before (see post #52), a quick code review revealed several pieces of Jive code that are not timer-wrapping-aware. This causes events to no longer fire and delay/wait loops to loop forever.

    Basically, all these buggy pieces of code look similar:
    Code:
    if (timer_value > some_point_of_time) then {do something};
    
    (pseudocode)
    To finally get that problem fixed, they all need to be reviewed and changed to something like this:
    Code:
    if (signed_value(timer_value - some_point_of_time) > 0) then {do something};
    
    (pseudocode)
    It's a bunch of work. I'd already have done this if I had found some time and more experience in Lua, but I haven't. :-/
    1x Squeezebox Classic, 3x Radio, 1x Touch, LMS 7.9.1 running on ODROID-U3, Ubuntu 16.04 and I'm happy with it! :)

  8. #78
    Senior Member
    Join Date
    Dec 2020
    Posts
    125
    Quote Originally Posted by Steevee28 View Post
    Thank you for having a closer look.
    As I already posted before (see post #52), a quick code review revealed several pieces of Jive code that are not timer-wrapping-aware. This causes events to no longer fire and delay/wait loops to loop forever.

    Basically, all these buggy pieces of code look similar:
    Code:
    if (timer_value > some_point_of_time) then {do something};
    
    (pseudocode)
    To finally get that problem fixed, they all need to be reviewed and changed to something like this:
    Code:
    if (signed_value(timer_value - some_point_of_time) > 0) then {do something};
    
    (pseudocode)
    I actually did have a look into that and essentially all of bits of code I examined don't really use the value itself (which would be pointless anyway as it does not reference any 'real' time that would mean anything to a user) but are about calculating a time difference that should not be affected by negative numbers being fed to it. e.g. `-111 - (-110)` still results in a valid result of positive 1 and really the only bad thing I can imagine happening here would be something like `(-MAX+4) - (MAX-3)` which results in -2*MAX + 7, but I would not expect this to happen consistently with every roll-over.

    I do think someone mentioned having disabled watchdog and the UI becoming unresponsive after the critical time? That does sound consistent with my finding that the `ui` task is no longer being called several seconds before watchdog causes the player to reboot. It will be interesting to know if any of the other functions seize at that point as well, in particular the alsa backend which appears to be the reason that this clock method is being used.

  9. #79
    Senior Member Steevee28's Avatar
    Join Date
    Feb 2010
    Location
    Mannheim, Germany
    Posts
    103

    SUM-UP: code that is not aware of timer wrapping

    Quote Originally Posted by gordonb3 View Post
    I actually did have a look into that and essentially all of bits of code I examined don't really use the value itself (...) but are about calculating a time difference (...)
    That's interesting. My findings were completely different.

    Summing up: The spontaneous reboots are caused by Watchdog resets, bailing out, because the wrapping of the (int32_t) jiffy_timer causes a stall of event processing or an endless loop somewhere in Jive Framework after 24.855 days.
    In the Jive Framework, there are many lines of code not being aware of that timer wrapping, thus suspicious to cause this.
    These bugs affect all Jive-based projects, being at least the Radios, Touch, Controller and PiCorePlayer.

    See the suspicious lines of code, e.g. here:

    in ui/jive_framework.c:
    (these can easily be fixed by changing `timeout < now` to `(int32_t)(timeout - now) < 0`)
    https://github.com/Logitech/squeezep...mework.c#L1305
    https://github.com/Logitech/squeezep...mework.c#L1310
    https://github.com/Logitech/squeezep...mework.c#L1322
    https://github.com/Logitech/squeezep...mework.c#L1334

    in jive/ui/Timer.lua:
    these lines being part of very crucial timer handling code
    https://github.com/ralph-irving/jive...Timer.lua#L161
    https://github.com/ralph-irving/jive...Timer.lua#L177
    https://github.com/ralph-irving/jive...Timer.lua#L183

    in jive/ui/Framework.lua:
    https://github.com/ralph-irving/jive...ework.lua#L330
    https://github.com/ralph-irving/jive...ework.lua#L347
    https://github.com/ralph-irving/jive...ework.lua#L364

    As an example, post #37 already showed how these lines of codes can completely stall all UI process handling.

    Please note that not only direct timer value comparisons need to be fixed, but one also need to inspect all lines of code that add offsets to a timer value, like this one and many others, because the timer value is a Lua_Number corresponding to a C double, so it can represent larger numbers than the int32_t jiffy timer itself.
    This means that close to int32_t wrapping, the result of an addition may lie beyond the value range of the jiffy timer.

    in system.c (of jivelite):
    A very funny line of code regarding "Uptime" (although very probably irrelevant), because it shows that there was at least one programmer being aware of the wrapping problem:
    https://github.com/ralph-irving/jive.../system.c#L100

    Note as @mrw pointed out in post #46, the reason for timer wrapping at signed int32_t boundary (24.855 days) as opposed to 49.710 days, although jive_jiffies() is Uint32, lies in jiveL_get_ticks().
    One should also have a closer look at jiveL_thread_time() some lines below.

    Currently, three proposed solutions exist:

    - mitigation: introduce controlled reboots at a user-defined interval <24 days, see post #51
    - extend the upper bits of timer values returned by jiveL_get_ticks(), see post #52, which might raise other problems, see post #93
    - fix all lines of code dealing with timer values, preferably by using a userdata type in Lua only dealing with timer value differences and making use of modular arithmetic, see post #52 and post #83

    Also note that the easiest way to test the changes may be to modify the code of jive_jiffies(), so that it returns a timer value close to (int32_t) wrapping right from the start (eg. return timer value + 0x7F820000), so we only need to wait for 15 minutes instead of 24 days, see https://github.com/Logitech/squeezep...c/common.h#L67. @ralphy is able to provide this for Radios, based on the current cummunity firmware, see #92
    Last edited by Steevee28; 2021-03-12 at 15:43. Reason: added more links to related code, added note about testing, corrected syntax errors, added sum-up and proposed solutions
    1x Squeezebox Classic, 3x Radio, 1x Touch, LMS 7.9.1 running on ODROID-U3, Ubuntu 16.04 and I'm happy with it! :)

  10. #80
    Senior Member
    Join Date
    Dec 2020
    Posts
    125
    Quote Originally Posted by Steevee28 View Post
    I read that somewhat different, because the values of xxx.expires are created elsewhere from jive_jiffies incremented with some max_process_time value. The sole purpose of those lines you mentioned is thus to identify timeouts.

    oh, and that's a funny one (although very probably irrelevant):
    https://github.com/ralph-irving/jive.../system.c#L100
    Interesting comment, as it clearly shows that the programmer was aware of the limit, but didn't consider it to be critical nor realized that LUA would cast the value to a signed int.

    Note: the easiest way to test the changes may be to modify the code of jive_jiffies(), so that it returns a timer value close to (int32_t) wrapping right from the start, so we don't need to wait for 24 days, see https://github.com/Logitech/squeezep...c/common.h#L67
    Yes, but that also requires compiling to the correct (ancient) support libraries, although maybe you could get away with static linking on a newer development platform. Of course once you do that you can also fix jive_jiffies() to perform the roll-over itself at some `safe` time within the 2^31 limit.

    Another option for testing in this case would be to upgrade LUA to 5.4+ and replace all instances of `lua_Number` (signed) to `lua_Integer` (unsigned), which should either shift the restart to after 49 days or possibly continue indefinitely.
    Last edited by gordonb3; 2021-03-09 at 00:53.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •