PDA

View Full Version : Need User Data to Diagnose GUI Stall Problem



Eric Carroll
2007-01-30, 00:53
In two other threads, people are reporting the SB GUI stalling as users scroll from one menu to another.
http://forums.slimdevices.com/showthread.php?t=23988
http://forums.slimdevices.com/showthread.php?t=31566

These stalls are reported as 8-30 seconds long. Music continues to stream, buffer shows full if you are displaying it, but the SB does not respond to remote keypresses. Sometimes, when it recovers, you get a flurry of "catch up" keypress processing.

Please do NOT report if you are using SqueezeNetwork. We want to focus only on the SB connected to SlimServer.

I have reported this problem to Tech Support, however, they immediately indicate it is my wireless network.

We need some data from the community to understand if this is a slimserver, SB or network problem.

If you have this symptom could you post to this thread just the following information (take discussion to another thread):

First, reply to the poll as to which devices have the symptom: wired, wireless, or none.

If you have devices with the GUI stall (pause) please post the following:

1. How frequently does it occur?
2. What are you runing slimserver on?
3. What slimserver version you are running?
4. For each device with the issue:

a) identify if the device is wired or wireless
b) Go to the Slimserver Web page Server & Network health under the Help section of the default page layout.
c) Go to the "Player Statistics" for the player on the bottom.
d) Post the Signal Strength and Control Connection graphs.
e) Run a Network test to the device, 1000 kb/s, to a count of 1000, and post the resulting graph.
f) run "ping -t <device IP>" for 200 pings and cut & paste the result

We hope to see if wired users experience this as much as wireless, and if so, if their networks are clean.

My hypothesis is that this is a slimserver bug. But until we eliminate the network, it is reasonable for SD to continue to suspect user's wireless networks. So I hope this will start to provide some data for all of us to look at.

Eric Carroll
2007-01-30, 01:03
1. Frequency: several times an hour
2. Computer: Infrant ReadyNAS NV+w. 1GB mem
3. Slimserver: 6.5.1 early release bundled with v3.01.c1-p4.
4. Network data - all devices wireless

Basement SB3

Signal Strength
< 10 : 0 : 0%
< 20 : 0 : 0%
< 30 : 2 : 0%
< 40 : 177 : 32% ################
< 50 : 373 : 68% #################################
< 60 : 0 : 0%
< 70 : 0 : 0%
< 80 : 0 : 0%
< 90 : 0 : 0%
< 100 : 0 : 0%
>=100 : 0 : 0%
max : 49.000000
min : 28.000000
avg : 41.014493

Control Connection
< 1 : 556 :100% ##################################################
< 2 : 0 : 0%
< 5 : 0 : 0%
< 10 : 0 : 0%
< 20 : 0 : 0%
>=20 : 0 : 0%
max : 0.000000
min : 0.000000
avg : 0.000000

Network Test
< 10 : 0 : 0%
< 20 : 0 : 0%
< 30 : 0 : 0%
< 40 : 0 : 0%
< 50 : 0 : 0%
< 60 : 0 : 0%
< 70 : 0 : 0%
< 75 : 0 : 0%
< 80 : 0 : 0%
< 85 : 0 : 0%
< 90 : 0 : 0%
< 95 : 0 : 0%
< 100 : 0 : 0%
>=100 : 1349 :100% ##################################################
max : 100.000000
min : 100.000000
avg : 100.000000
Ping statistics for 192.168.1.100:
Packets: Sent = 392, Received = 392, Lost = 0 (0% loss),
Approximate route trip times in milli-seconds:
Minimum = 1ms, Maximum = 30ms, Average = 2ms


3rd Floor SB3

Signal Strength
< 10 : 0 : 0%
< 20 : 0 : 0%
< 30 : 0 : 0%
< 40 : 0 : 0%
< 50 : 0 : 0%
< 60 : 0 : 0%
< 70 : 132 : 21% ##########
< 80 : 299 : 47% #######################
< 90 : 202 : 32% ###############
< 100 : 0 : 0%
>=100 : 0 : 0%
max : 85.000000
min : 64.000000
avg : 73.666667

Control Connection
< 1 : 642 :100% ##################################################
< 2 : 0 : 0%
< 5 : 0 : 0%
< 10 : 0 : 0%
< 20 : 0 : 0%
>=20 : 0 : 0%
max : 0.000000
min : 0.000000
avg : 0.000000

Network test
< 10 : 0 : 0%
< 20 : 0 : 0%
< 30 : 0 : 0%
< 40 : 0 : 0%
< 50 : 0 : 0%
< 60 : 0 : 0%
< 70 : 0 : 0%
< 75 : 0 : 0%
< 80 : 0 : 0%
< 85 : 0 : 0%
< 90 : 0 : 0%
< 95 : 0 : 0%
< 100 : 0 : 0%
>=100 : 792 :100% ##################################################
max : 100.000000
min : 100.000000
avg : 100.000000

Ping statistics for 192.168.1.101:
Packets: Sent = 394, Received = 394, Lost = 0%
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 17ms, Average = 1ms

1st Floor Transporter
Signal Strength
< 10 : 0 : 0%
< 20 : 0 : 0%
< 30 : 0 : 0%
< 40 : 0 : 0%
< 50 : 0 : 0%
< 60 : 0 : 0%
< 70 : 0 : 0%
< 80 : 54 : 6% ##
< 90 : 882 : 94% ###############################################
< 100 : 0 : 0%
>=100 : 0 : 0%
max : 87.000000
min : 75.000000
avg : 81.534188

Control Connection
< 1 : 1205 :100% ##################################################
< 2 : 0 : 0%
< 5 : 0 : 0%
< 10 : 0 : 0%
< 20 : 0 : 0%
>=20 : 0 : 0%
max : 0.000000
min : 0.000000
avg : 0.000000

Network Test
< 10 : 0 : 0%
< 20 : 0 : 0%
< 30 : 0 : 0%
< 40 : 1 : 0%
< 50 : 0 : 0%
< 60 : 0 : 0%
< 70 : 0 : 0%
< 75 : 0 : 0%
< 80 : 0 : 0%
< 85 : 0 : 0%
< 90 : 0 : 0%
< 95 : 0 : 0%
< 100 : 0 : 0%
>=100 : 1183 :100% #################################################
max : 100.000000
min : 30.851064
avg : 99.941597

Ping statistics for 192.168.1.105:
Packets: Sent = 1457, Received = 1457, Lost = 0%
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 104ms, Average = 1ms

This pretty clearly shows my wireless network is as clean as a whistle.

Eric Carroll
2007-01-30, 12:37
bump... Was I was confused by the other threads? This is not an issue?

Triode
2007-01-30, 12:42
Please turn on the diagnotics described in the second part of:
http://wiki.slimdevices.com/index.cgi?DiagnosingPerformanceIssues

and post the output here - this should show anything taking more than 0.5 seconds in the server.

Eric Carroll
2007-01-30, 12:52
Please turn on the diagnotics described in the second part of:
http://wiki.slimdevices.com/index.cgi?DiagnosingPerformanceIssues

and post the output here - this should show anything taking more than 0.5 seconds in the server.

I understand. I am trying to see if the *network* specifically is what is causing the issue. That is Tech Support's position and we need some data to see if they are right, or not.

If its not the network its got to be the firmware or the slimserver.

I didn't start this thread because I am having a problem (which I am). I started it to collect data systematically from all the others reporting the problem in the cited threads and understand if this happens on wireline SB3s and if wireless connectivity is the major root cause.

Here is my server data. Keep in mind this is a ~250MHZ processor, so web page generation is SLOW. And as such, I almost NEVER use the Web GUI (I use the SB remote, and Moose, in that order of frequency).

----------------
Server StatisticsHome / Server & Network Health / Server Statistics

Server Response Time
The response time of the server - the time between successive calls to select.
< 0.002 : 133316 : 20% ##########
< 0.005 : 258750 : 39% ###################
< 0.01 : 150078 : 23% ###########
< 0.015 : 13020 : 2%
< 0.025 : 38569 : 6% ##
< 0.05 : 46328 : 7% ###
< 0.1 : 11957 : 2%
< 0.5 : 3984 : 1%
< 1 : 30 : 0%
< 5 : 37 : 0%
>=5 : 54 : 0%
max : 28.456353
min : 0.001262
avg : 0.011664

Timer Lateness
The time between when a timer task was scheduled and when it is run.
< 0.002 : 26797 : 12% #####
< 0.005 : 29627 : 13% ######
< 0.01 : 154952 : 67% #################################
< 0.015 : 6898 : 3% #
< 0.025 : 6851 : 3% #
< 0.05 : 3510 : 2%
< 0.1 : 1830 : 1%
< 0.5 : 1196 : 1%
< 1 : 117 : 0%
< 5 : 254 : 0%
>=5 : 408 : 0%
max : 28.376613
min : 0.000000
avg : 0.039846

Select Task Duration
The length of time taken by each task run by select.
< 0.002 : 1453 : 3% #
< 0.005 : 1710 : 3% #
< 0.01 : 40578 : 82% #########################################
< 0.015 : 4167 : 8% ####
< 0.025 : 167 : 0%
< 0.05 : 348 : 1%
< 0.1 : 137 : 0%
< 0.5 : 522 : 1%
< 1 : 23 : 0%
< 5 : 231 : 0%
>=5 : 58 : 0%
max : 28.451643
min : 0.000722
avg : 0.036801

Timer Task Duration
The length of time taken by each timer task.
< 0.002 : 1393 : 1%
< 0.005 : 127539 : 55% ###########################
< 0.01 : 13841 : 6% ##
< 0.015 : 3069 : 1%
< 0.025 : 37061 : 16% #######
< 0.05 : 40219 : 17% ########
< 0.1 : 8430 : 4% #
< 0.5 : 873 : 0%
< 1 : 0 : 0%
< 5 : 14 : 0%
>=5 : 1 : 0%
max : 13.850420
min : 0.001002
avg : 0.018120

Execute / Notification Task Duration
The length of time taken by each execute command or notification callback.
< 0.002 : 1192 : 52% #########################
< 0.005 : 77 : 3% #
< 0.01 : 220 : 10% ####
< 0.015 : 118 : 5% ##
< 0.025 : 131 : 6% ##
< 0.05 : 187 : 8% ####
< 0.1 : 175 : 8% ###
< 0.5 : 160 : 7% ###
< 1 : 11 : 0%
< 5 : 34 : 1%
>=5 : 0 : 0%
max : 3.353227
min : 0.000140
avg : 0.051486

Scheduler Task Duration
The length of time taken by each scheduled task.
< 0.002 : 0 : 0%
< 0.005 : 0 : 0%
< 0.01 : 0 : 0%
< 0.015 : 0 : 0%
< 0.025 : 0 : 0%
< 0.05 : 0 : 0%
< 0.1 : 0 : 0%
< 0.5 : 0 : 0%
< 1 : 0 : 0%
< 5 : 0 : 0%
>=5 : 0 : 0%
max : 0.000000
min : 0.000000
avg : 0.000000

Database Access
The time taken for information to be retrieved from the database.
< 0.002 : 715 : 7% ###
< 0.005 : 2 : 0%
< 0.01 : 0 : 0%
< 0.015 : 7 : 0%
< 0.025 : 18 : 0%
< 0.05 : 9523 : 92% ##############################################
< 0.1 : 29 : 0%
< 0.5 : 11 : 0%
< 1 : 0 : 0%
< 5 : 0 : 0%
>=5 : 0 : 0%
max : 0.242024
min : 0.001042
avg : 0.026107

Web Page Build
The length of time taken to build each web page.
< 0.002 : 0 : 0%
< 0.005 : 0 : 0%
< 0.01 : 0 : 0%
< 0.015 : 0 : 0%
< 0.025 : 0 : 0%
< 0.05 : 0 : 0%
< 0.1 : 0 : 0%
< 0.5 : 9 : 4% #
< 1 : 9 : 4% #
< 5 : 222 : 91% #############################################
>=5 : 4 : 2%
max : 12.981102
min : 0.259739
avg : 1.569122

Process Template
The time to process each Template Toolkit template when building web pages.
< 0.002 : 0 : 0%
< 0.005 : 1116 : 8% ###
< 0.01 : 2623 : 18% #########
< 0.015 : 7820 : 54% ###########################
< 0.025 : 62 : 0%
< 0.05 : 1994 : 14% ######
< 0.1 : 148 : 1%
< 0.5 : 409 : 3% #
< 1 : 214 : 1%
< 5 : 14 : 0%
>=5 : 0 : 0%
max : 4.111182
min : 0.002805
avg : 0.031899

IR Queue Length
The delay between an IR key press being received and being processed.
< 0.002 : 0 : 0%
< 0.005 : 287 : 49% ########################
< 0.01 : 56 : 10% ####
< 0.015 : 22 : 4% #
< 0.025 : 34 : 6% ##
< 0.05 : 54 : 9% ####
< 0.1 : 88 : 15% #######
< 0.5 : 37 : 6% ###
< 1 : 0 : 0%
< 5 : 4 : 1%
>=5 : 1 : 0%
max : 16.438772
min : 0.002865
avg : 0.074439
--------------------------------------------------------------------------------

Triode
2007-01-30, 13:09
If you run the network test at a high rate for a resonable period of time and it does not drop a packet then I'm not sure it will be the network.

However this is only true if the server is fast enough as the network tries to send a packet at a given rate. If the server is running too slow to generate a packet at the right time, it counts that as successful transmission.

So try turning on the performance warnings and see whether you get any warning in the log about the server running slow. [this is for normal use rather than when running nettest]

Eric Carroll
2007-01-30, 13:21
I don't believe this UI stall issue has anything to do with the network. But we have to show tech support of this before they will budge off their "its the network" position. My wireless is clean, as you can see. I still have UI stalls.

Are you saying that if the server receives a SB3 keypress, then drops it due to overload (it can drop it?) that is counted in the performance stats as success?

Why will I get performance warnings if the histograms show no issues?

Are you talking about the thresholds on the server performance stats page or some other performance warnings?

If slimserver is quietly dropping keypresses when "too busy" that would explain a lot, but it seems too obvious... or is it?

Triode
2007-01-30, 13:43
Yes - set the warning thresholds on the web page.

SS drops key presses if they have been queued up for too long. If you edit the perl in Slim/Hardware/IR.pm to change $maxIRQTime you can adjust how long is deemed too long. The idea here is that keeping old key presses and processing them too later is worse than loosing them. At present the IR time graph shows how long a key press is stored for, but does not show that ones over the threshold above are actually droppped without doing anything.

The server code is single threaded, but tries to maintain audio streaming if at all possible. This is what "Response Time" is really recording - the time between checking to see if streaming is needed. However for other functions its possible that UI will be locked out because it is single threaded and so a function may take some time and the only other thing done when it is running is audio streaming. So when building a web page for example, the player UI will freeze, but audio streaming should be maintained. This is by current design of the server.

bpa
2007-01-30, 13:55
I may be wrong, bit I think if Slimserver is running on an Windows XP system, editing the IR.pm file will have no effect because of the way the EXE is created.

Eric Carroll
2007-01-30, 17:15
SS drops key presses if they have been queued up for too long. If you edit the perl in Slim/Hardware/IR.pm to change $maxIRQTime you can adjust how long is deemed too long. The idea here is that keeping old key presses and processing them too later is worse than loosing them. At present the IR time graph shows how long a key press is stored for, but does not show that ones over the threshold above are actually droppped without doing anything.

The server code is single threaded, but tries to maintain audio streaming if at all possible. This is what "Response Time" is really recording - the time between checking to see if streaming is needed. However for other functions its possible that UI will be locked out because it is single threaded and so a function may take some time and the only other thing done when it is running is audio streaming. So when building a web page for example, the player UI will freeze, but audio streaming should be maintained. This is by current design of the server.

Well well this is very interesting. Thanks, you are the first person that's been able to explain this to me so far.

What is the default for maxIRQTime? (I am just starting to look around SS - and the internal documentation seems, well, limited)

With performance alarms on I am seeing lots of strange 15 second delay reports that are not showing up in the histograms:

2007-01-30 18:23:39.3165 Select Task > .5 : 15.86377 Slim::Networking::Async::HTTP::_http_read_body
2007-01-30 18:23:39.3330 Timer Late > .5 : 15.87504
2007-01-30 18:23:39.3624 Response Time > .5 : 15.91006
2007-01-30 18:23:39.3682 Timer Late > .5 : 15.69968
2007-01-30 18:23:39.3785 Timer Late > .5 : 15.70222
2007-01-30 18:23:39.3904 Timer Late > .5 : 15.66950
2007-01-30 18:23:39.4246 Timer Late > .5 : 15.69042
2007-01-30 18:23:39.4354 Timer Late > .5 : 14.21449
2007-01-30 18:23:39.4426 Timer Late > .5 : 13.44166
2007-01-30 18:23:39.4770 Timer Late > .5 : 2.34886
2007-01-30 18:23:39.4973 Timer Late > .5 : 2.34086
2007-01-30 18:38:39.2370 Select Task > .5 : 15.86086 Slim::Networking::Async::HTTP::_http_read_body
2007-01-30 18:38:39.2536 Timer Late > .5 : 15.79570
2007-01-30 18:38:39.2831 Response Time > .5 : 15.90738
2007-01-30 18:38:39.2887 Timer Late > .5 : 15.56050
2007-01-30 18:38:39.3229 Timer Late > .5 : 15.32192
2007-01-30 18:38:39.3576 Timer Late > .5 : 15.19713
2007-01-30 18:38:39.4262 Timer Late > .5 : 15.20347
2007-01-30 18:38:39.4385 Timer Late > .5 : 15.16337
2007-01-30 18:38:39.4493 Timer Late > .5 : 14.22114
2007-01-30 18:53:39.3777 Select Task > .5 : 15.85946 Slim::Networking::Async::HTTP::_http_read_body
2007-01-30 18:53:39.3941 Timer Late > .5 : 15.65779
2007-01-30 18:53:39.4266 Response Time > .5 : 15.90890
2007-01-30 18:53:39.4323 Timer Late > .5 : 15.43129
2007-01-30 18:53:39.4760 Timer Late > .5 : 15.33866
2007-01-30 18:53:39.5346 Timer Late > .5 : 15.25172
2007-01-30 18:53:39.5468 Timer Late > .5 : 15.21176
2007-01-30 18:53:39.5577 Timer Late > .5 : 15.09981
2007-01-30 18:53:39.5882 Timer Late > .5 : 14.35192

I don't think I am using the Web GUI at these times, but I will have to look more closely.

I have not yet correlated these alarms with UI freeze, but the time period is darn coincidental.

Is there any way to localize these messages further? How do I figure out what is setting up or firing the instance of Slim::Networking::Async::HTTP::_http_read_body?

Now, the really hard part (I hesitate to even ask)...

Is there any plan to thread SS? Would another coder help?

I know many people use the web GUI, but all my devices are far away from spinning disks and computers, intentionally. So the remote performance matters tremendously, second only to streaming.

Eric Carroll
2007-01-30, 17:49
Triode,

I feel like I am getting somewhere now.

Just to be safe, I turned off the SuperDateTime plugin I had running on a single SB3. I was using it to gather & display local weather forecast (with underground weather station as well). This definately improved things (which is weird, I thought I had turned the weather update way down so it wouldn't be doing http gets for the weather all the time; just twice an hour).

But even after doing this, an SB3 still stalled once, and, yes, here is the performance alarm:

2007-01-30 19:38:42.7632 Select Task > .5 : 15.90966 Slim::Networking::Async::HTTP::_http_read_body
2007-01-30 19:38:42.7889 Response Time > .5 : 15.93579

I believe this was coincident.

So, something in SS is doing http_read_body, causing a long lockout of the single thread and impacting IR processing?

But what?

Any help appreciated.

Ron Olsen
2007-01-30, 19:35
One thing you might try: disable all third-party plugins when you do your tests. If you really believe the basic SlimServer software is at fault, you have to eliminate any possible effects from third-party software in your tests.

I've found that some third-party plugins can adversely affect SlimServer performance.

Eric Carroll
2007-01-30, 20:11
If you really believe the basic SlimServer software is at fault, you have to eliminate any possible effects from third-party software in your tests.


All I have enabled are:

CLI
Date and Time Screensaver
Visual Screensaver
Digital Inputs
Lazy Music Search
Network Test
Random Mix
Rescan Music Library
Save Playlist
Visualizer Screensaver

That's it.

What's even weirder, I have the following performance alarms when I am not even using an SB3/Tp at all (and unsolicitied tracebacks):

2007-01-30 21:23:29.6780 Timer Task > .5 : 3.67679 Slim::Utils::Cache::cleanup
2007-01-30 21:23:29.6881 Response Time > .5 : 3.68727
2007-01-30 21:23:29.6955 Timer Late > .5 : 3.27285
2007-01-30 21:23:29.7092 Timer Late > .5 : 3.26647
2007-01-30 21:23:29.7229 Timer Late > .5 : 3.26018
2007-01-30 21:23:29.7354 Timer Late > .5 : 2.68836
2007-01-30 21:23:29.7496 Timer Late > .5 : 2.37595
2007-01-30 21:23:48.3435 Select Task > .5 : 15.87832 Slim::Networking::Async::HTTP::_http_read_body
2007-01-30 21:23:48.3616 Timer Late > .5 : 15.61960
2007-01-30 21:23:48.3956 Response Time > .5 : 15.93090
2007-01-30 21:23:48.4029 Timer Late > .5 : 15.65363
2007-01-30 21:23:48.4164 Timer Late > .5 : 15.65974
2007-01-30 21:23:48.4301 Timer Late > .5 : 15.66573
2007-01-30 21:23:48.4425 Timer Late > .5 : 14.20052
2007-01-30 21:23:48.4514 Timer Late > .5 : 13.45046
2007-01-30 21:38:48.2803 Select Task > .5 : 15.87050 Slim::Networking::Async::HTTP::_http_read_body
2007-01-30 21:38:48.2984 Timer Late > .5 : 15.54569
2007-01-30 21:38:48.3326 Response Time > .5 : 15.92325
2007-01-30 21:38:48.3399 Timer Late > .5 : 15.33890
2007-01-30 21:38:48.3746 Timer Late > .5 : 15.15982
2007-01-30 21:38:48.4360 Timer Late > .5 : 15.20399
2007-01-30 21:38:48.4499 Timer Late > .5 : 15.20006
2007-01-30 21:38:48.4623 Timer Late > .5 : 14.20960
2007-01-30 21:53:48.3753 Select Task > .5 : 15.89625 Slim::Networking::Async::HTTP::_http_read_body
2007-01-30 21:53:48.3936 Timer Late > .5 : 15.63090
2007-01-30 21:53:48.4277 Response Time > .5 : 15.94906
2007-01-30 21:53:48.4535 Timer Late > .5 : 15.45251
2007-01-30 21:53:48.5917 Timer Late > .5 : 15.54888
2007-01-30 21:53:48.6804 Timer Late > .5 : 15.50254
2007-01-30 21:53:48.6942 Timer Late > .5 : 15.49962
2007-01-30 21:53:48.7068 Timer Late > .5 : 14.44415
2007-01-30 21:59:43.1440 Backtrace:

frame 0: Slim::Player::Source::errorOpening (/usr/share/slimserver/Slim/Player/Squeezebox2.pm line 579)
frame 1: Slim::Player::Squeezebox2::failedDirectStream (/usr/share/slimserver/Slim/Networking/Slimproto.pm line 527)
frame 2: Slim::Networking::Slimproto::_disco_handler (/usr/share/slimserver/Slim/Networking/Slimproto.pm line 387)
frame 3: Slim::Networking::Slimproto::client_readable (/usr/share/slimserver/Slim/Networking/Select.pm line 238)
frame 4: Slim::Networking::Select::select (/usr/share/slimserver/slimserver.pl line 492)
frame 5: main::idle (/usr/share/slimserver/slimserver.pl line 445)
frame 6: main::main (/usr/share/slimserver/slimserver.pl line 1074)

2007-01-30 21:59:43.3196 Backtrace:

frame 0: Slim::Player::Source::errorOpening (/usr/share/slimserver/Slim/Player/Squeezebox2.pm line 579)
frame 1: Slim::Player::Squeezebox2::failedDirectStream (/usr/share/slimserver/Slim/Networking/Slimproto.pm line 527)
frame 2: Slim::Networking::Slimproto::_disco_handler (/usr/share/slimserver/Slim/Networking/Slimproto.pm line 387)
frame 3: Slim::Networking::Slimproto::client_readable (/usr/share/slimserver/Slim/Networking/Select.pm line 238)
frame 4: Slim::Networking::Select::select (/usr/share/slimserver/slimserver.pl line 492)
frame 5: main::idle (/usr/share/slimserver/slimserver.pl line 445)
frame 6: main::main (/usr/share/slimserver/slimserver.pl line 1074)

Triode
2007-01-31, 11:50
Looks like async http is not as async as it should be. [Its intended to perform http without blocking the server] Do you have disks on the NAS which spin down? If so may be related to waiting for these to spin up?

Could you turn on asyc http debugging go to server debugging page to turn on. [--d_http_asyn if you turn on debugging from the command line]

Can you confirm if this is 6.5.1 and do you have access to the perl to change it?

Eric Carroll
2007-01-31, 12:59
Do you have disks on the NAS which spin down? If so may be related to waiting for these to spin up?


No, I have no disks spun down or waiting to spin up.



Can you confirm if this is 6.5.1?

This is Infrant's release of 6.5.1, but not the final 6.5.1 release. I understand this was due to the timing of the release of Infrant release v3.01.c1-p4 which has the early 6.5.1 in it.



Could you turn on asyc http debugging go to server debugging page to turn on. [--d_http_asyn if you turn on debugging from the command line]

This is not obvious, I don't have access to the internals without hackery I am not willing to perform (the NAS is my backup system). I could try installing it on my PC and see if I can replicate the problem. Alternatively I can see if Infrant will give me a way to turn it on.



Do you have access to the perl to change it


I don't have Infrant's specific code if it is different. I have downloaded a previous version of SS to start the process of getting up to speed on it. I am starting cold on SS but would be happy to help if required (with some guidance, internals documentation is, um, lacking).

I am sure Infrant would integrate the change once we figure it out and commit it to the repository.

I will see what Infrant has to say. I doubt this issue is Infrant specific based on the postings I have seen. As best as I understand they just use the standard SD build tree.

As a next step, I will focus on getting debugging enabled on the NAS or replicate it on my PC.

You have been extremely helpful, thank you. If you have more places to look, please let me know.

Triode
2007-01-31, 13:04
Ok - sounds like you can't alter the perl.

However if you go to server settings - debugging web page you should be able to turn on the asyc http debugging. This should then show extra entries in the server log which may indicate what is going on.

Eric Carroll
2007-01-31, 13:22
Here is additional testing I have performed.

Originally, I had SuperDateTime screensaver enabled with both weather and underground weather fetches on a 15 minute basis.

I turned off the SuperDateTime plugin. Still saw activity in the performance alarms. So, I turned SuperDateTime back on, cleared the underground weather station code, and set the regular retrieval to 45 minutes.

Now, like clockwork, I have one performance alarm spaced 45 minutes apart on http get. This must be the weather http get.

Originally I ended up having 4x2 fetches per hour, or one every 7 minutes or so. This definately explains the symptoms I was seeing. So I think we are closing in on the root cause here.

Right now, with weather fetching at 45 minutes, the UI is much more responsive. Stalls are not nearly on the same frequency, for nearly as long. I get some little stalls and see corresponding alarms in the performance page, but one thing at a time.

An example of the 45 minute http get:
2007-01-31 14:54:00.5725 Select Task > .5 : 15.90743 Slim::Networking::Async::HTTP::_http_read_body
2007-01-31 14:54:00.5947 Timer Late > .5 : 15.80381
2007-01-31 14:54:00.6075 Response Time > .5 : 15.94287
2007-01-31 14:54:00.6158 Timer Late > .5 : 15.80647
2007-01-31 14:54:00.6492 Timer Late > .5 : 15.72846
2007-01-31 14:54:00.6598 Timer Late > .5 : 15.16028
2007-01-31 14:54:00.6735 Timer Late > .5 : 15.15737
2007-01-31 14:54:00.6882 Timer Late > .5 : 11.68723

Eric Carroll
2007-01-31, 13:23
if you go to server settings - debugging web page you should be able to turn on the asyc http debugging.

Oh, excellent! I have turned it on.

Edit:

Here is the output of the first run:

2007-01-31 15:38:43.4406 SimpleAsyncHTTP: GETing http://www.weather.com/weather/local/CAXX0504
2007-01-31 15:38:43.4625 Async: Starting async DNS lookup for [www.weather.com] using server [192.168.1.1] [timeout 5]
2007-01-31 15:38:43.5530 Async: Resolved www.weather.com to [63.111.24.75]
2007-01-31 15:38:43.5582 Async: Connecting to www.weather.com:80
2007-01-31 15:38:43.6222 Async: connected, ready to write request
2007-01-31 15:38:43.6318 Async: Sending:
GET /weather/local/CAXX0504 HTTP/1.1
Connection: close
Cache-Control: no-cache
Accept: */*
Accept-Encoding: gzip, deflate
Host: www.weather.com
User-Agent: iTunes/4.7.1 (Linux; N; Linux; sparc64-linux; EN; utf8) SlimServer/6.5.1/10479
Icy-MetaData: 1

2007-01-31 15:38:43.7137 Async::HTTP: Headers read. code: 302 status: Moved Temporarily
2007-01-31 15:38:43.7438 bless({
connection => "Keep-Alive",
"content-length" => 0,
"content-type" => "text/plain",
date => "Wed, 31 Jan 2007 20:38:43 GMT",
"keep-alive" => "timeout=1, max=7499",
location => "http://www.weather.com/outlook/travel/businesstraveler/local/CAXX0504",
server => "Apache",
}, "HTTP::Headers")
2007-01-31 15:38:43.7565 Async::HTTP: Redirecting to http://www.weather.com/outlook/travel/businesstraveler/local/CAXX0504
2007-01-31 15:38:43.7695 Async: Starting async DNS lookup for [www.weather.com] using server [192.168.1.1] [timeout 5]
2007-01-31 15:38:43.9412 Async: Resolved www.weather.com to [63.111.24.75]
2007-01-31 15:38:43.9462 Async: Connecting to www.weather.com:80
2007-01-31 15:38:44.0058 Async: connected, ready to write request
2007-01-31 15:38:44.0151 Async: Sending:
GET /outlook/travel/businesstraveler/local/CAXX0504 HTTP/1.1
Connection: close
Cache-Control: no-cache
Accept: */*
Accept-Encoding: gzip, deflate
Host: www.weather.com
User-Agent: iTunes/4.7.1 (Linux; N; Linux; sparc64-linux; EN; utf8) SlimServer/6.5.1/10479
Icy-MetaData: 1

2007-01-31 15:38:44.1462 Async::HTTP: Headers read. code: 200 status: OK
2007-01-31 15:38:44.1637 bless({
"cache-control" => "private",
connection => "Keep-Alive",
"content-encoding" => "gzip",
"content-type" => "text/html;charset=ISO-8859-1",
date => "Wed, 31 Jan 2007 20:38:44 GMT",
"keep-alive" => "timeout=1, max=7490",
server => "Apache",
"transfer-encoding" => "chunked",
}, "HTTP::Headers")
2007-01-31 15:38:44.1856 Async::HTTP: Read body: 769 bytes
2007-01-31 15:38:44.2023 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.2194 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.2364 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.2530 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.2702 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.2874 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.3045 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.3213 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.3378 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.3545 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.3708 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.3870 Async::HTTP: Read body: 983 bytes
2007-01-31 15:38:44.4043 Async::HTTP: Read body: 1016 bytes
2007-01-31 15:38:44.4206 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.4368 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.4535 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.4697 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.4863 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.5026 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.5193 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.5359 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.5522 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.5689 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.5851 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.6014 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.6176 Async::HTTP: Read body: 685 bytes
2007-01-31 15:38:44.6354 Async::HTTP: Read body: 1016 bytes
2007-01-31 15:38:44.6517 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.6679 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.6841 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.7004 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.7173 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.7335 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.7521 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.7904 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.8067 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.8229 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.8391 Async::HTTP: Read body: 137 bytes
2007-01-31 15:38:44.8563 Async::HTTP: Read body: 1016 bytes
2007-01-31 15:38:44.8726 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.8891 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.9053 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.9215 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.9378 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.9550 Async::HTTP: Read body: 1024 bytes
2007-01-31 15:38:44.9709 Async::HTTP: Read body: 686 bytes
2007-01-31 15:38:44.9970 Async::HTTP: Body read
2007-01-31 15:38:45.0020 SimpleAsyncHTTP: status for http://www.weather.com/weather/local/CAXX0504 is 200 OK
2007-01-31 15:38:45.0083 Decompressing gzip'ed content
2007-01-31 15:38:45.0488 SimpleAsyncHTTP: Done
2007-01-31 15:39:00.8249 Select Task > .5 : 15.83802 Slim::Networking::Async::HTTP::_http_read_body
2007-01-31 15:39:00.8469 Timer Late > .5 : 15.87499
2007-01-31 15:39:00.8598 Response Time > .5 : 15.87341
2007-01-31 15:39:00.8683 Timer Late > .5 : 15.88924
2007-01-31 15:39:00.8831 Timer Late > .5 : 15.13980
2007-01-31 15:39:00.9190 Timer Late > .5 : 14.93920
2007-01-31 15:39:00.9328 Timer Late > .5 : 14.93184

Ok, I'll bite, where is the 15 seconds coming in to play here indicated by the performance alarm? The start time is exactly 45 minutes from the previous alarm time.

Triode
2007-01-31, 15:17
Ok so the SlimpleAsync calls a callback after printing the debug line:

$::d_http_async && msg("SimpleAsyncHTTP: Done\n");

$self->cb->( $self );

return;

So I would suggest this this callback is part of the plugin which processes the http response. Try with all plugins disabled.

Eric Carroll
2007-01-31, 16:04
I am not sure I understand what you are suggesting here.

Are you saying that since the time to complete to the debug is about 2 seconds, the balance of the time is in the callback, which is in the plugin (specifically SuperDateTime Screensaver)?

If I disable the plugin, SimpleAsyncHTTP will never get invoked anyway, so its sort of moot.

Can I debug that callback time to confirm? I guess i could try it on the PC and put another debug in there before the return.

Triode
2007-01-31, 16:14
Yes - it looks like the 15 seconds is spent in the plugin's callback. Unfortunately the server is not instumented enough to prove this, but you can infer if from the fact that it occurs after the "Done" debug entry is printed.

As you can install plugins, you could edit the pluigin perl code to time itself and print the time taken to the log. Which version of SuperDateTime are you using?

Triode
2007-01-31, 16:22
Looks to me that sub gotWeather is just taking too long on your cpu.

To prove this edit the SuperDateTime Plugin.pm code to add something like:

msg("start of SDT gotWeather\n"); at the start of the function and another one at the end.

Then you should see these messages + timestamps in the log.

Eric Carroll
2007-01-31, 16:38
Thank you very much for all your guidance in looking at this.

15 seconds even on the NV+ CPU is a very long time. I have asked Infrant what version they integrated. I will take a quick look at the code too, and perhaps post to the developer at this point.

Now the really hard question. Any plans to thread SS for version 7?

Would another pair of hands make it more likely to happen?

Triode
2007-01-31, 16:55
There has been discussion about threading for the long term, but it's not in the current scope of 7.0 (see wiki page on this).

My personal view is that we can achieve pretty good performance with the current architecture, but need to be aware of it when coding and have enough instumentation to find performance issues - which is where the health pages came from. I'll look to add a bit more to measure async http callbacks.

Feel free to bring any suggestions and code to the developers forum..

Ron Olsen
2007-01-31, 18:05
Eric,

Glad you identified the source of the problem. With the weather plugin disabled, are you still having perfomance issues?

I use the National Weather Service forecast at http://www.nws.noaa.gov/. It may not be as convenient as having the information on the Squeezebox display, but there are no performance impacts.

Eric Carroll
2007-01-31, 22:53
Which version of SuperDateTime are you using?
Infrant's plugin package for 6.5.1 (not final) contains SuperDateTime v5.0.0b4.

I see the current rev is 5.0.4. Not sure if that is significant or not.

I looked at the code. While I admit my perl is a bit rusty, I just don't see 15 seconds of CPU time there. I do see a d_plugins, so let me try enabling that.

Eric Carroll
2007-01-31, 23:14
Eric,

Glad you identified the source of the problem. With the weather plugin disabled, are you still having perfomance issues?

I use the National Weather Service forecast at http://www.nws.noaa.gov/. It may not be as convenient as having the information on the Squeezebox display, but there are no performance impacts.

My wife is a weather hound. Weather on the SB3 was a key selling point. WAF - need I say more?

I have turned off the underground weather, cutting the http_gets in half. I have set the http_get time to 45 minutes. Performance is massively improved. But it still doesn't explain 15 clock seconds in the callback. I would like to fix this... I guess this gets to be my first SS project...

Couldn't have done it without Triode. He gets the thanks for localizing the problem. I just collected the data.