Home
South Pole Logbook

Search below for 'logbook_sop' for help on usage.

Sections

Search

Archives

November 2009
Sun Mon Tue Wed Thu Fri Sat
         

RSS Feed

Powered by Blosxom


Feb 28, 2006

Investigating cause of sndaq not recovering from crash


After the upgrade of sncount to an HP DL380 at the end of the 05/06 austral summer, the supernova application has not been as robust as in the past. Specifically, when it dies, it is not able to recover. The cron daemon does run the 'check_crash.pl' script every 10 minutes, but it fails to restart the supernova application.

Digging deeper into the problem by running the check_crash script manually, it's clear that the script is capable of talking to both craty and craty2, and capable of clearing the lock files on both machines. The part that appears to fail is where the check_crash script calls start_daq.sh. Digging a little deeper, the first thing this shell script does (after running kill_daq.sh to ensure the software is not running) is to fire up sn_master and capture its stdout and stderr to 'master.out'

nohup ./sn_master > master.out 2>&1 &
Normally, master.out is empty. During a failed attempt to restart from a crash, the following text was found in it:
Error in : key 1157038081 not found at 259
Warning in : reference to object of unavailable class TObject, offset=1157038081 pointer will be 0
Error in : key 28 not found at 29
Warning in : reference to object of unavailable class TObject, offset=28 pointer will be 0
Error in : key 9090 not found at 37
Warning in : reference to object of unavailable class TList, offset=9090 pointer will be 0
Error in : object of class process_entry read too few bytes: 110 instead of 17635
Error in : key 9090 not found at 37
Warning in : reference to object of unavailable class TList, offset=9090 pointer will be 0
Error in : object of class process_entry read too few bytes: 105 instead of 353
Error in : key 9090 not found at 37
Warning in : reference to object of unavailable class TList, offset=9090 pointer will be 0
Error in : object of class process_entry read too few bytes: 91 instead of 206
Error in : key 9090 not found at 37
Warning in : reference to object of unavailable class TList, offset=9090 pointer will be 0
Error in : object of class process_entry read too few bytes: 90 instead of 205
Error in : key 9090 not found at 37
Warning in : reference to object of unavailable class TList, offset=9090 pointer will be 0
Error in : object of class process_entry read too few bytes: 113 instead of 148
Error in : data_entry, discarding: char* data, illegal [data_size] (must be Int_t)

Error in : object of class data_entry read too few bytes: 58 instead of 67
Error in : object of class data_entry read too few bytes: 57 instead of 71
Error in : key 9090 not found at 37
Warning in : reference to object of unavailable class TList, offset=9090 pointer will be 0
Error in : object of class process_entry read too few bytes: 113 instead of 148
Error in : object of class data_entry read too few bytes: 58 instead of 67
Error in : object of class data_entry read too few bytes: 56 instead of 71
Manually rerunning the start_daq.sh script started sn_master quietly, which permitted the rest of the start_daq process to proceed.

Investigations continue.

Ethan Dicks | 28 Feb 2006 20:18 GMT | AMANDA/Supernova | | permalink