Search below for 'logbook_sop' for help on usage.
| November 2009 | ||||||
|---|---|---|---|---|---|---|
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 | 12 | 13 | 14 |
| 15 | 16 | 17 | 18 | 19 | 20 | 21 |
| 22 | 23 | 24 | 25 | 26 | 27 | 28 |
| 29 | 30 | |||||
By: TeX
Time: 2008037 15:03Z
0844Z Email from dawn last night:
Hello,
The sps-cvmaster "workq" queue is not working this morning; all
the jobs on it are stuck in " 00:00:00" time. Jobs are beginning to pile
up in the queue and we are getting a backlog of data. We are at a critical
stage in C&V; the queue must be functioning in order for us to keep up
with the normal data and process all the special data that we take.
14:30Z(approx) I look on sps-cvmaster and at first glance looked good,
email to the usual suspects asking if it was working or not.
14:29Z reply from dawn
This is not okay, this is the way the queue has been stuck all
morning; none of the jobs are actually running, they are all stuck in 0:00
mode. I have not heard back from anyone yet on this issue (it's still a
little early here in the US.)
1435Z another email from dawn
It also looks like some jobs have simply dropped out of the queue,
for example job 112042.sps-cvmaster was just hanging in the queue a couple
of hours ago, and now it's gone, and it certainly didn't run because the
data file is still there. Anything in the logs to show what happened
there?
15:00Z on sps-cvmaster, qstat shows:
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
112065.sps-cvmast VMiJEBraw_0002. cnv 00:00:00 R workq
112066.sps-cvmast VMiJEBraw_0003. cnv 00:00:00 R workq
112067.sps-cvmast VMiJEBraw_0000. cnv 00:00:00 R workq
112068.sps-cvmast VMiJEBraw_0001. cnv 00:00:00 R workq
112069.sps-cvmast VMiJEBraw_0002. cnv 00:00:00 R workq
112070.sps-cvmast VMiJEBraw_0003. cnv 00:00:00 R workq
112071.sps-cvmast VMiJEBraw_0000. cnv 00:00:00 R workq
112072.sps-cvmast VMiJEBraw_0001. cnv 00:00:00 R workq
112073.sps-cvmast VMiJEBraw_0002. cnv 00:00:00 R workq
112074.sps-cvmast VMiJEBraw_0003. cnv 00:00:00 R workq
112075.sps-cvmast VMiJEBraw_0000. cnv 00:00:00 R workq
112076.sps-cvmast VMiJEBraw_0001. cnv 00:00:00 R workq
112077.sps-cvmast VMiJEBraw_0002. cnv 00:00:00 R workq
112078.sps-cvmast VMiJEBraw_0003. cnv 00:00:00 R workq
112079.sps-cvmast VMiJEBraw_0000. cnv 00:00:00 R workq
112080.sps-cvmast VMiJEBraw_0001. cnv 00:00:00 R workq
112081.sps-cvmast VMiJEBraw_0002. cnv 00:00:00 R workq
112082.sps-cvmast VMiJEBraw_0000. cnv 00:00:00 R workq
112083.sps-cvmast VMiJEBraw_0001. cnv 00:00:00 R workq
112084.sps-cvmast VMiJEBraw_0002. cnv 00:00:00 R workq
112085.sps-cvmast VMiJEBraw_0003. cnv 0 Q workq
112086.sps-cvmast VMiJEBraw_0000. cnv 0 Q workq
u.s.w. ...
==> sched_logs/20080206 <==
02/06/2008 15:05:04;0040;pbs_sched;Job;112156.sps-cvmaster;No available resources on nodes
02/06/2008 15:05:04;0080;pbs_sched;Req;;Leaving Scheduling Cycle
==> server_logs/20080206 <==
02/06/2008 15:07:01;0100;Server@sps-cvmaster;Req;;Type 21 request received from barnetst@sps-cvmaster.sps.icecube.southpole.usap.gov, sock=10
02/06/2008 15:07:01;0100;Server@sps-cvmaster;Req;;Type 19 request received from barnetst@sps-cvmaster.sps.icecube.southpole.usap.gov, sock=10
[enielsen@sps-cvmaster PBS]$ date
Notice no mom_log for the day? Its last log was:
02/05/2008 21:16:54;0100;pbs_mom;Req;;Type 6 request received from PBS_Server@sps-cvmaster.sps.icecube.southpole.usap.gov, sock=10
oh yeah:
[cnv@sps-cvmaster ~]$ qstat -B
Server Max Tot Que Run Hld Wat Trn Ext Status
---------------- ----- ----- ----- ----- ----- ----- ----- ----- -----------
sps-cvmaster 0 91 71 20 0 0 0 0 Active
15:11Z Run another qstat:
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
112066.sps-cvmast VMiJEBraw_0003. cnv 00:00:00 R workq
112067.sps-cvmast VMiJEBraw_0000. cnv 00:00:00 R workq
112068.sps-cvmast VMiJEBraw_0001. cnv 00:00:00 R workq
112069.sps-cvmast VMiJEBraw_0002. cnv 00:00:00 R workq
112070.sps-cvmast VMiJEBraw_0003. cnv 00:00:00 R workq
112071.sps-cvmast VMiJEBraw_0000. cnv 00:00:00 R workq
112072.sps-cvmast VMiJEBraw_0001. cnv 00:00:00 R workq
112073.sps-cvmast VMiJEBraw_0002. cnv 00:00:00 R workq
112074.sps-cvmast VMiJEBraw_0003. cnv 00:00:00 R workq
112075.sps-cvmast VMiJEBraw_0000. cnv 00:00:00 R workq
112076.sps-cvmast VMiJEBraw_0001. cnv 00:00:00 R workq
112077.sps-cvmast VMiJEBraw_0002. cnv 00:00:00 R workq
112078.sps-cvmast VMiJEBraw_0003. cnv 00:00:00 R workq
112079.sps-cvmast VMiJEBraw_0000. cnv 00:00:00 R workq
112080.sps-cvmast VMiJEBraw_0001. cnv 00:00:00 R workq
112081.sps-cvmast VMiJEBraw_0002. cnv 00:00:00 R workq
112082.sps-cvmast VMiJEBraw_0000. cnv 00:00:00 R workq
112083.sps-cvmast VMiJEBraw_0001. cnv 00:00:00 R workq
112084.sps-cvmast VMiJEBraw_0002. cnv 00:00:00 R workq
112085.sps-cvmast VMiJEBraw_0003. cnv 00:00:00 R workq
112086.sps-cvmast VMiJEBraw_0000. cnv 0 Q workq
112087.sps-cvmast VMiJEBraw_0001. cnv 0 Q workq
u.s.w. ....
Note that the jobs have moved up, 112065 has disappeared.
Let us look at the error log for the file, it ends with:
/var/spool/PBS/mom_priv/jobs/112065.sps-cvmaster.SC: entering obtain_io_lockfile()
/var/spool/PBS/mom_priv/jobs/112065.sps-cvmaster.SC: IO_LOCKFILE_DIRECTORY = /mnt/data/cnv/io_lockfile_directory
/var/spool/PBS/mom_priv/jobs/112065.sps-cvmaster.SC: io_lockfile.public = /mnt/data/cnv/io_lockfile_directory/io_lockfile
/var/spool/PBS/mom_priv/jobs/112065.sps-cvmaster.SC: io_lockfile.private = /mnt/data/cnv/io_lockfile_directory/io_lockfile.112065.sps-cvmaster
/var/spool/PBS/mom_priv/jobs/112065.sps-cvmaster.SC: executing lockfile for io_lock.public & io_lock.private w/delay time of 13 seconds
=>> PBS: job killed: walltime 14458 exceeded limit 14400
lockfile: Signal received, giving up on "/mnt/data/cnv/io_lockfile_directory/io_lockfile"
/mnt/data/cnv/io_lockfile_directory is a local file system.
I can see files getting created and destroyed, but not ones with
'iolockfile.*', more like:
ls -l /mnt/data/cnv/io_lockfile_directory/
total 12
---------- 1 cnv cnv 0 Feb 22 1992 _gtG.7HdqHB.sps-fpslave16.sps
-r-------- 1 cnv cnv 1 Feb 5 20:16 io_lockfile
-r-------- 1 cnv cnv 1 Feb 5 20:16 io_lockfile.111988.sps-cvmaster
-r--r--r-- 1 cnv cnv 1 Feb 6 15:27 _XaD+jHdqHB.sps-fpslave16.sps
[cnv@sps-cvmaster pbs]$ !!
ls -l /mnt/data/cnv/io_lockfile_directory/
total 8
-r-------- 1 cnv cnv 0 Feb 6 15:28 _BNC+aHdqHB.sps-fpslave15.sps
-r-------- 1 cnv cnv 1 Feb 5 20:16 io_lockfile
-r-------- 1 cnv cnv 1 Feb 5 20:16 io_lockfile.111988.sps-cvmaster
Let us take a look at the 112066 job as I don't see
a way to look at 112065.
qstat -f 112066.sps-cvmaster
Job Id: 112066.sps-cvmaster
Job_Name = VMiJEBraw_0003.
Job_Owner = cnv@sps-cvmaster.sps.icecube.southpole.usap.gov
resources_used.cpupercent = 0
resources_used.cput = 00:00:00
resources_used.mem = 4268kb
resources_used.ncpus = 1
resources_used.vmem = 168036kb
resources_used.walltime = 03:51:18
job_state = R
queue = workq
server = sps-cvmaster
Checkpoint = u
ctime = Wed Feb 6 06:21:55 2008
Error_Path = sps-cvmaster.sps.icecube.southpole.usap.gov:/mnt/data/cnv/pdaq
_processing/PFRaw_processing_Jan08/pbs/VMiJEBraw_0003..e112066
exec_host = sps-fpslave18/3
exec_vnode = (sps-cvslave02:ncpus=1)
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Wed Feb 6 11:23:24 2008
Output_Path = sps-cvmaster.sps.icecube.southpole.usap.gov:/mnt/data/cnv/pda
q_processing/PFRaw_processing_Jan08/pbs/VMiJEBraw_0003..o112066
Priority = 0
qtime = Wed Feb 6 06:22:23 2008
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=1
Resource_List.place = scatter
Resource_List.select = 1:ncpus=1
Resource_List.walltime = 04:00:00
stime = 1202297004
session_id = 5030
job_dir = /usr/local/cnv
Variable_List = PBS_O_HOME=/usr/local/cnv,PBS_O_LOGNAME=cnv,
PBS_O_PATH=/usr/pbs/bin:/usr/java/jdk1.5.0_07/bin:/usr/local/bin:/bin:
/usr/bin:/usr/X11R6/bin:/usr/local/cnv:/usr/bin:/bin,
PBS_O_SHELL=/bin/sh,
PBS_O_HOST=sps-cvmaster.sps.icecube.southpole.usap.gov,
PBS_O_WORKDIR=/mnt/data/cnv/pdaq_processing/PFRaw_processing_Jan08/pbs
,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq
comment = Job run at Wed Feb 06 at 11:23 on (sps-cvslave02:ncpus=1)
etime = Wed Feb 6 06:22:23 2008
I don't know much about PBS, but this looks like it didn't do anything
since the CPU time is 0. Was queued at 11:23 and finished at 06:22.
15:22Z OK, let us do an strace on the mom.orig process:
[enielsen@sps-cvmaster server_logs]$ sudo strace -p 25671
Process 25671 attached - interrupt to quit
select(1024, [5 6 7 8], NULL, NULL, {0, 640000}) = 0 (Timeout)
time(NULL) = 1202311352
rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], NULL, 8) = 0
time(NULL) = 1202311352
time(NULL) = 1202311352
recvfrom(7, 0x40220140, 4096, 0, 0x7fbffff690, 0x7fbffff6ac) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(8, 0x40220140, 4096, 0, 0x7fbffff690, 0x7fbffff6ac) = -1 EAGAIN (Resource temporarily unavailable)
time(NULL) = 1202311352
rt_sigprocmask(SIG_UNBLOCK, [HUP INT TERM CHLD], NULL, 8) = 0
select(1024, [5 6 7 8], NULL, NULL, {10, 0}) = 0 (Timeout)
time(NULL) = 1202311362
rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], NULL, 8) = 0
time(NULL) = 1202311362
open("/proc/loadavg", O_RDONLY) = 10
fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2a95aef000
read(10, "2.48 3.01 3.01 2/195 8359\n", 1024) = 26
close(10) = 0
munmap(0x2a95aef000, 4096) = 0
time(NULL) = 1202311362
recvfrom(7, 0x40220140, 4096, 0, 0x7fbffff690, 0x7fbffff6ac) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(8, 0x40220140, 4096, 0, 0x7fbffff690, 0x7fbffff6ac) = -1 EAGAIN (Resource temporarily unavailable)
time(NULL) = 1202311362
rt_sigprocmask(SIG_UNBLOCK, [HUP INT TERM CHLD], NULL, 8) = 0
select(1024, [5 6 7 8], NULL, NULL, {10, 0}) = 0 (Timeout)
time(NULL) = 1202311372
rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], NULL, 8) = 0
time(NULL) = 1202311372
time(NULL) = 1202311372
I watched for a while, nothing else happens. I lean to mom once again be
in an unideal state and should be rebooted.
15:35Z Darryn at office now. Looks like Barnett is too. Sent info via write
to both of them, then to skype. I'm off this for now to take a shower.
oh yeah, barnett is thinking is a cnv script issue.
16:00Z Confirmed, evidentally cnv scripts do not have stale lock handling
Edgar Nielsen
|
06 Feb 2008 10:26 GMT
|
Ice Cube/PBS
|
|