Home
South Pole Logbook

Search below for 'logbook_sop' for help on usage.

Sections

Search

Archives

November 2009
Sun Mon Tue Wed Thu Fri Sat
         

RSS Feed

Powered by Blosxom


Feb 06, 2008

More reported PBS problems on sps-cvmaster


By: TeX
Time: 2008037 15:03Z

0844Z Email from dawn last night:
        Hello,

        The sps-cvmaster "workq" queue is not working this morning; all
	the jobs on it are stuck in " 00:00:00" time. Jobs are beginning to pile
	up in the queue and we are getting a backlog of data. We are at a critical
	stage in C&V; the queue must be functioning in order for us to keep up
	with the normal data and process all the special data that we take.

14:30Z(approx) I look on sps-cvmaster and at first glance looked good,
	email to the usual suspects asking if it was working or not.


14:29Z reply from dawn
        This is not okay, this is the way the queue has been stuck all 
	morning; none of the jobs are actually running, they are all stuck in 0:00 
	mode. I have not heard back from anyone yet on this issue (it's still a 
	little early here in the US.)


1435Z another email from dawn
	It also looks like some jobs have simply dropped out of the queue, 
	for example job 112042.sps-cvmaster was just hanging in the queue a couple 
	of hours ago, and now it's gone, and it certainly didn't run because the 
	data file is still there. Anything in the logs to show what happened 
	there?


15:00Z on sps-cvmaster, qstat shows:
	Job id            Name             User              Time Use S Queue
	----------------  ---------------- ----------------  -------- - -----
	112065.sps-cvmast VMiJEBraw_0002.  cnv               00:00:00 R workq
	112066.sps-cvmast VMiJEBraw_0003.  cnv               00:00:00 R workq
	112067.sps-cvmast VMiJEBraw_0000.  cnv               00:00:00 R workq
	112068.sps-cvmast VMiJEBraw_0001.  cnv               00:00:00 R workq
	112069.sps-cvmast VMiJEBraw_0002.  cnv               00:00:00 R workq
	112070.sps-cvmast VMiJEBraw_0003.  cnv               00:00:00 R workq
	112071.sps-cvmast VMiJEBraw_0000.  cnv               00:00:00 R workq
	112072.sps-cvmast VMiJEBraw_0001.  cnv               00:00:00 R workq
	112073.sps-cvmast VMiJEBraw_0002.  cnv               00:00:00 R workq
	112074.sps-cvmast VMiJEBraw_0003.  cnv               00:00:00 R workq
	112075.sps-cvmast VMiJEBraw_0000.  cnv               00:00:00 R workq
	112076.sps-cvmast VMiJEBraw_0001.  cnv               00:00:00 R workq
	112077.sps-cvmast VMiJEBraw_0002.  cnv               00:00:00 R workq
	112078.sps-cvmast VMiJEBraw_0003.  cnv               00:00:00 R workq
	112079.sps-cvmast VMiJEBraw_0000.  cnv               00:00:00 R workq
	112080.sps-cvmast VMiJEBraw_0001.  cnv               00:00:00 R workq
	112081.sps-cvmast VMiJEBraw_0002.  cnv               00:00:00 R workq
	112082.sps-cvmast VMiJEBraw_0000.  cnv               00:00:00 R workq
	112083.sps-cvmast VMiJEBraw_0001.  cnv               00:00:00 R workq
	112084.sps-cvmast VMiJEBraw_0002.  cnv               00:00:00 R workq
	112085.sps-cvmast VMiJEBraw_0003.  cnv                      0 Q workq
	112086.sps-cvmast VMiJEBraw_0000.  cnv                      0 Q workq
	u.s.w. ...


	==> sched_logs/20080206 <==
	02/06/2008 15:05:04;0040;pbs_sched;Job;112156.sps-cvmaster;No available resources on nodes
	02/06/2008 15:05:04;0080;pbs_sched;Req;;Leaving Scheduling Cycle

	==> server_logs/20080206 <==
	02/06/2008 15:07:01;0100;Server@sps-cvmaster;Req;;Type 21 request received from barnetst@sps-cvmaster.sps.icecube.southpole.usap.gov, sock=10
	02/06/2008 15:07:01;0100;Server@sps-cvmaster;Req;;Type 19 request received from barnetst@sps-cvmaster.sps.icecube.southpole.usap.gov, sock=10
	[enielsen@sps-cvmaster PBS]$ date


	Notice no mom_log for the day? Its last log was:
	02/05/2008 21:16:54;0100;pbs_mom;Req;;Type 6 request received from PBS_Server@sps-cvmaster.sps.icecube.southpole.usap.gov, sock=10


	oh yeah:
	[cnv@sps-cvmaster ~]$ qstat -B
	Server             Max   Tot   Que   Run   Hld   Wat   Trn   Ext Status
	---------------- ----- ----- ----- ----- ----- ----- ----- ----- -----------
	sps-cvmaster         0    91    71    20     0     0     0     0 Active


15:11Z Run another qstat:
	Job id            Name             User              Time Use S Queue
	----------------  ---------------- ----------------  -------- - -----
	112066.sps-cvmast VMiJEBraw_0003.  cnv               00:00:00 R workq
	112067.sps-cvmast VMiJEBraw_0000.  cnv               00:00:00 R workq
	112068.sps-cvmast VMiJEBraw_0001.  cnv               00:00:00 R workq
	112069.sps-cvmast VMiJEBraw_0002.  cnv               00:00:00 R workq
	112070.sps-cvmast VMiJEBraw_0003.  cnv               00:00:00 R workq
	112071.sps-cvmast VMiJEBraw_0000.  cnv               00:00:00 R workq
	112072.sps-cvmast VMiJEBraw_0001.  cnv               00:00:00 R workq
	112073.sps-cvmast VMiJEBraw_0002.  cnv               00:00:00 R workq
	112074.sps-cvmast VMiJEBraw_0003.  cnv               00:00:00 R workq
	112075.sps-cvmast VMiJEBraw_0000.  cnv               00:00:00 R workq
	112076.sps-cvmast VMiJEBraw_0001.  cnv               00:00:00 R workq
	112077.sps-cvmast VMiJEBraw_0002.  cnv               00:00:00 R workq
	112078.sps-cvmast VMiJEBraw_0003.  cnv               00:00:00 R workq
	112079.sps-cvmast VMiJEBraw_0000.  cnv               00:00:00 R workq
	112080.sps-cvmast VMiJEBraw_0001.  cnv               00:00:00 R workq
	112081.sps-cvmast VMiJEBraw_0002.  cnv               00:00:00 R workq
	112082.sps-cvmast VMiJEBraw_0000.  cnv               00:00:00 R workq
	112083.sps-cvmast VMiJEBraw_0001.  cnv               00:00:00 R workq
	112084.sps-cvmast VMiJEBraw_0002.  cnv               00:00:00 R workq
	112085.sps-cvmast VMiJEBraw_0003.  cnv               00:00:00 R workq
	112086.sps-cvmast VMiJEBraw_0000.  cnv                      0 Q workq
	112087.sps-cvmast VMiJEBraw_0001.  cnv                      0 Q workq

	u.s.w. ....

	Note that the jobs have moved up, 112065 has disappeared.
	Let us look at the error log for the file, it ends with:
	/var/spool/PBS/mom_priv/jobs/112065.sps-cvmaster.SC: entering obtain_io_lockfile()
	/var/spool/PBS/mom_priv/jobs/112065.sps-cvmaster.SC: IO_LOCKFILE_DIRECTORY = /mnt/data/cnv/io_lockfile_directory
	/var/spool/PBS/mom_priv/jobs/112065.sps-cvmaster.SC: io_lockfile.public    = /mnt/data/cnv/io_lockfile_directory/io_lockfile
	/var/spool/PBS/mom_priv/jobs/112065.sps-cvmaster.SC: io_lockfile.private   = /mnt/data/cnv/io_lockfile_directory/io_lockfile.112065.sps-cvmaster
	/var/spool/PBS/mom_priv/jobs/112065.sps-cvmaster.SC: executing lockfile for io_lock.public & io_lock.private w/delay time of 13 seconds
	=>> PBS: job killed: walltime 14458 exceeded limit 14400
	lockfile: Signal received, giving up on "/mnt/data/cnv/io_lockfile_directory/io_lockfile"

	/mnt/data/cnv/io_lockfile_directory is a local file system.
	I can see files getting created and destroyed, but not ones with
	'iolockfile.*', more like:
		ls -l /mnt/data/cnv/io_lockfile_directory/
		total 12
		----------  1 cnv cnv 0 Feb 22  1992 _gtG.7HdqHB.sps-fpslave16.sps
		-r--------  1 cnv cnv 1 Feb  5 20:16 io_lockfile
		-r--------  1 cnv cnv 1 Feb  5 20:16 io_lockfile.111988.sps-cvmaster
		-r--r--r--  1 cnv cnv 1 Feb  6 15:27 _XaD+jHdqHB.sps-fpslave16.sps
		[cnv@sps-cvmaster pbs]$ !!
		ls -l /mnt/data/cnv/io_lockfile_directory/
		total 8
		-r--------  1 cnv cnv 0 Feb  6 15:28 _BNC+aHdqHB.sps-fpslave15.sps
		-r--------  1 cnv cnv 1 Feb  5 20:16 io_lockfile
		-r--------  1 cnv cnv 1 Feb  5 20:16 io_lockfile.111988.sps-cvmaster


	Let us take a look at the 112066 job as I don't see
	a way to look at 112065.

	qstat -f 112066.sps-cvmaster
	Job Id: 112066.sps-cvmaster
	Job_Name = VMiJEBraw_0003.
	Job_Owner = cnv@sps-cvmaster.sps.icecube.southpole.usap.gov
	resources_used.cpupercent = 0
	resources_used.cput = 00:00:00
	resources_used.mem = 4268kb
	resources_used.ncpus = 1
	resources_used.vmem = 168036kb
	resources_used.walltime = 03:51:18
	job_state = R
	queue = workq
	server = sps-cvmaster
	Checkpoint = u
	ctime = Wed Feb  6 06:21:55 2008
	Error_Path = sps-cvmaster.sps.icecube.southpole.usap.gov:/mnt/data/cnv/pdaq
	_processing/PFRaw_processing_Jan08/pbs/VMiJEBraw_0003..e112066
	exec_host = sps-fpslave18/3
	exec_vnode = (sps-cvslave02:ncpus=1)
	Hold_Types = n
	Join_Path = oe
	Keep_Files = n
	Mail_Points = a
	mtime = Wed Feb  6 11:23:24 2008
	Output_Path = sps-cvmaster.sps.icecube.southpole.usap.gov:/mnt/data/cnv/pda
	q_processing/PFRaw_processing_Jan08/pbs/VMiJEBraw_0003..o112066
	Priority = 0
	qtime = Wed Feb  6 06:22:23 2008
	Rerunable = True
	Resource_List.ncpus = 1
	Resource_List.nodect = 1
	Resource_List.nodes = 1:ppn=1
	Resource_List.place = scatter
	Resource_List.select = 1:ncpus=1
	Resource_List.walltime = 04:00:00
	stime = 1202297004
	session_id = 5030
	job_dir = /usr/local/cnv
	Variable_List = PBS_O_HOME=/usr/local/cnv,PBS_O_LOGNAME=cnv,
	PBS_O_PATH=/usr/pbs/bin:/usr/java/jdk1.5.0_07/bin:/usr/local/bin:/bin:
	/usr/bin:/usr/X11R6/bin:/usr/local/cnv:/usr/bin:/bin,
	PBS_O_SHELL=/bin/sh,
	PBS_O_HOST=sps-cvmaster.sps.icecube.southpole.usap.gov,
	PBS_O_WORKDIR=/mnt/data/cnv/pdaq_processing/PFRaw_processing_Jan08/pbs
	,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq
	comment = Job run at Wed Feb 06 at 11:23 on (sps-cvslave02:ncpus=1)
	etime = Wed Feb  6 06:22:23 2008


	I don't know much about PBS, but this looks like it didn't do anything
	since the CPU time is 0. Was queued at 11:23 and finished at 06:22.

15:22Z OK, let us do an strace on the mom.orig process:
	[enielsen@sps-cvmaster server_logs]$ sudo strace -p 25671
	Process 25671 attached - interrupt to quit
	select(1024, [5 6 7 8], NULL, NULL, {0, 640000}) = 0 (Timeout)
	time(NULL)                              = 1202311352
	rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], NULL, 8) = 0
	time(NULL)                              = 1202311352
	time(NULL)                              = 1202311352
	recvfrom(7, 0x40220140, 4096, 0, 0x7fbffff690, 0x7fbffff6ac) = -1 EAGAIN (Resource temporarily unavailable)
	recvfrom(8, 0x40220140, 4096, 0, 0x7fbffff690, 0x7fbffff6ac) = -1 EAGAIN (Resource temporarily unavailable)
	time(NULL)                              = 1202311352
	rt_sigprocmask(SIG_UNBLOCK, [HUP INT TERM CHLD], NULL, 8) = 0
	select(1024, [5 6 7 8], NULL, NULL, {10, 0}) = 0 (Timeout)
	time(NULL)                              = 1202311362
	rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], NULL, 8) = 0
	time(NULL)                              = 1202311362
	open("/proc/loadavg", O_RDONLY)         = 10
	fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
	mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2a95aef000
	read(10, "2.48 3.01 3.01 2/195 8359\n", 1024) = 26
	close(10)                               = 0
	munmap(0x2a95aef000, 4096)              = 0
	time(NULL)                              = 1202311362
	recvfrom(7, 0x40220140, 4096, 0, 0x7fbffff690, 0x7fbffff6ac) = -1 EAGAIN (Resource temporarily unavailable)
	recvfrom(8, 0x40220140, 4096, 0, 0x7fbffff690, 0x7fbffff6ac) = -1 EAGAIN (Resource temporarily unavailable)
	time(NULL)                              = 1202311362
	rt_sigprocmask(SIG_UNBLOCK, [HUP INT TERM CHLD], NULL, 8) = 0
	select(1024, [5 6 7 8], NULL, NULL, {10, 0}) = 0 (Timeout)
	time(NULL)                              = 1202311372
	rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], NULL, 8) = 0
	time(NULL)                              = 1202311372
	time(NULL)                              = 1202311372                                                          

	I watched for a while, nothing else happens. I lean to mom once again be
	in an unideal state and should be rebooted.


15:35Z Darryn at office now. Looks like Barnett is too. Sent info via write
	to both of them, then to skype. I'm off this for now to take a shower.

	oh yeah, barnett is thinking is a cnv script issue.

16:00Z Confirmed, evidentally cnv scripts do not have stale lock handling


Edgar Nielsen | 06 Feb 2008 10:26 GMT | Ice Cube/PBS | | permalink