[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Jobs hanging - JEventProcessor::fini() not being called





Hi Claire,

    This looks like you may have identified a bug in the framework. The 
lines starting with "Thread 0 hasn't responded ..." indicates that a 
processing thread (looks like the only processing thread) is getting 
stuck by not having processed an event for 30 seconds and not 
de-registering himself to indicate that he's finished processing events 
altogether. When this situation is detected, the framework tries to get 
rid of him by sending an HUP signal which tells the thread to exit. Once 
that happens, the framework tries to create another processing thread to 
take his place. You can see this happening 2 lines down where it says 
"Launching new thread ...".

    This admittedly complicated scheme is intended to address the 
situation where a single, anomalous event causes some piece of 
reconstruction code to get stuck in a (semi-)infinite loop. This is 
extremely annoying when you are stuck 300k events (or even 10 events) 
into a file and you are more than willing to sacrifice that *one* event 
in order to process the rest of the file.

    I'll have to look at this closer, but I have a strong suspicion that 
you're getting into a deadlock situation where 2 threads are trying to 
lock the same mutex and the guy with the lock is waiting on the guy 
without the lock to do something. This is buried deep in JANA and I will 
have to take a much closer look at it to figure it out completely and 
implement a solution. As such it won't happen this week.

    One way you might be able to get around this in the meantime is to 
extend the timeout from 30 seconds to say 300 seconds. If you see this 
problem occur more frequently when multiple jobs are running, it 
probably indicates that the CPU is busier and therefore the average time 
to process an event is slightly higher. This increases the probability 
that a "large" event will take more than 30 seconds causing a timeout 
that activates the "thread assassin" code. The longer timeout should 
then reduce the frequency of these occurrences. Let me know if you try 
this and it does or doesn't appear to help.

Regards,
-David

Tarbert, Claire wrote:
> Hi,
>
> I have a problem with jobs occasionally hanging without giving any
> error messages.  I'm using JANA-0.4.9 and release-2009-02-24 of the 
> Hall D software to process hddm files produced by bggen.
>
> The job loops over all events in the hddm file but then hangs and 
> doesn't finish - they keep running but draw very little CPU.  Here's 
> an example of the end of the log file:
>
>>  367.9k events processed  (367.9k events read)  0.0Hz  (avg.: 788.6Hz)
>>  367.9k events processed  (367.9k events read)  0.0Hz  (avg.: 786.1Hz)
>> Thread 0 hasn't responded in 30 seconds. (run:event=2:367888) 
>> Delisting ...
>> Caught HUP signal for thread 0x42eefbb0 thread exiting...
>> Launching new thread ...
>>  369.4k events processed  (369.4k events read)  0.0Hz  (avg.:
>> 785.1Hz)     0kHz  (avg.: 786.8Hz)
>>  373.9k events processed  (372.3k events read)  3.5kHz  (avg.: 788.9Hz)
>>  378.0k events processed  (378.0k events read)  2.8kHz  (avg.: 795.7Hz)
>>  381.6k events processed  (381.6k events read)  2.3kHz  (avg.: 800.9Hz)
>>  386.8k
>>  391.0k events processed  (391.5k events read)  2.7kHz  (avg.: 814.1Hz)
>>  393.5k events processed  (393.5k events read)  2.7kHz  (avg.: 819.9Hz)
>> No more event sources
>> Thread 0x42eefbb0 completed gracefully
>
> There should be some more printed statements after this that are coded
> in my processor fini() function.  So the job finishes processing all
> the events but doesn't get to the JEventProcessor::fini() function.
>
> Is there somewhere between "No more event sources" and 
> JEventProcessor::fini() that it could get hung up?
>
> This only happens occasionally - I have been able to analyse the same 
> files with the same code and the processes finish cleanly.  The 
> problem only seems to arise when I have mutliple jobs running at the 
> same time on the cluster at IU.
>
> Cheers
> Claire
>
> -- 
> Claire Tarbert
> 260 Swain Hall West               Phone: +1 812 855 8933
> Department of Physics             E-mail: ctarbert@indiana.edu
> Indiana University
> 727 E Third St
> Bloomington
> IN 47405-7105
>
>
>

-- 

------------------------------------------------------------------------
 David Lawrence Ph.D.
 Staff Scientist                 Office: (757)269-5567   [[[  [   [ [       
 Jefferson Lab                   Pager:  (757)584-5567   [  [ [ [ [ [   
 http://www.jlab.org/~davidl     davidl@jlab.org         [[[  [[ [[ [[[
------------------------------------------------------------------------