[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: thread not responding




Hi Mark,

    This behavior and message are coming from JANA itself. The intent 
(as you have guessed) is to detect "stuck" threads and kill them. The 
idea was this was most likely due to a rare sort of event that gets a 
thread stuck in an infinite loop and the event should be discarded 
anyway. The failure of the system is that it does not automatically 
launch another thread to take the "bad" one's place. I have placed this 
on the upgrades list on the Hall-D JANA/DANA wiki page so that it will 
be addressed in a future release.

    In the meantime, your options to get around this are as follows:

1. Set the monitor_heartbeat data member of the DApplication object to 
false. This will disable the automatic killing of an event.

2. Run the program with --nthreads=X where X is greater than 1. If there 
is a problematic event, then other threads will still exist and keep 
processing even after the one thread is killed. This is not a very 
practical solution for debugging though.

3. Modify the timeout in the JANA source. It is currently hardwired in 
JApplication::Run(...) in a line that looks like this:

if(monitor_heartbeat && (*hb > 7.0+sleep_time)){

The "7.0" is the number you would change.

Regards,
-David

Mark M. Ito wrote:
> D Listers,
>
> I made a change to allow my fitter to attempt to fit events with a 
> more challenging configuration than before. No doubt this is exposing 
> some pre-existing bug. The failure mode is apparently that a thread is 
> not reporting signs of life soon enough (see transcript below). Have 
> you seen this before? Does this mean I have an infinite loop in one 
> thread or another? Is there a way to increase the time-out period? 
> Have never dealt with threads so am a bit confused about how to 
> proceed with debugging.
>
> > fitter_d /u/scratch/marki/piplus_2.0gev.hddm
> Reading Magnetic field map from Magnets/Solenoid/solenoid_1500 ...
> 32481 entries found ( Nx=81 Ny=1 Nz=401 )
> Read 840 values from FDC/lorentz_deflections in calibDB
>   lorentz_deflections columns (alphabetical): bx bz nx nz x z
> Opening source "/u/scratch/marki/piplus_2.0gev.hddm"of type: HDDM
> Launching threads Registering FDC factories
> --- Configuration Parameters --
>        < all defaults >
> -------------------------------
> .
> Thread 0 hasn't responded in 8 seconds. (run:event=9999:1) Delisting ...
> fini called
> Caught HUP signal for thread 0xb730db90 thread exiting...
> Merging thread 0 ...
> 11 events processed. Average rate: 1.4Hz
>
>  -- Mark
>

-- 

------------------------------------------------------------------------
  David Lawrence Ph.D.
  Staff Scientist                 Office: (757)269-5567   [[[  [   [ [       
  Jefferson Lab                   Pager:  (757)584-5567   [  [ [ [ [ [   
  http://www.jlab.org/~davidl     davidl@jlab.org         [[[  [[ [[ [[[
------------------------------------------------------------------------