[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: thread not responding
Hi Mark,
This behavior and message are coming from JANA itself. The intent
(as you have guessed) is to detect "stuck" threads and kill them. The
idea was this was most likely due to a rare sort of event that gets a
thread stuck in an infinite loop and the event should be discarded
anyway. The failure of the system is that it does not automatically
launch another thread to take the "bad" one's place. I have placed this
on the upgrades list on the Hall-D JANA/DANA wiki page so that it will
be addressed in a future release.
In the meantime, your options to get around this are as follows:
1. Set the monitor_heartbeat data member of the DApplication object to
false. This will disable the automatic killing of an event.
2. Run the program with --nthreads=X where X is greater than 1. If there
is a problematic event, then other threads will still exist and keep
processing even after the one thread is killed. This is not a very
practical solution for debugging though.
3. Modify the timeout in the JANA source. It is currently hardwired in
JApplication::Run(...) in a line that looks like this:
if(monitor_heartbeat && (*hb > 7.0+sleep_time)){
The "7.0" is the number you would change.
Regards,
-David
Mark M. Ito wrote:
> D Listers,
>
> I made a change to allow my fitter to attempt to fit events with a
> more challenging configuration than before. No doubt this is exposing
> some pre-existing bug. The failure mode is apparently that a thread is
> not reporting signs of life soon enough (see transcript below). Have
> you seen this before? Does this mean I have an infinite loop in one
> thread or another? Is there a way to increase the time-out period?
> Have never dealt with threads so am a bit confused about how to
> proceed with debugging.
>
> > fitter_d /u/scratch/marki/piplus_2.0gev.hddm
> Reading Magnetic field map from Magnets/Solenoid/solenoid_1500 ...
> 32481 entries found ( Nx=81 Ny=1 Nz=401 )
> Read 840 values from FDC/lorentz_deflections in calibDB
> lorentz_deflections columns (alphabetical): bx bz nx nz x z
> Opening source "/u/scratch/marki/piplus_2.0gev.hddm"of type: HDDM
> Launching threads Registering FDC factories
> --- Configuration Parameters --
> < all defaults >
> -------------------------------
> .
> Thread 0 hasn't responded in 8 seconds. (run:event=9999:1) Delisting ...
> fini called
> Caught HUP signal for thread 0xb730db90 thread exiting...
> Merging thread 0 ...
> 11 events processed. Average rate: 1.4Hz
>
> -- Mark
>
--
------------------------------------------------------------------------
David Lawrence Ph.D.
Staff Scientist Office: (757)269-5567 [[[ [ [ [
Jefferson Lab Pager: (757)584-5567 [ [ [ [ [ [
http://www.jlab.org/~davidl davidl@jlab.org [[[ [[ [[ [[[
------------------------------------------------------------------------