ATP

Created Wednesday 24 August 2016

ATP in Theta

At higher core counts, other strategies will be needed -- disabling core dumps or even lightweight files is needed when 10,000 processes fall over, each choosing to write a core dump. Lustre doesn't take kindly to that ... but then when you go and do operations in the directory with 10,000 (or even 100,000 files) in them, it likes that even less! Got burned a few times over the years on that one .. although I never tried breaking a GPFS system in that way, YMMV.

There are two other ways to debug this situation non-interactively:

- runs under Cray's abnormal termination kit, which merges stacks of the crashees and dumps out the info in a readable way. - and can give you a sublist of processes (1-per unique stacktrace) to set your debugger loose on before they die.

I can confirm that ATP works under these conditions:

a) compile with Cray
b) put this in your job script:

module load atp
export ATP_ENABLED=1

aprun [...]

ATP in NERSC


http://www.nersc.gov/users/software/performance-and-debugging-tools/stat-and-atp/#toc-anchor-2



Backlinks: Software:Debugging Software