Speaker: Sharanya Jayaraman

Date: Jan 12, 11:45am–12:45pm

Abstract and Bio: Computational Resilience represents the process of upholding the correctness and efficiency of computing systems, even in the face of inherent faults and challenges. As Large Scale High-Performance applications grow in size and scope they become increasingly susceptible to diverse fault types. With recovery times for Application Level Checkpoint and Restart approaching the mean time to failure of these systems, there is an urgent and compelling demand for innovative fault tolerance methodologies to address the evolving landscape.

In this work, we simulate the operation of Exascale machines running Monte Carlo applications and observe their performance and behavior under fault conditions. By systematically injecting controlled faults into the system, we can record the propagation of errors. This data empowers us to analyze the ramifications of faults comprehensively and develop strategies to mitigate and confine the effects of unpredictable faults. Notably, our observations demonstrate that, even in the presence of faults, it is feasible to complete calculations that might have otherwise necessitated frequent restarts while maintaining an acceptable error margin. This research holds the promise of ensuring the continued reliability and efficiency of computing systems in an era of increasingly complex and fault-prone Exascale systems.

Location and Zoom link: 307 Love, or https://fsu.zoom.us/j/99388380077