Weikuan Yu, Professor in the FSU Department of Computer Science, has been awarded a grant of $225K for research on efficient and robust checkpoint/restart support for deep learning from the National Science Foundation. This is a joint project in collaboration with Prof. Bin Ren from the College of William & Mary and Prof. Wei Niu form the University of Georgia. The project, titled as “CropDL – Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters”, aims to support application-level checkpoints/restarts of deep learning applications for better resiliency, faster average completion time, and higher resource utilization. By taking advantages of unique properties of DL workloads such as limited communication patterns, highly compressible checkpoints, and malleable execution from a different number of processes, this project develops techniques for asynchronous user-level checkpointing, efficient DAG-based I/O scheduling, and automated compiler-directed checkpointing. The project will have impacts in AI democratization by facilitating the use of shared HPC clusters for long-running AI/ML training tasks and increasing the number of researchers that can successfully train large AI/ML models for various applications.