iCR: Intent-Driven Checkpoint/Restart (Mar 10)

Speaker: Olga Kogiou

Date: March 10, 11:45 – 12:45 pm

Abstract:

High Performance Computing applications have evolved from monolithic simulations to interdependent tasks that form heterogeneous workflows. These workflows rely on Checkpoint/Restart (C/R) mechanisms to enable continuity in their execution. However, the workflow heterogeneity demands adaptability from the underlying C/R solution. Our examination reveals that, in a case study with Montage workflow, C/R with static configurations can lead to an overhead of 1.6x and 176 GB, for performance and storage respectively, compared to existing solutions for scientific workflows. To adapt C/R policies to the increasing heterogeneity, we develop iCR, a toolkit that utilizes workflow and data intents. We demonstrate three key benefits of intent-driven adaptable C/R. Firstly, we show that a representative set of six large-scale workflows can be decomposed into seven intents that transcend across workflows allowing us to suggest workflow-agnostic checkpointing policies. Secondly, adapting the redundancy for different checkpoint intents reduces checkpointing overhead by 2x. Thirdly, adaptable C/R guided by user intents, reduces I/0 time, lowers makespan under worst-case failure rates, and decreases the overall storage footprint. By identifying user intent in heterogeneous workflows, iCR is able to reduce C/R I/O overhead and increase storage savings compared to static C/R configurations.

Biographical Sketch

I am a 4th year PhD candidate in Computer Science at Florida State University, advised by Dr. Weikuan Yu, and collaborating with Lawrence Livermore National Laboratory. My research focuses on High Performance Computing (HPC) and more specifically on Data Management and Fault-Tolerance in complex HPC environments. My current project is on checkpoint de-duplication for Foundation Deep Learning models. Throughout my PhD, I have been involved in projects focusing on Checkpoint/Restart mechanisms and I/O characterization of scientific workflows, lightweight I/0 tracing tools for performance provisioning of AI-driven workflows, and performance analysis of VAST file system for diverse HPC workloads. The thesis of my work is to understand system and application heterogeneity in HPC in order to design more efficient Checkpoint/Restart and generally Data Management solutions. I received my B.S. degree from the University of Thessaly, Greece, in 2021, and prior to joining FSU, I worked as a back-end developer

Location(In Person Only): LOV 353