UGAL-Q: A Multi-Agent Reinforcement Learning-Based Routing for Dragonfly Networks (Oct 17)

Speaker: Xin Yuan

Date: Friday, Oct 17, 2:15 – 3:05 pm

Abstract:Multi-Agent Reinforcement Learning (MARL)-based routing has emerged as a promising approach for high-performance interconnect networks such as Dragonfly, offering a viable alternative to the widely used Universal Globally Adaptive Load-balanced (UGAL) routing. Practical routing on modern interconnects must satisfy various requirements, such as being deadlock-free and having a limited path length. These requirements impose routing constraints, which in turn pose challenges for MARL-based routing. In particular, two important issues must be addressed for a MARL-based scheme to be effective. First, in the presence of routing constraints, sufficient path diversity to accommodate different traffic conditions is essential. Second, since routing constraints influence how Q-values are propagated in a MARL-based scheme, it is vital that the value propagation mechanism accounts for the routing constraints. Existing MARL-based routing schemes for Dragonfly fall short in addressing both issues. As a result, while they achieve high performance for some traffic conditions, they may exhibit poor performance or even pathological behaviors in other scenarios.
In this work, we discuss the limitations of existing MARL-based routing schemes for Dragonfly, present methods to address the two key issues, and develop UGAL-Q, a novel MARL-based scheme that resolves these issues and overcomes the problems in existing approaches. We perform extensive evaluations using both synthetic traffic and HPC application benchmarks. The results demonstrate that our scheme is more effective than existing ones and is a robust routing solution for Dragonfly.

Location and Zoom link: LOV 307 and ZOOM Click Here

Leave a Reply