Dr. Sonia Haiduc, an Associate Professor in the Computer Science Department and her Serene research lab have published their recent work in the prestigious 38th IEEE/ACM International Conference on Automated Software Engineering (ASE 2023). ASE is recognized as a top-tier conference in the field of software engineering research. The research paper, titled “Improving code extraction from coding screencasts using a code-aware encoder-decoder model”, was also co-authored by Abdulkarim Malkadi, a recent CS Ph.D. graduate, and Ahmad Tayeb, a CS Ph.D. candidate, both of whom were under the supervision of Dr. Haiduc in the Serene Lab in our department.
The team’s contribution sheds light on the intricacies of extracting code from tutorial videos, a challenge faced by numerous software developers aiming to replicate or understand coding practices from visual content. With current methods for video code extraction often producing inconsistent or inaccurate results, the team’s introduction of CodeT5-OCRfix – a novel, fine-tuned deep learning model based on the T5 transformer architecture – emerges as a significant enhancement to the state of the art in the field. Their empirical evaluation clearly demonstrates that CodeT5-OCRfix not only surpasses other baseline code extraction models but is also more time-efficient. This endeavor elevates the current standards in code extraction techniques from screencasts and images, making it an indispensable asset for developers and software engineers.
The paper will be presented at the upcoming ASE 2023 conference in Luxembourg in September and will be published by IEEE in the proceedings of the conference.