WAFFLE: Multi-Modal Model for Automated Front-End Development

Shanchao Liang, Nan Jiang, Shangshu Qian, Lin Tan

2024-10-25

WAFFLE: Multi-Modal Model for Automated Front-End Development

Summary

This paper introduces WAFFLE, a new approach to automate the process of turning user interface (UI) designs into HTML code for web development, making it easier for developers to create functional webpages.

What's the problem?

Web development can be complicated because it involves converting visual UI designs into HTML code, which has a specific structure that can be hard to manage. Many developers, especially beginners, struggle with this process because HTML's hierarchical structure and styling rules are complex. Current models that generate HTML from UI designs face two main challenges: understanding the hierarchical nature of HTML and bridging the gap between the visual design and the text-based HTML code.

What's the solution?

To solve these problems, WAFFLE uses a special fine-tuning strategy that enhances how large language models (LLMs) understand HTML's structure and how they relate UI images to HTML code. It employs a structure-aware attention mechanism to help LLMs better grasp the organization of HTML and a contrastive fine-tuning method to align their understanding of UI images with the corresponding HTML code. As a result, models trained with WAFFLE show significantly improved performance in generating accurate HTML from UI designs.

Why it matters?

This research is important because it simplifies the web development process, allowing developers to create webpages more efficiently and effectively. By improving how AI can translate visual designs into functional code, WAFFLE can help reduce the learning curve for new developers and speed up the development process for experienced ones.

Abstract

Web development involves turning UI designs into functional webpages, which can be difficult for both beginners and experienced developers due to the complexity of HTML's hierarchical structures and styles. While Large Language Models (LLMs) have shown promise in generating source code, two major challenges persist in UI-to-HTML code generation: (1) effectively representing HTML's hierarchical structure for LLMs, and (2) bridging the gap between the visual nature of UI designs and the text-based format of HTML code. To tackle these challenges, we introduce Waffle, a new fine-tuning strategy that uses a structure-aware attention mechanism to improve LLMs' understanding of HTML's structure and a contrastive fine-tuning approach to align LLMs' understanding of UI images and HTML code. Models fine-tuned with Waffle show up to 9.00 pp (percentage point) higher HTML match, 0.0982 higher CW-SSIM, 32.99 higher CLIP, and 27.12 pp higher LLEM on our new benchmark WebSight-Test and an existing benchmark Design2Code, outperforming current fine-tuning methods.

View Paper