Text-to-CAD Generation Through Infusing Visual Feedback in Large Language Models

Ruiyu Wang, Yu Yuan, Shizhao Sun, Jiang Bian

2025-02-05

Text-to-CAD Generation Through Infusing Visual Feedback in Large
Language Models

Summary

This paper talks about CADFusion, a method that helps AI models turn text descriptions into CAD designs by combining logical instructions with visual feedback. It improves how accurately these models create 3D designs based on what users describe.

What's the problem?

Creating CAD models from scratch requires a lot of expertise and effort. While some AI tools can generate CAD designs from text, they often lack the ability to combine both logical sequences and visual accuracy, which limits their effectiveness in producing high-quality results.

What's the solution?

The researchers developed CADFusion, which trains large language models in two stages. First, the models learn to create logically correct CAD instructions using real examples. Then, they are taught to improve their designs by comparing how the generated visuals match user preferences. This back-and-forth training ensures balanced learning between logic and visuals, leading to better results.

Why it matters?

This research matters because it makes creating CAD models easier and more accessible by allowing users to describe their designs in text. By improving both the logic and visual quality of these models, CADFusion could significantly speed up design processes and make them more accurate for industries like engineering and manufacturing.

Abstract

Creating Computer-Aided Design (CAD) models requires significant expertise and effort. Text-to-CAD, which converts textual descriptions into CAD parametric sequences, is crucial in streamlining this process. Recent studies have utilized ground-truth parametric sequences, known as sequential signals, as supervision to achieve this goal. However, CAD models are inherently multimodal, comprising parametric sequences and corresponding rendered visual objects. Besides,the rendering process from parametric sequences to visual objects is many-to-one. Therefore, both sequential and visual signals are critical for effective training. In this work, we introduce CADFusion, a framework that uses Large Language Models (LLMs) as the backbone and alternates between two training stages: the sequential learning (SL) stage and the visual feedback (VF) stage. In the SL stage, we train LLMs using ground-truth parametric sequences, enabling the generation of logically coherent parametric sequences. In the VF stage, we reward parametric sequences that render into visually preferred objects and penalize those that do not, allowing LLMs to learn how rendered visual objects are perceived and evaluated. These two stages alternate throughout the training, ensuring balanced learning and preserving benefits of both signals. Experiments demonstrate that CADFusion significantly improves performance, both qualitatively and quantitatively.

View Paper