Evaluating the Performance of Large Language Models in 3D Architectural Modeling
Keywords:
Large language model, rhino 3D model, ChatGPT, claudeAbstract
This study evaluates the performance of two large language models (LLMs), Claude and ChatGPT, in generating 3D architectural models from textual and visual prompts. Using Pass@k metrics (k = 1, 3, 5) and error analysis, the research examines their accuracy and reliability. Results show that both models perform equally at Pass@1 (24%), but Claude achieves higher scores at Pass@3 (43% vs. 35%) and Pass@5 (47% vs. 40%), demonstrating greater consistency. While syntax errors are minimal (below 5%), structural errors dominate, particularly in complex tasks where outputs tend to be simplified, incomplete, or overlapping. Logical and visual-semantic errors occur less frequently but further reduce usability. Despite these limitations, LLMs demonstrate strong potential in multimodal processing and efficiency, as they can generate models in under one minute. The findings suggest that LLM-based modeling is still at an early stage but may evolve into a transformative tool in architectural design, bridging conceptual speed with structurally valid outputs.