for Scalable and Versatile 3D Generation
We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.
NOTE: The appearance and geometry shown in this page are rendered from 3D Gaussians and meshes, respectively. GLB files are extracted by baking appearance from 3D Gaussians to meshes.
All text prompts are generated by GPT-4. Click on the cards to view extracted GLB files.
Image prompts are either generated by DALL-E 3 or extracted from SA-1B. Click on the cards to view extracted GLB files.
TRELLIS can generates variants of a given 3D asset coherent with given text prompts.
TRELLIS can manipulate targeted local regions of a given 3D asset according to given text or image prompts.
Compositing the high-quality 3D assets generated by TRELLIS, complex and vibrant 3D art designs can be created with ease.
We introduce Structured LATents (SLAT), a unified 3D latent representation for high-quality, versatile 3D generation. SLAT marries sparse structures with powerful visual representations. It defines local latents on active voxels intersecting the object's surface. The local latents are encoded by fusing and processing image features from densely rendered views of the 3D asset, while attaches them onto active voxels. These features, derived from powerful pretrained vision encoders, capture detailed geometric and visual characteristics, complementing the coarse structure provided by the active voxels. Different decoders can then be applied to map SLAT to diverse 3D representations of high quality.
Building on SLAT, we train a family of large 3D generation models, dubbed TRELLIS, with text prompts or images as conditions. A two stage pipeline is applied which first generates the sparse structure of SLAT, followed by generating the latent vectors for non-empty cells. We employ rectified flow transformers as our backbone models and adapt them properly to handle the sparsity in SLAT. We train Trellis with up to 2 billion parameters on a large dataset of carefully-collected 3D assets. TRELLIS can create high-quality 3D assets with detailed geometry and vivid texture, significantly surpassing previous methods. Moreover, it can easily generate 3D assets with different output formats to meet diverse downstream requirements.
If you find our work useful, please consider citing:
@article{xiang2024structured, title = {Structured 3D Latents for Scalable and Versatile 3D Generation}, author = {Xiang, Jianfeng and Lv, Zelong and Xu, Sicheng and Deng, Yu and Wang, Ruicheng and Zhang, Bowen and Chen, Dong and Tong, Xin and Yang, Jiaolong}, journal = {arXiv preprint arXiv:2412.01506}, year = {2024} }