Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control

Ruining Li^1* Yuxin Yao^2* Matt Zhou² Chuanxia Zheng³

Christian Rupprecht¹ Joan Lasenby² Shangzhe Wu^2† Andrea Vedaldi^1†

¹University of Oxford ²University of Cambridge ³Nanyang Technological University

^* Equal contribution. ^† Equal advising.

Instruct-Particulate is a new model in the Particulate series, with much stronger generalization to novel categories and promptability. It is a feed-forward model that, given a single static 3D mesh (including outputs from off-the-shelf 3D generators) and a target kinematic specification (i.e., part descriptions, connectivity, joint types, and optional point prompts), directly infers the underlying articulated structure, including the corresponding kinematic part segmentation and joint motion parameters. This allows for generating diverse, realistic, simulator-compatible articulated 3D objects directly from real-world images.

Kinematic Prompting

Previous work often trains neural networks to extract articulated 3D objects in an unconditional manner. However, there is often no single correct way to assign an articulated structure to a given 3D object. Annotation inconsistencies in part granularity and semantics would lead a naively trained model to "average" over multiple plausible annotations and produce suboptimal results. We find that the key to using all heterogeneous datasets (see here) to enhance generalization is to add kinematic context to the model input, including an explicit kinematic structure (a list of parts and their connectivity), text prompt of each part, joint types (i.e., prismatic or revolute), and optional 3D point prompts. Beyond disambiguating outputs, kinematic conditioning also enables prompting at test time, allowing the user to specify the intended kinematic structure during inference (shown above).

Generating Articulated 3D Assets from Images

While Instruct-Particulate converts existing static 3D objects into articulated ones rather than synthesizing them from scratch, our model can be used for image-conditioned articulated 3D asset generation. From a single image, we first reconstruct a 3D object using an off-the-shelf 3D generator, and then prompt a vision-language model to identify the kinematic structure of the object. Finally, Instruct-Particulate predicts the kinematic parts and joints in a single feed-forward pass, with outputs compatible with physics simulators such as Isaac Sim and MuJoCo.

Data Scaling

Articulated 3D models are scarce due to the cost of designing and annotating them manually. In developing Instruct-Particulate, we curate a massive dataset consisting of the following components:

Existing articulated 3D datasets: This is the highest quality data source, but it is limited in quantity and category coverage.
Articulated 3D assets generated by a coding agent: Large language models already possess a strong prior on object articulation. To extract this prior to explicit 3D assets, we develop Articraft, a coding agent that writes programs against an LLM-friendly SDK to define parts, compose geometry, specify joints and motion limits. The resulting assets are more diverse and provide full joint supervision.
Synthetic 3D objects pseudo-labeled with kinematic parts: While Instruct-Particulate takes existing 3D objects as input, we aim to achieve image-conditioned articulated 3D asset generation via an off-the-shelf 3D generator (see here). To expose the model with synthetic 3D models during training, we build an annotation pipeline that identifies and segments 3D kinematic parts of synthetic 3D objects using vision-language models.
Generic part-segmented 3D models: Generic 3D part segmentations rarely correspond to object articulation and motion. We mix them in the training data as a pre-training step for aligning text prompts with 3D parts.

Citation

@article{li2026instructparticulate,
  title   = {{Instruct-Particulate}: Scaling Feed-Forward 3D Object Articulation with Kinematic Control},
  author  = {Li, Ruining and Yao, Yuxin and Zhou, Matt and Zheng, Chuanxia and Rupprecht, Christian and Lasenby, Joan and Wu, Shangzhe and Vedaldi, Andrea},
  journal = {arXiv preprint arXiv:2606.14699},
  year    = {2026}
}