Make-It-Poseable: Advances in 3D Posing

Updated 4 July 2026

Make-It-Poseable is a framework that converts static or captured 3D assets into explicitly poseable representations using kinematic structures and latent transformations.
It integrates structure-recovery methods and representation learning to infer joint hierarchies and control deformations without relying solely on traditional skinning.
The approach supports applications ranging from unsupervised multi-view rigging for arbitrary objects to interactive pose specification via sketches, images, and text.

Across the cited works, “Make-It-Poseable” denotes the conversion of a static asset, an observed articulated object, or a captured deformable subject into a representation whose pose can be controlled explicitly or predictably. In the most explicit formulation, this means a kinematic structure of rigid parts, joints, and skinning or geometry; in more recent formulations, it also includes latent-space pose transformation, pose-conditioned 3D generation, and retrieval-based motion transfer. The topic therefore spans unsupervised structure discovery from multi-view videos, rigging of arbitrary humanoid assets, sketch-, image-, and text-driven pose specification, and physically embodied or fabrication-oriented kinematic design (Noguchi et al., 2021, Guo et al., 18 Dec 2025).

1. Conceptual foundations

A recurring definition in this literature is that to “Make-It-Poseable,” one needs a 3D object model that can be explicitly controlled by changing joint angles, namely a kinematic structure consisting of rigid parts, joints, and skinning or geometry. Earlier poseable pipelines for humans, faces, or hands typically assume that this structure is already known, often through a predefined skeleton such as SMPL, and are therefore category-specific. By contrast, recent work targets unseen articulated objects, arbitrary humanoid models, or subject-customized imagery, and seeks either to infer the structure directly or to bypass explicit per-vertex deformation with more abstract poseable representations (Noguchi et al., 2021, Guo et al., 2024).

This split has produced two major technical lineages. One lineage treats poseability as a structure-recovery problem: parts are segmented from motion, joints are inferred from relative transformations, and a tree or graph is built for forward kinematics. The other treats poseability as a representation-learning problem: skeletal motion conditions a particle field, a latent token set, or a generative backbone so that a new posed geometry can be decoded directly, without relying on explicit LBS-style deformation at inference time. A plausible implication is that the field is increasingly separating control semantics from surface realization, allowing a compact rig or skeleton to drive either explicit geometry or a learned latent representation (Guo et al., 18 Dec 2025).

2. Recovering articulated structure from motion and scans

One of the clearest structure-recovery formulations appears in “Watch It Move: Unsupervised Discovery of 3D Joints for Re-Posing of Articulated Objects,” which learns a 3D part decomposition, joints and hierarchy, and a high-fidelity radiance plus surface representation from calibrated multi-view videos and foreground masks, without joint or skeleton annotations. Its core insight is that points that stay rigid relative to each other over time belong to the same part, while adjacent parts that move relative to each other must be connected by a joint. The method represents each part explicitly as a 3D ellipsoid with learnable radii and per-frame pose, combines the ellipsoid union with a neural residual SDF, samples candidate joint points inside ellipsoids, minimizes motion inconsistency across time, constructs a tree-structured kinematic graph, merges redundant parts, and then re-poses the object by forward kinematics without re-training. It is evaluated on quadrupeds, single-arm robots, and humans (Noguchi et al., 2021).

A related but distinct route is developed in “Building Rearticulable Models for Arbitrary 3D Objects from 4D Point Clouds,” where the input is a point-cloud video of an articulated object. The method first fits a relaxed piecewise-rigid model with a neural segmentation field and free 6-DoF part transforms, then projects those trajectories to a kinematic tree of 1-DoF screw joints by minimizing a projection energy that combines spatial proximity with 1-DoF consistency, and finally retargets to novel poses from sparse point correspondences. This formulation is explicitly aimed at arbitrary everyday man-made objects with an arbitrary number of parts connected in arbitrary ways through 1 degree-of-freedom joints (Liu et al., 2023).

For raw scans rather than motion sequences, FAKIR fits a simple articulated sphere-mesh skeleton directly to point clouds of statues. Its Forward And bacKward Iterative Registration proceeds joint by joint, alternating forward and backward passes along chains, and estimates both pose and elementary anatomy, including bone lengths and radii. The method is designed to handle non realistic body proportions and can be adapted to animals and imaginary creatures by altering the skeleton topology (Fu et al., 2019).

The most recent extension of this line is GaussiAnimate, which begins from temporally consistent deformable 3D Gaussians and compresses them into free-form bones, extracts a Mean Curvature Skeleton from canonical Gaussians, and binds scaffold and skin through Partwise Motion Matching. The resulting “Skelebones” rig preserves a category-agnostic, motion-adaptive, and topology-correct kinematic structure while remaining expressive enough for complex non-rigid surface dynamics. Reported reanimation gains include 17.3% PSNR over LBS, 21.7% over Bag-of-Bones, and 48.4% RMSE improvement over robust LBS in a low-data regime (Wang et al., 9 Apr 2026).

3. Rigging arbitrary 3D characters and latent-space posing

When a 3D asset already exists, poseability becomes a rigging and deformation problem. “Make-It-Animatable” addresses this by predicting a skeleton, blend weights, and pose-to-rest transformations directly from a point-sampled particle representation. Its particle-based shape autoencoder supports meshes and 3D Gaussian splats, while a coarse-to-fine pipeline, hierarchical sampling near hands, and a structure-aware transformer improve robustness and finger-level accuracy. The framework assumes a predefined skeleton topology but predicts joint positions, bone lengths, and orientations from geometry, and can process a typical mesh with 8k vertices in about 0.5 seconds; even 1M-face assets are reported to be processed in a few seconds (Guo et al., 2024).

The 2025 system titled “Make-It-Poseable: Feed-forward Latent Posing Model for 3D Humanoid Character Animation” moves the problem into native 3D latent space. Instead of deforming mesh vertices, it reconstructs the character in new poses by manipulating a VecSet latent representation with a latent posing transformer. The method uses a dense pose representation aligned one-to-one with shape tokens through shared FPS query indices, supervises the target latent directly to avoid permutation ambiguity, and adds an adaptive completion module to recover newly exposed geometry such as armpits or surfaces hidden by fused limbs. Quantitatively, it reports a Chamfer Distance of $0.07 \times 10^{-3}$ , F-score of $0.9858$, IoU of $0.8542$, SDF-RMSE of $0.0161$, and inference time of $0.59$ seconds, compared with $1.07 \times 10^{-3}$ , $0.8788$, $0.7467$, $0.0164$, and $0.45$ seconds for MIA, and $0.9858$0, $0.9858$1, $0.9858$2, $0.9858$3, and $0.9858$4 seconds for HY3D-Omni (Guo et al., 18 Dec 2025).

A related generative formulation appears in PoseMaster, which unifies pose transformation and 3D character generation in a flow-based 3D native generation framework. It conditions a Rectified Flow DiT on a single image and a 21-bone skeleton represented by start and end 3D points, and uses independent random dropping of pose and image conditions to improve generalizability. The associated AnimePose dataset is built from approximately 60M samples derived from humanoid characters and animation data (Yan et al., 26 Jun 2025). This suggests that feed-forward poseable generation and explicit rigging are converging around the same skeletal abstractions, even when their deformation mechanisms differ.

4. Interaction paradigms: sketch, reference image, and text

One interaction model treats the pose itself as a user-facing input modality. “Interactive Sketching of Mannequin Poses” maps a 2D cylinder-person sketch to 3D SMPL pose through a two-stage pipeline: a Sketch Interpreter based on DensePose and Keypoint R-CNN predicts 2D joints and silhouettes, and a pretrained STRAPS network lifts them to SMPL pose and shape; the result is then rigged in Blender via Rigify and refined with FK and IK. The system is trained on 25K synthetic sketches. In a user study with 12 novice users, Sketch Prediction alone achieved Joint3D $0.9858$5 versus $0.9858$6 for Manual Refine while taking $0.9858$7 seconds versus $0.9858$8; Sketch + Refine achieved the best accuracy with Chamfer $0.9858$9, Joint3D $0.8542$0, and mean time $0.8542$1, and 70% of users preferred Sketch + Refine (Unlu et al., 2022).

A more appearance-centric formulation is PersonificationNet, which makes a customized subject such as a plush toy or cartoon character act like a referenced person. It combines a subject-specific customized branch, a pose condition branch derived from ControlNet, and a Structure Alignment Module that converts a human skeleton into a subject-compatible skeleton by preserving the human pose angles while using the subject’s limb lengths or proportions. The method is trained from only 3–5 subject images per identity and a 55-image dataset for pose-branch finetuning, explicitly targeting the structure gap between human bodies and non-human subjects (Guo et al., 2024).

Language control extends the same idea to articulated 3D assets. Articulate3D is a training-free method that first uses RSActrl, a rewired self-attention mechanism inside MVDream, to generate target multi-view images under a text instruction, and then optimizes bone rotations by matching 2D keypoints between rendered source views and generated target views. The method uses $0.8542$2 viewpoints and $0.8542$3 DDIM steps, and explicitly avoids differentiable rendering or SDS as the articulation signal. Reported articulation results include CLIP Score $0.8542$4, CLIP Directional Similarity $0.8542$5, CS wins on 80% of prompts, and 90% human preference; for the image-generation stage alone, RSActrl reaches CLIP Score $0.8542$6, CDS $0.8542$7, CS wins 80%, and 86% human preference (Deb et al., 26 Aug 2025).

5. Kinematic optimization, physical embodiment, and fabrication

Some “Make-It-Poseable” formulations are not primarily about neural deformation but about kinematic synthesis and control. “Rational Linkages: From Poses to 3D-printed Prototypes” interpolates up to four rigid poses with a rational motion curve on the Study quadric, factors the resulting dual-quaternion polynomial into linear factors corresponding to joints, and converts the mechanism to DH parameters and connection points suitable for CAD and 3D printing. The package also performs self-collision analysis and exposes a workflow that goes directly from pose specification to physical prototype (Huczala et al., 2024).

A physical realization of body shaping appears in “Soft Robotic Mannequin: Design and Algorithm for Deformation Control.” Here the poseable object is a deformable torso with a soft membrane actuated by pneumatic chambers. A structured-light scanner closes the loop, ICP-based pose estimation is included in the optimization, and the expensive Jacobian evaluation is reduced through a Broyden update when possible. The system is intended to approximate target human body geometries rather than skeletal joint poses, but it still instantiates a control-oriented definition of poseability through optimized actuation and geometric feedback (Tian et al., 2022).

In collaborative robotics, poseability can mean human-like replication under task constraints. “Pose Imitation Constraints for Collaborative Robots” introduces PIC and PICs, a FABRIK-inspired framework with octant-based IN and OUT pose constraints. It is evaluated on Baxter and YuMi across an assembly task and an incision task. The reported mean Pose Accuracy is $0.8542$8 for FABRIK, $0.8542$9 for PIC, and $0.0161$0 for PICs; on incision, the mean Percentage of Occlusion/Obstruction is $0.0161$1 for FABRIK, $0.0161$2 for PIC, and $0.0161$3 for PICs (Gonzalez et al., 2020).

A broader assembly-design counterpart is Kinematic Kitbashing, which synthesizes functionality-aware articulated objects from existing articulated parts. Its optimizer uses a kinematics-aware attachment energy based on vector distance function features sampled across articulation snapshots and an annealed Riemannian Langevin dynamics sampler to satisfy objectives such as collision-free actuation, reachability, and trajectory following. This extends poseability from single objects to assembled mechanisms whose geometry and articulation must both remain functional (Guo et al., 14 Oct 2025).

6. Evaluation regimes, limitations, and open directions

Evaluation in this area is heterogeneous because the task itself changes with the representation and interface. Structure-discovery papers emphasize reconstruction quality, joint localization, and reanimation; interactive systems emphasize speed and user preference; text-driven systems emphasize semantic alignment and preference; latent posing systems emphasize geometric fidelity and throughput. On ZJU-MoCap, “Watch It Move” reports novel-view LPIPS $0.0161$4, novel-view SSIM $0.0161$5, and MPJPE $0.0161$6 mm for the unmerged model, while the merged model reaches re-posing LPIPS $0.0161$7 and SSIM $0.0161$8–$0.0161$9 (Noguchi et al., 2021).

Evaluation goal	Representative metrics	Representative source
Re-posing from discovered structure	LPIPS, SSIM, MPJPE	(Noguchi et al., 2021)
Sketch-to-3D mannequin posing	Chamfer, Joint3D, time	(Unlu et al., 2022)
Text-driven object posing	CLIP Score, CDS, human preference	(Deb et al., 26 Aug 2025)
Feed-forward latent 3D posing	CD, F-score, IoU, SDF-RMSE, time	(Guo et al., 18 Dec 2025)

Several limitations recur. Methods driven by motion cues can fail when parts always move together; “Building Rearticulable Models for Arbitrary 3D Objects from 4D Point Clouds” identifies co-moving parts as a failure mode and assumes 1-DoF tree-structured mechanisms in its base formulation (Liu et al., 2023). “Make-It-Animatable” remains tied to a fixed skeleton topology per model and states that its current scope is bipedal humanoids, with non-bipedal topologies not yet covered (Guo et al., 2024). The latent-space “Make-It-Poseable” inherits the capacity limits of its underlying 3D VAE, does not natively model appearance, and treats adaptive completion as a plausible regressive reconstruction rather than a richly conditioned generative process (Guo et al., 18 Dec 2025).

Taken together, these systems indicate a transition from category-specific template rigs toward category-agnostic structure recovery, particle or latent intermediate representations, and hybrid scaffold–skin decompositions. The unresolved tension is between explicit controllability and high-frequency, topology-changing detail. Explicit skeletons, screw joints, and CAD linkages provide interpretable control; latent tokens, 3DGS bones, and adaptive completion recover surfaces and dynamics that classical skinning handles poorly. Current research suggests that the most robust formulations are increasingly hybrid: a semantically meaningful kinematic core coupled to a richer geometric or latent exterior that can preserve identity, repair topology, and generalize across pose regimes (Wang et al., 9 Apr 2026, Guo et al., 18 Dec 2025).