33email: {lgx23@mails.,sy-zhang23@mails.,lujiwen@,tang.yansong@sz.}tsinghua.edu.cn
33email: {ziweiwa2@,cliu6@}andrew.cmu.edu
Guanxing Lu11 Shiyi Zhang11 Ziwei Wang🖂🖂 Corresponding author.22 Changliu Liu22 Jiwen Lu33 Yansong Tang11
Abstract
Performing language-conditioned robotic manipulation tasks in unstructured environments is highly demanded for general intelligent robots. Conventional robotic manipulation methods usually learn semantic representation of the observation for action prediction, which ignores the scene-level spatiotemporal dynamics for human goal completion. In this paper, we propose a dynamic Gaussian Splatting method named ManiGaussian for multi-task robotic manipulation, which mines scene dynamics via future scene reconstruction. Specifically, we first formulate the dynamic Gaussian Splatting framework that infers the semantics propagation in the Gaussian embedding space, where the semantic representation is leveraged to predict the optimal robot action. Then, we build a Gaussian world model to parameterize the distribution in our dynamic Gaussian Splatting framework, which provides informative supervision in the interactive environment via future scene reconstruction. We evaluate our ManiGaussian on RLBench tasks with variations, and the results demonstrate our framework can outperform the state-of-the-art methods by % in average success rate 111Project page: https://guanxinglu.github.io/ManiGaussian/.
Keywords:
Multi-task robotic manipulation dynamic Gaussian Splatting world model
1 Introduction
Designing autonomous agents for language-conditioned manipulation tasks[28, 2, 26, 70, 54, 55, 9] has been highly desired in the pursuit of artificial intelligence for a long time. In realistic deployment, intelligent robots are usually required to deal with unseen scenarios in novel tasks. Therefore, comprehending complex 3D structures in the deployment scenes is necessary for the robots to achieve high task success rates across diverse manipulation tasks.
To address the challenges, previous arts have made great processes in general manipulation policy learning, which can be divided into two categories including perceptive methods and generative methods.For the first regard, semantic features extracted by perceptive models are directly leveraged to predict the robot actions according to the visual input such as image[36, 13, 12], point cloud[5, 11, 71] and voxel[25, 55].However, the perceptive methods heavily rely on multi-view cameras to cover the whole workbench to deal with the occlusion problem within unstructured environments, which restricts their deployment.To this end, generative methods[43, 42, 47, 31, 20, 68, 69, 32] capture the 3D scene structure information by reconstructing the scene and objects in arbitrary novel views with self-supervised learning. Nevertheless, they ignore the spatiotemporal dynamics that depict the physical interaction among objects during manipulation, and the predicted actions still fail to complete human goals without correct object interactions. Figure1 shows a comparison of manipulation achieved by the conventional generative manipulation method (top) and the proposed method (bottom), where the conventional method fails to stack the two rose blocks due to the poor comprehension of scene dynamics.
In this paper, we propose a ManiGaussian method that leverages a dynamic Gassuain Splatting framework for multi-task robotic manipulation. Different from conventional methods which only focus on semantic representation, our method mines the scene-level spatiotemporal dynamics via future scene reconstruction. Therefore, the interaction among objects can be comprehended for accurate manipulation action prediction. More specifically, we first formulate the dynamic Gaussian Splatting framework that models the propagation of diverse semantic features in the Gaussian embedding space, and the semantic features with scene dynamics are leveraged to predict the optimal robot actions for general manipulation tasks. We build a Gaussian world model to parameterize the distributions in our dynamic Gaussian Splatting framework. Therefore, our framework can acquire informative supervision in interactive environments by reconstructing the future scene according to the current scene and the robot actions, where we constrain consistency between reconstructed and realistic future scenes for dynamics mining. We evaluate our ManiGaussian method on the RLBench dataset[24] with tasks and variants, where our method outperforms the state-of-the-art multi-task robotic manipulation methods by % in the average task success rate. Our contributions can be summarized as follows:
- •
We propose a dynamic Gaussian Splatting framework to learn the scene-level spatiotemporal dynamics in general robotic manipulation tasks, so that the robotic agent can complete human instructions with accurate action prediction in unstructured environments.
- •
We build a Gaussian world model to parameterize distributions in our dynamic Gaussian Splatting framework, which can provide informative supervision to learn scene dynamics from the interactive environment.
- •
We conduct extensive experiments of tasks on RLBench, and the results demonstrate that our method achieves a higher success rate than the state-of-the-art methods with less computation.
2 Related Work
Visual Representations for Robotic Manipulation.Developing intelligent agents for language-conditioned manipulation tasks in complex and unstructured environments has been a longstanding objective.One of the key bottlenecks in achieving this goal is effectively representing visual information of the scene.Prior arts can be categorized into two branches: perceptive methods and generative methods. Perceptive methods directly utilize pretrained 2D[36, 13, 12, 71] or 3D visual representation backbone[5, 11, 25, 55] to learn scene embedding, where optimal robot actions are predicted based on the scene semantics.For example, InstructRL[36] and Hiveformer[13] directly passed 2D visual tokens through a multi-modal transformer to decode gripper actions, but struggled to handle complex manipulation tasks due to the lack of geometric understanding.To incorporate 3D information beyond images, PolarNet[5] and Act3D[11] utilized point cloud representation, where PolarNet used a PointNeXt[44]-based architecture and Act3D designed a ghost point sampling mechanism to decode actions.Moreover, PerAct[55] fed voxel tokens into a PerceiverIO[23]-based transformer policy, demonstrating impressive performance in a variety of manipulation tasks.However, perceptive methods heavily relied on seamless camera overlay for comprehensive 3D understanding, which makes them less effective in unstructured environments.To address this, generative methods[43, 42, 47, 31, 20, 68, 69, 32] have gained attention. which learns the 3D geometry through self-supervised novel view reconstruction.For instance, Li et al.[32] combined NeRF and time contrastive learning to embed 3D geometry and learn fluid dynamics within an autoencoder framework.GNFactor[69] optimized a generalizable NeRF with a reconstruction loss besides behavior cloning, and showed effective improvement in both simulated and real scenarios.However, conventional generative methods usually ignore the scene-level spatiotemporal dynamics that demonstrate the interaction among objects, and the predicted actions still fail to achieve human goals because of incorrect interaction.
World Models.In recent years, world models have emerged as an effective approach to encode scene dynamics by predicting the future states given the current state and actions, which are explored in autonomous driving[57, 10, 21, 22], game agent[14, 15, 16, 17, 18, 66, 49] and robotic manipulation [19, 61, 50]. Early works[14, 15, 16, 17, 18, 19, 66, 49, 50] learned a latent space for future prediction by autoencoding, which acquired notable effectiveness in both simulated and real-world situations[61]. However, learning latent for accurate future prediction requires large amount of data and is limited to simple tasks such as robot control due to the weak representative ability of the implicit features. To address these limitations, explicit representation in the image domain[7, 51, 60, 40] and the language domain[34, 38, 58] has been widely studied because of the rich semantics.UniPi[7] reconstructed the future images with a text-conditional video generation model, employing an inverse dynamics model to obtain the intermediate actions. Dynalang[34] learned to predict text representations as future states, and enabled embodied agents to navigate in photorealistic home scans under human instructions.In contrast to these approaches, we generalize the world model to embedding space of dynamic Gaussian Splatting, which predicts the future state for the agent to learn scene-level dynamics from interactive environments.
Gaussian Splatting.Gaussian Splatting[29] models the scenes with a set of 3D Gaussians which are projected to 2D planes with efficient differentiable splatting. Gaussian Splatting achieves higher effectiveness and efficiency compared with implicit representations such as Neural Radiance Fields (NeRF)[41, 27, 35, 32, 6, 53, 69] with fast inference, high fidelity, and strong editability for novel view synthesis. To deploy Gaussian Splatting in diverse complex scenarios, many variants have been proposed to enhance the generalization ability, enrich the semantic information, and reconstruct deformable scenes.For higher generalization ability across diverse scenes, recent works[56, 72, 74, 4, 8, 63] constructed a direct mapping from pixels to Gaussian parameters, where the latent features were learned from large-scale datasets.To integrate rich semantic information into Gaussian Splatting, many efforts [45, 75, 52] have been demonstrated in distilling Gaussian radiance fields from pretrained foundation models[46, 3, 48].For instance, LangSplat[45] advanced the Gaussian representation by encoding language features distilled from CLIP[46] using a scene-wise language autoencoder, enabling efficient open-vocabulary localization compared with its NeRF-based counterpart[30].For deformation modeling, time-variant Gaussian radiance fields[64, 62, 65, 39, 59, 33, 1] were reconstructed from videos instead of images, which are widely applied in applications such as surgical scene reconstruction[73, 37].Although these approaches have achieved high-quality reconstruction from entire videos like interpolation, extrapolation to future states conditioned on previous states and actions is unexplored, which holds significance for scene-level dynamics modeling for interactive agents.In this paper, we formulate a dynamic Gaussian Splatting framework to model the scene dynamics of objects interactions, which enhance the physical reasoning for agents to complete a wide range of robotic manipulation tasks.
3 Approach
In this section, we first briefly introduce preliminaries on the problem formulation (Section3.1), and then we present an overview of our pipeline (Section3.2).Subsequently, we introduce our dynamic Gaussian Splatting framework (Section3.3) that infers the semantics propagation of the manipulation scenarios in the Gaussian embedding space. To enable our dynamic Gaussian Splatting framework to learn scene dynamics from the interactive environment, we build a Gaussian world model (Section3.4) that reconstructs future scenes according to the propagated semantics.
3.1 Problem Formulation
The demand for language-conditioned robotic manipulation is a significant aspect in the development of general intelligent robots.The agent is required to interactively predict the subsequent pose of the robot arm based on the observation and achieve the pose with a low-level motion planner to complete a wide range of manipulation tasks described in humans.The visual input at the step for the agent is defined as , where and respectively represent the single-view images and the depth images. The proprioception matrix indicates the gripper state including the end-effector position, openness, and current timestep. Based on the visual input and the language instructions, the agent is required to generate the optimal action for the robot arm and grippers , which respectively demonstrates the modification of the robot arm in translation, rotation, the motion-planner state and of the grippers in open state. To learn the manipulation policy effectively, expert demonstrations as offline datasets are provided for imitation learning, where the sample triplets contain the visual input, language instruction and the expert actions.Existing methods leverage powerful visual representation to learn informative latent features for optimal action prediction.However, they ignore the spatiotemporal dynamics which depicts the physical interaction among objects, and the predicted actions usually fail to complete complex human goals without correct object interactions. On the contrary, we present a dynamic Gaussian Splatting framework to mine the scene dynamics for robotic manipulation.
3.2 Overall Pipeline
The overall pipeline of our ManiGaussian method is shown in Figure2, in which we construct a dynamic Gaussian Splatting framework that models the propagation of diverse semantic features in the Gaussian embedding space for manipulation. We also build a Gaussian world model to parameterize distributions in our dynamic Gaussian Splatting framework, which can provide informative supervision of scene dynamics by future scene reconstruction.More specifically, we transform the visual input from RGB-D cameras to a volumetric representation by lifting and voxelization for data preprocessing. For dynamic Gaussian Splatting, we leverage a Gaussian regressor to infer the Gaussian distribution of geometric and semantic features in the scene, which are propagated along time steps with rich scene-level spatiotemporal dynamics. For the Gaussian world model, we instantiate a deformation field to reconstruct the future scene according to the current scene and the robot actions, and require consistency between reconstructed and realistic scenes for dynamics mining. Therefore, the spatiotemporal dynamics indicating object correlation can be embedded into the geometric and semantic features learned in the dynamic Gaussian Splatting framework. Finally, we employ multi-modal transformer PerceiverIO[23] to predict the optimal robot actions for general manipulation tasks, which considers geometric and semantic features in Gaussian embedding space with human language instructions.
3.3 Dynamic Gaussian Splatting for Robotic Manipulation
In order to capture the scene-level dynamics for general manipulation tasks, we propose a dynamic Gaussian Splatting framework that models the propagation of diverse semantic features within the Gaussian embedding space.While the vanilla Gaussian Splatting has remarkable effectiveness and efficiency in reconstructing static environments, it fails to capture the scene dynamics for manipulation due to the lack of temporal information.To this end, we formulate a dynamic Gaussian Splatting framework based on the vanilla Gaussian Splatting methodology by enabling the Gaussian points of scene representation to move with robotic manipulation, which demonstrates the physical interactions between objects. The scene representation of our dynamic Gaussian Splatting framework contains the geometric features depicting the explicit visual clues and the semantic features illustrating the implicit high-level visual features, which are utilized to predict the optimal action for the robot arm and grippers.
Dynamic Gaussian Splatting.Gaussian Splatting[29] is a promising approach for multi-view 3D reconstruction, which exhibits fast inference, high fidelity, and strong editability of generated content compared with Neural Radiance Field (NeRF)[41]. Gaussian Splatting represents a 3D scene explicitly with multiple Gaussian primitives,where the Gaussian primitive is parameterized by , where respectively represent the positions, rotation, color, scale, and opacity for the Gaussian primitive.To render a novel view, we project Gaussian primitives onto the 2D plane by differential tile-based rasterization. The value of the pixel can be rendered by the alpha-blend rendering:
(1) |
where is the rendered image, denotes the number of Gaussians in this tile, represents the 2D density of the Gaussian points in the splatting process, and stands for the covariance matrix acquired from the rotation and scales of the Gaussian parameters.However, the vanilla Gaussian Splatting encounters difficulties in reconstructing temporal information in changing environments, which limits the ability to model scene-level dynamics that is crucial for manipulation tasks.To address this, we enable the Gaussian particles to be propagated with time to capture the spatiotemporal dynamics of the scene. The parameters of the Gaussian primitive at the step can be expressed as follows:
(2) |
The positions, rotation, colors, scale and opacity with the superscript represent their counterparts at the step in the propagation, and is the high-level semantic features distilled from the Stable Diffusion[48] visual encoder based on the RGB images of the scene. In robotic manipulation, all objects are regarded as rigid body without inherent properties including colors, scales, opacity and semantic features, and , , and are therefore regarded as a time-independent parameters. The positions and rotation for Gaussian particles change during the manipulation due to the physical interaction between objects and robot grippers, which can be formulated as follows:
(3) |
where and demonstrate the change of positions and rotation from the step to the next one for the Gaussian primitive. With the time-dependent parameters of the Gaussian mixture distribution, the pixel values in 2D views of the scene can still be rendered by (1).
Gaussian World Model. In our implementation, we present a Gaussian world model to parameterize the Gaussian mixture distribution in dynamic Gaussian Splatting, through which the future scene can be reconstructed via the parameter propagation. Therefore, the dynamic Gaussian Splatting model can acquire informative supervision in the interactive environment by considering the consistency between the reconstructed and realistic feature scenes. World models are effective to learn the environmental dynamics for downstream tasks by anticipating the future state based on the current state and action at the step, which have been applied to a variety of tasks including autonomous driving[57, 10, 21, 22] and game agent[14, 15, 16, 17, 18, 66, 49].For our robotic manipulation tasks, we instantiate the current state in the world model as the visual observation in current step, and actions refer to the those of the robot arm and grippers. They are leveraged to predict the visual scenes observed in the next step that represents the future state. More specifically, the Gaussian world model contains a representation network that learns high-level visual features with rich semantics for the input observation, a Gaussian regressor that predicts the Gaussian parameters of different primitives based on the visual features, a deformation predictor that infers the difference of Gaussian parameters during the propagation, and a Gaussian renderer that generates the RGB images for the predicted future state:
(4) |
where and mean the visual observation and the corresponding high-level visual features at the step. is the camera pose for the view where we project the Gaussian primitives. We leverage multi-head neural networks as the Gaussian regressor, where each head predicts a specific feature for the Gaussian parameters shown in (2). By inferring the changes of positions and rotations between consecutive steps, we acquire the propagated Gaussian parameters in the future step based on (3). Finally, the Gaussian renderer projects the propagated Gaussian distribution in a specific view for future scene reconstruction.
3.4 Learning Objectives
Current Scene Consistency Loss.Reconstructing the current scene based on the current Gaussian parameters accurately can enhance the performance of the Gaussian regressor. To achieve this goal, we introduce the consistency objective between the realistic current observation and the rendered according to the current Gaussian parameters:
(5) |
where and respectively mean the prediction and groundtruth of observation images from different views at the step.
Semantic Feature Consistency Loss.The semantic features contain high-level visual information of the observed scenes. Since the foundation models can extract informative semantic features for general scenes, we expect the semantic features in our Gaussian parameters to mimic those acquired by large pre-trained models such as Stable Diffusion[48], so that the knowledge learned by the pre-trained models can be distilled to our Gaussian world model according to the following objective:
(6) |
where and are projected map of semantic features in Gaussian parameters and the feature map learned by pre-trained models. means the cosine distance between variables.
Action Prediction Loss.The distribution parameters in our dynamic Gaussian framework are leveraged to predict the optimal action of the robot arm and grippers for general manipulation tasks. We employ a multi-modal transformer PerceiverIO[23] to infer the selection probability of different action candidates based on the Gaussian parameters and the human language instructions, and leverage the cross-entropy loss for accurate action prediction:
(7) |
where , , , represent the probability of the groundtruth actions for translation, rotation, gripper openness and motion planner states of the robot, respectively.
Future Scene Consistency Loss.We require consistency between the reconstructed and realistic scenes, so that the dynamic Gaussian Splatting framework can accurately embed scene-level spatiotemporal dynamics in the Gaussian parameters. Specifically, the training objective aligns the predicted future scenes based on different observations and actions with the realistic ones, which can be formulated as follows:
(8) |
where means the predicted future observation of the scene at the step based on the action and current observation , and is the realistic counterpart. The predicted action at the step is acquired from the action decoder according to the Gaussian parameters, and the reconstructed observation for current scene can be obtained by rendering the current Gaussian parameters. The objective enforces the dynamic Gaussian Splatting framework to learn informative Gaussian parameters to generate optimal actions and recover the current observation, so that the future observation can be accurately reconstructed with embedded spatiotemporal dynamics.
The overall objective for our ManiGaussian agent is written as a weighted combination of different loss terms:
(9) |
where are the hyperparamters that controls the importance of different terms during training. In training, we set a warm-up phase that freezes the deformation predictor to first learn a stable representation model and a Gaussian regressor during the first k iterations. After the warm-up phase, we jointly train the whole Gaussian world model with the action decoder until the training ends.
4 Experiments
In this section, we first introduce the experiment setup including datasets, baseline methods, and implementation details (Section 4.1). Then we compare our method with the state-of-the-art approaches to show the superiority in success rate (Section 4.2), and conduct an ablation study to verify the effectiveness of different components in our dynamic Gaussian Splatting framework and the Gaussian world model (Section 4.3). Finally, we also illustrate the visualization results to depict our intuition (Section 4.4). Further results and case studies can be found in the supplementary material.
4.1 Experiment Setup
Dataset. Our experiments are conducted in the popular RLBench[24] simulated tasks. Following[69], we utilize a curated subset of challenging language-conditioned manipulation tasks from RLBench, which includes variations in object properties and scene arrangement.The diversity of these tasks requires the agent to acquire generalizable knowledge about the intrinsical scene-level spatial-temporal dynamics for manipulation, rather than solely relying on mimicking the provided expert demonstrations to achieve high success rates.We evaluated 25 episodes in the testing set for each task to avoid result bias from noise.For visual observation, we employ RGB-D images captured by a single front camera with a resolution of . During the training phase, we use demonstrations collected by GNFactor[69] for each task.
Baselines: We compare our ManiGaussian with the previous state of the arts including the perceptive method PerAct[55] and its modified version using camera inputs to cover the workbench, as well as the generative method GNFactor[69]. The evaluation metric is the task success rate, which measures the percentage of completed episodes. An episode is considered successful if the agent completes the goal specified in natural language within a maximum of steps.
Implementation Details.We use the [55, 69] augmentation for the expert demonstrations in the training set to enhance the generalizability of agents.To ensure consistency and mitigate the impact of parameter size, we utilize the same version of PerceiverIO[23] as the action decoder across all baselines.All the compared methods are trained on two NVIDIA RTX 4090 GPUs for k iterations with a batch size of . We employ LAMB optimizer [67] with the initial learning rate . We also adopt a cosine scheduler with warmup in the first 3k steps.
Method / Task | closejar | opendrawer | sweep todustpan | meat offgrill | turntap |
PerAct | |||||
PerAct (4 cameras) | |||||
GNFactor | |||||
ManiGaussian (ours) |
Method / Task | slideblock | put indrawer | dragstick | pushbuttons | stackblocks | Average |
PerAct | ||||||
PerAct (4 cameras) | ||||||
GNFactor | ||||||
ManiGaussian (ours) |
Geo. | Sem. | Dyna. | Planning | Long | Tools | Motion | Screw | Occlusion | Average |
✗ | ✗ | ✗ | |||||||
✓ | ✗ | ✗ | |||||||
✓ | ✓ | ✗ | |||||||
✓ | ✗ | ✓ | |||||||
✓ | ✓ | ✓ |
4.2 Comparison with the State-of-the-Art Methods.
In this section, we compare our ManiGaussian with previous state-of-the-art methods on the RLBench tasksuite. Table1 illustrates the comparison of the average success rate of each task.Our method achieves the best performance with an average success rate of %, which is the state-of-the-art, outperforming the previous arts including both perceptive and generative-based methods by a sizable margin.The dominated generative-based method GNFactor[69] leveraged a generalizable NeRF to learn informative latent representation for optimal action prediction, which showed effective improvement beyond the perceptive-based methods.However, it ignores the scene-level spatiotemporal dynamics that demonstrate the interaction among objects, and the predicted actions still fail to achieve human goals because of the incorrect interaction.On the contrary, our ManiGaussian learns the scene dynamics with the proposed dynamic Gaussian Splatting framework, so that the robotic agent can complete human instructions with accurate action prediction in unstructured environments.As a result, our method outperforms the second-best GNFactor method by a relative improvement of %.In addition, out of tasks, we excel in tasks, achieving more than double the accuracy of the previous methods in the “sweep to dustpan”, “put in drawer” and “drag stick” tasks. In the task “meat off grill” where the best performance was not reached, our method also ranks as second best. In conclusion, the experimental results illustrate the effectiveness of our proposed method across multiple language-conditioned robotic manipulation tasks.
4.3 Ablation Study
Our dynamic Gaussian Splatting framework models the propagation of diverse features in the Gaussian embedding space, and the Gaussian world model reconstructs the future scene according to the current scene by constraining the consistency between reconstructed and realistic scenes for dynamics mining.We conduct an ablation study to verify the effectiveness of each presented component in Table2, where we manually group the RLBench tasks into categories according to their main challenges to demonstrate the improvement reason.We first implement a vanilla baseline without any proposed technique, where we directly train the representation model and the action decoder in the Gaussian world model to predict the optimal robot action.By adding the Gaussian regressor to predict the Gaussian parameters in the dynamic Gaussian Splatting framework, the performance improves by % compared with the vanilla baseline.Especially, in the tasks that require geometric reasoning such as “Occulusion”, “Tools” and “Screw”, it outperforms the vanilla version by sizable margins, which proves the ability of the Gaussian Splatting technology to model the spatial information for manipulation tasks.We then add semantic features distilled from the pretrained foundation model into the dynamic Gaussian Splatting framework, which contains high-level visual information of the observed scenes. By adding the semantic features and the related consistency loss, we observe that the average success rate increases by % than the only geometric features version, which indicates the benefits of the rich semantic information for robotic manipulation.Besides, we implement the deformation predictor and the corresponding future scene consistency loss based on the vanilla baseline to mine the scene-level dynamics via future scene reconstruction, resulting in a dramatic performance improvement of %. Particularly, the proposed deformation predictor improves the task completion of out of task types, including “Planning”, “Long”, “Motion”, and “Occlusion”, which demonstrates the importance of the scene-level dynamics encoded by the deformation predictor in the Gaussian world model.After combining all the proposed techniques in our dynamic Gaussian Splatting framework, the performance increases from % to %, which verifies the necessity of the scene-level spatiotemporal dynamics mined by the proposed dynamics Gaussian Splatting framework with the Gaussian world model parameterization.
Figure 3 shows the learning curve of the proposed ManiGaussian and the state-of-the-art method GNFactor, where we save checkpoints and test them every k parameter updates.Both the compared methods get convergence within k training steps.As shown in Figure 3, our ManiGaussian outperforms the state-of-the-art method GNFactor, achieving better performance and faster training. This result proves that our ManiGaussian not only performs better but also trains faster, which also shows the efficiency of the explicit Gaussian scene reconstruction than the implicit approach like NeRF.
4.4 Qualitative Analysis
Visualization of Whole Trajectories.We present two qualitative examples of the generated action sequence in Figure4 from GNFactor and our ManiGaussian.In the top case, the agent is instructed to “Slide the block to yellow target”. The results show that the previous agent struggles to complete the task since It imitates the expert’s backward pulling motion, even though the claw is already leaning towards the right side of the red block.In contrast, our ManiGaussian returns to the red square and successfully slides the square to the yellow target, owing to that our method can correctly understand the scene dynamics of objects in contact.In the bottom case, the agent is instructed to “Turn left tap”. The results show that the GNFactor agent misunderstands the meaning of “left”, and instead operates the right tap, and also fails to turn on the tap. Our ManiGaussian successfully completes the task, which shows that our method can not only understand the semantic information, but also execute operations accurately.
Visualization of Novel View Synthesis.Figure5 shows the novel view image synthesis results, where the action loss is removed for all compared methods for better visualization.The results show that, first, based on the front view observation where the claw shape cannot be seen, our ManiGaussian offers superior detail in modeling cubes in novel views within the same number of training steps. Second, our method accurately predicts future states and can model interactions between objects based on accurately recovering object details. For example, in the top case of the “slide block” task, our ManiGaussian not only predicts the future gripper position that corresponds to the human instruction, but also predicts the future object location influenced by the gripper based on the scene dynamics understanding of the physical interaction among objects. This qualitative result demonstrates that our ManiGaussian learns the intricate scene-level dynamics successfully.
5 Conclusion
In this paper, we have presented a ManiGaussian agent that encodes the scene-level spatiotemporal dynamics for language-conditioned manipulation agents.We design a dynamic Gaussian Splatting framework that models the propagation of diverse semantic features in the Gaussian embedding space, and the semantic features with scene dynamics are leveraged to predict the optimal robot actions.Subsequently, we build a Gaussian world model to parameterize the distributions in the dynamic Gaussian Splatting framework to mine scene-level dynamics by reconstructing the future scene according to the current scene and the robot actions.Extensive experiments in a wide variety of manipulation tasks demonstrate the superiority of our ManiGaussian compared with the state-of-the-art methods.The limitations of our ManiGaussian stem from the necessity of multiple view supervision with camera calibration for the dynamic Gaussian Splatting framework.
References
- [1]Abou-Chakra, J., Rana, K., Dayoub, F., Sünderhauf, N.: Physically embodied gaussian splatting: Embedding physical priors into a visual 3d world model for robotics. In: CoRL (2023)
- [2]Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., etal.: Rt-1: Robotics transformer for real-world control at scale. arXiv (2022)
- [3]Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
- [4]Charatan, D., Li, S., Tagliasacchi, A., Sitzmann, V.: pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. arXiv preprint arXiv:2312.12337 (2023)
- [5]Chen, S., Garcia, R., Schmid, C., Laptev, I.: Polarnet: 3d point clouds for language-guided robotic manipulation. arXiv preprint arXiv:2309.15596 (2023)
- [6]Driess, D., Schubert, I., Florence, P., Li, Y., Toussaint, M.: Reinforcement learning with neural radiance fields. NeurIPS (2022)
- [7]Du, Y., Yang, S., Dai, B., Dai, H., Nachum, O., Tenenbaum, J., Schuurmans, D., Abbeel, P.: Learning universal policies via text-guided video generation. NeurIPS 36 (2024)
- [8]Fu, Y., Liu, S., Kulkarni, A., Kautz, J., Efros, A.A., Wang, X.: Colmap-free 3d gaussian splatting. arXiv preprint arXiv:2312.07504 (2023)
- [9]Fu, Z., Zhao, T.Z., Finn, C.: Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117 (2024)
- [10]Gao, Z., Mu, Y., Shen, R., Chen, C., Ren, Y., Chen, J., Li, S.E., Luo, P., Lu, Y.: Enhance sample efficiency and robustness of end-to-end urban autonomous driving via semantic masked world model. arXiv preprint arXiv:2210.04017 (2022)
- [11]Gervet, T., Xian, Z., Gkanatsios, N., Fragkiadaki, K.: Act3d: 3d feature field transformers for multi-task robotic manipulation. In: CoRL. pp. 3949–3965 (2023)
- [12]Goyal, A., Xu, J., Guo, Y., Blukis, V., Chao, Y.W., Fox, D.: Rvt: Robotic view transformer for 3d object manipulation. arXiv preprint arXiv:2306.14896 (2023)
- [13]Guhur, P.L., Chen, S., Pinel, R.G., Tapaswi, M., Laptev, I., Schmid, C.: Instruction-driven history-aware policies for robotic manipulations. In: CoRL. pp. 175–187. PMLR (2023)
- [14]Ha, D., Schmidhuber, J.: Recurrent world models facilitate policy evolution. NeurIPS 31 (2018)
- [15]Hafner, D., Lee, K.H., Fischer, I., Abbeel, P.: Deep hierarchical planning from pixels. NeurIPS 35, 26091–26104 (2022)
- [16]Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603 (2019)
- [17]Hafner, D., Lillicrap, T., Norouzi, M., Ba, J.: Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193 (2020)
- [18]Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T.: Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104 (2023)
- [19]Hansen, N., Su, H., Wang, X.: Td-mpc2: Scalable, robust world models for continuous control. arXiv preprint arXiv:2310.16828 (2023)
- [20]Hansen, N., Wang, X.: Generalization in reinforcement learning by soft data augmentation. In: ICRA (2021)
- [21]Hu, A., Corrado, G., Griffiths, N., Murez, Z., Gurau, C., Yeo, H., Kendall, A., Cipolla, R., Shotton, J.: Model-based imitation learning for urban driving. NeurIPS 35, 20703–20716 (2022)
- [22]Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., Corrado, G.: Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080 (2023)
- [23]Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Perceiver: General perception with iterative attention. In: ICML (2021)
- [24]James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learning benchmark & learning environment. RA-L (2020)
- [25]James, S., Wada, K., Laidlow, T., Davison, A.J.: Coarse-to-fine q-attention: Efficient learning for visual robotic manipulation via discretisation. In: CVPR (2022)
- [26]Jang, E., Irpan, A., Khansari, M., Kappler, D., Ebert, F., Lynch, C., Levine, S., Finn, C.: Bc-z: Zero-shot task generalization with robotic imitation learning. In: CoRL (2022)
- [27]Jiang, Z., Zhu, Y., Svetlik, M., Fang, K., Zhu, Y.: Synergies between affordance and geometry: 6-dof grasp detection via implicit representations. arXiv preprint arXiv:2104.01542 (2021)
- [28]Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., etal.: Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv (2018)
- [29]Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. TOG 42(4) (2023)
- [30]Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: Language embedded radiance fields. arXiv preprint arXiv:2303.09553 (2023)
- [31]Laskin, M., Srinivas, A., Abbeel, P.: Curl: Contrastive unsupervised representations for reinforcement learning. In: ICML (2020)
- [32]Li, Y., Li, S., Sitzmann, V., Agrawal, P., Torralba, A.: 3d neural scene representations for visuomotor control. In: CoRL. pp. 112–123 (2022)
- [33]Liang, L., Bian, L., Xiao, C., Zhang, J., Chen, L., Liu, I., Xiang, F., Huang, Z., Su, H.: Robo360: A 3d omnispective multi-material robotic manipulation dataset. arXiv preprint arXiv:2312.06686 (2023)
- [34]Lin, J., Du, Y., Watkins, O., Hafner, D., Abbeel, P., Klein, D., Dragan, A.: Learning to model the world with language. arXiv preprint arXiv:2308.01399 (2023)
- [35]Lin, Y.C., Florence, P., Zeng, A., Barron, J.T., Du, Y., Ma, W.C., Simeonov, A., Garcia, A.R., Isola, P.: Mira: Mental imagery for robotic affordances. In: CoRL. pp. 1916–1927 (2023)
- [36]Liu, H., Lee, L., Lee, K., Abbeel, P.: Instruction-following agents with jointly pre-trained vision-language models. arXiv preprint arXiv:2210.13431 (2022)
- [37]Liu, Y., Li, C., Yang, C., Yuan, Y.: Endogaussian: Gaussian splatting for deformable surgical scene reconstruction. arXiv preprint arXiv:2401.12561 (2024)
- [38]Lu, G., Wang, Z., Liu, C., Lu, J., Tang, Y.: Thinkbot: Embodied instruction following with thought chain reasoning. arXiv preprint arXiv:2312.07062 (2023)
- [39]Luiten, J., Kopanas, G., Leibe, B., Ramanan, D.: Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713 (2023)
- [40]Mendonca, R., Bahl, S., Pathak, D.: Structured world models from human videos. arXiv preprint arXiv:2308.10901 (2023)
- [41]Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. CACM 65(1), 99–106 (2021)
- [42]Nair, S., Rajeswaran, A., Kumar, V., Finn, C., Gupta, A.: R3m: A universal visual representation for robot manipulation. arXiv (2022)
- [43]Parisi, S., Rajeswaran, A., Purushwalkam, S., Gupta, A.: The unsurprising effectiveness of pre-trained vision models for control. In: ICML (2022)
- [44]Qian, G., Li, Y., Peng, H., Mai, J., Hammoud, H., Elhoseiny, M., Ghanem, B.: Pointnext: Revisiting pointnet++ with improved training and scaling strategies. NeurIPS 35, 23192–23204 (2022)
- [45]Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: Langsplat: 3d language gaussian splatting. arXiv preprint arXiv:2312.16084 (2023)
- [46]Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., etal.: Learning transferable visual models from natural language supervision. In: ICML (2021)
- [47]Radosavovic, I., Xiao, T., James, S., Abbeel, P., Malik, J., Darrell, T.: Real-world robot learning with masked visual pre-training. In: CoRL (2023)
- [48]Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
- [49]Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., etal.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588(7839), 604–609 (2020)
- [50]Seo, Y., Hafner, D., Liu, H., Liu, F., James, S., Lee, K., Abbeel, P.: Masked world models for visual control. In: CoRL. pp. 1332–1344 (2023)
- [51]Seo, Y., Lee, K., James, S.L., Abbeel, P.: Reinforcement learning with action-free pre-training from videos. In: ICML. pp. 19561–19579 (2022)
- [52]Shi, J.C., Wang, M., Duan, H.B., Guan, S.H.: Language embedded 3d gaussians for open-vocabulary scene understanding. arXiv preprint arXiv:2311.18482 (2023)
- [53]Shim, D., Lee, S., Kim, H.J.: Snerl: Semantic-aware neural radiance fields for reinforcement learning. ICML (2023)
- [54]Shridhar, M., Manuelli, L., Fox, D.: Cliport: What and where pathways for robotic manipulation. In: CoRL (2022)
- [55]Shridhar, M., Manuelli, L., Fox, D.: Perceiver-actor: A multi-task transformer for robotic manipulation. In: CoRL (2023)
- [56]Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Splatter image: Ultra-fast single-view 3d reconstruction. arXiv preprint arXiv:2312.13150 (2023)
- [57]Wang, X., Zhu, Z., Huang, G., Chen, X., Lu, J.: Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv preprint arXiv:2309.09777 (2023)
- [58]Wang, Z., Cai, S., Chen, G., Liu, A., Ma, X., Liang, Y.: Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560 (2023)
- [59]Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang, X.: 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528 (2023)
- [60]Wu, J., Ma, H., Deng, C., Long, M.: Pre-training contextualized world models with in-the-wild videos for reinforcement learning. NeurIPS 36 (2024)
- [61]Wu, P., Escontrela, A., Hafner, D., Abbeel, P., Goldberg, K.: Daydreamer: World models for physical robot learning. In: CoRL. pp. 2226–2240 (2023)
- [62]Xie, T., Zong, Z., Qiu, Y., Li, X., Feng, Y., Yang, Y., Jiang, C.: Physgaussian: Physics-integrated 3d gaussians for generative dynamics. arXiv preprint arXiv:2311.12198 (2023)
- [63]Xu, D., Yuan, Y., Mardani, M., Liu, S., Song, J., Wang, Z., Vahdat, A.: Agg: Amortized generative 3d gaussians for single image to 3d. arXiv preprint arXiv:2401.04099 (2024)
- [64]Yang, Z., Yang, H., Pan, Z., Zhu, X., Zhang, L.: Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. arXiv preprint arXiv:2310.10642 (2023)
- [65]Yang, Z., Gao, X., Zhou, W., Jiao, S., Zhang, Y., Jin, X.: Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101 (2023)
- [66]Ye, W., Liu, S., Kurutach, T., Abbeel, P., Gao, Y.: Mastering atari games with limited data. NeurIPS 34, 25476–25488 (2021)
- [67]You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., Hsieh, C.J.: Large batch optimization for deep learning: Training bert in 76 minutes. arXiv (2019)
- [68]Ze, Y., Hansen, N., Chen, Y., Jain, M., Wang, X.: Visual reinforcement learning with self-supervised 3d representations. RA-L (2023)
- [69]Ze, Y., Yan, G., Wu, Y.H., Macaluso, A., Ge, Y., Ye, J., Hansen, N., Li, L.E., Wang, X.: Gnfactor: Multi-task real robot learning with generalizable neural feature fields. In: CoRL. pp. 284–301. PMLR (2023)
- [70]Zeng, A., Florence, P., Tompson, J., Welker, S., Chien, J., Attarian, M., Armstrong, T., Krasin, I., Duong, D., Sindhwani, V., etal.: Transporter networks: Rearranging the visual world for robotic manipulation. In: CoRL (2021)
- [71]Zhang, T., Hu, Y., Cui, H., Zhao, H., Gao, Y.: A universal semantic-geometric representation for robotic manipulation. arXiv preprint arXiv:2306.10474 (2023)
- [72]Zheng, S., Zhou, B., Shao, R., Liu, B., Zhang, S., Nie, L., Liu, Y.: Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. arXiv preprint arXiv:2312.02155 (2023)
- [73]Zhu, L., Wang, Z., Jin, Z., Lin, G., Yu, L.: Deformable endoscopic tissues reconstruction with gaussian splatting. arXiv preprint arXiv:2401.11535 (2024)
- [74]Zou, Z.X., Yu, Z., Guo, Y.C., Li, Y., Liang, D., Cao, Y.P., Zhang, S.H.: Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. arXiv preprint arXiv:2312.09147 (2023)
- [75]Zuo, X., Samangouei, P., Zhou, Y., Di, Y., Li, M.: Fmgs: Foundation model embedded 3d gaussian splatting for holistic 3d scene understanding. arXiv preprint arXiv:2401.01970 (2024)