ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation (2024)

11institutetext: Tsinghua Shenzhen International Graduate School, Tsinghua University, China22institutetext: Carnegie Mellon University33institutetext: Department of Automation, Tsinghua University, China
33email: {lgx23@mails.,sy-zhang23@mails.,lujiwen@,tang.yansong@sz.}tsinghua.edu.cn
33email: {ziweiwa2@,cliu6@}andrew.cmu.edu

Guanxing Lu11  Shiyi Zhang11  Ziwei Wang🖂🖂 Corresponding author.22  Changliu Liu22  Jiwen Lu33  Yansong Tang11

Abstract

Performing language-conditioned robotic manipulation tasks in unstructured environments is highly demanded for general intelligent robots. Conventional robotic manipulation methods usually learn semantic representation of the observation for action prediction, which ignores the scene-level spatiotemporal dynamics for human goal completion. In this paper, we propose a dynamic Gaussian Splatting method named ManiGaussian for multi-task robotic manipulation, which mines scene dynamics via future scene reconstruction. Specifically, we first formulate the dynamic Gaussian Splatting framework that infers the semantics propagation in the Gaussian embedding space, where the semantic representation is leveraged to predict the optimal robot action. Then, we build a Gaussian world model to parameterize the distribution in our dynamic Gaussian Splatting framework, which provides informative supervision in the interactive environment via future scene reconstruction. We evaluate our ManiGaussian on 10101010 RLBench tasks with 166166166166 variations, and the results demonstrate our framework can outperform the state-of-the-art methods by 13.113.113.113.1% in average success rate 111Project page: https://guanxinglu.github.io/ManiGaussian/.

Keywords:

Multi-task robotic manipulation dynamic Gaussian Splatting world model

1 Introduction

Designing autonomous agents for language-conditioned manipulation tasks[28, 2, 26, 70, 54, 55, 9] has been highly desired in the pursuit of artificial intelligence for a long time. In realistic deployment, intelligent robots are usually required to deal with unseen scenarios in novel tasks. Therefore, comprehending complex 3D structures in the deployment scenes is necessary for the robots to achieve high task success rates across diverse manipulation tasks.

To address the challenges, previous arts have made great processes in general manipulation policy learning, which can be divided into two categories including perceptive methods and generative methods.For the first regard, semantic features extracted by perceptive models are directly leveraged to predict the robot actions according to the visual input such as image[36, 13, 12], point cloud[5, 11, 71] and voxel[25, 55].However, the perceptive methods heavily rely on multi-view cameras to cover the whole workbench to deal with the occlusion problem within unstructured environments, which restricts their deployment.To this end, generative methods[43, 42, 47, 31, 20, 68, 69, 32] capture the 3D scene structure information by reconstructing the scene and objects in arbitrary novel views with self-supervised learning. Nevertheless, they ignore the spatiotemporal dynamics that depict the physical interaction among objects during manipulation, and the predicted actions still fail to complete human goals without correct object interactions. Figure1 shows a comparison of manipulation achieved by the conventional generative manipulation method (top) and the proposed method (bottom), where the conventional method fails to stack the two rose blocks due to the poor comprehension of scene dynamics.

ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation (1)

In this paper, we propose a ManiGaussian method that leverages a dynamic Gassuain Splatting framework for multi-task robotic manipulation. Different from conventional methods which only focus on semantic representation, our method mines the scene-level spatiotemporal dynamics via future scene reconstruction. Therefore, the interaction among objects can be comprehended for accurate manipulation action prediction. More specifically, we first formulate the dynamic Gaussian Splatting framework that models the propagation of diverse semantic features in the Gaussian embedding space, and the semantic features with scene dynamics are leveraged to predict the optimal robot actions for general manipulation tasks. We build a Gaussian world model to parameterize the distributions in our dynamic Gaussian Splatting framework. Therefore, our framework can acquire informative supervision in interactive environments by reconstructing the future scene according to the current scene and the robot actions, where we constrain consistency between reconstructed and realistic future scenes for dynamics mining. We evaluate our ManiGaussian method on the RLBench dataset[24] with 10101010 tasks and 166166166166 variants, where our method outperforms the state-of-the-art multi-task robotic manipulation methods by 13.113.113.113.1% in the average task success rate. Our contributions can be summarized as follows:

  • We propose a dynamic Gaussian Splatting framework to learn the scene-level spatiotemporal dynamics in general robotic manipulation tasks, so that the robotic agent can complete human instructions with accurate action prediction in unstructured environments.

  • We build a Gaussian world model to parameterize distributions in our dynamic Gaussian Splatting framework, which can provide informative supervision to learn scene dynamics from the interactive environment.

  • We conduct extensive experiments of 10101010 tasks on RLBench, and the results demonstrate that our method achieves a higher success rate than the state-of-the-art methods with less computation.

2 Related Work

Visual Representations for Robotic Manipulation.Developing intelligent agents for language-conditioned manipulation tasks in complex and unstructured environments has been a longstanding objective.One of the key bottlenecks in achieving this goal is effectively representing visual information of the scene.Prior arts can be categorized into two branches: perceptive methods and generative methods. Perceptive methods directly utilize pretrained 2D[36, 13, 12, 71] or 3D visual representation backbone[5, 11, 25, 55] to learn scene embedding, where optimal robot actions are predicted based on the scene semantics.For example, InstructRL[36] and Hiveformer[13] directly passed 2D visual tokens through a multi-modal transformer to decode gripper actions, but struggled to handle complex manipulation tasks due to the lack of geometric understanding.To incorporate 3D information beyond images, PolarNet[5] and Act3D[11] utilized point cloud representation, where PolarNet used a PointNeXt[44]-based architecture and Act3D designed a ghost point sampling mechanism to decode actions.Moreover, PerAct[55] fed voxel tokens into a PerceiverIO[23]-based transformer policy, demonstrating impressive performance in a variety of manipulation tasks.However, perceptive methods heavily relied on seamless camera overlay for comprehensive 3D understanding, which makes them less effective in unstructured environments.To address this, generative methods[43, 42, 47, 31, 20, 68, 69, 32] have gained attention. which learns the 3D geometry through self-supervised novel view reconstruction.For instance, Li et al.[32] combined NeRF and time contrastive learning to embed 3D geometry and learn fluid dynamics within an autoencoder framework.GNFactor[69] optimized a generalizable NeRF with a reconstruction loss besides behavior cloning, and showed effective improvement in both simulated and real scenarios.However, conventional generative methods usually ignore the scene-level spatiotemporal dynamics that demonstrate the interaction among objects, and the predicted actions still fail to achieve human goals because of incorrect interaction.

World Models.In recent years, world models have emerged as an effective approach to encode scene dynamics by predicting the future states given the current state and actions, which are explored in autonomous driving[57, 10, 21, 22], game agent[14, 15, 16, 17, 18, 66, 49] and robotic manipulation [19, 61, 50]. Early works[14, 15, 16, 17, 18, 19, 66, 49, 50] learned a latent space for future prediction by autoencoding, which acquired notable effectiveness in both simulated and real-world situations[61]. However, learning latent for accurate future prediction requires large amount of data and is limited to simple tasks such as robot control due to the weak representative ability of the implicit features. To address these limitations, explicit representation in the image domain[7, 51, 60, 40] and the language domain[34, 38, 58] has been widely studied because of the rich semantics.UniPi[7] reconstructed the future images with a text-conditional video generation model, employing an inverse dynamics model to obtain the intermediate actions. Dynalang[34] learned to predict text representations as future states, and enabled embodied agents to navigate in photorealistic home scans under human instructions.In contrast to these approaches, we generalize the world model to embedding space of dynamic Gaussian Splatting, which predicts the future state for the agent to learn scene-level dynamics from interactive environments.

Gaussian Splatting.Gaussian Splatting[29] models the scenes with a set of 3D Gaussians which are projected to 2D planes with efficient differentiable splatting. Gaussian Splatting achieves higher effectiveness and efficiency compared with implicit representations such as Neural Radiance Fields (NeRF)[41, 27, 35, 32, 6, 53, 69] with fast inference, high fidelity, and strong editability for novel view synthesis. To deploy Gaussian Splatting in diverse complex scenarios, many variants have been proposed to enhance the generalization ability, enrich the semantic information, and reconstruct deformable scenes.For higher generalization ability across diverse scenes, recent works[56, 72, 74, 4, 8, 63] constructed a direct mapping from pixels to Gaussian parameters, where the latent features were learned from large-scale datasets.To integrate rich semantic information into Gaussian Splatting, many efforts [45, 75, 52] have been demonstrated in distilling Gaussian radiance fields from pretrained foundation models[46, 3, 48].For instance, LangSplat[45] advanced the Gaussian representation by encoding language features distilled from CLIP[46] using a scene-wise language autoencoder, enabling efficient open-vocabulary localization compared with its NeRF-based counterpart[30].For deformation modeling, time-variant Gaussian radiance fields[64, 62, 65, 39, 59, 33, 1] were reconstructed from videos instead of images, which are widely applied in applications such as surgical scene reconstruction[73, 37].Although these approaches have achieved high-quality reconstruction from entire videos like interpolation, extrapolation to future states conditioned on previous states and actions is unexplored, which holds significance for scene-level dynamics modeling for interactive agents.In this paper, we formulate a dynamic Gaussian Splatting framework to model the scene dynamics of objects interactions, which enhance the physical reasoning for agents to complete a wide range of robotic manipulation tasks.

3 Approach

In this section, we first briefly introduce preliminaries on the problem formulation (Section3.1), and then we present an overview of our pipeline (Section3.2).Subsequently, we introduce our dynamic Gaussian Splatting framework (Section3.3) that infers the semantics propagation of the manipulation scenarios in the Gaussian embedding space. To enable our dynamic Gaussian Splatting framework to learn scene dynamics from the interactive environment, we build a Gaussian world model (Section3.4) that reconstructs future scenes according to the propagated semantics.

3.1 Problem Formulation

The demand for language-conditioned robotic manipulation is a significant aspect in the development of general intelligent robots.The agent is required to interactively predict the subsequent pose of the robot arm based on the observation and achieve the pose with a low-level motion planner to complete a wide range of manipulation tasks described in humans.The visual input at the tthsubscript𝑡𝑡t_{th}italic_t start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT step for the agent is defined as o(t)=(𝐂(t),𝐃(t),𝐏(t))superscript𝑜𝑡superscript𝐂𝑡superscript𝐃𝑡superscript𝐏𝑡o^{(t)}=(\mathbf{C}^{(t)},\mathbf{D}^{(t)},\mathbf{P}^{(t)})italic_o start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = ( bold_C start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_D start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_P start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ), where 𝐂(t)superscript𝐂𝑡\mathbf{C}^{(t)}bold_C start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and 𝐃(t)superscript𝐃𝑡\mathbf{D}^{(t)}bold_D start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT respectively represent the single-view images and the depth images. The proprioception matrix 𝐏(t)4superscript𝐏𝑡superscript4\mathbf{P}^{(t)}\in\mathbb{R}^{4}bold_P start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT indicates the gripper state including the end-effector position, openness, and current timestep. Based on the visual input o(t)superscript𝑜𝑡o^{(t)}italic_o start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and the language instructions, the agent is required to generate the optimal action for the robot arm and grippers 𝐚(t)=(𝐚trans(t),𝐚rot(t),𝐚opent(t),𝐚col(t))superscript𝐚𝑡subscriptsuperscript𝐚𝑡transsubscriptsuperscript𝐚𝑡rotsubscriptsuperscript𝐚𝑡opentsubscriptsuperscript𝐚𝑡col\mathbf{a}^{(t)}=(\mathbf{a}^{(t)}_{\text{trans}},\ \mathbf{a}^{(t)}_{\text{{%rot\phantom{}}}},\ \mathbf{a}^{(t)}_{\text{{open\phantom{t}}}},\mathbf{a}^{(t)%}_{\text{col}})bold_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = ( bold_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT trans end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT rot end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT open roman_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT col end_POSTSUBSCRIPT ), which respectively demonstrates the modification of the robot arm in translation, rotation, the motion-planner state and of the grippers in open state. To learn the manipulation policy effectively, expert demonstrations as offline datasets are provided for imitation learning, where the sample triplets contain the visual input, language instruction and the expert actions.Existing methods leverage powerful visual representation to learn informative latent features for optimal action prediction.However, they ignore the spatiotemporal dynamics which depicts the physical interaction among objects, and the predicted actions usually fail to complete complex human goals without correct object interactions. On the contrary, we present a dynamic Gaussian Splatting framework to mine the scene dynamics for robotic manipulation.

ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation (2)

3.2 Overall Pipeline

The overall pipeline of our ManiGaussian method is shown in Figure2, in which we construct a dynamic Gaussian Splatting framework that models the propagation of diverse semantic features in the Gaussian embedding space for manipulation. We also build a Gaussian world model to parameterize distributions in our dynamic Gaussian Splatting framework, which can provide informative supervision of scene dynamics by future scene reconstruction.More specifically, we transform the visual input from RGB-D cameras to a volumetric representation by lifting and voxelization for data preprocessing. For dynamic Gaussian Splatting, we leverage a Gaussian regressor to infer the Gaussian distribution of geometric and semantic features in the scene, which are propagated along time steps with rich scene-level spatiotemporal dynamics. For the Gaussian world model, we instantiate a deformation field to reconstruct the future scene according to the current scene and the robot actions, and require consistency between reconstructed and realistic scenes for dynamics mining. Therefore, the spatiotemporal dynamics indicating object correlation can be embedded into the geometric and semantic features learned in the dynamic Gaussian Splatting framework. Finally, we employ multi-modal transformer PerceiverIO[23] to predict the optimal robot actions for general manipulation tasks, which considers geometric and semantic features in Gaussian embedding space with human language instructions.

3.3 Dynamic Gaussian Splatting for Robotic Manipulation

In order to capture the scene-level dynamics for general manipulation tasks, we propose a dynamic Gaussian Splatting framework that models the propagation of diverse semantic features within the Gaussian embedding space.While the vanilla Gaussian Splatting has remarkable effectiveness and efficiency in reconstructing static environments, it fails to capture the scene dynamics for manipulation due to the lack of temporal information.To this end, we formulate a dynamic Gaussian Splatting framework based on the vanilla Gaussian Splatting methodology by enabling the Gaussian points of scene representation to move with robotic manipulation, which demonstrates the physical interactions between objects. The scene representation of our dynamic Gaussian Splatting framework contains the geometric features depicting the explicit visual clues and the semantic features illustrating the implicit high-level visual features, which are utilized to predict the optimal action for the robot arm and grippers.

Dynamic Gaussian Splatting.Gaussian Splatting[29] is a promising approach for multi-view 3D reconstruction, which exhibits fast inference, high fidelity, and strong editability of generated content compared with Neural Radiance Field (NeRF)[41]. Gaussian Splatting represents a 3D scene explicitly with multiple Gaussian primitives,where the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT Gaussian primitive is parameterized by θi=(μi,ci,ri,si,σi)subscript𝜃𝑖subscript𝜇𝑖subscript𝑐𝑖subscript𝑟𝑖subscript𝑠𝑖subscript𝜎𝑖\theta_{i}=(\mu_{i},c_{i},r_{i},s_{i},\sigma_{i})italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where respectively represent the positions, rotation, color, scale, and opacity for the Gaussian primitive.To render a novel view, we project Gaussian primitives onto the 2D plane by differential tile-based rasterization. The value of the pixel 𝐩𝐩\mathbf{p}bold_p can be rendered by the alpha-blend rendering:

C(𝐩)=i=1Nαicij=1i1(1αj)where,αi=σie12(𝐩μi)Σi1(𝐩μi),formulae-sequence𝐶𝐩superscriptsubscript𝑖1𝑁subscript𝛼𝑖subscript𝑐𝑖superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑗where,subscript𝛼𝑖subscript𝜎𝑖superscript𝑒12superscript𝐩subscript𝜇𝑖topsuperscriptsubscriptΣ𝑖1𝐩subscript𝜇𝑖C(\mathbf{p})=\!\sum_{i=1}^{N}\alpha_{i}c_{i}\prod_{j=1}^{i-1}(1-\alpha_{j})%\quad\text{ where, }\alpha_{i}=\sigma_{i}e^{-\frac{1}{2}\left(\mathbf{p}-\mu_{%i}\right)^{\top}\Sigma_{i}^{-1}\left(\mathbf{p}-\mu_{i}\right)},italic_C ( bold_p ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) where, italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_p - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_p - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ,(1)

where C𝐶Citalic_C is the rendered image, N𝑁Nitalic_N denotes the number of Gaussians in this tile, αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the 2D density of the Gaussian points in the splatting process, and ΣisubscriptΣ𝑖\Sigma_{i}roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT stands for the covariance matrix acquired from the rotation and scales of the Gaussian parameters.However, the vanilla Gaussian Splatting encounters difficulties in reconstructing temporal information in changing environments, which limits the ability to model scene-level dynamics that is crucial for manipulation tasks.To address this, we enable the Gaussian particles to be propagated with time to capture the spatiotemporal dynamics of the scene. The parameters of the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT Gaussian primitive at the tthsubscript𝑡𝑡t_{th}italic_t start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT step can be expressed as follows:

θi(t)=(μi(t),ci(t),ri(t),si(t),σi(t),fi(t)),superscriptsubscript𝜃𝑖𝑡superscriptsubscript𝜇𝑖𝑡superscriptsubscript𝑐𝑖𝑡superscriptsubscript𝑟𝑖𝑡superscriptsubscript𝑠𝑖𝑡superscriptsubscript𝜎𝑖𝑡superscriptsubscript𝑓𝑖𝑡\theta_{i}^{(t)}=(\mu_{i}^{(t)},c_{i}^{(t)},r_{i}^{(t)},s_{i}^{(t)},\sigma_{i}%^{(t)},f_{i}^{(t)}),italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ,(2)

The positions, rotation, colors, scale and opacity with the superscript t𝑡titalic_t represent their counterparts at the tthsubscript𝑡𝑡t_{th}italic_t start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT step in the propagation, and fi(t)superscriptsubscript𝑓𝑖𝑡f_{i}^{(t)}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is the high-level semantic features distilled from the Stable Diffusion[48] visual encoder based on the RGB images of the scene. In robotic manipulation, all objects are regarded as rigid body without inherent properties including colors, scales, opacity and semantic features, and ci(t)superscriptsubscript𝑐𝑖𝑡c_{i}^{(t)}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, si(t)superscriptsubscript𝑠𝑖𝑡s_{i}^{(t)}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, σi(t)superscriptsubscript𝜎𝑖𝑡\sigma_{i}^{(t)}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and fi(t)superscriptsubscript𝑓𝑖𝑡f_{i}^{(t)}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT are therefore regarded as a time-independent parameters. The positions and rotation for Gaussian particles change during the manipulation due to the physical interaction between objects and robot grippers, which can be formulated as follows:

(μi(t+1),ri(t+1))=(μi(t)+Δμi(t),ri(t)+Δri(t))superscriptsubscript𝜇𝑖𝑡1superscriptsubscript𝑟𝑖𝑡1superscriptsubscript𝜇𝑖𝑡Δsuperscriptsubscript𝜇𝑖𝑡superscriptsubscript𝑟𝑖𝑡Δsuperscriptsubscript𝑟𝑖𝑡(\mu_{i}^{(t+1)},r_{i}^{(t+1)})=(\mu_{i}^{(t)}+\Updelta\mu_{i}^{(t)},r_{i}^{(t%)}+\Updelta r_{i}^{(t)})( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT ) = ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + roman_Δ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + roman_Δ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT )(3)

where Δμi(t)Δsuperscriptsubscript𝜇𝑖𝑡\Updelta\mu_{i}^{(t)}roman_Δ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and Δri(t)Δsuperscriptsubscript𝑟𝑖𝑡\Updelta r_{i}^{(t)}roman_Δ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT demonstrate the change of positions and rotation from the tthsubscript𝑡𝑡t_{th}italic_t start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT step to the next one for the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT Gaussian primitive. With the time-dependent parameters of the Gaussian mixture distribution, the pixel values in 2D views of the scene can still be rendered by (1).

Gaussian World Model. In our implementation, we present a Gaussian world model to parameterize the Gaussian mixture distribution in dynamic Gaussian Splatting, through which the future scene can be reconstructed via the parameter propagation. Therefore, the dynamic Gaussian Splatting model can acquire informative supervision in the interactive environment by considering the consistency between the reconstructed and realistic feature scenes. World models are effective to learn the environmental dynamics for downstream tasks by anticipating the future state s(t+1)superscript𝑠𝑡1s^{(t+1)}italic_s start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT based on the current state s(t)superscript𝑠𝑡s^{(t)}italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and action a(t)superscript𝑎𝑡a^{(t)}italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT at the tthsubscript𝑡𝑡t_{th}italic_t start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT step, which have been applied to a variety of tasks including autonomous driving[57, 10, 21, 22] and game agent[14, 15, 16, 17, 18, 66, 49].For our robotic manipulation tasks, we instantiate the current state in the world model as the visual observation in current step, and actions refer to the those of the robot arm and grippers. They are leveraged to predict the visual scenes observed in the next step that represents the future state. More specifically, the Gaussian world model contains a representation network qϕsubscript𝑞italic-ϕq_{\phi}italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT that learns high-level visual features with rich semantics for the input observation, a Gaussian regressor gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT that predicts the Gaussian parameters of different primitives based on the visual features, a deformation predictor pϕsubscript𝑝italic-ϕp_{\phi}italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT that infers the difference of Gaussian parameters during the propagation, and a Gaussian renderer \mathcal{R}caligraphic_R that generates the RGB images for the predicted future state:

{Representation model:𝐯(t)=qϕ(o(t)),Gaussian regressor:θ(t)=gϕ(𝐯(t)),Deformation predictor:Δθ(t)=pϕ(θ(t),a(t)),Gaussian renderer:o(t+1)=(θ(t+1),w),casesRepresentation model:superscript𝐯𝑡subscript𝑞italic-ϕsuperscript𝑜𝑡Gaussian regressor:superscript𝜃𝑡subscript𝑔italic-ϕsuperscript𝐯𝑡Deformation predictor:Δsuperscript𝜃𝑡subscript𝑝italic-ϕsuperscript𝜃𝑡superscript𝑎𝑡Gaussian renderer:superscript𝑜𝑡1superscript𝜃𝑡1𝑤\begin{cases}\text{Representation model:}&\mathbf{v}^{(t)}=q_{\phi}\left(o^{(t%)}\right),\\\text{Gaussian regressor: }&\theta^{(t)}=g_{\phi}\left(\mathbf{v}^{(t)}\right)%,\\\text{Deformation predictor:}&\Updelta\theta^{(t)}=p_{\phi}\left(\theta^{(t)},%a^{(t)}\right),\\\text{Gaussian renderer: }&o^{(t+1)}=\mathcal{R}\left(\theta^{(t+1)},w\right),%\\\end{cases}{ start_ROW start_CELL Representation model: end_CELL start_CELL bold_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , end_CELL end_ROW start_ROW start_CELL Gaussian regressor: end_CELL start_CELL italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , end_CELL end_ROW start_ROW start_CELL Deformation predictor: end_CELL start_CELL roman_Δ italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , end_CELL end_ROW start_ROW start_CELL Gaussian renderer: end_CELL start_CELL italic_o start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = caligraphic_R ( italic_θ start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_w ) , end_CELL end_ROW(4)

where o(t)superscript𝑜𝑡o^{(t)}italic_o start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and 𝐯(t)superscript𝐯𝑡\mathbf{v}^{(t)}bold_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT mean the visual observation and the corresponding high-level visual features at the tthsubscript𝑡𝑡t_{th}italic_t start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT step. w𝑤witalic_w is the camera pose for the view where we project the Gaussian primitives. We leverage multi-head neural networks as the Gaussian regressor, where each head predicts a specific feature for the Gaussian parameters shown in (2). By inferring the changes of positions and rotations between consecutive steps, we acquire the propagated Gaussian parameters in the future step based on (3). Finally, the Gaussian renderer projects the propagated Gaussian distribution in a specific view for future scene reconstruction.

3.4 Learning Objectives

Current Scene Consistency Loss.Reconstructing the current scene based on the current Gaussian parameters accurately can enhance the performance of the Gaussian regressor. To achieve this goal, we introduce the consistency objective between the realistic current observation and the rendered according to the current Gaussian parameters:

Geo=𝐂(t)𝐂^(t)22,subscriptGeosuperscriptsubscriptnormsuperscript𝐂𝑡superscript^𝐂𝑡22\mathcal{L}_{\text{\tiny Geo}}=\|\mathbf{C}^{(t)}-\hat{\mathbf{C}}^{(t)}\|_{2}%^{2},caligraphic_L start_POSTSUBSCRIPT Geo end_POSTSUBSCRIPT = ∥ bold_C start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - over^ start_ARG bold_C end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(5)

where 𝐂(t)superscript𝐂𝑡\mathbf{C}^{(t)}bold_C start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and 𝐂^(t)superscript^𝐂𝑡\hat{\mathbf{C}}^{(t)}over^ start_ARG bold_C end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT respectively mean the prediction and groundtruth of observation images from different views at the tthsubscript𝑡𝑡t_{th}italic_t start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT step.

Semantic Feature Consistency Loss.The semantic features contain high-level visual information of the observed scenes. Since the foundation models can extract informative semantic features for general scenes, we expect the semantic features in our Gaussian parameters to mimic those acquired by large pre-trained models such as Stable Diffusion[48], so that the knowledge learned by the pre-trained models can be distilled to our Gaussian world model according to the following objective:

Sem=1σcos(𝐅(t),𝐅^(t)),subscriptSem1subscript𝜎superscript𝐅𝑡superscript^𝐅𝑡\mathcal{L}_{\text{\tiny Sem}}=1-\sigma_{\cos}(\mathbf{F}^{(t)},\hat{\mathbf{F%}}^{(t)}),caligraphic_L start_POSTSUBSCRIPT Sem end_POSTSUBSCRIPT = 1 - italic_σ start_POSTSUBSCRIPT roman_cos end_POSTSUBSCRIPT ( bold_F start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ,(6)

where 𝐅(t)superscript𝐅𝑡\mathbf{F}^{(t)}bold_F start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and 𝐅^(t)superscript^𝐅𝑡\hat{\mathbf{F}}^{(t)}over^ start_ARG bold_F end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT are projected map of semantic features in Gaussian parameters and the feature map learned by pre-trained models. σcossubscript𝜎\sigma_{\cos}italic_σ start_POSTSUBSCRIPT roman_cos end_POSTSUBSCRIPT means the cosine distance between variables.

Action Prediction Loss.The distribution parameters in our dynamic Gaussian framework are leveraged to predict the optimal action of the robot arm and grippers for general manipulation tasks. We employ a multi-modal transformer PerceiverIO[23] to infer the selection probability of different action candidates based on the Gaussian parameters and the human language instructions, and leverage the cross-entropy loss CE𝐶𝐸CEitalic_C italic_E for accurate action prediction:

Act=CE(ptrans,prot,popen,pcol),subscriptAct𝐶𝐸subscript𝑝transsubscript𝑝rotsubscript𝑝opensubscript𝑝col\mathcal{L}_{\text{\tiny Act}}=CE(p_{\text{\tiny trans}},p_{\text{\tiny rot}},%p_{\text{\tiny open}},p_{\text{\tiny col}}),caligraphic_L start_POSTSUBSCRIPT Act end_POSTSUBSCRIPT = italic_C italic_E ( italic_p start_POSTSUBSCRIPT trans end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT rot end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT open end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT col end_POSTSUBSCRIPT ) ,(7)

where ptranssubscript𝑝transp_{\text{\tiny trans}}italic_p start_POSTSUBSCRIPT trans end_POSTSUBSCRIPT, protsubscript𝑝rotp_{\text{\tiny rot}}italic_p start_POSTSUBSCRIPT rot end_POSTSUBSCRIPT, popensubscript𝑝openp_{\text{\tiny open}}italic_p start_POSTSUBSCRIPT open end_POSTSUBSCRIPT, pcolsubscript𝑝colp_{\text{\tiny col}}italic_p start_POSTSUBSCRIPT col end_POSTSUBSCRIPT represent the probability of the groundtruth actions for translation, rotation, gripper openness and motion planner states of the robot, respectively.

Future Scene Consistency Loss.We require consistency between the reconstructed and realistic scenes, so that the dynamic Gaussian Splatting framework can accurately embed scene-level spatiotemporal dynamics in the Gaussian parameters. Specifically, the training objective aligns the predicted future scenes based on different observations and actions with the realistic ones, which can be formulated as follows:

Dyna=𝐂^(t+1)(a^(t),o(t))𝐂(t+1)22+𝐂^(t+1)(a(t),o^(t))𝐂(t+1)22,subscriptDynasuperscriptsubscriptnormsuperscript^𝐂𝑡1superscript^𝑎𝑡superscript𝑜𝑡superscript𝐂𝑡122superscriptsubscriptnormsuperscript^𝐂𝑡1superscript𝑎𝑡superscript^𝑜𝑡superscript𝐂𝑡122\mathcal{L}_{\text{\tiny Dyna}}=\|\hat{\mathbf{C}}^{(t+1)}(\hat{a}^{(t)},o^{(t%)})-\mathbf{C}^{(t+1)}\|_{2}^{2}+\|\hat{\mathbf{C}}^{(t+1)}(a^{(t)},\hat{o}^{(%t)})-\mathbf{C}^{(t+1)}\|_{2}^{2},caligraphic_L start_POSTSUBSCRIPT Dyna end_POSTSUBSCRIPT = ∥ over^ start_ARG bold_C end_ARG start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_o start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) - bold_C start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ over^ start_ARG bold_C end_ARG start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , over^ start_ARG italic_o end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) - bold_C start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(8)

where 𝐂^(t+1)(a,o)superscript^𝐂𝑡1𝑎𝑜\hat{\mathbf{C}}^{(t+1)}(a,o)over^ start_ARG bold_C end_ARG start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT ( italic_a , italic_o ) means the predicted future observation of the scene at the tthsubscript𝑡𝑡t_{th}italic_t start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT step based on the action a𝑎aitalic_a and current observation o𝑜oitalic_o, and 𝐂(t+1)superscript𝐂𝑡1\mathbf{C}^{(t+1)}bold_C start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT is the realistic counterpart. The predicted action a^(t)superscript^𝑎𝑡\hat{a}^{(t)}over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT at the tthsubscript𝑡𝑡t_{th}italic_t start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT step is acquired from the action decoder according to the Gaussian parameters, and the reconstructed observation o^(t)superscript^𝑜𝑡\hat{o}^{(t)}over^ start_ARG italic_o end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT for current scene can be obtained by rendering the current Gaussian parameters. The objective enforces the dynamic Gaussian Splatting framework to learn informative Gaussian parameters to generate optimal actions and recover the current observation, so that the future observation can be accurately reconstructed with embedded spatiotemporal dynamics.

The overall objective for our ManiGaussian agent is written as a weighted combination of different loss terms:

=Act+λGeoGeo+λSemSem+λDynaDyna,subscriptActsubscript𝜆GeosubscriptGeosubscript𝜆SemsubscriptSemsubscript𝜆DynasubscriptDyna\mathcal{L}=\mathcal{L}_{\text{Act}}+\lambda_{\text{\tiny Geo}}\mathcal{L}_{%\text{\tiny Geo}}+\lambda_{\text{\tiny Sem}}\mathcal{L}_{\text{\tiny Sem}}+%\lambda_{\text{\tiny Dyna}}\mathcal{L}_{\text{\tiny Dyna}},caligraphic_L = caligraphic_L start_POSTSUBSCRIPT Act end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT Geo end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT Geo end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT Sem end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT Sem end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT Dyna end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT Dyna end_POSTSUBSCRIPT ,(9)

where λGeo,λSem,λDynasubscript𝜆Geosubscript𝜆Semsubscript𝜆Dyna\lambda_{\text{\tiny Geo}},\lambda_{\text{\tiny Sem}},\lambda_{\text{\tiny Dyna}}italic_λ start_POSTSUBSCRIPT Geo end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT Sem end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT Dyna end_POSTSUBSCRIPT are the hyperparamters that controls the importance of different terms during training. In training, we set a warm-up phase that freezes the deformation predictor to first learn a stable representation model and a Gaussian regressor during the first 3333k iterations. After the warm-up phase, we jointly train the whole Gaussian world model with the action decoder until the training ends.

4 Experiments

In this section, we first introduce the experiment setup including datasets, baseline methods, and implementation details (Section 4.1). Then we compare our method with the state-of-the-art approaches to show the superiority in success rate (Section 4.2), and conduct an ablation study to verify the effectiveness of different components in our dynamic Gaussian Splatting framework and the Gaussian world model (Section 4.3). Finally, we also illustrate the visualization results to depict our intuition (Section 4.4). Further results and case studies can be found in the supplementary material.

4.1 Experiment Setup

Dataset. Our experiments are conducted in the popular RLBench[24] simulated tasks. Following[69], we utilize a curated subset of 10101010 challenging language-conditioned manipulation tasks from RLBench, which includes 166166166166 variations in object properties and scene arrangement.The diversity of these tasks requires the agent to acquire generalizable knowledge about the intrinsical scene-level spatial-temporal dynamics for manipulation, rather than solely relying on mimicking the provided expert demonstrations to achieve high success rates.We evaluated 25 episodes in the testing set for each task to avoid result bias from noise.For visual observation, we employ RGB-D images captured by a single front camera with a resolution of 128×128128128128\times 128128 × 128. During the training phase, we use 20202020 demonstrations collected by GNFactor[69] for each task.

Baselines: We compare our ManiGaussian with the previous state of the arts including the perceptive method PerAct[55] and its modified version using 4444 camera inputs to cover the workbench, as well as the generative method GNFactor[69]. The evaluation metric is the task success rate, which measures the percentage of completed episodes. An episode is considered successful if the agent completes the goal specified in natural language within a maximum of 25252525 steps.

Implementation Details.We use the SE(3)SE3\text{SE}(3)SE ( 3 )[55, 69] augmentation for the expert demonstrations in the training set to enhance the generalizability of agents.To ensure consistency and mitigate the impact of parameter size, we utilize the same version of PerceiverIO[23] as the action decoder across all baselines.All the compared methods are trained on two NVIDIA RTX 4090 GPUs for 100100100100k iterations with a batch size of 2222. We employ LAMB optimizer [67] with the initial learning rate 5×1045E-45\text{\times}{10}^{-4}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 4 end_ARG end_ARG. We also adopt a cosine scheduler with warmup in the first 3k steps.

Method / Task

closejar

opendrawer

sweep todustpan

meat offgrill

turntap

PerAct

18.718.718.718.7

54.754.754.754.7

0.00.00.00.0

40.040.040.040.0

38.738.738.738.7

PerAct (4 cameras)

21.321.321.321.3

44.044.044.044.0

0.00.00.00.0

65.365.3\mathbf{65.3}bold_65.3

46.746.746.746.7

GNFactor25.325.325.325.3

76.076.0\mathbf{76.0}bold_76.0

28.028.028.028.0

57.357.357.357.3

50.750.750.750.7
ManiGaussian (ours)

28.028.0\mathbf{28.0}bold_28.0

76.076.0\mathbf{76.0}bold_76.0

64.064.0\mathbf{64.0}bold_64.0

60.060.060.060.0

56.056.0\mathbf{56.0}bold_56.0

Method / Task slideblock put indrawer dragstick pushbuttons stackblocks Average
PerAct18.718.718.718.72.72.72.72.75.35.35.35.318.718.718.718.76.76.76.76.720.420.420.420.4
PerAct (4 cameras)16.016.016.016.06.76.76.76.712.012.012.012.09.39.39.39.35.35.35.35.322.722.722.722.7
GNFactor20.020.020.020.00.00.00.00.037.337.337.337.318.718.718.718.74.04.04.04.031.731.731.731.7
ManiGaussian (ours)24.024.0\mathbf{24.0}bold_24.016.016.0\mathbf{16.0}bold_16.092.092.0\mathbf{92.0}bold_92.020.020.0\mathbf{20.0}bold_20.012.012.0\mathbf{12.0}bold_12.044.844.8\mathbf{44.8}bold_44.8
Geo.Sem.Dyna.PlanningLongToolsMotionScrewOcclusionAverage
36.036.036.036.02.02.02.02.025.325.325.325.352.052.052.052.04.04.04.04.028.028.028.028.023.623.623.623.6
46.046.046.046.04.04.04.04.052.052.052.052.052.052.052.052.024.024.024.024.060.060.060.060.039.239.239.239.2
46.046.046.046.08.08.08.08.053.353.353.353.364.064.0\mathbf{64.0}bold_64.028.028.0\mathbf{28.0}bold_28.056.056.056.056.041.641.641.641.6
54.054.0\mathbf{54.0}bold_54.010.010.010.010.049.349.349.349.364.064.0\mathbf{64.0}bold_64.024.024.024.024.072.072.072.072.043.643.643.643.6
40.040.040.040.014.014.0\mathbf{14.0}bold_14.060.060.0\mathbf{60.0}bold_60.056.056.056.056.028.028.0\mathbf{28.0}bold_28.076.076.0\mathbf{76.0}bold_76.044.844.8\mathbf{44.8}bold_44.8

4.2 Comparison with the State-of-the-Art Methods.

In this section, we compare our ManiGaussian with previous state-of-the-art methods on the RLBench tasksuite. Table1 illustrates the comparison of the average success rate of each task.Our method achieves the best performance with an average success rate of 44.844.844.844.8%, which is the state-of-the-art, outperforming the previous arts including both perceptive and generative-based methods by a sizable margin.The dominated generative-based method GNFactor[69] leveraged a generalizable NeRF to learn informative latent representation for optimal action prediction, which showed effective improvement beyond the perceptive-based methods.However, it ignores the scene-level spatiotemporal dynamics that demonstrate the interaction among objects, and the predicted actions still fail to achieve human goals because of the incorrect interaction.On the contrary, our ManiGaussian learns the scene dynamics with the proposed dynamic Gaussian Splatting framework, so that the robotic agent can complete human instructions with accurate action prediction in unstructured environments.As a result, our method outperforms the second-best GNFactor method by a relative improvement of 41.341.341.341.3%.In addition, out of 10101010 tasks, we excel in 9999 tasks, achieving more than double the accuracy of the previous methods in the “sweep to dustpan”, “put in drawer” and “drag stick” tasks. In the task “meat off grill” where the best performance was not reached, our method also ranks as second best. In conclusion, the experimental results illustrate the effectiveness of our proposed method across multiple language-conditioned robotic manipulation tasks.

4.3 Ablation Study

ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation (3)

Our dynamic Gaussian Splatting framework models the propagation of diverse features in the Gaussian embedding space, and the Gaussian world model reconstructs the future scene according to the current scene by constraining the consistency between reconstructed and realistic scenes for dynamics mining.We conduct an ablation study to verify the effectiveness of each presented component in Table2, where we manually group the 10101010 RLBench tasks into 6666 categories according to their main challenges to demonstrate the improvement reason.We first implement a vanilla baseline without any proposed technique, where we directly train the representation model and the action decoder in the Gaussian world model to predict the optimal robot action.By adding the Gaussian regressor to predict the Gaussian parameters in the dynamic Gaussian Splatting framework, the performance improves by 15.615.615.615.6% compared with the vanilla baseline.Especially, in the tasks that require geometric reasoning such as “Occulusion”, “Tools” and “Screw”, it outperforms the vanilla version by sizable margins, which proves the ability of the Gaussian Splatting technology to model the spatial information for manipulation tasks.We then add semantic features distilled from the pretrained foundation model into the dynamic Gaussian Splatting framework, which contains high-level visual information of the observed scenes. By adding the semantic features and the related consistency loss, we observe that the average success rate increases by 2.42.42.42.4% than the only geometric features version, which indicates the benefits of the rich semantic information for robotic manipulation.Besides, we implement the deformation predictor and the corresponding future scene consistency loss based on the vanilla baseline to mine the scene-level dynamics via future scene reconstruction, resulting in a dramatic performance improvement of 4.44.44.44.4%. Particularly, the proposed deformation predictor improves the task completion of 4444 out of 6666 task types, including “Planning”, “Long”, “Motion”, and “Occlusion”, which demonstrates the importance of the scene-level dynamics encoded by the deformation predictor in the Gaussian world model.After combining all the proposed techniques in our dynamic Gaussian Splatting framework, the performance increases from 23.623.623.623.6% to 44.844.844.844.8%, which verifies the necessity of the scene-level spatiotemporal dynamics mined by the proposed dynamics Gaussian Splatting framework with the Gaussian world model parameterization.

Figure 3 shows the learning curve of the proposed ManiGaussian and the state-of-the-art method GNFactor, where we save checkpoints and test them every 10101010k parameter updates.Both the compared methods get convergence within 100100100100k training steps.As shown in Figure 3, our ManiGaussian outperforms the state-of-the-art method GNFactor, achieving 1.181.181.181.18×\times× better performance and 2.292.292.292.29×\times× faster training. This result proves that our ManiGaussian not only performs better but also trains faster, which also shows the efficiency of the explicit Gaussian scene reconstruction than the implicit approach like NeRF.

ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation (4)
ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation (5)

4.4 Qualitative Analysis

Visualization of Whole Trajectories.We present two qualitative examples of the generated action sequence in Figure4 from GNFactor and our ManiGaussian.In the top case, the agent is instructed to “Slide the block to yellow target”. The results show that the previous agent struggles to complete the task since It imitates the expert’s backward pulling motion, even though the claw is already leaning towards the right side of the red block.In contrast, our ManiGaussian returns to the red square and successfully slides the square to the yellow target, owing to that our method can correctly understand the scene dynamics of objects in contact.In the bottom case, the agent is instructed to “Turn left tap”. The results show that the GNFactor agent misunderstands the meaning of “left”, and instead operates the right tap, and also fails to turn on the tap. Our ManiGaussian successfully completes the task, which shows that our method can not only understand the semantic information, but also execute operations accurately.

Visualization of Novel View Synthesis.Figure5 shows the novel view image synthesis results, where the action loss is removed for all compared methods for better visualization.The results show that, first, based on the front view observation where the claw shape cannot be seen, our ManiGaussian offers superior detail in modeling cubes in novel views within the same number of training steps. Second, our method accurately predicts future states and can model interactions between objects based on accurately recovering object details. For example, in the top case of the “slide block” task, our ManiGaussian not only predicts the future gripper position that corresponds to the human instruction, but also predicts the future object location influenced by the gripper based on the scene dynamics understanding of the physical interaction among objects. This qualitative result demonstrates that our ManiGaussian learns the intricate scene-level dynamics successfully.

5 Conclusion

In this paper, we have presented a ManiGaussian agent that encodes the scene-level spatiotemporal dynamics for language-conditioned manipulation agents.We design a dynamic Gaussian Splatting framework that models the propagation of diverse semantic features in the Gaussian embedding space, and the semantic features with scene dynamics are leveraged to predict the optimal robot actions.Subsequently, we build a Gaussian world model to parameterize the distributions in the dynamic Gaussian Splatting framework to mine scene-level dynamics by reconstructing the future scene according to the current scene and the robot actions.Extensive experiments in a wide variety of manipulation tasks demonstrate the superiority of our ManiGaussian compared with the state-of-the-art methods.The limitations of our ManiGaussian stem from the necessity of multiple view supervision with camera calibration for the dynamic Gaussian Splatting framework.

References

  • [1]Abou-Chakra, J., Rana, K., Dayoub, F., Sünderhauf, N.: Physically embodied gaussian splatting: Embedding physical priors into a visual 3d world model for robotics. In: CoRL (2023)
  • [2]Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., etal.: Rt-1: Robotics transformer for real-world control at scale. arXiv (2022)
  • [3]Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
  • [4]Charatan, D., Li, S., Tagliasacchi, A., Sitzmann, V.: pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. arXiv preprint arXiv:2312.12337 (2023)
  • [5]Chen, S., Garcia, R., Schmid, C., Laptev, I.: Polarnet: 3d point clouds for language-guided robotic manipulation. arXiv preprint arXiv:2309.15596 (2023)
  • [6]Driess, D., Schubert, I., Florence, P., Li, Y., Toussaint, M.: Reinforcement learning with neural radiance fields. NeurIPS (2022)
  • [7]Du, Y., Yang, S., Dai, B., Dai, H., Nachum, O., Tenenbaum, J., Schuurmans, D., Abbeel, P.: Learning universal policies via text-guided video generation. NeurIPS 36 (2024)
  • [8]Fu, Y., Liu, S., Kulkarni, A., Kautz, J., Efros, A.A., Wang, X.: Colmap-free 3d gaussian splatting. arXiv preprint arXiv:2312.07504 (2023)
  • [9]Fu, Z., Zhao, T.Z., Finn, C.: Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117 (2024)
  • [10]Gao, Z., Mu, Y., Shen, R., Chen, C., Ren, Y., Chen, J., Li, S.E., Luo, P., Lu, Y.: Enhance sample efficiency and robustness of end-to-end urban autonomous driving via semantic masked world model. arXiv preprint arXiv:2210.04017 (2022)
  • [11]Gervet, T., Xian, Z., Gkanatsios, N., Fragkiadaki, K.: Act3d: 3d feature field transformers for multi-task robotic manipulation. In: CoRL. pp. 3949–3965 (2023)
  • [12]Goyal, A., Xu, J., Guo, Y., Blukis, V., Chao, Y.W., Fox, D.: Rvt: Robotic view transformer for 3d object manipulation. arXiv preprint arXiv:2306.14896 (2023)
  • [13]Guhur, P.L., Chen, S., Pinel, R.G., Tapaswi, M., Laptev, I., Schmid, C.: Instruction-driven history-aware policies for robotic manipulations. In: CoRL. pp. 175–187. PMLR (2023)
  • [14]Ha, D., Schmidhuber, J.: Recurrent world models facilitate policy evolution. NeurIPS 31 (2018)
  • [15]Hafner, D., Lee, K.H., Fischer, I., Abbeel, P.: Deep hierarchical planning from pixels. NeurIPS 35, 26091–26104 (2022)
  • [16]Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603 (2019)
  • [17]Hafner, D., Lillicrap, T., Norouzi, M., Ba, J.: Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193 (2020)
  • [18]Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T.: Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104 (2023)
  • [19]Hansen, N., Su, H., Wang, X.: Td-mpc2: Scalable, robust world models for continuous control. arXiv preprint arXiv:2310.16828 (2023)
  • [20]Hansen, N., Wang, X.: Generalization in reinforcement learning by soft data augmentation. In: ICRA (2021)
  • [21]Hu, A., Corrado, G., Griffiths, N., Murez, Z., Gurau, C., Yeo, H., Kendall, A., Cipolla, R., Shotton, J.: Model-based imitation learning for urban driving. NeurIPS 35, 20703–20716 (2022)
  • [22]Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., Corrado, G.: Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080 (2023)
  • [23]Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Perceiver: General perception with iterative attention. In: ICML (2021)
  • [24]James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learning benchmark & learning environment. RA-L (2020)
  • [25]James, S., Wada, K., Laidlow, T., Davison, A.J.: Coarse-to-fine q-attention: Efficient learning for visual robotic manipulation via discretisation. In: CVPR (2022)
  • [26]Jang, E., Irpan, A., Khansari, M., Kappler, D., Ebert, F., Lynch, C., Levine, S., Finn, C.: Bc-z: Zero-shot task generalization with robotic imitation learning. In: CoRL (2022)
  • [27]Jiang, Z., Zhu, Y., Svetlik, M., Fang, K., Zhu, Y.: Synergies between affordance and geometry: 6-dof grasp detection via implicit representations. arXiv preprint arXiv:2104.01542 (2021)
  • [28]Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., etal.: Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv (2018)
  • [29]Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. TOG 42(4) (2023)
  • [30]Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: Language embedded radiance fields. arXiv preprint arXiv:2303.09553 (2023)
  • [31]Laskin, M., Srinivas, A., Abbeel, P.: Curl: Contrastive unsupervised representations for reinforcement learning. In: ICML (2020)
  • [32]Li, Y., Li, S., Sitzmann, V., Agrawal, P., Torralba, A.: 3d neural scene representations for visuomotor control. In: CoRL. pp. 112–123 (2022)
  • [33]Liang, L., Bian, L., Xiao, C., Zhang, J., Chen, L., Liu, I., Xiang, F., Huang, Z., Su, H.: Robo360: A 3d omnispective multi-material robotic manipulation dataset. arXiv preprint arXiv:2312.06686 (2023)
  • [34]Lin, J., Du, Y., Watkins, O., Hafner, D., Abbeel, P., Klein, D., Dragan, A.: Learning to model the world with language. arXiv preprint arXiv:2308.01399 (2023)
  • [35]Lin, Y.C., Florence, P., Zeng, A., Barron, J.T., Du, Y., Ma, W.C., Simeonov, A., Garcia, A.R., Isola, P.: Mira: Mental imagery for robotic affordances. In: CoRL. pp. 1916–1927 (2023)
  • [36]Liu, H., Lee, L., Lee, K., Abbeel, P.: Instruction-following agents with jointly pre-trained vision-language models. arXiv preprint arXiv:2210.13431 (2022)
  • [37]Liu, Y., Li, C., Yang, C., Yuan, Y.: Endogaussian: Gaussian splatting for deformable surgical scene reconstruction. arXiv preprint arXiv:2401.12561 (2024)
  • [38]Lu, G., Wang, Z., Liu, C., Lu, J., Tang, Y.: Thinkbot: Embodied instruction following with thought chain reasoning. arXiv preprint arXiv:2312.07062 (2023)
  • [39]Luiten, J., Kopanas, G., Leibe, B., Ramanan, D.: Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713 (2023)
  • [40]Mendonca, R., Bahl, S., Pathak, D.: Structured world models from human videos. arXiv preprint arXiv:2308.10901 (2023)
  • [41]Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. CACM 65(1), 99–106 (2021)
  • [42]Nair, S., Rajeswaran, A., Kumar, V., Finn, C., Gupta, A.: R3m: A universal visual representation for robot manipulation. arXiv (2022)
  • [43]Parisi, S., Rajeswaran, A., Purushwalkam, S., Gupta, A.: The unsurprising effectiveness of pre-trained vision models for control. In: ICML (2022)
  • [44]Qian, G., Li, Y., Peng, H., Mai, J., Hammoud, H., Elhoseiny, M., Ghanem, B.: Pointnext: Revisiting pointnet++ with improved training and scaling strategies. NeurIPS 35, 23192–23204 (2022)
  • [45]Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: Langsplat: 3d language gaussian splatting. arXiv preprint arXiv:2312.16084 (2023)
  • [46]Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., etal.: Learning transferable visual models from natural language supervision. In: ICML (2021)
  • [47]Radosavovic, I., Xiao, T., James, S., Abbeel, P., Malik, J., Darrell, T.: Real-world robot learning with masked visual pre-training. In: CoRL (2023)
  • [48]Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
  • [49]Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., etal.: Mastering atari, go, chess and shogi by planning with a learned model. Nature 588(7839), 604–609 (2020)
  • [50]Seo, Y., Hafner, D., Liu, H., Liu, F., James, S., Lee, K., Abbeel, P.: Masked world models for visual control. In: CoRL. pp. 1332–1344 (2023)
  • [51]Seo, Y., Lee, K., James, S.L., Abbeel, P.: Reinforcement learning with action-free pre-training from videos. In: ICML. pp. 19561–19579 (2022)
  • [52]Shi, J.C., Wang, M., Duan, H.B., Guan, S.H.: Language embedded 3d gaussians for open-vocabulary scene understanding. arXiv preprint arXiv:2311.18482 (2023)
  • [53]Shim, D., Lee, S., Kim, H.J.: Snerl: Semantic-aware neural radiance fields for reinforcement learning. ICML (2023)
  • [54]Shridhar, M., Manuelli, L., Fox, D.: Cliport: What and where pathways for robotic manipulation. In: CoRL (2022)
  • [55]Shridhar, M., Manuelli, L., Fox, D.: Perceiver-actor: A multi-task transformer for robotic manipulation. In: CoRL (2023)
  • [56]Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Splatter image: Ultra-fast single-view 3d reconstruction. arXiv preprint arXiv:2312.13150 (2023)
  • [57]Wang, X., Zhu, Z., Huang, G., Chen, X., Lu, J.: Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv preprint arXiv:2309.09777 (2023)
  • [58]Wang, Z., Cai, S., Chen, G., Liu, A., Ma, X., Liang, Y.: Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560 (2023)
  • [59]Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang, X.: 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528 (2023)
  • [60]Wu, J., Ma, H., Deng, C., Long, M.: Pre-training contextualized world models with in-the-wild videos for reinforcement learning. NeurIPS 36 (2024)
  • [61]Wu, P., Escontrela, A., Hafner, D., Abbeel, P., Goldberg, K.: Daydreamer: World models for physical robot learning. In: CoRL. pp. 2226–2240 (2023)
  • [62]Xie, T., Zong, Z., Qiu, Y., Li, X., Feng, Y., Yang, Y., Jiang, C.: Physgaussian: Physics-integrated 3d gaussians for generative dynamics. arXiv preprint arXiv:2311.12198 (2023)
  • [63]Xu, D., Yuan, Y., Mardani, M., Liu, S., Song, J., Wang, Z., Vahdat, A.: Agg: Amortized generative 3d gaussians for single image to 3d. arXiv preprint arXiv:2401.04099 (2024)
  • [64]Yang, Z., Yang, H., Pan, Z., Zhu, X., Zhang, L.: Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. arXiv preprint arXiv:2310.10642 (2023)
  • [65]Yang, Z., Gao, X., Zhou, W., Jiao, S., Zhang, Y., Jin, X.: Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101 (2023)
  • [66]Ye, W., Liu, S., Kurutach, T., Abbeel, P., Gao, Y.: Mastering atari games with limited data. NeurIPS 34, 25476–25488 (2021)
  • [67]You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., Hsieh, C.J.: Large batch optimization for deep learning: Training bert in 76 minutes. arXiv (2019)
  • [68]Ze, Y., Hansen, N., Chen, Y., Jain, M., Wang, X.: Visual reinforcement learning with self-supervised 3d representations. RA-L (2023)
  • [69]Ze, Y., Yan, G., Wu, Y.H., Macaluso, A., Ge, Y., Ye, J., Hansen, N., Li, L.E., Wang, X.: Gnfactor: Multi-task real robot learning with generalizable neural feature fields. In: CoRL. pp. 284–301. PMLR (2023)
  • [70]Zeng, A., Florence, P., Tompson, J., Welker, S., Chien, J., Attarian, M., Armstrong, T., Krasin, I., Duong, D., Sindhwani, V., etal.: Transporter networks: Rearranging the visual world for robotic manipulation. In: CoRL (2021)
  • [71]Zhang, T., Hu, Y., Cui, H., Zhao, H., Gao, Y.: A universal semantic-geometric representation for robotic manipulation. arXiv preprint arXiv:2306.10474 (2023)
  • [72]Zheng, S., Zhou, B., Shao, R., Liu, B., Zhang, S., Nie, L., Liu, Y.: Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. arXiv preprint arXiv:2312.02155 (2023)
  • [73]Zhu, L., Wang, Z., Jin, Z., Lin, G., Yu, L.: Deformable endoscopic tissues reconstruction with gaussian splatting. arXiv preprint arXiv:2401.11535 (2024)
  • [74]Zou, Z.X., Yu, Z., Guo, Y.C., Li, Y., Liang, D., Cao, Y.P., Zhang, S.H.: Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. arXiv preprint arXiv:2312.09147 (2023)
  • [75]Zuo, X., Samangouei, P., Zhou, Y., Di, Y., Li, M.: Fmgs: Foundation model embedded 3d gaussian splatting for holistic 3d scene understanding. arXiv preprint arXiv:2401.01970 (2024)
ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation (2024)

References

Top Articles
Latest Posts
Article information

Author: Edmund Hettinger DC

Last Updated:

Views: 6445

Rating: 4.8 / 5 (58 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Edmund Hettinger DC

Birthday: 1994-08-17

Address: 2033 Gerhold Pine, Port Jocelyn, VA 12101-5654

Phone: +8524399971620

Job: Central Manufacturing Supervisor

Hobby: Jogging, Metalworking, Tai chi, Shopping, Puzzles, Rock climbing, Crocheting

Introduction: My name is Edmund Hettinger DC, I am a adventurous, colorful, gifted, determined, precious, open, colorful person who loves writing and wants to share my knowledge and understanding with you.