DeformPAM: Data-Efficient Learning for
Long-horizon Deformable Object Manipulation via
Preference-based Action Alignment

In Submission

Wendi Chen1*, Han Xue1*, Fangyuan Zhou1, Yuan Fang1, Cewu Lu1
1Shanghai Jiao Tong University, * indicates equal contribution

In long-horizon manipulation tasks, a probabilistic policy may encounter distribution shifts when imperfect policy fitting leads to unseen states. As time progresses, the deviation from the expert policy becomes more significant. Our framework employs Reward-guided Action Selection (RAS) to reassess sampled actions from the generative policy model, thereby improving overall performance.

Demos of Long-horizon Tasks with Human Interference

Target πŸ‘‡

Target State of Granular Pile Shaping

Shape a Pile of Granules into a 'T' (12×)

Target πŸ‘‡

Target State of Rope Shaping

Shape a Rope into a Circle (12×)

Target πŸ‘‡

Target State of T-shirt Unfolding

T-shirt Unfolding (8×)

Abstract

In recent years, imitation learning has made progress in the field of robotic manipulation. However, it still faces challenges when dealing with complex long-horizon deformable object tasks, such as high-dimensional state spaces, complex dynamics, and multimodal action distributions. Traditional imitation learning methods often require a large amount of data and encounter distributional shifts and accumulative errors in these tasks. To address these issues, we propose a data-efficient general learning framework (DeformPAM) based on preference learning and reward-guided action selection. DeformPAM decomposes long-horizon tasks into multiple action primitives, utilizes 3D point cloud inputs and diffusion models to model action distributions, and trains an implicit reward model using human preference data. During the inference phase, the reward model scores multiple candidate actions, selecting the optimal action for execution, thereby reducing the occurrence of anomalous actions and improving task completion quality. Experiments conducted on three challenging real-world long-horizon deformable object manipulation tasks demonstrate the effectiveness of this method. Results show that DeformPAM improves both task completion quality and efficiency compared to baseline methods even with limited data.

Summary Video

Method

Pipeline overview of DeformPAM

Fig. 1: Pipeline overview of DeformPAM. (1) In stage 1, we assign actions for execution and annotate auxiliary actions for supervised learning in a real-world environment and train a supervised primitive model based on Diffusion. (2) In stage 2, we deploy this model in the environment to collect preference data composed of annotated and predicted actions. These data are used to train a DPO-finetuned model. (3) During inference, we utilize the supervised model to predict actions and employ an implicit reward model derived from two models for Reward-guided Action Selection (RAS). The action with the highest reward is regarded as the final prediction.

Experiments

Tasks and Hardware Setup

Tasks and Primitives

(a)

Hardware Setup

(b)

(c)

Fig. 2: (a) Object states and primitives of each task. Beginning with a random complex state of an object, multiple steps of action primitives are performed to gradually achieve the target state. (b) Hardware setup and tools used in our real-world experiments. Devices and tools marked with DP are not used in primitive-based methods. (c) Some of the initial states of the three tasks during evaluation. The initial states are random and complex, which makes the tasks challenging.

Data Collection and Model Training Costs

For primitive-based methods and the Diffusion Policy, we collect a comparable amount of data to ensure a fair comparison. Additionally, it is noted that the primitive-based methods (including ours) are more efficient in terms of training time.

TABLE I: The dataset size of each model and each task. # seq. and # states indicate the number of task sequences and states.

Granular Pile Shaping Rope Shaping T-shirt Unfolding
# seq. # states # seq. # states # seq. # states
Primitive-Based (Stage 1) ~ 60 400 ~ 30 200 ~ 90 200
Primitive-Based (Stage 2) ~ 25 200 ~ 10 100 ~ 50 146
Diffusion Policy 60 29807 50 9971 - -

(a) Granular Pile Shaping

(b) Rope Shaping

Fig. 3: Training time of primitive-based methods and the Diffusion Policy. The results are tested on a single NVIDIA RTX 4090.

Quantitative Results

The quantitative results show that our method achieves a higher completion quality with fewer steps and exhibits reduced variance.

You can hover over the figure to highlight the curve of our method.

IoU of Granular Pile Shaping IoU of Granular Pile Shaping (Highlight)
Earth Mover's Distance of Granular Pile Shaping Earth Mover's Distance of Granular Pile Shaping (Highlight)

(a) Granular Pile Shaping

IoU of Rope Shaping IoU of Rope Shaping (Highlight)
Earth Mover's Distance of Rope Shaping Earth Mover's Distance of Rope Shaping (Highlight)

(b) Rope Shaping

Normalized Coverage of T-shirt Unfolding Normalized Coverage of T-shirt Unfolding (Highlight)
Earth Mover's Distance of T-shirt Unfolding Earth Mover's Distance of T-shirt Unfolding (Highlight)

(c) T-shirt Unfolding

Fig. 4: Quality metrics per step on the three tasks. The results are calculated on 20 trials. Each evaluation trial ends until the policy already reaches its optimal state or exceeds the maximum steps. SL, DPO, RAS stand for the supervised model, DPO-finetuned model, and reward-guided action selection.

Qualitative Results

The qualitative results also demonstrate the superiority of our method in terms of completion quality and variance.

Heatmap of Granular Pile Shaping (SL)
Heatmap of Granular Pile Shaping (SL + SL)
Heatmap of Granular Pile Shaping (DPO + RGAS)
Heatmap of Granular Pile Shaping (SL + Explicit RGAS)
Heatmap of Granular Pile Shaping (Ours)

(a) Granular Pile Shaping

Heatmap of Rope Shaping (SL)
Heatmap of Rope Shaping (SL + SL)
Heatmap of Rope Shaping (DPO + RGAS)
Heatmap of Rope Shaping (SL + Explicit RGAS)
Heatmap of Rope Shaping (Ours)

(b) Rope Shaping

Fig. 5: Final state heatmaps of 20 trials compared with the target state.

How Does Reward-guided Action Selection Contribute to Performance?

We analyzed the distribution of normalized implicit reward values during inference, as shown in Fig. 6a. This indicates that there is no positive correlation between the sampling probability of the action generation model and the predicted reward values, which suggests that employing reward-guided action selection can serve as a quality reassessment.

From another perspective, we compare the performance between random sampling and reward-guided action selection by adjusting the number N of predicted actions during inference in the T-shirt unfolding task and computing the final coverage. As shown in Fig. 6b, as \(N\) increases, the model’s performance gradually improves. This demonstrates that reward-guided action selection enables the model to select superior samples, thereby benefiting from a greater number of samples.

Normalized Reward Distribution during Inference

(a)

Average Coverage for Various Numbers of Predicted Actions during Inference

(b)

Fig. 6: (a) Normalized reward distribution during inference when sampling \(N = 8\) actions. (b) Average coverage for various numbers \(N\) of predicted actions during inference.

Primitive-based Methods Results

Compared with other primitive-based methods, our method achieves the target state with fewer steps without abnormal actions or getting stuck.

Green and Red indicate success and failure. The videos will be synchronized when clicking the play button.

Trial 1

Trial 2

Trial 3

Granular Pile Shaping (12×)

Trial 1

Trial 2

Trial 3

Rope Shaping (24×)

Trial 1

Trial 2

Trial 3

T-shirt Unfolding (8×)

Primitve-free Methods Results (Diffusion Policy)

We collect data and train the Diffusion Policy on two shaping tasks. The results indicate that the Diffusion Policy can easily get stuck when only a small amount of data is available. This may be due to the distribution shift causing the model to encounter unseen states, where the multi-modal action distribution of deformable objects makes the model confused. For example, in the rope shaping task, when both sides of the rope have curves, the model faces two opposing choices, creating challenges for learning.

Note that we also simplify the shaping tasks, such as by adding a reference line for the target state in the granular pile shaping task and using a stick for rope shaping, to reduce failures caused by contact-rich actions. The hardware settings are also different. For example, a wrist-mounted 2D camera and a third-person perspective 2D camera are used instead of the 3D camera.

Due to the difficulty of recording high-dynamic actions and precise position-based contact, we did not test the T-shirt Unfolding tasks.

Granular Pile Shaping (8×)

Rope Shaping (8×)

BibTeX

@article{chen2024deformpam,
  title     = {DeformPAM: Data-Efficient Learning for Long-horizon Deformable Object Manipulation via Preference-based Action Alignment},
  author    = {Chen, Wendi and Xue, Han and Zhou, Fangyuan and Fang, Yuan and Lu, Cewu},
  journal   = {arXiv preprint arXiv:2410.11584},
  year      = {2024}
}