ML / AIProject

Reinforcement Learning for Cooperative Manipulation

Project in Robot Learning · CS391R · UT Austin · 2021 · with Steven Patrick

An improvement on multi-agent reinforcement learning for cooperative robotic manipulation. We replace the original intrinsic motivation paper's PPO optimizer with DDPG and add Hindsight Experience Replay, achieving faster convergence and a success rate above 90% on OpenAI Fetch pick-and-place.

PythonPyTorch★ 90%+ success rate

Final Report Code

Pick-and-place demo (learned policy)

Fetch robot arm performing pick-and-place with learned DDPG-HER policy

Robosuite two-arm lift environment

Robosuite two-arm cooperative lifting task

90%+Success rateDDPG+HER on Fetch pick-and-place

50kFetch training runs150 time steps each

100kRobosuite runsTwo-arm lift training

4Hidden layers64 units each · actor & critic

Overview

The problem: making robots cooperate

Single-robot manipulation is well-studied, but coordinating multiple robots to lift a shared object — where each robot's action affects the other — introduces non-stationarity that breaks standard single-agent RL. This project builds on an intrinsic motivation approach that rewards collective over individual actions, and improves it with a more sample-efficient optimizer.

🤝

Cooperative task

Two robot arms must jointly lift an object that neither can lift alone. The reward for collective action is defined intrinsically — by comparing the joint outcome to what each agent would achieve individually.

🎯

Sparse rewards are hard

In pick-and-place, the robot only gets a reward when the object reaches the exact goal. With no intermediate feedback, naive DDPG never learns. HER solves this by retroactively treating visited states as goals.

🔁

Off-policy data reuse

Unlike PPO (the baseline), DDPG is off-policy: past episode data can be replayed from the buffer. This makes every simulation run more valuable and dramatically reduces training cost.

Methods

Three algorithms, one pipeline

The contribution is a specific combination: the intrinsic reward formulation from prior work, upgraded to an off-policy training loop with experience replay. Each piece addresses a distinct failure mode of the others.

Intrinsic MotivationFrom prior work

Defines an intrinsic reward as the L2 distance between the predicted next state under joint action vs. chained single-agent actions. Rewards behaviors that require cooperation — actions where individual agents could not produce the same outcome alone.

r_intrinsic = ‖f_joint(s,a) − f_composed(s,a)‖

DDPGOur improvement

Deep Deterministic Policy Gradient replaces PPO as the optimizer. Off-policy: episode data is collected with noise-injected actions and stored in a replay buffer for reuse. Suited for continuous action spaces — robot arm joint offsets.

Soft target network updates: θ′ ← τθ + (1−τ)θ′

HER — Hindsight Experience ReplayOur improvement

Augments the replay buffer by relabelling failed episodes with the state the robot actually reached as a substitute goal. Learns from failure. Critical for sparse-reward environments where successes are initially near-impossible to sample.

Goal-agnostic reward structure is a prerequisite — satisfied by all Fetch and Robosuite tasks.

Environments

Two simulation testbeds

OpenAI Gym — Fetch

Fetch Reach

Fetch Pick-and-Place

DOF7 (6 arm + gripper open/close)

Reach taskMove end effector to xyz position — shaped reward, fast convergence

Pick-and-placeMove block to 3D goal — sparse binary reward, requires HER

Training runs50,000 episodes · 150 steps each

Robosuite

Single-arm lift

Two-arm lift

DOF7 per arm (6 arm + gripper) — 14 total for two-arm

Single armGrasp and move a cube — reward shaping for grasping

Two armsCollaboratively lift a bucket via two handles — intrinsic reward

Training runs100,000 episodes · same step length

Results

What worked, what didn't, and why

The key finding is that DDPG alone fails on sparse-reward environments, but DDPG+HER reliably solves them. The two-arm lift remained partially unsolved — a reward shaping issue, not an algorithmic one.

DDPG convergence — Fetch Reach (shaped reward)

DDPG reward convergence on Fetch Reach environment

DDPG+HER success rate — Fetch Pick-and-Place (sparse reward)

DDPG-HER success rate vs random policy on Fetch pick-and-place

EnvironmentFetch Reach

AlgorithmDDPG

OutcomeConverges to successful policyShaped continuous reward provides sufficient gradient signal. Reward approaches 0 within a small number of episodes.

EnvironmentFetch Pick-and-Place

AlgorithmDDPG (no HER)

OutcomeFails to learnSparse binary reward: always receives −1 at the start, no gradient signal. Training never improves.

EnvironmentFetch Pick-and-Place

AlgorithmDDPG + HER

Outcome90%+ success rateHER relabels failed episodes as successful for the achieved state. Policy steadily improves toward 100%.

EnvironmentRobosuite Single Arm

AlgorithmDDPG

OutcomeGrasp learned, goal not reachedPolicy found reliable grasping. Post-grasp motion is random due to reward shaping not penalizing displacement from goal.

EnvironmentRobosuite Two Arm

AlgorithmDDPG + Intrinsic

OutcomeHandle grasp learned; lift not learnedGripping converged quickly. Lifting never learned — reward over-weighted grasping vs. object height goal.

Architecture

Network design

The system has three types of networks. The partial estimators — one per agent — are pre-trained from human demonstrations to predict the next object state given a single agent's action. They are then frozen and chained into f_composed.

The full estimator f_joint takes all agents' actions simultaneously and is trained online. The intrinsic reward is the L2 distance between its prediction and the composed chain — a signal that fires strongly when collective action produces a qualitatively different outcome than the sum of individual actions.

▸

Partial estimators — 4 hidden layers · 64 units · ℓ2 loss · trained from demos

▸

Full estimator (f_joint) — Same architecture · trained online from RL episodes

▸

Actor network — Input: joint positions + end-effector positions + object pose → outputs Δpose

▸

Critic network — Same input as actor → outputs scalar value (score)

▸

All networks — 4 hidden layers · 64 hidden units each

Architecture diagram — joint vs composed estimators

Network architecture showing f_joint vs f_composed and intrinsic reward computation

Single-robot baseline environment

Single robot arm picking a hammer — baseline before multi-agent extension