Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective

AAAI 2026

Nhat Chung*,1       Taisei Hanyu*,2       Toan Nguyen1       Huy Le1      
Frederick Bumgarner2       Duy Nguyen Ho Minh3,7,8       Khoa Vo2       Kashu Yamazaki4      
Chase Rainwater2       Tung Kieu5       Anh Nguyen6       Ngan Le2              

*Equal contributions  
1FPT Software AI Center   2University of Arkansas  
3University of Stuttgart   4Carnegie Mellon University  
5Aalborg University   6University of Liverpool  
7German Research Center for Artificial Intelligence (DFKI)  
8Max Planck Research School for Intelligent Systems (IMPRS-IS)  
Scene overview


LIBERO-Mem, a non-Markovian task suite for stress-testing robotic manipulation under object-level partial observability. It combines short- and long-horizon object tracking with temporally sequenced subgoals, requiring reasoning beyond the current frame.

Abstract

As embodied agents operate in increasingly complex environments, the ability to perceive, track, and reason about individual object instances over time becomes essential, especially in tasks requiring sequenced interactions with visually similar objects. In these non-Markovian settings, key decision cues are often hidden in object-specific histories rather than the current scene. Without persistent memory of prior interactions, such as what has been interacted with, where it has been, or how it has changed, visuomotor policies may fail, repeat past actions, or overlook completed ones. To surface this challenge, we introduce LIBERO-Mem, a non-Markovian task suite for stress-testing robotic manipulation under object-level partial observability. It combines short- and long-horizon object tracking with temporally sequenced subgoals, requiring reasoning beyond the current frame. However, naïve vision-language-action (VLA) models struggle in such settings, with token scaling quickly becoming intractable-even for tasks spanning just a few hundred frames. We propose Embodied-SlotSSM, a slot-centric VLA framework built for temporal scalability. It maintains spatio-temporally consistent slot identities and leverages them through two mechanisms: (1) slot-state-space modeling for reconstructing short-term history, and (2) a relational encoder to align the input tokens with action decoding. Together, these components enable temporally grounded, context-aware action prediction. Experiments show Embodied-SlotSSM's baseline performance on LIBERO-Mem and general benchmarks, offering a scalable solution for non-Markovian reasoning in object-centric robotic policies.

Key Contributions

LIBERO-Mem consists of 10 tasks spanning four object-centric memory dimensions: Object Motion (OM), Object Sequence (OS), Object Relations (OR), and Object Occlusion (OO). Each task presents temporal dependencies and ambiguity requiring structured memory beyond instantaneous observations.

Novelty of LIBERO

Short- and long-horizon data collection process. Expert demonstrations are collected via smooth keyboard control with multi-key tracking. Each task contains 200–700 frames, supporting both short- and long-horizon evaluation. Each task is collected to 120 trajectories, where 100 are refined as training data, and 20 are left for validation. Subgoal-aware evaluation. Each task is decomposed into symbolic subgoals using Sequence (→) and Or (∨) operators for fine-grained evaluation. Object identity ambiguities. Visually identical bowls and plates differ only in asset ID, requiring agents to resolve object identity from temporal interaction history. Object & subgoal annotations. Each timestep includes object instance IDs, masks, and subgoal completion flags per object.

Task Task Description Subtask Goals Types
Task 1 robot to pick up the bowl and place it back on the plate bowl lifted → bowl on plate OM
Task 2 robot to lift the bottle and put it down on the plate bottle lifted → bottle on plate OM
Task 3 robot to lift the bowl and place it back on the plate 3 times bowl lifted → bowl on plate →
× 3
OM, OS
Task 4 robot to pick up the bottle and put it down on the plate 3 times bottle lifted → bottle on plate →
× 3
OM, OS
Task 5 robot to lift the bowl and place it back on the plate 5 times bowl lifted → bowl on plate →
× 5
OM, OS
Task 6 robot to pick up the bowl and put it on the plate 7 times bowl lifted → bowl on plate →
× 7
OM, OS
Task 7 robot to swap 2 bowls on their plates using the rotation rule bowl 1 on plate 3 →
bowl 2 on plate 1 →
bowl 1 on plate 2
OM, OR
Task 8 robot to swap 3 bowls on their plates using the rotation rule bowl 1 on plate 4 →
bowl 2 on plate 1 →
bowl 3 on plate 2 →
bowl 1 on plate 3
OM, OR
Task 9 robot to put bowl in closest basket and move basket to the middle bowl 1 in basket 1 →
basket 1 in center
OM, OO
Task 10 robot to put bowl in closest basket and move empty basket to middle bowl 1 in basket 1 →
basket 2 in center
OM, OO

Embodied-SlotSSM a slot-based state-space modeling framework that encodes persistent, object-centric memory representations, enabling structured tracking and decision-making under partial observability.

Embodied-SlotSSM

Visualizations

LIBERO-Mem tasks T1–T10 with ordered subgoal notes

T1 · pick up the bowl and place it back on the plate

Subgoals

  • one: bowl 1 on plate 1
T2 · lift the bottle and put it down on the plate

Subgoals

  • one: bottle 1 on plate 1
T3 · lift the bowl and place it back on the plate 3 times

Subgoals

  • one: bowl 1 on plate 1
  • two: bowl 1 on plate 1
  • three: bowl 1 on plate 1
T4 · pick up the bottle and put it down the plate 3 times

Subgoals

  • one: bottle 1 on plate 1
  • two: bottle 1 on plate 1
  • three: bottle 1 on plate 1
T5 · lift the bowl and place it back on the plate 5 times

Subgoals

  • one: bowl 1 on plate 1
  • two: bowl 1 on plate 1
  • three: bowl 1 on plate 1
  • four: bowl 1 on plate 1
  • five: bowl 1 on plate 1
T6 · pick up the bowl and place it on the plate 7 times

Subgoals

  • one: bowl 1 on plate 1
  • two: bowl 1 on plate 1
  • three: bowl 1 on plate 1
  • four: bowl 1 on plate 1
  • five: bowl 1 on plate 1
  • six: bowl 1 on plate 1
  • seven: bowl 1 on plate 1
T7 · swap the 2 bowls on their plates using the empty plate

Subgoals

  • one: bowl 1 on plate 1
  • two: bowl 2 on plate 2
  • three: bowl 1 on plate 3
T8 · rotate the 3 bowls on their plates from left to right using the empty plate

Subgoals

  • one: bowl 1 on plate 1
  • two: bowl 2 on plate 2
  • three: bowl 3 on plate 3
  • four: bowl 1 on plate 4
T9 · put the cream cheese in the nearest basket and place that basket in the center

Subgoals

  • one: cream cheese 1 in basket 1
  • two: basket 1 in center
T10 · put the cream cheese in the nearest basket and place the empty basket in the center

Subgoals

  • one: cream cheese 1 in basket 1
  • two: basket 2 in center

Acknowledgements

We borrow github page from HabiCrowd and HyperNeRF. Special thanks to them!