Robotics 26
☆ Whole-Body Proprioceptive Morphing: A Modular Soft Gripper for Robust Cross-Scale Grasping
Dong Heon Han, Xiaohao Xu, Yuxi Chen, Yusheng Zhou, Xinqi Zhang, Jiaqi Wang, Daniel Bruder, Xiaonan Huang
Biological systems, such as the octopus, exhibit masterful cross-scale
manipulation by adaptively reconfiguring their entire form, a capability that
remains elusive in robotics. Conventional soft grippers, while compliant, are
mostly constrained by a fixed global morphology, and prior shape-morphing
efforts have been largely confined to localized deformations, failing to
replicate this biological dexterity. Inspired by this natural exemplar, we
introduce the paradigm of collaborative, whole-body proprioceptive morphing,
realized in a modular soft gripper architecture. Our design is a distributed
network of modular self-sensing pneumatic actuators that enables the gripper to
intelligently reconfigure its entire topology, achieving multiple morphing
states that are controllable to form diverse polygonal shapes. By integrating
rich proprioceptive feedback from embedded sensors, our system can seamlessly
transition from a precise pinch to a large envelope grasp. We experimentally
demonstrate that this approach expands the grasping envelope and enhances
generalization across diverse object geometries (standard and irregular) and
scales (up to 10$\times$), while also unlocking novel manipulation modalities
such as multi-object and internal hook grasping. This work presents a low-cost,
easy-to-fabricate, and scalable framework that fuses distributed actuation with
integrated sensing, offering a new pathway toward achieving biological levels
of dexterity in robotic manipulation.
☆ Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Recently, augmenting Vision-Language-Action models (VLAs) with world modeling
has shown promise in improving robotic policy learning. However, it remains
challenging to jointly predict next-state observations and action sequences
because of the inherent difference between the two modalities. To address this,
we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework
that handles the modality conflict and enhances the performance of VLAs across
diverse tasks. Specifically, we propose a multimodal diffusion transformer
architecture that explicitly maintains separate modality streams while still
enabling cross-modal knowledge sharing. In addition, we introduce independent
noise perturbations for each modality and a decoupled flow-matching loss. This
design enables the model to learn the joint distribution in a bidirectional
manner while avoiding the need for a unified latent space. Based on the
decoupling of modalities during training, we also introduce a joint sampling
method that supports test-time scaling, where action and vision tokens evolve
asynchronously at different rates. Through experiments on simulated benchmarks
such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods,
while our test-time scaling approach provides an additional 2-5% boost. On
real-world tasks with the Franka Research 3, DUST improves success rates by
13%, confirming its effectiveness beyond simulation. Furthermore, pre-training
on action-free videos from BridgeV2 yields significant transfer gains on
RoboCasa, underscoring DUST's potential for large-scale VLA pretraining.
comment: 20 pages, 10 figures
☆ Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs
This paper presents a framework that leverages pre-trained foundation models
for robotic manipulation without domain-specific training. The framework
integrates off-the-shelf models, combining multimodal perception from
foundation models with a general-purpose reasoning model capable of robust task
sequencing. Scene graphs, dynamically maintained within the framework, provide
spatial awareness and enable consistent reasoning about the environment. The
framework is evaluated through a series of tabletop robotic manipulation
experiments, and the results highlight its potential for building robotic
manipulation systems directly on top of off-the-shelf foundation models.
☆ EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities
Implicit policies parameterized by generative models, such as Diffusion
Policy, have become the standard for policy learning and Vision-Language-Action
(VLA) models in robotics. However, these approaches often suffer from high
computational cost, exposure bias, and unstable inference dynamics, which lead
to divergence under distribution shifts. Energy-Based Models (EBMs) address
these issues by learning energy landscapes end-to-end and modeling equilibrium
dynamics, offering improved robustness and reduced exposure bias. Yet, policies
parameterized by EBMs have historically struggled to scale effectively. Recent
work on Energy-Based Transformers (EBTs) demonstrates the scalability of EBMs
to high-dimensional spaces, but their potential for solving core challenges in
physically embodied models remains underexplored. We introduce a new
energy-based architecture, EBT-Policy, that solves core issues in robotic and
real-world settings. Across simulated and real-world tasks, EBT-Policy
consistently outperforms diffusion-based policies, while requiring less
training and inference computation. Remarkably, on some tasks it converges
within just two inference steps, a 50x reduction compared to Diffusion Policy's
100. Moreover, EBT-Policy exhibits emergent capabilities not seen in prior
models, such as zero-shot recovery from failed action sequences using only
behavior cloning and without explicit retry training. By leveraging its scalar
energy for uncertainty-aware inference and dynamic compute allocation,
EBT-Policy offers a promising path toward robust, generalizable robot behavior
under distribution shifts.
comment: 9 pages, 6 figures, 4 tables
☆ Preliminary Prototyping of Avoidance Behaviors Triggered by a User's Physical Approach to a Robot
Human-robot interaction frequently involves physical proximity or contact. In
human-human settings, people flexibly accept, reject, or tolerate such
approaches depending on the relationship and context. We explore the design of
a robot's rejective internal state and corresponding avoidance behaviors, such
as withdrawing or pushing away, when a person approaches. We model the
accumulation and decay of discomfort as a function of interpersonal distance,
and implement tolerance (endurance) and limit-exceeding avoidance driven by the
Dominance axis of the PAD affect model. The behaviors and their intensities are
realized on an arm robot. Results illustrate a coherent pipeline from internal
state parameters to graded endurance motions and, once a limit is crossed, to
avoidance actions.
comment: Workshop on Socially Aware and Cooperative Intelligent Systems in HAI
2025
☆ Learning Soft Robotic Dynamics with Active Exploration
Soft robots offer unmatched adaptability and safety in unstructured
environments, yet their compliant, high-dimensional, and nonlinear dynamics
make modeling for control notoriously difficult. Existing data-driven
approaches often fail to generalize, constrained by narrowly focused task
demonstrations or inefficient random exploration. We introduce SoftAE, an
uncertainty-aware active exploration framework that autonomously learns
task-agnostic and generalizable dynamics models of soft robotic systems. SoftAE
employs probabilistic ensemble models to estimate epistemic uncertainty and
actively guides exploration toward underrepresented regions of the state-action
space, achieving efficient coverage of diverse behaviors without task-specific
supervision. We evaluate SoftAE on three simulated soft robotic platforms -- a
continuum arm, an articulated fish in fluid, and a musculoskeletal leg with
hybrid actuation -- and on a pneumatically actuated continuum soft arm in the
real world. Compared with random exploration and task-specific model-based
reinforcement learning, SoftAE produces more accurate dynamics models, enables
superior zero-shot control on unseen tasks, and maintains robustness under
sensing noise, actuation delays, and nonlinear material effects. These results
demonstrate that uncertainty-driven active exploration can yield scalable,
reusable dynamics models across diverse soft robotic morphologies, representing
a step toward more autonomous, adaptable, and data-efficient control in
compliant robots.
☆ Towards a Multi-Embodied Grasping Agent
Multi-embodiment grasping focuses on developing approaches that exhibit
generalist behavior across diverse gripper designs. Existing methods often
learn the kinematic structure of the robot implicitly and face challenges due
to the difficulty of sourcing the required large-scale data. In this work, we
present a data-efficient, flow-based, equivariant grasp synthesis architecture
that can handle different gripper types with variable degrees of freedom and
successfully exploit the underlying kinematic model, deducing all necessary
information solely from the gripper and scene geometry. Unlike previous
equivariant grasping methods, we translated all modules from the ground up to
JAX and provide a model with batching capabilities over scenes, grippers, and
grasps, resulting in smoother learning, improved performance and faster
inference time. Our dataset encompasses grippers ranging from humanoid hands to
parallel yaw grippers and includes 25,000 scenes and 20 million grasps.
comment: 9 pages, 3 figures
☆ Modified-Emergency Index (MEI): A Criticality Metric for Autonomous Driving in Lateral Conflict
Hao Cheng, Yanbo Jiang, Qingyuan Shi, Qingwen Meng, Keyu Chen, Wenhao Yu, Jianqiang Wang, Sifa Zheng
Effective, reliable, and efficient evaluation of autonomous driving safety is
essential to demonstrate its trustworthiness. Criticality metrics provide an
objective means of assessing safety. However, as existing metrics primarily
target longitudinal conflicts, accurately quantifying the risks of lateral
conflicts - prevalent in urban settings - remains challenging. This paper
proposes the Modified-Emergency Index (MEI), a metric designed to quantify
evasive effort in lateral conflicts. Compared to the original Emergency Index
(EI), MEI refines the estimation of the time available for evasive maneuvers,
enabling more precise risk quantification. We validate MEI on a public lateral
conflict dataset based on Argoverse-2, from which we extract over 1,500
high-quality AV conflict cases, including more than 500 critical events. MEI is
then compared with the well-established ACT and the widely used PET metrics.
Results show that MEI consistently outperforms them in accurately quantifying
criticality and capturing risk evolution. Overall, these findings highlight MEI
as a promising metric for evaluating urban conflicts and enhancing the safety
assessment framework for autonomous driving. The open-source implementation is
available at https://github.com/AutoChengh/MEI.
☆ A Modular and Scalable System Architecture for Heterogeneous UAV Swarms Using ROS 2 and PX4-Autopilot
In this paper a modular and scalable architecture for heterogeneous
swarm-based Counter Unmanned Aerial Systems (C-UASs) built on PX4-Autopilot and
Robot Operating System 2 (ROS 2) framework is presented. The proposed
architecture emphasizes seamless integration of hardware components by
introducing independent ROS 2 nodes for each component of a Unmanned Aerial
Vehicle (UAV). Communication between swarm participants is abstracted in
software, allowing the use of various technologies without architectural
changes. Key functionalities are supported, e.g. leader following and formation
flight to maneuver the swarm. The system also allows computer vision algorithms
to be integrated for the detection and tracking of UAVs. Additionally, a ground
station control is integrated for the coordination of swarm operations.
Swarm-based Unmanned Aerial System (UAS) architecture is verified within a
Gazebo simulation environment but also in real-world demonstrations.
☆ Vectorized Online POMDP Planning ICRA 2026
Planning under partial observability is an essential capability of autonomous
robots. The Partially Observable Markov Decision Process (POMDP) provides a
powerful framework for planning under partial observability problems, capturing
the stochastic effects of actions and the limited information available through
noisy observations. POMDP solving could benefit tremendously from massive
parallelization of today's hardware, but parallelizing POMDP solvers has been
challenging. They rely on interleaving numerical optimization over actions with
the estimation of their values, which creates dependencies and synchronization
bottlenecks between parallel processes that can quickly offset the benefits of
parallelization. In this paper, we propose Vectorized Online POMDP Planner
(VOPP), a novel parallel online solver that leverages a recent POMDP
formulation that analytically solves part of the optimization component,
leaving only the estimation of expectations for numerical computation. VOPP
represents all data structures related to planning as a collection of tensors
and implements all planning steps as fully vectorized computations over this
representation. The result is a massively parallel solver with no dependencies
and synchronization bottlenecks between parallel computations. Experimental
results indicate that VOPP is at least 20X more efficient in computing
near-optimal solutions compared to an existing state-of-the-art parallel online
solver.
comment: 8 pages, 3 figures. Submitted to ICRA 2026
☆ Hybrid Gripper Finger Enabling In-Grasp Friction Modulation Using Inflatable Silicone Pockets ICRA 2026
Hoang Hiep Ly, Cong-Nhat Nguyen, Doan-Quang Tran, Quoc-Khanh Dang, Ngoc Duy Tran, Thi Thoa Mac, Anh Nguyen, Xuan-Thuan Nguyen, Tung D. Ta
Grasping objects with diverse mechanical properties, such as heavy, slippery,
or fragile items, remains a significant challenge in robotics. Conventional
grippers often rely on applying high normal forces, which can cause damage to
objects. To address this limitation, we present a hybrid gripper finger that
combines a rigid structural shell with a soft, inflatable silicone pocket. The
gripper finger can actively modulate its surface friction by controlling the
internal air pressure of the silicone pocket. Results from fundamental
experiments indicate that increasing the internal pressure results in a
proportional increase in the effective coefficient of friction. This enables
the gripper to stably lift heavy and slippery objects without increasing the
gripping force and to handle fragile or deformable objects, such as eggs,
fruits, and paper cups, with minimal damage by increasing friction rather than
applying excessive force. The experimental results demonstrate that the hybrid
gripper finger with adaptable friction provides a robust and safer alternative
to relying solely on high normal forces, thereby enhancing the gripper
flexibility in handling delicate, fragile, and diverse objects.
comment: Submitted to ICRA 2026
☆ MobiDock: Design and Control of A Modular Self Reconfigurable Bimanual Mobile Manipulator via Robotic Docking ICRA2026
Xuan-Thuan Nguyen, Khac Nam Nguyen, Ngoc Duy Tran, Thi Thoa Mac, Anh Nguyen, Hoang Hiep Ly, Tung D. Ta
Multi-robot systems, particularly mobile manipulators, face challenges in
control coordination and dynamic stability when working together. To address
this issue, this study proposes MobiDock, a modular self-reconfigurable mobile
manipulator system that allows two independent robots to physically connect and
form a unified mobile bimanual platform. This process helps transform a complex
multi-robot control problem into the management of a simpler, single system.
The system utilizes an autonomous docking strategy based on computer vision
with AprilTag markers and a new threaded screw-lock mechanism. Experimental
results show that the docked configuration demonstrates better performance in
dynamic stability and operational efficiency compared to two independently
cooperating robots. Specifically, the unified system has lower Root Mean Square
(RMS) Acceleration and Jerk values, higher angular precision, and completes
tasks significantly faster. These findings confirm that physical
reconfiguration is a powerful design principle that simplifies cooperative
control, improving stability and performance for complex tasks in real-world
environments.
comment: ICRA2026 submited
☆ Confined Space Underwater Positioning Using Collaborative Robots
Positioning of underwater robots in confined and cluttered spaces remains a
key challenge for field operations. Existing systems are mostly designed for
large, open-water environments and struggle in industrial settings due to poor
coverage, reliance on external infrastructure, and the need for feature-rich
surroundings. Multipath effects from continuous sound reflections further
degrade signal quality, reducing accuracy and reliability. Accurate and easily
deployable positioning is essential for repeatable autonomous missions;
however, this requirement has created a technological bottleneck limiting
underwater robotic deployment. This paper presents the Collaborative Aquatic
Positioning (CAP) system, which integrates collaborative robotics and sensor
fusion to overcome these limitations. Inspired by the "mother-ship" concept,
the surface vehicle acts as a mobile leader to assist in positioning a
submerged robot, enabling localization even in GPS-denied and highly
constrained environments. The system is validated in a large test tank through
repeatable autonomous missions using CAP's position estimates for real-time
trajectory control. Experimental results demonstrate a mean Euclidean distance
(MED) error of 70 mm, achieved in real time without requiring fixed
infrastructure, extensive calibration, or environmental features. CAP leverages
advances in mobile robot sensing and leader-follower control to deliver a step
change in accurate, practical, and infrastructure-free underwater localization.
comment: 31 pages including appendix, 24 figures
☆ WildfireX-SLAM: A Large-scale Low-altitude RGB-D Dataset for Wildfire SLAM and Beyond
3D Gaussian splatting (3DGS) and its subsequent variants have led to
remarkable progress in simultaneous localization and mapping (SLAM). While most
recent 3DGS-based SLAM works focus on small-scale indoor scenes, developing
3DGS-based SLAM methods for large-scale forest scenes holds great potential for
many real-world applications, especially for wildfire emergency response and
forest management. However, this line of research is impeded by the absence of
a comprehensive and high-quality dataset, and collecting such a dataset over
real-world scenes is costly and technically infeasible. To this end, we have
built a large-scale, comprehensive, and high-quality synthetic dataset for SLAM
in wildfire and forest environments. Leveraging the Unreal Engine 5 Electric
Dreams Environment Sample Project, we developed a pipeline to easily collect
aerial and ground views, including ground-truth camera poses and a range of
additional data modalities from unmanned aerial vehicle. Our pipeline also
provides flexible controls on environmental factors such as light, weather, and
types and conditions of wildfire, supporting the need for various tasks
covering forest mapping, wildfire emergency response, and beyond. The resulting
pilot dataset, WildfireX-SLAM, contains 5.5k low-altitude RGB-D aerial images
from a large-scale forest map with a total size of 16 km2. On top of
WildfireX-SLAM, a thorough benchmark is also conducted, which not only reveals
the unique challenges of 3DGS-based SLAM in the forest but also highlights
potential improvements for future works. The dataset and code will be publicly
available. Project page: https://zhicongsun.github.io/wildfirexslam.
comment: This paper has been accepted by MMM 2026
☆ Learning Generalizable Visuomotor Policy through Dynamics-Alignment
Dohyeok Lee, Jung Min Lee, Munkyung Kim, Seokhun Ju, Jin Woo Koo, Kyungjae Lee, Dohyeong Kim, TaeHyun Cho, Jungwoo Lee
Behavior cloning methods for robot learning suffer from poor generalization
due to limited data support beyond expert demonstrations. Recent approaches
leveraging video prediction models have shown promising results by learning
rich spatiotemporal representations from large-scale datasets. However, these
models learn action-agnostic dynamics that cannot distinguish between different
control inputs, limiting their utility for precise manipulation tasks and
requiring large pretraining datasets. We propose a Dynamics-Aligned Flow
Matching Policy (DAP) that integrates dynamics prediction into policy learning.
Our method introduces a novel architecture where policy and dynamics models
provide mutual corrective feedback during action generation, enabling
self-correction and improved generalization. Empirical validation demonstrates
generalization performance superior to baseline methods on real-world robotic
manipulation tasks, showing particular robustness in OOD scenarios including
visual distractions and lighting variations.
comment: 9 pages, 6 figures
♻ ☆ RObotic MAnipulation Network (ROMAN) -- Hybrid Hierarchical Learning for Solving Complex Sequential Tasks
Solving long sequential tasks poses a significant challenge in embodied
artificial intelligence. Enabling a robotic system to perform diverse
sequential tasks with a broad range of manipulation skills is an active area of
research. In this work, we present a Hybrid Hierarchical Learning framework,
the Robotic Manipulation Network (ROMAN), to address the challenge of solving
multiple complex tasks over long time horizons in robotic manipulation. ROMAN
achieves task versatility and robust failure recovery by integrating
behavioural cloning, imitation learning, and reinforcement learning. It
consists of a central manipulation network that coordinates an ensemble of
various neural networks, each specialising in distinct re-combinable sub-tasks
to generate their correct in-sequence actions for solving complex long-horizon
manipulation tasks. Experimental results show that by orchestrating and
activating these specialised manipulation experts, ROMAN generates correct
sequential activations for accomplishing long sequences of sophisticated
manipulation tasks and achieving adaptive behaviours beyond demonstrations,
while exhibiting robustness to various sensory noises. These results
demonstrate the significance and versatility of ROMAN's dynamic adaptability
featuring autonomous failure recovery capabilities, and highlight its potential
for various autonomous manipulation tasks that demand adaptive motor skills.
comment: To appear in Nature Machine Intelligence. Includes the main and
supplementary manuscript. Total of 70 pages, with a total of 9 Figures and 17
Tables
♻ ☆ GenSwarm: Scalable Multi-Robot Code-Policy Generation and Deployment via Language Models
Wenkang Ji, Huaben Chen, Mingyang Chen, Guobin Zhu, Lufeng Xu, Roderich Groß, Rui Zhou, Ming Cao, Shiyu Zhao
The development of control policies for multi-robot systems traditionally
follows a complex and labor-intensive process, often lacking the flexibility to
adapt to dynamic tasks. This has motivated research on methods to automatically
create control policies. However, these methods require iterative processes of
manually crafting and refining objective functions, thereby prolonging the
development cycle. This work introduces \textit{GenSwarm}, an end-to-end system
that leverages large language models to automatically generate and deploy
control policies for multi-robot tasks based on simple user instructions in
natural language. As a multi-language-agent system, GenSwarm achieves zero-shot
learning, enabling rapid adaptation to altered or unseen tasks. The white-box
nature of the code policies ensures strong reproducibility and
interpretability. With its scalable software and hardware architectures,
GenSwarm supports efficient policy deployment on both simulated and real-world
multi-robot systems, realizing an instruction-to-execution end-to-end
functionality that could prove valuable for robotics specialists and
non-specialists alike.The code of the proposed GenSwarm system is available
online: https://github.com/WindyLab/GenSwarm.
comment: This article has been accepted for publication in npj Robotics
♻ ☆ A Study on Human-Swarm Interaction: A Framework for Assessing Situation Awareness and Task Performance
This paper introduces a framework for human swarm interaction studies that
measures situation awareness in dynamic environments. A tablet-based interface
was developed for a user study by implementing the concepts introduced in the
framework, where operators guided a robotic swarm in a single-target search
task, marking hazardous cells unknown to the swarm. Both subjective and
objective situation awareness measures were used, with task performance
evaluated based on how close the robots were to the target. The framework
enabled a structured investigation of the role of situation awareness in human
swarm interaction, leading to key findings such as improved task performance
across attempts, showing the interface was learnable, centroid active robot
position proved to be a useful task performance metric for assessing situation
awareness, perception and projection played a key role in task performance,
highlighting their importance in interface design and objective situation
awareness influenced both subjective situation awareness and task performance,
emphasizing the need for interfaces that emphasise objective situation
awareness. These findings validate our framework as a structured approach for
integrating situation awareness concepts into human swarm interaction studies,
offering a systematic way to assess situation awareness and task performance.
The framework can be applied to other swarming studies to evaluate interface
learnability, identify meaningful task performance metrics, and refine
interface designs to enhance situation awareness, ultimately improving human
swarm interaction in dynamic environments.
comment: 10 pages, 8 figures, 2 tables, 2 equations
♻ ☆ Uncertainty-Based Smooth Policy Regularisation for Reinforcement Learning with Few Demonstrations
In reinforcement learning with sparse rewards, demonstrations can accelerate
learning, but determining when to imitate them remains challenging. We propose
Smooth Policy Regularisation from Demonstrations (SPReD), a framework that
addresses the fundamental question: when should an agent imitate a
demonstration versus follow its own policy? SPReD uses ensemble methods to
explicitly model Q-value distributions for both demonstration and policy
actions, quantifying uncertainty for comparisons. We develop two complementary
uncertainty-aware methods: a probabilistic approach estimating the likelihood
of demonstration superiority, and an advantage-based approach scaling imitation
by statistical significance. Unlike prevailing methods (e.g. Q-filter) that
make binary imitation decisions, SPReD applies continuous,
uncertainty-proportional regularisation weights, reducing gradient variance
during training. Despite its computational simplicity, SPReD achieves
remarkable gains in experiments across eight robotics tasks, outperforming
existing approaches by up to a factor of 14 in complex tasks while maintaining
robustness to demonstration quality and quantity. Our code is available at
https://github.com/YujieZhu7/SPReD.
♻ ☆ A Tactile Feedback Approach to Path Recovery after High-Speed Impacts for Collision-Resilient Drones
Aerial robots are a well-established solution for exploration, monitoring,
and inspection, thanks to their superior maneuverability and agility. However,
in many environments, they risk crashing and sustaining damage after
collisions. Traditional methods focus on avoiding obstacles entirely, but these
approaches can be limiting, particularly in cluttered spaces or on weight-and
compute-constrained platforms such as drones. This paper presents a novel
approach to enhance drone robustness and autonomy by developing a path recovery
and adjustment method for a high-speed collision-resilient aerial robot
equipped with lightweight, distributed tactile sensors. The proposed system
explicitly models collisions using pre-collision velocities, rates and tactile
feedback to predict post-collision dynamics, improving state estimation
accuracy. Additionally, we introduce a computationally efficient
vector-field-based path representation that guarantees convergence to a
user-specified path, while naturally avoiding known obstacles. Post-collision,
contact point locations are incorporated into the vector field as a repulsive
potential, enabling the drone to avoid obstacles while naturally returning to
its path. The effectiveness of this method is validated through Monte Carlo
simulations and demonstrated on a physical prototype, showing successful path
following, collision recovery, and adjustment at speeds up to 3.7 m/s.
♻ ☆ SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents
Sheng Yin, Xianghe Pang, Yuanzhuo Ding, Menglan Chen, Yutong Bi, Yichen Xiong, Wenhao Huang, Zhen Xiang, Jing Shao, Siheng Chen
With the integration of large language models (LLMs), embodied agents have
strong capabilities to understand and plan complicated natural language
instructions. However, a foreseeable issue is that those embodied agents can
also flawlessly execute some hazardous tasks, potentially causing damages in
the real world. Existing benchmarks predominantly overlook critical safety
risks, focusing solely on planning performance, while a few evaluate LLMs'
safety awareness only on non-interactive image-text data. To address this gap,
we present SafeAgentBench -- the first comprehensive benchmark for safety-aware
task planning of embodied LLM agents in interactive simulation environments,
covering both explicit and implicit hazards. SafeAgentBench includes: (1) an
executable, diverse, and high-quality dataset of 750 tasks, rigorously curated
to cover 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal
embodied environment with a low-level controller, supporting multi-agent
execution with 17 high-level actions for 9 state-of-the-art baselines; and (3)
reliable evaluation methods from both execution and semantic perspectives.
Experimental results show that, although agents based on different design
frameworks exhibit substantial differences in task success rates, their overall
safety awareness remains weak. The most safety-conscious baseline achieves only
a 10% rejection rate for detailed hazardous tasks. Moreover, simply replacing
the LLM driving the agent does not lead to notable improvements in safety
awareness. Dataset and codes are available in
https://github.com/shengyin1224/SafeAgentBench and
https://huggingface.co/datasets/safeagentbench/SafeAgentBench.
comment: 28 pages, 19 tables, 15 figures
♻ ☆ From Canada to Japan: How 10,000 km Affect User Perception in Robot Teleoperation
Siméon Capy, Thomas M. Kwok, Kevin Joseph, Yuichiro Kawasumi, Koichi Nagashima, Tomoya Sasaki, Yue Hu, Eiichi Yoshida
Robot teleoperation (RTo) has emerged as a viable alternative to local
control, particularly when human intervention is still necessary. This research
aims to study the distance effect on user perception in RTo, exploring the
potential of teleoperated robots for older adult care. We propose an evaluation
of non-expert users' perception of long-distance RTo, examining how their
perception changes before and after interaction, as well as comparing it to
that of locally operated robots. We have designed a specific protocol
consisting of multiple questionnaires, along with a dedicated software
architecture using the Robotics Operating System (ROS) and Unity. The results
revealed no statistically significant differences between the local and remote
robot conditions, suggesting that robots may be a viable alternative to
traditional local control.
comment: Author preprint - Accepted for Humanoids 2025
♻ ☆ A Practical-Driven Framework for Transitioning Drive-by-Wire to Autonomous Driving Systems: A Case Study with a Chrysler Pacifica Hybrid Vehicle
Transitioning from a Drive-by-Wire (DBW) system to a fully autonomous driving
system (ADS) involves multiple stages of development and demands robust
positioning and sensing capabilities. This paper presents a practice-driven
framework for facilitating the DBW-to-ADS transition using a 2022 Chrysler
Pacifica Hybrid Minivan equipped with cameras, LiDAR, GNSS, and onboard
computing hardware configured with the Robot Operating System (ROS) and
Autoware.AI. The implementation showcases offline autonomous operations
utilizing pre-recorded LiDAR and camera data, point clouds, and vector maps,
enabling effective localization and path planning within a structured test
environment. The study addresses key challenges encountered during the
transition, particularly those related to wireless-network-assisted sensing and
positioning. It offers practical solutions for overcoming software
incompatibility constraints, sensor synchronization issues, and limitations in
real-time perception. Furthermore, the integration of sensing, data fusion, and
automation is emphasized as a critical factor in supporting autonomous driving
systems in map generation, simulation, and training. Overall, the transition
process outlined in this work aims to provide actionable strategies for
researchers pursuing DBW-to-ADS conversion. It offers direction for
incorporating real-time perception, GNSS-LiDAR-camera integration, and fully
ADS-equipped autonomous vehicle operations, thus contributing to the
advancement of robust autonomous vehicle technologies.
comment: This updated version includes further implementation details and
experimental validation. Accepted for presentation at The 22nd International
Conference on Automation Technology (AUTOMATION 2025), Taipei, Taiwan,
November 2025
♻ ☆ Panoramic Out-of-Distribution Segmentation for Autonomous Driving
Panoramic imaging enables capturing 360{\deg} images with an ultra-wide
Field-of-View (FoV) for dense omnidirectional perception, which is critical to
applications, such as autonomous driving and augmented reality, etc. However,
current panoramic semantic segmentation methods fail to identify outliers, and
pinhole Out-of-distribution Segmentation (OoS) models perform unsatisfactorily
in the panoramic domain due to background clutter and pixel distortions. To
address these issues, we introduce a new task, Panoramic Out-of-distribution
Segmentation (PanOoS), with the aim of achieving comprehensive and safe scene
understanding. Furthermore, we propose the first solution, POS, which adapts to
the characteristics of panoramic images through text-guided prompt distribution
learning. Specifically, POS integrates a disentanglement strategy designed to
materialize the cross-domain generalization capability of CLIP. The proposed
Prompt-based Restoration Attention (PRA) optimizes semantic decoding by prompt
guidance and self-adaptive correction, while Bilevel Prompt Distribution
Learning (BPDL) refines the manifold of per-pixel mask embeddings via semantic
prototype supervision. Besides, to compensate for the scarcity of PanOoS
datasets, we establish two benchmarks: DenseOoS, which features diverse
outliers in complex environments, and QuadOoS, captured by a quadruped robot
with a panoramic annular lens system. Extensive experiments demonstrate
superior performance of POS, with AuPRC improving by 34.25% and FPR95
decreasing by 21.42% on DenseOoS, outperforming state-of-the-art pinhole-OoS
methods. Moreover, POS achieves leading closed-set segmentation capabilities
and advances the development of panoramic understanding. Code and datasets will
be available at https://github.com/MengfeiD/PanOoS.
comment: Code and datasets will be available at
https://github.com/MengfeiD/PanOoS
♻ ☆ Sim2Real Diffusion: Leveraging Foundation Vision Language Models for Adaptive Automated Driving Robotics and Automation Letters
Simulation-based design, optimization, and validation of autonomous vehicles
have proven to be crucial for their improvement over the years. Nevertheless,
the ultimate measure of effectiveness is their successful transition from
simulation to reality (sim2real). However, existing sim2real transfer methods
struggle to address the autonomy-oriented requirements of balancing: (i)
conditioned domain adaptation, (ii) robust performance with limited examples,
(iii) modularity in handling multiple domain representations, and (iv)
real-time performance. To alleviate these pain points, we present a unified
framework for learning cross-domain adaptive representations through
conditional latent diffusion for sim2real transferable automated driving. Our
framework offers options to leverage: (i) alternate foundation models, (ii) a
few-shot fine-tuning pipeline, and (iii) textual as well as image prompts for
mapping across given source and target domains. It is also capable of
generating diverse high-quality samples when diffusing across parameter spaces
such as times of day, weather conditions, seasons, and operational design
domains. We systematically analyze the presented framework and report our
findings in terms of performance benchmarks and ablation studies. Additionally,
we demonstrate its serviceability for autonomous driving using behavioral
cloning case studies. Our experiments indicate that the proposed framework is
capable of bridging the perceptual sim2real gap by over 40%.
comment: Accepted in IEEE Robotics and Automation Letters (RA-L)
♻ ☆ Faster Model Predictive Control via Self-Supervised Initialization Learning
Model Predictive Control (MPC) is widely used in robot control by optimizing
a sequence of control outputs over a finite-horizon. Computational approaches
for MPC include deterministic methods (e.g., iLQR and COBYLA), as well as
sampling-based methods (e.g., MPPI and CEM). However, complex system dynamics
and non-convex or non-differentiable cost terms often lead to prohibitive
optimization times that limit real-world deployment. Prior efforts to
accelerate MPC have limitations on: (i) reusing previous solutions fails under
sharp state changes and (ii) pure imitation learning does not target compute
efficiency directly and suffers from suboptimality in the training data. To
address these, We propose a warm-start framework that learns a policy to
generate high-quality initial guesses for MPC solver. The policy is first
trained via behavior cloning from expert MPC rollouts and then fine-tuned
online with reinforcement learning to directly minimize MPC optimization time.
We empirically validate that our approach improves both deterministic and
sampling-based MPC methods, achieving up to 21.6% faster optimization and 34.1%
more tracking accuracy for deterministic MPC in Formula 1 track path-tracking
domain, and improving safety by 100%, path efficiency by 12.8%, and steering
smoothness by 7.2% for sampling-based MPC in obstacle-rich navigation domain.
These results demonstrate that our framework not only accelerates MPC but also
improves overall control performance. Furthermore, it can be applied to a
broader range of control algorithms that benefit from good initial guesses.