arxiv_ai 90% Match Research Paper Robotics Engineers,Manufacturing Automation Specialists,AI Researchers in Robotics,Industrial Automation Companies 2 weeks ago

Manual2Skill++: Connector-Aware General Robotic Assembly from Instruction Manuals via Vision-Language Models

robotics › manipulation

📄 Abstract

Abstract: Assembly hinges on reliably forming connections between parts; yet most robotic approaches plan assembly sequences and part poses while treating connectors as an afterthought. Connections represent the critical "last mile" of assembly execution, while task planning may sequence operations and motion plan may position parts, the precise establishment of physical connections ultimately determines assembly success or failure. In this paper, we consider connections as first-class primitives in assembly representation, including connector types, specifications, quantities, and placement locations. Drawing inspiration from how humans learn assembly tasks through step-by-step instruction manuals, we present Manual2Skill++, a vision-language framework that automatically extracts structured connection information from assembly manuals. We encode assembly tasks as hierarchical graphs where nodes represent parts and sub-assemblies, and edges explicitly model connection relationships between components. A large-scale vision-language model parses symbolic diagrams and annotations in manuals to instantiate these graphs, leveraging the rich connection knowledge embedded in human-designed instructions. We curate a dataset containing over 20 assembly tasks with diverse connector types to validate our representation extraction approach, and evaluate the complete task understanding-to-execution pipeline across four complex assembly scenarios in simulation, spanning furniture, toys, and manufacturing components with real-world correspondence.

Authors (12)

Chenrui Tie

Shengxiang Sun

Yudi Lin

Yanbo Wang

Zhongrui Li

Zhouhan Zhong

+6 more

Submitted

October 18, 2025

arXiv Category

cs.RO

arXiv PDF

Key Contributions

Presents Manual2Skill++, a vision-language framework that enables general robotic assembly by treating connections between parts as first-class primitives. It automatically extracts structured connector information from instruction manuals and encodes assembly tasks as hierarchical graphs, improving reliability in the critical 'last mile' of assembly execution.

Business Value

Enables more flexible and automated manufacturing processes, reducing reliance on manual labor for complex assembly tasks. Improves product quality and consistency.

Paper Metadata

Innovation Type

Framework / Methodology

Deployment Feasibility

Medium; requires integration with robotic hardware, vision systems, and potentially NLP capabilities for manual parsing.

Limitations Addressed

Existing robotic approaches often neglect the critical role of connectors in assembly,Difficulty in generalizing robotic assembly to new tasks based on instructions,Need for more robust assembly planning that accounts for physical connections

Technical Tags

robotic assemblyvision-language modelsinstruction manualsconnector-awarepart pose estimationassembly planninghierarchical graphstask executionmanipulation

Research Topics

Robotic Assembly AutomationVision-Language Models for RoboticsInstruction Following in RoboticsGrasping and Manipulation

Methods & Architectures

Vision-Language FrameworkInformation Extraction from ManualsHierarchical Graph RepresentationConnector Modeling Vision-Language Models (VLMs)

Applications & Tasks

Robotics Manufacturing Assembly Automation Treating connectors as an afterthought in assembly planningReliably forming connections between partsAutomating assembly from unstructured instructions General robotic assemblyExtracting structured connection information from manualsPlanning assembly sequences considering connectorsExecuting precise assembly actions

Related Fields

RoboticsComputer VisionNatural Language ProcessingManufacturing AutomationHuman-Robot Interaction

Keywords

robotic assemblyvision-language modelsinstruction manualsconnectorsroboticsmanipulationautomationmanufacturingpart assemblytask planninghierarchical graphs

Academic Context

#Robotic Assembly Automation#Vision-Language Models for Robotics#Instruction Following in Robotics#Grasping and Manipulation

Commercial Potential

Potential Products

Robotic assembly systems for factoriesSoftware for generating assembly plans from manualsAI modules for robot manipulation

Target Industries

ManufacturingAutomotiveElectronicsAerospace

Use Case Examples

Automated assembly of electronic devicesRobotic assembly of furniture from IKEA-style manualsFlexible manufacturing lines that can adapt to different products

Competitive Edge

Advances robotic assembly by providing a more robust approach that explicitly models and utilizes connector information, learned from human-readable instructions.

Market Opportunity

Large and growing market for industrial automation and robotics.

Revenue Models

Sales of robotic systemslicensing of software componentsintegration services.

Resource Requirements

Compute Needs

Requires computational resources for training VLMs and for real-time robot control.

Data Requirements

Assembly instruction manuals, corresponding 3D models or CAD data, potentially robot demonstration data.

Deployment Constraints

Integration with specific robotic hardware, calibration, safety considerations, robustness to variations in manuals and parts.

Scalability

Scalable to different types of assembly tasks and connector types with sufficient training data.

Regulatory Considerations

Safety standards for industrial robots.

Production Readiness

Maturity Level

Research / Development

Time to Market

Medium term for industrial adoption.

Patent Potential

Medium to High, for the specific framework and methods for extracting and utilizing connector information.

View Full Paper Back to Papers