Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Instruction-based image editing enables precise modifications via natural
language prompts, but existing methods face a precision-efficiency tradeoff:
fine-tuning demands massive datasets (>10M) and computational resources, while
training-free approaches suffer from weak instruction comprehension. We address
this by proposing ICEdit, which leverages the inherent comprehension and
generation abilities of large-scale Diffusion Transformers (DiTs) through three
key innovations: (1) An in-context editing paradigm without architectural
modifications; (2) Minimal parameter-efficient fine-tuning for quality
improvement; (3) Early Filter Inference-Time Scaling, which uses VLMs to select
high-quality noise samples for efficiency. Experiments show that ICEdit
achieves state-of-the-art editing performance with only 0.1\% of the training
data and 1\% trainable parameters compared to previous methods. Our approach
establishes a new paradigm for balancing precision and efficiency in
instructional image editing. Codes and demos can be found in
https://river-zhang.github.io/ICEdit-gh-pages/.