Step1X-Edit v1p2
Developed by StepFun, is a state-of-the-art open-source image editing model designed to challenge closed-source models like GPT-4o and Gemini Flash. It leverages a Multimodal Large Language Model (MLLM) to comprehend complex user text instructions, paired natively with a Diffusion Transformer (DiT) decoder network to output high-fidelity, region-precise image manipulations. The v1p2 iteration introduces a major technical breakthrough: a native reasoning-led edit architecture. Moving beyond basic pixel changes, it functions in a continuous thinking-editing-reflection loop. The "thinking" phase taps into world knowledge to parse abstract requests without manual masking, while the "reflection" stage acts as a hard visual firewall, auditing outputs to revert unintended changes and preserve subject identity.
