Consistent image generation requires faithfully preserving identities, styles, and logical coherence across multiple images, which is essential for applications such as storytelling and character design. Supervised training approaches struggle with this task due to the lack of large-scale datasets capturing visual consistency and the complexity of modeling human perceptual preferences. In this paper, we argue that reinforcement learning (RL) offers a promising alternative by enabling models to learn complex and subjective visual criteria in a data-free manner. To achieve this, we introduce PaCo-RL, a comprehensive framework that combines a specialized consistency reward model with an efficient RL algorithm. The first component, PaCo-Reward, is a pairwise consistency evaluator trained on a large-scale dataset constructed via automated sub-figure pairing. It evaluates consistency through a generative, autoregressive scoring mechanism enhanced by task-aware instructions and CoT reasons. The second component, PaCo-GRPO, leverages a novel resolution-decoupled optimization strategy to substantially reduce RL cost, alongside a log-tamed multi-reward aggregation mechanism that ensures balanced and stable reward optimization. Extensive experiments across the two representative subtasks show that PaCo-Reward significantly improves alignment with human perceptions of visual consistency, and PaCo-GRPO achieves state-of-the-art consistency performance with improved training efficiency and stability. Together, these results highlight the promise of PaCo-RL as a practical and scalable solution for consistent image generation.
| Model | Prompt Following | Consistency | Overall |
|---|
Depict the process of cleaning a cast iron skillet with visible rust. All images follow a realistic style with a neutral kitchen environment, featuring the same cast iron skillet in sequential cleaning stages. The cookware maintains consistent size, shape, and handle design throughout.
Generate the sequential construction phases of a modern skyscraper. All images maintain a realistic style with technical precision, using a consistent color palette of industrial grays and blues. The skyscraper progresses visibly across stages, with evolving structural details and machinery.
Depict the gradual melting of ice under sunlight, adhering to thermodynamic principles. All images share a realistic style, consistent environmental elements (sunlight angle, surrounding terrain), and scientifically accurate phase transitions. The ice structure degrades progressively, with light reflections and water behavior following heat transfer dynamics.
Hygge-inspired nursery elements with soft textures and muted natural tones. All images maintain a cohesive hygge aesthetic through soft lighting, organic materials, and muted earthy color palettes, evoking warmth and tranquility.
Distinct boho-chic bedroom areas with eclectic global fusion elements. All scenes feature layered textures, vibrant patterns, and globally inspired decor elements unified by a warm, free-spirited bohemian aesthetic.
Spa-like bathroom interiors blending coastal aesthetics and relaxation-focused elements. All images maintain a cohesive beach-inspired theme with whitewashed wood textures, aqua accent tones, and natural materials like pebbles or driftwood to evoke breezy coastal serenity.
Depict the Pyramids of Giza across historical and cultural contexts. All images maintain a realistic style with accurate architectural details of the pyramids, set against a desert landscape under a clear sky. Consistent warm, sandy tones and historical authenticity in attire and structures unify the scenes.
Depict key moments of the Songhai Empire's historical and cultural legacy in West Africa. All images adopt the intricate linework, flat vibrant colors, and gold-leaf accents characteristic of Timbuktu Manuscript illuminations, with shared Saharan architectural motifs and traditional clothing patterns.
Depict pivotal moments of European maritime exploration and cross-cultural encounters during the 15th-16th century. All images employ early modern European maritime art style with rich earth tones, intricate ship details, and dramatic lighting. Shared themes include nautical navigation tools, period-accurate clothing/armor, and compositions emphasizing tension between explorers and indigenous populations.
Generate a set of images depicting the effects of climate change on Arctic ecosystems through interconnected perspectives. All images use a realistic style with cool-toned palettes to emphasize the Arctic environment. Themes of urgency and interconnectedness unify the narrative, balancing human and wildlife perspectives while maintaining visual coherence through shared icy landscapes.
Generate a set of images depicting the discovery and analysis of a new exoplanet in a distant galaxy. All illustrations maintain a cohesive blend of scientific realism and imaginative artistry, using a unified color palette of cosmic blues, starry golds, and planetary reds to visually connect the narrative stages.
Generate a set of images depicting Twinkle, a small star, and the Moon in a whimsical children's book style. All images maintain consistent character designs for Twinkle (soft golden glow, round eyes) and the Moon (serene face with craters), using a cohesive night sky palette with dreamy textures and gentle lighting to unify the celestial journey narrative.
Generate a set of images portraying an elderly man with a weathered face and gentle eyes in a rustic countryside environment. All images maintain warm, nostalgic tones and a cohesive rustic atmosphere, emphasizing the elderly man's connection to his rural surroundings through consistent lighting, clothing, and natural textures.
Generate a set of images depicting a stylish young person immersed in an urban nightlife atmosphere. All images feature bold neon colors, dramatic lighting contrasts, and a cohesive urban nightlife theme. The character's modern style and energetic ambiance remain consistent, emphasizing vibrant city textures and nocturnal energy.
Generate a set of images depicting a woman with a slender figure, straight red hair, and freckles across her nose in hyper-realistic style. All images feature the same woman with distinct red hair and freckles, rendered in hyper-realistic detail. Shared elements include natural lighting, textured environments, and a focus on expressive interactions with surroundings.
@misc{ping2025pacorladvancingreinforcementlearning,
title={PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling},
author={Bowen Ping and Chengyou Jia and Minnan Luo and Changliang Xia and Xin Shen and Zhuohang Dang and Hangwei Qian},
year={2025},
eprint={2512.04784},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.04784},
}