Pangram verdict · v3.3
We believe that this document is primarily human-written, with some AI-generated and AI-assisted content detected
AI likelihood · overall
MixedArticle text · 1,703 words · 5 segments analyzed
Introduction Over the past few years, image generation has seen remarkable progress. Diffusion and flow-matching models can generate high-resolution images, produce sharp photorealism and stable structure, render dense text, encode broad world knowledge, and follow user prompts in precise detail. These improvements have been driven by several interacting factors including scalable transformers architectures, improved captioning and text encoders, better latent representations, and pipelined post-training techniques. Yet as the field has optimized for reliability on these capabilities, many systems have converged toward a narrow set of default aesthetics. While effective production tools, this makes them less effective as engines for creative exploration, where users often need to search across styles, moods, compositions and visual directions rather than receive a single polished default. To address these limitations, we present Krea 2, a series of foundation models focused on creative exploration. Krea 2’s models are built on the belief that image generation should be an exploratory medium: expressive enough to span many aesthetics, and controllable enough for creators to navigate them. We built a large-scale data infrastructure and distributed training framework from scratch to curate a comprehensive pretraining dataset with broad world knowledge and style coverage. Using this infrastructure, we train expressive models through a multi-stage pipeline spanning pretraining, midtraining, supervised finetuning (SFT), preference optimization, and reinforcement learning (RL), with each stage designed to progressively refine the model’s output distribution. We develop a simple yet performant diffusion transformer (DiT) architecture through thorough ablations. Our model incorporates several components that accelerate convergence , including iREPA, improved VAEs, and Qwen3-VL. We also integrate several architectural improvements, including grouped-query attention (GQA), sigmoid-gated attention, lightweight timestep modulation, and multilayer feature aggregation for text-encoder features, which together improve training stability and efficiency. A strong base model is only useful if users can reliably reach the parts of its distribution they care about. In training, the model learns from rich, carefully constructed captions that describe images with dense visual detail. In practice, user inputs are often shorter, more ambiguous, and shaped by many different habits of expression. Some users describe a scene in natural language; others gesture toward a mood, a style, or a reference image. This creates a gap between the model’s learned conditioning space and the way creative intent is expressed at inference time.
To reduce this gap, we build two systems that make Krea 2 more exploratory and steerable from both text and image inputs: a prompt expander and a style-reference system. The prompt expander maps simple or underspecified user prompts into richer visual directions without overwriting the user’s intent. It is trained through a two-stage SFT and RL pipeline on top of open-source LLMs, where the objective is not only to improve image quality, but also to encourage creative variation and controllable exploration. Complementing this textual interface, the style-reference system lets users express visual intent through images when words are insufficient. It allows users to inject the style or mood of one or more reference images with minimal content leakage, while providing fine-grained control over style strength and weighted style mixing. Together, these components define Krea 2 as a foundation model for exploratory generation. Instead of optimizing only for a single polished default, Krea 2 is designed to expose a broad visual space and give users practical ways to move through it, using both text and image-based control. Krea 2 is among the top 10 models on the the Artificial Analysis leaderboard for text-to-image, and scores 2nd place among models from independent labs.
Krea 2 serves as a comprehensive baseline and enables a creative generative experience while maintaining competitive performance. Data Data Curation Principles Before detailing our data pipeline, it is important to establish what constitutes a good data mix for our purpose. A good mix does not consist solely of “high quality” images. Diversity and broad domain coverage are essential given our objective of building an expressive, stylistically diverse model. We argue that conventional model-based filtering, which uses aesthetic-score and image-quality-assessment (IQA) models, introduces implicit biases. For example, such methods may classify a blurry image as low quality, even though motion blur or softness can be a deliberate artistic choice. Furthermore, we argue that as long as a caption accurately describes its image, even an undesirable image may be helpful in downstream use cases: because the model precisely understands the undesired behavior, such samples can later be used to steer generations away from that distribution. For these reasons, we build the pretraining dataset by filtering out only: Duplicated samples and over-represented concepts. Samples for which VLMs consistently fail to capture important aspects of the image. Samples that induce undesired biases and artifacts. Samples with high visual complexity that is too difficult to model reliably at low resolution. AI-generated samples These conditions shape a pretraining dataset with broad coverage while avoiding poor text-to-image alignment and artifacts. Importantly, we use no AI-generated images in our pretraining mix. Synthetic data and distillation can be an effective shortcut for acquiring model capabilities. However we find that even a small proportion of AI-generated images introduces biases into the model’s output distribution, as synthetic images tend to be easier to learn, which effectively imposes an upper bound on model quality. We therefore designed in-house classifiers to filter such images out. Captioning We employ a multi-stage approach to produce captions. First, we run an OCR model on each target image to extract any visible text. In the second stage, we provide both the OCR results and any available metadata (camera settings, known entities, and so on) to the captioning model, which produces an enriched caption that incorporates world knowledge alongside the extracted text. General captioning pipeline Once a context-rich, long-form natural-language caption is obtained, we use a cheaper LLM to reformat it into a variety of lengths and formats, exposing the model to a range of prompt styles.
Empirically, we find that training on long prompts provides dense supervision, yielding faster convergence and lower training loss. For many downstream and applied use cases, however, performance on short and medium-length prompts remains important. We therefore train predominantly on long captions while ensuring the model is exposed to short and medium-length prompts throughout training. Our overall training pipeline and data stages Pretraining Data Pretraining data spans 256px, 512px, and 1024px resolution stages. Progressively scaling the resolution forms a curriculum-learning strategy: we dedicate the majority of FLOPs to the low-resolution stages to build core model capabilities efficiently, then equip the model with high-fidelity generation capabilities as the training resolution increases. Low-resolution pretraining is the stage at which basic text-image alignment and structure are learned. At this stage the dataset is on the order of billions of images, so we rely heavily on inexpensive CPU-based filters to remove low-quality images. These range from simple broken-file, resolution, and aspect-ratio filters that remove unqualified images, to Laplacian filters that remove images with extreme textures and noise patterns. As an example, one issue we encountered while pretraining K2 was a tendency for the model to generate flat-color backgrounds and border artifacts. To mitigate this, we used RGB entropy, white/black pixel ratios, custom heuristics, and in-house classifiers to filter out samples that induced this behavior. Building an in-house classifier, one effective strategy was to use a large VLM to craft a task-specific system prompt for the filtering task (for example, detecting a specific pattern or artifact), produce a pseudo-labeled dataset, and then train a small DINOv3- or SigLIP-2-based classifier to run the filter at scale. Any filtering model that requires GPU compute at the low-resolution stage is kept under 1B parameters for efficiency. For deduplication at the low-resolution stages, we primarily use inexpensive hash-based methods, combining md5, phash, and colorhash to remove duplicate images with minimal compute. We find that the default 8x8 phash does not account for color and has a high false-positive rate; we therefore combine a 12x12 phash with colorhash for more robust deduplication. As we scale the training resolution, we introduce image-quality and aesthetic filters.
Importantly, these quality scores are used only to drop images of extremely poor quality, not to oversample images on the basis of their scores. We additionally use an image-complexity score and text density (from OCR results) to exclude images whose text and content cannot be meaningfully represented at low resolution. We adjust the quality, complexity, and text-density thresholds as training progresses. Beyond conventional quality filters, we also train a sparse autoencoder (SAE) on SigLIP-2 embeddings computed over a sample of our pretraining corpus. After training the SAE, we use a VLM to annotate each SAE feature based on its top-k activating samples. These annotated features form an unsupervised tagging system in which we extract the predominant SAE features from each image. This tagging system was useful for filtering clear visual artifacts without training an explicit classifier. Midtraining Data Unlike the pretraining stages, midtraining explicitly selects specific image sources known to offer good stylistic coverage and high-quality images for particular visual domains. Whereas pretraining is a bottom-up process that begins from a general pool, midtraining data is curated top-down: the domains and sources are chosen first. Midtraining is a crucial stage that smoothly bridges the general pretraining distribution and the high-quality SFT distribution. To improve the quality of the distribution, we introduce semantic clustering and use retrieval-based strategies to ensure world-knowledge coverage. Building on the approach in Automatic Data Curation for Self-Supervised Learning, we use FAISS to perform hierarchical k-means clustering, which we then sample so as to retain long-tail visual concepts without wasting compute over-sampling head concepts. After computing the hierarchical clusters, we have a VLM examine the images nearest each cluster centroid in order to name and, where appropriate, flag the cluster. Following human review of the flagged clusters, we dropped several that were low quality or problematic. We remove further redundant data through semantic deduplication, computing the SigLIP similarity between images within each remaining leaf cluster. An important capability of image generation models is faithfully representing known entities that users may reference simply by name. Some entities, such as sports players or actors, can fall into semantic clusters containing many other entities, which risks their being dropped under straightforward hierarchical sampling. To address this, we ran PageRank over English Wikipedia using Danker and retained the top 90% of articles by rank.