Pangram verdict · v3.3
We believe that this document is a mix of AI-generated, and human-written content
AI likelihood · overall
MixedArticle text · 1,412 words · 4 segments analyzed
1.0IntroductionWhen professional creatives evaluate AI-generated work, their judgments produce two distinct signals. The first is convergence: evaluators agree on what works, revealing shared best practices like readable typography, functional layout, and correct visual hierarchy. The second is divergence: evaluators disagree, and that disagreement reflects genuine differences in taste, aesthetic direction, and creative intent. Most AI benchmarks treat the second signal as noise to be resolved. This paper proposes a framework for measuring both.This distinction matters because creative work has no ground truth. The dimensions on which experts disagree — aesthetic direction, mood, conceptual risk — are not reducible to miscalibration or error [1][2]. Standard evaluation approaches, including majority voting, adjudication, and gold-standard reconciliation, treat evaluator disagreement as something to resolve [3][4]. These methods work where labels have objective answers. In creative domains, they would smooth out the information most worth preserving. Work in annotation science has recognized that disagreement can carry signal [5], and frameworks like CrowdTruth have formalized this for labeling tasks [4]. The Human Creativity Benchmark applies that insight to creative evaluation, where the standard resolution strategies are structurally wrong because taste is legitimately distributed across professionals. Flattening it into a single quality score artificially homogenizes an otherwise diverse workflow and creative process, and produces exactly the generic output that professionals already find unusable.That homogeneity is already a practical problem. Generative models tend toward mode collapse [6][7]: when multiple models are given the same creative brief, they converge on safe, averaged aesthetics rather than distinctive directions. Creative professionals depend on differentiated output. They use AI for trend awareness, style inspiration, and rapid exploration — deciding "what to build?" and validating "is it good?" [8][9]. Both require a range of possible directions, and the creative process extends well past first draft [10]. Designers iterate fluidly, revisit stages, and make hundreds of small judgment calls where the distance between "good enough" and "right" is entirely a matter of taste [2]. A model that converges on a single safe default fails this workflow even when the output is technically competent.The HCB proposes that creative quality is measured along evaluation axes that fall on a spectrum from objectively verifiable to inherently subjective. Prompt adherence sits at the clear end: did the model follow the instructions? Visual appeal sits at the taste end: does this feel right?
Usability falls in the middle, where shared conventions exist but leave room for interpretive difference. Convergence and divergence are properties of these dimensions themselves. Verifiable axes produce agreement because the criteria are shared and checkable. Taste-driven axes produce disagreement because the criteria are personal. The separation of these observations, not just the observation that they exist, is what makes this framework useful.Convergence and divergence as two interacting signals in creative evaluation. Convergence rises as work approaches production; divergence remains structurally present where the question shifts to taste.Convergence by scalar question and domain. Prompt adherence and usability produce higher agreement than visual appeal; Desktop Apps and Landing Page converge most, while Ad Video and Brand Assets remain the most divergent.Convergence captures best practices: shared standards like composition (visual hierarchy, balance), clarity (legibility, information architecture), and technical correctness (rendering, proper alignment, absence of artifacts) that are stable, repeatable, and critical for training models to produce reliable outputs. Divergence captures taste: variation in aesthetic judgment, interpretation, and creative intent that defines what makes creative work distinctive and is essential for steerability, personalization, and creative control. These signals are not always cleanly separated. Best practices may conflict depending on the objective, and apparent agreement may result from limited model expressivity, where outputs are too similar to elicit meaningful differences in judgment and convergence reflects a lack of variation rather than strong alignment.The benchmark measures both signals through three complementary methods. Pairwise forced-ranking surfaces relative preference. Scalar ratings on three dimensions surface where agreement concentrates. Open-ended qualitative follow-ups surface the reasoning behind each judgment. Together, they produce data that distinguishes convergence-driven dimensions from divergence-driven ones. Collapsing them into a single quality score discards the most actionable information: where a model needs to be correct versus where it needs to be steerable.To test this framework, Contra Labs ran a study drawing from its network of over 1.5 million independent professional creatives, who have earned over $250M. A select group of evaluators across five creative domains (landing pages, desktop apps, ad images, brand images, and product videos) assessed AI-generated outputs across three phases of the creative process (ideation,
mockup, refinement) using all three methods, producing roughly 15,000 individual judgments that reveal where evaluation is objective and where it is irreducibly a matter of professional but interpretative judgement.2.0MethodologyCreative ProcessThe study structures the creative workflow into three phases, validated against a separate survey of working creatives:1. Ideation: Discovery, exploration, and directional potential. At this stage, the creative is not looking for final production quality, but rather for exciting creative direction that is strategically appropriate and worth developing.A sculptural luxury leather handbag by Mata Forma, crafted from vegetable-tanned Brazilian leather with architectural brass hardware, positioned as a symbol of modern strength rooted in biodiversity and craftsmanship, designed for confident women who value structure, heritage materials, and quiet authority, (a minimal studio environment inspired by the tones of Amazonian earth and raw clay rather than literal forest scenery), subtle references to Brazilian nature translated into form through curved silhouettes inspired by tree trunks and organic growth rings, refined tailoring details, and natural tonal layering in wardrobe styling, (soft directional lighting creating sculptural shadow play and depth, evoking the feeling of strength emerging from the earth without overt wilderness imagery), (editorial hero composition where the handbag feels like a design object and extension of the wearer's posture, balanced negative space, premium fashion campaign aesthetic expandable into a full luxury accessories series).2. Mockup: Creative direction has been decided; now it is time to make the vision come to life. The creative is actualizing the project's creative direction, creating product shots, stitching together scenes, incorporating brand identity, and bringing the campaign to life.A high-resolution luxury fashion product portrait of the Mata Forma structured leather handbag in warm terracotta vegetable-tanned leather with visible natural grain and subtle tonal variation, featuring a brushed architectural brass flap with clean geometric curvature and precision edge finishing, (a refined neutral studio backdrop in soft clay beige with gentle gradient depth and matte surface texture), styled with a tailored taupe blazer and fluid silk skirt in muted earth tones to complement the bag without
overpowering it, minimal gold jewelry accents and natural makeup to reinforce understated luxury, (controlled studio lighting with a soft key light from upper left, subtle fill to preserve leather depth, crisp but controlled reflections on brass hardware, realistic shadow grounding beneath the bag), (three-quarter seated editorial composition, eye-level camera angle, shallow depth of field, handbag as clear focal point positioned at the center of visual hierarchy, premium contemporary fashion campaign aesthetic).3. Refinement: Designs are near production-ready. Slight tweaks are all it will take to cross the finish line. Certain aspects that are meant to be kept consistent with others are targeted for adjustments.Refine the image to include ad design text for meta ads. The headline should read "Crafted in Brazil. Carried Everywhere." The CTA should be "Shop now". Use an all caps serif font. Sharp corner outline button.Prompts built upon the previous phase, using input images for Mockup and Refinement to simulate a real designer's workflow. Ideation prompts created new design directions, Mockup prompts used that vision and prompted for a more stable direction, and Refinement prompts used that direction and prompted for specific edits.Participants and input dataParticipants were drawn from Contra's network, a global platform where independent creatives have earned over $250 million across design, video, development, and content projects. We chose these five domains because they reflect the most common professional deliverables on the platform. We selected participants based on skillset and the generative model category most relevant to their workflow, then presented with guidelines contextualizing each phase of the creative process and outlining grading criteria for rubric alignment.Creative professionals from the Contra network also generated the prompts and input media. Designers were given high-level product and industry information and advised to design output appropriate for their use case. The prompt generation task guided creatives through each phase with baseline structural requirements covering prompt length, camera angles, color palettes, and other domain-relevant attributes. Prompts were reviewed by Contra's research team for clarity and alignment with real-world project briefs, then normalized for consistency. Prompts containing negative sentiment were removed to mitigate potential confounds.