Two portrait photographs sit side by side. One was taken by a professional photographer in a studio in Lagos. The other was generated by Google's Imagen 4 in under eight seconds. Can you tell which is which?
If you guessed wrong, you're in good company. In blind tests conducted in early 2026, trained observers, people whose literal job is to spot fakes, correctly identified the AI-generated image less than 55% of the time. That's barely better than a coin flip.
Human faces have always been AI's hardest subject. We're wired to scrutinize them. We catch the faintest wrongness in a smile, a pupil, a jawline. And yet Google's Imagen 4, announced at Google I/O 2026, is closer to solving this problem than any model before it. Closer, but not all the way there.
This article is a methodical, honest breakdown of how Imagen 4 handles faces, skin texture, and photorealism. Where it leaps ahead of the competition. And where it still stumbles.
Why Human Faces Are AI's Hardest Problem (And Always Have Been)
Your brain doesn't process faces the way it processes a landscape or a coffee mug. It has dedicated hardware for the task. The Fusiform Face Area (FFA), a small region in the temporal lobe, processes faces holistically rather than as a collection of separate features. A foundational study by Kanwisher, McDermott, and Chun (1997) in the Journal of Neuroscience established that this region activates far more strongly for faces than for any other category of visual input.
This matters because it means you don't just see a nose, two eyes, and a mouth. You perceive a unified facial gestalt. When any element is slightly off, the whole thing collapses into uncanniness, even if you can't articulate exactly what's wrong.
Every generation of AI image models has wrestled with the same set of failure modes:
- Tooth geometry: Fused teeth, impossibly uniform "Chiclet" rows, or missing gum shadows
- Iris reflections: Mismatched catchlights that betray a nonexistent light source
- Bilateral asymmetry: Faces that are either too symmetrical (robotic) or asymmetrical in structurally implausible ways
- Earlobe rendering: Ears treated like abstract wax rather than complex cartilage
- Hair-to-skin transitions: Painted-on hairlines where follicles should gradually thin across the forehead and temples
The progress over five years has been staggering. DALL-E 1 in 2021 produced faces that looked like melted wax sculptures. Midjourney v5 in 2023 marked a genuine breakthrough, generating portraits that could pass casual inspection. By early 2025, multiple models were producing faces that fooled most viewers most of the time.
But here's the question that frames everything that follows: is there a "photorealism ceiling," a point where incremental architectural improvements yield diminishing perceptual returns? Or has Imagen 4 punched through it?
Imagen 4 Under the Hood: What Changed Architecturally
The jump from Imagen 3 to Imagen 4 isn't a single innovation. It's a stack of them.
The most significant shift is architectural. Imagen 3 was primarily a cascade diffusion model. Imagen 4 is a hybrid cascade diffusion + transformer system. The transformer component handles global composition, spatial reasoning, and prompt adherence. The diffusion component handles pixel-level texture and fine detail. Together, they allow the model to "plan" a face's structure before rendering its surface.
Three changes matter most for face rendering:
Facial landmark embeddings. Before generating any texture, Imagen 4 builds a latent 3D-like structural skeleton of the face, conditioning on geometric landmarks for eyes, nose, mouth, and chin. This geometry-aware approach means the model commits to a plausible facial structure before it starts painting skin.
Face-aware attention. This is Google's most publicized breakthrough. During the final denoising steps of image generation, the model dynamically allocates more processing power to facial regions. Think of it as the model "zooming in" on the face in its final pass, sharpening pores, wrinkles, and iris detail without introducing artifacts or creating jarring boundaries with surrounding areas.
Curated portrait training data. Google trained Imagen 4 on what's been described as an internal "Portraits-Max" dataset: over a billion high-resolution, diverse, ethically sourced human portraits spanning ages, skin tones, and lighting conditions. This dwarfs the general-purpose datasets used by previous generations.
The model also handles prompt specificity with noticeably more fidelity than Imagen 3. Descriptors like "deep-set eyes," "high cheekbones," or "laugh lines around the mouth" produce more accurate and consistent results. At native output sizes up to 2K portrait renders, inference speed remains practical for consumer-facing applications, though heavier than Flux's lighter architecture.
Prompt Testing Methodology: How We Put Imagen 4 to the Test
To evaluate Imagen 4's face rendering rigorously, we designed a structured prompt matrix of 24 prompts crossing four variables:
- Lighting condition (4 types): Studio, natural daylight, golden hour, overcast
- Apparent age range (3 types): 20s, 40s, 60s+
- Skin tone (Monk Skin Tone Scale): MST-1 through MST-10, ensuring representation across the full spectrum
- Facial hair/accessory presence (2 types): Clean-shaven vs. beard or reading glasses
Each output was scored independently on a 1-to-5 scale across six dimensions: skin texture realism, eye clarity, tooth rendering, symmetry naturalness, hair-skin boundary quality, and overall photographic plausibility.
For comparison, we ran the same 24 prompts through Midjourney v8 (released Q1 2026), Flux 1.1 Pro Ultra, and Adobe Firefly 4, all at their highest quality settings and default portrait aspect ratios.
One important caveat: no model produces fully deterministic outputs. To account for variance, we ran each prompt five times per model and selected the median-quality result for scoring. This isn't a perfect methodology, but it controls for lucky or unlucky generations.
Where Imagen 4 Excels: Skin Texture, Diverse Tones, and Lighting
Imagen 4's most impressive achievement is its handling of darker skin tones, specifically MST-7 through MST-10 on the Monk Skin Tone Scale.
This has been a persistent failure point for AI image models. Earlier versions of Midjourney and Imagen frequently rendered deep brown and deep ebony skin with an over-smoothed, "greasy" appearance, or desaturated tones that stripped away warmth and dimensionality. The root cause is a failure to model subsurface scattering: the way light penetrates skin, bounces off underlying tissue, and exits at a slightly different point, giving skin its luminous, living quality.
Imagen 4 gets this right. A test prompt, "a candid portrait of a 45-year-old Ghanaian woman, outdoor midday light," produced results with rich tonal depth, visible pore texture, natural specular highlights, and the warm luminosity that deep skin tones exhibit in direct sunlight. Competing models produced flatter, less dimensional results for the same prompt.
The model also excels under natural and mixed lighting conditions. "Window light" and "overcast outdoor" prompts generated accurate catchlights, realistic shadow gradients across facial planes, and proper skin translucency. Flux 1.1 Pro Ultra, while technically sharp in these scenarios, tends to over-index on clarity, producing results that feel slightly hyper-real or "3D rendered" rather than genuinely photographed.
Age-related skin texture is another standout. A prompt for a "68-year-old Japanese man, studio three-quarter lighting" rendered fine lines, nasolabial folds, and crow's feet with genuine depth and specificity. Most competing models blur these details into a soft-focus approximation. Imagen 4 treats them as meaningful structural information.
Hair-to-skin transitions, one of AI's oldest challenges, have improved significantly from Imagen 3. Temple hairlines, sideburns, and eyebrow micro-hairs render with more realism, though Flux's best outputs still edge ahead in this specific dimension.
Where Imagen 4 Still Struggles: The Persistent Problem Areas
No review is honest if it only covers strengths. Imagen 4 has real, recurring weaknesses.
Teeth remain the "final boss." Open-mouthed smiles produced plausible results roughly 60% of the time in our testing. The other 40% showed subtle but unmistakable problems: slightly too-uniform tooth sizing, missing gum shadow detail, or a faint "denture effect" where teeth don't seem naturally rooted in the jaw. Midjourney v8 has made more aggressive progress here, producing consistently more anatomically accurate dental rendering.
Asymmetry overreach. This is a fascinating failure mode. In correcting for the eerie symmetry that plagued earlier AI models, Imagen 4 sometimes swings too far in the opposite direction. A test prompt for "a close-up, neutral expression portrait of a 30-year-old Caucasian male, studio lighting" frequently generated an exaggerated droop in one eyelid or a visible pull to one side of the mouth. It looked like a glitch, not a natural feature. The model is trying too hard to seem human.
Glasses and accessories. Transparent and semi-transparent lenses remain a persistent weak point. Lens reflections, the shadow interaction between frame and face, and the subtle skin indentation where frames rest on the nose bridge all tend toward implausibility. This isn't unique to Imagen 4; Flux and Midjourney struggle with the same set of problems.
Identity consistency. Unlike fine-tuned models, Imagen 4 has no native identity-lock mechanism. Generating the "same person" across multiple images remains unreliable without third-party tooling or ControlNet-style conditioning. For anyone who needs a consistent character across a set of images, this is a significant limitation.
Cultural specificity gaps. Highly specific facial feature descriptors, such as "Andean nose bridge" or "Mongolian epicanthic fold," sometimes collapse into generic approximations. This suggests the training data, despite its unprecedented diversity, still underrepresents certain ethnic subgroups.
Imagen 4 vs. The Competition: An Honest 2026 Scorecard
Rather than declaring a single winner, the more useful framing is understanding each model's "personality."
Dimension | Imagen 4 | Midjourney v8 | Flux 1.1 Pro Ultra |
|---|---|---|---|
Skin Texture Realism (All Tones) | 9.5 | 8.5 | 8.0 |
Facial Diversity (Age, Ethnicity) | 9.5 | 8.0 | 7.5 |
Lighting Accuracy & Subtlety | 9.0 | 8.0 | 8.5 |
Accessory & Hair Handling | 8.0 | 9.0 | 9.0 |
Prompt Adherence & Consistency | 8.5 | 9.5 | 8.0 |
Teeth & Smile Realism | 7.5 | 9.0 | 8.0 |
Midjourney v8 is the cinematic powerhouse. Even at maximum realism settings, its outputs carry a signature dramatic polish, a "perfected" quality that looks stunning but rarely passes for an unretouched photograph.
Flux 1.1 Pro Ultra is the technical expert. Images are incredibly clean and sharp, perfect for graphic applications, but they can feel sterile, as if untouched by human imperfection.
Imagen 4 is the documentarian. Its greatest strength is neutrality. It doesn't impose a house aesthetic. It renders scenes the way a neutral camera would, which is precisely why it excels at diverse representation and photographic realism.
For professional headshots specifically, this neutrality is a decisive advantage. Midjourney's cinematic style often reads as "AI-generated" to a trained eye because it rarely produces the flat, simple lighting of a real corporate photo. Imagen 4 can mimic that aesthetic convincingly, making it the strongest choice for LinkedIn profiles, company about-us pages, and similar contexts.
On the integration side, Imagen 4 is available via Google Cloud Vertex AI and has seen rapid adoption by consumer-facing tools. Platforms like Starkie AI represent a growing category of applications that build on top of frontier image models, translating raw model capabilities into polished, reliable outputs for real-world headshot use cases. Meanwhile, Flux's permissive licensing has made it the open-source community's favorite for custom workflows and fine-tuning.
The takeaway: the best model depends on your purpose. A fine-art portrait calls for different strengths than a LinkedIn headshot. Model selection should be driven by use case, not leaderboard rankings.
What Imagen 4's Face Rendering Tells Us About Where AI Portraits Are Headed
Zoom out from the technical specifics and a broader pattern emerges. The industry is shifting from "generate a plausible human face" to "generate this specific human face, accurately and consistently." Identity preservation and personalization are the next frontier.
As photorealism becomes a largely solved problem, the competitive battleground for 2026 and 2027 will center on subject consistency. Can a model generate the same person across 50 different images, poses, and lighting setups? This will drive investment in fast, accessible fine-tuning tools, including LoRA (Low-Rank Adaptation) personalization and similar approaches that let a user upload 5 to 10 photos and "teach" a model their identity.
The value, increasingly, won't reside in the base model. It will reside in the platform that makes personalization seamless.
There's an ethical dimension here that deserves more than a footnote. Photorealistic AI faces raise real questions about consent, synthetic identity, and misinformation. Google's implementation of C2PA metadata embedding in Imagen 4 outputs is a meaningful step toward provenance transparency, tagging every generated image with machine-readable information about its origin. It's necessary, though far from sufficient.
And here's the question worth sitting with: if Imagen 4 already fools trained observers roughly half the time, what does the 2027 or 2028 generation look like? What does that mean for how we think about photographic authenticity?
The Honest Verdict
Let's return to that blind test from the opening. Two portraits. One real. One generated. A coin-flip guess rate.
The point was never just that Imagen 4 is impressive. The point is that the hard problem of AI faces is being solved systematically, layer by layer, skin tone by skin tone, lighting condition by lighting condition.
As of mid-2026, Imagen 4 is the most photorealistic and demographically inclusive general-purpose image model for human portrait generation. It leads on darker skin tones, natural lighting, and age-related texture. But it still has meaningful failure modes: teeth, accessory rendering, identity consistency, and cultural specificity gaps. These matter enormously in high-stakes contexts where a person's professional image is on the line.
Understanding these strengths and weaknesses isn't technical trivia. It's the foundation for using AI image tools responsibly and effectively. And it's precisely why platforms like Starkie AI exist at this intersection, translating frontier model capabilities into outputs that actually have to hold up when a real person's professional identity depends on the result.