One Image, Four Minds: Can AI Caption Like A Child or An Elder?
GenAI 30 Project Challenge - 6
Can Different Eyes See the Same Picture?
After diving into a number of text-based AI, I wanted something more colorful, interactive, and visual.
Have you ever wondered how your kids or elder family members might interpret an image differently?
A child might focus on the colorful playground, while an elder might notice the peaceful trees in the background. This curiosity inspired me to explore multi-perspective image captioning for #5 of my GenAI challenge.
Here’s how my journey went!

A little housekeeping:
If you’re curious about my other projects, you can find them in the GenAI 30-Project Challenge list. For updates on their progress, product launches, lessons learned, and more, feel free to check out my Build to Launch newsletter, where I regularly share the highs, lows, and insights.
Project Goals
For this project, I set out to build an app that could:
Allow users to upload an image.
Generate captions for the image from four different perspectives (e.g., a child’s, an elder’s, a cultural lens, and a creative one).
Display the captions interactively.
The result? An app that not only interprets images but also reveals how different “minds” might perceive the same visual input.
Choosing the Right Tools
For this project, I explored two key AI models:
BLIP (Bootstrapped Language-Image Pretraining): A model designed to generate free-form natural language captions based on prompts.
CLIP (Contrastive Language–Image Pretraining): A model that compares image features with text features to classify and interpret images.
Both models are available via Hugging Face, but I quickly discovered their differences would play a big role in how this project unfolded.
Challenges Along the Way
Of course, no GenAI project is complete without some technical hiccups. Here are the biggest challenges I faced:
1. Errors in Generating Captions
At first, I kept running into an error:
“Error generating captions: Unable to create tensor, you should probably activate padding with ‘padding=True’ to have batched tensors with the same length.”
After asking ChatGPT, trying every suggestion, and switching models from BLIP to CLIP, I finally found the culprit: a dependency issue with Hugging Face’s library. Specifically, the version of NumPy had to be less than 2.0. Ironically, my older version of NumPy (used for Project 1) worked fine. Such a tiny problem caused hours of frustration!
2. Caption Quality
When I finally got captions, the results from CLIP weren’t great. For instance, a kid would clearly see a bird in the image, but CLIP’s output was rigid and overly simplistic. It became obvious that CLIP’s architecture works more like classification — it pulls from a pool of pre-trained text templates rather than generating unique descriptions.
3. Comparing BLIP and CLIP
BLIP performed better, generating more natural language captions. However, I faced new issues:
Wrong Processor: I initially used the wrong processor for decoding, resulting in weird text.
Prompt Inclusion: BLIP sometimes included the exact prompt in the output, making captions feel repetitive. I had to manually trim these, but the results improved significantly.
4. Inconsistent Outputs
Interestingly, BLIP’s outputs varied with each run — even with the same image and prompt!
Trial 1: The “child’s perspective” felt oddly adult-like.
Trial 2: Captions suddenly focused on Japanese culture.
Trial 3: The model confidently stated, “Kids know cherry blossoms bloom in Japan!” — a fascinating assumption.
This randomness made me realize the model’s interpretations depend heavily on subtle variations in processing. While exciting, it also highlighted the importance of refining prompts.
What I Learned
This project reinforced a critical lesson: Prompts are key. However, even the best prompts don’t guarantee consistent outputs every time. BLIP’s generative nature means results can vary, and that’s both a strength and a challenge.
Additionally:
BLIP generates more nuanced and creative captions but is slower than CLIP.
CLIP works better for classification but lacks free-form creativity.
Experiments and Fun Discoveries
While testing, I ran into some quirky issues:
One image loaded sideways due to a bug in the Image library. After a quick fix, the next shot displayed properly!
Captions for rotated images turned out unintentionally hilarious”
These experiments made the project even more enjoyable, and it reminded me to embrace the unexpected during testing.
Problems and Improvements
While the app works, there’s plenty of room for improvement:
Prompt Refinement: Captions vary too much depending on the prompt. Experimenting with more specific prompts or layering prompts might lead to better results.
Model Efficiency: BLIP takes longer to process images than CLIP. Adding a loading indicator would improve the user experience while waiting for captions.
UI Enhancements: The current interface is functional but could use a facelift — adding animations or a drag-and-drop upload feature would make it more engaging.
Caption Accuracy: Trimming BLIP’s outputs to remove repetitive prompts is a temporary fix. A more systematic approach to post-processing could help refine captions further.
Final Thoughts
Project 5 of my GenAI challenge was a colorful and thought-provoking journey. Exploring multi-perspective image captioning gave me a new appreciation for how AI interprets visuals. BLIP and CLIP each brought unique strengths to the table, and I now have a better understanding of their architectures and use cases.
Despite the technical hurdles, this project reminded me of the creative possibilities of AI. Seeing the same image through “different eyes” is not just fun — it’s a glimpse into how diverse minds (and models) perceive the world.
Next up, I’m diving into another exciting GenAI project. If you’ve ever tried image captioning or have tips for improving my approach, let me know — I’d love to hear your thoughts!
Stay tuned for the next project, and let’s keep experimenting together! 🎨✨
Interesting!! Were you planning to use this for any practical purpose or just wanted to find things out?