INGREDIENT DETECTION

Technical skills involved in this project

Programming: Python

Software: Scikit-learn, numpy, requests. Beautifulsoup

Machine Learning Architecture: ViT, CNN, RF, SVM

Soft Skills: Teamwork, Communication

PROJECT SUMMARY

INGRAIDIENT is an AI-powered tool developed to identify food ingredients from a single image of a prepared dish, aimed at supporting health-conscious users in tracking and understanding their diet. Built on a Vision Transformer (ViT) architecture fine-tuned for multi-label classification, the model was trained on a cleaned and rebalanced subset of the Recipe1M+ dataset. It achieved robust performance with strong generalization on unseen web-scraped recipes, outperforming a traditional SVM baseline. The project also included custom data preprocessing, ingredient grouping, and a new test set scraped from Pinch of Yum. Key strengths include the model’s ability to extract contextual features, handle poor-quality images, and predict additional accurate ingredients beyond ground truth, making it a promising tool for real-world dietary analysis.

ARCHITECTURE

The core model is a Vision Transformer (ViT), specifically the ViT-B/16 variant pre-trained on ImageNet. Here's how the architecture is structured:

Feature Extraction: Input images are split into patches and passed through the frozen ViT encoder, which uses self-attention to capture spatial and contextual features.
Classification Head: The original classification head is replaced with a custom feedforward network:
- Layers: 768 → 1024 → 512 → 53 (final output classes)
- Activation: ReLU
- Normalization: LayerNorm between fully connected layers
- Dropout: 0.3 and 0.2 to prevent overfitting
Fine-tuning Strategy: Initially, only the classification head is trained. Later, the last two layers of the transformer encoder are unfrozen to allow for domain-specific fine-tuning.

The model predicts multi-label ingredient probabilities for each input image.

RESULTS

Qualitative and Quantitative Result below

Despite showing robustness in qualitative result and beats the base line SVM performance, the ViT model achieved only modest metrics (F1 ≈ 0.34 on the original test set). The performance was limited by several factors:

Dataset Issues:

Label Noise: Ground truths often had missing or inconsistent ingredients.
Visual Irrelevance: Some images didn’t clearly depict food (e.g., people at a cooking event).
Inconsistent Label Parsing: Ingredient names varied with formatting (plurals, adjectives, etc.).
Imbalance: Even after rebalancing, some ingredients were far more common than others.

Model Limitations:

Confusion from Visually Similar Ingredients: e.g., apple vs. tomato.
Overfitting: ViTs can overfit small or noisy datasets if not tuned properly.
Computation: Long training times hindered extensive hyperparameter tuning or augmentation trials.

Evaluation Misalignment:

The model sometimes made contextually correct predictions that were marked wrong due to incomplete ground truth labels.

Generalization Surprises:

The ViT performed better on a cleaner, web-scraped dataset, highlighting how much data quality impacted the original evaluation.