RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation

The first real-world benchmark dataset designed specifically for recipe image and video generation tasks, featuring step-aligned visual content across diverse culinary traditions.

RecipeGen Preview
Preview: A sample recipe image from RecipeGen
RecipeGen Demo Video

About RecipeGen

A Large-Scale Multimodal Benchmark for Recipe Image and Video Generation

RecipeGen is the first real-world benchmark dataset designed specifically for recipe image and video generation tasks. It addresses the limitations of existing datasets by providing detailed step-level annotations with corresponding visual content.

26,453
Recipes
196,724
Step Images
4,491
Cooking Videos
7.4
Avg Steps/Recipe

RecipeGen provides a robust foundation for evaluating Text-to-Image (T2I), Image-to-Video (I2V), and Text-to-Video (T2V) generation models with a focus on diverse culinary traditions.

Key Features

Step-Aligned Content

Detailed step-by-step textual instructions paired with corresponding visual illustrations, enabling fine-grained understanding of cooking procedures.

Diverse Cooking Styles

Covers a wide range of regional cuisines and cooking techniques, with 158 keywords spanning different culinary traditions.

Rigorous Quality Control

Multi-stage filtering process including step count filtering, image quality control, and GPT-4o based text refinement.

Comprehensive Evaluation Metrics

Novel task-specific metrics including Goal Faithfulness, Step Faithfulness, Ingredient Faithfulness, Interaction Accuracy, and Cross-Step Consistency to comprehensively evaluate model performance across multiple dimensions.

Comprehensive Evaluation on Fine-Grained Classification and Recipe QA

Multi-step Complexity
Cooking involves multiple steps, from ingredient preparation to doneness checks. It also includes precise timing and diverse techniques.

VQA Generation
We generated 160,450 QA pairs focusing on ingredients, utensil usage, and step-by-step actions. GPT-based methods structure these pairs for fine-grained cooking semantics.

Professional Evaluation
VQA data underwent thorough review by hospital nutritionists, ensuring high-quality standards and consistency with practical dietary guidelines.

Model Recipe Non-seasoning Ingredient Seasonings VQA Score
BLEU ROUGE F1 Score Precision Recall IoU F1 Score Precision Recall IoU
LLaVA-NeXT 3.07 0.16 0.3658 0.3955 0.3935 0.2290 0.2192 0.2714 0.2350 0.1294 4.40
Qwen-VL2.5 3.06 0.17 0.4548 0.4891 0.5083 0.3000 0.2755 0.3170 0.2916 0.1669 5.19
LLAMA3.2 2.42 0.16 0.4203 0.4279 0.5038 0.2958 0.1915 0.2111 0.2210 0.1221 4.68
  • BLEU and ROUGE: Evaluate the similarity between generated recipes and ground truth, used by macromodels like LLaVA.
  • F1 Score, Precision, Recall, IoU: Measure ingredient recognition accuracy of the macromodels.
  • VQA Score: Assesses the response performance to VQA questions through GPT-4o.

Benchmark Results

Comprehensive evaluation of different models on RecipeGen benchmark across text-to-image and video generation tasks.

Evaluation Metrics for Recipe Generation

Metric Description
Goal Faithfulness CLIP similarity between the final image and last-step caption, measuring alignment with the overall goal.
Step Faithfulness Assesses each image's alignment with its step caption using CLIP.
Ingredient Faithfulness Employ GPT-4o to extract the list of ingredients involved in each step. Then, we finetune Qwen2.5-VL as a VQA model to recognize and predict the corresponding ingredients directly from the generated step images, enabling visual grounding of textual content.
Interaction Accuracy Based on different interaction types—Mix up, Blend, Put on, Cover, and No relationship—we design a standardized prompt that guides GPT-4o to generate one-sentence captions grounded in visible features such as color, texture, spatial layout, and perceived fusion. We then compute the CLIP similarity score between each generated image and its corresponding caption to evaluate interaction consistency.
Cross-Step Consistency Based on StackedDiffusion and DINOv2, uses l₂ distance and step count difference to assess visual and numerical consistency.

Experiments on Text-to-Image Models

Method Joint Generation GF↑ SF↑ CSC↓ IA(%)↑ IF↑
SD1.5 26.84 28.40 5.42 36.04 24.60
SD2.1 26.88 28.51 7.54 39.86 24.95
SDXL 27.46 29.37 2.98 44.04 24.27
SKD 26.62 28.53 0.7 42.34 24.71
SD3.5 27.42 28.77 2.97 45.27 24.11
Flux.1-dev 26.47 28.31 3.47 38.88 24.06
IC-LoRA 26.07 26.58 9.03 24.28 24.17
RPF 27.19 25.99 8.73 31.13 24.59

Experiments on Video Generation Models

Method TASK GF↑ SF↑ CSC↓ IA(%)↑ IF↑
Hunyuan T2V 29.66 29.75 0.02 46.02 29.69
I2V 28.56 29.24 0.15 51.42 28.73
Opensora T2V 30.88 30.59 0.03 56.20 30.21
I2V 29.58 29.14 0.16 55.67 29.40

Video-Demonstrations

Here are some examples of video generation models generating recipes.

Prepare seafood: Score and cut the squid into pieces. Remove the head, shell, and vein from the shrimp, leaving the tail intact. Set aside.

i2v-Hunyuan

i2v-Opensora

t2v-Hunyuan

t2v-Opensora

Marinate seafood: Add cooking wine and a pinch of salt to the shrimp and squid, mixing well. Marinate for 15 minutes.

i2v-Hunyuan

i2v-Opensora

t2v-Hunyuan

t2v-Opensora

Prepare vegetables: While the seafood is marinating, cut the firm tofu into rectangular pieces. Soak and wash the wood ear mushrooms. Wash and cut the crab-flavored mushrooms into sections. Slice fresh shiitake mushrooms. Cut the snow peas into sections and dice the carrots. Set aside.

i2v-Hunyuan

i2v-Opensora

t2v-Hunyuan

t2v-Opensora

Citation

If you use RecipeGen in your work, please cite our research:

@inproceedings{zhang2025RecipeGen, title={RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation}, author={Zhang, Ruoxuan and Gao, Jidong and Wen, Bin and Xie, hongxia and Zhang, chenming and Shuai, hong-han and Cheng, Wen-Huang}, booktitle={}, pages={1--6}, year={2025} }