RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation

About RecipeGen

A Large-Scale Multimodal Benchmark for Recipe Image and Video Generation

RecipeGen is the first real-world benchmark dataset designed specifically for recipe image and video generation tasks. It addresses the limitations of existing datasets by providing detailed step-level annotations with corresponding visual content.

26,453

Recipes

196,724

Step Images

4,491

Cooking Videos

7.4

Avg Steps/Recipe

RecipeGen provides a robust foundation for evaluating Text-to-Image (T2I), Image-to-Video (I2V), and Text-to-Video (T2V) generation models with a focus on diverse culinary traditions.

Key Features

Step-Aligned Content

Detailed step-by-step textual instructions paired with corresponding visual illustrations, enabling fine-grained understanding of cooking procedures.

Diverse Cooking Styles

Covers a wide range of regional cuisines and cooking techniques, with 158 keywords spanning different culinary traditions.

Rigorous Quality Control

Multi-stage filtering process including step count filtering, image quality control, and GPT-4o based text refinement.

Comprehensive Evaluation Metrics

Novel task-specific metrics including Goal Faithfulness, Step Faithfulness, Ingredient Faithfulness, Interaction Accuracy, and Cross-Step Consistency to comprehensively evaluate model performance across multiple dimensions.

Comprehensive Evaluation on Fine-Grained Classification and Recipe QA

Multi-step Complexity
Cooking involves multiple steps, from ingredient preparation to doneness checks. It also includes precise timing and diverse techniques.

VQA Generation
We generated 160,450 QA pairs focusing on ingredients, utensil usage, and step-by-step actions. GPT-based methods structure these pairs for fine-grained cooking semantics.

Professional Evaluation
VQA data underwent thorough review by hospital nutritionists, ensuring high-quality standards and consistency with practical dietary guidelines.

Model	Recipe		Non-seasoning Ingredient				Seasonings				VQA Score
Model	BLEU	ROUGE	F1 Score	Precision	Recall	IoU	F1 Score	Precision	Recall	IoU	VQA Score
LLaVA-NeXT	3.07	0.16	0.3658	0.3955	0.3935	0.2290	0.2192	0.2714	0.2350	0.1294	4.40
Qwen-VL2.5	3.06	0.17	0.4548	0.4891	0.5083	0.3000	0.2755	0.3170	0.2916	0.1669	5.19
LLAMA3.2	2.42	0.16	0.4203	0.4279	0.5038	0.2958	0.1915	0.2111	0.2210	0.1221	4.68

BLEU and ROUGE: Evaluate the similarity between generated recipes and ground truth, used by macromodels like LLaVA.
F1 Score, Precision, Recall, IoU: Measure ingredient recognition accuracy of the macromodels.
VQA Score: Assesses the response performance to VQA questions through GPT-4o.

Benchmark Results

Comprehensive evaluation of different models on RecipeGen benchmark across text-to-image and video generation tasks.

Evaluation Metrics for Recipe Generation

Metric	Description
Goal Faithfulness	CLIP similarity between the final image and last-step caption, measuring alignment with the overall goal.
Step Faithfulness	Assesses each image's alignment with its step caption using CLIP.
Ingredient Faithfulness	Employ GPT-4o to extract the list of ingredients involved in each step. Then, we finetune Qwen2.5-VL as a VQA model to recognize and predict the corresponding ingredients directly from the generated step images, enabling visual grounding of textual content.
Interaction Accuracy	Based on different interaction types—Mix up, Blend, Put on, Cover, and No relationship—we design a standardized prompt that guides GPT-4o to generate one-sentence captions grounded in visible features such as color, texture, spatial layout, and perceived fusion. We then compute the CLIP similarity score between each generated image and its corresponding caption to evaluate interaction consistency.
Cross-Step Consistency	Based on StackedDiffusion and DINOv2, uses l₂ distance and step count difference to assess visual and numerical consistency.

Experiments on Text-to-Image Models

Method	Joint Generation	GF↑	SF↑	CSC↓	IA(%)↑	IF↑
SD1.5	✗	26.84	28.40	5.42	36.04	24.60
SD2.1	✗	26.88	28.51	7.54	39.86	24.95
SDXL	✗	27.46	29.37	2.98	44.04	24.27
SKD	✓	26.62	28.53	0.7	42.34	24.71
SD3.5	✗	27.42	28.77	2.97	45.27	24.11
Flux.1-dev	✗	26.47	28.31	3.47	38.88	24.06
IC-LoRA	✓	26.07	26.58	9.03	24.28	24.17
RPF	✓	27.19	25.99	8.73	31.13	24.59

Experiments on Video Generation Models

Method	TASK	GF↑	SF↑	CSC↓	IA(%)↑	IF↑
Hunyuan	T2V	29.66	29.75	0.02	46.02	29.69
Hunyuan	I2V	28.56	29.24	0.15	51.42	28.73
Opensora	T2V	30.88	30.59	0.03	56.20	30.21
Opensora	I2V	29.58	29.14	0.16	55.67	29.40

Video-Demonstrations

Here are some examples of video generation models generating recipes.

Prepare seafood: Score and cut the squid into pieces. Remove the head, shell, and vein from the shrimp, leaving the tail intact. Set aside.

i2v-Hunyuan

i2v-Opensora

t2v-Hunyuan

t2v-Opensora

Marinate seafood: Add cooking wine and a pinch of salt to the shrimp and squid, mixing well. Marinate for 15 minutes.

i2v-Hunyuan

i2v-Opensora

t2v-Hunyuan

t2v-Opensora

Prepare vegetables: While the seafood is marinating, cut the firm tofu into rectangular pieces. Soak and wash the wood ear mushrooms. Wash and cut the crab-flavored mushrooms into sections. Slice fresh shiitake mushrooms. Cut the snow peas into sections and dice the carrots. Set aside.

i2v-Hunyuan

i2v-Opensora

t2v-Hunyuan

t2v-Opensora

Citation

If you use RecipeGen in your work, please cite our research:

@inproceedings{zhang2025RecipeGen, title={RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation}, author={Zhang, Ruoxuan and Gao, Jidong and Wen, Bin and Xie, hongxia and Zhang, chenming and Shuai, hong-han and Cheng, Wen-Huang}, booktitle={}, pages={1--6}, year={2025} }