Predict Recipe Ratings

Final Project for EECS 398 Practical Data Science at the University of Michigan by Lucy Piao

Introduction

Understanding what makes a recipe successful is valuable for cooking enthusiasts and professionals to create dishes or recipes that consistently attract users. In this project, we aim to understand what types of recipes tend to have higher ratings. We analyzed the data from food.com, based on two datasets: RAW_recipes.csv, which contains 83781 data points about recipes posted since 2008, and RAW_interactions.csv, which includes 731926 data points of reviews and ratings for these recipes.

Here are the columns that are relevant for my data analysis:

Table1(a). Relevant columns in RAW_recipes.csv.

Column	Description
`name`	Recipe name
`id`	Recipe ID
`minutes`	Minutes to prepare recipe
`n_steps`	Number of steps in recipe
`n_ingredients`	Number of ingredients in recipe
`nutrition`	Nutrition information

Nutrition information is in the form [calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]; PDV stands for “percentage of daily value”.

Table1(b). Relevant columns in RAW_interactions.csv.

Column	Description
`rating`	Rating given
`recipe_id`	Recipe ID

Data Cleaning and Exploratory Data Analysis

Data Cleaning

Firstly, I loaded the two datasets and used a left merge based on the recipe ID, creating a single DataFrame that combines recipe details with corresponding reviews and ratings. This step helps analyze both recipe content and user feedback in one dataset.

Next, I removed extreme outliers, including a recipe titled “how to preserve a husband” to prevent adding noise to the data. It includes 1,051,200 minutes, which is over 2 years, and was adapted from an old folk tale.

Since the rating system ranges from 1 to 5, I considered ratings of 0 invalid and replaced them with NaN. Then, I calculated the average rating for each recipe by grouping it on recipe_id and added a new column, rating_avg, to represent the aggregated score. This column is our response variable in the prediction problem, which will be further discussed later.

The nutrition column originally includes seven values in a single string (calories, total fat, sugar, sodium, protein, saturated fat, and carbohydrates). I split the string into separate columns and converted each value to a numeric format (float).

Next, I dropped irrelevant columns such as contributor_id, nutrition, date, review, as these are redundant or not helpful in the analysis, but are included in the original data sets. I also dropped rows with missing rating_avg values, since they cannot be used in our model. I will further discuss this later.

Finally, I removed duplicate rows and reset the index. I also rearranged the columns to put rating_avg after name for a better format.

Below are the first few rows of my cleaned Dataframe:

Table2. The head of the cleaned data frame.

Univariate Analysis

To understand what kinds of recipes tend to receive higher ratings, I analyzed the distribution of average recipe ratings across the dataset.

The plot below shows the distribution of average ratings (rating_avg) for all recipes in the dataset. Most recipes have an average rating above 4.0/5.0, with a peak around 5.0/5.0. This indicates that users tend to rate recipes positively, giving almost perfect ratings. The high portion of ratings near 5.0/5.0 may also suggest that users generally choose to review recipes they enjoy on the website.

Figure1. The distribution of average ratings for each recipe in the cleaned dataset.

I also examined the distributions of the predicting variables. For example, I analyzed the number of steps (n_steps) involved in each recipe. Most of the recipes have fewer than 14 steps, with a peak around 5–9 steps. This suggests that most recipes on the platform are simple and require few steps. Understanding the typical number of steps helps us consider whether recipe complexity affects user ratings in the analyses.

Figure2. The distribution of the number of steps for each recipe in the cleaned dataset.

Bivariate Analysis

To explore how different recipe characteristics are related to user ratings, I looked at two key variables: the number of steps and calorie content. I grouped n_steps and calories into bins and used box plots to prevent overplotting.

I grouped n_steps into: ‘0-5’, ‘6-10’, ‘11-15’, ‘16-20’, ‘21-30’, ‘30+’.
I grouped calories into 6 equally-sized bins (each bin will contain roughly 1/6 of the data): ‘Very Low’, ‘Low’, ‘Medium-Low’, ‘Medium-High’, ‘High’, ‘Very High’

The first plot below shows the relationship between the number of steps in a recipe and its average rating across multiple reviewers. The distribution of ratings appears fairly consistent across all step ranges. Recipes with very few steps (0–5) or very many steps (30+) do not show significant differences in average rating compared to other groups. However, recipes with very few steps (0–5) or many steps (16+) tend to have a smaller variance, which indicates the majority of them are likely to have a consistently high rating, while the groups with 6-15 steps are vice versa.

Figure3. The association between the number of steps for each recipe and its average ratings in the cleaned dataset.

The second plot explores the relationship between calories and recipe rating. The results show that the average rating does not vary much across different calorie levels. This suggests that users do not appear to favor recipes with lower or higher calorie counts more than others when leaving ratings.

Figure4. The association between calorie contents for each recipe and their average ratings in the cleaned dataset.

Interesting Aggregations

To explore the data more deeply, I created a few aggregation analyses to investigate how nutritional values and recipe complexity relate to recipes’ average rating.

I first grouped recipes by calories and calculated the average rating for each bin. As shown below, there is a small difference in average ratings across different calorie groups. Recipes with “Very Low” calories have the highest average rating (4.64), but the overall variation is minimal. This suggests that calorie content alone does not strongly influence how users rate a recipe.

Table3. Average rating by calorie content categories.

Calories bin	Mean	Count
Very Low	4.63989	13537
Low	4.6272	13539
Medium-Low	4.62158	13523
Medium-High	4.61923	13518
High	4.62195	13526
Very High	4.62227	13529

Next, I grouped recipes by their average rating and calculated the average nutrition values for each group. Interestingly, recipes with lower ratings (below 3) had higher calories, sugar, and saturated fat content. This may imply that unhealthy recipes are not as well recognized by users.

Table4. Nutrition by rating category.

rating_category	calories	total_fat	sugar	sodium	protein	saturated_fat	carbohydrates
Below 3	442.398	32.7228	79.9188	37.8392	32.5697	41.2249	15.0179
3.0-3.5	439.185	31.0472	77.2808	26.3153	33.1473	39.4557	15.5006
3.5-4.0	426.119	31.6138	61.6182	28.3131	35.4261	38.7319	13.5506
4.0-4.5	416.087	30.8324	60.4056	28.0984	34.0167	37.276	13.3149
4.5-5.0	427.668	32.7853	69.2216	28.633	32.4752	40.5076	13.6314

Finally, I computed a complexity score, which is the summation of the number of steps and ingredients, then grouped recipes into six bins based on this score. Interestingly, both “Very Simple” and “Very Complex” recipes have the highest average ratings, suggesting that people enjoy both very simple recipes and very detailed or complicated ones. Recipes in the middle range tend to have slightly lower ratings.

Table5. Average rating by recipe complexity.

complexity_bin	mean	count
Very Simple	4.64899	14123
Simple	4.6189	15884
Medium-Simple	4.61156	12633
Medium-Complex	4.61158	14496
Complex	4.61953	11992
Very Complex	4.64303	12044

Imputation

I chose not to fill in any missing values for the average recipe rating (rating_avg). Instead, I dropped rows where the rating was missing. Since rating_avg is the value I aim to predict, filling in missing ratings with estimated values would create artificial data points that could bias the model and reduce the reliability of the analysis. Recipes with ratings out of scale (1-5) or missing ratings didn’t provide useful information, and there’s no accurate way to guess their ratings.

By doing this, I can make sure that all the data used is from genuine reviews. This helps me maintain the integrity of the model and avoids introducing noise or bias that could come from imputed target values.

Framing a Prediction Problem

The objective of this project is to understand what types of recipes tend to have higher ratings. I probed the problem as a regression task, to predict the average rating a recipe receives, which is a numeric value.

The response variable is rating_avg, which represents the average of all ratings for each recipe. I chose this variable because it reflects how users evaluate recipes, and it aligns with my question about what features lead to higher-rated recipes.

The predictor variables are features that would be available at the time a recipe is created, including minutes, n_steps, n_ingredients, and nutritions. The exhaustive list could be found later in the Baseline Model.

To evaluate the performance of my regression model, I used R² as the main metric, because it measures how well the model explains the variance in the target variable. I also used MSE (Mean Squared Error), RMSE (Root Mean Squared Error), and MAE (Mean Absolute Error) to give a more complete picture of model performance. RMSE is used for penalizing larger errors more heavily, and MAE is used for interpreting the average prediction error.

Baseline Model

For my baseline model, I used a Linear Regression model to predict the average rating (rating_avg) for recipes. The features I included are all quantitative and available at “time of prediction”, which are 'n_steps', 'minutes', 'n_ingredients', 'calories','total_fat', 'sugar', 'sodium', 'protein', 'saturated_fat', 'carbohydrates'. I used Standard Scaler on my qualitative data to standardize them to the same scale, ensuring the appropriate coefficient result. I didn’t have ordinal and nominal features, and there were no relevant encodings.

Then, I split the data into 80% training, 20% testing dataset, and saved their indices in the data frame, which could be reused in our final model. It will help us maintain a consistent performance evaluation on the basis of the model itself and not the dataset it was trained on. I trained the model and evaluated it by R², MSE, RMSE, MAE, as I mentioned earlier.

Baseline Model Performance:
R² Score: 0.00024
MSE: 0.40390
RMSE: 0.63554
MAE: 0.46698

I also performed a cross-validation (K=5) to assess the model’s generalizability. The results are shown below.

Cross-validation R² scores: ['-0.00003', '-0.00079', '0.00035', '0.00030', '0.00036']
Mean CV R² score: 0.00004
Standard deviation of CV R² scores: 0.00044

I also analyzed the coefficient/weight of each feature in the baseline mode, sorted by the absolute value.

Table5. Feature coefficients in the baseline model.

Feature	Coefficient
calories	0.0536733
carbohydrates	-0.0525195
sugar	0.0202813
protein	-0.0148397
total_fat	-0.0145259
n_steps	0.00574767
saturated_fat	-0.00435153
n_ingredients	-0.00300174
minutes	-0.00126595
sodium	-0.000759393

Among all features, calories had the strongest positive correlation with average rating, and carbohydrates had the strongest negative correlation with average rating.

Below is the visualization of the Predicted Ratings versus Actual Ratings. The dashed line indicates a hypothetically perfect prediction model.

Figure5. The scatter plot of predicted and actual ratings for each recipe.

Although the baseline model can make some predictions, the R² and Cross Validation R² are very low, showing a limited prediction ability. The MSE, RMSE, MAE are relatively low as well, indicating the result is about 0.63554 points around the actual rating, on a scale of 1-5. However, the results should be interpreted with caution, because in the data set, the ratings tend to cluster around 4.75, as Figure 1 shows. We need to combine the results from R² and Standard Deviation of Cross Validation R², which shows that although the model doesn’t have big errors, it is not effectively capturing the underlying features that influence ratings either. This implies that I need to tweak the model for improvement.

Final Model

Feature Engineering

To improve on the baseline model, I first engineered at original features. Specifically, I split the original features into two categories based on their skewness. Many features are right-skewed. For example, Figure 2 entails the distribution of the number of steps in all recipes, which are skewed to the left and widely varied.

basic_features: 'n_ingredients', 'total_fat', 'saturated_fat', 'carbohydrates'
skewed_features: 'calories', 'minutes', 'n_steps', 'sodium', 'sugar','protein' I applied Standard Scaler on the basic features, which is consistent with the baseline model, and applied a log-transform on skewed features, followed by a Standard Scaler. The log-transform can reduce the skew and compress extreme values.

I also engineered new features that can capture nuanced relationships in the dataset:

protein_carb_ratio: It is the ratio of protein to carbohydrates, derived from the nutrition data. Table 4 indicates the relationship between nutrition breakdowns and recipe ratings, which suggests that healthy recipes tend to have a higher rating. The protein_carb_ratio is a good measurement for nutrition balance: the higher ratio means a more balanced diet.
fat_calorie_ratio: It is computed based on the formula of total_fat*9/(calories+1). The formula was calculated because fat has 9 calories per gram. This measurement indicates if the recipe is healthy or not, which can be helpful to predict the result, as suggested by Table 4.
steps_per_ingredient: It is the ratio of the number of steps to the number of ingredients. In the Exploratory Data Analysis part, the result suggests that very simple or complex recipes tend to have a higher rating, as shown by Table 5. Unlike Table 5, the “complexity” here is not defined by the summation of them but by a proportional relationship to be more interpretable and improve model expressiveness.
time_per_step: It measures the average time required for each step in a recipe. It implies that if the recipe is time-efficient, it could impact the reviewer’s experience.

While the new engineered features are correlated with existing features, which may cause multicollinearity, making the model coefficients unstable, we could prevent the negative effect by choosing models that are robust to multicollinearity or using dimensionality reduction techniques (like PCA) in linear models.

I applied Quantile Transformer to new features, which maps the feature values to a Gaussian shape, improving the compatibility with linear models and PCA.

Modeling Algorithm

I trained and tuned the following algorithms:

Random Forest
Gradient boosting
Ridge Regression
Ridge Regression with PCA

Model and Hyperparameter Selection

I used GridSearch CV (K=5) to find the best parameters for each model, evaluated by R². I chose the model with the best score. It shows that Gradient Boosting has the best performance with best parameters: 'model__learning_rate': 0.1, 'model__max_depth': 3, 'model__n_estimators': 50.

Model Performance

The plot below shows the predicted ratings versus the actual ratings.

Figure6. The scatter plot of predicted and actual ratings for each recipe.

The model reaches a better performance compared with our baseline model:

Final Model Performance:
R² Score: 0.00433
MSE: 0.40225
RMSE: 0.63423
MAE: 0.46458
Improvement over Baseline:
R² Score improvement: 0.00409
MSE improvement: 0.00165
RMSE improvement: 0.00131
MAE improvement: 0.00240

The R² has increased, and MSE, RMSE, MAE have decreased, compared with our baseline model. It indicates our final model has a better performance. The feature engineering and the well-selected model help capture the underlying factors.

Below are the performances of other models with their best parameters:

RandomForest:

Best parameters for RandomForest: {'model__max_depth': 5, 'model__min_samples_split': 5, 'model__n_estimators': 50}
Best CV score for RandomForest: 0.0040
Test set performance for RandomForest:
R² Score: 0.00368
MSE: 0.40251
RMSE: 0.63444
MAE: 0.46510

Ridge Regression:

Fitting 5 folds for each of 4 candidates, totalling 20 fits
Best parameters for Ridge: {'model__alpha': 0.01}
Best CV score for Ridge: 0.0020
Test set performance for Ridge:
R² Score: 0.00238
MSE: 0.40304
RMSE: 0.63485
MAE: 0.46582

Ridge Regression with PCA:

Best parameters for PCA_Ridge: {'pca__n_components': 9, 'ridge__alpha': 1.0}
Best CV score for PCA_Ridge: 0.0018
Test set performance for PCA_Ridge:
R² Score: 0.00160
MSE: 0.40335
RMSE: 0.63510
MAE: 0.46627

I also analyzed the feature importance in the final model, using Gradient Boosting.

Table6. Sorted feature importance in the final model

Feature	Importance
sugar	0.242528
carbohydrates	0.138349
fat_calorie_ratio	0.10783
total_fat	0.0757918
protein	0.0670013
saturated_fat	0.0622826
time_per_step	0.0573898
steps_per_ingredient	0.0506006
n_ingredients	0.0477057
n_steps	0.034544
minutes	0.034215
sodium	0.0298707
calories	0.0272974
protein_carb_ratio	0.0245936

Figure7. The importance of features in the Gradient Boosting model with the best hyperparameters.

The result shows that the original features sugar and carbohydrates, as well as the new engineered feature fat_calorie_ratio, play an important role in the model.

Discussion

In conclusion, our final model achieved an improved performance compared with our baseline model. Meanwhile, we could include ordinal features like recipe tags in our model for better prediction and generalizability.