Skip to the content.

Predict Recipe Ratings

Final Project for EECS 398 Practical Data Science at the University of Michigan by Lucy Piao


Introduction

Understanding what makes a recipe successful is valuable for cooking enthusiasts and professionals to create dishes or recipes that consistently attract users. In this project, we aim to understand what types of recipes tend to have higher ratings. We analyzed the data from food.com, based on two datasets: RAW_recipes.csv, which contains 83781 data points about recipes posted since 2008, and RAW_interactions.csv, which includes 731926 data points of reviews and ratings for these recipes.

Here are the columns that are relevant for my data analysis:

Table1(a). Relevant columns in RAW_recipes.csv.

Column Description
name Recipe name
id Recipe ID
minutes Minutes to prepare recipe
n_steps Number of steps in recipe
n_ingredients Number of ingredients in recipe
nutrition Nutrition information

Nutrition information is in the form [calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]; PDV stands for “percentage of daily value”.

Table1(b). Relevant columns in RAW_interactions.csv.

Column Description
rating Rating given
recipe_id Recipe ID

Data Cleaning and Exploratory Data Analysis

Data Cleaning

Firstly, I loaded the two datasets and used a left merge based on the recipe ID, creating a single DataFrame that combines recipe details with corresponding reviews and ratings. This step helps analyze both recipe content and user feedback in one dataset.

Next, I removed extreme outliers, including a recipe titled “how to preserve a husband” to prevent adding noise to the data. It includes 1,051,200 minutes, which is over 2 years, and was adapted from an old folk tale.

Since the rating system ranges from 1 to 5, I considered ratings of 0 invalid and replaced them with NaN. Then, I calculated the average rating for each recipe by grouping it on recipe_id and added a new column, rating_avg, to represent the aggregated score. This column is our response variable in the prediction problem, which will be further discussed later.

The nutrition column originally includes seven values in a single string (calories, total fat, sugar, sodium, protein, saturated fat, and carbohydrates). I split the string into separate columns and converted each value to a numeric format (float).

Next, I dropped irrelevant columns such as contributor_id, nutrition, date, review, as these are redundant or not helpful in the analysis, but are included in the original data sets. I also dropped rows with missing rating_avg values, since they cannot be used in our model. I will further discuss this later.

Finally, I removed duplicate rows and reset the index. I also rearranged the columns to put rating_avg after name for a better format.

Below are the first few rows of my cleaned Dataframe:

Table2. The head of the cleaned data frame.

Univariate Analysis

To understand what kinds of recipes tend to receive higher ratings, I analyzed the distribution of average recipe ratings across the dataset.

The plot below shows the distribution of average ratings (rating_avg) for all recipes in the dataset. Most recipes have an average rating above 4.0/5.0, with a peak around 5.0/5.0. This indicates that users tend to rate recipes positively, giving almost perfect ratings. The high portion of ratings near 5.0/5.0 may also suggest that users generally choose to review recipes they enjoy on the website.

Figure1. The distribution of average ratings for each recipe in the cleaned dataset.

I also examined the distributions of the predicting variables. For example, I analyzed the number of steps (n_steps) involved in each recipe. Most of the recipes have fewer than 14 steps, with a peak around 5–9 steps. This suggests that most recipes on the platform are simple and require few steps. Understanding the typical number of steps helps us consider whether recipe complexity affects user ratings in the analyses.

Figure2. The distribution of the number of steps for each recipe in the cleaned dataset.

Bivariate Analysis

To explore how different recipe characteristics are related to user ratings, I looked at two key variables: the number of steps and calorie content. I grouped n_steps and calories into bins and used box plots to prevent overplotting.

The first plot below shows the relationship between the number of steps in a recipe and its average rating across multiple reviewers. The distribution of ratings appears fairly consistent across all step ranges. Recipes with very few steps (0–5) or very many steps (30+) do not show significant differences in average rating compared to other groups. However, recipes with very few steps (0–5) or many steps (16+) tend to have a smaller variance, which indicates the majority of them are likely to have a consistently high rating, while the groups with 6-15 steps are vice versa.

Figure3. The association between the number of steps for each recipe and its average ratings in the cleaned dataset.

The second plot explores the relationship between calories and recipe rating. The results show that the average rating does not vary much across different calorie levels. This suggests that users do not appear to favor recipes with lower or higher calorie counts more than others when leaving ratings.

Figure4. The association between calorie contents for each recipe and their average ratings in the cleaned dataset.

Interesting Aggregations

To explore the data more deeply, I created a few aggregation analyses to investigate how nutritional values and recipe complexity relate to recipes’ average rating.

I first grouped recipes by calories and calculated the average rating for each bin. As shown below, there is a small difference in average ratings across different calorie groups. Recipes with “Very Low” calories have the highest average rating (4.64), but the overall variation is minimal. This suggests that calorie content alone does not strongly influence how users rate a recipe.

Table3. Average rating by calorie content categories.

Calories bin Mean Count
Very Low 4.63989 13537
Low 4.6272 13539
Medium-Low 4.62158 13523
Medium-High 4.61923 13518
High 4.62195 13526
Very High 4.62227 13529

Next, I grouped recipes by their average rating and calculated the average nutrition values for each group. Interestingly, recipes with lower ratings (below 3) had higher calories, sugar, and saturated fat content. This may imply that unhealthy recipes are not as well recognized by users.

Table4. Nutrition by rating category.

rating_category calories total_fat sugar sodium protein saturated_fat carbohydrates
Below 3 442.398 32.7228 79.9188 37.8392 32.5697 41.2249 15.0179
3.0-3.5 439.185 31.0472 77.2808 26.3153 33.1473 39.4557 15.5006
3.5-4.0 426.119 31.6138 61.6182 28.3131 35.4261 38.7319 13.5506
4.0-4.5 416.087 30.8324 60.4056 28.0984 34.0167 37.276 13.3149
4.5-5.0 427.668 32.7853 69.2216 28.633 32.4752 40.5076 13.6314

Finally, I computed a complexity score, which is the summation of the number of steps and ingredients, then grouped recipes into six bins based on this score. Interestingly, both “Very Simple” and “Very Complex” recipes have the highest average ratings, suggesting that people enjoy both very simple recipes and very detailed or complicated ones. Recipes in the middle range tend to have slightly lower ratings.

Table5. Average rating by recipe complexity.

complexity_bin mean count
Very Simple 4.64899 14123
Simple 4.6189 15884
Medium-Simple 4.61156 12633
Medium-Complex 4.61158 14496
Complex 4.61953 11992
Very Complex 4.64303 12044

Imputation

I chose not to fill in any missing values for the average recipe rating (rating_avg). Instead, I dropped rows where the rating was missing. Since rating_avg is the value I aim to predict, filling in missing ratings with estimated values would create artificial data points that could bias the model and reduce the reliability of the analysis. Recipes with ratings out of scale (1-5) or missing ratings didn’t provide useful information, and there’s no accurate way to guess their ratings.

By doing this, I can make sure that all the data used is from genuine reviews. This helps me maintain the integrity of the model and avoids introducing noise or bias that could come from imputed target values.

Framing a Prediction Problem

The objective of this project is to understand what types of recipes tend to have higher ratings. I probed the problem as a regression task, to predict the average rating a recipe receives, which is a numeric value.

The response variable is rating_avg, which represents the average of all ratings for each recipe. I chose this variable because it reflects how users evaluate recipes, and it aligns with my question about what features lead to higher-rated recipes.

The predictor variables are features that would be available at the time a recipe is created, including minutes, n_steps, n_ingredients, and nutritions. The exhaustive list could be found later in the Baseline Model.

To evaluate the performance of my regression model, I used as the main metric, because it measures how well the model explains the variance in the target variable. I also used MSE (Mean Squared Error), RMSE (Root Mean Squared Error), and MAE (Mean Absolute Error) to give a more complete picture of model performance. RMSE is used for penalizing larger errors more heavily, and MAE is used for interpreting the average prediction error.

Baseline Model

For my baseline model, I used a Linear Regression model to predict the average rating (rating_avg) for recipes. The features I included are all quantitative and available at “time of prediction”, which are 'n_steps', 'minutes', 'n_ingredients', 'calories','total_fat', 'sugar', 'sodium', 'protein', 'saturated_fat', 'carbohydrates'. I used Standard Scaler on my qualitative data to standardize them to the same scale, ensuring the appropriate coefficient result. I didn’t have ordinal and nominal features, and there were no relevant encodings.

Then, I split the data into 80% training, 20% testing dataset, and saved their indices in the data frame, which could be reused in our final model. It will help us maintain a consistent performance evaluation on the basis of the model itself and not the dataset it was trained on. I trained the model and evaluated it by R², MSE, RMSE, MAE, as I mentioned earlier.

Baseline Model Performance:
R² Score: 0.00024
MSE: 0.40390
RMSE: 0.63554
MAE: 0.46698

I also performed a cross-validation (K=5) to assess the model’s generalizability. The results are shown below.

Cross-validation R² scores: ['-0.00003', '-0.00079', '0.00035', '0.00030', '0.00036']
Mean CV R² score: 0.00004
Standard deviation of CV R² scores: 0.00044

I also analyzed the coefficient/weight of each feature in the baseline mode, sorted by the absolute value.

Table5. Feature coefficients in the baseline model.

Feature Coefficient
calories 0.0536733
carbohydrates -0.0525195
sugar 0.0202813
protein -0.0148397
total_fat -0.0145259
n_steps 0.00574767
saturated_fat -0.00435153
n_ingredients -0.00300174
minutes -0.00126595
sodium -0.000759393

Among all features, calories had the strongest positive correlation with average rating, and carbohydrates had the strongest negative correlation with average rating.

Below is the visualization of the Predicted Ratings versus Actual Ratings. The dashed line indicates a hypothetically perfect prediction model.

Figure5. The scatter plot of predicted and actual ratings for each recipe.

Although the baseline model can make some predictions, the and Cross Validation R² are very low, showing a limited prediction ability. The MSE, RMSE, MAE are relatively low as well, indicating the result is about 0.63554 points around the actual rating, on a scale of 1-5. However, the results should be interpreted with caution, because in the data set, the ratings tend to cluster around 4.75, as Figure 1 shows. We need to combine the results from and Standard Deviation of Cross Validation R², which shows that although the model doesn’t have big errors, it is not effectively capturing the underlying features that influence ratings either. This implies that I need to tweak the model for improvement.

Final Model

Feature Engineering

To improve on the baseline model, I first engineered at original features. Specifically, I split the original features into two categories based on their skewness. Many features are right-skewed. For example, Figure 2 entails the distribution of the number of steps in all recipes, which are skewed to the left and widely varied.

I also engineered new features that can capture nuanced relationships in the dataset:

While the new engineered features are correlated with existing features, which may cause multicollinearity, making the model coefficients unstable, we could prevent the negative effect by choosing models that are robust to multicollinearity or using dimensionality reduction techniques (like PCA) in linear models.

I applied Quantile Transformer to new features, which maps the feature values to a Gaussian shape, improving the compatibility with linear models and PCA.

Modeling Algorithm

I trained and tuned the following algorithms:

Model and Hyperparameter Selection

I used GridSearch CV (K=5) to find the best parameters for each model, evaluated by . I chose the model with the best score. It shows that Gradient Boosting has the best performance with best parameters: 'model__learning_rate': 0.1, 'model__max_depth': 3, 'model__n_estimators': 50.

Model Performance

The plot below shows the predicted ratings versus the actual ratings.

Figure6. The scatter plot of predicted and actual ratings for each recipe.

The model reaches a better performance compared with our baseline model:

Final Model Performance:
R² Score: 0.00433
MSE: 0.40225
RMSE: 0.63423
MAE: 0.46458
Improvement over Baseline:
R² Score improvement: 0.00409
MSE improvement: 0.00165
RMSE improvement: 0.00131
MAE improvement: 0.00240

The has increased, and MSE, RMSE, MAE have decreased, compared with our baseline model. It indicates our final model has a better performance. The feature engineering and the well-selected model help capture the underlying factors.

Below are the performances of other models with their best parameters:

RandomForest:

Best parameters for RandomForest: {'model__max_depth': 5, 'model__min_samples_split': 5, 'model__n_estimators': 50}
Best CV score for RandomForest: 0.0040
Test set performance for RandomForest:
R² Score: 0.00368
MSE: 0.40251
RMSE: 0.63444
MAE: 0.46510

Ridge Regression:

Fitting 5 folds for each of 4 candidates, totalling 20 fits
Best parameters for Ridge: {'model__alpha': 0.01}
Best CV score for Ridge: 0.0020
Test set performance for Ridge:
R² Score: 0.00238
MSE: 0.40304
RMSE: 0.63485
MAE: 0.46582

Ridge Regression with PCA:

Best parameters for PCA_Ridge: {'pca__n_components': 9, 'ridge__alpha': 1.0}
Best CV score for PCA_Ridge: 0.0018
Test set performance for PCA_Ridge:
R² Score: 0.00160
MSE: 0.40335
RMSE: 0.63510
MAE: 0.46627

I also analyzed the feature importance in the final model, using Gradient Boosting.

Table6. Sorted feature importance in the final model

Feature Importance
sugar 0.242528
carbohydrates 0.138349
fat_calorie_ratio 0.10783
total_fat 0.0757918
protein 0.0670013
saturated_fat 0.0622826
time_per_step 0.0573898
steps_per_ingredient 0.0506006
n_ingredients 0.0477057
n_steps 0.034544
minutes 0.034215
sodium 0.0298707
calories 0.0272974
protein_carb_ratio 0.0245936

Figure7. The importance of features in the Gradient Boosting model with the best hyperparameters.

The result shows that the original features sugar and carbohydrates, as well as the new engineered feature fat_calorie_ratio, play an important role in the model.

Discussion

In conclusion, our final model achieved an improved performance compared with our baseline model. Meanwhile, we could include ordinal features like recipe tags in our model for better prediction and generalizability.