Notebook
Imperial Business School Logo

Predict Podcast Listening Time

Capstone Executive Summary and Report
by Aishwarya Singh

Executive Summary

This project advances the prediction of podcast listening time to improve episode ranking, advertising pacing, and content investment decisions. Using machine learning, we reduced the average prediction error from 27.09 minutes to 13.11 minutes, representing a 51.6% improvement with interpretable models built in scikit-learn. Validation R2 reached 76.6%, further supporting the accuracy of the approach.

The analysis included robust data cleaning, careful handling of missing values, and comparison of several models. The Histogram Gradient Boosting model provided the most accurate and efficient results. The Decision Tree model performed closely, while the Random Forest model was less effective under the tested parameters.

Deployment of the Histogram Gradient Boosting model is recommended for the initial rollout, with ongoing monitoring of prediction accuracy. Further improvements can be achieved by enriching the feature set and applying cross-validation. This approach delivers measurable business impact, operational simplicity, and a clear foundation for future enhancements.

Introduction

Accurately predicting how long users will listen to podcast episodes is essential for streaming and audio platforms aiming to optimise content recommendations, allocate advertising inventory effectively, and guide content investment decisions. Listening time serves as a direct indicator of user engagement and has a measurable impact on monetisation opportunities.

The training data contains 750,000 rows and is a synthetic derivative of the original podcast listening time dataset. This ensures a robust foundation for model development while allowing for meaningful comparisons and validation.

This capstone project set out to develop a transparent and reproducible machine learning pipeline that delivers meaningful improvements over a simple average-based prediction. The focus was on using interpretable models available in scikit-learn, ensuring that the solution remains accessible, explainable, and aligned with best practices.

Methodology

Data Source: The dataset for this project was sourced from the Kaggle Playground Series competition “Predict Podcast Listening Time.” The data consists of synthetic derivatives of original podcast engagement records, designed to closely resemble real world listening patterns while ensuring privacy and fairness in model evaluation.

id Podcast_Name Episode_Title Episode_Length_minutes Genre Host_Popularity_percentage Publication_Day Publication_Time Guest_Popularity_percentage Number_of_Ads Episode_Sentiment Listening_Time_minutes
0Mystery MattersEpisode 98NaNTrue Crime74.81ThursdayNightNaN0.0Positive31.41998
1Joke JunctionEpisode 26119.80Comedy66.95SaturdayAfternoon75.952.0Negative88.01241
2Study SessionsEpisode 1673.90Education69.97TuesdayEvening8.970.0Negative44.92531
3Digital DigestEpisode 4567.17Technology57.22MondayMorning78.702.0Positive46.27824
4Mind & BodyEpisode 86110.51Health80.07MondayAfternoon58.683.0Neutral75.61031
Training Data Preview
id Podcast_Name Episode_Title Episode_Length_minutes Genre Host_Popularity_percentage Publication_Day Publication_Time Guest_Popularity_percentage Number_of_Ads Episode_Sentiment
750000Educational NuggetsEpisode 7378.96Education38.11SaturdayEvening53.331.0Neutral
750001Sound WavesEpisode 2327.87Music71.29SundayMorningNaN0.0Neutral
750002Joke JunctionEpisode 1169.10Comedy67.89FridayEvening97.510.0Positive
750003Comedy CornerEpisode 73115.39Comedy23.40SundayMorning51.752.0Positive
750004Life LessonsEpisode 5072.32Lifestyle58.10WednesdayMorning11.302.0Neutral
Test Data Preview

Three files were provided: a training set with the target variable Listening_Time_minutes, a test set without the target, and a sample submission template. The training data contains 750,000 rows and includes episode details, popularity metrics, genre, publication timing, sentiment and advertisement counts. The test set contains 250,000 rows but without the target variable. Only the training set was used for model selection, with the test set reserved for final inference and submission.

Target & Metric: The target variable for prediction is Listening_Time_minutes which measures how long a user listens to a podcast episode. Model performance was evaluated using Root Mean Squared Error (RMSE), a metric that quantifies the average prediction error in minutes and penalises larger mistakes more heavily. The baseline for comparison was the mean listening time from the training set.

Modelling Approach: The modelling process advanced in clear, purposeful steps from simple to more sophisticated methods, each chosen for interpretability and efficiency. This staged approach ensured that extra complexity was only introduced when it provided a measurable reduction in error and a clear business benefit:

  • Baseline Mean Predictor: Served as the control, highlighting the error from a basic approach and setting a clear benchmark for improvement.
  • Decision Tree: Enabled splits on important feature thresholds such as episode length or popularity. Each split represents a straightforward decision rule, making the model easy to interpret.
  • Random Forest: Combined many shallow trees to reduce random noise and produce more stable predictions. This approach improved consistency while keeping runtime manageable.
  • Histogram Gradient Boosting: Built trees sequentially, each one focusing on correcting the errors of the previous. This method achieved strong accuracy quickly, handled mixed feature types well, and remained efficient for deployment.

Exploratory Data Analysis

Preprocessing: The raw training file contained a mix of numbers such as counts and durations and categories such as genre, publication day and sentiment labels. We removed the identifier column since it does not help prediction, filled missing numeric values with the middle value of the distribution, and filled missing categorical values with the most common entry so that no rows were lost. Categorical fields were converted into model friendly indicator columns while keeping the structure consistent between training and validation splits. Because all models were tree based we did not need to standardise or scale values which keeps the pipeline simpler and faster to run.

Exploratory Analysis:

Listening Time Distribution: We reviewed basic distributions to check for extreme outliers, compared average listening time across key categories such as genre, publication day and sentiment, and inspected simple relationships for the most influential numeric fields.

Distribution of listening time in minutes
Figure: Listening Time Distribution

Feature Collinearity: We inspected simple relationships for the most influential numeric fields.

Scatterplot of key numeric predictors versus listening time
Figure: Listening Time vs Episode Length Scatterplot
Correlation heatmap of engineered and raw numeric features
Figure: Feature Correlation Heatmap

Train vs Test Dataset Comparison: We also overlaid training and test distributions to confirm they were broadly aligned, reducing the risk that the model would face a very different pattern at prediction time. No material shifts or data quality concerns emerged so we proceeded without heavy feature pruning.

Comparison of key feature distributions between train and test sets
Figure: Alignment between Train and Test Distributions

Training the Models

Baseline: A simple mean predictor, that always forecasts the average listening time, established an anchor error of 27.09 minutes RMSE. All subsequent models are judged against this reference both in absolute minutes reduced and percentage improvement.

Decision Tree: A single depth controlled tree captured key non linear thresholds such as episode length and popularity splits and reduced RMSE to 13.20 minutes, a drop of 13.89 minutes or 51.3% versus the baseline. This large first step showed that most reducible variance could be explained with a small set of hierarchical rules.

Random Forest: An ensemble of shallow trees produced an RMSE of 16.65 minutes. This is still 10.44 minutes or 38.6% better than the baseline, but 3.45 minutes worse than the single Decision Tree which achieved 13.20. Conservative runtime focused parameters smoothed important sharp splits and weakened accuracy relative to the best tree.

Histogram Gradient Boosting: Sequential boosting refined residual errors and achieved the best RMSE at 13.11 minutes, 13.98 minutes or 51.6% lower than baseline, edging the Decision Tree by 0.09 minutes and outperforming the Random Forest by 3.54 minutes. The marginal gain over the tree indicates diminishing returns yet confirms that targeted residual correction adds a small further improvement without material complexity overhead.

Model Comparison (Validation RMSE):

Model RMSE (min) Δ vs Baseline % Improvement
Baseline mean27.09--
Decision Tree13.20-13.8951.3%
Random Forest16.65-10.4438.6%
HistGradientBoosting13.11-13.9851.6%
Bar chart comparing model RMSE values
Figure: Validation RMSE Comparison

Interpretation: Both the Decision Tree and Histogram Gradient Boosting halved the baseline error. HistGB narrowly led with an advantage of about 0.09 minutes, suggesting most reducible variance was captured by depth controlled tree structure rather than extensive ensembling. The Random Forest underperformed relative to expectations, likely due to conservative depth and leaf constraints chosen for runtime reasons.

Generating Test Predictions

Following model validation, the Histogram Gradient Boosting approach was selected for its combination of accuracy and efficient processing. To maximise predictive power, the model was retrained using the entire labelled dataset, ensuring that all available patterns were captured before generating final predictions.

This fully trained model was then applied to the separate test set of 250,000 episodes, producing individual listening time estimates for each row. These predictions were compiled into the required submission format, including the unique identifier and the predicted listening time in minutes.

By automating this final step and avoiding manual adjustments, the process remains fully reproducible and transparent. The resulting predictions are ready for direct use in episode ranking, advertising pacing, and further business evaluation. Future improvements will build upon this established baseline, supporting ongoing optimisation.

id Listening_Time_minutes
75000056.10
75000118.00
75000249.26
75000379.93
75000448.86
Submission File Preview

Conclusion

This project successfully delivered a transparent and reproducible machine learning pipeline that substantially improved the accuracy of podcast listening time predictions. By reducing the average error from 27.09 minutes to 13.11 minutes, the solution demonstrates clear value for optimising episode ranking, advertising strategies, and content investment decisions.

The recommended Histogram Gradient Boosting model offers a strong balance between predictive accuracy and operational simplicity, making it well-suited for deployment in a production environment. Its performance, coupled with efficient training requirements, ensures the approach remains practical and scalable.

Key drivers of predictive power were episode length, popularity measures, and a subset of sentiment and genre indicators. Moderate correlation among numeric features did not inflate variance, evidenced by the narrow gap between the single tree and boosted model errors which indicates minimal overfitting under the chosen constraints.

Looking ahead, further gains can be realised by implementing k-fold cross-validation for more robust performance reporting, exploring targeted feature engineering, and monitoring for prediction drift post-deployment. These steps will help maintain accuracy and adapt the model as new data becomes available.

Overall, the project achieves its objectives of clarity and measurable improvement, providing a solid foundation for ongoing experimentation and future enhancements.