House Prices Prediction - Harvard Certificate
Regression
Ensemble Learning
R
Harvard
Harvard Data Science Certificate final project - Regression model comparison
Context & Problem
Question: Predict house sale prices in Iowa from their characteristics.
This project is the final assessment for the Harvard Data Science Certificate (HarvardX). It’s a classic Kaggle challenge that allows comparing many regression approaches.
Dataset
Ames Housing Dataset:
- ~1,500 houses in Iowa (USA)
- 79 features
- Target: SalePrice (sale price in $)
Key Variables
| Category | Variables |
|---|---|
| Area | GrLivArea, TotalBsmtSF, GarageArea |
| Quality | OverallQual, OverallCond, ExterQual |
| Location | Neighborhood, MSZoning |
| Age | YearBuilt, YearRemodAdd |
| Amenities | FullBath, BedroomAbvGr, Fireplaces |
Methodology
Models Tested
- Linear Regression (baseline)
- Random Forest
- XGBoost
- GAM (Generalized Additive Models)
- Neural Networks
- Ensemble (combination of best models)
Results
Model Comparison
| Model | RMSE (CV) | R² | Rank |
|---|---|---|---|
| Linear Regression | 34,521 | 0.82 | 6 |
| Random Forest | 28,934 | 0.87 | 3 |
| XGBoost | 27,156 | 0.89 | 2 |
| GAM | 29,845 | 0.86 | 4 |
| Neural Network | 31,234 | 0.84 | 5 |
| Ensemble | 26,012 | 0.90 | 1 |
Most Important Variables
From XGBoost and Random Forest analysis:
- OverallQual: Overall house quality
- GrLivArea: Above ground living area
- TotalBsmtSF: Basement area
- GarageCars: Garage capacity
- YearBuilt: Construction year
- Neighborhood: Area
- TotalBath: Number of bathrooms
Technologies
| Component | Technology |
|---|---|
| Language | R |
| Data Wrangling | tidyverse (dplyr, tidyr) |
| ML Framework | caret |
| Models | lm, randomForest, xgboost, mgcv (GAM), nnet |
| Visualization | ggplot2 |
| Documentation | RMarkdown |
Learnings
This certification project allowed me to:
- Master complete ML workflow: From EDA to Kaggle submission
- Rigorously compare models: Cross-validation, multiple metrics
- Understand feature engineering importance: New variables significantly improve performance
- Discover ensembles: Model combination often outperforms individual models
- Practice R in depth: tidyverse, caret, xgboost, mgcv
Certification
This project is part of the Professional Certificate in Data Science from HarvardX, covering:
- R Basics, Visualization, Probability
- Inference and Modeling
- Productivity Tools, Wrangling
- Linear Regression, Machine Learning
- Capstone (this project)