Exploratory Data Analysis and Regression:

Air quality dataset analysis and humidity prediction.


Brief Introduction

This project is to analyze an air quality dataset. The data was obtained along a main road in an Italian city over 13 months.

The project goal is to use both linear and non-linear models to predict the absolute humidity by giving a temperature. And also compare the prediction results from partially cleaned data and mostly cleaned data. Matlab is the tool used in this project.



Steps

1. Data preparation: remove missing values and duplicates.

2. 1st time regression analysis and make predictions.

3. Further clean data: find outliers and remove outliers.

4. 2nd time regression analysis and make predictions.

5. Compare the 1st and 2nd predictions.



Outcomes

- 1st time Linear Regression:

The coefficient of determination (r square) is 0.3358, which is close to 0. Therefore, this is not a good fit.

The best fit line is: humidity = 0.3152 + 0.03621 * temperature

Predicted humidity is 1.58255 when temperature is 35 degrees.

Image

- 1st time Nonlinear Regression (2nd degree polynomial model):

R-square is 0.3443.

The equation is: humidity = 0.001002 * temperature^2 + 0.003151 * temperature + 0.5659

Predicted humidity is 1.9036 when temperature is 35 degrees.

Image

- Find outliers and remove outliers:

Use built-in function ‘boxplot()’ to find Upper Adjacent and Lower Adjacent.

Image

- 2nd time Linear Regression:

R-square is 0.9332, which is close to 1. Therefore, this is a fairly good fit.

The equation is:humidity = 0.0332 * temperature + 0.3597

Predicted humidity is 1.5217 when temperature is 35 degrees

Image

- 2nd time Nonlinear Regression (2nd Degree Polynomial Model):

R-square is 0.9429.

Model equation is: humidity = 0.0007089 * temperature^2 + 0.01141 * temperature + 0.5154

Predicted humidity is 1.7832 when temperature is 35 degrees

Image

Removing outliers gives a significant improvement in both linear and non-linear models. Predictions can be very different from partially cleaned data and mostly cleaned data, it is more obvious in the non-linear models.