Last time I set up the environment and solved a quick and easy example to test the waters. It's time to tackle a harder problem. Fortunately for me
scikit-learn comes with a few build-in datasets that I can play with. I want to practice what I've learned at the beginning of the course, so I will try to solve something with Linear Regression first.
I looked at the datasets present in the library and decided that Boston hource pricing might be a good place to start.
Quick and dirty
One of the things that prof. Ng mentioned a few times during the course was getting some dirty solution up and running quickly, before trying anything else. With a working solution, we can try and find areas we can focus on, instead of wasting time doing unnecessary stuff. Let's try to quickly fit a simple linear regression and see how it goes. First, I'll split the data between train and tests sets and than I'll calculate scores for both. I also pass in the
random_state argument, so the split is performed in the same way every time.
import sklearn.datasets as ds import sklearn.model_selection as md data = ds.load_boston(); X_train, X_test, y_train, y_test = md.train_test_split( data.data, data.target, random_state=1) from sklearn.linear_model import LinearRegression reg = LinearRegression(); reg.fit(X_train, y_train) print(reg.score(X_train, y_train)) print(reg.score(X_test, y_test))
Train score: 0.7167286808673383 Test score: 0.7790257749137512
Now we can work on making the score better!
One of the way to 'debug' the algorithm that I learned was plotting learning curves. When I was doing the homework, I had to save J (the error) during each learning step. Fortunately,
scikit comes with a neat function that does everything for you - learning_curve. The function will divide your train set randomly into a few sets and compute how the score changes with each step. By default it divides your data to 5 steps/sets, but I increased it to 10 by passing
train_sizes argument, so it looks smoother.
from sklearn.model_selection import learning_curve from matplotlib import pyplot import numpy as np train_sizes, train_scores, cv_scores = learning_curve(reg, X_train, y_train, train_sizes = np.linspace(.1, 1.0, 10)) train_scores_mean = np.mean(train_scores, axis=1) cv_scores_mean = np.mean(cv_scores, axis=1) pyplot.ylim((-1, 1.0)) pyplot.plot(train_sizes, train_scores_mean, 'ro-', label="Train score") pyplot.plot(train_sizes, cv_scores_mean, 'go-', label="CV score") pyplot.legend()
This plot tells me two things:
- Train and CV score converge so it's not a high variance problem
- The score of both is pretty low, so it might be a high bias problem
High bias problem usually stems from model not fitting the data well.
One of the simplest things we can do to fix our problem is adding some polynomial features. This will allow the model to fit our data better.
Scikit comes with a neat function to add the features for us - PolynomialFeatures. Let's modify our data preparation step:
... import sklearn.model_selection as md from sklearn.preprocessing import PolynomialFeatures data = ds.load_boston(); data.data = PolynomialFeatures(degree=3).fit_transform(data.data) X_train, X_test, y_train, y_test = md.train_test_split( data.data, data.target) ...
And run the code again:
Train score: 1.0 Test score: -690.9819870039847
The score and learning curve tells me that we moved from having high bias problem to having high variance one. The model fits the training data so well that it fails miserably when it comes to the test data.
To fix our current problem we can use regularization, but for it to work properly, we need to scale the features first. Regularization penalizes each coefficient with the same "force", so having features in the different ranges or order of magnitudes will cause problems. I'll use StandardScaler which will remove the mean and normalize feature values. Will fit the scaler to the training data and then use it to transform the test data as well.
... X_train, X_test, y_train, y_test = md.train_test_split( data.data, data.target) from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) from sklearn.linear_model import LinearRegression ...
Now that our data is prepared properly, we can add regularization. To add it to our code, we need to switch from LinearRegression to Ridge regressor. We need to provide it with
alpha parameter which will control how much the coefficients are penalized. Fortunately, instead of guessing it by ourselves (and wasting time testing different values), the library can do it for us. RidgeCV is a version of Ridge regressor that will guess it for us. All we need to do is provide it with a few alpha values to try:
... X_test = scaler.transform(X_test) from sklearn.linear_model import RidgeCV alphas=[0.1,0.3, 1,3, 10,30,100,300, 1000] reg = RidgeCV(alphas) reg.fit(X_train, y_train) print("Train score: " + str(reg.score(X_train, y_train))) print("Test score: " + str(reg.score(X_test, y_test))) print("Selected alpha: " + str(reg.alpha_)) from sklearn.model_selection import learning_curve ...
Train score: 0.9347137854004398 Test score: 0.9294740071777031
I think those results are pretty good.
Another thing that would help with our overfitting problem would be gathering more data. It's not possible in this case, but can be really helpful in real world.