Looking at how adding an extra feature with varying levels of relevance effects the r2_score

[1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
[2]:
X, y, coef = make_regression(n_samples=100,
                             n_features=10,
                             n_informative=5,
                             n_targets=1,
                             bias=0.0,
                             effective_rank=None,
                             tail_strength=0.5,
                             noise=100,
                             shuffle=True,
                             coef=True,
                             random_state=42)
[3]:
pd.DataFrame(coef, index=[f"feat_{x}" for x in range(0, coef.shape[0])]).T
[3]:
feat_0 feat_1 feat_2 feat_3 feat_4 feat_5 feat_6 feat_7 feat_8 feat_9
0 16.748258 0.0 0.0 63.643025 0.0 70.647573 0.0 10.456784 3.158614 0.0
[4]:
df = (pd.DataFrame(X, columns=[f"feat_{x}" for x in range(0, X.shape[1])])
.merge(pd.DataFrame(y, columns=["target"]),
      left_index=True,
      right_index=True))
[5]:
df.head()
[5]:
feat_0 feat_1 feat_2 feat_3 feat_4 feat_5 feat_6 feat_7 feat_8 feat_9 target
0 -0.926930 -1.430141 1.632411 -3.241267 -1.247783 -1.024388 0.130741 -0.059525 -0.252568 -0.440044 -186.494628
1 0.202923 0.334457 0.285865 1.547505 -0.387702 1.795878 2.010205 -1.515744 -0.612789 0.658544 191.976107
2 -0.241236 0.456753 0.342725 -1.251539 1.117296 1.443765 0.447709 0.352055 -0.082151 0.569767 315.503594
3 0.289775 -1.008086 -2.038125 0.871125 -0.408075 -0.326024 -0.351513 2.075401 1.201214 -1.870792 100.185659
4 -0.007973 -0.190339 -1.037246 0.077368 0.538910 -0.861284 -1.382800 1.479944 1.523124 -0.875618 -40.813080

It is interesting to see how choosing a feature with a high coefficient vs one with a coefficient of 0 effects the outcome

[6]:
edit_feature = "feat_5"
[7]:
new_feat_df = df[[edit_feature]]
[8]:
## Create
[9]:
np.random.seed(42)
for i in np.arange(0.1, 100, 0.1):
    new_feat_df[f"extra_feat_{round(i, 2)}"] = new_feat_df[[edit_feature]].add(np.random.normal(0,i,100).reshape(-1, 1))
<ipython-input-9-15ece02a7e91>:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_feat_df[f"extra_feat_{round(i, 2)}"] = new_feat_df[[edit_feature]].add(np.random.normal(0,i,100).reshape(-1, 1))
[10]:
new_feat_df.head()
[10]:
feat_5 extra_feat_0.1 extra_feat_0.2 extra_feat_0.3 extra_feat_0.4 extra_feat_0.5 extra_feat_0.6 extra_feat_0.7 extra_feat_0.8 extra_feat_0.9 ... extra_feat_99.0 extra_feat_99.1 extra_feat_99.2 extra_feat_99.3 extra_feat_99.4 extra_feat_99.5 extra_feat_99.6 extra_feat_99.7 extra_feat_99.8 extra_feat_99.9
0 -1.024388 -0.974716 -1.307462 -0.917051 -1.355986 -1.821601 -0.468681 -0.494496 -1.442566 -0.179932 ... -16.004939 133.364347 -58.934818 116.415540 124.949513 10.956874 152.970119 -61.107173 -140.130109 9.020166
1 1.795878 1.782051 1.711749 1.964113 1.571805 1.496190 2.941528 1.150362 2.635085 1.331437 ... 122.317131 -51.718905 72.445362 -19.581405 -24.721140 19.987382 -32.259533 32.857998 -53.048834 -12.343040
2 1.443765 1.508533 1.375222 1.768680 1.742682 1.446386 0.604624 2.052489 0.880290 1.530273 ... -29.645214 -1.617802 -15.752390 62.461780 -33.942097 34.128823 94.635425 -103.690008 73.000333 -119.124344
3 -0.326024 -0.173721 -0.486479 -0.009883 -0.081875 -0.302533 0.011758 0.622923 -1.452793 -0.742071 ... 7.551026 -92.430132 41.112141 31.240235 17.735043 210.924356 51.918244 0.612539 39.567114 -123.228660
4 -0.861284 -0.884700 -0.893541 -1.274585 -0.869645 -1.086317 -1.251670 -0.571880 -2.106588 -1.252331 ... -24.106935 -5.330764 82.459494 54.121763 7.279049 153.839735 -196.280966 159.187931 27.495354 19.974685

5 rows × 1000 columns

Show the fit and r2-score of the original data frame without the extra noise based feature

[11]:
X = df.filter(regex="feat")
y = df["target"]
[12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
[13]:
model = LinearRegression()
[14]:
_ = model.fit(X_train, y_train)
[15]:
y_pred = model.predict(X_test)
[16]:
_ = sns.regplot(x=y_test, y=y_pred)
_ = plt.text(x=150, y=-200, s=f"r_square = {round(r2_score(y_test, y_pred), 3)}")
_images/uninformative_feature_effect_on_R2_18_0.png
[17]:
original_r2_score = r2_score(y_test, y_pred)

Run the same regression as above but with a different noise feature added each time

(thus seeing how adding a different feature with more or less noise effects the r2 score of the model)

[18]:
noise_features_list = new_feat_df.drop(edit_feature, axis=1).columns.tolist()
[19]:
all_r2_df = pd.DataFrame()
for noise_feat in noise_features_list:

    X = df.filter(regex="feat")
    X = pd.concat([X, new_feat_df[[noise_feat]]], axis=1)
    y = df["target"]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    model = LinearRegression()

    _ = model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    current_r2_df = pd.DataFrame({"added_feature": noise_feat,
                                  "noise_std": float(noise_feat[-3:]),
                                  "r2": r2_score(y_test, y_pred)}, index=[0])
    all_r2_df = pd.concat([all_r2_df, current_r2_df])

Show an example of one of the X’s to compare to the initial data frame used to get the initial r2 score shown above in the scatter plot

[20]:
df.filter(regex="feat").head(1)
[20]:
feat_0 feat_1 feat_2 feat_3 feat_4 feat_5 feat_6 feat_7 feat_8 feat_9
0 -0.92693 -1.430141 1.632411 -3.241267 -1.247783 -1.024388 0.130741 -0.059525 -0.252568 -0.440044
[21]:
X.head(1)
[21]:
feat_0 feat_1 feat_2 feat_3 feat_4 feat_5 feat_6 feat_7 feat_8 feat_9 extra_feat_99.9
0 -0.92693 -1.430141 1.632411 -3.241267 -1.247783 -1.024388 0.130741 -0.059525 -0.252568 -0.440044 9.020166

Plot all the r2 scores

[22]:
_ = sns.scatterplot(data=all_r2_df, x="noise_std", y="r2")
_ = sns.lineplot(data=all_r2_df, x="noise_std", y="r2")
_ = plt.axhline(y=original_r2_score, c="red", label="Original r2 score, before adding an extra feature")
_ = plt.legend()
_images/uninformative_feature_effect_on_R2_27_0.png

Initial multiple choice question:

Adding a non-important feature to a linear regression model may result in: 1. Increase in R-square 2. Decrease in R-square

Answer: Only 1 is correct

Based on the r2 vs noise_std plot it seems like the answer hinges on how one defines “non-important to the model”.

So whilst in theory - based on the formula used to calculate the r2 score - adding an extra feature should always increase the r-squared score - hence the need for adjusted r-squareds, it actually hinges on what is considered “non-important”. Hence I would suggest rewording the multiple choice question to reflect this somehow.