Python 実践　データ加工/可視化　１００本ノック　に挑戦　ノック90

ノック90 :Testデータの前処理をしよう

Testデータの前処理をおこないます。

最初にデータの確認をします。

import seaborn as sns
from sklearn.model_selection import train_test_split
import pandas as pd

dataset = sns.load_dataset('titanic')
label = dataset.pop('survived')

train_ds,test_ds,train_label,test_label=train_test_split(
    dataset,label,random_state=2021,stratify=label)

train_ds.drop(columns=['embark_town','alive'],inplace=True)

one_hot_encoded = pd.get_dummies(train_ds)
one_hot_encoded = pd.get_dummies(one_hot_encoded,columns=['pclass'])
train_ds = one_hot_encoded

#ノック86
from sklearn.preprocessing import RobustScaler,StandardScaler

age_scaler   = StandardScaler()
sibsp_scaler = RobustScaler()
parch_scaler = RobustScaler()
fare_scaler  = RobustScaler()

train_ds['age']    = age_scaler  .fit_transform(train_ds['age']  .values.reshape(-1,1))
train_ds['sibsp']  = sibsp_scaler.fit_transform(train_ds['sibsp'].values.reshape(-1,1))
train_ds['parch']  = parch_scaler.fit_transform(train_ds['parch'].values.reshape(-1,1))
train_ds['fare']   = fare_scaler .fit_transform(train_ds['fare'] .values.reshape(-1,1))

#ノック88
from sklearn.impute import SimpleImputer

age_imputer = SimpleImputer(strategy='median')
train_ds['age'] = age_imputer.fit_transform(train_ds['age'].values.reshape(-1,1))
#ノック90
print(test_ds.head())

実行結果

pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
404 3 female 20.0 0 0 8.6625 S Third woman False NaN Southampton no True
521 3 male 22.0 0 0 7.8958 S Third man True NaN Southampton no True
130 3 male 33.0 0 0 7.8958 C Third man True NaN Cherbourg no True
14 3 female 14.0 0 0 7.8542 S Third child False NaN Southampton no True
610 3 female 39.0 1 5 31.2750 S Third woman False NaN Southampton no False

embark_townとaliveの列を削除します。

test_ds.drop(columns=['embark_town','alive'],inplace=True)

train_dsの時と同様、特徴をカテゴリカル変数に変換します。

True=1、False＝０に変換します。

test_ds = pd.get_dummies(test_ds)
test_ds = pd.get_dummies(test_ds,columns=['pclass'])
test_ds.replace({True: 1, False: 0},inplace = True)

TestデータはTrainデータより少ないため、Trainデータに存在していてもTestデータには存在していない状況が発生します。

項目不一致を修正し、あわせます。

test_ds = test_ds.merge(train_ds,how='left')
test_ds = test_ds[train_ds.columns]
print(test_ds.head())

実行結果

   age sibsp parch fare adult_male alone sex_female sex_male embarked_C embarked_Q embarked_S ...
0 20.0 0 0 8.6625 0 1 1 0 0 0 1 ...
1 22.0 0 0 7.8958 1 1 0 1 0 0 1 ...
2 33.0 0 0 7.8958 1 1 0 1 1 0 0 ...
3 14.0 0 0 7.8542 0 1 1 0 0 0 1 ...
4 39.0 1 5 31.2750 0 0 1 0 0 0 1 ...
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
222 56.0 0 1 83.1583 0 0 1 0 1 0 0 ...

[223 rows x 27 columns]

ノック86で行ったスケーリングの処理を行います。

with open('data/scalers/age_scaler.pkl',mode='rb') as f :
    age_scaler = pickle.load(f)
with open('data/scalers/sibsp_scaler.pkl',mode='rb') as f :
    sibsp_scaler = pickle.load(f)
with open('data/scalers/parch_scaler.pkl',mode='rb') as f :
    parch_scaler = pickle.load(f)
with open('data/scalers/fare_scaler.pkl',mode='rb') as f :
    fare_scaler = pickle.load(f)
test_ds['age']   = age_scaler.transform(test_ds.age.values.reshape(-1,1))
test_ds['sibsp'] = sibsp_scaler.transform(test_ds.sibsp.values.reshape(-1,1))
test_ds['parch'] = parch_scaler.transform(test_ds.parch.values.reshape(-1,1))
test_ds['fare']  = fare_scaler.transform(test_ds.fare.values.reshape(-1,1))

欠損値の補完を行います。

with open('data/imputers/age_imputer.pkl',mode='rb') as f:
    age_imputer = pickle.load(f)

test_ds['age']= age_imputer.transform(test_ds.age.values.reshape(-1,1))

print(test_ds.head())

実行結果

age sibsp parch fare adult_male alone sex_female sex_male embarked_C embarked_Q embarked_S ...
0 -0.666020 0.0 0.0 -0.276724 0 1 1 0 0 0 1 ...
1 -0.528613 0.0 0.0 -0.309950 1 1 0 1 0 0 1 ...
2 0.227125 0.0 0.0 -0.309950 1 1 0 1 1 0 0 ...
3 -1.078240 0.0 0.0 -0.311753 0 1 1 0 0 0 1 ...
4 0.639346 1.0 5.0 0.703233 0 0 1 0 0 0 1 ...

[5 rows x 27 columns]