타이타닉 데이터 with logistic regression¶
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
In [2]:
train = pd.read_csv('titanic/train.csv')
test = pd.read_csv('titanic/test.csv')
In [3]:
train
Out[3]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
In [4]:
test
Out[4]:
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
413 | 1305 | 3 | Spector, Mr. Woolf | male | NaN | 0 | 0 | A.5. 3236 | 8.0500 | NaN | S |
414 | 1306 | 1 | Oliva y Ocana, Dona. Fermina | female | 39.0 | 0 | 0 | PC 17758 | 108.9000 | C105 | C |
415 | 1307 | 3 | Saether, Mr. Simon Sivertsen | male | 38.5 | 0 | 0 | SOTON/O.Q. 3101262 | 7.2500 | NaN | S |
416 | 1308 | 3 | Ware, Mr. Frederick | male | NaN | 0 | 0 | 359309 | 8.0500 | NaN | S |
417 | 1309 | 3 | Peter, Master. Michael J | male | NaN | 1 | 1 | 2668 | 22.3583 | NaN | C |
418 rows × 11 columns
In [5]:
# categorical 변수만 encoding 해주기 위해 info 출력
print(train.info())
print()
print(test.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB None <class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 418 non-null int64 1 Pclass 418 non-null int64 2 Name 418 non-null object 3 Sex 418 non-null object 4 Age 332 non-null float64 5 SibSp 418 non-null int64 6 Parch 418 non-null int64 7 Ticket 418 non-null object 8 Fare 417 non-null float64 9 Cabin 91 non-null object 10 Embarked 418 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 36.0+ KB None
In [6]:
survivors = train['Survived'].value_counts()
print(survivors)
survivors.plot.bar()
0 549 1 342 Name: Survived, dtype: int64
Out[6]:
<AxesSubplot:>
one-hot encoding 할 건데 object type 변수들 중에 Name이나 Ticket, Cabin은... 정말 해줄 필요가 없는 변수라 걍 드랍했음(전부 고유값, 결측치도 너무 많다).
그리고 int type 변수인 Age도 결측치가 많아서 적당히 평균으로 대체해줌. NaN이 있으면 모델 자체가 안돌아가니까 나머지도 적당히 대체.
In [7]:
train = train.drop(['Name', 'Ticket', 'Cabin'], axis=1)
test = test.drop(['Name', 'Ticket', 'Cabin'], axis=1)
In [8]:
train['Age'] = train['Age'].fillna(train['Age'].mean())
test['Age'] = test['Age'].fillna(test['Age'].mean())
train['Embarked'] = train['Embarked'].fillna(train['Embarked'].value_counts().index[0])
test['Embarked'] = test['Embarked'].fillna(test['Embarked'].value_counts().index[0])
train['Fare'] = train['Fare'].fillna(train['Fare'].mean())
test['Fare'] = test['Fare'].fillna(train['Fare'].mean())
In [9]:
# one-encdoing을 위해 데이터 다 합쳐줌
all_df = train.append(test)
idx = len(train)
all_df = pd.get_dummies(all_df)
train = all_df[:idx]
test = all_df[idx:]
In [10]:
train
Out[10]:
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | 22.000000 | 1 | 0 | 7.2500 | 0 | 1 | 0 | 0 | 1 |
1 | 2 | 1.0 | 1 | 38.000000 | 1 | 0 | 71.2833 | 1 | 0 | 1 | 0 | 0 |
2 | 3 | 1.0 | 3 | 26.000000 | 0 | 0 | 7.9250 | 1 | 0 | 0 | 0 | 1 |
3 | 4 | 1.0 | 1 | 35.000000 | 1 | 0 | 53.1000 | 1 | 0 | 0 | 0 | 1 |
4 | 5 | 0.0 | 3 | 35.000000 | 0 | 0 | 8.0500 | 0 | 1 | 0 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0.0 | 2 | 27.000000 | 0 | 0 | 13.0000 | 0 | 1 | 0 | 0 | 1 |
887 | 888 | 1.0 | 1 | 19.000000 | 0 | 0 | 30.0000 | 1 | 0 | 0 | 0 | 1 |
888 | 889 | 0.0 | 3 | 29.699118 | 1 | 2 | 23.4500 | 1 | 0 | 0 | 0 | 1 |
889 | 890 | 1.0 | 1 | 26.000000 | 0 | 0 | 30.0000 | 0 | 1 | 1 | 0 | 0 |
890 | 891 | 0.0 | 3 | 32.000000 | 0 | 0 | 7.7500 | 0 | 1 | 0 | 1 | 0 |
891 rows × 12 columns
In [11]:
# validation까지 하려면 train 데이터셋 쪼개야됨
train_df, test_df = train_test_split(train, test_size=0.2)
In [12]:
x_train, y_train = train_df.loc[:, train_df.columns!='Survived'].values, train_df['Survived'].values
x_test, y_test = test_df.loc[:, test_df.columns!='Survived'].values, test_df['Survived'].values
In [13]:
lr = LogisticRegression(max_iter=1000)
lr.fit(x_train, y_train)
predict = lr.predict(x_test)
proba = lr.predict_proba(x_test)[:,1]
In [14]:
print(classification_report(y_test, predict))
precision recall f1-score support 0.0 0.76 0.87 0.81 104 1.0 0.77 0.63 0.69 75 accuracy 0.77 179 macro avg 0.77 0.75 0.75 179 weighted avg 0.77 0.77 0.76 179
In [15]:
# test 데이터셋 예측
test_X = test.loc[:, test_df.columns!='Survived'].values
y_pred = lr.predict(test_X)
In [16]:
test['Survived'] = y_pred
test
<ipython-input-16-02f3e4d41143>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy test['Survived'] = y_pred
Out[16]:
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 0.0 | 3 | 34.50000 | 0 | 0 | 7.8292 | 0 | 1 | 0 | 1 | 0 |
1 | 893 | 0.0 | 3 | 47.00000 | 1 | 0 | 7.0000 | 1 | 0 | 0 | 0 | 1 |
2 | 894 | 0.0 | 2 | 62.00000 | 0 | 0 | 9.6875 | 0 | 1 | 0 | 1 | 0 |
3 | 895 | 0.0 | 3 | 27.00000 | 0 | 0 | 8.6625 | 0 | 1 | 0 | 0 | 1 |
4 | 896 | 1.0 | 3 | 22.00000 | 1 | 1 | 12.2875 | 1 | 0 | 0 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
413 | 1305 | 0.0 | 3 | 30.27259 | 0 | 0 | 8.0500 | 0 | 1 | 0 | 0 | 1 |
414 | 1306 | 1.0 | 1 | 39.00000 | 0 | 0 | 108.9000 | 1 | 0 | 1 | 0 | 0 |
415 | 1307 | 0.0 | 3 | 38.50000 | 0 | 0 | 7.2500 | 0 | 1 | 0 | 0 | 1 |
416 | 1308 | 0.0 | 3 | 30.27259 | 0 | 0 | 8.0500 | 0 | 1 | 0 | 0 | 1 |
417 | 1309 | 0.0 | 3 | 30.27259 | 1 | 1 | 22.3583 | 0 | 1 | 1 | 0 | 0 |
418 rows × 12 columns
'AI > Statistics' 카테고리의 다른 글
[통계] 여러가지 분포 - 정규분포, 이항분포, t-분포, 카이제곱-분포 (0) | 2021.09.14 |
---|---|
[통계] 최대우도법(Maximum Likelihood Estimation) (1) | 2021.09.10 |
[통계] 로지스틱 회귀와 정규화 (0) | 2021.09.06 |
[통계] 다중회귀분석 예제 - Statsmodel을 이용한 고유값, vif 확인 (0) | 2021.09.06 |
[통계] 최소제곱법과 회귀분석의 가정들 (0) | 2021.09.06 |