Hello. I'm asking you a question because I couldn't understand while doing my university assignment. A problem that creates a logistic regression model and outputs model performance through a training dataset.

In `#4 logistic regression`

,

```
model_formula = sm.Logit.from_formula("Survived ~ Age + Parch + Fare + Pclass_1 + Pclass_2 + Pclass_3", df)
```

`"Survived~Age+Parch+Fare+Pclass_1+Pclass_2+Pclass_3"`

I wonder why the range is set like this.

```
import pandas as pd
import numpy as np
#1. Read Data
df = pd.read_csv('C:/Users/minki/Downloads/Sample_data.csv', header =0)
df['Sex'] = df['Sex'].astype('category')
df['Pclass'] = df['Pclass'].astype('category')
df[ 'Embarked'] = df['Embarked'].astype('category')
df = pd.get_dummies((df))
#2 data scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler ()
df_scaled = scaler.fit_transform(df)
#3 data splitting
from sklearn.model_selection import train_test_split
Y = df['Survived']
X = df.iloc[:, 1:12] ##important to check the index
print(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
#4 Logistic regression
import statsmodels.api as sm
model_formula = sm. Logit.from_formula ("Survived Age + Parch + Fare + Pclass_1+ Pclass_2+ Pclass_3", df)
result_model = model_formula.fit() ##Build your log reg.
print (result_model.summary()) ## Chcek the reg result
print (np.exp(result_model.params)) ## calculate odds ratio of each year
Y_pred = result_model.predict(X_test) ## using test dataset, we will predict the value of :"survived"
Y_pred = list(map (round, Y_pred))
print(Y_pred)
print("----")
print(list(Y_test))
from sklearn import metrics
metrics.confusion_matrix (Y_test, Y_pred)
accuracy = metrics.accuracy_score (Y_test, Y_pred)
recall = metrics.recall_score (Y_test, Y_pred)
f1 = metrics.f1_score (Y_test, Y_pred)
print("Accuracy:", accuracy, ", F1-score:", f1, "Recall score:", recall)
```

2022-09-20 08:44

I don't think the code says why we decided on independent variables as them.

Usually, we do exploratory data analysis (EDA) for a long time, and then we make a regression model by deciding that some features are important to discard.

After transforming the category feature into one-hot, standard scaling, and just making a logistic regression model with a few randomly selected variables and training them to check.

The process of selecting variables might have been to create a model over and over again, and then skip the process of looking at the results and erasing them... That's right.

2022-09-20 08:44

Popular Tags

© 2022 pinfo. All rights reserved.