Regarding the subject matter, I would like you to give us guidance on how the data is defined (or if the definition is unnecessary).

First of all, the error code is as follows.

name'X_train' is not defined

The code I wrote is as follows.

Read data from #Kaggle
train = pd.read_csv ('../ input/titanic/train.csv')
test = pd.read_csv ('../ input/titanic/test.csv')
gender_submission = pd.read_csv ('../ input/titanic/gender_submission.csv')
#Data shaping (feature engineering)
data = pd.concat ([train, test], sort = False)
data ['Sex'] .replace (['male','female'], [0, 1], inplace = True)
data ['Fare']. fillna (np.mean (data ['Fare']), inplace = True)
#Logistic regression settings
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression (penalty = '12', solver ='sag', random_state = 0)
clf.fit (X_train, y_train)

I proceeded according to the reference book (*), but from the content of the error, I thought that the name of the data was not defined.

* The reference book is P.51-P.56 of "Kaggle Startbook Starting with Python (written by Shotaro Ishihara and Hideki Murata)".

Sure, I felt that the definitions of X_train and y_train weren't in the code, but how do you separate the data?

Or is this data basically defined and just my little mistake?

I would like your guidance.

We apologize for the inconvenience, and thanks for your cooperation.

  • Answer # 1

    I don't have a book, but there was a support page.

    Support page

    There was the following description there, so why not check it?

    y_train = train ['Survived']
    X_train = train.drop ('Survived', axis = 1)