记第一次 kaggle competition - Titanic

1. 数据集描述

Overview

The data has been split into two groups:

training set (train.csv)
test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

Data Dictionary

Variable	Definition	Key
survival	Survival 0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex
Age	Age in years
sibsp	# of siblings / spouses aboard the Titanic
parch	# of parents / children aboard the Titanic
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5 sibsp: The dataset defines family relations in this way… Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored) parch: The dataset defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

2. 第一次挑战

2.1 数据预处理

由于数据中存在很多非数值型数据和缺失值，需要进行预处理。数据处理常使用pandas库进行处理。

# 传入dataframe，返回处理后的dataframe
def preprocess(df):
    df_ = df.copy()
    
    # 丢弃不需要的列
    columns_drop = ['PassengerId', 'Name', 'Ticket', 'Cabin']
    df_ = df_.drop(columns_drop, axis=1)
    # 填充缺失值
    # 中位数
    df_['Age'].fillna(df_['Age'].median(), inplace=True)
    # 众数
    df_['Embarked'].fillna(df_['Embarked'].mode()[0], inplace=True)
    df_['Fare'].fillna(df_['Fare'].median(), inplace=True)

    # 转换为数值型
    df_['Sex'] = df_['Sex'].map({'male': 0, 'female': 1})
    df_['Embarked'] = df_['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

    return df_

2.2 导入数据

class TitanicDataset(Dataset):
    def __init__(self, filepath):
        df = pd.read_csv(filepath)
        df = preprocess(df)
        # 将数据转换为numpy数组，再转换为torch张量
        x_np = df.drop('Survived', axis=1).astype(np.float32).values
        y_np = df['Survived'].astype(np.float32).values
        # xy = np.loadtxt(filepath, delimiter=',', dtype=np.float32)
        self.len = x_np.shape[0]
        self.x_data = torch.from_numpy(x_np)
        self.y_data = torch.from_numpy(y_np)

    def __getitem__(self, index):
        return self.x_data[index], self.y_data[index]
    
    def __len__(self):
        return self.len

2.3 模型设计

第一次使用一个简单的线性模型，输入为7维，输出为1维，中间使用sigmoid函数进行激活。

class TitanicModel(torch.nn.Module):
    def __init__(self):
        super(TitanicModel, self).__init__()
        self.linear = torch.nn.Linear(7, 1)
        self.sigmoid = torch.nn.Sigmoid()

    def forward(self, x):
        x = self.linear(x)
        x = self.sigmoid(x)
        return x.squeeze()

为什么要使用squeeze()函数？因为模型的输出是一个1维的张量，而标签是一个0维的张量，所以需要使用squeeze()函数将模型的输出转换为0维张量，才能与标签进行比较。PyTorch文档-squeeze()

2.4 模型训练

train_dataset = TitanicDataset('./exec/titanic/train.csv')
train_loader = DataLoader(dataset=train_dataset, batch_size=32, shuffle=True, num_workers=2)

model = TitanicModel()

criterion = torch.nn.BCELoss(size_average=True)
optimizer = torch.optim.SGD(model.parameters(), lr=0.0001)

for epoch in range(100):
    for i,data in enumerate(train_loader, 0):
        inputs, labels = data

        y_pred = model(inputs)
        loss = criterion(y_pred, labels)
        print(f'Epoch: {epoch} | Batch: {i} | Loss: {loss.item()}')
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

2.5 预测

df = pd.read_csv('./exec/titanic/test.csv')
ids = df['PassengerId'].values
df = preprocess(df)
x_np = df.astype(np.float32).values
x = torch.from_numpy(x_np)

model.eval()

with torch.no_grad():
    y_pred = model(x)
    y_pred_binary = (y_pred > 0.5).float().squeeze()

submission_df = pd.DataFrame({
    'PassengerId': ids,
    'Survived': y_pred_binary.numpy().astype(int)
})

submission_df.to_csv('./exec/titanic/result.csv', index=False)

最后，提交result.csv文件到kaggle平台，即可查看模型的准确率。得到了0.66的“高”准确率。

记录 > 人工智能 > 深度学习

#kaggle #titanic #深度学习

记第一次 kaggle competition - Titanic

https://eleco.top/2026/02/25/kaggle-titanic/

作者

Eleco

发布于

2026年2月25日

许可协议

PyTorch入门学习：10-Basic CNN 上一篇

PyTorch入门学习：9-Softmax Classifier 下一篇