记第一次 kaggle competition - Titanic

记第一次 kaggle competition - Titanic

1. 数据集描述

Overview

The data has been split into two groups:

  • training set (train.csv)
  • test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

Data Dictionary

Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5 sibsp: The dataset defines family relations in this way… Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored) parch: The dataset defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

2. 第一次挑战

2.1 数据预处理

由于数据中存在很多非数值型数据和缺失值,需要进行预处理。数据处理常使用pandas库进行处理。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 传入dataframe,返回处理后的dataframe
def preprocess(df):
df_ = df.copy()

# 丢弃不需要的列
columns_drop = ['PassengerId', 'Name', 'Ticket', 'Cabin']
df_ = df_.drop(columns_drop, axis=1)
# 填充缺失值
# 中位数
df_['Age'].fillna(df_['Age'].median(), inplace=True)
# 众数
df_['Embarked'].fillna(df_['Embarked'].mode()[0], inplace=True)
df_['Fare'].fillna(df_['Fare'].median(), inplace=True)

# 转换为数值型
df_['Sex'] = df_['Sex'].map({'male': 0, 'female': 1})
df_['Embarked'] = df_['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

return df_

2.2 导入数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class TitanicDataset(Dataset):
def __init__(self, filepath):
df = pd.read_csv(filepath)
df = preprocess(df)
# 将数据转换为numpy数组,再转换为torch张量
x_np = df.drop('Survived', axis=1).astype(np.float32).values
y_np = df['Survived'].astype(np.float32).values
# xy = np.loadtxt(filepath, delimiter=',', dtype=np.float32)
self.len = x_np.shape[0]
self.x_data = torch.from_numpy(x_np)
self.y_data = torch.from_numpy(y_np)

def __getitem__(self, index):
return self.x_data[index], self.y_data[index]

def __len__(self):
return self.len

2.3 模型设计

第一次使用一个简单的线性模型,输入为7维,输出为1维,中间使用sigmoid函数进行激活。

1
2
3
4
5
6
7
8
9
10
class TitanicModel(torch.nn.Module):
def __init__(self):
super(TitanicModel, self).__init__()
self.linear = torch.nn.Linear(7, 1)
self.sigmoid = torch.nn.Sigmoid()

def forward(self, x):
x = self.linear(x)
x = self.sigmoid(x)
return x.squeeze()

为什么要使用squeeze()函数? 因为模型的输出是一个1维的张量,而标签是一个0维的张量,所以需要使用squeeze()函数将模型的输出转换为0维张量,才能与标签进行比较。PyTorch文档-squeeze()

2.4 模型训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
train_dataset = TitanicDataset('./exec/titanic/train.csv')
train_loader = DataLoader(dataset=train_dataset, batch_size=32, shuffle=True, num_workers=2)

model = TitanicModel()

criterion = torch.nn.BCELoss(size_average=True)
optimizer = torch.optim.SGD(model.parameters(), lr=0.0001)

for epoch in range(100):
for i,data in enumerate(train_loader, 0):
inputs, labels = data

y_pred = model(inputs)
loss = criterion(y_pred, labels)
print(f'Epoch: {epoch} | Batch: {i} | Loss: {loss.item()}')
optimizer.zero_grad()
loss.backward()
optimizer.step()

2.5 预测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
df = pd.read_csv('./exec/titanic/test.csv')
ids = df['PassengerId'].values
df = preprocess(df)
x_np = df.astype(np.float32).values
x = torch.from_numpy(x_np)

model.eval()

with torch.no_grad():
y_pred = model(x)
y_pred_binary = (y_pred > 0.5).float().squeeze()

submission_df = pd.DataFrame({
'PassengerId': ids,
'Survived': y_pred_binary.numpy().astype(int)
})

submission_df.to_csv('./exec/titanic/result.csv', index=False)

最后,提交result.csv文件到kaggle平台,即可查看模型的准确率。 得到了0.66的“高”准确率。


记第一次 kaggle competition - Titanic
https://eleco.top/2026/02/25/kaggle-titanic/
作者
Eleco
发布于
2026年2月25日
许可协议