home..

ML, Titanic Data

Introduction to Pandas and ML

Here is how Mort started on this assignment by asking ChatGPT … Regarding Python Pandas, what are some data sets that would be good for learning Pandas?

Follow up question, “where can I find Titanic data set?”

Titanic Libraries

Using the Titanic dataset will require importing data.

    import seaborn as sns
    titanic_data = sns.load_dataset('titanic')
# Uncomment the following lines to install the required packages
!pip install seaborn
!pip install pandas
!pip install scikit-learn
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: seaborn in /home/trevor/.local/lib/python3.10/site-packages (0.13.2)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /usr/lib/python3/dist-packages (from seaborn) (3.5.1)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /home/trevor/.local/lib/python3.10/site-packages (from seaborn) (1.26.4)
Requirement already satisfied: pandas>=1.2 in /home/trevor/.local/lib/python3.10/site-packages (from seaborn) (2.2.1)
Requirement already satisfied: pytz>=2020.1 in /usr/lib/python3/dist-packages (from pandas>=1.2->seaborn) (2022.1)
Requirement already satisfied: tzdata>=2022.7 in /home/trevor/.local/lib/python3.10/site-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/trevor/.local/lib/python3.10/site-packages (from pandas>=1.2->seaborn) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.2->pandas>=1.2->seaborn) (1.16.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pandas in /home/trevor/.local/lib/python3.10/site-packages (2.2.1)
Requirement already satisfied: tzdata>=2022.7 in /home/trevor/.local/lib/python3.10/site-packages (from pandas) (2024.1)
Requirement already satisfied: pytz>=2020.1 in /usr/lib/python3/dist-packages (from pandas) (2022.1)
Requirement already satisfied: numpy<2,>=1.22.4 in /home/trevor/.local/lib/python3.10/site-packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/trevor/.local/lib/python3.10/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: scikit-learn in /home/trevor/.local/lib/python3.10/site-packages (1.4.1.post1)
Requirement already satisfied: scipy>=1.6.0 in /usr/lib/python3/dist-packages (from scikit-learn) (1.8.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/trevor/.local/lib/python3.10/site-packages (from scikit-learn) (3.3.0)
Requirement already satisfied: numpy<2.0,>=1.19.5 in /home/trevor/.local/lib/python3.10/site-packages (from scikit-learn) (1.26.4)
Requirement already satisfied: joblib>=1.2.0 in /home/trevor/.local/lib/python3.10/site-packages (from scikit-learn) (1.3.2)

Titanic Data

Look at a sample of data.

import seaborn as sns

# Load the titanic dataset
titanic_data = sns.load_dataset('titanic')

print("Titanic Data")


print(titanic_data.columns) # titanic data set
display(titanic_data[['survived','pclass', 'sex', 'age', 'sibsp', 'parch', 'class', 'fare', 'embark_town', 'alone']]) # look at selected columns
/usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"


Titanic Data
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')
survived pclass sex age sibsp parch class fare embark_town alone
0 0 3 male 22.0 1 0 Third 7.2500 Southampton False
1 1 1 female 38.0 1 0 First 71.2833 Cherbourg False
2 1 3 female 26.0 0 0 Third 7.9250 Southampton True
3 1 1 female 35.0 1 0 First 53.1000 Southampton False
4 0 3 male 35.0 0 0 Third 8.0500 Southampton True
... ... ... ... ... ... ... ... ... ... ...
886 0 2 male 27.0 0 0 Second 13.0000 Southampton True
887 1 1 female 19.0 0 0 First 30.0000 Southampton True
888 0 3 female NaN 1 2 Third 23.4500 Southampton False
889 1 1 male 26.0 0 0 First 30.0000 Cherbourg True
890 0 3 male 32.0 0 0 Third 7.7500 Queenstown True

891 rows × 10 columns

Clean Titanic Data

This is called ‘Cleaning’ data.

Most analysis, like Machine Learning require data to be in standardized format…

import pandas as pd
# Preprocess the data
from sklearn.preprocessing import OneHotEncoder

td = titanic_data
td.drop(['alive', 'who', 'adult_male', 'class', 'embark_town', 'deck'], axis=1, inplace=True)
td.dropna(inplace=True) # drop rows with at least one missing value, after dropping unuseful columns
td['sex'] = td['sex'].apply(lambda x: 1 if x == 'male' else 0)
td['alone'] = td['alone'].apply(lambda x: 1 if x == True else 0)

# Encode categorical variables
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(td[['embarked']])
onehot = enc.transform(td[['embarked']]).toarray()
cols = ['embarked_' + val for val in enc.categories_[0]]
td[cols] = pd.DataFrame(onehot)
td.drop(['embarked'], axis=1, inplace=True)
td.dropna(inplace=True) # drop rows with at least one missing value, after preparing the data

print(td.columns)
display(td)
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'alone',
       'embarked_C', 'embarked_Q', 'embarked_S'],
      dtype='object')
survived pclass sex age sibsp parch fare alone embarked_C embarked_Q embarked_S
0 0 3 1 22.0 1 0 7.2500 0 0.0 0.0 1.0
1 1 1 0 38.0 1 0 71.2833 0 1.0 0.0 0.0
2 1 3 0 26.0 0 0 7.9250 1 0.0 0.0 1.0
3 1 1 0 35.0 1 0 53.1000 0 0.0 0.0 1.0
4 0 3 1 35.0 0 0 8.0500 1 0.0 0.0 1.0
... ... ... ... ... ... ... ... ... ... ... ...
705 0 2 1 39.0 0 0 26.0000 1 0.0 0.0 1.0
706 1 2 0 45.0 0 0 13.5000 1 0.0 0.0 1.0
707 1 1 1 42.0 0 0 26.2875 1 0.0 1.0 0.0
708 1 1 0 22.0 0 0 151.5500 1 0.0 0.0 1.0
710 1 1 0 24.0 0 0 49.5042 1 1.0 0.0 0.0

564 rows × 11 columns

Train Titanic Data

The result of ‘Training’ data is making it easier to analyze or make conclusions.

What conclusions can you make using min, max, means statistics bout the following…

Median Values

print(titanic_data.median())
survived       0.0
pclass         2.0
sex            1.0
age           28.0
sibsp          0.0
parch          0.0
fare          16.1
alone          1.0
embarked_C     0.0
embarked_Q     0.0
embarked_S     1.0
dtype: float64

Perished Mean/Average

print(titanic_data.query("survived == 0").mean())
survived       0.000000
pclass         2.464072
sex            0.844311
age           31.073353
sibsp          0.562874
parch          0.398204
fare          24.835902
alone          0.616766
embarked_C     0.185629
embarked_Q     0.038922
embarked_S     0.775449
dtype: float64

Survived Mean/Average

print(td.query("survived == 1").mean())
survived       1.000000
pclass         1.878261
sex            0.326087
age           28.481522
sibsp          0.504348
parch          0.508696
fare          50.188806
alone          0.456522
embarked_C     0.152174
embarked_Q     0.034783
embarked_S     0.813043
dtype: float64

Survived Max and Min Stats

print("maximums for survivors")
print(td.query("survived == 1").max())
print()
print("minimums for survivors")
print(td.query("survived == 1").min())
maximums for survivors
survived        1.0000
pclass          3.0000
sex             1.0000
age            80.0000
sibsp           4.0000
parch           5.0000
fare          512.3292
alone           1.0000
embarked_C      1.0000
embarked_Q      1.0000
embarked_S      1.0000
dtype: float64

minimums for survivors
survived      1.00
pclass        1.00
sex           0.00
age           0.75
sibsp         0.00
parch         0.00
fare          0.00
alone         0.00
embarked_C    0.00
embarked_Q    0.00
embarked_S    0.00
dtype: float64

Machine Learning

Visit Tutorials Point

Scikit-learn is a powerful Python library for machine learning, offering tools for classification, regression, clustering, and dimensionality reduction.

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Build distinct data frames on survived column
X = td.drop('survived', axis=1) # all except 'survived'
y = td['survived'] # only 'survived'

# Split arrays in random train 70%, random test 30%, using stratified sampling (same proportion of survived in both sets) and a fixed random state (42
# The number 42 is often used in examples and tutorials because of its cultural significance in fields like science fiction (it's the "Answer to the Ultimate Question of Life, The Universe, and Everything" in The Hitchhiker's Guide to the Galaxy by Douglas Adams). But in practice, the actual value doesn't matter; what's important is that it's set to a consistent value.
# X_train is the DataFrame containing the features for the training set.
# X_test is the DataFrame containing the features for the test set.
# y-train is the 'survived' status for each passenger in the training set, corresponding to the X_train data.
# y_test is the 'survived' status for each passenger in the test set, corresponding to the X_test data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a decision tree classifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

# Test the model
y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('DecisionTreeClassifier Accuracy: {:.2%}'.format(accuracy))  

# Train a logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Test the model
y_pred = logreg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('LogisticRegression Accuracy: {:.2%}'.format(accuracy))  
DecisionTreeClassifier Accuracy: 72.94%
LogisticRegression Accuracy: 78.82%


/home/trevor/.local/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

Predicting Survival

So, now we are ready to play the game… “Would I have survived the Titanic?”.

Insert your own data in the code. Look at your analysis and consider how you would travel today.

import numpy as np

# Logistic regression model is used to predict the probability

# Define a new passenger
passenger = pd.DataFrame({
    'name': ['John Mortensen'],
    'pclass': [2], # 2nd class picked as it was median, bargains are my preference, but I don't want to have poor accomodations
    'sex': ['male'],
    'age': [15],
    'sibsp': [1], # I usually travel with my wife
    'parch': [1], # currenly I have 1 child at home
    'fare': [16.00], # median fare picked assuming it is 2nd class
    'embarked': ['S'], # majority of passengers embarked in Southampton
    'alone': [False] # travelling with family (spouse and child))
})

display(passenger)
new_passenger = passenger.copy()

# Preprocess the new passenger data
new_passenger['sex'] = new_passenger['sex'].apply(lambda x: 1 if x == 'male' else 0)
new_passenger['alone'] = new_passenger['alone'].apply(lambda x: 1 if x == True else 0)

# Encode 'embarked' variable
onehot = enc.transform(new_passenger[['embarked']]).toarray()
cols = ['embarked_' + val for val in enc.categories_[0]]
new_passenger[cols] = pd.DataFrame(onehot, index=new_passenger.index)
new_passenger.drop(['name'], axis=1, inplace=True)
new_passenger.drop(['embarked'], axis=1, inplace=True)

display(new_passenger)

# Predict the survival probability for the new passenger
dead_proba, alive_proba = np.squeeze(logreg.predict_proba(new_passenger))

# Print the survival probability
print('Death probability: {:.2%}'.format(dead_proba))  
print('Survival probability: {:.2%}'.format(alive_proba))
name pclass sex age sibsp parch fare embarked alone
0 John Mortensen 2 male 15 1 1 16.0 S False
pclass sex age sibsp parch fare alone embarked_C embarked_Q embarked_S
0 2 1 15 1 1 16.0 0 0.0 0.0 1.0
Death probability: 63.49%
Survival probability: 36.51%

Improve your chances

Is there anything you could do to improve your chances?

# Decision tree model is used to determine the importance of each feature

importances = dt.feature_importances_
for feature, importance in zip(new_passenger.columns, importances):
    print(f'The importance of {feature} is: {importance}')
The importance of pclass is: 0.14782910650558925
The importance of sex is: 0.27345943069742495
The importance of age is: 0.2581901376876413
The importance of sibsp is: 0.06088147063013274
The importance of parch is: 0.015768208157955033
The importance of fare is: 0.22586293772938693
The importance of alone is: 0.0
The importance of embarked_C is: 0.004181924322029404
The importance of embarked_Q is: 0.0
The importance of embarked_S is: 0.01382678426984039
© 2024    •  Powered by Soopr   •  Theme  Moonwalk