Executive Summary

  This report is based on the Titanic dataset from Kaggle with R Programming to apply necessary data understanding and preparation methods. The primary objective of this report is to build up an ML model to group Titanic passengers into similar groups to analyze the passengers’ profiles and provide recommendations on what could have been done to create a better customer experience; identify which independent variables in the dataset contribute most to the survival of passengers in the 1912 tragedy and determine if Logistic Regression is a better model to predict the probability of survival. 

  Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology process is applied to this project as a framework for guidance that focuses on business issues and technical analysis. CRISP-DM put the data mining processes into six phases, as shown in Figure 1.

Figure 1 CRISP-DM Phases (“Data Mining using CRISP-DM methodology”, 2021)

Data Understanding

Variable NameDescriptionVar: TypeValue
PassengerIDIndex number for the recordINTMin – 1, Max – 1309
SurvivedSurvival Status of the passengerINT0= NO, 1=YES
PclassTicket Class of the passengerINT1= 1st class, 2= 2nd class, 3= 3rd class
NamePassenger NameCHR
SexGender of the passengerCHR
AgeAge of the passengerNUMMin – 0.17, Max – 80
SibSpNo of Siblings/ Spouse onboardINTMin – 0, Max – 8
ParchNo. of Parents/ Children onboardINTMin – 0, Max – 9
TicketTicket NumberCHR
FareTicket FareNUMMin – 0, Max – 512.33
CabinCabin NumberCHR
EmbarkedPort of EmbarkationCHRC = Cherbourg, Q = Queenstown, S = Southampton

Table 1 List of variables with their descriptions

  The Titanic Machine Learning dataset from Kaggle was used for this project to build a machine learning model using K-means Clustering and Logistic Regression techniques. The dataset included a total of 1,309 records with 12 attributes, as listed in Table 1. 

Data Preparation

Dealing with Missing Value

Variable NameMissing Value
PassengerIDNo Missing Value
SurvivedNo Missing Value
PclassNo Missing Value
NameNo Missing Value
SexNo Missing Value
Age263 Missing Values
SibSpNo Missing Value
ParchNo Missing Value
TicketNo Missing Value
Fare1 Missing Value
CabinNo Missing Value
EmbarkedNo Missing Value

Table 2 List of variables with missing values

  Table 2 shows which variable has a missing value in the selected dataset. The missing records will be excluded from the training dataset.

Remove all missing values

df2 <- na.omit(df)

df2 %>%
names_to = “variables”, values_to=”missing”) %>%
count(variables, missing) %>%
list_missing <- colnames(df2)[apply(df2, 2, anyNA)]

Figure 2 Screenshot of a chart generated to verify no missing values

Data Transformation & Feature Selection

  We applied the data transformation method to “Sex” and “Embarked” variables as these two features were encoded as “CHR” originally. To include those two variables in the model building process, we transformed those two variables to “NUM”

Variable NameOriginal Var: TypeTarget Var: TypeValue
EmbarkedCHRNUMC= 12, Q= 26, S=28

Table 2 List of selected variables to be transformed

Figure 3 Screenshot of the dataset with Embarked and Sex variables transformed

  As poor-quality data result in less accurate and unreliable performance results, we are removing the non-essential variables such as (i) PassengerID, (ii) Name, (iii) Ticket & (iv) Cabin from the model training dataset. Table 3 is the selected list of variables used in the model building process.

Variable NameDescriptionVar: TypeValue
SurvivedSurvival Status of the passengerINT0= NO, 1=YES
PclassTicket Class of the passengerINT1= 1st class, 2= 2nd class, 3= 3rd class
SexGender of the passengerNUM1= MALE, 0= FEMALE
AgeAge of the passengerNUMMin – 0.17, Max – 80
SibSpNo of Siblings/ Spouse onboardINTMin – 0, Max – 8
ParchNo. of Parents/ Children onboardINTMin – 0, Max – 9
FareTicket FareNUMMin – 0, Max – 512.33
EmbarkedPort of EmbarkationNUMC= 12, Q= 26, S=28

Table 3 List of selected variables after Transformation & Filtering

  Passenger ID, Name, Ticket, and Cabin variables were excluded as they are either nominal variables or unbalanced with too many missing values.

Figure 4 Screenshot of the transformed dataset with only selected features


  The model selection is a crucial stage for every Data Science project where the Data Scientist team has to decide on the appropriate algorithms which can provide the best result to meet their end goals. Generally, those algorithms can be classified into Supervised, Unsupervised, and Reinforcement Learning methods. For this project, we are choosing K-means Clustering to find the similarity in the passenger and Logistic Regression to predict the survivability of the passenger during the disaster.

Figure 5 Various Categories of Machine Learning Algorithms

K-means Clustering 

  K-means clustering is a Centroid-Based Clustering technique to classify the raw data into a number of groups, where k is a positive integer. Grouping of observation is based on calculating and minimizing the distance between the data points and their respect cluster centroid. The number of centroids depends on the number of required groups of clusters.


Figure 6 Flow Chart of K-Means Clustering Algorithm

  The objective of K-means is to minimize the total distortion (within-cluster sum of squares WCSS), which is the sum of the distance of each observation in a cluster to its centroid.

µi is the mean of the points in Ci

  One of the vital steps of K-means clustering is choosing the optimal K value, which is more of an art than science. In practical application, an approximate K value is usually provided by the business. However, there are three popular methods to determine the optimal K values as follows:

1. Elbow Method

2. Sihouette Method

3. Gap statistics

Visualising the distances between data points

Figure 7 Distance Matrix to visualize the potential dissimilarities

  From the output of functions get_dist and fviz_dist from the factoextra R package, the above heatmap illustrates passenger profiles with large dissimilarities (red) in contrast to those that appear to be similar.

First cluster plot with two centroids

Figure 8 Results of k=2 clustering as an initial screening of data variances

Within cluster sum of squares by cluster:

[1] 3677.466    2864.141 

(between_SS / total_SS =  21.5 %)

  As seen in the above result, only approximately 21.5% of the total variance in this dataset is accountable by the two clusters formed. 

Subsequent plots with more random centroids

Figure 9 Visualisation of clusters with some random numbers of centroids

  Since it is now known that two clusters were not sufficient to account for the groupings, a more random number of centroids were used to assess the optimal number of clusters. 

The total within-clusters sum of squares

Figure 10 Chart showing the gradual reduction in the sum of squares as the number of clusters increased

Elbow method

Figure 11 Chart showing that k=4 or k=5 might be the optimal number of clusters based on the Elbow Method. 

Average Silhouette Width

Figure 12 Chart showing that k=4 might be the optimal number of clusters based on the Average Silhouette Method since it produced the highest Average Silhouette Width between the clusters. 

Gap Statistic

Figure 13 Gap Statistic chart showing k=14 as the optimal number of clusters, which means this clustering structure is the best grouping to assume no random uniform distribution of data points. 

  Based on the three methods presented above, k=4 and k=14 were identified as the optimal numbers of clusters. As such, Within Clusters Sum of Squares (WSS) for both k numbers was computed to determine the final clusters.   

Results for k=4 Cluster Plot

Figure 14 Visualization of 4 clusters

Figure 15 Results of k=4 clustering

Within cluster sum of squares by cluster:

[1]  498.1963 1675.3273 1133.6671 1419.7128

 (between_SS / total_SS =  43.3 %)

  Therefore, approximately 43.3% the total variance in this dataset is accountable by the four clusters formed. 

Results for k=14 Cluster Plot

Figure 16 Visualization of k=14 clusters

Figure 17 Results of k=14 clustering

Within cluster sum of squares by cluster:

 [1] 109.85184 171.84435 279.39369 303.61189 243.56093

 [6] 172.22166  99.74677 144.01286 180.39667 144.61924

[11]  83.58127  94.80291  49.31257 277.18289

 (between_SS / total_SS =  71.8 %)

  Therefore, approximately 71.8% the total variance in this dataset is accountable by the fourteen clusters formed. 

  Based on the results above, k=14 was chosen to be the optimal number of clusters for further analysis. 

Mapped cluster numbers

Figure 18 Screenshot of data frame with cluster numbers being assigned back to names of the passengers

Passenger profiles of major clusters

Cluster No.SizeAssessment of Passengers’ Profile
4227About 97.79% of the passengers embarked from Southhampton, about 10.13% are female about 9.25% of the passengers survived the majority Fare paid was about $7 to $9
9127About 99.21% of the passengers embarked from Southhampton, about 2.36% are female about 3.14% of the passengers survive majority Fare paid was about $13
10109About 91.74% of the passengers embarked from Southhampton, about 100% are female about 100% of the passengers survived the majority Fare paid was about $7 to $15

Table 4 Customers’ profiles within the major clusters computed

  The larger clusters were chosen for further assessment of their inferred purchasing power and demographics. 

Recommendation for better customer experience

  Firstly, the major forms of entertainment on Titanic (“Onboard Titanic,” n.d.) were: 

  1. Gambling
  2. Drinking
  3. Turkish Bath
  4. Gyms
  5. Shuffleboard
  6. Cricket
  7. Bull Board
  8. Tennis
  9. Chess
  10. Daily Sweepstakes

  Since the majority of the passengers within the three clusters were male from Southhampton with a middle range of purchasing power inferred from the fare they paid, a targeted customer experience design revolving around this group of passengers could have enhanced the overall cruise experience. 

  In the context of the period when the Titanic set off in 1912,  football was already a popular sport in England according to archives published on . Therefore, the following top-performing football players from Southhampton F.C. during that era could have been invited to Titanic for a meet-and-greet as well as friendly matches on board the ship instead of having Cricket and Tennis as the sports entertainment. This would have likely boosted the demand for the Titanic experience even more.  

Top scoring Southampton F.C. players from Year 1908 to 1912

Player NameNationalityPosClub CareerStarting AppearancesSubstitute AppearancesTotal AppearancesTotal Goals Scored
Arthur Hughes EnglandFW1908–190928N/A2818
Frank Jordan EnglandFW1908–191056N/A5610
Robert Carter EnglandFW1909–191045N/A4514
Charlie McGibbon EnglandFW1909–191032N/A3224

Table 5 Source: (“List of Southampton F.C. players (25–99 appearances),” 2012)

Logistic Regression

  A regression model with a categorical target variable, attaining only two possible values (0 and 1) is known as Logistic regression. To model binary dependent variables, LR employs a logistic/ sigmoid function. 

Logistic regression can be subdivided into three types based on the type of response variable:

1. Binomial Logistic Regression – Only two possible values for target variables

2. Multinomial Logistic Regression – Three or more values, but no fixed order of preference.

3. Ordinal Logistic Regression – Three or more possible values, each of which has a preference or order

Split the data into training and testing set

# split into training (80%) and testing set (20%)

sample_size = round(nrow(df7)*.80) 

index <- sample(seq_len(nrow(df7)), size = sample_size)

train_set <- df7[index, ]

test_set <- df7[-index, ]

train_label <- train_set$Survived

test_label <- test_set$Survived

The above R code was used to divide the dataset into 80% training and 20% testing data. 

Fitting a Logistic Regression model using the training set

Figure 18 Screenshot of the Logistic Regression Model fitted using training set 

  The coefficients accounts for the change in the outcome in log odds for every one unit of increase of the predictors variable. The table showed that PClass, Sex and Age variables were all statistically significant.  

Confidence intervals of coefficient estimates

Figure 19 Screenshot of the computed confidence intervals

> # Test the overall effect of the PClass, Sex and Age as independent variables

> wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 2:4)

Wald test:


Chi-squared test:

X2 = 258.7, df = 3, P(> X2) = 0.0

  The Wald Test results proved that since the chi-squared test statistic of 258.7 with 3 degrees of freedom has a p-value of 0.0, the difference between the coefficients of PClass, Sex and Age is statistically significant. 

Exponentiate the coefficients and interpret them as odds-ratios

Figure 20 Screenshot of the computed odds ration binded to the coefficients intervals

  When the coefficients were exponentiated, the odds-ratios showed that Fare does not the affect the odds of Survived. In contrast, it appears that Sex odds-ratio means the probability of male’s survival chance was very low.  

Prediction on training set with Logistic Regression

# Create prediction for passengers’ survival on training set

train_set$SurvivedP <- predict(mylogit, newdata = train_set, type = “response”)


df8 <- cbind(train_set, predict(mylogit, newdata = train_set, type = “link”,

    se = TRUE))

df8 <- within(df8, {

    PredictedProb <- plogis(fit)

    LL <- plogis(fit – (1.96 *

    UL <- plogis(fit + (1.96 *


Figure 22 Screenshot of prediction done on training set for result comparison purpose

Prediction on test set with Logistic Regression

# Create a prediction on test set

test_set$SurvivedP <- predict(mylogit, newdata = test_set, type = “response”)


df20 <- cbind(test_set, predict(mylogit, newdata = test_set, type = “link”,

    se = TRUE))

df20 <- within(df20, {

    PredictedProbTest <- plogis(fit)

    LL <- plogis(fit – (1.96 *

    UL <- plogis(fit + (1.96 *


Figure 23 Screenshot of prediction done on test set for result comparison purpose

Compute Cross Table, Accuracy and Error Rate of Logistic Regression Model

  Due to the incompatible versions of the R packages, library and Studio associated with Confusion Matrix, the originally intended code was not executable: 

# Create Confusion Matrix

Cmatrix <- confusionMatrix(data=prediction_test, reference = df7$Survived)

  Therefore, the following computation and calculation were made in lieu of the above issue:

Training Set Results

Figure 24 Screenshot of crosstable calculations

Accuracy = (TN + TP) / (TN + FN + FP + TP)

= (452 + 261) / (452 + 74 + 47 + 261)  

= 0.854916

Therefore, the Accuracy of training set is approximately 85%. 

Error Rate = (FN + FP) / (TN + FN + FP + TP)

= (74 + 47) /  (452 + 74 + 47 + 261) 

= 0.145084

Therefore, the Error Rate of training set is approximately 15%

Testing Set Results

Figure 24 Screenshot of prediction done on test set for result comparison purpose

Logistic Regression Accuracy = (TN + TP) / (TN + FN + FP + TP)

= (114 + 62) / (114 + 18 + 15 + 62)  

= 0.842105

Therefore, the Accuracy of test set prediction is approximately 84%. 

Logistic Regression Error Rate = (FN + FP) / (TN + FN + FP + TP)

= (18 + 15) /  (114 + 18 + 15 + 62) 

= 0.157895

Therefore, the Error Rate of test set prediction is approximately 16%

Logistic Regression Model Summary


> with(mylogit, null.deviance – deviance)

[1] 493.2767

> with(mylogit, df.null – df.residual)

[1] 7

> with(mylogit, pchisq(null.deviance – deviance, df.null – df.residual, lower.tail = FALSE))

[1] 2.235123e-102

> logLik(mylogit)

‘log Lik.’ -315.2161 (df=8)

  Therefore, the chi-square of 493.2767 with 7 degrees of freedom, p-value of less than 0.001 suggests that the model generally fits significantly better than an empty model. 


  1. K-means clustering and Logistic Regression were successfully applied to model the Titanic dataset. 
  1. k=4 and k=14 were identified as the optimal number of clusters to group the Titanic passengers based on Gap Statistics, Elbow, and Average Silhouette Width methods.
  1. 3 major clusters of passengers were identified and recommendation for better customer experience was made more this customer segment. 
  1. Logistic Regression was applied to determine that PClass, Sex, and Age variables’ effect on the Survival were statistically significant. The fitted model took 5 Fisher Score iterations to achieve approximately 84% accuracy in the prediction of passengers’ survival. 

What the models could possibly do is predict the probability of survival of a person from our times, if he were born in the late 19th century and went on board the Titanic ship in 1912 as a teenager from Queenstown. Likewise for the k-means clusters generated, profiles of the groups would only be as good as the context which the features could help the researcher to interpret. The key takeaway from this study was that this clustering technique could be applied to other passengers on other ships and routes beforehand to provide meaningful information for the service provider to act on. This market intelligence could help marketing and operational functions craft better itineraries and make better logistic arrangements that are customer-focused with a design thinking approach.     


Data mining using CRISP-DM methodology. (n.d.). Engineering Education (EngEd) Program | Section.

List of Southampton F.C. players (25–99 appearances). (2012, November 15). Wikipedia, the free encyclopedia. Retrieved November 25, 2021, from board Titanic. (n.d.). The History Press | The destination for history.