A survey of machine learning algorithms and approaches in different research areas

Table of contents

  • 1. Introduction
  • 2. Types of machine learning (ML) approaches
  • 2.1 Supervised Learning
  • 2.2 Unsupervised Learning
  • 2.3 Reinforcement Learning
  • 3. Summarised characteristics of the applicable ML algorithms in this survey
  • 4. Selected Research Areas For Comparison and Discussion
  • 4.1 Indoor Localization
  • 4.1.1 Common algorithms used in Indoor Localization
  • 4.1.2 Key issues
  • 4.1.3 Comparisons of results and performance
  • 4.1.4 Conclusion for ML approach in Indoor Localization
  • 4.2 Sepsis Detection in Biomedical Science
  • 4.2.1 Common algorithms used in Sepsis Detection
  • 4.2.2 Key issues
  • 4.2.3 Comparisons of results and performance
  • 4.2.4 Conclusion for ML approach in Sepsis detection
  • 5. Survey Summary
  • References

1. Introduction

  The following sections of this paper aim to provide a survey of machine learning algorithms and their application to different research areas. This includes a critical review of current literature and conclusions on which machine learning approaches fit best to various types of problem-solving and why. These will be supported by discussions and references of more than 3 machine learning algorithms with their successful performance in different research areas, and the conclusions reached. 

2. Types of machine learning (ML) approaches

2.1 Supervised Learning

  Supervised learning involves training a model to recognize and map a set of inputs to its known outputs, be it categories or continuous values. Use cases can range from image recognition to sales forecasts.  

2.2 Unsupervised Learning

  Unsupervised learning, on the other hand, approaches a set of data from a more exploratory angle to generate insights on hidden patterns that may not be obvious to human perception due to the multi-dimensional features of each instance. Use cases include clustering customers’ profiles and topic modeling on customers’ feedback.

2.3 Reinforcement Learning

  More commonly used in robotics, reinforcement learning focuses on trial-and-error to achieve the best self-interpreted results based on a set of reward and punishment systems. 

3. Summarised characteristics of the applicable ML algorithms in this survey

  1. XGBoost and Random Forest (RF)

  XGBoost is a versatile algorithm for supervised ensemble learning that is able to handle both classification and regression problems. Its key advantages include the ability to prevent overfitting with added regularization and tuning of hyper-parameters (Niang et al., 2021, p. 145; Chen & Guestrin, 2016). RF is another supervised ensemble learning algorithm that is fit for both classification and regression. In contrast to XGboost’s feature of having the next tree learning from the error of the previous tree, RF generates multiple decision trees then gathers their respective outputs for a majority vote or averaging to produce the final result. 

  1. K Nearest Neighbour (KNN)

  KNN is a supervised learning algorithm that is commonly used for classification though it can handle regression as well. It identifies the assigned numbers of data points nearest to the test point by calculating the resultant (Euclidean) distance within a multi-dimensional dataset. 

  1. Logistic Regression (LR) and Support-Vector Machine (SVM)

  Logistic Regression – a supervised learning algorithm that adopts a probabilistic approach in classification and tends to overfit the model. In contrast, SVM is a more versatile supervised learning algorithm that is deterministic and is suitable for both classification and regression. It works by trying to find the best-fit hyperplane to divide the dataset into two groups. 

  1. Artificial Neural Networks (ANN), Recurrent Neural Network (RNN), Convolution Neural Network (CNN), and Long Short-Term Memory (LSTM)

  Neural networks (NN) primarily aim to emulate some cognitive abilities of human brains. Different NNs vary in terms of the number of hidden layers, whether the hidden layers are learning in a recurrent manner and how the nodes are connected. They are commonly used for a wide range of classification, prediction, clustering, and association from image and speech recognition to forex prediction or other more complex problems.

4. Selected Research Areas For Comparison and Discussion

4.1 Indoor Localization

  Indoor localization and mapping technology has gained more attention in recent years due to the rising need to provide a more personalized and enhanced experience for patrons in malls. Other demands may include more precise indoor mapping for geospatial application, clash detection, and digital twins in Building Information Model as well. These applications per se have vastly different expectations and requirements for data accuracy and resolution; the marketing manager of the mall may just require the area name as output but an engineer may require cm to mm positional accuracy of a known location.

  General principles behind such localization methods are primarily based on trilateral or triangulation of three or more signal strengths to infer or calculate the position of the target (Niang et al., 2021, p. 1). However, in a study cited in the following sections, an alternate method of using geo-tagged images for image classification was also being looked at. 

4.1.1 Common algorithms used in Indoor Localization

  1. XGBoost, K Nearest Neighbor (KNN), Support-Vector Machine(SVM), Random Forest (RF) (Niang et al., 2021; Nessa et al., 2020; Jang et al., 2019). 
  2. Artificial Neural Networks (ANN) (Nessa et al., 2020).
  3. Image Recognition using ResNet Convolutional Neural Network (CNN) (Nessa et al., 2020; Shahid & Arain, 2021).

4.1.2 Key issues 

  1. Since most of the current methods are based on interpolation and inference using signal strengths and time of arrivals of signals with beacons acting as correlators to estimate triangulated position, the target may go undetected once it moves into a dead zone or out of the line of sight (Nessa et al., 2020, p. 8).
  2. In the real-world situation, weak Line-Of-Sight and other signal interferences may potentially provide data and measurements that are not accurate to the model during training as well as prediction (Nessa et al., 2020, p. 1; Niang et al., 2021, p. 145). That means the training frequency of the model may need to increase and input data be monitored for its reliability. 

4.1.3 Comparisons of results and performance 

  1. Including results that were referenced from previous works, RF and XGBoost outperformed other algorithms and were able to achieve positional accuracy of 1.83m and 1.77m respectively, using Received Signal Strength Indicator (RSSI)-based fingerprinting data (Niang et al., 2021, p. 147).  
  2. Deep Learning: RNN, LSTM, GRU, BiRNN, BiLSTM, BiGRU – Average localization error is 75cm. However, this is only achievable with RSSI-based fingerprinting and not CSI-based fingerprinting (Nessa et al., 2020, p. 13). 
  3. Alternatively, using ResNet CNN is the easiest but inferior in positional accuracy in terms of localization error and serverless cloud services are available on AWS and Azure. Percentage of precision and recall of more than 90%, and the average accuracy is above 95% (Shahid & Arain, 2021, p. 22).
  4. The following conclusion is based on the assumption that the cited study case using ResNet CNN has captured sufficient images at data points along the perimeters of each segregated area equivalent to achieving 0.5m x 0.5 grid level of resolution with RSSI-based fingerprinting.  

4.1.4 Conclusion for ML approach in Indoor Localization

  A supervised Deep Learning approach of using RNN (Nessa et al., 2020) appears to be most suitable for indoor localization because the requirements are more targeted than an exploratory one and the effects of each input feature may be complex and different from instance to instance. Results and accuracy are rather subjected to the inputs selected as features and types of data and how they are collected. 

  XGBoost and Random Forest were able to achieve sub 2m accuracy with a grid of 0.5m x 0.5m mapping using RSSI fingerprinting as the baseline data (Niang et al., 2021, p. 145). The cited study using ResNet CNN image recognition approach is faster and easier to implement (Shahid & Arain, 2021) but the RSSI-based fingerprinting Deep Learning approach is likely to yield much higher accuracy due to the offline mapping of the area using measured RSSI values (Niang et al., 2021, p. 147; Nessa et al., 2020, p. 13). However, the application of Total Least Square adjustment, together with data from LiDAR remote sensing (Guo et al., 2021, p. 3), georeferenced to images used in ResNet CNN may be more appropriate if localized relative geospatial accuracy of down to 75cm does not fulfill the project requirement. In the case where there is no existing labeled data that could be leveraged for model building, unsupervised learning was apparently more effective in such a case with fused-based localization technique (Nessa et al., 2020, p. 11).

4.2 Sepsis Detection in Biomedical Science

  From the pieces of literature that were examined for this research area of Sepsis detection, there is an increasing need to have a systematic method and data-driven approach for early detection of this deadly disease. Available data are mainly from one single source in two types – structured vital signs data from medical sensors and unstructured clinical notes of doctors’ on-scene observation and expert judgments,  all stored within the Electronic Medical Records (EMR).   

4.2.1 Common algorithms used in Sepsis Detection

  1. XGBoost, K Nearest Neighbor (KNN), Random Forest (RF), Logistic Regression (LR) (Ullah et al., 2021). 
  2. Support Vector Machine (SVM), Recurrent Neural Network (RNN), Long Short Term Memory (LSTM) (Giacobbe et al., 2021).
  3. In a recent study supported by the Ministry of Education, Singapore, Sepsis Early Risk Assessment (SERA) was developed that combines two inter-linked algorithms (Goh et al., 2021, p. 2).

4.2.2 Key issues

  1. Even expert physicians’ judgments may differ from one another, thus affecting the quality and usability of the labeling, and decisions on the types of inputs for features can be controversial as well. It is also important to look at how IT infrastructure and automation can help to extract unstructured data within the EMR (Giacobbe et al., 2021, p. 3). 
  2. Definitions and tell-tale signs of Sepsis are evolving over time due to the different permutations of symptoms discovered by expert physicians, of which some are still debatable among the qualified experts themselves (Giacobbe et al., 2021, p. 2).  

4.2.3 Comparisons of results and performance 

  1. Based on the results presented, leveraging both structured data and unstructured data from EMR – SERA (LDA for Topic Modeling followed by Gradient Boosted Trees then finally with Multi-Label Classification), up to AUC of 0.94 (Goh et al., 2021, p. 4). 
  2. In the other study where XGBoost was used on only structured data from EMR, with 92.31% accuracy (Ullah et al., 2021, p. 4). However, the author of this paper did a calculation of AUC (Idrees et al., 2017, p. 43) using the Confusion Matrix and generated an AUC of 0.51. 
  3. The following conclusion is based on the assumption that data ingested from the EMRs in both studies mentioned above are formatted according to established international medical standards. 

4.2.4 Conclusion for ML approach in Sepsis detection

  In the case of Sepsis detection, relevant literature cited here has shown and is not contradictory to the presumption that EMR data per se, without additional psychographic data, as input features are sufficient and fit for the purpose of the prediction. The results, however, differed much according to their respective approaches. Supervised ensemble learning for classification using XGBoost did not produce reasonably good AUC for the prediction as seen in Ullah et al.(2021). Though the comparison with other algorithms like KNN, RF, and LR suggested that XGBoost can produce high accuracy, a more specific contrast using AUC did not suggest the same level of effectiveness. 

  On the other end, SERA adopts a sequential combination of unsupervised classification for topic modeling and follow by supervised classification with ensemble learning using GBT to produce binary and ultimately multi-label predictions as the final output of the model. This approach produced a much higher AUC of 0.94 as compared to 0.51 of the same in Ullah et al.(2021). 

  This is likely due to the fact that SERA allows expert physicians’ judgment to be rightly integrated into the machine learning prediction process in a more contextual manner through topic modeling (Goh et al., 2021, p. 8) and building multi-label classification, as opposed to having their observation utilized as binary labels (Ullah et al., 2021, p. 3). In addition, clustering may be a good initial strategy to identify hidden patterns among subgroups first (Giacobbe et al., 2021, p. 3), to gain a better understanding of whether a specific profile of instances should increase to shape the learning process towards a more accurate representation of reality. 

5. Survey Summary

In summary, this survey highlights the importance of considering external factors above and beyond the algorithms in a model, in order to make a more informed decision on the most suitable machine learning approach to apply in a given scenario. These factors may include the type of sensors, size of available data, and meaningfulness of the selected input features, accuracy measurement required, and the current proven methods adopted by others elsewhere. 

References 

Chen, T., & Guestrin, C. (2016). XGBoost. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://doi.org/10.1145/2939672.2939785

Giacobbe, D. R., Signori, A., Del Puente, F., Mora, S., Carmisciano, L., Briano, F., Vena, A., Ball, L., Robba, C., Pelosi, P., Giacomini, M., & Bassetti, M. (2021). Early detection of sepsis with machine learning techniques: A brief clinical perspective. Frontiers in Medicine, 8. https://doi.org/10.3389/fmed.2021.617486

Goh, K. H., Wang, L., Yeow, A. Y., Poh, H., Li, K., Yeow, J. J., & Tan, G. Y. (2021). Artificial intelligence in sepsis early prediction and diagnosis using unstructured data in healthcare. Nature Communications, 12(1). https://doi.org/10.1038/s41467-021-20910-4

Guo, R., Shi, P., Liu, S., & Geng, J. (2021). 3D laser scanning and surveying adjustment in traffic infrastructure management. IOP Conference Series: Earth and Environmental Science, 734(1), 012014. https://doi.org/10.1088/1755-1315/734/1/012014

Idrees, F., Rajarajan, M., Conti, M., Chen, T. M., & Rahulamathavan, Y. (2017). PIndroid: A novel Android malware detection system using ensemble learning methods. Computers & Security, 68, 36-46. https://doi.org/10.1016/j.cose.2017.03.011

Jang, H., Kim, B., & Jung, S. (2019). K ‐nearest reliable neighbor search in crowdsourced LBSs. International Journal of Communication Systems, 34(2). https://doi.org/10.1002/dac.4097

Liao, J., Chang, M., & Chang, L. (2020). Prediction of air-conditioning energy consumption in R&D building using multiple machine learning techniques. Energies, 13(7), 1847. https://doi.org/10.3390/en13071847

Liu, R., Li, L., Pirasteh, S., Lai, Z., Yang, X., & Shahabi, H. (2021). The performance quality of LR, SVM, and RF for earthquake-induced landslides susceptibility mapping incorporating remote sensing imagery. Arabian Journal of Geosciences, 14(4). https://doi.org/10.1007/s12517-021-06573-x

Luíza da Costa, N., Dias de Lima, M., & Barbosa, R. (2021). Evaluation of feature selection methods based on artificial neural network weights. Expert Systems with Applications, 168, 114312. https://doi.org/10.1016/j.eswa.2020.114312

Nessa, A., Adhikari, B., Hussain, F., & Fernando, X. N. (2020). A survey of machine learning for indoor positioning. IEEE Access, 8, 214945-214965. https://doi.org/10.1109/access.2020.3039271

Niang, M., Ndong, M., Dioum, I., Diop, I., Mashaly, M., & El Ghany, M. A. (2021). Comparison of random forest and extreme gradient boosting fingerprints to enhance an indoor WiFi localization system. 2021 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC). https://doi.org/10.1109/miucc52538.2021.9447676

Rapp, M., Mencía, E. L., Fürnkranz, J., Nguyen, V., & Hüllermeier, E. (2021). Learning gradient boosted multi-label classification rules. Machine Learning and Knowledge Discovery in Databases, 124-140. https://doi.org/10.1007/978-3-030-67664-3_8

Sagi, O., & Rokach, L. (2021). Approximating XGBoost with an interpretable decision tree. Information Sciences, 572, 522-542. https://doi.org/10.1016/j.ins.2021.05.055

Shahid, E., & Arain, Q. A. (2021). Indoor positioning: “an image-based crowdsource machine learning approach”. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-021-10906-z

Ullah, A., Qayyum, H., Khan, M. K., & Ahmad, F. (2021). Sepsis detection using extreme gradient boost (XGB): A supervised learning approach. 2021 Mohammad Ali Jinnah University International Conference on Computing (MAJICC). https://doi.org/10.1109/majicc53071.2021.9526260

Venayagamoorthy, G., & Zarghami, M. (n.d.). Wide area power system protection using a learning vector quantization network. Proceedings of the 13th International Conference on, Intelligent Systems Application to Power Systems. https://doi.org/10.1109/isap.2005.1599286

Zhou, X., Tian, S., An, J., Yang, J., Zhou, Y., Yan, D., Wu, J., Shi, X., & Jin, X. (2021). Comparison of different machine learning algorithms for predicting air-conditioning operating behavior in open-plan offices. Energy and Buildings, 251, 111347. https://doi.org/10.1016/j.enbuild.2021.111347