Skip to main content

Table 4 Data analysis characteristics

From: Machine learning methods in sport injury prediction and prevention: a systematic review

Authors Train, Validate and Test Strategy Data Pre-processing Feature Selection/ Dimensionality Reduction Machine Learning Classification Methods Deficits of ML Analysis
AYALA ET AL threefold stratified cross-validation for comparison of 68 algorithms - Data imputation: missing data were replaced by the mean values of the players in the same division
- Data discretization
No - Decision tree ensembles
- Adjusted for imbalance via synthetic minority oversampling
- Aggregated using bagging and boosting methods
Discretization before data splitting
CAREY ET AL - Split in training dataset (data of 2014 and 2015) and test dataset (data of 2016)
- Hyperparameter tuning via tenfold cross-validation
- Each analysis repeated 50 times
NR Principal Component Analysis - Decision tree ensembles (Random Forests), Support Vector Machines
- Adjusted for imbalance via undersampling and synthetic minority oversampling
Dependency between training and test dataset
LÓPEZ-VALENCIANO ET AL fivefold stratified cross-validation for comparison of 68 algorithms - Data imputation: missing data were replaced by the mean values of the players in the same division
- Data discretization using literature and Weka software
No - Decision trees ensembles
- Adjusted for imbalance via synthetic minority oversampling, random oversampling, random undersampling
- Aggregated using bagging and boosting methods
Discretization before data splitting
MCCULLAGH ET AL tenfold cross-validation for testing NR No Artificial Neural Networks with backpropagation Dependency between training and test dataset
OLIVER ET AL fivefold cross-validation for comparison of 57 models - Data discretization using literature and Weka software No - Decision trees ensembles
- Adjusted for imbalance via synthetic minority oversampling, random oversampling, random undersampling
- Aggregated using bagging and boosting methods
Discretization before data splitting
RODAS ET AL - Outer fivefold cross-validation for model testing
- inner tenfold cross-validation for hyperparameters tuning
- Synthetic variant imputation Least Absolute Shrinkage and Selection Operator (LASSO) Decision tree ensembles (Random Forests), Support Vector Machines  
ROMMERS ET AL - Split in training (80%) and test (20%) dataset
- Cross-validation for tuning hyperparameters
NR No Decision tree ensembles
- Aggregated using boosting methods
 
ROSSI ET AL - Split in dataset 1 (30%) for feature elimination and dataset 2 (70%) for training and testing
- stratified two-fold cross-validation on dataset 2
- repeated 10,000 times
NR Recursive Feature Elimination with Cross-Validation - Decision tree ensembles
- Adjusted for imbalance via adaptive synthetic sampling
- Aggregated using Random Forests
Dependency between training and test dataset
RUDDY ET AL Between Year approach:
- Split in training dataset (2013) and test dataset (2015)
Within Year approach:
- Split in training (70%) and test (30%) dataset
Both approaches:
- tenfold cross-validation for hyperparameter tuning
- Each analysis repeated 10,000 times
- Data standardization No - Single decision tree, decision tree ensembles (Random Forests), Artificial Neural Networks, Support Vector Machines
- Adjusted for imbalance via synthetic minority oversampling
Standardization independent in training and test dataset
THORNTON ET AL Split in training (70%), validation (15%), and test (15%) dataset NR No Decision tree ensembles
- Aggregated using Random Forests
 
WHITESIDE ET AL fivefold cross-validation for comparison of models NR Brute Force feature selection: Every possible subset of features is tested Support Vector Machines