Skip to main content

Table 4 Data analysis characteristics

From: Machine learning methods in sport injury prediction and prevention: a systematic review

Authors

Train, Validate and Test Strategy

Data Pre-processing

Feature Selection/ Dimensionality Reduction

Machine Learning Classification Methods

Deficits of ML Analysis

AYALA ET AL

threefold stratified cross-validation for comparison of 68 algorithms

- Data imputation: missing data were replaced by the mean values of the players in the same division

- Data discretization

No

- Decision tree ensembles

- Adjusted for imbalance via synthetic minority oversampling

- Aggregated using bagging and boosting methods

Discretization before data splitting

CAREY ET AL

- Split in training dataset (data of 2014 and 2015) and test dataset (data of 2016)

- Hyperparameter tuning via tenfold cross-validation

- Each analysis repeated 50 times

NR

Principal Component Analysis

- Decision tree ensembles (Random Forests), Support Vector Machines

- Adjusted for imbalance via undersampling and synthetic minority oversampling

Dependency between training and test dataset

LĂ“PEZ-VALENCIANO ET AL

fivefold stratified cross-validation for comparison of 68 algorithms

- Data imputation: missing data were replaced by the mean values of the players in the same division

- Data discretization using literature and Weka software

No

- Decision trees ensembles

- Adjusted for imbalance via synthetic minority oversampling, random oversampling, random undersampling

- Aggregated using bagging and boosting methods

Discretization before data splitting

MCCULLAGH ET AL

tenfold cross-validation for testing

NR

No

Artificial Neural Networks with backpropagation

Dependency between training and test dataset

OLIVER ET AL

fivefold cross-validation for comparison of 57 models

- Data discretization using literature and Weka software

No

- Decision trees ensembles

- Adjusted for imbalance via synthetic minority oversampling, random oversampling, random undersampling

- Aggregated using bagging and boosting methods

Discretization before data splitting

RODAS ET AL

- Outer fivefold cross-validation for model testing

- inner tenfold cross-validation for hyperparameters tuning

- Synthetic variant imputation

Least Absolute Shrinkage and Selection Operator (LASSO)

Decision tree ensembles (Random Forests), Support Vector Machines

 

ROMMERS ET AL

- Split in training (80%) and test (20%) dataset

- Cross-validation for tuning hyperparameters

NR

No

Decision tree ensembles

- Aggregated using boosting methods

 

ROSSI ET AL

- Split in dataset 1 (30%) for feature elimination and dataset 2 (70%) for training and testing

- stratified two-fold cross-validation on dataset 2

- repeated 10,000 times

NR

Recursive Feature Elimination with Cross-Validation

- Decision tree ensembles

- Adjusted for imbalance via adaptive synthetic sampling

- Aggregated using Random Forests

Dependency between training and test dataset

RUDDY ET AL

Between Year approach:

- Split in training dataset (2013) and test dataset (2015)

Within Year approach:

- Split in training (70%) and test (30%) dataset

Both approaches:

- tenfold cross-validation for hyperparameter tuning

- Each analysis repeated 10,000 times

- Data standardization

No

- Single decision tree, decision tree ensembles (Random Forests), Artificial Neural Networks, Support Vector Machines

- Adjusted for imbalance via synthetic minority oversampling

Standardization independent in training and test dataset

THORNTON ET AL

Split in training (70%), validation (15%), and test (15%) dataset

NR

No

Decision tree ensembles

- Aggregated using Random Forests

 

WHITESIDE ET AL

fivefold cross-validation for comparison of models

NR

Brute Force feature selection: Every possible subset of features is tested

Support Vector Machines

Â