This study explores a big and open database of soccer leagues in 10 European countries. Data related to players, teams and matches covering 7 seasons (from 2009/2010 to 2015/2016) were retrieved from Kaggle, an online platform in which big data are available for predictive modelling and analytics competition among data scientists. Based on both preliminary data analysis, experts' evaluation and players' position on the football pitch, role-based indicators of teams' performance have been built and used to estimate the win probability of the home team with the Binomial Logistic Regression (BLR) Model, that has been extended including the ELO rating predictor and two random effects, due to the hierarchical structure of the dataset. The predictive power of the BLR Model and its extensions has been compared with the one of other statistical modelling approaches (Random Forest, Neural Network, k-NN, Naive Bayes). Results showed that role-based indicators substantially improved the performance of all the models used in both this work and in previous works available on Kaggle. The base BLR Model increased prediction accuracy by 10 percentage points, and showed the importance of defensive performances, especially in the last seasons. Inclusion of both ELO rating predictor and the random effects did not substantially improve prediction, as the simpler BLR Model performed equally good. With respect to the other models, only Naive Bayes showed more balanced results in predicting both win and no-win of the home team.

Exploring and Modelling Team Performances of the Kaggle European Soccer Database

Maurizio Carpita
;
2019-01-01

Abstract

This study explores a big and open database of soccer leagues in 10 European countries. Data related to players, teams and matches covering 7 seasons (from 2009/2010 to 2015/2016) were retrieved from Kaggle, an online platform in which big data are available for predictive modelling and analytics competition among data scientists. Based on both preliminary data analysis, experts' evaluation and players' position on the football pitch, role-based indicators of teams' performance have been built and used to estimate the win probability of the home team with the Binomial Logistic Regression (BLR) Model, that has been extended including the ELO rating predictor and two random effects, due to the hierarchical structure of the dataset. The predictive power of the BLR Model and its extensions has been compared with the one of other statistical modelling approaches (Random Forest, Neural Network, k-NN, Naive Bayes). Results showed that role-based indicators substantially improved the performance of all the models used in both this work and in previous works available on Kaggle. The base BLR Model increased prediction accuracy by 10 percentage points, and showed the importance of defensive performances, especially in the last seasons. Inclusion of both ELO rating predictor and the random effects did not substantially improve prediction, as the simpler BLR Model performed equally good. With respect to the other models, only Naive Bayes showed more balanced results in predicting both win and no-win of the home team.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11379/510484
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 32
  • ???jsp.display-item.citation.isi??? 27
social impact