This chapter presents a data mining process for investigating the relationship between the outcome of a football match (win, lose or draw) and a set of variables describing the actions of each team, using the R environment and selected R packages for statistical computing. The analyses were implemented with parallel computing when possible. Our goals were to identify, from hundreds of covariates, those that most strongly affect the probability of winning a match and to construct a small number of composite indicators based on the most predictive variables. These two tasks were carried out using the Random Forest machine learning algorithm and Principal Component Analysis, respectively. Variable selection was performed using the novel approach developed by Sandri and Zuccolotto in 2008. Finally, we compared the results of several different classification models and algorithms (Random Forest, Classification Neural Network, K-Nearest Neighbor, Naïve Bayes classifier, and Multinomial Logit regression), assessing both their performance and the insightfulness of their results.

Football Mining with R

CARPITA, Maurizio;SIMONETTO, Anna;ZUCCOLOTTO, Paola
2014-01-01

Abstract

This chapter presents a data mining process for investigating the relationship between the outcome of a football match (win, lose or draw) and a set of variables describing the actions of each team, using the R environment and selected R packages for statistical computing. The analyses were implemented with parallel computing when possible. Our goals were to identify, from hundreds of covariates, those that most strongly affect the probability of winning a match and to construct a small number of composite indicators based on the most predictive variables. These two tasks were carried out using the Random Forest machine learning algorithm and Principal Component Analysis, respectively. Variable selection was performed using the novel approach developed by Sandri and Zuccolotto in 2008. Finally, we compared the results of several different classification models and algorithms (Random Forest, Classification Neural Network, K-Nearest Neighbor, Naïve Bayes classifier, and Multinomial Logit regression), assessing both their performance and the insightfulness of their results.
2014
9780124115118
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11379/263312
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact