Football Mining with R

Carpita, Maurizio; Sandri, Marco; Simonetto, Anna; Zuccolotto, Paola

This chapter presents a data mining process for investigating the relationship between the outcome of a football match (win, lose or draw) and a set of variables describing the actions of each team, using the R environment and selected R packages for statistical computing. The analyses were implemented with parallel computing when possible. Our goals were to identify, from hundreds of covariates, those that most strongly affect the probability of winning a match and to construct a small number of composite indicators based on the most predictive variables. These two tasks were carried out using the Random Forest machine learning algorithm and Principal Component Analysis, respectively. Variable selection was performed using the novel approach developed by Sandri and Zuccolotto in 2008. Finally, we compared the results of several different classification models and algorithms (Random Forest, Classification Neural Network, K-Nearest Neighbor, Naïve Bayes classifier, and Multinomial Logit regression), assessing both their performance and the insightfulness of their results.