- Introduction
- Data:
- Exploratory Data Analysis:
- Correlation between variables
- How head-to-head matchup history affect the current match?
- How recent performances affect the current match?
- Do strong teams usually win?
- Do young players play better than old one ?
- Is short pass better than long pass ?
- How labels distribute in reduced dimension?
- Methodology: Details about your procedure
- Models
- Baseline models
- odd-based model
- history and form based model
- squad-strength-based model
- Enhance models
- Logistic Regression
- Random Forest
- Gradient Boosting tree
- ADA boost tree
- Neural Network
- Light GBM
- Evaluation Criteria
- F1
- 10-fold Cross Validation accuracy
- Area under ROC
Abstract:
In this work, we compare 9 different modeling approaches for the soccer matches and goal difference on all international matches from 2005 - 2017, FIFA World Cup 2010 - 2014 and FIFA EURO 2012-2016. Within this comparison, while performance of "Win / Draw / Lose" predictions shows not much difference, "Goal Difference" prediction is quite favored to Random Forest and squad-strength based decision tree. We also apply these models in World Cup 2018 and again, Random Forest and Logistic Regression predicts about 33% acccuracy for "Goal Difference" and about 57% for "Win / Draw / Lose". However a simple decision tree based on bet odd and squad-strength are also comparable.
Objective:
- Prediction of the winner of an international matches Prediction results are "Win / Lose / Draw" or "goal difference"
- Apply the model to predict the result of FIFA world cup 2018.
Supervisor: Pratibha Rathore
Lifecycle
Data: The dataset are from all international matches from 2000 - 2018, results, bet odds, ranking, squad strengths.
- FIFA World Cup 2018
- International match 1872 - 2018
- FIFA Ranking through Time
- Bet Odd
- Bet Odd 2
- Squad Strength - Sofia
- Squad Strength - FIFA index
Feature Selection: To determine who will more likely to win a match, based on my knowledge, I come up with 4 main groups of features as follows:
- head-to-head match history between 2 teams. Some teams have few opponents who they hardly win no matter how strong they currently are. For example Germany team usually loses / couldn't beat Italian team in 90 minute matches.
- Recent performance of each team (10 recent matches), aka "form" A team with "good" form usually has higher chance to win next matches.
- Bet-ratio before matches Odd bookmarkers already did many analysis before matches to select the best betting odds so why don't we include them.
- Squad strength (from FIFA video game). We want a real squad strength but these data are not free and not always available so we use strength from FIFA video games which have updated regularly to catch up with the real strength.
Feature List Feature list reflects those four factors.
- *difference: team1 - team2
- *form: performance in 10 recent matches
Feature Name | Description | Source |
---|---|---|
team_1 | Nation Code (e.g US, NZ) | 1 & 2 |
team_2 | Nation Code (e.g US, NZ) | 1 & 2 |
date | Date of match yyyy - mm - dd | 1 & 2 |
tournament | Friendly,EURO, AFC, FIFA WC | 1 & 2 |
h_win_diff | Head2Head: win difference | 2 |
h_draw | Head2Head: number of draw | 2 |
form_diff_goalF | Form: difference in "Goal For" | 2 |
form_diff_goalA | Form: difference in "Goal Against" | 2 |
form_diff_win | Form: difference in number of win | 2 |
form_diff_draw | Form: difference in number of draw | 2 |
odd_diff_win | Betting Odd: difference bet rate for win | 4 & 5 |
odd_draw | Betting Odd: bet rate for draw | 4 & 5 |
game_diff_rank | Squad Strength: difference in FIFA Rank | 3 |
game_diff_ovr | Squad Strength: difference in Overall Strength | 6 |
game_diff_attk | Squad Strength: difference in Attack Strength | 6 |
game_diff_mid | Squad Strength: difference in Midfield Strength | 6 |
game_diff_def | Squad Strength: difference in Defense Strength | 6 |
game_diff_prestige | Squad Strength: difference in prestige | 6 |
game_diff_age11 | Squad Strength: difference in age of 11 starting players | 6 |
game_diff_ageAll | Squad Strength: difference in age of all players | 6 |
game_diff_bup_speed | Squad Strength: difference in Build Up Play Speed | 6 |
game_diff_bup_pass | Squad Strength: difference in Build Up Play Passing | 6 |
game_diff_cc_pass | Squad Strength: difference in Chance Creation Passing | 6 |
game_diff_cc_cross | Squad Strength: difference in Chance Creation Crossing | 6 |
game_diff_cc_shoot | Squad Strength: difference in Chance Creation Shooting | 6 |
game_diff_def_press | Squad Strength: difference in Defense Pressure | 6 |
game_diff_def_aggr | Squad Strength: difference in Defense Aggression | 6 |
game_diff_def_teamwidth | Squad Strength: difference in Defense Team Width | 6 |
There are few questions in order to understand data better
Imbalance of data
Correlation between variables
First we draw correlation matrix of large dataset which contains all matches from 2005-2018 with features group 1,2 and 3
In general, features are not correlated. "odd_win_diff" is quite negatively correlated with "form_diff_win" (-0.5), indicating that form of two teams reflex belief of odd bookmarkers on winners. One more interesting point is when difference of bet odd increases we would see more goal differences (correlation score = -0.6).
Second, we draw correlation matrix of small dataset which contains all matches from World Cup 2010, 2014, 2018 and EURO 2012, 2016
Overall rating is just an average of "attack", "defense" and "midfield" index therefore we see high correlation between them. In addition, some of new features of squad strength show high correlation for example "FIFA Rank", "Overall rating" and "Difference in winning odd"
How head-to-head matchup history affect the current match?
You may think when head-to-head win difference is positive, match result should be "Win" (Team 1 wins Team 2) and vice versa, when head-to-head win difference is negative, match result should be "Lose" (Team 2 wins Team 1). In fact, positive head-to-head win difference indicates that there is 51.8% chance the match results end up with "Win" and negative head-to-head win difference indicates that there is 55.5% chance the match results end up with "Lose"
Let's perform our hypothesis testing with two-sampled t-test Null Hypothesis: There is no difference of 'h2h win difference' between "Win" and "Lose" Alternative Hypothesis: There are differences of 'h2h win difference' between "Win" and "Lose"
T-test between win and lose:
Ttest_indResult(statistic=24.30496036405259, pvalue=2.503882847793891e-126)
Very small of p-value means we can reject the null hypothesis and accept alternative hypothesis.
We can do the same procedure with win-draw and lose-draw
T-test between win and draw:
Ttest_indResult(statistic=7.8385466293651023, pvalue=5.395456011352264e-15)
T-test between lose and draw:
Ttest_indResult(statistic=-8.6759649601068887, pvalue=5.2722587025773183e-18)
Therefore, we can say history of head-to-head matches of two teams contribute significantly to the result
How 10-recent performances affect the current match?
We consider differences in "Goal For" (how many goals they got), "Goal Against" (how many goals they conceded), "number of winning matches" and "number of drawing matches". We performed same procedure as previous questions. From pie charts, we can see a clear distinction in "number of wins" where proportion of "Win" result decreases from 49% to 25% while "Lose" result increases from 26.5% to 52.3%.
Pie charts are not enough we should do the hypothesis testing to see significance of each feature
Feature Name | t-test between 'win' and 'lose' | t-test between 'win' and 'draw' | t-test between 'lose' and 'draw' |
---|---|---|---|
Goal For | pvalue = 2.50e-126 | pvalue = 5.39e-15 | pvalue = 5.27e-18 |
Goal Against | pvalue = 0.60 | pvalue = 0.17 | pvalue = 0.08 |
Number of Winning Matches | pvalue = 3.02e-23 | pvalue = 1.58e-33 | pvalue = 2.57e-29 |
Number of Draw Matches | pvalue = 1.53e-06 | pvalue = 0.21 | pvalue = 0.03 |
We see many small value of p-value in cases of "Goal For" and "Number of Winning Matches". Based on t-test, we know difference in "Goal For" and "Number of Winning Matches" are helpful features
Do stronger teams usually win?
We define stronger teams based on
- Higher FIFA Ranking
- Higher Overall Rating
Feature Name | t-test between 'win' and 'lose' | t-test between 'win' and 'draw' | t-test between 'lose' and 'draw' |
---|---|---|---|
FIFA Rank | pvalue = 2.11e-10 | pvalue=0.65 | pvalue=0.00068 |
Overall Rating | pvalue = 1.53e-16 | pvalue = 0.0804 | pvalue = 0.000696 |
Do young players play better than old one ?
Young players may have better stamina and more energy while older players have more experience. We want to see how age affects match results.
Feature Name | t-test between 'win' and 'lose' | t-test between 'win' and 'draw' | t-test between 'lose' and 'draw' |
---|---|---|---|
Age | pvalue = 2.07e-05 | pvalue = 0.312 | pvalue=0.090 |
Based on t-test and pie chart, we know that the age contributes significantly to the result. More specifically, younger teams tends to play better than older ones
Is short pass better than long pass ? Higher value of "Build Up Play Passing" means "Long Pass" and lower value means "Short Pass", value in middle mean "Mixed-Type Pass"
Feature Name | t-test between 'win' and 'lose' | t-test between 'win' and 'draw' | t-test between 'lose' and 'draw' |
---|---|---|---|
Age | pvalue = 1.05e-07 | pvalue = 0.0062 | pvalue = 0.571 |
Based on t-test and pie chart, we know that the age contributes significantly to the result. More specifically, teams who replies on "Longer Pass" usually loses the game.
How does crossing pass affect match result ?
How does chance creation shooting affect match result ?
How does defence pressure affect match result ?
How does defence aggression affect match result ?
How does defence team width affect match result ?
How labels distribute in reduced dimension?
For this question, we use PCA to pick two first principal components which best explained data. Then we plot data in new dimension
While "Win" and "Lose" are while separate, "Draw" seems to be mixed between other labels.
Our main objectives of prediction are "Win / Lose / Draw" and "Goal Difference". In this work, we do two main experiments, for each experiment we follow these procedure
- Split data into 70:30
- First we perform "normalization" of features, convert category to number
- Second we perform k-fold cross validation to select the best parameters for each model based on some criteria.
- Third we use the best model to do prediction on 10-fold cross validation (9 folds for training and 1 fold for testing) to achieve the mean of test error. This error is more reliable.
Experiment 1. Build classifiers for "Win / Lose / Draw" from 2005. Because feature "Bet Odds" are only available after 2005 so we only conduct experiments for this period of time.
Experiment 2. Build classifiers for "Goal Difference" for "World Cup" and "UEFA EURO" after 2010. The reason is because features of "Squad Strength" are not always available before 2010, some national teams does not have database of squad strength in FIFA Video Games. We know that tackling prediction with regression would be hard so we turn "Goal Difference" into classification by defining labels as follows:
Team A vs Team B
- "win_1": A wins with 1 goal differences
- "win_2": A wins with 2 goal differences
- "win_3": A wins with 3 or more goal differences
- "lose_1": B wins with 1 goal differences
- "lose_2": B wins with 2 goal differences
- "lose_3": A wins with 3 or more goal differences
- "draw_0": Draw
Experiment 3. In addition, we want to test how our trained model in Experiment 2 to predict the "Goal Difference" and "Win/Draw/Lose" of matches in World Cup 2018.
Baseline Model: In EDA part, we already investigate importance of features and see that odd, history, form and squad strength are all significant. Now we divide features into three groups: odd, h2h-form, squad strength and build "Baseline Models" based on these groups. To keep the baseline model simple, we set hyper-parameter of Decision Tree maximum depth = 2, maximum leaf nodes = 3
- Odd-based model:
Experiment 1 | Experiment 2 |
---|---|
![]() |
![]() |
- History-Form-based model:
Experiment 1 | Experiment 2 |
---|---|
![]() |
![]() |
- Squad-strength based model:
For experiment 2
Enhanced Model:
To beat the baseline models we use all features and several machine algorithms as follows
- Logistic Regression
- Random Forest
- Gradient Boosting Tree
- ADA Boost Tree
- Neural Network
- LightGBM
Models are evaluated on these criteria which are carried out for each label "win", "lose" and "draw"
-
Precision: Among our prediction of "True" value, how many percentage we hit?, the higher value, the better prediction
-
Recall: Among actual "True" value, how many percentage we hit?, the higher value, the better prediction
-
F1: A balance of Precision and Recall, the higher value, the better prediction, there are 2 types of F1
- F1-micro: compute F1 by aggregating True Positive and False Positive or each class
- F1-macro: compute F1 independently for each class and take the average (all classed equally)
In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance (i.e you may have many more examples of one class than of other classes). In this case, we should stick with F1-micro
-
10-fold cross validation accuracy: Mean of accuracy for each cross-validation fold. This is a reliable estimation of test error of model evaluation (no need to split to train and test)
-
Area under ROC: For binary classification, True Positive Rate vs False Positive Rate for all threshold.
Experiment 1 "Draw / Lose /Win"
Model | 10-fold CV accuracy (%) | F1 - micro average | AUROC - micro average |
---|---|---|---|
Odd-based Decision Tree | 59.28 | 60.22 | 0.76 |
H2H-Form based Decision Tree | 51.22 | 51.52 | 0.66 |
Logistic Regression | 59.37 | 59.87 | 0.76 |
Random Forest | 54.40 | 55.92 | 0.74 |
Gradient Boosting tree | 58.60 | 59.47 | 0.77 |
ADA boost tree | 59.08 | 60.22 | 0.77 |
Neural Net | 58.96 | 58.36 | 0.77 |
LightGBM | 59.49 | 60.28 | 0.78 |
Results from experiment 1 show little improvement between enhanced models and baseline models based on three evaluation criteria: 10-fold cross validation, F1 and Area Under Curve. A simple Odd-based Decision Tree is enough to classify Win/Draw/Lose . However, according to confusion matrix in Appendix of experiment 1, we see that most of classifiers failed to classify "Draw" label, only Random Forest and Gradient Boosting Tree can predict "Draw" label, 74 hits and 29 hits respectively. Furthermore, as we mentioned, there is not much difference of classifiers in other criteria so our recommendation for classify "Win / Draw / Lose" is "Gradient Boosting Tree" and "Random Forest"
Experiment 2 "Goal Difference"
Model | 10-fold CV accuracy (%) | F1 - micro average | AUROC - micro average |
---|---|---|---|
Odd-based Decision Tree | 26.41 | 25.37 | 0.62 |
H2H-Form-based Decision Tree | 16.74 | 18.94 | 0.59 |
Squad-strength-based Decision Tree | 31.64 | 31.34 | 0.66 |
Logistic Regression | 21.39 | 22.38 | 0.64 |
Random Forest | 25.36 | 25.37 | 0.60 |
Gradient Boosting tree | 27.27 | 16.42 | 0.58 |
ADA boost tree | 26.92 | 16.41 | 0.59 |
Neural Net | 22.42 | 25.37 | 0.63 |
LightGBM | 25.62 | 20.89 | 0.57 |
In experiment 2, "Squad Strength" based Decision Tree tends to superior to other classifiers.
Experiment 3 "Goal Difference" and "Win/Draw/Lose" in World Cup 2018
Model | "Goal Difference" Accuracy | "Win/Draw/Lose" Accuracy (%) | F1 - micro average |
---|---|---|---|
Odd-based Decision Tree | 31.25 | 48.43 | 31.25 |
H2H-Form based Decision Tree | 25.00 | 34.37 | 25.00 |
Squad strength based Decision Tree | 28.12 | 43.75 | 28.12 |
Logistic Regression | 32.81 | 57.81 | 32.81 |
Random Forest | 32.81 | 56.25 | 32.81 |
Gradient Boosting tree | 21.87 | 45.31 | 21.87 |
ADA boost tree | 28.12 | 51.56 | 28.12 |
Neural Net | 20.31 | 35.94 | 20.31 |
LightGBM | 32.81 | 56.25 | 32.81 |
In conclusion, odd-based features from bet bookmarkers are reliable to determine who is the winner of matches. However, it is very bad at finding out whether matches end up a draw result. Instead, Ensemble method like Random Forest and Gradient Boosting tree are superior in this case. Squad index from FIFA video games provide more information and also contribute significantly for prediction of "Goal Difference". Other complex machine learning models show not much difference against simple odd-based or strength-based tree, this is reasonable because the amount of data are limited and a simple decision tree can provide an easy solution.
The dataset are from all international matches from 2000 - 2018, results, bet odds, ranking, squad strengths
- FIFA World Cup 2018
- International match 1872 - 2018
- FIFA Ranking through Time
- Bet Odd
- Bet Odd 2
- Squad Strength - Sofia
- Squad Strength - FIFA index
- A machine learning framework for sport result prediction
- t-test definition
- Confusion Matrix Multi-Label example
- Precision-Recall Multi-Label example
- ROC curve example
- Model evaluation
- Tuning the hyper-parameters of an estimator
- Validation curves
- Understand Bet odd format
- EURO 2016 bet odd
Experiment 1
- Odd-based Decision Tree:
Confusion matrix | ROC curve |
---|---|
![]() |
![]() |
- h2h-Form-based Decision Tree:
Confusion matrix | ROC curve |
---|---|
![]() |
![]() |
- Logistic Regression
Best parameters:
LogisticRegression(C=0.002154434690031882, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='multinomial', n_jobs=1, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
Confusion matrix | ROC curve |
---|---|
![]() |
![]() |
- Random Forest
Best parameters:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=15, n_jobs=1,
oob_score=False, random_state=85, verbose=0, warm_start=False)
Confusion matrix | ROC curve |
---|---|
![]() |
![]() |
- Gradient Boosting tree
Best parameters:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
presort='auto', random_state=0, subsample=1.0, verbose=False,
warm_start=False)
Confusion matrix | ROC curve |
---|---|
![]() |
![]() |
- ADA boost tree
AdaBoostClassifier(algorithm='SAMME',
base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best'),
learning_rate=1, n_estimators=100, random_state=0)
Confusion matrix | ROC curve |
---|---|
![]() |
![]() |
- Neural Net
Best parameters:
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(10, 5), learning_rate='constant',
learning_rate_init=0.1, max_iter=1000, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
solver='adam', tol=1e-10, validation_fraction=0.1, verbose=False,
warm_start=False)
Confusion matrix | ROC curve |
---|---|
![]() |
![]() |
- Light GBM
Best parameters:
LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
learning_rate=0.1, max_depth=-1, min_child_samples=20,
min_child_weight=0.001, min_split_gain=0.0, n_estimators=20,
n_jobs=-1, num_leaves=31, objective=None, random_state=1,
reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,
subsample_for_bin=200000, subsample_freq=0)
Confusion matrix | ROC curve |
---|---|
![]() |
![]() |
Experiment 2
- Odd-based Decision Tree:
Confusion matrix | ROC curve |
---|---|
![]() |
![]() |
- h2h-Form-based Decision Tree:
Confusion matrix | ROC curve |
---|---|
![]() |
![]() |
- squad-strength-based Decision Tree:
Confusion matrix | ROC curve |
---|---|
![]() |
![]() |
- Logistic Regression
Best parameters:
LogisticRegression(C=2.1544346900318823e-05, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='multinomial', n_jobs=1, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
Confusion matrix | ROC curve |
---|---|
![]() |
![]() |
- Random Forest
Best parameters:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=15, n_jobs=1,
oob_score=False, random_state=85, verbose=0, warm_start=False)
Confusion matrix | ROC curve |
---|---|
![]() |
![]() |
- Gradient Boosting tree
Best parameters:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000,
presort='auto', random_state=0, subsample=1.0, verbose=False,
warm_start=False)
Confusion matrix | ROC curve |
---|---|
![]() |
![]() |
- ADA boost tree
AdaBoostClassifier(algorithm='SAMME',
base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best'),
learning_rate=1, n_estimators=100, random_state=0)
Confusion matrix | ROC curve |
---|---|
![]() |
![]() |
- Neural Net
Best parameters:
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(30, 15), learning_rate='constant',
learning_rate_init=0.1, max_iter=1000, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
solver='adam', tol=1e-10, validation_fraction=0.1, verbose=False,
warm_start=False)
Confusion matrix | ROC curve |
---|---|
![]() |
![]() |
- Light GBM
Best parameters:
LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
learning_rate=0.1, max_depth=-1, min_child_samples=20,
min_child_weight=0.001, min_split_gain=0.0, n_estimators=15,
n_jobs=-1, num_leaves=31, objective=None, random_state=1,
reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,
subsample_for_bin=200000, subsample_freq=0)
Confusion matrix | ROC curve |
---|---|
![]() |
![]() |
World Cup 2018 result
Now the model is applying for World Cup 2018 in Russia with simulation time = 100 000.
Result Explanation:
Team A vs Team B (only valid until 90th minute)
- "win_1": A wins with 1 goal differences
- "win_2": A wins with 2 goal differences
- "win_3": A wins with 3 or more goal differences
- "lose_1": B wins with 1 goal differences
- "lose_2": B wins with 2 goal differences
- "lose_3": A wins with 3 or more goal differences
- "draw_0": Draw
Final and Third Place