Introduction
Data:
Exploratory Data Analysis:
1. Correlation between variables
2. How head-to-head matchup history affect the current match?
3. How recent performances affect the current match?
4. Do strong teams usually win?
5. Do young players play better than old one ?
6. Is short pass better than long pass ?
7. How labels distribute in reduced dimension?
Methodology: Details about your procedure
Models
1. Baseline models
- odd-based model
- history and form based model
- squad-strength-based model
1. Enhance models
- Logistic Regression
- Random Forest
- Gradient Boosting tree
- ADA boost tree
- Neural Network
- Light GBM
Evaluation Criteria

F1
10-fold Cross Validation accuracy
Area under ROC

Results
Conclusion
References
Appendix

Introduction

Abstract:

In this work, we compare 9 different modeling approaches for the soccer matches and goal difference on all international matches from 2005 - 2017, FIFA World Cup 2010 - 2014 and FIFA EURO 2012-2016. Within this comparison, while performance of "Win / Draw / Lose" predictions shows not much difference, "Goal Difference" prediction is quite favored to Random Forest and squad-strength based decision tree. We also apply these models in World Cup 2018 and again, Random Forest and Logistic Regression predicts about 33% acccuracy for "Goal Difference" and about 57% for "Win / Draw / Lose". However a simple decision tree based on bet odd and squad-strength are also comparable.

Objective:

Prediction of the winner of an international matches Prediction results are "Win / Lose / Draw" or "goal difference"
Apply the model to predict the result of FIFA world cup 2018.

Supervisor: Pratibha Rathore

Lifecycle

Data

Data: The dataset are from all international matches from 2000 - 2018, results, bet odds, ranking, squad strengths.

FIFA World Cup 2018
International match 1872 - 2018
FIFA Ranking through Time
Bet Odd
Bet Odd 2
Squad Strength - Sofia
Squad Strength - FIFA index

Feature Selection: To determine who will more likely to win a match, based on my knowledge, I come up with 4 main groups of features as follows:

head-to-head match history between 2 teams. Some teams have few opponents who they hardly win no matter how strong they currently are. For example Germany team usually loses / couldn't beat Italian team in 90 minute matches.
Recent performance of each team (10 recent matches), aka "form" A team with "good" form usually has higher chance to win next matches.
Bet-ratio before matches Odd bookmarkers already did many analysis before matches to select the best betting odds so why don't we include them.
Squad strength (from FIFA video game). We want a real squad strength but these data are not free and not always available so we use strength from FIFA video games which have updated regularly to catch up with the real strength.

Feature List Feature list reflects those four factors.

*difference: team1 - team2
*form: performance in 10 recent matches

Feature Name	Description	Source
team_1	Nation Code (e.g US, NZ)	1 & 2
team_2	Nation Code (e.g US, NZ)	1 & 2
date	Date of match yyyy - mm - dd	1 & 2
tournament	Friendly,EURO, AFC, FIFA WC	1 & 2
h_win_diff	Head2Head: win difference	2
h_draw	Head2Head: number of draw	2
form_diff_goalF	Form: difference in "Goal For"	2
form_diff_goalA	Form: difference in "Goal Against"	2
form_diff_win	Form: difference in number of win	2
form_diff_draw	Form: difference in number of draw	2
odd_diff_win	Betting Odd: difference bet rate for win	4 & 5
odd_draw	Betting Odd: bet rate for draw	4 & 5
game_diff_rank	Squad Strength: difference in FIFA Rank	3
game_diff_ovr	Squad Strength: difference in Overall Strength	6
game_diff_attk	Squad Strength: difference in Attack Strength	6
game_diff_mid	Squad Strength: difference in Midfield Strength	6
game_diff_def	Squad Strength: difference in Defense Strength	6
game_diff_prestige	Squad Strength: difference in prestige	6
game_diff_age11	Squad Strength: difference in age of 11 starting players	6
game_diff_ageAll	Squad Strength: difference in age of all players	6
game_diff_bup_speed	Squad Strength: difference in Build Up Play Speed	6
game_diff_bup_pass	Squad Strength: difference in Build Up Play Passing	6
game_diff_cc_pass	Squad Strength: difference in Chance Creation Passing	6
game_diff_cc_cross	Squad Strength: difference in Chance Creation Crossing	6
game_diff_cc_shoot	Squad Strength: difference in Chance Creation Shooting	6
game_diff_def_press	Squad Strength: difference in Defense Pressure	6
game_diff_def_aggr	Squad Strength: difference in Defense Aggression	6
game_diff_def_teamwidth	Squad Strength: difference in Defense Team Width	6

Exploratory Data Analysis

There are few questions in order to understand data better

Imbalance of data

Correlation between variables

First we draw correlation matrix of large dataset which contains all matches from 2005-2018 with features group 1,2 and 3

In general, features are not correlated. "odd_win_diff" is quite negatively correlated with "form_diff_win" (-0.5), indicating that form of two teams reflex belief of odd bookmarkers on winners. One more interesting point is when difference of bet odd increases we would see more goal differences (correlation score = -0.6).

Second, we draw correlation matrix of small dataset which contains all matches from World Cup 2010, 2014, 2018 and EURO 2012, 2016

Overall rating is just an average of "attack", "defense" and "midfield" index therefore we see high correlation between them. In addition, some of new features of squad strength show high correlation for example "FIFA Rank", "Overall rating" and "Difference in winning odd"

How head-to-head matchup history affect the current match?

You may think when head-to-head win difference is positive, match result should be "Win" (Team 1 wins Team 2) and vice versa, when head-to-head win difference is negative, match result should be "Lose" (Team 2 wins Team 1). In fact, positive head-to-head win difference indicates that there is 51.8% chance the match results end up with "Win" and negative head-to-head win difference indicates that there is 55.5% chance the match results end up with "Lose"

Let's perform our hypothesis testing with two-sampled t-test Null Hypothesis: There is no difference of 'h2h win difference' between "Win" and "Lose" Alternative Hypothesis: There are differences of 'h2h win difference' between "Win" and "Lose"

T-test between win and lose:
Ttest_indResult(statistic=24.30496036405259, pvalue=2.503882847793891e-126)

Very small of p-value means we can reject the null hypothesis and accept alternative hypothesis.

We can do the same procedure with win-draw and lose-draw

T-test between win and draw:
Ttest_indResult(statistic=7.8385466293651023, pvalue=5.395456011352264e-15)

T-test between lose and draw:
Ttest_indResult(statistic=-8.6759649601068887, pvalue=5.2722587025773183e-18)

Therefore, we can say history of head-to-head matches of two teams contribute significantly to the result

How 10-recent performances affect the current match?

We consider differences in "Goal For" (how many goals they got), "Goal Against" (how many goals they conceded), "number of winning matches" and "number of drawing matches". We performed same procedure as previous questions. From pie charts, we can see a clear distinction in "number of wins" where proportion of "Win" result decreases from 49% to 25% while "Lose" result increases from 26.5% to 52.3%.

Pie charts are not enough we should do the hypothesis testing to see significance of each feature

Feature Name	t-test between 'win' and 'lose'	t-test between 'win' and 'draw'	t-test between 'lose' and 'draw'
Goal For	pvalue = 2.50e-126	pvalue = 5.39e-15	pvalue = 5.27e-18
Goal Against	pvalue = 0.60	pvalue = 0.17	pvalue = 0.08
Number of Winning Matches	pvalue = 3.02e-23	pvalue = 1.58e-33	pvalue = 2.57e-29
Number of Draw Matches	pvalue = 1.53e-06	pvalue = 0.21	pvalue = 0.03

We see many small value of p-value in cases of "Goal For" and "Number of Winning Matches". Based on t-test, we know difference in "Goal For" and "Number of Winning Matches" are helpful features

Do stronger teams usually win?

We define stronger teams based on

Higher FIFA Ranking
Higher Overall Rating

Feature Name	t-test between 'win' and 'lose'	t-test between 'win' and 'draw'	t-test between 'lose' and 'draw'
FIFA Rank	pvalue = 2.11e-10	pvalue=0.65	pvalue=0.00068
Overall Rating	pvalue = 1.53e-16	pvalue = 0.0804	pvalue = 0.000696

Do young players play better than old one ?

Young players may have better stamina and more energy while older players have more experience. We want to see how age affects match results.

Feature Name	t-test between 'win' and 'lose'	t-test between 'win' and 'draw'	t-test between 'lose' and 'draw'
Age	pvalue = 2.07e-05	pvalue = 0.312	pvalue=0.090

Based on t-test and pie chart, we know that the age contributes significantly to the result. More specifically, younger teams tends to play better than older ones

Is short pass better than long pass ? Higher value of "Build Up Play Passing" means "Long Pass" and lower value means "Short Pass", value in middle mean "Mixed-Type Pass"

Feature Name	t-test between 'win' and 'lose'	t-test between 'win' and 'draw'	t-test between 'lose' and 'draw'
Age	pvalue = 1.05e-07	pvalue = 0.0062	pvalue = 0.571

Based on t-test and pie chart, we know that the age contributes significantly to the result. More specifically, teams who replies on "Longer Pass" usually loses the game.

How does crossing pass affect match result ?

How does chance creation shooting affect match result ?

How does defence pressure affect match result ?

How does defence aggression affect match result ?

How does defence team width affect match result ?

How labels distribute in reduced dimension?

For this question, we use PCA to pick two first principal components which best explained data. Then we plot data in new dimension

While "Win" and "Lose" are while separate, "Draw" seems to be mixed between other labels.

Methodology

Our main objectives of prediction are "Win / Lose / Draw" and "Goal Difference". In this work, we do two main experiments, for each experiment we follow these procedure

Split data into 70:30
First we perform "normalization" of features, convert category to number
Second we perform k-fold cross validation to select the best parameters for each model based on some criteria.
Third we use the best model to do prediction on 10-fold cross validation (9 folds for training and 1 fold for testing) to achieve the mean of test error. This error is more reliable.

Experiment 1. Build classifiers for "Win / Lose / Draw" from 2005. Because feature "Bet Odds" are only available after 2005 so we only conduct experiments for this period of time.

Experiment 2. Build classifiers for "Goal Difference" for "World Cup" and "UEFA EURO" after 2010. The reason is because features of "Squad Strength" are not always available before 2010, some national teams does not have database of squad strength in FIFA Video Games. We know that tackling prediction with regression would be hard so we turn "Goal Difference" into classification by defining labels as follows:

Team A vs Team B

"win_1": A wins with 1 goal differences
"win_2": A wins with 2 goal differences
"win_3": A wins with 3 or more goal differences
"lose_1": B wins with 1 goal differences
"lose_2": B wins with 2 goal differences
"lose_3": A wins with 3 or more goal differences
"draw_0": Draw

Experiment 3. In addition, we want to test how our trained model in Experiment 2 to predict the "Goal Difference" and "Win/Draw/Lose" of matches in World Cup 2018.

Models

Baseline Model: In EDA part, we already investigate importance of features and see that odd, history, form and squad strength are all significant. Now we divide features into three groups: odd, h2h-form, squad strength and build "Baseline Models" based on these groups. To keep the baseline model simple, we set hyper-parameter of Decision Tree maximum depth = 2, maximum leaf nodes = 3

Odd-based model:

Experiment 1	Experiment 2

History-Form-based model:

Experiment 1	Experiment 2

Squad-strength based model:

For experiment 2

Enhanced Model:

To beat the baseline models we use all features and several machine algorithms as follows

Logistic Regression
Random Forest
Gradient Boosting Tree
ADA Boost Tree
Neural Network
LightGBM

Evaluation Criteria

Models are evaluated on these criteria which are carried out for each label "win", "lose" and "draw"

Precision: Among our prediction of "True" value, how many percentage we hit?, the higher value, the better prediction
Recall: Among actual "True" value, how many percentage we hit?, the higher value, the better prediction
F1: A balance of Precision and Recall, the higher value, the better prediction, there are 2 types of F1
- F1-micro: compute F1 by aggregating True Positive and False Positive or each class
- F1-macro: compute F1 independently for each class and take the average (all classed equally)

In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance (i.e you may have many more examples of one class than of other classes). In this case, we should stick with F1-micro

10-fold cross validation accuracy: Mean of accuracy for each cross-validation fold. This is a reliable estimation of test error of model evaluation (no need to split to train and test)
Area under ROC: For binary classification, True Positive Rate vs False Positive Rate for all threshold.

Results

Experiment 1 "Draw / Lose /Win"

Model	10-fold CV accuracy (%)	F1 - micro average	AUROC - micro average
Odd-based Decision Tree	59.28	60.22	0.76
H2H-Form based Decision Tree	51.22	51.52	0.66
Logistic Regression	59.37	59.87	0.76
Random Forest	54.40	55.92	0.74
Gradient Boosting tree	58.60	59.47	0.77
ADA boost tree	59.08	60.22	0.77
Neural Net	58.96	58.36	0.77
LightGBM	59.49	60.28	0.78

Results from experiment 1 show little improvement between enhanced models and baseline models based on three evaluation criteria: 10-fold cross validation, F1 and Area Under Curve. A simple Odd-based Decision Tree is enough to classify Win/Draw/Lose . However, according to confusion matrix in Appendix of experiment 1, we see that most of classifiers failed to classify "Draw" label, only Random Forest and Gradient Boosting Tree can predict "Draw" label, 74 hits and 29 hits respectively. Furthermore, as we mentioned, there is not much difference of classifiers in other criteria so our recommendation for classify "Win / Draw / Lose" is "Gradient Boosting Tree" and "Random Forest"

Experiment 2 "Goal Difference"

Model	10-fold CV accuracy (%)	F1 - micro average	AUROC - micro average
Odd-based Decision Tree	26.41	25.37	0.62
H2H-Form-based Decision Tree	16.74	18.94	0.59
Squad-strength-based Decision Tree	31.64	31.34	0.66
Logistic Regression	21.39	22.38	0.64
Random Forest	25.36	25.37	0.60
Gradient Boosting tree	27.27	16.42	0.58
ADA boost tree	26.92	16.41	0.59
Neural Net	22.42	25.37	0.63
LightGBM	25.62	20.89	0.57

In experiment 2, "Squad Strength" based Decision Tree tends to superior to other classifiers.

Experiment 3 "Goal Difference" and "Win/Draw/Lose" in World Cup 2018

Model	"Goal Difference" Accuracy	"Win/Draw/Lose" Accuracy (%)	F1 - micro average
Odd-based Decision Tree	31.25	48.43	31.25
H2H-Form based Decision Tree	25.00	34.37	25.00
Squad strength based Decision Tree	28.12	43.75	28.12
Logistic Regression	32.81	57.81	32.81
Random Forest	32.81	56.25	32.81
Gradient Boosting tree	21.87	45.31	21.87
ADA boost tree	28.12	51.56	28.12
Neural Net	20.31	35.94	20.31
LightGBM	32.81	56.25	32.81

Conclusion

In conclusion, odd-based features from bet bookmarkers are reliable to determine who is the winner of matches. However, it is very bad at finding out whether matches end up a draw result. Instead, Ensemble method like Random Forest and Gradient Boosting tree are superior in this case. Squad index from FIFA video games provide more information and also contribute significantly for prediction of "Goal Difference". Other complex machine learning models show not much difference against simple odd-based or strength-based tree, this is reasonable because the amount of data are limited and a simple decision tree can provide an easy solution.

Data Source

The dataset are from all international matches from 2000 - 2018, results, bet odds, ranking, squad strengths

FIFA World Cup 2018
International match 1872 - 2018
FIFA Ranking through Time
Bet Odd
Bet Odd 2
Squad Strength - Sofia
Squad Strength - FIFA index

References

A machine learning framework for sport result prediction
t-test definition
Confusion Matrix Multi-Label example
Precision-Recall Multi-Label example
ROC curve example
Model evaluation
Tuning the hyper-parameters of an estimator
Validation curves
Understand Bet odd format
EURO 2016 bet odd

Appendix

Experiment 1

Odd-based Decision Tree:

Confusion matrix	ROC curve

h2h-Form-based Decision Tree:

Confusion matrix	ROC curve

Logistic Regression

Best parameters:

LogisticRegression(C=0.002154434690031882, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='multinomial', n_jobs=1, penalty='l2',
          random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
          warm_start=False)

Confusion matrix	ROC curve

Random Forest

Best parameters:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=15, n_jobs=1,
            oob_score=False, random_state=85, verbose=0, warm_start=False)

Confusion matrix	ROC curve

Gradient Boosting tree

Best parameters:

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=0, subsample=1.0, verbose=False,
              warm_start=False)

Confusion matrix	ROC curve

ADA boost tree

AdaBoostClassifier(algorithm='SAMME',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
          learning_rate=1, n_estimators=100, random_state=0)

Confusion matrix	ROC curve

Neural Net

Best parameters:

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(10, 5), learning_rate='constant',
       learning_rate_init=0.1, max_iter=1000, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
       solver='adam', tol=1e-10, validation_fraction=0.1, verbose=False,
       warm_start=False)

Confusion matrix	ROC curve

Light GBM

Best parameters:

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
        learning_rate=0.1, max_depth=-1, min_child_samples=20,
        min_child_weight=0.001, min_split_gain=0.0, n_estimators=20,
        n_jobs=-1, num_leaves=31, objective=None, random_state=1,
        reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,
        subsample_for_bin=200000, subsample_freq=0)

Confusion matrix	ROC curve

Experiment 2

Odd-based Decision Tree:

Confusion matrix	ROC curve

h2h-Form-based Decision Tree:

Confusion matrix	ROC curve

squad-strength-based Decision Tree:

Confusion matrix	ROC curve

Logistic Regression

Best parameters:

LogisticRegression(C=2.1544346900318823e-05, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='multinomial', n_jobs=1, penalty='l2',
          random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
          warm_start=False)

Confusion matrix	ROC curve

Random Forest

Best parameters:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=15, n_jobs=1,
            oob_score=False, random_state=85, verbose=0, warm_start=False)

Confusion matrix	ROC curve

Gradient Boosting tree

Best parameters:

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=1000,
              presort='auto', random_state=0, subsample=1.0, verbose=False,
              warm_start=False)

Confusion matrix	ROC curve

ADA boost tree

AdaBoostClassifier(algorithm='SAMME',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
          learning_rate=1, n_estimators=100, random_state=0)

Confusion matrix	ROC curve

Neural Net

Best parameters:

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(30, 15), learning_rate='constant',
       learning_rate_init=0.1, max_iter=1000, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
       solver='adam', tol=1e-10, validation_fraction=0.1, verbose=False,
       warm_start=False)

Confusion matrix	ROC curve

Light GBM

Best parameters:

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
        learning_rate=0.1, max_depth=-1, min_child_samples=20,
        min_child_weight=0.001, min_split_gain=0.0, n_estimators=15,
        n_jobs=-1, num_leaves=31, objective=None, random_state=1,
        reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,
        subsample_for_bin=200000, subsample_freq=0)

Confusion matrix	ROC curve

World Cup 2018 result

World Cup 2018

Now the model is applying for World Cup 2018 in Russia with simulation time = 100 000.

Result Explanation:

Team A vs Team B (only valid until 90th minute)

"win_1": A wins with 1 goal differences
"win_2": A wins with 2 goal differences
"win_3": A wins with 3 or more goal differences
"lose_1": B wins with 1 goal differences
"lose_2": B wins with 2 goal differences
"lose_3": A wins with 3 or more goal differences
"draw_0": Draw

Final and Third Place

Semi_Finals

Quarter-Finals

_Round of 16

Match Day 3

Match Day 2

Match Day 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

report.md

report.md

Table of Contents

Introduction

Data

Exploratory Data Analysis

Methodology

Models

Evaluation Criteria

Results

Conclusion

Data Source

References

Appendix

World Cup 2018

Files

report.md

Latest commit

History

report.md

File metadata and controls

Table of Contents

Introduction

Data

Exploratory Data Analysis

Methodology

Models

Evaluation Criteria

Results

Conclusion

Data Source

References

Appendix

World Cup 2018