LoL Optimization
June 28th, 2023
In the game, two teams of five players battle in player-versus-player combat, each team occupying and defending their half of the map. Each of the ten players controls a character, known as a "champion", with unique abilities and differing styles of play. During a match, champions become more powerful by collecting experience points, earning gold, and purchasing items to defeat the opposing team. In League's main mode, Summoner's Rift, a team wins by pushing through to the enemy base and destroying their "Nexus", a large structure located within. (wikipedia)
The two teams are the red team and the blue team.
In League there are different "tiers" of players called "leagues". New and unskilled players (sorry) are in Bronze league while relatively skilled players are in Diamond league.
*we will refer to a player in diamond league as a diamond player and a player in bronze league as a bronze player
Investigate the differences between bronze players and diamond players. Presumably, bronze players and diamond players will play the game somewhat differently due to their relative skill difference. However, we dont expect there to be huge differences since bronze players may try to imitate diamond players in order to get better at the gane.
We want to know
Are there are any systematic differences in the way bronze players and diamond players approach the game?
Do bronze league games and diamond league games play out the same?
There are 8 provided datasets. Each row of each dataset records information from one match up to a certain point time (15, 20, 25, or 30 minutes)
For example
timeline_DIAMOND_15.csv contains match data up to 15 minutes into the game for diamond players.
timeline_BRONZE_30.csv contains match data up to 30 minutes into the game for bronze players.
We will be using all datasets to look at differences between bronze and diamond players over the course of the game.
Using the provided data we will compare the gameplay of bronze and diamond players at five stages of the game.
Stage 1 is the 15 minute mark
Stage 2 is the 20 minute mark
Stage 3 is the 25 minute mark
Stage 4 is the 30 minute mark
We will try to discover how features differ between the two groups and which features are important to predicting the winner of the game.
# standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.metrics import roc_curve, roc_auc_score
# score classifiers
from sklearn.preprocessing import OneHotEncoder
def brier_score(targets, probs):
enc = OneHotEncoder()
target_enc = enc.fit_transform(np.array(targets).reshape(-1, 1)).toarray()
return np.mean(np.sum((probs - target_enc)**2, axis=1))
def log_score(targets, probs):
enc = OneHotEncoder()
target_enc = enc.fit_transform(np.array(targets).reshape(-1, 1)).toarray()
return -np.mean(np.sum(target_enc * np.log(probs + 1e-32), axis=1))
Lets assume we dont know anything about league. We have never played the game, never seen anyone play it, and maybe we haven't even heard of it. Thats okay, we can still analyze the data and try to learn something.
First things first, we want to do some exploratory analysis. This will help us identify consistent trends between bronze and diamond players and become familiar with the data.
The amount of "xp" earned by each team. A team with more "xp" is typically stronger than a team with less "xp". Basically more xp is better.
The amount of "gold" earned by each team. A team with more gold typically has better equipment than a team with less gold. Basically more gold is better.
The number of "wards" placed by each team. A team with more wards can react to the opposing team better. Basically more wards is better.
Using the data available up to 15 minutes in the game, compare the distribution of xp, gold, and wards for bronze and diamond players.
Create 9 subfigures in a 3x3 grid (done for you). Fill in the provided template with the appropriate variables.
The column names (variables) you will need are: blue_gold, red_gold, gold_diff, blue_xp, red_xp, xp_diff, blue_ward_placed, red_ward_placed, ward_placed_diff. These variables record the amount of gold, xp, and wards each team (red or blue) has and the difference.
In each subfigure plot two histograms: one for diamond players and one for bronze players.
#bronze15 contains bronze league match data up to 15 minutes into the game
bronze15 = pd.read_csv('data/timeline_BRONZE_15.csv', index_col = 0)
#diamond15 contains diamond league match data up to 15 minutes into the game
diamond15 = pd.read_csv('data/timeline_DIAMOND_15.csv', index_col = 0)
# check column names (should be the same for both datasets)
print(bronze15.columns)
print(diamond15.columns)
fig, ax = plt.subplots(3, 3, constrained_layout = True, figsize = (12, 8))
# gold histograms
ax[0,0].set_title('Blue team gold', fontsize = 15)
ax[0,0].hist(diamond15["blue_gold"],alpha=0.5, label='diamond')
ax[0,0].hist(bronze15["blue_gold"],alpha=0.5, label='bronze')
ax[0,0].legend()
ax[0,1].set_title('Red team gold', fontsize = 15)
ax[0,1].hist(diamond15["red_gold"],alpha=0.5, label='diamond')
ax[0,1].hist(bronze15["red_gold"],alpha=0.5, label='bronze')
ax[0,1].legend()
ax[0,2].set_title('Gold difference', fontsize = 15)
ax[0,2].hist(diamond15["gold_diff"],alpha=0.5, label='diamond')
ax[0,2].hist(bronze15["gold_diff"],alpha=0.5, label='bronze')
ax[0,2].legend()
# XP histograms
ax[1,0].set_title('Blue team XP', fontsize = 15)
ax[1,0].hist(diamond15["blue_xp"],alpha=0.5, label='diamond')
ax[1,0].hist(bronze15["blue_xp"],alpha=0.5, label='bronze')
ax[1,0].legend()
ax[1,1].set_title('Red team XP', fontsize = 15)
ax[1,1].hist(diamond15["red_xp"],alpha=0.5, label='diamond')
ax[1,1].hist(bronze15["red_xp"],alpha=0.5, label='bronze')
ax[1,1].legend()
ax[1,2].set_title('XP difference', fontsize = 15)
ax[1,2].hist(diamond15["xp_diff"],alpha=0.5, label='diamond')
ax[1,2].hist(bronze15["xp_diff"],alpha=0.5, label='bronze')
ax[1,2].legend()
# Ward histograms
ax[2,0].set_title('Blue team wards', fontsize = 15)
ax[2,0].hist(diamond15["blue_ward_placed"],alpha=0.5, label='diamond')
ax[2,0].hist(bronze15["blue_ward_placed"],alpha=0.5, label='bronze')
ax[2,0].legend()
ax[2,1].set_title('Red team wards', fontsize = 15)
ax[2,1].hist(diamond15["red_ward_placed"],alpha=0.5, label='diamond')
ax[2,1].hist(bronze15["red_ward_placed"],alpha=0.5, label='bronze')
ax[2,1].legend()
ax[2,2].set_title('Ward difference', fontsize = 15)
ax[2,2].hist(diamond15["ward_placed_diff"],alpha=0.5, label='diamond')
ax[2,2].hist(bronze15["ward_placed_diff"],alpha=0.5, label='bronze')
ax[2,2].legend()
plt.show()
import statistics as st
averageMetrics = pd.DataFrame(index=["diamond15","bronze15"],columns=["blue_gold","red_gold","gold_diff","blue_xp","red_xp","xp_diff","blue_ward_placed","red_ward_placed","ward_placed_diff"])
diaBlueGoldAvg = np.round(st.mean(diamond15["blue_gold"]),3)
averageMetrics.iloc[0,0] = diaBlueGoldAvg
diaRedGoldAvg = np.round(st.mean(diamond15["red_gold"]),3)
averageMetrics.iloc[0,1] = diaRedGoldAvg
diaGoldDiffAvg = np.round(st.mean(diamond15["gold_diff"]),3)
averageMetrics.iloc[0,2] = diaGoldDiffAvg
broBlueGoldAvg = np.round(st.mean(bronze15["blue_gold"]),3)
averageMetrics.iloc[1,0] = broBlueGoldAvg
broRedGoldAvg = np.round(st.mean(bronze15["red_gold"]),3)
averageMetrics.iloc[1,1] = broRedGoldAvg
broGoldDiffAvg = np.round(st.mean(bronze15["gold_diff"]),3)
averageMetrics.iloc[1,2] = broGoldDiffAvg
diaBlueXpdAvg = np.round(st.mean(diamond15["blue_xp"]),3)
averageMetrics.iloc[0,3] = diaBlueXpdAvg
diaRedXpdAvg = np.round(st.mean(diamond15["red_xp"]),3)
averageMetrics.iloc[0,4] = diaRedXpdAvg
diaXpDiffAvg = np.round(st.mean(diamond15["xp_diff"]),3)
averageMetrics.iloc[0,5] = diaXpDiffAvg
broBlueXpAvg = np.round(st.mean(bronze15["blue_xp"]),3)
averageMetrics.iloc[1,3] = broBlueXpAvg
broRedXpAvg = np.round(st.mean(bronze15["red_xp"]),3)
averageMetrics.iloc[1,4] = broRedXpAvg
broXpDiffAvg = np.round(st.mean(bronze15["xp_diff"]),3)
averageMetrics.iloc[1,5] = broXpDiffAvg
diaBlueWardAvg = np.round(st.mean(diamond15["blue_ward_placed"]),3)
averageMetrics.iloc[0,6] = diaBlueWardAvg
diaRedWardAvg = np.round(st.mean(diamond15["red_ward_placed"]),3)
averageMetrics.iloc[0,7] = diaRedWardAvg
diaWardDiffAvg = np.round(st.mean(diamond15["ward_placed_diff"]),3)
averageMetrics.iloc[0,8] = diaWardDiffAvg
broBlueWardAvg = np.round(st.mean(bronze15["blue_ward_placed"]),3)
averageMetrics.iloc[1,6] = broBlueWardAvg
broRedWardAvg = np.round(st.mean(bronze15["red_ward_placed"]),3)
averageMetrics.iloc[1,7] = broRedWardAvg
broWardDiffAvg = np.round(st.mean(bronze15["ward_placed_diff"]),3)
averageMetrics.iloc[1,8] = broWardDiffAvg
averageMetrics
Overall, diamond players outperform bronze players in every category. You can see this in each plot simply by looking at the histograms, or you can observe this though the difference in means.
For example, the average blue_gold for diamond players is almost 3000 more than bronze players. This is also true for blue_xp, over 1500 more, and blue_ward_placed with over 2 more on average.
There is no significant difference between the red in blue team for either diamond or bronze. Blue is better on average for gold but red is better for xp (for diamond players at least). There is insufficient evidence to prove otherwise.
Using the data available up to 15, 20, 25 and 30 minutes into the game, compare the distribution of gold for bronze and diamond players.
Create 12 subfigures in a 4x3 grid (done for you). Fill in the provided template with the appropriate variables.
The column names (variables) you will need are: blue_gold, red_gold, gold_diff
In each subfigure plot two histograms: one for diamond players and one for bronze players. Make each histogram transparent (alpha = 0.5) since they will overlap. Label the histograms.
Create table that clearly displays the mean of each histogram for each of the twelve subfigures. The table should be 12x3, each row is a subfigure, column 1 contains the title of the subfigure, column 2 contains the bronze mean, and column 3 contains the diamond mean.
diamond15 = pd.read_csv('data/timeline_DIAMOND_15.csv', index_col = 0)
diamond20 = pd.read_csv('data/timeline_DIAMOND_20.csv', index_col = 0)
diamond25 = pd.read_csv('data/timeline_DIAMOND_25.csv', index_col = 0)
diamond30 = pd.read_csv('data/timeline_DIAMOND_30.csv', index_col = 0)
bronze15 = pd.read_csv('data/timeline_BRONZE_15.csv', index_col = 0)
bronze20 = pd.read_csv('data/timeline_BRONZE_20.csv', index_col = 0)
bronze25 = pd.read_csv('data/timeline_BRONZE_25.csv', index_col = 0)
bronze30 = pd.read_csv('data/timeline_BRONZE_30.csv', index_col = 0)
fig, ax = plt.subplots(4, 3, constrained_layout = True, figsize = (12, 10))
averageGoldMetrics = pd.DataFrame(columns=["subfigure","bronze mean","diamond mean"],index=['0','1','2','3','4','5','6','7','8','9','10','11'])
# Gold histograms (15 min)
ax[0,0].set_title('Blue gold (15 min.)')
ax[0,0].hist(diamond15['blue_gold'], alpha=0.5, label='diamond')
ax[0,0].hist(bronze15['blue_gold'], alpha=0.5, label='bronze')
ax[0,0].legend()
averageGoldMetrics.iloc[0,0] = "Blue gold (15 min.)"
averageGoldMetrics.iloc[0,1] = np.round(st.mean(bronze15["blue_gold"]),3)
averageGoldMetrics.iloc[0,2] = np.round(st.mean(diamond15["blue_gold"]),3)
ax[0,1].set_title('Red gold (15 min.)')
ax[0,1].hist(diamond15['red_gold'], alpha=0.5, label='diamond')
ax[0,1].hist(bronze15['red_gold'], alpha=0.5, label='bronze')
ax[0,1].legend()
averageGoldMetrics.iloc[1,0] = "Red gold (15 min.)"
averageGoldMetrics.iloc[1,1] = np.round(st.mean(bronze15["red_gold"]),3)
averageGoldMetrics.iloc[1,2] = np.round(st.mean(diamond15["red_gold"]),3)
ax[0,2].set_title('Gold diff (15 min.)')
ax[0,2].hist(diamond15['gold_diff'], alpha=0.5, label='diamond')
ax[0,2].hist(bronze15['gold_diff'], alpha=0.5, label='bronze')
ax[0,2].legend()
averageGoldMetrics.iloc[2,0] = "Gold diff (15 min.)"
averageGoldMetrics.iloc[2,1] = np.round(st.mean(bronze15["gold_diff"]),3)
averageGoldMetrics.iloc[2,2] = np.round(st.mean(diamond15["gold_diff"]),3)
# Gold histograms (20 min)
ax[1,0].set_title('Blue gold (20 min.)')
ax[1,0].hist(diamond20['blue_gold'], alpha=0.5, label='diamond')
ax[1,0].hist(bronze20['blue_gold'], alpha=0.5, label='bronze')
ax[1,0].legend()
averageGoldMetrics.iloc[3,0] = "Blue gold (20 min.)"
averageGoldMetrics.iloc[3,1] = np.round(st.mean(bronze20["blue_gold"]),3)
averageGoldMetrics.iloc[3,2] = np.round(st.mean(diamond20["blue_gold"]),3)
ax[1,1].set_title('Red gold (20 min.)')
ax[1,1].hist(diamond20['red_gold'], alpha=0.5, label='diamond')
ax[1,1].hist(bronze20['red_gold'], alpha=0.5, label='bronze')
ax[1,1].legend()
averageGoldMetrics.iloc[4,0] = "Red gold (20 min.)"
averageGoldMetrics.iloc[4,1] = np.round(st.mean(bronze20["red_gold"]),3)
averageGoldMetrics.iloc[4,2] = np.round(st.mean(diamond20["red_gold"]),3)
ax[1,2].set_title('Gold diff (20 min.)')
ax[1,2].hist(diamond20['gold_diff'], alpha=0.5, label='diamond')
ax[1,2].hist(bronze20['gold_diff'], alpha=0.5, label='bronze')
ax[1,2].legend()
averageGoldMetrics.iloc[5,0] = "Gold diff (20 min.)"
averageGoldMetrics.iloc[5,1] = np.round(st.mean(bronze20["gold_diff"]),3)
averageGoldMetrics.iloc[5,2] = np.round(st.mean(diamond20["gold_diff"]),3)
# Gold histograms (25 min)
ax[2,0].set_title('Blue gold (25 min.)')
ax[2,0].hist(diamond25['blue_gold'], alpha=0.5, label='diamond')
ax[2,0].hist(bronze25['blue_gold'], alpha=0.5, label='bronze')
ax[2,0].legend()
averageGoldMetrics.iloc[6,0] = "Blue gold (25 min.)"
averageGoldMetrics.iloc[6,1] = np.round(st.mean(bronze25["blue_gold"]),3)
averageGoldMetrics.iloc[6,2] = np.round(st.mean(diamond25["blue_gold"]),3)
ax[2,1].set_title('Red gold (25 min.)')
ax[2,1].hist(diamond25['red_gold'], alpha=0.5, label='diamond')
ax[2,1].hist(bronze25['red_gold'], alpha=0.5, label='bronze')
ax[2,1].legend()
averageGoldMetrics.iloc[7,0] = "Red gold (25 min.)"
averageGoldMetrics.iloc[7,1] = np.round(st.mean(bronze25["red_gold"]),3)
averageGoldMetrics.iloc[7,2] = np.round(st.mean(diamond25["red_gold"]),3)
ax[2,2].set_title('Gold diff (25 min.)')
ax[2,2].hist(diamond25['gold_diff'], alpha=0.5, label='diamond')
ax[2,2].hist(bronze25['gold_diff'], alpha=0.5, label='bronze')
ax[2,2].legend()
averageGoldMetrics.iloc[8,0] = "Gold diff (25 min.)"
averageGoldMetrics.iloc[8,1] = np.round(st.mean(bronze25["gold_diff"]),3)
averageGoldMetrics.iloc[8,2] = np.round(st.mean(diamond25["gold_diff"]),3)
# Gold histograms (30 min)
ax[3,0].set_title('Blue gold (30 min.)')
ax[3,0].hist(diamond30['blue_gold'], alpha=0.5, label='diamond')
ax[3,0].hist(bronze30['blue_gold'], alpha=0.5, label='bronze')
ax[3,0].legend()
averageGoldMetrics.iloc[9,0] = "Blue gold (30 min.)"
averageGoldMetrics.iloc[9,1] = np.round(st.mean(bronze30["blue_gold"]),3)
averageGoldMetrics.iloc[9,2] = np.round(st.mean(diamond30["blue_gold"]),3)
ax[3,1].set_title('Red gold (30 min.)')
ax[3,1].hist(diamond30['red_gold'], alpha=0.5, label='diamond')
ax[3,1].hist(bronze30['red_gold'], alpha=0.5, label='bronze')
ax[3,1].legend()
averageGoldMetrics.iloc[10,0] = "Red gold (30 min.)"
averageGoldMetrics.iloc[10,1] = np.round(st.mean(bronze30["red_gold"]),3)
averageGoldMetrics.iloc[10,2] = np.round(st.mean(diamond30["red_gold"]),3)
ax[3,2].set_title('Gold diff (30 min.)')
ax[3,2].hist(diamond30['gold_diff'], alpha=0.5, label='diamond')
ax[3,2].hist(bronze30['gold_diff'], alpha=0.5, label='bronze')
ax[3,2].legend()
averageGoldMetrics.iloc[11,0] = "Gold diff (30 min.)"
averageGoldMetrics.iloc[11,1] = np.round(st.mean(bronze30["gold_diff"]),3)
averageGoldMetrics.iloc[11,2] = np.round(st.mean(diamond30["gold_diff"]),3)
plt.show()
# create table
averageGoldMetrics
The majority of the histograms convery a pretty congrugent scalar relationship between bronze and diamond players. You can see that overall diamond players average more gold, regardless of the observed duration or team color. Interestingly the gap between bronze and diamond gold seems to shrink the longer the game is played. Blue team in the shorter durations seems to gather more gold, but realistically this is within the margin of error. For the bronze mean for a 15 minute game is within 100 gold for both blue and red (24741 and 24619 respectively).
Create 8 subfigures in a 2x4 grid (done for you). Fill in the provided template with the appropriate variables.
The column names (variables) you will need are: blue_win, blue_gold, and gold_diff
In each subfigure plot two histograms: one for diamond players and one for bronze players only in the cases where blue team won (blue_win == 1). Make each histogram transparent (alpha = 0.5) since they will overlap. Label the histograms.
Create table that clearly displays the mean of each histogram for each of the eight subfigures. The table should be 8x3, each row is a subfigure, column 1 contains the title of the subfigure, column 2 contains the bronze mean, and column 3 contains the diamond mean.
diamond15BlueWin = diamond15[diamond15['blue_win']==1]
diamond20BlueWin = diamond20[diamond20['blue_win']==1]
diamond25BlueWin = diamond25[diamond25['blue_win']==1]
diamond30BlueWin = diamond30[diamond30['blue_win']==1]
bronze15BlueWin = bronze15[bronze15['blue_win']==1]
bronze20BlueWin = bronze20[bronze20['blue_win']==1]
bronze25BlueWin = bronze25[bronze25['blue_win']==1]
bronze30BlueWin = bronze30[bronze30['blue_win']==1]
fig, ax = plt.subplots(2, 4, constrained_layout = True, figsize = (14, 5))
averageBlueWinGoldMetrics = pd.DataFrame(columns=["subfigure","bronze mean","diamond mean"],index=['0','1','2','3','4','5','6','7'])
# blue gold distribution (winners only)
ax[0,0].set_title('Winning Blue gold (15 min)')
ax[0,0].hist(diamond15BlueWin['blue_gold'], alpha=0.5, label='diamond')
ax[0,0].hist(bronze15BlueWin['blue_gold'], alpha=0.5, label='bronze')
ax[0,0].legend()
averageBlueWinGoldMetrics.iloc[0,0] = "Winning Blue gold (15 min)"
averageBlueWinGoldMetrics.iloc[0,1] = np.round(st.mean(bronze15BlueWin["blue_gold"]),3)
averageBlueWinGoldMetrics.iloc[0,2] = np.round(st.mean(diamond15BlueWin["blue_gold"]),3)
ax[0,1].set_title('Winning Blue gold (20 min)')
ax[0,1].hist(diamond20BlueWin['blue_gold'], alpha=0.5, label='diamond')
ax[0,1].hist(bronze20BlueWin['blue_gold'], alpha=0.5, label='bronze')
ax[0,1].legend()
averageBlueWinGoldMetrics.iloc[1,0] = "Winning Blue gold (20 min)"
averageBlueWinGoldMetrics.iloc[1,1] = np.round(st.mean(bronze20BlueWin["blue_gold"]),3)
averageBlueWinGoldMetrics.iloc[1,2] = np.round(st.mean(diamond20BlueWin["blue_gold"]),3)
ax[0,2].set_title('Winning Blue gold (25 min)')
ax[0,2].hist(diamond25BlueWin['blue_gold'], alpha=0.5, label='diamond')
ax[0,2].hist(bronze25BlueWin['blue_gold'], alpha=0.5, label='bronze')
ax[0,2].legend()
averageBlueWinGoldMetrics.iloc[2,0] = "Winning Blue gold (25 min)"
averageBlueWinGoldMetrics.iloc[2,1] = np.round(st.mean(bronze25BlueWin["blue_gold"]),3)
averageBlueWinGoldMetrics.iloc[2,2] = np.round(st.mean(diamond25BlueWin["blue_gold"]),3)
ax[0,3].set_title('Winning Blue gold (30 min)')
ax[0,3].hist(diamond30BlueWin['blue_gold'], alpha=0.5, label='diamond')
ax[0,3].hist(bronze30BlueWin['blue_gold'], alpha=0.5, label='bronze')
ax[0,3].legend()
averageBlueWinGoldMetrics.iloc[3,0] = "Winning Blue gold (30 min)"
averageBlueWinGoldMetrics.iloc[3,1] = np.round(st.mean(bronze30BlueWin["blue_gold"]),3)
averageBlueWinGoldMetrics.iloc[3,2] = np.round(st.mean(diamond30BlueWin["blue_gold"]),3)
# gold diff distribution (winners only)
ax[1,0].set_title('Winning Gold diff (15 min)')
ax[1,0].hist(diamond15BlueWin['gold_diff'], alpha=0.5, label='diamond')
ax[1,0].hist(bronze15BlueWin['gold_diff'], alpha=0.5, label='bronze')
ax[1,0].legend()
averageBlueWinGoldMetrics.iloc[4,0] = "Winning Gold diff (15 min)"
averageBlueWinGoldMetrics.iloc[4,1] = np.round(st.mean(bronze15BlueWin["gold_diff"]),3)
averageBlueWinGoldMetrics.iloc[4,2] = np.round(st.mean(diamond15BlueWin["gold_diff"]),3)
ax[1,1].set_title('Winning Gold diff (20 min)')
ax[1,1].hist(diamond20BlueWin['gold_diff'], alpha=0.5, label='diamond')
ax[1,1].hist(bronze20BlueWin['gold_diff'], alpha=0.5, label='bronze')
ax[1,1].legend()
averageBlueWinGoldMetrics.iloc[5,0] = "Winning Gold diff (20 min)"
averageBlueWinGoldMetrics.iloc[5,1] = np.round(st.mean(bronze20BlueWin["gold_diff"]),3)
averageBlueWinGoldMetrics.iloc[5,2] = np.round(st.mean(diamond20BlueWin["gold_diff"]),3)
ax[1,2].set_title('Winning Gold diff (25 min)')
ax[1,2].hist(diamond25BlueWin['gold_diff'], alpha=0.5, label='diamond')
ax[1,2].hist(bronze25BlueWin['gold_diff'], alpha=0.5, label='bronze')
ax[1,2].legend()
averageBlueWinGoldMetrics.iloc[6,0] = "Winning Gold diff (25 min)"
averageBlueWinGoldMetrics.iloc[6,1] = np.round(st.mean(bronze25BlueWin["gold_diff"]),3)
averageBlueWinGoldMetrics.iloc[6,2] = np.round(st.mean(diamond25BlueWin["gold_diff"]),3)
ax[1,3].set_title('Winning Gold diff (30 min)')
ax[1,3].hist(diamond30BlueWin['gold_diff'], alpha=0.5, label='diamond')
ax[1,3].hist(bronze30BlueWin['gold_diff'], alpha=0.5, label='bronze')
ax[1,3].legend()
averageBlueWinGoldMetrics.iloc[7,0] = "Winning Gold diff (30 min)"
averageBlueWinGoldMetrics.iloc[7,1] = np.round(st.mean(bronze30BlueWin["gold_diff"]),3)
averageBlueWinGoldMetrics.iloc[7,2] = np.round(st.mean(diamond30BlueWin["gold_diff"]),3)
plt.show()
# create table
averageBlueWinGoldMetrics
Once again you can see that overall diamond players on average collect more gold throughout a game, in this example however we are only analyzing blue wins. It is interesting how the gold_diff however doesn't seem to shrink nearly as much as before. We can see that the difference between Winning Gold diff(20 min) between bronze (3776.497) and diamond (3807.515) isn't much at all.
For each stage of the game (15, 20, 25, 30 minutes) and for both diamond and bronze players compute the win percentage of the blue team (fraction of times that blue_win == 1).
Plot two lines (label them) indicating the percent of the time blue wins. X-axis is match time (15, 20, 25, 30 minutes) and y-axis is win percentage of blue team.
One line shows the win percentage of blue for diamond players
The other line shows the win percentage of blue for bronze players
Label the axis and title the plot appropriately.
diamondBlueWinRate = [0] * 4
diamondBlueWinRate[0] = np.round(len(diamond15BlueWin)/len(diamond15)*100,2)
diamondBlueWinRate[1] = np.round(len(diamond20BlueWin)/len(diamond20)*100,2)
diamondBlueWinRate[2] = np.round(len(diamond25BlueWin)/len(diamond25)*100,2)
diamondBlueWinRate[3] = np.round(len(diamond30BlueWin)/len(diamond30)*100,2)
bronzeBlueWinRate = [0] * 4
bronzeBlueWinRate[0] = np.round(len(bronze15BlueWin)/len(bronze15)*100,2)
bronzeBlueWinRate[1] = np.round(len(bronze20BlueWin)/len(bronze20)*100,2)
bronzeBlueWinRate[2] = np.round(len(bronze25BlueWin)/len(bronze25)*100,2)
bronzeBlueWinRate[3] = np.round(len(bronze30BlueWin)/len(bronze30)*100,2)
matchTimes = [15,20,25,30]
# plot goes here
plt.plot(matchTimes,diamondBlueWinRate, label='diamond')
plt.plot(matchTimes,bronzeBlueWinRate, label='bronze')
plt.title("Blue Team Win %")
plt.xlabel("Match Times")
plt.ylabel("Blue Win %")
plt.xticks([15,20,25,30])
plt.legend()
plt.show()
Visually on the graph you can see a similiar shape for the bronze and diamond win percentages. Both of them seem to decrease overtime by a percent or so. According to the data, blue team has a slight advantage in bronze but overtime that becomes even smaller. In Diamond the blue team seems to win more just slightly but over longer matches loses more than the red team. In bronze, the blue team seems to win more, this could be a total random error or perhaps it has something to do with the perception of colors (seeing red makes you more aggressive) so perhaps lower ranks are more susceptible to this effect.
Now we want to further investigate how bronze and diamond players differ. Since bronze and diamond players have different skill levels, we think their games might be played differently. For example, maybe "xp" and gold are more important to bronze players and "wards" are more important for diamond players.
Lets build some models to predict the winner of a match using the provided match information. We will investigate a few different phenomena
Is it easier to predict the outcome of bronze or diamond league matches?
Do different features determine the winner between bronze and diamond league players?
Lets start with the diamond players. We want to classify if blue team will win (blue_win == 1), given match information like xp, gold, wards, etc., at the 15, 20, 25, and 30 minute marks. I.e. we need 4 classification models.
Import the diamond player data (done for you)
Separate the target variable (blue_win) and the feature matrix (everything else) (done for 15 minutes data, you do the rest).
Fit any classification model you like to each dataset.
Make sure your model has an out of sample brier < 0.4 and an accuracy > 0.65 on the 15 minute data. We dont want to use bad models!
You need a variable importance measure, so maybe dont choose nearest neighbors.
For logistic regression use the absolute value of the coefficients as variable importance.
For decision trees or random forest use the feature_importance_ score.
diamond15 = pd.read_csv('data/timeline_DIAMOND_15.csv', index_col = 0)
diamond20 = pd.read_csv('data/timeline_DIAMOND_20.csv', index_col = 0)
diamond25 = pd.read_csv('data/timeline_DIAMOND_25.csv', index_col = 0)
diamond30 = pd.read_csv('data/timeline_DIAMOND_30.csv', index_col = 0)
# 15 minutes
x15 = diamond15.drop(['blue_win'], axis=1)
y15 = diamond15.loc[:,['blue_win']]
x15_train, x15_test, y15_train, y15_test = train_test_split(x15, y15, test_size=0.33, random_state=42)
x15_train = np.array(x15_train)
y15_train = np.array(y15_train)
x15_test = np.array(x15_test)
y15_test = np.array(y15_test)
# recommend keeping a consistent naming scheme
# 20 minutes
x20 = diamond20.drop(['blue_win'], axis=1)
y20 = diamond20.loc[:,['blue_win']]
x20_train, x20_test, y20_train, y20_test = train_test_split(x20, y20, test_size=0.33, random_state=42)
x20_train = np.array(x20_train)
y20_train = np.array(y20_train)
x20_test = np.array(x20_test)
y20_test = np.array(y20_test)
# 25 minutes
x25 = diamond25.drop(['blue_win'], axis=1)
y25 = diamond25.loc[:,['blue_win']]
x25_train, x25_test, y25_train, y25_test = train_test_split(x25, y25, test_size=0.33, random_state=42)
x25_train = np.array(x25_train)
y25_train = np.array(y25_train)
x25_test = np.array(x25_test)
y25_test = np.array(y25_test)
# 30 minutes
x30 = diamond30.drop(['blue_win'], axis=1)
y30 = diamond30.loc[:,['blue_win']]
x30_train, x30_test, y30_train, y30_test = train_test_split(x30, y30, test_size=0.33, random_state=42)
x30_train = np.array(x30_train)
y30_train = np.array(y30_train)
x30_test = np.array(x30_test)
y30_test = np.array(y30_test)
# define and fit models
from sklearn.metrics import accuracy_score
# 15 minutes
lda15 = LDA()
lda15.fit(x15_train, y15_train)
# 20 minutes
lda20 = LDA()
lda20.fit(x20_train, y20_train)
# 25 minutes
lda25 = LDA()
lda25.fit(x25_train, y25_train)
# 30 minutes
lda30 = LDA()
lda30.fit(x30_train, y30_train)
y15_hat = lda15.predict_proba(x15_test)
lda15Brier = np.round(brier_score(y15_test, y15_hat),3)
print("diamond15 LDA Brier Score:", lda15Brier)
p15_hat = lda15.predict(x15_test)
lda15Acc = np.round(accuracy_score(y15_test,p15_hat),3)
print("diamond15 LDA Accuracy Score:", lda15Acc)
diamond15 LDA Brier Score: 0.322
diamond15 LDA Accuracy Score: 0.761
Compute and print the brier score and accruacy of each model (round to 3 decimal places)
Display the brier score and accuracy in a single table. The table should be 4x3. Each row is a time point (15, 20, 25, or 30 minutes). Column 1 is the time point as a string, column 2 is the brier score, column 3 is the accuracy. For example row 1 might look like 15 Minutes, 0.351, 0.825.
We might naively expect that its easier to predict the winner the longer the match goes on. Comment on: Which time period is the hardest to predict? Which time period is the easiest to predict? Do matches become more predictable (better scores) over time? Use the computed Brier and accuracy scores to inform your response.
# predict if blue wins
y15_hat = lda15.predict_proba(x15_test)
lda15Brier = np.round(brier_score(y15_test, y15_hat),3)
print("diamond15 LDA Brier Score:", lda15Brier)
y20_hat = lda20.predict_proba(x20_test)
lda20Brier = np.round(brier_score(y20_test, y20_hat),3)
print("diamond20 LDA Brier Score:", lda20Brier)
y25_hat = lda25.predict_proba(x25_test)
lda25Brier = np.round(brier_score(y25_test, y25_hat),3)
print("diamond25 LDA Brier Score:", lda25Brier)
y30_hat = lda30.predict_proba(x30_test)
lda30Brier = np.round(brier_score(y30_test, y30_hat),3)
print("diamond30 LDA Brier Score:", lda30Brier,'\n')
# predict probabilities of blue winning and losing
p15_hat = lda15.predict(x15_test)
lda15Acc = np.round(accuracy_score(y15_test,p15_hat),3)
print("diamond15 LDA Accuracy Score:", lda15Acc)
p20_hat = lda20.predict(x20_test)
lda20Acc = np.round(accuracy_score(y20_test,p20_hat),3)
print("diamond20 LDA Accuracy Score:", lda20Acc)
p25_hat = lda25.predict(x25_test)
lda25Acc = np.round(accuracy_score(y25_test,p25_hat),3)
print("diamond25 LDA Accuracy Score:", lda25Acc)
p30_hat = lda30.predict(x30_test)
lda30Acc = np.round(accuracy_score(y30_test,p30_hat),3)
print("diamond30 LDA Accuracy Score:", lda30Acc)
# create brier and accuracy table
brierAccScores = pd.DataFrame(columns=["Time Point","Brier Score","Accuracy"],index=['0','1','2','3'])
brierAccScores.iloc[0,0] = "15 minutes"
brierAccScores.iloc[1,0] = "20 minutes"
brierAccScores.iloc[2,0] = "25 minutes"
brierAccScores.iloc[3,0] = "30 minutes"
brierAccScores.iloc[0,1] = lda15Brier
brierAccScores.iloc[1,1] = lda20Brier
brierAccScores.iloc[2,1] = lda25Brier
brierAccScores.iloc[3,1] = lda30Brier
brierAccScores.iloc[0,2] = lda15Acc
brierAccScores.iloc[1,2] = lda20Acc
brierAccScores.iloc[2,2] = lda25Acc
brierAccScores.iloc[3,2] = lda30Acc
brierAccScores
diamond15 LDA Brier Score: 0.322
diamond20 LDA Brier Score: 0.276
diamond25 LDA Brier Score: 0.245
diamond30 LDA Brier Score: 0.267
diamond15 LDA Accuracy Score: 0.761
diamond20 LDA Accuracy Score: 0.8
diamond25 LDA Accuracy Score: 0.826
diamond30 LDA Accuracy Score: 0.809
According to the scores, the hardest time period to predict seems to be 15 minutes, having the highest brier score and lowest accuracy. The 25 minute games seems to be the easiest to predict with the best scores. The matches tend to become easier to predict over time, with the exception of 30 minute games being slightly harder to predict.
Now plot the ROC curve for each model in a single figure. Make sure each line is appropriately labeled.
Create a single ROC curve plot
Compute and print the AUC values for each model.
# compute ROC curves and AUC values
fpr_lda15, tpr_lda15, thresholds = roc_curve(y15_test, y15_hat[:,1])
fpr_lda20, tpr_lda20, thresholds = roc_curve(y20_test, y20_hat[:,1])
fpr_lda25, tpr_lda25, thresholds = roc_curve(y25_test, y25_hat[:,1])
fpr_lda30, tpr_lda30, thresholds = roc_curve(y30_test, y30_hat[:,1])
print('15 Minutes:', np.round(roc_auc_score(y15_test, y15_hat[:,1]), 3))
print('20 Minutes:', np.round(roc_auc_score(y20_test, y20_hat[:,1]), 3))
print('25 Minutes:', np.round(roc_auc_score(y25_test, y25_hat[:,1]), 3))
print('30 Minutes:', np.round(roc_auc_score(y30_test, y30_hat[:,1]), 3))
# plot ROC curves
plt.plot(fpr_lda15, tpr_lda15, label = '15')
plt.plot(fpr_lda20, tpr_lda20, label = '20')
plt.plot(fpr_lda25, tpr_lda25, label = '25')
plt.plot(fpr_lda30, tpr_lda30, label = '30')
plt.title("ROC Curves")
plt.xlabel('False Positive Rate', fontsize = 15)
plt.ylabel('True Positive Rate', fontsize = 15)
plt.legend()
plt.show()
15 Minutes: 0.844
20 Minutes: 0.885
25 Minutes: 0.909
30 Minutes: 0.892
Based on the AUC values, 15 minutes is the hardest to predict while 25 minutes is the easiest to predict. This makes sense as it correlates with the outcome expected from the brier/accuracy scores. These findings corroborate my findings in part b. The matches become more predicatble over time, except with a slight decrease in the 30 minute range.
Print the feature importance of each feature in a table format. Each row should include the feature name and the importance score for each model. Sort this table by the feature importances for the 15 minute mark model.
If you used logistic regression use the coefficients (coefs_) as the importance measure
If you used decision trees or random forests use the feature importance score (feature_importances_) as the importance measure
Comment on: What are the top 5 most important features for predicting the winner of the game at the 15, 20, 25, and 30 minute marks of the match. Are these variables the same? Do any features become more or less important over time? Briefly argue these points, a simple "yes" or "no" is insufficient.
# create table here
featureImportance = pd.DataFrame(columns=["Feature Name", "15 - Importance", "20 - Importance", "25 - Importance", "30 - Importance"])
featureImportance["Feature Name"] = x15.columns
featureImportance["15 - Importance"] = abs(lda15.coef_[0])
featureImportance["20 - Importance"] = abs(lda20.coef_[0])
featureImportance["25 - Importance"] = abs(lda25.coef_[0])
featureImportance["30 - Importance"] = abs(lda30.coef_[0])
featureImportance = featureImportance.sort_values('15 - Importance')
featureImportance
The 5 most important features for a 15 minute game are:
first_inhibitor, blue_hextech, first_turret, red_earth, and red_fire.
For the 20 minute game:
blue_inhibitors, first_inhibitor, inhibtor_diff, red_inhibitors, and red_earth
25 minute game:
first_inhibitor, water, red_water, red_earth, and blue_earth
And lastly for the 30 minute time period:
first_inhibitor, blue_earth, red_water, hextech, blue_fire
These features are not the same through out every model. First_inhibitor and earth are consistently in the top 5 while hextech importance fades out over time period duration.
Now lets build models for the bronze players. Use the same model as you did for diamond players so that the results are comparable. We'll skip through a bit this time. By "same" I mean if you used logistic regression before then use it again. You of course have to refit the models to the bronze data. You should again have 4 models.
Compute the brier score and accuracy of each model (4 models in total) on the test set.
Display the brier score and accuracy in a single table. The table should be 4x3. Each row is a time point (15, 20, 25, or 30 minutes). Column 1 is the time point as a string, column 2 is the brier score, column 3 is the accuracy.
Comment on: Do matches get easier to predict over time? Also are the brier scores for bronze players lower or higher than diamond on average? I.e. is it easier to predict the outcome of a bronze game or a diamond game?
bronze15 = pd.read_csv('data/timeline_BRONZE_15.csv', index_col = 0)
bronze20 = pd.read_csv('data/timeline_BRONZE_20.csv', index_col = 0)
bronze25 = pd.read_csv('data/timeline_BRONZE_25.csv', index_col = 0)
bronze30 = pd.read_csv('data/timeline_BRONZE_30.csv', index_col = 0)
# 15 minutes
x15 = bronze15.drop(['blue_win'], axis=1)
y15 = bronze15.loc[:,['blue_win']]
x15_train, x15_test, y15_train, y15_test = train_test_split(x15, y15, test_size=0.33, random_state=42)
x15_train = np.array(x15_train)
y15_train = np.array(y15_train)
x15_test = np.array(x15_test)
y15_test = np.array(y15_test)
# recommend keeping a consistent naming scheme
# 20 minutes
x20 = bronze20.drop(['blue_win'], axis=1)
y20 = bronze20.loc[:,['blue_win']]
x20_train, x20_test, y20_train, y20_test = train_test_split(x20, y20, test_size=0.33, random_state=42)
x20_train = np.array(x20_train)
y20_train = np.array(y20_train)
x20_test = np.array(x20_test)
y20_test = np.array(y20_test)
# 25 minutes
x25 = bronze25.drop(['blue_win'], axis=1)
y25 = bronze25.loc[:,['blue_win']]
x25_train, x25_test, y25_train, y25_test = train_test_split(x25, y25, test_size=0.33, random_state=42)
x25_train = np.array(x25_train)
y25_train = np.array(y25_train)
x25_test = np.array(x25_test)
y25_test = np.array(y25_test)
# 30 minutes
x30 = bronze30.drop(['blue_win'], axis=1)
y30 = bronze30.loc[:,['blue_win']]
x30_train, x30_test, y30_train, y30_test = train_test_split(x30, y30, test_size=0.33, random_state=42)
x30_train = np.array(x30_train)
y30_train = np.array(y30_train)
x30_test = np.array(x30_test)
y30_test = np.array(y30_test)
# define and fit models
# 15 minutes
lda15 = LDA()
lda15.fit(x15_train, y15_train)
# 20 minutes
lda20 = LDA()
lda20.fit(x20_train, y20_train)
# 25 minutes
lda25 = LDA()
lda25.fit(x25_train, y25_train)
# 30 minutes
lda30 = LDA()
lda30.fit(x30_train, y30_train)
# predict if blue wins
y15_hat = lda15.predict_proba(x15_test)
lda15Brier = np.round(brier_score(y15_test, y15_hat),3)
print("bronze15 LDA Brier Score:", lda15Brier)
y20_hat = lda20.predict_proba(x20_test)
lda20Brier = np.round(brier_score(y20_test, y20_hat),3)
print("bronze20 LDA Brier Score:", lda20Brier)
y25_hat = lda25.predict_proba(x25_test)
lda25Brier = np.round(brier_score(y25_test, y25_hat),3)
print("bronze25 LDA Brier Score:", lda25Brier)
y30_hat = lda30.predict_proba(x30_test)
lda30Brier = np.round(brier_score(y30_test, y30_hat),3)
print("bronze30 LDA Brier Score:", lda30Brier,'\n')
# predict probabilities of blue winning and losing
p15_hat = lda15.predict(x15_test)
lda15Acc = np.round(accuracy_score(y15_test,p15_hat),3)
print("bronze15 LDA Accuracy Score:", lda15Acc)
p20_hat = lda20.predict(x20_test)
lda20Acc = np.round(accuracy_score(y20_test,p20_hat),3)
print("bronze20 LDA Accuracy Score:", lda20Acc)
p25_hat = lda25.predict(x25_test)
lda25Acc = np.round(accuracy_score(y25_test,p25_hat),3)
print("bronze25 LDA Accuracy Score:", lda25Acc)
p30_hat = lda30.predict(x30_test)
lda30Acc = np.round(accuracy_score(y30_test,p30_hat),3)
print("bronze30 LDA Accuracy Score:", lda30Acc)
# create brier and accuracy table
brierAccScores = pd.DataFrame(columns=["Time Point","Brier Score","Accuracy"],index=['0','1','2','3'])
brierAccScores.iloc[0,0] = "15 minutes"
brierAccScores.iloc[1,0] = "20 minutes"
brierAccScores.iloc[2,0] = "25 minutes"
brierAccScores.iloc[3,0] = "30 minutes"
brierAccScores.iloc[0,1] = lda15Brier
brierAccScores.iloc[1,1] = lda20Brier
brierAccScores.iloc[2,1] = lda25Brier
brierAccScores.iloc[3,1] = lda30Brier
brierAccScores.iloc[0,2] = lda15Acc
brierAccScores.iloc[1,2] = lda20Acc
brierAccScores.iloc[2,2] = lda25Acc
brierAccScores.iloc[3,2] = lda30Acc
brierAccScores
bronze15 LDA Brier Score: 0.321
bronze20 LDA Brier Score: 0.277
bronze25 LDA Brier Score: 0.261
bronze30 LDA Brier Score: 0.254
bronze15 LDA Accuracy Score: 0.761
bronze20 LDA Accuracy Score: 0.801
bronze25 LDA Accuracy Score: 0.81
bronze30 LDA Accuracy Score: 0.816
The matches get easier to predict as the time increases. The brier scores are lower on average for the bronze players. By this it would be easier to predict a bronze game vs a diamond game.
# create table here
featureImportance = pd.DataFrame(columns=["Feature Name", "15 - Importance", "20 - Importance", "25 - Importance", "30 - Importance"])
featureImportance["Feature Name"] = x15.columns
featureImportance["15 - Importance"] = abs(lda15.coef_[0])
featureImportance["20 - Importance"] = abs(lda20.coef_[0])
featureImportance["25 - Importance"] = abs(lda25.coef_[0])
featureImportance["30 - Importance"] = abs(lda30.coef_[0])
featureImportance = featureImportance.sort_values('15 - Importance')
featureImportance
The 5 most important features for a 15 minute game are:
first_inhibitor, red_inhibitors, inhibitors_diff, blue_inhibitors, red_heralds
For the 20 minute game:
first_inhibitor, red_inhibitors, inhibitors_diff, blue_inhibitors, red_fire
25 minute game:
earth, fire, first_inhibitor, blue_fire, hextech
And lastly for the 30 minute time period:
fire, first_inhibitor, earth, water, air
These features are not the same through out every model. First_inhibitor is consistently in the top 5 while the more specific inhibitor feature importance fades out over time period duration.
Some of the features are the same for both diamond and bronze players, however there were some differences. Earth and fire played a more significant role for bronze players but inhibitor gain was important for both ranks. The most major similarity was first_inhibitor being the prominent feature in both ranks of players.