Why match quality is often poor (x-post / r / OverwatchUniversity)

Too long, do not read:.

  • Trying to measure the skills of individual players in 6-on-6 singles is a very difficult task.
  • Using detailed simulations it is possible to quantify the quality of the game according to the specific system and the number of players.
  • For a realistic simulation, the quality of the game is often poor.
  • If you want to understand this, but don’t have the training to do so, I recommend you start with a computer analysis of your skills.


It is generally accepted that individual observation matches can be frustrating, and the source of this frustration is often the players. I have described above (1) many sources of problems with the marriage broker, but this discussion is very qualitative. I’m now going to do a quantitative description with detailed modelling. This gives a better insight into what is essentially a measurement problem: What is the qualification of each player and what is the quality of the games that can be played with these skill measures?

These simulations have been performed with Microsoft TrueSkill, version 1, because it is well documented (2,3,4,5,6), designed to work with commands and an open source implementation is available (7,8). The competing Overwatch system, which is probably under the hood, is very similar to the TrueSkill system, but the documentation for Overwatch is much worse. Overwatch certainly has to deal with many of these same problems. Simulations will be simplified and optimistic to detailed and realistic. It will be difficult for us to give the players a positive experience in the six against six games in the singles.



The more players you have, the more accurate the simulation is (unless you are trying to simulate what happens in ranks/regions/platforms/days/time with a small number of players) and the longer it takes.

Players of Team

The more players per team, the more difficult it is to determine each player’s contribution.


A measure of skill level. If player A takes 10th place in beta 10. If Player A is one place above Player B, then Player A beats Player B about 75% of the time. For Overwatch, a beta of 100 seems appropriate, which corresponds to a team of 3100 beating a team of 3000 in about 75% of the cases. For a more accurate figure, data from Blizzard are needed. Beta is of course more for games with more random variables (e.g. card games), and more for games with fewer skill factors (Tick-tock-zero > Merge Four > Chess > Go (9)).


The value of the face rating is comparable to that of the Overwatch MMR. I’ll start at 2500 for each player to make it easier (although Overwatch is actually at 2350 (10)). The range is approximately 1-5000.


Measurement uncertainty mu. It starts with a very large one (833) because the new player has an unknown ability. It’s compressed with games.


About the extent to which an experienced (not new) player can change his mu in a fair game. Monitoring is set to 24 (11).


While mu is a measure of a person’s abilities, an ability is the real value and is used to count the winner of each game in the simulation.


A greater inconsistency means that the player is less reliable in his results. This parameter is used to calculate the winner of each game.

Games per player for iteration

The number of games per player the simulator generates per simulation iteration. If it is 0.01, it means that 1% of the players (randomly selected) play the game every round.

Smurfing rate

How often do players change accounts? With a value of 0.001 the player scrolls through an average of 1000 games. The throws are made randomly so that each player can more or less try it.

Assessment of conformity

For each potential game, the estimated quality of the games is calculated. Games are of better quality when the muscles of the two teams are close together and the sigma is small compared to the beta version. In fact, it’s a statement about the system’s dependence on fair play. It wouldn’t be fair if the muscles were very different. It might not be fair that the sigma is high (because we don’t know much about muscles).

Quality of actual correspondence

The actual quality of the game is calculated for each future game. We can only do it because it is a simulator and we know the real skill and inconsistency of the players. Games are of better quality when the skills of the two teams are close together and the deviations from the beta are small. In fact, it is a statement about the balance of the outcome of the game. It wouldn’t be fair if the skills were very different. If the number of inconsistent players is large, the game will be less balanced, because if one or more inconsistent players play particularly well or badly, the game will end badly.

Model description

The code is available (12). It is written in Python, and all required libraries, including the TrueSkill implementation, are available in pip (Python package management system). To be a little friendly, I’m not going to make any comparisons in this article. See code and links for mathematical details. The modelling procedure is described below:

1) Generate attributes and inconsistencies for each player by drawing lots in a given division Sort players by attribute and assign a number to the player. The number of players, skills and inconsistency remain unchanged throughout the simulation. Each player starts with a mu and a standard sigma. Update Mu and Sigma during the game.

2) For each iteration :

3) Remove the players from the audience

4) Sort the players by mu

5) Divide the sorted players into enough groups to complete the game (12 for 6v6).

6) Divide each group randomly into two teams.

7) For each game, if it is a random game

8) Update the muscles and sigma using the TrueSkill algorithm.

9) Reset mu and sigma to a number of players to simulate smurfs

10) Go to step 2.

Description of the animated character

I’m going to show a lot of animated characters, but they’re all the same size, so I’m going to describe in detail what’s going on. Click on one of the videos below to see what I’m talking about. On the left hand side the mu and sigma of each player are displayed for each game. (If you ever forget what a parameter like mu and sigma is, see the section on options). In the beginning the system has no idea of the ranking, so your mu is 2500 and your sigma is high (833). Over time, the best players win and the worst lose, and the players are pushed into their respective ranks, and the sigma drops as the system can better assess their abilities. The S-curve at the end of the film is a bell-shaped curve, but drawn on slightly different axes than the usual ones. If you are watching a film, you can read the accuracy of the edge measurement by looking at the thickness of the S-curve.

The diagram above on the right shows both the evaluated game qualities (according to the TrueSkill algorithm) and the real ones (determined by the knowledge of the real skills and the inconsistency of each player). The name of the TrueSkill rating system is not successful, but I will write capital letters and a word when referring to the rating system, unlike the simulation parameters. Uninterrupted lines – average quality of compliance Dotted lines – +/- 34.1% of the carrier, which would be a sigma if the quality distribution was normal (they do not match) These qualities start at least because matchmaking is indeed random, increases as you learn and sort the skills of the players, and is then aligned as the algorithm is largest.

In the lower right corner of the story a mu in time is displayed for five selected players with percentages of 10%, 30%, 50%, 70% and 90%. With 12,000 players (as in this article) the player has the numbers 1200, 3600, 6000, 8400 and 10800. They start with the same rank (2500) and then filter to the right place, with some randomly running around their real rank when they win and lose.

Simulation 1: Simplified, possibly beyond the point of use

Actors: 12,000, players per team: One, beta: one, string: 1, play on a player for iteration: 1, possibility: normal (2500.0.833.3), inconsistency: 1.0, smurf ratio : 0.0 The link to the cast of actors is below. Unfortunately, Redit does not support embedded images and movies.

Distribution of actors

Skills are distributed on a call curve, and the order is the same for all players on 1 (extremely low). The beta is also equal to 1, which means that a player with attribute 2501 will beat a player with attribute 2500 75% of the time. The evaluation over time is presented below. Click here to see the movie.


Some things are obvious, most players are sorted relatively quickly (45 games) very precisely, others take much more time. Indeed, the Tau is also equal to 1, which means that for players with a large number of games, the mu is only about 1 move per game. If a player has bad luck with his opponents during his first few games, it can mean that he needs up to 350 games before his classification is correct. This case somehow breaks the TrueSkill system, but gives a good clue, because it shows that (assuming you can make a game with 5000 skill levels) you can judge people extremely accurately. Again, the quality of the game is not very good because the skill level is low. Most of the first games will be trampled, and the trampled games will be distributed later, because for one player to completely dominate the other, no big difference in skill is needed.

Model No. 2: Realistic number of qualification levels

Actors: 12,000, players per team: One, beta: 100, string: 1, play on a player for iteration: 1, possibility: normal (2500.0.833.3), inconsistency: 1.0, smurf ratio : 0.0

Here, I set the beta to 100 to be more realistic. In this case, the player (or team) of 2600 beats the player (or team) of 2500 in 75% of the cases. That’s exactly what the observer needs.


The accuracy of the classification is still quite difficult (but not as much as before), the quality of the game is much better (very close to 1 at most), and there is no stupid case where it takes people forever to get exactly one place. However, for most people it takes a little longer (about 75 games). The main reason why people take so long to get to a place is that the dew is so low. Once people have played enough games, their rank changes very slowly.

Model No. 3: Faster classification movement

Actors: 12,000, players per team: One, beta: 100, string: 24 solo competitions for iteration: 1, capacity: normal (2500.0.833.3), inconsistency: 1.0, smurf ratio : 0.0


I swapped the rope in 24 hours to allow for faster queues. This allows people to rank faster (about 40 games), but the quality of the game and the accuracy of the ranking have had some success. Estimates are now accurate to around +/- 150, compared to +/- 25 earlier.

Model No. 4: Actors in conflict

Actors: 12,000, players per team: One, beta: 100, string: 24 solo contests for iteration: 1, capacity : Normal (2500.0.833.3), inconsistency: 100.0, smurf rate: 0.0

Previously the gap was 1, which means that the performance of the players only differs by +/- 3 per game. We know that for many reasons this is not the case for Overwatch. Mismatch 100 (for a full range of variations about +/- 300 is much more realistic)

Here is a new distribution plan and a simulation:

Distribution of actors


The increase in inconsolability mainly leads to a significant deterioration in the quality of the real game – from 0.95 to 0.65. This is normal, as individual players perform much better or worse in some matches, resulting in an increase in the number of kicks.

Simulation 5: Six players per team

Actors: 12,000, players per team: Six, Beta: 100, String: 24 solo competitions for iteration: 1, capacity: Normal (2500.0.833.3), inconsistency: 100.0, smurf rate: 0.0

Of course, it’s not surveillance if there’s only one player on the team. So let’s make it work with six players on each team.


It has been a great success in all areas. The accuracy of the assessment is now about +/- 300. It takes about 150 games to evaluate them all. The actual quality of the game is about 0.55. This is the most important issue, and the most important one, when assessing individual observations. The amount of data needed to divide individuals into teams is much larger than for individuals. More people in the team also increases the chance that something goes wrong, that the quality of the game is bad.

In the margin, if you take 150 games to sort them all, and at the beginning of the season the quality of the games is poor, it would be a terrible idea to reset the data. Rebooting would have destroyed the quality of the game in a few months (and forever if it was restarted every two months).

Simulation 6: Realistic number of results per iteration

Actors: 12,000, players per team: Six, Beta: 100, String: 24 solo contests for iteration: 0.01, Capacity: normal (2500.0.833.3), Inconsistency: 100.0, Smurf rate: 0.0

So far all players have played the game in every iteration of the simulator. This is unrealistic for a game where you can log in, play and log out at any time. In this next simulation, only 1% of the players play the game in each iteration. These players are chosen randomly, so that some players (randomly) go through multiple iterations without playing a single game.


In fact, there is little change except a slight decrease in the accuracy of assessment (from +/- 300 to +/- 375).

Simulation 7: Smurfs

Actors: 12,000, players per team: Six, Beta: 100, String: 24 solo contests for iteration: 0.01, Capacity: normal (2500.0.833.3), Inconsistency: 100.0, Smurf rate: 0.001

As we all know, there are Smurfs in the game. Here, I’ll randomly reset the player rankings. At this rate, a player puts his ranking back on average once every thousand matches. But because it’s an accident, some players will fall more and others less. After the reset, the player plays normally, without a ban (13).


There is a slight decrease in the average experience of the players. The main difference is that there are many variations in mu that match players who have recently restarted. It takes 20 to 100 games for the players to regain their original rank. If a player is particularly unlucky, it may take longer (as it has ever happened to me), but most points seem to recover quite quickly. Examples of player recovery are easier to see in the ranking at the bottom right.

Model No. 8: Trolls

Actors: 12,000, players per team: Six, Beta: 100, String: 24 solo competitions for iteration: 0.01, capacity: Normal(2500.0.833.3), inconsistency : Log normal (, smurfing speed: 0.001

Just one more thing:

The donkey

It is not entirely true that all actors have the same inconsistency. At the low level we have players who play a small number of heroes and never play if they are tired or bent. In the intermediate sections we have intermediate players who play more heroes and in a greater variety of moods and circumstances. When we go higher up, we have players who are drunk or intoxicated, who like to switch heroes to train or something like that. And finally, at the highest level we have trolls that can jump off the map to total trampling, depending on the level of toxicity they want to see released that day. I simulate this range of anomalies as the log-normal distribution shown below. For each player I assign a random deviation from the distribution.

Distribution of actors

There will be a simulation.


In fact, it wasn’t that bad, except for the actual quality of the game (which one could have expected).

However, let’s talk about this simulation in general, the most precise one I will make in this article. Honestly, it’s pretty bad. Less than half of the games have a quality higher than 0.5, which is a common limit for acceptable games. Many games have a quality of less than 0.2, which is normal for trash type games. The accuracy of the valuation is only +/- 400 for mid-level accounts. If we drop everything, we need about 150 games to return to this not-so-good solution.

Summary table

The figures in the last four columns are inaccurate. I evaluate them from the diagrams.

Simulation Description Reader The players of the team Games on player iteration Beta Dew Inconsistency Smurf rate Accuracy of assessment Stabilizes the quality of the game Final value of the quality of the game A real quality of compliance
1 Simplified use, maybe something from the past. 12000 1 1 1 1 1 ± 3 45/350 0.5 0.62
2 Realistic number of qualification levels 12000 1 1 100 1 1 ± 25 75/300 0.97 0.98
3 Faster scoring movement 12000 1 1 100 24 1 ± 150 40 0.8 0.95
4 Actors in conflict 12000 1 1 100 24 100 ± 250 40 0.8 0.65
5 Six players per team 12000 6 1 100 24 100 ± 300 150 0.55 0.6
6 Realistic number of coincidences in an iteration 12000 6 0.01 100 24 100 ± 375 150 0.55 0.6
7 Smurfs 12000 6 0.01 100 24 100 0.001 ± 400/1000 150 0.5 0.58
8 Trolls 12000 6 0.01 100 24 lognormal(0,4,0,120,0) 0.001 ± 400/1000 150 0.5 0.48

Possible improvements

TrueSkill version 1 contains no performance modifiers, just like me. Assuming that the performance modifiers work well, all parameters are improved. However, this is a difficult assumption, especially in competitions such as Overwatch, where performance is difficult to quantify and where performance modifiers can influence behaviour outside team play and targets in agricultural statistics.

TrueSkill version 2 contains performance modifiers and some other possible improvements (6). In the future I will be able to simulate TrueSkill version 2. However, there is no open implementation of TrueSkill 2 (it’s brand new), so I would have to implement it myself.

There are other possible mechanical improvements, such as examining the record of the last N games instead of the last 1 game to determine new rankings.

Looking For Group helps to reduce the chances of finding a partner, which improves the quality of the search for a partner. New possibilities in this direction could further improve the situation.

To continue this theme: If Blizzard had set up a 6-stack queue where the teams, not the players, have a CD, the player maker would have behaved more like a 1v1 player than a 6v6 player. The above simulations show how much better the system behaves in the 1v1 mode. Conversely, the commands in a six level queue can be compared to those in a normal queue if there are not enough commands in a six level queue.

Other options include a clan system or weekly or monthly tournaments for players who are not good enough to make an open league a positive experience.

A blizzard is better suited to eliminate toxic personalities (shooters, boosters, rags, hackers, etc.). Machine training can be useful for shooters and buffers, but eventually all semi-component reports should be examined and the investigator should be able to observe the repetition of the match(s) from someone else’s perspective. The blizzard is banishment from the series of offenders.

Individual players can improve their playing experience by using LFG or by playing with a group of normal people. They are also not allowed to play if they are drunk, inclined, tired, etc.


(1) How does scoring competitive skills -> Matchmaking -> If the matchmaker says most matches are fair, why so many line-ups?

(2) The TrueSkill rating system.

(3) Find the way

(4) Math for a real murder.

(5) TrueSkill : Bayesian notation system

(6) TrueSkill 2 : Advanced Bayesian Notation System

(7) Sale of Python Real Assassination (documentation)

(8) Sale of Python Real Assassination (code)

(9) Go play chess: Comparison -> Reviews

(10) The initial assessment of competitiveness is decoded.

(11) Over 3,000 assessments of skills and data analysis (currently including CDs).

(12) Simulation code

(13) The opening of a new account is not contrary to the rules. Accelerating or throwing is against the rules. – Jeff Kaplan

2020 will satisfy both classic and modern players. To get on the list, the game must be confirmed for 2020, or there must be a good reason to expect it this year. Therefore, upcoming matches with a simple announcement and no recognizable release date will not be taken into account.

By 2020, there will be tons of them… in the world of video games. Here are the fifteen races we expect for the first half of 2020.

once dating,delete once account