Am I going to win the U.S. Open?

Estimating the probability of winning the US Open men's and women's tournaments with Julia code and a toy model. Also an increasingly large (as I write this) exposition of Bayesian methods.

Am I going to win the U.S. Open?

(No.) Let's just look at a puzzle in time for the U.S. Open and see if we can't learn some statistics and, of course, use Julia!


I love The Riddler — FiveThirtyEight's weekly mathematical/statistical/logical puzzle column, edited by Oliver Roeder. He gave an awesome Talk at Google soon after I joined, promoting his excellent book in the style of Martin Gardner. Let me know if you're new to these puzzles or if you love them as much as I do! Also, if you have any related suggestions pass them along :)

I have not written up any of the articles that I thought of doing when I set this up ~2 years ago, and I remember how much fun I had writing up my solution to an especially fun puzzle. So, after seeing a cool new puzzle, I had to dive in!


What Are Your Chances Of Winning The U.S. Open?

Suppose you’re playing a match at the U.S. Open, and you’re slightly better than the competition: your chances of winning any given point are exactly 55 percent. What are your chances of winning a three-set match, as played by the women, or a five-set match, as played by the men? And what are your chances of winning the whole tournament (seven consecutive matches)?

Wissner-Gross, Z. https://fivethirtyeight.com/features/what-are-your-chances-of-winning-the-u-s-open/.

Note that this puzzle isn't sports analytics; we're just taking every point to have an equal probability of winning, independent of serving, or opponent, etc. The composition of a tennis match is as follows:

  • First to four points (win by 2) takes a game
  • First to six games (win by 2) takes a set
    • If a set is tied at 6-6, we go to a tiebreaker
    • A tie breaker is first to 7 points (win by 2)
  • Men's matches are best of 5 / first to 3 sets
  • Women's matches are best of 3 / first to 2 sets
  • The US Open tournament bracket is won by 7 consecutive match wins

Initial thoughts

We want to estimate the following probabilities, under the assumption that the chance of you winning any point played in the tournament is 55%:

  1. Probability of winning a single women's match
  2. Probability of winning a single men's match
  3. Probability of winning the women's tournament
  4. Probability of winning the men's tournament

Where does the gender even come in? The difference is in the structure of matches: a men's match is a trial of 5 sets, and a women's match is a trial of 3 sets.

So what sort of difference should this make? Well, we know that we have an advantage (NPI) over the competition, having a 55% chance of winning any point. By increasing the total number of points played in a men's match, we should reduce fluctuations in the match, and expect a higher probability of winning a men's match vs a women's match.

Calculation

It might be possible to create a recurrence relation that can be marginalized to obtain a closed-from solution, but it's definitely easier to simulate. And in the long and glorious tradition (viz. the single other post) of this blog, we're going to use Julia!

First, a simple function to simulate a "win-by-2" condition:

function win_by(p, target, margin)
    trials = 0
    wins = 0
    while trials + abs(wins) < 2*target || abs(wins) < margin
        wins += p() ? 1 : -1
        trials += 1
    end
    wins > 0
end

This function takes as arguments a function p, and two integers: the target number of total wins, and the winning margin. The function p should simulate a single trial. E.g., a single point.

This is very reusable, but I would like to also hack in the tiebreaker condition, so I'll add some more arguments which are nothing by default:

function win_by(p, target, margin, tie=nothing, tiebreak=nothing)
    trials = 0
    wins = 0
    while trials + abs(wins) < 2*target || abs(wins) < margin
        wins += p() ? 1 : -1
        trials += 1
        tie == trials && abs(wins) == 0 && return tiebreak()
    end
    wins > 0
end

This is as before, but now we have tie which is nothing or a total number of trials where you can have a tie, and a tiebreak function to simulate that condition. This relies on the fact that the Nothing type has the multimethod ==(Nothing, Int) = false defined, so it doesn't do anything unless you give it a number for tie.

Now we define the basics:

point() = rand() < 0.55
game() = win_by(point, 4, 2)
set() = win_by(game, 6, 2, 12, () -> win_by(point, 7, 2))

Here, a point is won 55% of the time. A game is won when you have 4 points, win-by-2. A set uses the extra tie/tiebreaker args.

How does that work exactly? Well, if you have 12 games played and are tied, then we execute the tiebreaker function, which I've given as the zero-arg lambda that just calls win_by(point, 7, 2). I.e., the tie breaker condition is just a weird game to 7 points, win-by-2.

Now the matches:

mens_match() = win_by(set, 3, 1)
womens_match() = win_by(set, 2, 1)

So a match is back to the simple win_by usage, but composed of sets.

How about the tournament? Well, if $p$ is the probability of winning a match, then $p^7$ is the probability of winning 7 independent matches in a row. But, just to keep having fun, let's simulate that as well.

function consecutive(p, target)
    for _ in 1:target
        p() || return false
    end
    true
end
mens_open() = consecutive(mens_match, 7)
womens_open() = consecutive(womens_match, 7)

Instead of hacking this in to win_by, I just made another helper to see if you have target consecutive successes of the trial function p.

Results

Now that we can simulate a men's or women's match or tournament, how do we estimate the probabilities we discussed before? The most naive approach would be to just pick a big number of trials, and compute the ratio of successes:

estimate(p, n) = sum(p() for _ in 1:n)/n

@time estimate(mens_match, 1000)
# Output:
#    0.001993 seconds (5 allocations: 176 bytes)
#  0.946

@time estimate(mens_match, 1000)
# Output:
#    0.001682 seconds (5 allocations: 176 bytes)
#  0.956

@time estimate(mens_match, 10_000_000)
# Output:
#   13.246685 seconds (5 allocations: 176 bytes)
#  0.9530686

@time estimate(mens_match, 10_000_000)
# Output:
#   13.222734 seconds (5 allocations: 176 bytes)
#  0.9530566

Taking men's matches as an example, I calculated the ratio of successes with a 1000 trials twice, and got 95.6% and 94.8%. I then tried it twice with 10 million trials, and got 95.3% both times.

Let's stick with this approach for now, and get those estimates, using 10 million trials:

Event Win Pct. Simulation Time (s)
Men's match 95.3% 13.2
Men's open 71.4% 81.6
Women's match 91.0% 8.7
Women's open 51.7% 46.5

In case you were wondering, indeed, $0.953^7 = 0.714$, etc. So we could have saved the 2 minutes in simulating the tournaments :)

By far the most striking thing here is how drastically different the win % is for the men's vs women's open. It's a margin of 4 points for individual matches, but of course it's compounded 7-fold for the bracket.

This means that the (highly unrealistic) women's player that has a 10 percentage point advantage for every point played against every other player in the tournament, has a only slightly better than coin-flip chance of winning the entire tournament.


Statistical treatments

I noted above that the success ratio method is the most naive (reasonable) way of doing this. What are other approaches?

Frequentist

In the classical/frequentist treatment, there is a "true" number or parameter that describes your chance of winning a match, and if you could repeat a random trial measuring that value infinitely many times, the observed frequency will tend toward that true value.

With a finite number of trials, we can estimate an interval that might contain this true value. One prescription is as follows:

using Statistics
using Distributions

function ac(sim, N, α=0.05)
  z = quantile(Normal(), 1-α/2)
  n = N + z^2
  x = sum(sim() for _ in 1:N)
  p = (1/n)*(x + z^2/2)
  c = p .+ [-1 1] .* (z*sqrt((p/n)*(1-p)))
  println("p = $p; p ∈ ($(c[1]), $(c[2])) @ $(100*(1-α))% CL")
end

@time ac(mens_match, 1000)
# Output:
#  p = 0.9442932657154425; p ∈ (0.9301052120368344, 0.9584813193940506) @ 95.0% CL
#    0.002066 seconds (27 allocations: 1.203 KiB)

@time ac(mens_match, 100_000)
# Output:
#  p = 0.9526926099998024; p ∈ (0.9513768379948709, 0.9540083820047339) @ 95.0% CL
#    0.136241 seconds (27 allocations: 1.203 KiB)

@time ac(mens_match, 1_000_000)
# Output:
#  p = 0.9528782602868123; p ∈ (0.952462946236072, 0.9532935743375527) @ 95.0% CL
#    1.324237 seconds (26 allocations: 896 bytes)

@time ac(mens_match, 10_000_000)
# Output:
#  p = 0.9530391259668853; p ∈ (0.9529080052066695, 0.9531702467271012) @ 95.0% CL
#   13.201614 seconds (26 allocations: 896 bytes)

Here, we have computed a 95% confidence level estimate for the interval of the true binomial proportion for winning a men's match.

This means that with e.g. 1000 trials, we estimate that the true proportion is in the interval 93.0% to 95.8%, and if we repeated this experiment infinitely many times, the true parameter would be in the given intervals 95% of the time.

Meanwhile, with 10 million trials, the 95% CL interval is much tighter: 95.29% to 95.32%.

Bayesian

I love Bayesian inference, and kinda went overboard. I'm just going to skip ahead to the results and then leave details on how this all works below.

We can estimate the interval containing ~95% of the posterior weight with the following:

function bci(p, n)
  k = sum(p() for _ in 1:n)
  θ = (k+1)/(n+2)
  println("$k successes out of $n")
  c = θ .+ [-1 1] .* sqrt(θ*(1-θ)/(n+3))
  println("p = $θ; p ∈ ($(c[1]), $(c[2])) @ 2σ")
end

@time bci(mens_match, 1000)
# Output:
#  956 successes out of 1000
#  p = 0.9550898203592815; p ∈ (0.9485503279841971, 0.9616293127343658) @ 2σ
#    0.002022 seconds (40 allocations: 1.359 KiB)

@time bci(mens_match, 100_000)
# Output:
#  95444 successes out of 100000
#  p = 0.9544309113817724; p ∈ (0.953771432481098, 0.9550903902824468) @ 2σ
#    0.136147 seconds (40 allocations: 1.359 KiB)

@time bci(mens_match, 1_000_000)
# Output:
#  953146 successes out of 1000000
#  p = 0.9531450937098126; p ∈ (0.9529337660257091, 0.9533564213939161) @ 2σ
#    1.342780 seconds (40 allocations: 1.359 KiB)

@time bci(mens_match, 10_000_000)
# Output:
#  9528975 successes out of 10000000
#  p = 0.9528974094205181; p ∈ (0.9528304139556902, 0.9529644048853461) @ 2σ
#   13.539280 seconds (40 allocations: 1.359 KiB)

Way more than you wanted to know

In the Bayesian view, we see the probability of success e.g. of winning a men's match, or winning the women's open as a parameter, $\theta$. We would like to know how plausible any given choice of $\theta$ is, which is given by its distribution, $p(\theta)$. Yes, that means the probability distribution of the probability of success. It should be clear that e.g. $p(\theta>1) = p(\theta<0) = 0$ — those are implausible choices of $\theta$.

Prior and posterior 🍑

So what is $p(\theta)$? All we know is that it should be a probability distribution, so it should never be negative, and it should integrate to 1 (be normalized), and that it should be zero whenever $\theta \notin [0,1]$. Not having anything else to go on, it's reasonable to just say $p(\theta) = 1$ within $\theta \in [0,1]$. Here, every allowable value of $\theta$ is equally likely.

This $p(\theta)$ is called the prior distribution; it's our estimate of the distribution for $\theta$ prior to conducting the simulations. We can update this with the data, $D$, from our simulations, and construct the posterior probability distribution $p(\theta|D)$. (Prior/posterior from a priori and a posteriori.)

We can think of the data $D$ as a sequence of successes/failures, but given that each trial is independent, the only thing that matters is the number of successes $k$, and trials $n$. For any given value of $\theta$, the probability of observing $D=(s, k)$ is given by the binomial distribution: \[ p(D|\theta) = {n \choose k} \theta^k (1-\theta)^{n-k} \ . \]

Bayes' theorem

Finally, we use Bayes' theorem to "invert" the probability from $p(D|\theta)$ to $p(\theta|D)$ — figure out where the plausibility of $\theta$ lies given the data, based on how likely the data is to be observed given $\theta$: \[ p(\theta|D) = \frac{p(D|\theta) p(\theta)}{\mathcal{Z}} \ , \] where $\mathcal{Z} = \int p(D|\theta) p(\theta) \, \mathrm d\theta$. We already decided to use $p(\theta) = 1$ as the prior, and \[ \mathcal{Z} = {n \choose k} \frac{k!(n-k)!}{(n+1)!} \ . \] (See the appendix on wtf happened there.)

Thus, we are left with the posterior \[ p(\theta|D) = \frac{(n+1)!}{k!(n-k)!} \theta^k (1-\theta)^{n-k} \ . \] This is our distribution of how plausible values of $\theta$ are given the chain of information:

  1. $\theta$ is a probability
  2. We performed a simulation of $n$ trials, obtaining $k$ successes.
Bayesian credible interval

We saw how the frequentists get to estimate their 95% confidence interval. Did you think Bayesians couldn't get in on the fun? Using the posterior distribution we found earlier, we evaluate the moments $\langle\theta\rangle = \mathbb{E}_{\theta|D} [\theta]$ and $\left\langle \theta^2 \right\rangle = \mathbb{E}_{\theta|D} \left[\theta^2\right]$. Then ~95% of the probability lies in the interval \[ \langle\theta\rangle \pm 2 \sqrt{\langle \theta^2 \rangle - \langle\theta\rangle^2} . \]

Using (see appendix for more info) \begin{align}
\langle\theta\rangle &= \frac{r+1}{n+2} \\
\left\langle\theta^2\right\rangle &= \frac{(r+1)(r+2)}{(n+2)(n+3)} \ ,
\end{align}
the interval is evaluated to be \[ p \pm 2\sqrt{\frac{p(1-p)}{n+3}} \ , \] where we denote $p \equiv \langle\theta\rangle$.

Appendix

The Euler integral of the first integral, also called the Euler beta function, kind can be written in terms of Euler gamma functions. Our variable $\mathcal{Z}$ is exactly this: \begin{align}
\mathcal{Z} &= \int p(D|\theta) p(\theta) \\
&\propto \int \theta^k (1-\theta)^{n-k} \, \mathrm d\theta \\
&= B(k+1, n-k+1) \\
&= \frac{\Gamma(k+1)\Gamma(n-k+1)}{\Gamma(n + 2)} \\
&= \frac{k!(n-k)!}{(n+1)!}
\end{align}
I left out the ${n\choose k}$ because it's outside the integral, and cancels out with the numerator anyway.

The other thing worth looking at are the expectation values of the moments: \begin{align}
\mathbb{E}_{\theta|D}[\theta^m] &= \mathcal{Z}^{-1} \int \theta{k+m} (1-\theta)^{n-k} \, \mathrm d\theta \\
&= \mathcal{Z}^{-1} B(k+m+1, n-k+1) \\
&= \left(\frac{(n+1)!}{k!(n-k)!}\right) \left(\frac{(k+m)!(n-k+1)!}{(n+m+1)!}\right) \\
&= \frac{(k+1)(k+2)\cdots(k+m)}{(n+2)(n+3)\cdots(n+m+1)}
\end{align}
God damn, I love canceling out factorials.