-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAB-vs-MAB--pros-of-AB-testing.html
211 lines (198 loc) · 10.3 KB
/
AB-vs-MAB--pros-of-AB-testing.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
<!DOCTYPE html>
<html lang="en">
<head>
<!-- ## for client-side less
<link rel="stylesheet/less" type="text/css" href="/theme/css/style.less">
<script src="http://cdnjs.cloudflare.com/ajax/libs/less.js/1.7.3/less.min.js" type="text/javascript"></script>
-->
<link rel="stylesheet" type="text/css" href="/theme/css/style.css">
<link rel="stylesheet" type="text/css" href="/theme/css/pygments.css">
<link rel="stylesheet" type="text/css" href="//fonts.googleapis.com/css?family=PT+Sans|PT+Serif|PT+Mono">
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="author" content="Gaël">
<meta name="description" content="Posts and writings by Gaël">
<meta name="keywords" content="">
<title>
And yet it moves
– The pros of the A/B testing method, Multi-armed bandit vs. A/B testing for web service optimization </title>
</head>
<body>
<aside>
<div id="user_meta">
<div id="metalogo">
<a href="/">
<img src="/images/logo.svg" alt="logo" id="logo">
</a>
<a href="/" id="title">And yet it moves</a>
</div>
<p></p>
<ul>
<li><a href="/category/data-processing.html" class="active">Data processing</a></li>
</ul>
</div>
</aside>
<main>
<header>
<p>
<a href="/">Index</a> ¦ <a href="/archives.html">Archives</a>
</p>
</header>
<article>
<div class="series_toc">
<p>This post is part 4 of the "AB vs. MAB" series:</p>
<ol class="parts">
<li >
<a href='/AB-vs-MAB.html'>Multi-armed bandit vs. A/B testing for web service optimization</a>
</li>
<li >
<a href='/AB-vs-MAB--basics.html'>Basics of testing for web optimisation, Multi-armed bandit vs. A/B testing for web service optimization</a>
</li>
<li >
<a href='/AB-vs-MAB--making-a-choice.html'>Choosing the best variant, Multi-armed bandit vs. A/B testing for web service optimization</a>
</li>
<li class="active">
<a href='/AB-vs-MAB--pros-of-AB-testing.html'>The pros of the A/B testing method, Multi-armed bandit vs. A/B testing for web service optimization</a>
</li>
<li >
<a href='/AB-vs-MAB--reward-maximisation.html'>Reward maximisation, Multi-armed bandit vs. A/B testing for web service optimization</a>
</li>
<li >
<a href='/AB-vs-MAB--other-drawbacks-conclusion.html'><span class="caps">MAB</span> other drawbacks and conclusion, Multi-armed bandit vs. A/B testing for web service optimization</a>
</li>
</ol>
</div>
<div class="article_meta">
<p style="text-align: right; margin: 0;">Posted on: Sun, 15 May 2016</p>
</div>
<div class="article_title">
<h3><a href="/AB-vs-MAB--pros-of-AB-testing.html">The pros of the A/B testing method, Multi-armed bandit vs. A/B testing for web service optimization</a></h3>
</div>
<div class="article_text">
<center><p><img alt="image0" src="/images/dab.gif" style="width: 640px; height: auto; max-width: 100%;"/></p>
</center><p>In order to bring some nuances and perspective into the discussion
(a.k.a. <em>unsupported claims smashing</em>™), this post is dedicated to
showing what A/B testing is good at and how it compares to the
multi-armed bandit strategy.</p>
<div class="section" id="a-b-testing-is-more-powerful-than-mab-testing">
<h2 id="ab-testing-is-more-powerful-than-mab-testing">A/B testing is more powerful than <span class="caps">MAB</span> testing</h2>
<p>When a (statistical) test is applied to data, it answers the question
for which it was designed with a given <em>level of significance</em> and
<em>power</em> (the two components of what we could improperly call “level of
confidence” for simplicity):</p>
<ul class="simple">
<li>The <em>level of significance</em> is the probability of correctly
concluding that the variants (<object data="/images/orange_btn.svg" type="image/svg+xml">the orange</object> and <object data="/images/green_btn.svg" type="image/svg+xml">the green button</object>
here) <em>have **no*</em> significant impact* on the click-through rate.</li>
<li>The <em>power</em> is the probability of correctly concluding that the
variants <strong>have</strong> <em>a significant impact</em> on the click-through rate.</li>
</ul>
<p>The power is the most important estimator of confidence in our case: if
we see a difference, we would like to know if it is significant. So we
modelled the whole process and estimated the power in each method at a
given level of significance and for different sample sizes.</p>
<div class="figure">
<object data="/images/power_vs_sample_size.svg" type="image/svg+xml">
</object>
<p class="caption">If you can’t see the figure, please try another web browser which
supports <span class="caps">SVG</span> images</p>
</div>
<p>It appears clearly that A/B testing reaches higher (and better) powers
at far smaller sample sizes: it typically requires sample sizes <em>5
times</em> smaller than with <span class="caps">MAB</span>!</p>
<p><em>Why is that?</em> As we claimed above when presenting the A/B testing
method, it is generally more efficient to provide each variant as often
than to favour one. A/B testing is better here and we didn’t even have
to look for a convoluted setup: this is valid for all the setups we tried!</p>
</div>
<div class="section" id="a-b-testing-is-correct-more-often-at-small-sample-sizes">
<h2 id="ab-testing-is-correct-more-often-at-small-sample-sizes">A/B testing is correct more often at small sample sizes</h2>
<p>There are cases however where no test is used. This is equivalent to
saying that you choose a level of significance of <span class="math">\(0\%\)</span>. Yet that
does not mean that the power would reach <span class="math">\(100\%\)</span> (in this case,
the power is equivalent to how often the seemingly best-performing
variant is really the best performer).</p>
<p>As it turned out, even in that case, the A/B data-gathering strategy
does provide better results:</p>
<div class="figure">
<object data="/images/correct_vs_sample_size.svg" type="image/svg+xml">
</object>
<p class="caption">If you can’t see the figure, please try another web browser which
supports <span class="caps">SVG</span> images</p>
</div>
<p>In this example, the A/B framework again provide better results at
smaller sample sizes. For instance, the A/B framework only requires
<span class="math">\(300\)</span> trials to be correct in <span class="math">\(9\)</span> campaigns out of
<span class="math">\(10\)</span>. To provide the <em>same</em> guarantee, the <span class="caps">MAB</span> strategy would
require <span class="math">\(600\)</span> trials: twice as many!</p>
<p>This can be explained the same way as above: providing each website
variant as often to the visitors to get the best-performing variant is
more efficient.</p>
</div>
<div class="section" id="a-b-testing-provides-more-accurate-estimations-of-the-differences">
<h2 id="ab-testing-provides-more-accurate-estimations-of-the-differences">A/B testing provides more accurate estimations of the differences</h2>
<p>A correct estimation of the differences in click-through rates is
important: it is required in methods relying on its quantifications
(e.g. multivariate analysis) but also to actually determine which
variant performs better:</p>
<div class="math">
\begin{equation*}
a > b
\end{equation*}
</div>
<p>is equivalent to</p>
<div class="math">
\begin{equation*}
a - b > 0\text{.}
\end{equation*}
</div>
<p>I estimated the difference in click-through rates (denoted
<span class="math">\(P(O|B) - P(O|A)\)</span> in the figure — see why here) in the same setup
as above and obtained the following figure:</p>
<div class="figure">
<object data="/images/difference_vs_sample_size.svg" type="image/svg+xml">
</object>
<p class="caption">If you can’t see the figure, please try another web browser which
supports <span class="caps">SVG</span> images</p>
</div>
<p>We can clearly see that the A/B testing data gathering method converges
toward the expected value of <span class="math">\(5\%\)</span> much faster than the <span class="caps">MAB</span>
strategy. Worse, that remaining difference, though small, actually goes
on and on for a very long time.</p>
</div>
<div class="section" id="summary-what-is-a-b-testing-good-at">
<h2 id="summary-what-is-ab-testing-good-at">Summary: what is A/B testing good at?</h2>
<p>These results illustrated what it means for A/B testing to answer the
question: <em>Is there a difference?</em></p>
<ul class="simple">
<li>Its data-gathering scheme is efficient: it provide a very high power
in the contingency tests, it leads to correct results more often at
lower sample rates than with the <span class="caps">MAB</span> strategy.</li>
<li>It makes use of the contingency test to actually answer that question
at a given “level of confidence” (significance and power). It is also
used to determine what the sample size should be.</li>
<li>It also generally provides more precise and more accurate estimates
of the actual difference in click-through rates.</li>
</ul>
<p>Beyond invalidating the most outrageous claims of the original blog
post, these results outline another very important fact: the <span class="caps">MAB</span>
strategy generally requires more samples to provide the exact same
guarantees as A/B testing. This means in particular that it would not be
fair to compare click-through rates at a single given sample size: that
sample size should be adjusted separately for each method in order to
provide the same guarantees.</p>
<p>From my perspective, this is just like an insurance seller undermining
the competition by comparing the price of his most basic contract with
the price of a broader contract (covering more stuff). Yes the basic
contract is cheaper but this is comparing apples to oranges. That said,
that basic contract might very well be sufficient and this is what I am
going to assess in the next post.</p>
</div>
</div>
</article>
<div id="ending_message">
<p>© Gaël. Built using <a href="http://getpelican.com" target="_blank">Pelican</a>. Theme by Giulio Fidente on <a href="https://github.com/gfidente/pelican-svbhack" target="_blank">github</a>. </p>
</div>
</main>
</body>
</html>