MAB-other-drawbacks.html

<!DOCTYPE html>
<html lang="en">
<head>
        <meta charset="utf-8" />
        <title>MAB other drawbacks</title>
        <link rel="stylesheet" href="/theme/css/main.css" />

        <!--[if IE]>
            <script src="https://html5shiv.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
</head>

<body id="index" class="home">
        <header id="banner" class="body">
                <h1><a href="/">And yet it moves! </a></h1>
                <nav><ul>
                    <li class="active"><a href="/category/data-processing.html">Data processing</a></li>
                </ul></nav>
        </header><!-- /#banner -->
<section id="content" class="body">
  <article>
    <header>
      <h1 class="entry-title">
        <a href="/MAB-other-drawbacks.html" rel="bookmark"
           title="Permalink to MAB other drawbacks"><span class="caps">MAB</span> other&nbsp;drawbacks</a></h1>
    </header>

    <div class="entry-content">
<footer class="post-info">
        <abbr class="published" title="2016-05-19T14:58:00+01:00">
                Published: jeu. 19 mai 2016
        </abbr>

        <address class="vcard author">
                By                         <a class="url fn" href="/author/gael.html">Gaël</a>
        </address>
<p>In <a href="/category/data-processing.html">Data processing</a>.</p>

</footer><!-- /.post-info -->      <p>Until now, I have spent time (and <span class="caps">CPU</span> power) to precisely illustrate what the
two methods were designed&nbsp;for:</p>
<ul>
<li>A/B testing efficiently answers the question <em>is there a&nbsp;difference?</em></li>
<li>The <span class="caps">MAB</span> strategy provides the highest click-through rates over
    a wider range of sample&nbsp;sizes.</li>
</ul>
<p>As their respective purposes are completely different, thinking of one of them as
intrinsically better than the other is nonsensical, especially as we have shown
that <strong>each</strong> method is clearly better than the other at what it was
designed for. Therefore saying that one is better than the other is saying that
a screwdriver is better than a hammer: it makes absolutely no sense because
each of them are useful for different&nbsp;things.</p>
<p>In this last post, I want to lay the emphasis on a few extra drawbacks of the
<span class="caps">MAB</span> strategy, mostly to show some side effects of the technique that were
seldom evoked in what I&nbsp;read.</p>
<h1 id="the-mab-strategy-does-skew-the-results">The <span class="caps">MAB</span> strategy does skew the&nbsp;results</h1>
<p>Results are skewed when there is a bias in the measurement process:
the &ldquo;ideal&rdquo; measured value is no longer equal to the hidden &ldquo;true&rdquo;&nbsp;value.</p>
<p>In statistics, there are a few dragons that are known to almost invariably
skew the results: changing the measurement process using the measurements
themselves is one of them.
The problem is that this kind of bias is hard for our brain to grasp
and, oddly enough, this is the very crux of most probability (false) &ldquo;paradoxes&rdquo;
(like the Monty-hall problem, 
<a href="https://en.wikipedia.org/wiki/Monty_Hall_problem">wiki link</a>).</p>
<p>Back to the original blog post, the blogger&nbsp;claimed:</p>
<blockquote>
<p>Showing the different options at different rates will skew the results. 
  (No it won&rsquo;t. You always have an estimate of the click through rate for 
  each&nbsp;choice)</p>
</blockquote>
<p>That claim is rather strange: having data does not mean these data are not
skewed and of course showing different options at different rate won&rsquo;t
(usually) skew the results&hellip; however, as I said, changing these rates
<strong>according</strong> to the measurements most likely&nbsp;will&hellip;</p>
<p>So how can I prove that? I already gave an hint earlier in the series in that&nbsp;figure:</p>
<p><img alt="If you can't see the figure, please try another web browser which supports
SVG images" src="/images/difference_vs_sample_size.svg"/></p>
<p>As I said at the time, there seems to be a remaining difference at higher
sample sizes but it is not obvious and really small. I can also plot the
distribution of these&nbsp;differences:</p>
<p><img alt="If you can't see the figure, please try another web browser which supports
SVG images" src="/images/dist_difference_vs_sample_size.svg"/></p>
<p>and the same conclusion applies: <strong>if</strong> there is a skew, it is not really 
visible: the two distributions are centered on <span class="math">\(5\%\)</span> and roughly
Gaussian-shaped (which allowed us to calculate and represent the arithmetic 
mean as the expected value&nbsp;earlier).</p>
<p>But remember, what <em>usually</em> introduces a skew is altering some probabilities
somewhere using the data themselves as an input. This situation is
likely to occur more often if the difference in click-through rates is small
or even null. I therefore calculated the same kind of distributions but in a
setup with equal click-through&nbsp;rates:</p>
<p><img alt="If you can't see the figure, please try another web browser which supports
SVG images" src="/images/dist_zero_difference_vs_sample_size.svg"/></p>
<p>The Gaussian-shaped distribution suddenly turned into a very strange one: 
it is still roughly symmetrical but the
expected value (<span class="math">\(0\%\)</span>) is actually <strong>not</strong> the most probable value&nbsp;anymore.</p>
<p>As we can already guess, we should be able to see what happens more clearly
by calculating the difference between the click-though rate of the variant which
happened to be mostly favored with the one which happened to be mostly
disfavored. The resulting distribution are represented in the following&nbsp;figure:</p>
<p><img alt="If you can't see the figure, please try another web browser which supports
SVG images" src="/images/abs_dist_zero_difference_vs_sample_size.svg"/></p>
<p>We can observe&nbsp;that:</p>
<ul>
<li>in the case of A/B testing, the distribution is a half of a Gaussian with
    its maximal value reached at <span class="math">\(0\%\)</span>, as&nbsp;expected;</li>
<li>in the case of the <span class="caps">MAB</span> strategy, the distribution is not Gaussian anymore
    and the maximal value is reach at <span class="math">\(0.4\%\)</span>.</li>
</ul>
<p>The <span class="caps">MAB</span> strategy therefore skews the data themselves by introducing a
difference where there shouldn&rsquo;t be&nbsp;any.</p>
<p>This is an issue for two&nbsp;reasons:</p>
<ol>
<li>Actually the individual click-through rate distributions are skewed too,
    which basically means that we cannot trust any contingency test performed
    on the data gathered using the <span class="caps">MAB</span>&nbsp;strategy.</li>
<li>There is no way to know whether a given difference is <strong>actually</strong> a&nbsp;difference.</li>
</ol>
<p>These two extra reasons also justify why applying the <span class="caps">MAB</span> strategy to drug testing
and the likes is a very bad&nbsp;idea.</p>
<h1 id="no-test-no-confidence-no-conclusion">No test, no confidence: no&nbsp;conclusion?</h1>
<p>Most people seem to forget completely about the test part of A/B <strong>testing</strong>.
The reasoning here, as seen elsewhere, is that whether or not the test is
positive, what appears to be the best-performing variant would be used anyway.
Therefore the test itself is useless; A/B testing itself is
useless; A/B testing can be replaced with something providing higher
click-through&nbsp;rates.</p>
<p>But the actual <em>test</em> in A/B testing is actually intended as a feedback, a way to
<strong>estimate</strong> your confidence in the results you obtained and make decisions
with both pieces of informations. In contrast, the <span class="caps">MAB</span> strategy seems to be
used far more often relying on one&rsquo;s &ldquo;good luck&rdquo;: it is <strong>assumed</strong> that the
population is large enough to eventually provide meaningful results and the
problem is precisely that this is <strong>usually</strong>&nbsp;true.</p>
<p>Think of a successful campaign as a light bubble. When you switch it on, you
expect to get light and this is what <strong>usually</strong> happens. 
The <span class="caps">MAB</span> strategy is like
saying &ldquo;let&rsquo;s ask a blind person to turn that light on: he will move more easily in
the dark, in that sense he will be more efficient for the job&rdquo;.
On the other hand, the A/B testing method would be more like saying 
&ldquo;let&rsquo;s ask a sighted person
to switch it on because he needs to know whether he was successful and report
back&rdquo;. Yes, in most cases, the light will be on anyway!
However, there is no way to detect that the light did not turn on for whatever 
reason in the <span class="caps">MAB</span>&nbsp;strategy.</p>
<p>Mind you, the final and allegedly real results given in the original blog
post do not even pass the contingency test (the maximal <span class="math">\(p\)</span>-values is only <span class="math">\(0.7\)</span>).
So strictly speaking, his conclusions are questionable, especially given that
the <span class="caps">MAB</span> strategy is known to introduce such a&nbsp;difference.</p>
<h1 id="conclusion">Conclusion</h1>
<h2 id="other-drawbacks">Other&nbsp;drawbacks</h2>
<p>There are a few moot moot points that I did not evoke such as the effect of
time-varying click-through rates, the effect of the lack of equally-sampled 
control group, the implications of the actual overlap of the click-through rate
distributions depending on the method used, the relative spread of the
distributions, etc. I decided not to include them because the most disastrous
outcomes would require stringent (albeit not that rare) requirements. 
I did not want to weaken the whole series because some would dismiss these 
arguments saying &ldquo;what are the&nbsp;odds?&rdquo;.</p>
<h2 id="original-post">Original&nbsp;post</h2>
<p>I&rsquo;ve generally not spoken about the original blog itself throughout
the series. The main reason is that everyone has the right not to be completely
right and nagging on each moot point would not have been&nbsp;right.</p>
<p>That said, is seems important to indicate that if one does not want to talk
about actual A/B <strong>testing</strong>, but A/B testing <strong>data-gathering</strong> strategy, it would
have been less misleading to talk about <em><span class="caps">MAB</span> with <span class="math">\(\varepsilon = 0\%\)</span></em>
vs. <em><span class="caps">MAB</span> with <span class="math">\(\varepsilon = 90\%\)</span></em> as it is rigorously what is compared in
that&nbsp;post.</p>
<p>As well, A/B testing was not designed in a set-and-forget state of mind: there
are two different stages (<em>testing</em> and <em>exploitation</em>) that must be used. In
particular, conflating the two as done in the original post is&nbsp;misleading. </p>
<p>Finally, sorry to raise that specific point but it illustrates well the degree
of understanding of the original&nbsp;blogger:</p>
<blockquote>
<p>In the epsilon-first strategy, you can explore 100% of the time in 
  the beginning and once you have a good sample, switch to&nbsp;pure-greedy.</p>
</blockquote>
<p>is an <strong>exact and accurate</strong> description of what A/B testing is (assuming the
&ldquo;good sample&rdquo; part is assessed with a contingency&nbsp;test).</p>
<p>I won&rsquo;t go further down that road: the point is made and not knowing that tiny
bit is a shame for someone who blogged about it. But it is perfectly 
understandable and fine (it happens to everyone and I guess that series of
posts does contain its share of inaccuracies and incomplete understanding). 
In particular, this does not make him a bad developer, far from&nbsp;it.</p>
<h2 id="of-hammers-and-screwdrivers">Of hammers and&nbsp;screwdrivers</h2>
<p>As surprising as it may appear, A/B testing and the multi-armed bandit strategy
were designed for two completely different&nbsp;purposes:</p>
<ul>
<li>The purpose of A/B testing is to determine whether or not there is a
    difference and provide that answer (with the uncertainties) through a
    statistical&nbsp;test.</li>
<li>The purpose of the multi-armed bandit strategy is to maximize the reward
    over a large range of sample sizes. Usually it can (and rigorously should)
    be used in a set-and-forget&nbsp;mode.</li>
</ul>
<p>These methods are just like a hammer and a screwdriver: both can be used to do
nearly the same thing, yet cannot really be freely&nbsp;swapped.</p>
<h2 id="when-should-i-choose-ab-testing">When should I choose A/B&nbsp;testing?</h2>
<ul>
<li>When you need to really determine which variant is better (e.g. drug&nbsp;trials).</li>
<li>When you need to estimate the difference precisely (e.g. multivariate&nbsp;analysis).</li>
<li>When you need the shortest testing period possible for a given confidence
    in the&nbsp;results.</li>
<li>When you need to know the uncertainties on the&nbsp;results.</li>
<li>When you need to assess that there is actually no&nbsp;difference.</li>
</ul>
<h2 id="when-should-i-use-the-mab-strategy">When should I use the <span class="caps">MAB</span>&nbsp;strategy?</h2>
<ul>
<li>When being sometimes wrong without any indication of it is not that&nbsp;important.</li>
<li>When you cannot setup an optimal A/B testing campaign because you lack too
    much information (minimum population size, estimate of the difference,&nbsp;etc.).</li>
</ul>
<h2 id="the-end">The&nbsp;End?</h2>
<p>What this post series lacks (as most post out there) are references: I am
pretty sure that all this work and far far more has already been done by people
out there. Though I cannot personally recommend it (as I have not read it), I
know there is a short <a href="http://shop.oreilly.com/product/0636920027393.do?sortby=publicationDate">O&rsquo;Reilly book</a> 
on the subject of bandit algorithms in general. I also know that there are a
lot of references in Google scholar about both&nbsp;methods.</p>
<p>Literature scanning is not just for scientist: it is necessary to efficiently
reuse the knowledge humankind already gathered on the subject in order to 
eventually avoid the traps, pitfalls and inefficiencies caused by starting from
scratch. I am sure a lot of great minds already worked on this, this is what
you should read, not some random blog posts on the internet if you really want
to do things&nbsp;correctly.</p>
<p>As far as this blog is concerned, it just started out as a few tests and it is
mainly intended to show once again that the world is not just black and white
by testing some of the original blogger&nbsp;claims.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
    var align = "center",
        indent = "0em",
        linebreak = "false";

    if (false) {
        align = (screen.width < 768) ? "left" : align;
        indent = (screen.width < 768) ? "0em" : indent;
        linebreak = (screen.width < 768) ? 'true' : linebreak;
    }

    var mathjaxscript = document.createElement('script');
    var location_protocol = (false) ? 'https' : document.location.protocol;
    if (location_protocol !== 'http' && location_protocol !== 'https') location_protocol = 'https:';
    mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
    mathjaxscript.type = 'text/javascript';
    mathjaxscript.src = location_protocol + '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
    mathjaxscript[(window.opera ? "innerHTML" : "text")] =
        "MathJax.Hub.Config({" +
        "    config: ['MMLorHTML.js']," +
        "    TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
        "    jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
        "    extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
        "    displayAlign: '"+ align +"'," +
        "    displayIndent: '"+ indent +"'," +
        "    showMathMenu: true," +
        "    messageStyle: 'normal'," +
        "    tex2jax: { " +
        "        inlineMath: [ ['\\\\(','\\\\)'] ], " +
        "        displayMath: [ ['$$','$$'] ]," +
        "        processEscapes: true," +
        "        preview: 'TeX'," +
        "    }, " +
        "    'HTML-CSS': { " +
        "        styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
        "        linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
        "    }, " +
        "}); " +
        "if ('default' !== 'default') {" +
            "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
            "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
        "}";
    (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
    </div><!-- /.entry-content -->

  </article>
</section>
        <section id="extras" class="body">
        </section><!-- /#extras -->

        <footer id="contentinfo" class="body">
                <address id="about" class="vcard body">
                Proudly powered by <a href="http://getpelican.com/">Pelican</a>, which takes great advantage of <a href="http://python.org">Python</a>.
                </address><!-- /#about -->

                <p>The theme is by <a href="http://coding.smashingmagazine.com/2009/08/04/designing-a-html-5-layout-from-scratch/">Smashing Magazine</a>, thanks!</p>
        </footer><!-- /#contentinfo -->

</body>
</html>