Spotify Engineering

Choosing Sequential Testing Framework — Comparisons and Discussions

thumbnail
  1. Introduction
  • The post discusses the pros and cons of different sequential testing frameworks for experimentation.
  • The choice of framework can have an impact on power properties and false positive rates.
  1. Always Valid Inference
  • Allows continuous testing during data collection without deciding on a stopping rule or number of analyses.
  • False positive rate can be bound by using Bonferroni corrections.
  • A good fit for experiments that run for a few weeks and receive data in batches.
  1. Evaluating Sequential Tests
  • Two important properties: bounded false positive rate and statistical power.
  • False positive rate simulation conducted for GST with correctly assumed, underestimated, and overestimated sample sizes.
  • Always valid tests (GAVI and mSPRT) are conservative when not performed after each observation.
  • Correctly bounded false positive rate guaranteed with always valid inference.
  1. GST vs. AVI
  • GSTs are preferable when the expected sample size can be estimated accurately.
  • AVI family of tests is a good choice when data is streamed and sample size cannot be estimated accurately.
  • Probability of identifying an effect is higher with GST when analyzing streaming data in batches.