Here is a 10-step guide to running a proper A/B test. The steps are:

  1. Define your goal and form your hypotheses.
    2. Identify a control and a treatment.
    3. Identify key metrics to measure.
    4. Identify what data needs to be collected.
    5. Make sure that appropriate logging is in place to collect all necessary data.
    6. Determine how small of a difference you would like to detect.
    7. Determine what fraction of visitors you want to be in the treatment
    8. Run a power analysis to decide how much data you need to collect and how long you need to run the test. ```r power.prop.test() # power.t.test( delta = 0.22,sd=sqrt(),sig.level = 0.05,power = 0.8,type = ‘two.sample’ )

```
**
9. Run the test for AT LEAST this long.
10. First time trying something new: run an A/A test (dummy test) simultaneously to check for systematic biases.

双样本的标准偏差(SD)怎么算

  1. The t tests
    Previously we have considered how to test the null hypothesis that there is no difference between the mean of a sample and the population mean, and no difference between the means of two samples. We obtained the difference between the means by subtraction, and then divided this difference by the standard error of the difference. If the difference is 196 times its standard error, or more, it is likely to occur by chance with a frequency of only 1 in 20, or less.
    With small samples, where more chance variation must be allowed for, these ratios are not entirely accurate because the uncertainty in estimating the standard error has been ignored. Some modification of the procedure of dividing the difference by its standard error is needed, and the technique to use is the t test. Its foundations were laid by WS Gosset, writing under the pseudonym “Student” so that it is sometimes known as Student’s t test. The procedure does not differ greatly from the one used for large samples, but is preferable when the number of observations is less than 60, and certainly when they amount to 30 or less.
    The application of the t distribution to the following four types of problem will now be considered.
    The calculation of a confidence interval for a sample mean.
    The mean and standard deviation of a sample are calculated and a value is postulated for the mean of the population. How significantly does the sample mean differ from the postulated population mean?
    The means and standard deviations of two samples are calculated. Could both samples have been taken from the same population?
    Paired observations are made on two samples (or in succession on one sample). What is the significance of the difference between the means of the two sets of observations?
    In each case the problem is essentially the same - namely, to establish multiples of standard errors to which probabilities can be attached. These multiples are the number of times a difference can be divided by its standard error. We have seen that with large samples 1.96 times the standard error has a probability of 5% or less, and 2.576 times the standard error a probability of 1% or less (Appendix table A ). With small samples these multiples are larger, and the smaller the sample the larger they become.
    Confidence interval for the mean from a small sample
    A rare congenital disease, Everley’s syndrome, generally causes a reduction in concentration of blood sodium. This is thought to provide a useful diagnostic sign as well as a clue to the efficacy of treatment. Little is known about the subject, but the director of a dermatological department in a London teaching hospital is known to be interested in the disease and has seen more cases than anyone else. Even so, he has seen only 18. The patients were all aged between 20 and 44.
    The mean blood sodium concentration of these 18 cases was 115 mmol/l, with standard deviation of 12 mmol/l. Assuming that blood sodium concentration is Normally distributed what is the 95% confidence interval within which the mean of the total population of such cases may be expected to lie?
    The data are set out as follows:
    To find the 95% confidence interval above and below the mean we now have to find a multiple of the standard error. In large samples we have seen that the multiple is 1.96 (Chapter 4). For small samples we use the table of t given in Appendix Table B.pdf. As the sample becomes smaller t becomes larger for any particular level of probability. Conversely, as the sample becomes larger t becomes smaller and approaches the values given in table A, reaching them for infinitely large samples.
    Since the size of the sample influences the value of t, the size of the sample is taken into account in relating the value of t to probabilities in the table. Some useful parts of the full t table appear in . The left hand column is headed d.f. for “degrees of freedom”. The use of these was noted in the calculation of the standard deviation (Chapter 2). In practice the degrees of freedom amount in these circumstances to one less than the number of observations in the sample. With these data we have 18 - 1 = 17 d.f. This is because only 17 observations plus the total number of observations are needed to specify the sample, the 18th being determined by subtraction.
    To find the number by which we must multiply the standard error to give the 95% confidence interval we enter table B at 17 in the left hand column and read across to the column headed 0.05 to discover the number 2.110. The 95% confidence intervals of the mean are now set as follows:
    Mean + 2.110 SE to Mean - 2.110 SE
    which gives us:
    115 - (2.110 x 283) to 115 + 2.110 x 2.83 or 109.03 to 120.97 mmol/l.
    We may then say, with a 95% chance of being correct, that the range 109.03 to 120.97 mmol/l includes the population mean.
    Likewise from Appendix Table B.pdf the 99% confidence interval of the mean is as follows:
    Mean + 2.898 SE to Mean - 2.898 SE
    which gives:
    115 - (2.898 x 2.83) to 115 + (2.898 x 2.83) or 106.80 to 123.20 mmol/l.
    Difference of sample mean from population mean (one sample t test)
    Estimations of plasma calcium concentration in the 18 patients with Everley’s syndrome gave a mean of 3.2 mmol/l, with standard deviation 1.1. Previous experience from a number of investigations and published reports had shown that the mean was commonly close to 2.5 mmol/l in healthy people aged 20-44, the age range of the patients. Is the mean in these patients abnormally high?
    We set the figures out as follows:
    t difference between means divided by standard error of sample mean. Ignoring the sign of the t value, and entering table B at 17 degrees of freedom, we find that 2.69 comes between probability values of 0.02 and 0.01, in other words between 2% and 1% and so It is therefore unlikely that the sample with mean 3.2 came from the population with mean 2.5, and we may conclude that the sample mean is, at least statistically, unusually high. Whether it should be regarded clinically as abnormally high is something that needs to be considered separately by the physician in charge of that case.
    Difference between means of two samples
    Here we apply a modified procedure for finding the standard error of the difference between two means and testing the size of the difference by this standard error (see Chapter 5 for large samples). For large samples we used the standard deviation of each sample, computed separately, to calculate the standard error of the difference between the means. For small samples we calculate a combined standard deviation for the two samples.
    The assumptions are:
    that the data are quantitative and plausibly Normal
    that the two samples come from distributions that may differ in their mean value, but not in the standard deviation
    that the observations are independent of each other.
    The third assumption is the most important. In general, repeated measurements on the same individual are not independent. If we had 20 leg ulcers on 15 patients, then we have only 15 independent observations.
    The following example illustrates the procedure.
    The addition of bran to the diet has been reported to benefit patients with diverticulosis. Several different bran preparations are available, and a clinician wants to test the efficacy of two of them on patients, since favourable claims have been made for each. Among the consequences of administering bran that requires testing is the transit time through the alimentary canal. Does it differ in the two groups of patients taking these two preparations?
    The null hypothesis is that the two groups come from the same population. By random allocation the clinician selects two groups of patients aged 40-64 with diverticulosis of comparable severity. Sample 1 contains 15 patients who are given treatment A, and sample 2 contains 12 patients who are given treatment B. The transit times of food through the gut are measured by a standard technique with marked pellets and the results are recorded, in order of increasing time, in Table 7.1 .
    Table 7.1
    These data are shown in figure 7.1 . The assumption of approximate Normality and equality of variance are satisfied. The design suggests that the observations are indeed independent. Since it is possible for the difference in mean transit times for A-B to be positive or negative, we will employ a two sided test.
    Figure 7.1
    With treatment A the mean transit time was 68.40 h and with treatment B 83.42 h. What is the significance of the difference, 15.02h?
    The procedure is as follows:
    Obtain the standard deviation in sample 1:
    Obtain the standard deviation in sample 2:
    image.png
    来源:https://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/7-t-tests
    Multiply the square of the standard deviation of sample 1 by the degrees of freedom, which is the number of subjects minus one:
    Repeat for sample 2
    Add the two together and divide by the total degrees of freedom

    testing two colors of a call-to-action button

    This is a classic A/B test: testing two colors of a call-to-action button. In this segment, we will go through each of the ten steps. In particular, we’ll demonstrate how to run a power analysis to determine how long to run the experiment for. This example calls out a nuance of A/B testing that can occur if users are not required to log in to access parts of your product. We will discuss intuitively how to understand the problem, the biases it could cause, and suggest various ways to account for it.

如果不要求登录:一个人可能同时在控制组和对照组。这种样本会在最后识别并剔除。
为什么:对于不等分的控制组,问题更大。这些人在两个组内的比重不一样(200人,在500人组内,然后又在1000人控制组内。其行为影响2/5 vS 1/5)
image.png

Using G power 计算样本量