2012-09-20

Created page with "Statistical tests are a form of educated guess. Based on incomplete data - that is, data from only a subset of the population - they seek to make conclusions. The incomplete..."

New page

Statistical tests are a form of educated guess. Based on incomplete data - that is, data from only a subset of the population - they seek to make conclusions. The incompleteness of the data guarantees that statistical tests will regularly lead to the wrong conclusion, particularly in situations where there is little data or the differences being examined are small (i.e., to use the jargon, the tests are particularly inaccurate when ''power'' is low). Nevertheless, statistical tests are in widespread use because when they are conducted correctly they are the most educated form of guess that is possible.

All statistical tests make a large number of assumptions. When these assumptions are not satisfied the consequence is that the conclusions from statistical testing become less reliable. The more egregious the violation of the assumptions, the less accurate the conclusions.

== Null hypothesis ==

All statistical tests require a ''null hypothesis'' (see [[Formal Hypothesis Testing]]). From time-to-time the null hypotheses that are used in statistical tests are not sensible and the consequence of this is that the ''p''-values are not meaningful. As an example, consider the table below, which is a special type of table known as a ''duplication matrix''. This table has the same data shown in both the rows and the columns. Thus, the first column shows that 100% of people that consume Coke consume Coke (not surprisingly), 14% of people that consume Coke also consume Diet Coke, 25% of people that consume Coke also consume Coke Zero, etc.

[[File|DuplicationMatrix.png]].

The arrows on the table show the results of statistical tests. In all of these tests, the null hypothesis is ''independence'' between the rows and the columns (see [[Statistical Tests on Tables]] for a description of what this null hypothesis entails, although the description is non-technical and the word ''independence'' is not used). However, this assumption is clearly not appropriate, as the same data is shown in the rows and columns and thus it cannot be considered ''independent'' in any meaningful sense and a different null hypothesis is required.

To appreciate how the incorrect null hypothesis renders the significance tests meaningless focus on the top-left cell. It suggests that the finding that 100% of people that consume Coke also consume Coke is significantly high. However, it is a logically necessary conclusion and thus cannot, in any sense, be considered to be ''significant''. All the tests on this table are wrong; refer to
Ehrenberg, A. S. C. (1988). Repeat Buying: Facts, Theory and Applications. New York, Oxford University Press.
for a discussion of the appropriate null hypothesis for such data.

== Alpha (significance level or cut-off) ==

The key technical output of a significance test is the ''p''-value (see [[Formal Hypothesis Testing]]). This ''p''-value is then compared to some pre-specified cut-off, which is usually called
\alpha
(the Greek letter ''alpha''). For example, most studies use a cut-off of and take conclude that a test is ''significant'' if
p
.

Having a standard rule such as this gives the veneer of rigor as it is a transparent and non-subjective process. Unfortunately, when statistical tests are an input into real-world decision making it is generally not ideal to use such a simple process. Rather, it is better to take into account the costs and risks associated with an incorrect conclusion.

Consider a simple problem like a milk company deciding whether or not to change the color of its milk package from white to blue. A study may find that that a small increase in sales results from the if the color change is made. However, the resulting ''p''-value may be 0.06. Thus, if using the 0.05 cut-off the conclusion would be that color makes no difference. However, if it costs the company nothing to make the change, then there is no downside, and they are better off making the change. That is, perhaps the change in packaging will have no impact, in which case there is no downside. And, there is the possibility that making the change will result in a small increase. Now, let's instead suppose that the ''p''-value is 0.04 but that the change in pack size will cost the company millions of dollars. In that situation it is likely best to conclude that there is no significant effect for a change in pack size, even though the ''p''-value is less than the 0.05 cut-off, as now the 0.04 chance is interpreted as meaning that there is a non-zero chance that the company could spend millions of dollars for no gain (and, the better course of events is for the company to increase the sample size of the study to see if it results in a smaller ''p''-value).

== Data collection process ==

Most statistical tests make an assumption known as ''simple random sampling'' (see [[Formal Hypothesis Tests]] for an example and definition).

Simple random sampling is a type of ''probability sampling'', which is a catch-all term to describe samples where everybody in the population has the potential to be included in a study and we know both the probability of people being included as well as having a good understanding of the mechanism by which people are included. Simple random sampling is the simplest type of probability sample. There are lots of others, such as cluster random sampling and stratified random sampling. When these other forms of sampling occur then there is a need to use different formulas to compute statistical significance (i.e., and, the standard formulas taught in introductory statistics courses all assume the data is from a simple random sample). In general, if tests assume simple random sampling, but one of these other types of probability sampling methods is a better description of the sampling mechanism, the consequence will be that the computed ''p''-value is smaller that the correctly-computed ''p''-value and results will be concluded as being ''significant'' and the rate of making [[Multiple Comparisons (Post Hoc Testing)|Multiple Comparisons]] will increase (there are some situations where alternatives to simple random sampling can result in ''p''-values being wrong in the other direction, but the nature of commercial research makes this possibility rare enough to be ignored).

Probability samples ''never'' occur in the real world. Only a tiny fraction of people are ever really available to participate in surveys. The rest are: illiterate, in prisons, unwilling to participate, too busy, not contactable, etc. Consequently, it is important to keep in mind that the ''p''-values that are computed are always rough approximation based upon implausible assumptions. However, without making the implausible assumptions there is no way of drawing any conclusion at all, so the orthodoxy is to make such untestable assumptions but to proceed with a degree of caution. Nevertheless, it is important to appreciate that it does not follow that because no sample is ever really a probability sample that this means that all samples are equally useful. The further a sample is from being a probability sample, the more dangerous it is to treat it as being a probability sample.

== Statistical distribution ==

== Sample size ==

== Number of comparisons ==

== The absence of other information ==

[[Tests of Statistical Significance]]

== References ==

{{reflis}}

Show more