The Kolmogorov-Smirnov test is defined by:
H\(_0\): The data follow a specified distribution H\(_1\): The data do not follow the specified distribution
Test Statistic: The Kolmogorov-Smirnov test statistic is defined as
where F is the theoretical cumulative distribution of the distribution being tested which must be a continuous distribution (i.e., no discrete distributions such as the binomial or Poisson), and it must be fully specified
Due to limitations 2 and 3 above, many analysts prefer to use the Anderson-Darling goodness-of-fit test.
However, the Anderson-Darling test is only available for a few specific distributions.
The Anderson -Darling test is a statistical test of whether there is evidence that a given sample of data did not arise from a given probability distribution.
In its basic form, the test assumes that there are no parameters to be estimated in the distribution being tested, in which case the test and its set of critical values is distribution-free. However, the test is most often used in contexts where a family of distributions is being tested, in which case the parameters of that family need to be estimated and account must be taken of this in adjusting either the test-statistic or its critical values.
When applied to testing if a normal distribution adequately describes a set of data, it is one of the most powerful statistical tools for detecting most departures from normality.
Performs the Shapiro-Wilk test of normality.
> x<- rnorm(100, mean = 5, sd = 3)
> shapiro.test(x)
Shapiro-Wilk normality test
data: rnorm(100, mean = 5, sd = 3)
W = 0.9818, p-value = 0.1834
In this case, the p-value is greater than 0.05, so we fail to reject the null hypothesis that the data set is normally distributed.
>y <- runif(100, min = 2, max = 4)
> shapiro.test(y)
Shapiro-Wilk normality test
data: runif(100, min = 2, max = 4)
W = 0.9499, p-value = 0.0008215
In this case, the p-value is less than 0.05, so we reject the null hypothesis that the data set is normally distributed.
Statisticians have devised several ways to detect outliers. Grubbs’ test is particularly easy to follow. This method is also called the ESD method (extreme studentized deviate). The first step is to quantify how far the outlier is from the others. Calculate the ratio Z as the difference between the outlier and the mean divided by the SD. If Z is large, the value is far from the others. Note that you calculate the mean and SD from all values, including the outlier.
Since 5% of the values in a Gaussian population are more than 1.96 standard deviations from the mean, your first thought might be to conclude that the outlier comes from a different population if Z is greater than 1.96. This approach only works if you know the population mean and SD from other data. Although this is rarely the case in experimental science, it is often the case in quality control. You know the overall mean and SD from historical data, and want to know whether the latest value matches the others. This is the basis for quality control charts.
When analyzing experimental data, you don’t know the SD of the population. Instead, you calculate the SD from the data. The presence of an outlier increases the calculated SD. Since the presence of an outlier increases both the numerator (difference between the value and the mean) and denominator (SD of all values), Z does not get very large. In fact, no matter how the data are distributed, Z can not get larger than, where N is the number of values. For example, if N=3, Z cannot be larger than 1.155 for any set of values.
Grubbs and others have tabulated critical values for Z which are tabulated below. The critical value increases with sample size, as expected.
If your calculated value of Z is greater than the critical value in the table, then the P value is less than 0.05. This means that there is less than a 5% chance that you’d encounter an outlier so far from the others (in either direction) by chance alone, if all the data were really sampled from a single Gaussian distribution. Note that the method only works for testing the most extreme value in the sample (if in doubt, calculate Z for all values, but only calculate a P value for Grubbs’ test from the largest value of Z.
Once you’ve identified an outlier, you may choose to exclude that value from your analyses. Or you may choose to keep the outlier, but use robust analysis techniques that do not assume that data are sampled from Gaussian populations.
If you decide to remove the outlier, you then may be tempted to run Grubbs’ test again to see if there is a second outlier in your data. If you do this , you cannot use the same table.
: There are no outliers in the data set
: There is exactly one outlier in the data set
install.packages("outliers")
library(outliers)
#Package Author : Lukasz Komsta (UMLUB, Poland)
grubbs.test(DAT002)
\end{framed}
Calculate Z as shown above. Look up the critical value of Z in the table below, where N is the number of values in the group. If your value of Z is higher than the tabulated value, the P value is less than 0.05.
You can also calculate an approximate P value as follows.
N is the number of values in the sample, Z is calculated for the suspected outlier as shown above. Look up the two-tailed P value for the student t distribution with the calculated value of T and N-2 degrees of freedom. Using Excel, the formula is =TDIST(T,DF,2) (the ‘2’ is for a two-tailed P value).
Multiply the P value you obtain in step 2 by N. The result is an approximate P value for the outlier test. This P value is the chance of observing one point so far from the others if the data were all sampled from a Gaussian distribution. If Z is large, this P value will be very accurate. With smaller values of Z, the calculated P value may be too large.
In statistics, Dixon’s Q test, or simply the Q test, is used for identification and rejection of outliers. This test should be used sparingly and never more than once in a data set. To apply a Q test for bad data, arrange the data in order of increasing values and calculate Q as defined:
\[ Q = \frac{\mbox{Gap}}{\mbox{Range}} \]
Where gap is the absolute difference between the outlier in question and the closest number to it. If \(Q_calculated > Q_table\), then reject the questionable point.
The Kolmogorov-Smirnov test is defined by:
H\(_0\): The data follow a specified distribution H\(_1\): The data do not follow the specified distribution
Test Statistic: The Kolmogorov-Smirnov test statistic is defined as
where F is the theoretical cumulative distribution of the distribution being tested which must be a continuous distribution (i.e., no discrete distributions such as the binomial or Poisson), and it must be fully specified
An attractive feature of this test is that the distribution of the K-S test statistic itself does not depend on the underlying cumulative distribution function being tested. Another advantage is that it is an exact test (the chi-square goodness-of-fit test depends on an adequate sample size for the approximations to be valid). Despite these advantages, the K-S test has several important limitations:
Due to limitations 2 and 3 above, many analysts prefer to use the Anderson-Darling goodness-of-fit test.
However, the Anderson-Darling test is only available for a few specific distributions.
For a single sample ofdata, the Kolmogorov-Smirnov test is used to test whether or not the sample of data is consistent with a specified distribution function. (Not part of this course) When there are two samples of data, it is used to test whether or not these two samples may reasonably be assumed to come from the same distribution. The null and alternative hypotheses are as follows:
H0: The two data sets are from the same distribution H1: The data sets are not from the same distribution
Consider two sample data sets X and Y that are bothnormally distributed with similar means and variances.
> X=rnorm(16,mean=20,sd=5)
> Y=rnorm(18,mean=21,sd=4)
> ks.test(X,Y)
Two-sample Kolmogorov-Smirnov test
data: X and Y
D = 0.2153, p-value = 0.7348
alternative hypothesis: two-sided
Remark: It doesnât not suffice that both datasets are from the same distribution. They must have the same value for the defining parameters. Consider the case of data sets; X and Z. Both are normally distributed, but with different mean values.
> X=rnorm(16,mean=20,sd=5)
> Z=rnorm(16,mean=14,sd=5)
> ks.test(X,Z)
Two-sample Kolmogorov-Smirnov test
data: X and Z
D = 0.5625, p-value = 0.0112
alternative hypothesis: two-sided
The Kolmogorov-Smirnov test is defined by: \ H\(_0\): The data follow a specified distribution\ H\(_1\): The data do not follow the specified distribution\
Test Statistic: The Kolmogorov-Smirnov test statistic is defined as
where F is the theoretical cumulative distribution of the distribution being tested which must be a continuous distribution (i.e., no discrete distributions such as the binomial or Poisson), and it must be fully specified
Due to limitations 2 and 3 above, many analysts prefer to use the Anderson-Darling goodness-of-fit test.
However, the Anderson-Darling test is only available for a few specific distributions.
The Kolmogorov-Smirnov test is defined by:
H\(_0\): The data follow a specified distribution\ H\(_1\): The data do not follow the specified distribution\
Test Statistic: The Kolmogorov-Smirnov test statistic is defined as
where F is the theoretical cumulative distribution of the distribution being tested which must be a continuous distribution (i.e., no discrete distributions such as the binomial or Poisson), and it must be fully specified
Due to limitations 2 and 3 above, many analysts prefer to use the Anderson-Darling goodness-of-fit test.
However, the Anderson-Darling test is only available for a few specific distributions.
To construct a boxplot
Calculate Q1, the median, Q3 and the IQR.
Draw a horizontal line to represent the scale of measurement.
Draw a box just above the line with the right and left ends at Q1 and Q3.
Draw a line through the box at the location of the median.
Any values below the lower fence or above the upper fence are classes as outliers.
What to Look for
Outliers are extreme values and can greatly influence your analysis. For that reason, you should check your data and make sure you have entered it correctly.
You also have the option of removing outliers, making a note that you have removed them, and presenting your analysis without them.
You are interested in how spread out or tightly packed the data are. The length of the whiskers and the position of the median in the box tell you this. Notice that 25% of the values in the boxplot are less than Q1 and this includes the outliers.
This is particularly true when the size of the samples is large (thanks to the Central Limit Theorem). Some deviations from normality can pose a problem for the t-test, specifically those that involve getting extreme scores more frequently than you would if the distribution were normal. Statistical Software Packages provides two statistical tests for deviation from normality, the ‘Kolomogorov-Smirnov’ family of tests and the ‘Shapiro-Wilk’ test. The ‘Kolomogorov-Smirnov’ test can be used to test if two data sets are distributed according to the same distribution. It can also be used to test if one data set comes from a specified distribution, such as the normal distribution. ( As such, the normal distribution must be specified as an argument to the function.)
For the purposes of this module, we will only use a special case of the ‘Kolomogorov-Smirnov’ test, known as the â Anderson-Darling’ test of normality.
The test can not be implemented directly in R. Using the test requires the installation of the nortest package. We will look at packages in greater detail later in the semester.
The null hypothesis of both the Anderson-Darlingâ and
Shapiro-Wilkâ tests is that the population is normally distributed, and the alternative hypothesis is that the data is not normally distributed.
The quantile-quantile (Q-Q) plot is an excellent way to see whether the data deviate from normal (the plot can be set up to see if the data deviate from other distributions as well but here we are only interested in the normal distribution).
The process used for creating a QQ plot involves determining what proportion of the ‘observed’ scores fall below any one score, then the âz- scoreâ that would fit that proportion if the data were normally distributed is calculated, and finally that âz- scoreâ that would cut off that proportion (the ‘expected normal value’) is translated back into the original metric to see what raw score that would be.
A scatter plot is then created that shows the relationship between the actual ‘observed’ values and what those values would be ‘expected’ to be if the data were normally distributed. If the data is normally distributed then the circles on the resulting plot (each circle representing a score) will form a straight line. A trend line can be added to the plot to assist in determining whether or not this relationship is linear.
If data deviate substantially from a Gaussian distribution, using a nonparametric test is not the only alternative. Consider transforming the data to create a Gaussian distribution. Transforming to reciprocals or logarithms are often helpful. Data can fail a normality test because of the presence of an outlier. Removing that outlier can restore normality. The decision of whether to use a parametric or nonparametric test is most important with small data sets (since the power of nonparametric tests is so low). But with small data sets, normality tests have little power to detect non-normal distributions, so an automatic approach would give you false confidence.
With large data sets, normality tests can be too sensitive. A low p-value from a normality test tells you that there is strong evidence that the data are not sampled from an ideal normal distribution. But you already know that, as almost no scientifically relevant variables form an ideal normal distribution. What you want to know is whether the distribution deviates enough from the normal ideal to invalidate conventional statistical tests (that assume a Gaussian distribution). A normality test does not answer this question. With large data sets, trivial deviations from the idea can lead to a small p-value.
If the outlier test identifies one or more values as being an outlier, we must consider the following
Was the outlier value entered into the computer incorrectly? If the “outlier” is in fact a typo, fix it. It is always worth going back to the original data source, and checking that outlier value entered into Prism is actually the value you obtained from the experiment. If the value was the result of calculations, check for math errors.
Is the outlier value scientifically impossible? Of course you should remove outliers from your data when the value is completely impossible. Examples include a negative weight, or an age (of a person) that exceed 150 years. Those are clearly errors, and leaving erroneous values in the analysis would lead to nonsense results.
Is the assumption of a Normal distribution dubious? The Grubbs’ tests assume that all the values are sampled from a Gaussian distribution, with the possible exception of one (or a few) outliers from a different distribution. If the underlying distribution is not Gaussian, then the results of the outlier test is unreliable. It is especially important to beware of lognormal distributions. If the data are sampled from a lognormal distribution, you expect to find some very high values which can easily be mistaken for outliers. Removing these values would be a mistake.
Is the outlier value potentially scientifically interesting? If each value is from a different animal or person, identifying an outlier might be important. Just because a value is not from the same normal distribution as the rest doesn’t mean it should be ignored. An interesting phenomenon may have been discovered. Don’t discard the data as an outlier without considering if the observation is potentially scientifically interesting.
Do you have a policy on when to remove outliers? Ideally, removing an outlier should not be an ad hoc decision. In general , you should follow a policy, and apply that policy consistently.
If you are looking for two or more outliers, could masking be a problem? is the name given to the problem where the presence of two (or more) outliers, can make it harder to find even a single outlier.
If you answered no to all those questions. If you’ve answered no to all the questions above, there are two possibilities: The suspect value came from the same normal population as the other values. You just happened to collect a value from one of the tails of that distribution.
The suspect value came from a different distribution than the rest. Perhaps it was due to a mistake, such as bad pipetting, voltage spike, holes in filters, etc.
If you knew the first possibility was the case, you would keep the value in your analyses. Removing it would be a mistake. If you knew the second possibility was the case, you would remove it, since including an erroneous value in your analyses will give invalid results.
The problem, of course, is that you can never know for sure which of these possibilities is correct. An outlier test cannot answer that question for sure. Ideally, you should create a lab policy for how to deal with such data, and follow it consistently. If you don’t have a lab policy on removing outliers, here is suggestion: Analyze your data both with and without the suspected outlier. If the results are similar either way, you’ve got a clear conclusion. If the results are very different, then you are stuck. Without a consistent policy on when you remove outliers, you are likely to only remove them when it helps push the data towards the results you want.
Many statistical tests assume that you have sampled data from populations that follow a Normal distribution. Biological data never follow a Gaussian distribution precisely, because a Gaussian distribution extends infinitely in both directions, and so it includes both infinitely low negative numbers and infinitely high positive numbers. But many kinds of biological data follow a bell-shaped distribution that is approximately Gaussian.
Because statistical tests work well even if the distribution is only approximately Gaussian (especially with large samples), these tests are used routinely in many fields of science.
An alternative approach does not assume that data follow a Gaussian distribution. These tests, called nonparametric tests, are appealing because they require fewer assumptions about the distribution of the data. In this approach, values are ranked from low to high, and the analyses are based on the distribution of ranks. Often, the analysis will be one of a series of experiments. Since you want to analyze all the experiments the same way, you cannot rely on the results from a single normality test.