Content
Basic Descriptive Statistics
Quick univariate summary
Univariate summary
Central tendency
Variance
Standard deviation
Standard error
Skewness
Frequencies
Crosstabs
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Quick univariate summary. 1
Univariate summary. 2
Central tendency. 6
Variance, standard deviation and spread.. 6
Variance, standard deviation and spread.. 7
Variance, standard deviation and spread.. 8
Skewness. 9
Frequencies. 10
Crosstabs. 11
Menu location: Analysis_Descriptive_Quick Summary.
This function provides rapidaccess to descriptive statistics for a worksheet column of data.
Shortcut: click on the right mouse button when the mouse cursor is over thecolumn of data you want to describe and you will be given summary statisticsfor that column, provided the setting of the Edit_Optionsmenu item is set to "Column summary".
The statistics calculated hereare a sub-set of those available through the Analysis_Descriptive_DescriptiveReport menu function. If you want to calculate summary statistics for morethan one column at a time then you must use the Analysis_Descriptive_DescriptiveReport menu function.
For definitions of the statisticscalculated, please see descriptivereport.
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Menu locations:
Analysis_Descriptive_Univariate Summary;
Analysis_Descriptive_Weighted Univariate Summary.
This function provides measuresof location and dispersion which describe the data in a worksheet column. Youare given the number, arithmetic mean, sum, variance, standard deviation,standard error of the arithmetic mean, coefficient of variance, confidenceinterval for the arithmetic mean, geometric mean, coefficient of skewness, coefficient of kurtosis, maximum, upper quartile,median, lower quartile, minimum and range for each selected variable. You canalso choose to calculate an additional quantile andthis is appended to the results listed above. Incalculable results aredisplayed as missing data using an asterisk (*).
If you selectmore than one column of data to describe then you are given an option to savethe results to worksheet columns. Saved columns of results represent thestatistics, mean, median etc., and their rows represent the variables/columnsyou selected to describe.
Confidence limits (boundaries ofthe confidence interval)are given for the arithmetic mean. Please see quantileconfidence interval for confidence intervals for the median and othermeasures of location.
Some related topics:
·central tendency
·variance, standard deviation and spread
·skewness
·normal distribution
·quantiles
·quantile confidence intervals
·histogram
Please refer to one of thegeneral textbooks listed in the reference sectionfor discussion of the application and relative merits of individual descriptivestatistics.
Definitions
Valid data and missing data:
For each worksheet column thatyou select, the number of valid data are the number of cells that can beinterpreted as numbers, the remaining cells that can not be interpreted asnumbers are counted as missing (e.g. empty cell, asterisk or text label). Thesample size used in the calculations below is the number of valid data.
Sum, mean, variance, standarddeviation, standard error and variance coefficient:
- where S is the summation for allobservations (xi) in a sample, x bar is the sample (arithmetic) mean, n is thesample size, s² is the sample variance, s is the sample standard deviation, sem is the standard error of the sample mean, upper andlower CL are the confidence limits of the confidence interval for the mean, ta, n-1 is the(100*a)% two tailed quantile from the Student tdistribution with n-1 degrees of freedom, and vc isthe variance coefficient.
Skewness and kurtosis:
- where S is thesummation for all observations (xi) in a sample, x bar is the sample mean and nis the sample size. Note that there are other definitions of these coefficientsused by some other statistical software. StatsDirectuses the standard definitions for which critical values are published instandard statistical tables (Pearson and Hartley,1970; Stuart and Ord, 1994).
Geometric mean:
The geometric mean is a usefulmeasure of central tendency for samples that are log-normally distributed (i.e.the logarithms of the observations are from an approximately normaldistribution). The geometric mean is not calculated for samples that containnegative values.
- where S is thesummation for all observations (xi) in a sample, lnis the natural (base e) logarithm, exp is the exponent (anti-logarithm for basee), gm is the sample geometric mean and n is the sample size.
Weights:
If weights are selected then theweights that you supply are first normalised so thatthey sum to the total number of observations n:
- wherevi is a user supplied weight and wi is the normalised weight.
The following formulae replacethe mean, variance and moments calculations defined above when weights areused:
Median, quartiles and range:
For samples that are not from anapproximately normal distribution, for example when data are censored to removevery large and/or very small values, the following nonparametric statisticsshould be used in place of the arithmetic mean, its variance and the otherparametric measures above.
Median (50th centile,quantile 0.5), lower quartile (25th centile, quantile 0.25) and upperquartile (75th centile, quantile0.75) are defined generally as quantiles:
Two different quantiledefinitions (Weisberg,1992; Gleason, 1997; Stuart and Ord, 1994) are used in the summarystatistics, the first allows for weights and the second is the conventional quantile that is also used in the quantileconfidence interval function:
Type 1
- where p is a proportion, Q isthe pth quantile (e.g.median is Q(0.5)), u is an observation from a sample after it has been orderedfrom smallest to largest value, n is the sample size, w is a weight normalised so that it sums to n and
Type 2
- where p is a proportion, Q isthe pth quantile (e.g.median is Q(0.5)), fix is the integer part of a real number, h is thefractional part of order statistic i, u is anobservation from a sample after it has been ordered from smallest to largestvalue and n is the sample size.
Technical validation
The computational methods used inStatsDirect univariatesummary statistics, including this function, provide 15 decimal places ofprecision. This is tested against known standards such as the reference dataset used in the example below.
Example
Test workbook (Parametricworksheet: Michelson).
The data are 100 measurements ofthe speed (millions of meters per second) of light in air recorded by Michelsonin 1879 (Dorsey,1944). The American National Institute of Standards and Technology usethese data as part of the Statistical Reference Datasets for testingstatistical software (McCullough andWilson, 1999; http://www.nist。gov.itl/div898/strd).
Open the test workbook and selectthe "Michelson" column. Choose descriptive report from thedescriptive section of the analysis menu and click on OK when you see a list ofdescriptive statistics options.
Results from StatsDirect(with decimal places in Analysis_Options set to 12and centile type 2 selected):
Descriptive statistics
Variables | Michelson |
Valid data | 100 |
Missing data | 0 |
Sum | 29985.24 |
Mean | 299.8524 |
Variance | 0.006242666667 |
Standard deviation | 0.079010547819 |
Variance coefficient | 0.000263498134 |
Standard error of mean | 0.007901054782 |
Upper 95% CL of mean | 299.868077406834 |
Lower 95% CL of mean | 299.836722593166 |
Geometric mean | 299.852389694496 |
Skewness | -0.01825961396 |
Kurtosis | 3.263530532311 |
Maximum | 300.07 |
Upper quartile | 299.895 |
Median | 299.85 |
Lower quartile | 299.805 |
Minimum | 299.62 |
Range | 0.45 |
Centile 95 | 299.98 |
Centile 5 | 299.73 |
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
The three common measures ofcentral tendency of a distribution are the arithmeticmean, the medianand the mode. Think of a distribution in terms of anhistogram with many bars; a large sample from a normal distribution woulddescribe a bell shaped curve that is symmetrical. In a perfectly symmetrical,non-skeweddistribution the mean, median and mode are equal. As distributions become moreskewed the difference between these different measures of central tendency getslarger.
The mode is the most commonly occurringvalue in a distribution, population or sample.
The mean (arithmetic mean) is theaverage (sum of observations / number of observations) in a distribution,sample or population. The mean is more sensitive to outliers than the median ormode.
The median is the middle value ina sorted distribution, sample or population. When there is an even number ofobservations the median is the mean of the two central values.
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
The standard deviation of themean (SD) is the most commonly used measure of the spread of values in adistribution. SD is calculated as the square root of the variance (the averagesquared deviation from the mean).
Variance in a population is:
[x is avalue from the population, m is the mean of all x, n is the number of x in the population, S is thesummation]
Variance is usually estimatedfrom a sample drawn from a population. The unbiased estimate of populationvariance calculated from a sample is:
[x is anobservation from the sample, x-bar is the sample mean, n (sample size) -1 is degrees offreedom, S is the summation]
The spread of a distribution isalso referred to as dispersion and variability. All three terms mean the extentto which values in a distribution differ from one another.
SD is the best measure of spreadof an approximately normal distribution. This is not the casewhen there are extreme values in a distribution or when the distribution isskewed, in these situations interquartile range orsemi-interquartile are preferred measures ofspread. Interquartile range is the difference betweenthe 25th and 75th centiles. Semi-interquartilerange is half of the difference between the 25th and 75th centiles.For any symmetrical (not skewed) distribution, half of its values will lie one semi-interquartile rangeeither side of the median, i.e. in the interquartilerange. When distributions are approximately normal, SD is a better measure ofspread because it is less susceptible to sampling fluctuation than (semi-)interquartile range.
If a variable y is a linear (y =a + bx) transformation of x then the variance of y isb² times the variance of x and the standard deviation of y is b times thevariance of x.
The standard error of the mean isthe expected value of the standard deviation of means of several samples, this is estimated from a single sample as:
[s isstandard deviation of the sample mean, n is the sample size]
See descriptivestatistics.
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
The standard deviation of themean (SD) is the most commonly used measure of the spread of values in adistribution. SD is calculated as the square root of the variance (the averagesquared deviation from the mean).
Variance in a population is:
[x is avalue from the population, m is the mean of all x, n is the number of x in the population, S is thesummation]
Variance is usually estimatedfrom a sample drawn from a population. The unbiased estimate of populationvariance calculated from a sample is:
[x is anobservation from the sample, x-bar is the sample mean, n (sample size) -1 is degrees offreedom, S is the summation]
The spread of a distribution isalso referred to as dispersion and variability. All three terms mean the extentto which values in a distribution differ from one another.
SD is the best measure of spreadof an approximately normal distribution. This is not the casewhen there are extreme values in a distribution or when the distribution isskewed, in these situations interquartile range orsemi-interquartile are preferred measures ofspread. Interquartile range is the difference betweenthe 25th and 75th centiles. Semi-interquartilerange is half of the difference between the 25th and 75th centiles.For any symmetrical (not skewed) distribution, half of its values will lie one semi-interquartile rangeeither side of the median, i.e. in the interquartilerange. When distributions are approximately normal, SD is a better measure ofspread because it is less susceptible to sampling fluctuation than (semi-)interquartile range.
If a variable y is a linear (y =a + bx) transformation of x then the variance of y isb² times the variance of x and the standard deviation of y is b times thevariance of x.
The standard error of the mean isthe expected value of the standard deviation of means of several samples, this is estimated from a single sample as:
[s isstandard deviation of the sample mean, n is the sample size]
See descriptivestatistics.
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
The standard deviation of themean (SD) is the most commonly used measure of the spread of values in adistribution. SD is calculated as the square root of the variance (the averagesquared deviation from the mean).
Variance in a population is:
[x is avalue from the population, m is the mean of all x, n is the number of x in the population, S is thesummation]
Variance is usually estimatedfrom a sample drawn from a population. The unbiased estimate of populationvariance calculated from a sample is:
[x is anobservation from the sample, x-bar is the sample mean, n (sample size) -1 is degrees offreedom, S is the summation]
The spread of a distribution isalso referred to as dispersion and variability. All three terms mean the extentto which values in a distribution differ from one another.
SD is the best measure of spreadof an approximately normal distribution. This is not the casewhen there are extreme values in a distribution or when the distribution isskewed, in these situations interquartile range orsemi-interquartile are preferred measures ofspread. Interquartile range is the difference betweenthe 25th and 75th centiles. Semi-interquartilerange is half of the difference between the 25th and 75th centiles.For any symmetrical (not skewed) distribution, half of its values will lie one semi-interquartile rangeeither side of the median, i.e. in the interquartilerange. When distributions are approximately normal, SD is a better measure ofspread because it is less susceptible to sampling fluctuation than (semi-)interquartile range.
If a variable y is a linear (y =a + bx) transformation of x then the variance of y isb² times the variance of x and the standard deviation of y is b times thevariance of x.
The standard error of the mean isthe expected value of the standard deviation of means of several samples, this is estimated from a single sample as:
[s isstandard deviation of the sample mean, n is the sample size]
See descriptivestatistics.
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Skewness describes the asymmetry of a distribution. A skewed distributiontherefore has one tail longer than the other.
A positively skeweddistribution has a longer tail to the right:
A negatively skeweddistribution has a longer tail to the left:
A distribution with no skew(e.g. a normal distribution) is symmetrical:
In a perfectly symmetrical,non-skewed, distribution the mean, median and mode are equal. As distributionsbecome more skewed the difference between these different measures of centraltendency gets larger.
Positively skewed distributionsare more common than negatively skewed ones.
A coefficient of skewness for a sample is calculated by StatsDirectas:
- wherexi is a sample observation, x bar is the sample mean and n is the sample size.
Skewed distributions cansometimes be "normalized" by transformation.
See descriptivestatistics.
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Menu location: Analysis_Frequencies.
This function gives the actualand relative values for frequency and cumulative frequency of observations inthe samples you select. If you want the cumulative frequencies to representorder then sortthe data before using this function.
Example
The following represent responsesto an element of a questionnaire that used a Likertscale:
response
3
3
4
1
1
2
5
3
In order to analysethese data in StatsDirect, fi衛(wèi)生資格考試網(wǎng)rst enter them into aworkbook column. Then selectthis column and choose the frequencies option of the analysis menu.
For this example:
N = 8
Value | Frequency | Relative % | Cumulative | Relative % |
1 | 2 | 25 | 2 | 25 |
2 | 1 | 12.5 | 3 | 37.5 |
3 | 3 | 37.5 | 6 | 75 |
4 | 1 | 12.5 | 7 | 87.5 |
5 | 1 | 12.5 | 8 | 100 |
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Menu location: Analysis_Crosstabs.
This a two orthree way cross tabulation function. If you havetwo columns of numbers that correspond to different classifications of the sameindividuals then you can use this function to give a two way frequency tablefor the cross classification. This can be stratified by a third classificationvariable.
For two way crosstabs, StatsDirect offers a range of analyses appropriate to thedimensions of the contingency table. For more information see chi-squaretests and exacttests.
For three way crosstabs, StatsDirect offers either odds ratio(for case-control studies) or relative risk(for cohort studies) meta-analyses for 2 by 2 by k tables, and generalisedCochran-Mantel-Haenszel tests for r by c by k tables.
Example
A database of test scorescontains two fields of interest, sex (M=1, F=0) and grade of skin reaction toan antigen (none = 0, weak + = 1, strong + = 2). Here is a list of those fieldsfor 10 patients:
Sex | Reaction |
0 | 0 |
1 | 1 |
1 | 2 |
0 | 2 |
1 | 2 |
0 | 1 |
0 | 0 |
0 | 1 |
1 | 2 |
1 | 0 |
In order to get a crosstabulation of these from StatsDirect you should enterthese data in two workbook columns. Then choose crosstabs from the analysis menu.
For this example:
Reaction | ||||
0 | 1 | 2 | ||
Sex | 0 | 2 | 2 | 1 |
1 | 1 | 1 | 3 |
We could then proceed to an r byc (2 by 3) contingencytable analysis to look for association between sex and reaction to thisantigen:
Contingency table analysis
Observed | 2 | 2 | 1 | 5 |
% of row | 40% | 40% | 20% | |
% of col | 66.67% | 66.67% | 25% | 50% |
Observed | 1 | 1 | 3 | 5 |
% of row | 20% | 20% | 60% | |
% of col | 33.33% | 33.33% | 75% | 50% |
Total | 3 | 3 | 4 | 10 |
% of n | 30% | 30% | 40% |
TOTAL number of cells = 6
WARNING: 6 out of 6 cells haveEXPECTATION < 5
NOMINAL INDEPENDENCE
Chi-square = 1.666667, DF = 2, P= 0.4346
G-square = 1.726092, DF = 2, P =0.4219
Fisher-Freeman-Halton exact P = 0.5714
ANOVA
Chi-square for equality of meancolumn scores = 1.5
DF = 2, P = 0.4724
LINEAR TREND
Sample correlation (r) = 0.361158
Chi-square for linear trend (M²)= 1.173913
DF = 1, P = 0.2786
NOMINAL ASSOCIATION
Phi = 0.408248
Pearson's contingency = 0.377964
Cramér's V = 0.408248
ORDINAL
Goodman-Kruskalgamma = 0.555556
Approximate test of gamma = 0: SE= 0.384107, P = 0.1481, 95% CI = -0.197281 to 1.308392
Approximate test of independence:SE = 0.437445, P = 0.2041, 95% CI = -0.301821 to 1.412932
Kendall tau-b = 0.348155
Approximate test of tau-b = 0: SE = 0.275596, P = 0.2065, 95% CI = -0.192002 to0.888313
Approximate test of independence:SE = 0.274138, P = 0.2041, 95% CI = -0.189145 to 0.885455