Practical statistical analyses using MINITAB

(Roy Thompson, Geology & Geophysics Department)

1. Introduction

Measurements are of little use until they are 'analysed'. Data analysis includes (i) organising measurements into a meaningful order or into groups, (ii) reducing the data into manageable quantities, (iii) forming succinct descriptions of the main features of the data, and (iv) elucidating any anomalies for subsequent examination. Analysis is the step between obtaining data and applying it to solve practical problems.

The recent advent of spreadsheets and statistical/graphical software packages has transformed elementary statistical analysis from a rather mathematical subject, backed up by dense statistical tables, into a readily accessible technique. Spreadsheets such as EXCEL contain a number of statistical functions, and these can be very helpful. Much better, however, are statistical packages that have been specially written to be 'user friendly' and to include a comprehensive suite of up-to-date statistical methods. Numerous high-quality statistical packages are available in Edinburgh; good ones include MINITAB and Splus. MINITAB is a particularly easy package to learn and to use; it has excellent self-help facilities, has been well tested, includes modem statistical methods and is widely used both inside and outside the University. MINITAB is an ideal package for learning statistics.

2. MINITAB

MINITAB is available on the Geology & Geophysics PC network, on the KB centre PC network and on Edinburgh University mainframe machines such as holyrood. [The KB centre provides the best computing environment for MINITAB.] In the KB centre PC.Laboratory or on the Geology & Geophysics PC.network you can access Minitab for windows version by double clicking the mouse on the Minitab icons which are found in More Applications. On a mainframe machine typing the command minitab starts the MINITAB package.

[On some of the older G&G PC machines it is necessary to use windows > main > control panel > enhanced virtual memory > change in order to increase the memory size before working with large data sets.]

3. MINITAB for Windows

The main window contains four subwindows:

Above the main Minitab window are a number of menus:

To end a Minitab windows session choose File followed by Exit.

4. Statistical analyses

a Summarising data

An important first step in any statistical analysis is simply to have a good look at the data. A histogram is a sensible starting point. In MINITAB simply select Histogram, in the Graph menu, then select the column of data to be analysed and click OK. If the data is a time series it should be plotted using Time-series plot. If the data involves paired observations a scatterplot can be produced using Plot. EDA techniques are very useful in environmental and other studies that involve data sets that do not necessarily involve ideally distributed data (i.e. data sets that are not normally distributed). MINITAB has a particularly good suite of robust EDA functions. The command Boxplot in the EDA submenu of Stat produces a simple figure that encapsulates the main characteristics of a data set and provides a quick way to look at its overall shape.

b Typical values

Large data sets can be summarised by just two numbers, namely (i) a typical value (that characterises the centre of the distribution) and (ii) the spread of values from the centre. The boxplot uses the median (or middle value) as a measure of the centre of a distribution. The median is a better measure of the centre of a data set than the arithmetic mean when the data distribution is skewed or includes aberrant values. Both the median and the mean are calculated by the very useful MINITAB command describe, which lists all the main descriptive statistics of a data set. It is executed by Display Basic Statistics at the very top of the Stat submenu in Basic statistics.

c Variation

Because of the problem of natural variability we need to combine results from several measurements in order to obtain useful quantitative information. The dispersion of a data set can be estimated in at least three ways, from the total range, from the shape of the histogram or by the standard deviation. All three methods aim to measure the amount of variation or spread in the data. The range is the difference between the largest and smallest numbers. While these maximum and minimum values are important, and so are included in boxplots, they are not a particularly good measure of spread. The reason is that they are determined only by two outlying values, not the bulk of the data, and so they are not particularly useful for assessing the scatter of all the data. A neat way around this difficulty is to use the inter-quartile range, that is the spread of the middle half of the data, rather than the full range. This excellent measure of scatter is used by MINITAB in Boxplot and is calculated as part of the describe command. The MINITAB command stdev calculates the classical measure of dispersion, the standard deviation. These basic statistics commands are to be found in the Column statistics submenu of the Calc menu.

When you have completed all calculations of the basic descriptive statistics, such as the median and interquartile range, a most important point to remember is that you must be careful how you express any results. They should only be given to the same level of precision as the least accurate measurement used.

d Relationships between variables

'Regression is perhaps one of the most widespread but mis-used <numerical> techniques in geology' Rock, 1988.

In environmental studies we frequently need to know whether two or more variables are related. If such a relationship can be expressed by a mathematical formula, we will then be able to use it for the purpose of making predictions. For example measurements on atmospheric CO2 concentrations can be used to predict the increased growth of plants. The reliability of any predictions naturally depends on the strength of the relationship between the variables included in the formula.

(i) Correlation

A very useful, often used, statistic that quantifies the link between two variables is the correlation coefficient, r, which has a value between +I and -1. MINITAB uses the command corr to calculate r. It is found in Basic statistics in the Stat submenu.

When r is close to +1, an observation with a high value for one variable will likely have a high value for the other variable. The correlation coefficient, r, is close to zero if there is little association between the variables. However note that the converse may not be true. A zero coefficient does not guarantee independence, because there is always a possibility of an unexamined additional factor (technically called a confounding factor) influencing the relationship.

Warning: an association (high correlation) between changes in two variables is very indicative but not conclusive proof of some underlying significance to their relationship i.e. a high r value does not in itself establish a causal relationship between two variables.

(ii) Regression

Regression is widely used to characterise and describe the relationship between two variables. MINITAB is very good for both simple and multiple regression analysis. MINITAB has a Regression submenu in Stat to perform the analyses.

The regression model takes the form

Y = a + bX + e

where a is the intercept, b the slope, and e the errors. This is the equation of a best fitting straight line. Here Y is the dependent (or response) variable, while X is the independent (or explanatory) variable, which theoretically is not subject to error. If using regression analysis for prediction of a particular variable, then regress this, the Y variable against the X or predictor variable.

MINITAB calculates all the required regression coefficients and their confidence limits. It also generates all the test statistics needed for hypothesis testing and for prediction. Stepwise regression is an excellent method of simplifying multiple regression relationships.

Use Fitted line plot to generate a graph of the regression line and Options to display confidence bands.

A couple of warnings: (i) Before finding a regression line for a set of data points, check that the data are roughly distributed about a straight line by plotting a scatter diagram. If the data fall on a curved pattern (and so clearly cannot be sensibly modelled by a straight-line fit) it will need transforming before the regression analysis. (ii) After performing a regression analysis it is very important to examine the residuals in order to check that the regression model was indeed specified sensibly. Residuals should be normally distributed and not show any abnormal relationships with the predictor, X, variable.

Finally, it is good practise to use only part of the available data (the so-called training set) to derive a regression equation, and then to use the remainder (the test set) for comparison with the regression equation. This approach gives a realistic indication of the reliability of the regression relationship. Minitab Regression performs this test through its lack of fit option.

e Time series analysis

Definition of time series data by Sir R. A. Fisher:

'One damn thing after another'

Time series data, such as climate series, frequently occur in environmental studies. Methods of time-series analysis are mainly concerned with decomposing the variation into (i) the trend, (ii) seasonal variation or cyclic changes and (iii) the remaining 'irregular' fluctuations.

(i) Trend

A linear trend, or long term change, in a data series can be found by the regression method described in d above. The time series data should be regressed against time. Trend analysis in the Time-series submenu of Stat can also be used for forecasting.

(ii) Cyclic variations

Cyclic changes need very careful identification. MINITAB has a separate Time series analysis section in Stat for analysis in the time domain. The correlation coefficient approach of d (i) can be used very effectively with time-series data to produce correlations between successive observations. Such correlations are referred to as autocorrelations. MINITAB calculates autocorrelations for observations separated by various time-steps (or lags) and plots the correlogram (graph of correlation coefficient against lag) through the command Autocorrelation in the Time-series submenu of Stat. True cyclic fluctuations in a time-series stand out clearly in a correlogram as distinctive oscillations. In contrast random time-series yield autocorelations close to zero. Many environmental time-series show only short-term coherence, in which a high value tends to be followed by one or more high values, but no long-term cyclicity. Here the correlogram yields zero autocorrelations at high lags, but a high coefficient at lag 1. Decomposition will pick out a seasonal component.

(iii) Smoothing

Irregular fluctuations can be removed from time series by smoothing. The MINITAB EDA function rsmooth is a very easy to use, highly recommended technique that can be applied to a wide range of time-series using Resistant smooth in EDA. By default rsmooth automatically sets the degree of smoothing, and deals with any aberrant or outlying points. The rsmooth (resistant smoothing) method is based on the techniques of running medians and hanning.

f Testing for differences

A common use of statistical hypothesis testing is to compare the means of two or more samples. We have to use a statistical test for this apparently simple task because of the problem of variation. We need to be confident that any difference between the sample means reflects a real difference and is not just caused by chance variability. Minitab uses the 2 sample t test in Basic statistics for this job. A more robust version of the procedure is the Mann-Whitney test in the Nonparametrics menu. Hypothesis testing follows a counterintuitive logic by assuming the opposite of what you are testing. For the two-sample t test a low p-value (low probability) means that the sample means are significantly different from each other. Minitab also provides ANOVA for comparing many groups as well as 1 sample t tests for determining whether the mean of a single sample is different from an expected value.

5. References

R Ennos, 2000. 'Statistical and Data Handling Skills in Biology' Prentice Hall.

[Excellent introductory text on practical statistics. Contains a very simple and effective flow chart on how to choose the most appropriate statistical test.]

B.F. Ryan, B.L. Joiner & T.A.R Ryan, 'Minitab Handbook' Duxbury Press.

[Very readable textbook about MINITAB.]

N.A. Weiss, 1995. 'Introductory Statistics', Addison-Wesley.

[Includes examples of the use of MINITAB in descriptive and inferential statistics, classical probability and regression analyses.]

www: There are many excellent descriptions about the practical use of MINITAB on the web. These can be easily found using the standard Netscape search facilities. For example, a search for +minitab +windows +introduction finds very useful documents on the servers of the Universities of Cardiff, Exeter and Glasgow.