| |||
Multiple Regression http://www.statsoft.com/TEXTBOOK/stmulreg.html http://statsoft.com/textbook/stathome.html General Purpose Computational Approach Least Squares The Regression Equation Unique Prediction and Partial Correlation Predicted and Residual Scores Residual Variance and R-square Interpreting the Correlation Coefficient R Assumptions, Limitations, and Practical Considerations Assumption of Linearity Normality Assumption Limitations Choice of the number of variables Multicollinearity and matrix ill-conditioning Fitting centered polynomial models The importance of residual analysis Elementary Concepts in Statistics http://www.statsoft.com/TEXTBOOK/esc.html Overview of Elementary Concepts in Statistics. In this introduction, we will briefly discuss those elementary statistical concepts that provide the necessary foundations for more specialized expertise in any area of statistical data analysis. The selected topics illustrate the basic assumptions of most statistical methods and/or have been demonstrated in research to be necessary components of one''s general understanding of the "quantitative nature" of reality (Nisbett, et al., 1987). Because of space limitations, we will focus mostly on the functional aspects of the concepts discussed and the presentation will be very short. Further information on each of those concepts can be found in statistical textbooks. Recommended introductory textbooks are: Kachigan (1986), and Runyon and Haber (1976); for a more advanced discussion of elementary theory and assumptions of statistics, see the classic books by Hays (1988), and Kendall and Stuart (1979). -------------------------------------------------------------------------------- What are variables? Correlational vs. experimental research Dependent vs. independent variables Measurement scales Relations between variables Why relations between variables are important Two basic features of every relation between variables What is "statistical significance" (p-value) How to determine that a result is "really" significant Statistical significance and the number of analyses performed Strength vs. reliability of a relation between variables Why stronger relations between variables are more significant Why significance of a relation between variables depends on the size of the sample Example: "Baby boys to baby girls ratio" Why small relations can be proven significant only in large samples Can "no relation" be a significant result? How to measure the magnitude (strength) of relations between variables Common "general format" of most statistical tests How the "level of statistical significance" is calculated Why the "Normal distribution" is important Illustration of how the normal distribution is used in statistical reasoning (induction) Are all test statistics normally distributed? Why significance of a relation between variables depends on the size of the sample Example: "Baby boys to baby girls ratio" Why small relations can be proven significant only in large samples Can "no relation" be a significant result? How to measure the magnitude (strength) of relations between variables Common "general format" of most statistical tests How the "level of statistical significance" is calculated Why the "Normal distribution" is important Illustration of how the normal distribution is used in statistical reasoning (induction) Are all test statistics normally distributed? How do we know the consequences of violating the normality assumption? Basic Statistics http://statsoft.com/textbook/stathome.html Descriptive statistics "True" Mean and Confidence Interval Shape of the Distribution, Normality Correlations Purpose (What is Correlation?) Simple Linear Correlation (Pearson r) How to Interpret the Values of Correlations Significance of Correlations Outliers Quantitative Approach to Outliers Correlations in Non-homogeneous Groups Nonlinear Relations between Variables Measuring Nonlinear Relations Exploratory Examination of Correlation Matrices Casewise vs. Pairwise Deletion of Missing Data How to Identify Biases Caused by the Bias due to Pairwise Deletion of Missing Data Pairwise Deletion of Missing Data vs. Mean Substitution Spurious Correlations Are correlation coefficients "additive?" How to Determine Whether Two Correlation Coefficients are Significant t-test for independent samples Purpose, Assumptions Arrangement of Data t-test graphs More Complex Group Comparisons t-test for dependent samples Within-group Variation Purpose Assumptions Arrangement of Data Matrices of t-tests More Complex Group Comparisons Breakdown: Descriptive statistics by groups Purpose Arrangement of Data Statistical Tests in Breakdowns Other Related Data Analysis Techniques Post-Hoc Comparisons of Means Breakdowns vs. Discriminant Function Analysis Breakdowns vs. Frequency Tables Graphical breakdowns Frequency tables Purpose Applications Crosstabulation and stub-and-banner tables Purpose and Arrangement of Table 2x2 Table Marginal Frequencies Column, Row, and Total Percentages Graphical Representations of Crosstabulations Stub-and-Banner Tables Interpreting the Banner Table Multi-way Tables with Control Variables Graphical Representations of Multi-way Tables Statistics in crosstabulation tables Multiple responses/dichotomies ANOVA/MANOVA http://www.statsoft.com/TEXTBOOK/stanman.html http://statsoft.com/textbook/stathome.html Basic Ideas The Partitioning of Sums of Squares Multi-Factor ANOVA Interaction Effects Complex Designs Between-Groups and Repeated Measures Incomplete (Nested) Designs Analysis of Covariance (ANCOVA) Fixed Covariates Changing Covariates Multivariate Designs: MANOVA/MANCOVA Between-Groups Designs Repeated Measures Designs Sum Scores versus MANOVA Contrast Analysis and Post hoc Tests Why Compare Individual Sets of Means? Contrast Analysis Post hoc Comparisons Assumptions and Effects of Violating Assumptions Deviation from Normal Distribution Homogeneity of Variances Homogeneity of Variances and Covariances Sphericity and Compound Symmetry Methods for Analysis of Variance This chapter includes a general introduction to ANOVA and a discussion of the general topics in the analysis of variance techniques, including repeated measures designs, ANCOVA, MANOVA, unbalanced and incomplete designs, contrast effects, post-hoc comparisons, assumptions, etc. For related topics, see also Variance Components (topics related to estimation of variance components in mixed model designs), Experimental Design/DOE (topics related to specialized applications of ANOVA in industrial settings), and Repeatability and Reproducibility Analysis (topics related to specialized designs for evaluating the reliability and precision of measurement systems). See also General Linear Models, General Regression Models; to analyze nonlinear models, see Generalized Linear Models. Association Rules http://statsoft.com/textbook/stathome.html -------------------------------------------------------------------------------- Association Rules Introductory Overview Computational Procedures and Terminology Tabular Representation of Associations Graphical Representation of Associations Interpreting and Comparing Results -------------------------------------------------------------------------------- Association Rules Introductory Overview The goal of the techniques described in this section is to detect relationships or associations between specific values of categorical variables in large data sets. This is a common task in many data mining projects as well as in the data mining subcategory text mining. These powerful exploratory techniques have a wide range of applications in many areas of business practice and also research - from the analysis of consumer preferences or human resource management, to the history of language. These techniques enable analysts and researchers to uncover hidden patterns in large data sets, such as "customers who order product A often also order product B or C" or "employees who said positive things about initiative X also frequently complain about issue Y but are happy with issue Z." The implementation of the so-called a-priori algorithm (see Agrawal and Swami, 1993; Agrawal and Srikant, 1994; Han and Lakshmanan, 2001; see also Witten and Frank, 2000) allows you to process rapidly huge data sets for such associations, based on predefined "threshold" values for detection. How association rules work. The usefulness of this technique to address unique data mining problems is best illustrated in a simple example. Suppose you are collecting data at the check-out cash registers at a large book store. Each customer transaction is logged in a database, and consists of the titles of the books purchased by the respective customer, perhaps additional magazine titles and other gift items that were purchased, and so on. Hence, each record in the database will represent one customer (transaction), and may consist of a single book purchased by that customer, or it may consist of many (perhaps hundreds of) different items that were purchased, arranged in an arbitrary order depending on the order in which the different items (books, magazines, and so on) came down the conveyor belt at the cash register. The purpose of the analysis is to find associations between the items that were purchased, i.e., to derive association rules that identify the items and co-occurrences of different items that appear with the greatest (co-)frequencies. For example, you want to learn which books are likely to be purchased by a customer who you know already purchased (or is about to purchase) a particular book. This type of information could then quickly be used to suggest to the customer those additional titles. You may already be "familiar" with the results of these types of analyses, if you are a customer of various on-line (Web-based) retail businesses; many times when making a purchase on-line, the vendor will suggest similar items (to the ones purchased by you) at the time of "check-out", based on some rules such as "customers who buy book title A are also likely to purchase book title B," and so on. Unique data analysis requirements. Crosstabulation tables, and in particular Multiple Response tables can be used to analyze data of this kind. However, in cases when the number of different items (categories) in the data is very large (and not known ahead of time), and when the "factorial degree" of important association rules is not known ahead of time, then these tabulation facilities may be too cumbersome to use, or simply not applicable: Consider once more the simple "bookstore-example" discussed earlier. First, the number of book titles is practically unlimited. In other words, if we would make a table where each book title would represent one dimension, and the purchase of that book (yes/no) would be the classes or categories for each dimension, then the complete crosstabulation table would be huge and sparse (consisting mostly of empty cells). Alternatively, we could construct all possible two-way tables from all items available in the store; this would allow us to detect two-way associations (association rules) between items. However, the number of tables that would have to be constructed would again be huge, most of the two-way tables would be sparse, and worse, if there were any three-way association rules "hiding" in the data, we would miss them completely. The a-priori algorithm implemented in Association Rules will not only automatically detect the relationships ("cross-tabulation tables") that are important (i.e., cross-tabulation tables that are not sparse, not containing mostly zero''s), but also determine the factorial degree of the tables that contain the important association rules. To summarize, Association Rules will allow you to find rules of the kind If X then (likely) Y where X and Y can be single values, items, words, etc., or conjunctions of values, items, words, etc. (e.g., if (Car=Porsche and Gender=Male and Age<20) then (Risk=High and Insurance=High)). The program can be used to analyze simple categorical variables, dichotomous variables, and/or multiple response variables. The algorithm will determine association rules without requiring the user to specify the number of distinct categories present in the data, or any prior knowledge regarding the maximum factorial degree or complexity of the important associations. In a sense, the algorithm will construct cross-tabulation tables without the need to specify the number of dimensions for the tables, or the number of categories for each dimension. Hence, this technique is particularly well suited for data and text mining of huge databases. Boosting Trees for Regression and Classification http://statsoft.com/textbook/stathome.html Boosting Trees for Regression and Classification Introductory Overview Gradient Boosting Trees The Problem of Overfitting; Stochastic Gradient Boosting Stochastic Gradient Boosting Trees and Classification Large Numbers of Categories -------------------------------------------------------------------------------- Boosting Trees for Regression and Classification Introductory Overview The general computational approach of stochastic gradient boosting is also known by the names TreeNet (TM Salford Systems, Inc.) and MART (TM Jerill, Inc.). Over the past few years, this technique has emerged as one of the most powerful methods for predictive data mining. Some implementations of these powerful algorithms allow them to be used for regression as well as classification problems, with continuous and/or categorical predictors. Detailed technical descriptions of these methods can be found in Friedman (1999a, b) as well as Hastie, Tibshirani, & Friedman (2001). Gradient Boosting Trees The algorithm for Boosting Trees evolved from the application of boosting methods to regression trees. The general idea is to compute a sequence of (very) simple trees, where each successive tree is built for the prediction residuals of the preceding tree. As described in the General Classification and Regression Trees Introductory Overview, this method will build binary trees, i.e., partition the data into two samples at each split node. Now suppose that you were to limit the complexities of the trees to 3 nodes only: a root node and two child nodes, i.e., a single split. Thus, at each step of the boosting (boosting trees algorithm), a simple (best) partitioning of the data is determined, and the deviations of the observed values from the respective means (residuals for each partition) are computed. The next 3-node tree will then be fitted to those residuals, to find another partition that will further reduce the residual (error) variance for the data, given the preceding sequence of trees. It can be shown that such "additive weighted expansions" of trees can eventually produce an excellent fit of the predicted values to the observed values, even if the specific nature of the relationships between the predictor variables and the dependent variable of interest is very complex (nonlinear in nature). Hence, the method of gradient boosting - fitting a weighted additive expansion of simple trees - represents a very general and powerful machine learning algorithm. To index Canonical Analysis http://statsoft.com/textbook/stathome.html General Purpose Computational Methods and Results Assumptions General Ideas Sum Scores Canonical Roots/Variates Number of Roots Extraction of Roots -------------------------------------------------------------------------------- General Purpose There are several measures of correlation to express the relationship between two or more variables. For example, the standard Pearson product moment correlation coefficient (r) measures the extent to which two variables are related; there are various nonparametric measures of relationships that are based on the similarity of ranks in two variables; Multiple Regression allows one to assess the relationship between a dependent variable and a set of independent variables; Multiple Correspondence Analysis is useful for exploring the relationships between a set of categorical variables. Canonical Correlation is an additional procedure for assessing the relationship between variables. Specifically, this analysis allows us to investigate the relationship between two sets of variables. For example, an educational researcher may want to compute the (simultaneous) relationship between three measures of scholastic ability with five measures of success in school. A sociologist may want to investigate the relationship between two predictors of social mobility based on interviews, with actual subsequent social mobility as measured by four different indicators. A medical researcher may want to study the relationship of various risk factors to the development of a group of symptoms. In all of these cases, the researcher is interested in the relationship between two sets of variables, and Canonical Correlation would be the appropriate method of analysis. In the following topics we will briefly introduce the major concepts and statistics in canonical correlation analysis. We will assume that you are familiar with the correlation coefficient as described in Basic Statistics, and the basic ideas of multiple regression as described in the overview section of Multiple Regression. To index Computational Methods and Results Some of the computational issues involved in canonical correlation and the major results that are commonly reported will now be reviewed. Eigenvalues. When extracting the canonical roots, you will compute the eigenvalues. These can be interpreted as the proportion of variance accounted for by the correlation between the respective canonical variates. Note that the proportion here is computed relative to the variance of the canonical variates, that is, of the weighted sum scores of the two sets of variables; the eigenvalues do not tell how much variability is explained in either set of variables. You will compute as many eigenvalues as there are canonical roots, that is, as many as the minimum number of variables in either of the two sets. Successive eigenvalues will be of smaller and smaller size. First, compute the weights that maximize the correlation of the two sum scores. After this first root has been extracted, you will find the weights that produce the second largest correlation between sum scores, subject to the constraint that the next set of sum scores does not correlate with the previous one, and so on. CHAID Analysis http://statsoft.com/textbook/stathome.html General CHAID Introductory Overview Basic Tree-Building Algorithm: CHAID and Exhaustive CHAID General Computation Issues of CHAID CHAID, C&RT, and QUEST -------------------------------------------------------------------------------- General CHAID Introductory Overview The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the oldest tree classification methods originally proposed by Kass (1980; according to Ripley, 1996, the CHAID algorithm is a descendent of THAID developed by Morgan and Messenger, 1973). CHAID will "build" non-binary trees (i.e., trees where more than two branches can attach to a single root or node), based on a relatively simple algorithm that is particularly well suited for the analysis of larger datasets. Also, because the CHAID algorithm will often effectively yield many multi-way frequency tables (e.g., when classifying a categorical response variable with many categories, based on categorical predictors with many classes), it has been particularly popular in marketing research, in the context of market segmentation studies. Both CHAID and C&RT techniques will construct trees, where each (non-terminal) node identifies a split condition, to yield optimum prediction (of continuous dependent or response variables) or classification (for categorical dependent or response variables). Hence, both types of algorithms can be applied to analyze regression-type problems or classification-type. To index Basic Tree-Building Algorithm: CHAID and Exhaustive CHAID The acronym CHAID stands for Chi-squared Automatic Interaction Detector. This name derives from the basic algorithm that is used to construct (non-binary) trees, which for classification problems (when the dependent variable is categorical in nature) relies on the Chi-square test to determine the best next split at each step; for regression-type problems (continuous dependent variable) the program will actually compute F-tests. Specifically, the algorithm proceeds as follows: Preparing predictors. The first step is to create categorical predictors out of any continuous predictors by dividing the respective continuous distributions into a number of categories with an approximately equal number of observations. For categorical predictors, the categories (classes) are "naturally" defined. Merging categories. The next step is to cycle through the predictors to determine for each predictor the pair of (predictor) categories that is least significantly different with respect to the dependent variable; for classification problems (where the dependent variable is categorical as well), it will compute a Chi-square test (Pearson Chi-square); for regression problems (where the dependent variable is continuous), F tests. If the respective test for a given pair of predictor categories is not statistically significant as defined by an alpha-to-merge value, then it will merge the respective predictor categories and repeat this step (i.e., find the next pair of categories, which now may include previously merged categories). If the statistical significance for the respective pair of predictor categories is significant (less than the respective alpha-to-merge value), then (optionally) it will compute a Bonferroni adjusted p-value for the set of categories for the respective predictor. Selecting the split variable. The next step is to choose the split the predictor variable with the smallest adjusted p-value, i.e., the predictor variable that will yield the most significant split; if the smallest (Bonferroni) adjusted p-value for any predictor is greater than some alpha-to-split value, then no further splits will be performed, and the respective node is a terminal node. Continue this process until no further splits can be performed (given the alpha-to-merge and alpha-to-split values). Classification and Regression Trees (C&RT) http://statsoft.com/textbook/stathome.html C&RT Introductory Overview - Basic Ideas Computational Details Computational Formulas -------------------------------------------------------------------------------- Introductory Overview - Basic Ideas Overview C&RT builds classification and regression trees for predicting continuous dependent variables (regression) and categorical predictor variables (classification). The classic C&RT algorithm was popularized by Breiman et al. (Breiman, Friedman, Olshen, & Stone, 1984; see also Ripley, 1996). A general introduction to tree-classifiers, specifically to the QUEST (Quick, Unbiased, Efficient Statistical Trees) algorithm, is also presented in the context of the Classification Trees Analysis facilities, and much of the following discussion presents the same information, in only a slightly different context. Another, similar type of tree building algorithm is CHAID (Chi-square Automatic Interaction Detector; see Kass, 1980). Classification and Regression Problems There are numerous algorithms for predicting continuous variables or categorical variables from a set of continuous predictors and/or categorical factor effects. For example, in GLM (General Linear Models) and GRM (General Regression Models), you can specify a linear combination (design) of continuous predictors and categorical factor effects (e.g., with two-way and three-way interaction effects) to predict a continuous dependent variable. In GDA (General Discriminant Function Analysis), you can specify such designs for predicting categorical variables, i.e., to solve classification problems. Regression-type problems. Regression-type problems are generally those where one attempts to predict the values of a continuous variable from one or more continuous and/or categorical predictor variables. For example, you may want to predict the selling prices of single family homes (a continuous dependent variable) from various other continuous predictors (e.g., square footage) as well as categorical predictors (e.g., style of home, such as ranch, two-story, etc.; zip code or telephone area code where the property is located, etc.; note that this latter variable would be categorical in nature, even though it would contain numeric values or codes). If you used simple multiple regression, or some general linear model (GLM) to predict the selling prices of single family homes, you would determine a linear equation for these variables that can be used to compute predicted selling prices. There are many different analytic procedures for fitting linear models (GLM, GRM, Regression), various types of nonlinear models (e.g., Generalized Linear/Nonlinear Models (GLZ), Generalized Additive Models (GAM), etc.), or completely custom-defined nonlinear models (see Nonlinear Estimation), where you can type in an arbitrary equation containing parameters to be estimated. CHAID also analyzes regression-type problems, and produces results that are similar (in nature) to those computed by C&RT. Note that various neural network architectures are also applicable to solve regression-type problems. Classification-type problems. Classification-type problems are generally those where one attempts to predict values of a categorical dependent variable (class, group membership, etc.) from one or more continuous and/or categorical predictor variables. For example, you may be interested in predicting who will or will not graduate from college, or who will or will not renew a subscription. These would be examples of simple binary classification problems, where the categorical dependent variable can only assume two distinct and mutually exclusive values. In other cases one might be interested in predicting which one of multiple different alternative consumer products (e.g., makes of cars) a person decides to purchase, or which type of failure occurs with different types of engines. In those cases there are multiple categories or classes for the categorical dependent variable. There are a number of methods for analyzing classification-type problems and to compute predicted classifications, either from simple continuous predictors (e.g., binomial or multinomial logit regression in GLZ), from categorical predictors (e.g., Log-Linear analysis of multi-way frequency tables), or both (e.g., via ANCOVA-like designs in GLZ or GDA). The CHAID also analyzes classification-type problems, and produces results that are similar (in nature) to those computed by C&RT. Note that various neural network architectures are also applicable to solve classification-type problems. Classification and Regression Trees (C&RT) In most general terms, the purpose of the analyses via tree-building algorithms is to determine a set of if-then logical (split) conditions that permit accurate prediction or classification of cases. Cluster Analysis http://statsoft.com/textbook/stathome.html General Purpose Statistical Significance Testing Area of Application Joining (Tree Clustering) Hierarchical Tree Distance Measures Amalgamation or Linkage Rules Two-way Joining Introductory Overview Two-way Joining k-Means Clustering Example Computations Interpretation of results EM (Expectation Maximization) Clustering Introductory Overview The EM Algorithm Finding the Right Number of Clusters in k-Means and EM Clustering: v-Fold Cross-Validation -------------------------------------------------------------------------------- General Purpose The term cluster analysis (first used by Tryon, 1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories. A general question facing researchers in many areas of inquiry is how to organize observed data into meaningful structures, that is, to develop taxonomies. In other words cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise. Given the above, cluster analysis can be used to discover structures in data without providing an explanation/interpretation. In other words, cluster analysis simply discovers structures in data without explaining why they exist. We deal with clustering in almost every aspect of daily life. For example, a group of diners sharing the same table in a restaurant may be regarded as a cluster of people. In food stores items of similar nature, such as different types of meat or vegetables are displayed in the same or nearby locations. There is a countless number of examples in which clustering playes an important role. For instance, biologists have to organize the different species of animals before a meaningful description of the differences between animals is possible. According to the modern system employed in biology, man belongs to the primates, the mammals, the amniotes, the vertebrates, and the animals. Note how in this classification, the higher the level of aggregation the less similar are the members in the respective class. Man has more in common with all other primates (e.g., apes) than it does with the more "distant" members of the mammals (e.g., dogs), etc. For a review of the general categories of cluster analysis methods, see Joining (Tree Clustering), Two-way Joining (Block Clustering), and k-Means Clustering. In short, whatever the nature of your business is, sooner or later you will run into a clustering problem of one form or another. Correspondence Analysis http://statsoft.com/textbook/stathome.html General Purpose Supplementary Points Multiple Correspondence Analysis Burt Tables -------------------------------------------------------------------------------- General Purpose Correspondence analysis is a descriptive/exploratory technique designed to analyze simple two-way and multi-way tables containing some measure of correspondence between the rows and columns. The results provide information which is similar in nature to those produced by Factor Analysis techniques, and they allow one to explore the structure of categorical variables included in the table. The most common kind of table of this type is the two-way frequency crosstabulation table (see, for example, Basic Statistics or Log-Linear). In a typical correspondence analysis, a crosstabulation table of frequencies is first standardized, so that the relative frequencies across all cells sum to 1.0. One way to state the goal of a typical analysis is to represent the entries in the table of relative frequencies in terms of the distances between individual rows and/or columns in a low-dimensional space. This is best illustrated by a simple example, which will be described below. There are several parallels in interpretation between correspondence analysis and Factor Analysis, and some similar concepts will also be pointed out below. For a comprehensive description of this method, computational details, and its applications (in the English language), refer to the classic text by Greenacre (1984). These methods were originally developed primarily in France by Jean-Paul Benzйrci in the early 1960''s and 1970''s (e.g., see Benzйrci, 1973; see also Lebart, Morineau, and Tabard, 1977), but have only more recently gained increasing popularity in English-speaking countries (see, for example, Carrol, Green, and Schaffer, 1986; Hoffman and Franke, 1986). (Note that similar techniques were developed independently in several countries, where they were known as optimal scaling, reciprocal averaging, optimal scoring, quantification method, or homogeneity analysis). In the following paragraphs, a general introduction to correspondence analysis will be presented. Data Mining Techniques http://statsoft.com/textbook/stathome.html http://www.statsoft.com/TEXTBOOK/stdatmin.html Data Mining Crucial Concepts in Data Mining Data Warehousing On-Line Analytic Processing (OLAP) Exploratory Data Analysis (EDA) and Data Mining Techniques EDA vs. Hypothesis Testing Computational EDA Techniques Graphical (data visualization) EDA techniques Verification of results of EDA Neural Networks -------------------------------------------------------------------------------- Data Mining Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications. The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment (i.e., the application of the model to new data in order to generate predictions) http://educationally.narod.ru/clusteranalysisdmphotoalbum.html Discriminant Function Analysis http://statsoft.com/textbook/stathome.html General Purpose Computational Approach Stepwise Discriminant Analysis Interpreting a Two-Group Discriminant Function Discriminant Functions for Multiple Groups Assumptions Classification -------------------------------------------------------------------------------- General Purpose Discriminant function analysis is used to determine which variables discriminate between two or more naturally occurring groups. For example, an educational researcher may want to investigate which variables discriminate between high school graduates who decide (1) to go to college, (2) to attend a trade or professional school, or (3) to seek no further training or education. For that purpose the researcher could collect data on numerous variables prior to students'' graduation. After graduation, most students will naturally fall into one of the three categories. Discriminant Analysis could then be used to determine which variable(s) are the best predictors of students'' subsequent educational choice. A medical researcher may record different variables relating to patients'' backgrounds in order to learn which variables best predict whether a patient is likely to recover completely (group 1), partially (group 2), or not at all (group 3). A biologist could record different characteristics of similar types (groups) of flowers, and then perform a discriminant function analysis to determine the set of characteristics that allows for the best discrimination between the types. To index Distribution Fitting http://statsoft.com/textbook/stathome.html General Purpose Fit of the Observed Distribution Types of Distributions Bernoulli Distribution Beta Distribution Binomial Distribution Cauchy Distribution Chi-square Distribution Exponential Distribution Extreme Value Distribution F Distribution Gamma Distribution Geometric Distribution Gompertz Distribution Laplace Distribution Logistic Distribution Log-normal Distribution Normal Distribution Pareto Distribution Poisson Distribution Rayleigh Distribution Rectangular Distribution Student''s t Distribution Weibull Distribution -------------------------------------------------------------------------------- General Purpose In some research applications one can formulate hypotheses about the specific distribution of the variable of interest. For example, variables whose values are determined by an infinite number of independent random events will be distributed following the normal distribution: one can think of a person''s height as being the result of very many independent factors such as numerous specific genetic predispositions, early childhood diseases, nutrition, etc. (see the animation below for an example of the normal distribution). As a result, height tends to be normally distributed in the U.S. population. On the other hand, if the values of a variable are the result of very rare events, then the variable will be distributed according to the Poisson distribution (sometimes called the distribution of rare events). For example, industrial accidents can be thought of as the result of the intersection of a series of unfortunate (and unlikely) events, and their frequency tends to be distributed according to the Poisson distribution. These and other distributions are described in greater detail in the respective glossary topics. Another common application where distribution fitting procedures are useful is when one wants to verify the assumption of normality before using some parametric test (see General Purpose of Nonparametric Tests). For example, you may want to use the Kolmogorov-Smirnov test for normality or the Shapiro-Wilks'' W test to test for normality Experimental Design (Industrial DOE) http://statsoft.com/textbook/stathome.html DOE Overview Experiments in Science and Industry Differences in techniques Overview General Ideas Computational Problems Components of Variance, Denominator Synthesis Summary 2**(k-p) Fractional Factorial Designs Basic Idea Generating the Design The Concept of Design Resolution Plackett-Burman (Hadamard Matrix) Designs for Screening Enhancing Design Resolution via Foldover Aliases of Interactions: Design Generators Blocking Replicating the Design Adding Center Points Analyzing the Results of a 2**(k-p) Experiment Graph Options Summary 2**(k-p) Maximally Unconfounded and Minimum Aberration Designs Basic Idea Design Criteria Summary 3**(k-p) , Box-Behnken, and Mixed 2 and 3 Level Factorial Designs Overview Designing 3**(k-p) Experiments An Example 3**(4-1) Design in 9 Blocks Box-Behnken Designs Analyzing the 3**(k-p) Design ANOVA Parameter Estimates Graphical Presentation of Results Designs for Factors at 2 and 3 Levels Central Composite and Non-Factorial Response Surface Designs Overview Design Considerations Alpha for Rotatability and Orthogonality Available Standard Designs Analyzing Central Composite Designs The Fitted Response Surface Categorized Response Surfaces Latin Square Designs Overview Latin Square Designs Analyzing the Design Very Large Designs, Random Effects, Unbalanced Nesting Taguchi Methods: Robust Design Experiments Overview Quality and Loss Functions Signal-to-Noise (S/N) Ratios Orthogonal Arrays Analyzing Designs Accumulation Analysis Summary Mixture designs and triangular surfaces Overview Triangular Coordinates Triangular Surfaces and Contours The Canonical Form of Mixture Polynomials Common Models for Mixture Data Standard Designs for Mixture Experiments Lower Constraints Upper and Lower Constraints Analyzing Mixture Experiments Analysis of Variance Parameter Estimates Pseudo-Components Graph Options Designs for constrained surfaces and mixtures Overview Designs for Constrained Experimental Regions Linear Constraints The Piepel & Snee Algorithm Choosing Points for the Experiment Analyzing Designs for Constrained Surfaces and Mixtures Constructing D- and A-optimal designs Overview Basic Ideas Measuring Design Efficiency Constructing Optimal Designs General Recommendations Avoiding Matrix Singularity "Repairing" Designs Constrained Experimental Regions and Optimal Design Special Topics Profiling Predicted Responses and Response Desirability Residuals Analysis Box-Cox Transformations of Dependent Variables Principal Components and Factor Analysis http://statsoft.com/textbook/stathome.html General Purpose Basic Idea of Factor Analysis as a Data Reduction Method Factor Analysis as a Classification Method Miscellaneous Other Issues and Statistics -------------------------------------------------------------------------------- General Purpose The main applications of factor analytic techniques are: (1) to reduce the number of variables and (2) to detect structure in the relationships between variables, that is to classify variables. Therefore, factor analysis is applied as a data reduction or structure detection method (the term factor analysis was first introduced by Thurstone, 1931). The topics listed below will describe the principles of factor analysis, and how it can be applied towards these two purposes. We will assume that you are familiar with the basic logic of statistical reasoning as described in Elementary Concepts. Moreover, we will also assume that you are familiar with the concepts of variance and correlation; if not, we advise that you read the Basic Statistics chapter at this point. There are many excellent books on factor analysis. For example, a hands-on how-to approach can be found in Stevens (1986); more detailed technical descriptions are provided in Cooley and Lohnes (1971); Harman (1976); Kim and Mueller, (1978a, 1978b); Lawley and Maxwell (1971); Lindeman, Merenda, and Gold (1980); Morrison (1967); or Mulaik (1972). The interpretation of secondary factors in hierarchical factor analysis, as an alternative to traditional oblique rotational strategies, is explained in detail by Wherry (1984). Confirmatory factor analysis. Structural Equation Modeling (SEPATH) allows you to test specific hypotheses about the factor structure for a set of variables, in one or several samples (e.g., you can compare factor structures across samples). Correspondence analysis. Correspondence analysis is a descriptive/exploratory technique designed to analyze two-way and multi-way tables containing some measure of correspondence between the rows and columns. The results provide information which is similar in nature to those produced by factor analysis techniques, and they allow one to explore the structure of categorical variables included in the table. For more information regarding these methods, refer to Correspondence Analysis. To index Basic Idea of Factor Analysis as a Data Reduction Method Suppose we conducted a (rather "silly") study in which we measure 100 people''s height in inches and centimeters. Thus, we would have two variables that measure height. If in future studies, we want to research, for example, the effect of different nutritional food supplements on height, would we continue to use both measures? Probably not; height is one characteristic of a person, regardless of how it is measured. Let us now extrapolate from this "silly" study to something that one might actually do as a researcher. Suppose we want to measure people''s satisfaction with their lives. We design a satisfaction questionnaire with various items; among other things we ask our subjects how satisfied they are with their hobbies (item 1) and how intensely they are pursuing a hobby (item 2). Most likely, the responses to the two items are highly correlated with each other. (If you are not familiar with the correlation coefficient, we recommend that you read the description in Basic Statistics - Correlations) Given a high correlation between the two items, we can conclude that they are quite redundant. Combining Two Variables into a Single Factor. One can summarize the correlation between two variables in a scatterplot. A regression line can then be fitted that represents the "best" summary of the linear relationship between the variables. If we could define a variable that would approximate the regression line in such a plot, then that variable would capture most of the "essence" of the two items. Subjects'' single scores on that new factor, represented by the regression line, could then be used in future data analyses to represent that essence of the two items. In a sense we have reduced the two variables to one factor. Note that the new factor is actually a linear combination of the two variables. Principal Components Analysis. The example described above, combining two correlated variables into one factor, illustrates the basic idea of factor analysis, or of principal components analysis to be precise (we will return to this later). If we extend the two-variable example to multiple variables, then the computations become more involved, but the basic principle of expressing two or more variables by a single factor remains the same. Extracting Principal Components. We do not want to go into the details about the computational aspects of principal components analysis here, which can be found elsewhere (references were provided at the beginning of this section). However, basically, the extraction of principal components amounts to a variance maximizing (varimax) rotation of the original variable space. For example, in a scatterplot we can think of the regression line as the original X axis, rotated so that it approximates the regression line. This type of rotation is called variance maximizing because the criterion for (goal of) the rotation is to maximize the variance (variability) of the "new" variable (factor), while minimizing the variance around the new variable (see Rotational Strategies). General Discriminant Analysis (GDA) http://statsoft.com/textbook/stathome.html Introductory Overview Advantages of GDA -------------------------------------------------------------------------------- Introductory Overview General Discriminant Analysis (GDA) is called a "general" discriminant analysis because it applies the methods of the general linear model (see also General Linear Models (GLM)) to the discriminant function analysis problem. A general overview of discriminant function analysis, and the traditional methods for fitting linear models with categorical dependent variables and continuous predictors, is provided in the context of Discriminant Analysis. In GDA, the discriminant function analysis problem is "recast" as a general multivariate linear model, where the dependent variables of interest are (dummy-) coded vectors that reflect the group membership of each case. The remainder of the analysis is then performed as described in the context of General Regression Models (GRM), with a few additional features noted below. To index Advantages of GDA Specifying models for predictor variables and predictor effects. One advantage of applying the general linear model to the discriminant analysis problem is that you can specify complex models for the set of predictor variables. For example, you can specify for a set of continuous predictor variables, a polynomial regression model, response surface model, factorial regression, or mixture surface regression (without an intercept). Thus, you could analyze a constrained mixture experiment (where the predictor variable values must sum to a constant), where the dependent variable of interest is categorical in nature. In fact, GDA does not impose any particular restrictions on the type of predictor variable (categorical or continuous) that can be used, or the models that can be specified. However, when using categorical predictor variables, caution should be used (see "A note of caution for models with categorical predictors, and other advanced techniques" below). Stepwise and best-subset analyses. In addition to the traditional stepwise analyses for single continuous predictors provided in Discriminant Analysis, General Discriminant Analysis makes available the options for stepwise and best-subset analyses provided in General Regression Models (GRM). Specifically, you can request stepwise and best-subset selection of predictors or sets of predictors (in multiple-degree of freedom effects, involving categorical predictors), based on the F-to-enter and p-to-enter statistics (associated with the multivariate Wilks'' Lambda test statistic). In addition, when a cross-validation sample is specified, best-subset selection can also be based on the misclassification rates for the cross-validation sample; in other words, after estimating the discriminant functions for a given set of predictors, the misclassification rates for the cross-validation sample are computed, and the model (subset of predictors) that yields the lowest misclassification rate for the cross-validation sample is chosen. This is a powerful technique for choosing models that may yield good predictive validity, while avoiding overfitting of the data (see also Neural Networks). Desirability profiling of posterior classification probabilities. Another unique option of General Discriminant Analysis (GDA) is the inclusion of Response/desirability profiler options. These options are described in some detail in the context of Experimental Design (DOE). In short, the predicted response values for each dependent variable are computed, and those values can be combined into a single desirability score. A graphical summary can then be produced to show the "behavior" of the predicted responses and the desirability score over the ranges of values for the predictor variables. In GDA, you can profile both simple predicted values (like in General Regression Models) for the coded dependent variables (i.e., dummy-coded categories of the categorical dependent variable), and you can also profile posterior prediction probabilities. This unique latter option allows you to evaluate how different values for the predictor variables affect the predicted classification of cases, and is particularly useful when interpreting the results for complex models that involve categorical and continuous predictors and their interactions. A note of caution for models with categorical predictors, and other advanced techniques. General Discriminant Analysis provides functionality that makes this technique a general tool for classification and data mining. However, most -- if not all -- textbook treatments of discriminant function analysis are limited to simple and stepwise analyses with single degree of freedom continuous predictors. No "experience" (in the literature) exists regarding issues of robustness and effectiveness of these techniques, when they are generalized in the manner provided in this very powerful analysis. The use of best-subset methods, in particular when used in conjunction with categorical predictors or when using the misclassification rates in a cross-validation sample for choosing the best subset of predictors, should be considered a heuristic search method, rather than a statistical analysis technique. The use of categorical predictor variables. The use of categorical predictor variables or effects in a discriminant function analysis model may be (statistically) questionable. For example, you can use GDA to analyze a 2 by 2 frequency table, by specifying one variable in the 2 by 2 table as the dependent variable, and the other as the predictor. Clearly, the (ab)use of GDA in this manner would be silly (although, interestingly, in most cases you will get results that are generally compatible with those you would get by computing a simple Chi-square test for the 2 by 2 table). On the other hand, if you only consider the parameter estimates computed by GDA as the least squares solution to a set of linear (prediction) equations, then the use of categorical predictors in GDA is fully justified; moreover, it is not uncommon in applied research to be confronted with a mixture of continuous and categorical predictors (e.g., income or age which are continuous, along with occupational status, which is categorical) for predicting a categorical dependent variable. In those cases, it can be very instructive to consider specific models involving the categorical predictors, and possibly interactions between categorical and continuous predictors for classifying observations. However, to reiterate, the use of categorical predictor variables in discriminant function analysis is not widely documented, and you should proceed cautiously before accepting the results of statistical significance tests, and before drawing final conclusions from your analyses. Also remember that there are alternative methods available to perform similar analyses, namely, the multinomial logit models available in Generalized Linear Models (GLZ), and the methods for analyzing multi-way frequency tables in Log-Linear. To index General Linear Models (GLM) http://statsoft.com/textbook/stathome.html http://www.statsoft.com/TEXTBOOK/stglm.html Basic Ideas: The General Linear Model Historical background The purpose of multiple regression Computations for solving the multiple regression equation Extension of multiple regression to the general linear model The sigma-restricted vs. overparameterized model Summary of computations Types of Analyses Between-subject designs Within-subject (repeated measures) designs Multivariate designs Estimation and Hypothesis Testing Whole model tests Six types of sums of squares Error terms for tests Testing specific hypotheses Testing hypotheses for repeated measures and dependent variables -------------------------------------------------------------------------------- This chapter describes the use of the general linear model in a wide variety of statistical analyses. If you are unfamiliar with the basic methods of ANOVA and regression in linear models, it may be useful to first review the basic information on these topics in Elementary Concepts. A detailed discussion of univariate and multivariate ANOVA techniques can also be found in the ANOVA/MANOVA chapter. Basic Ideas: The General Linear Model The following topics summarize the historical, mathematical, and computational foundations for the general linear model. For a basic introduction to ANOVA (MANOVA, ANCOVA) techniques, refer to ANOVA/MANOVA; for an introduction to multiple regression, see Multiple Regression; for an introduction to the design an analysis of experiments in applied (industrial) settings, see Experimental Design. Historical Background The roots of the general linear model surely go back to the origins of mathematical thought, but it is the emergence of the theory of algebraic invariants in the 1800''s that made the general linear model, as we know it today, possible. The theory of algebraic invariants developed from the groundbreaking work of 19th century mathematicians such as Gauss, Boole, Cayley, and Sylvester. The theory seeks to identify those quantities in systems of equations which remain unchanged under linear transformations of the variables in the system. Stated more imaginatively (but in a way in which the originators of the theory would not consider an overstatement), the theory of algebraic invariants searches for the eternal and unchanging amongst the chaos of the transitory and the illusory. That is no small goal for any theory, mathematical or otherwise. The wonder of it all is the theory of algebraic invariants was successful far beyond the hopes of its originators. Eigenvalues, eigenvectors, determinants, matrix decomposition methods; all derive from the theory of algebraic invariants. The contributions of the theory of algebraic invariants to the development of statistical theory and methods are numerous, but a simple example familiar to even the most casual student of statistics is illustrative. The correlation between two variables is unchanged by linear transformations of either or both variables. We probably take this property of correlation coefficients for granted, but what would data analysis be like if we did not have statistics that are invariant to the scaling of the variables involved? Some thought on this question should convince you that without the theory of algebraic invariants, the development of useful statistical techniques would be nigh impossible. The development of the linear regression model in the late 19th century, and the development of correlational methods shortly thereafter, are clearly direct outgrowths of the theory of algebraic invariants. Regression and correlational methods, in turn, serve as the basis for the general linear model. Indeed, the general linear model can be seen as an extension of linear multiple regression for a single dependent variable. Understanding the multiple regression model is fundamental to understanding the general linear model, so we will look at the purpose of multiple regression, the computational algorithms used to solve regression problems, and how the regression model is extended in the case of the general linear model. A basic introduction to multiple regression methods and the analytic problems to which they are applied is provided in the Multiple Regression. To index Generalized Additive Models (GAM) http://statsoft.com/textbook/stathome.html Additive models Generalized linear models Distributions and link functions Generalized additive models Estimating the non-parametric function of predictors via scatterplot smoothers A specific example: The generalized additive logistic model Fitting generalized additive models Interpreting the results Degrees of freedom A Word of Caution -------------------------------------------------------------------------------- The methods available in Generalized Additive Models are implementations of techniques developed and popularized by Hastie and Tibshirani (1990). A detailed description of these and related techniques, the algorithms used to fit these models, and discussions of recent research in this area of statistical modeling can also be found in Schimek (2000). Additive models. The methods described in this section represent a generalization of multiple regression (which is a special case of general linear models). Specifically, in linear regression, a linear least-squares fit is computed for a set of predictor or X variables, to predict a dependent Y variable. The well known linear regression equation with m predictors, to predict a dependent variable Y, can be stated as: Y = b0 + b1*X1 + ... + bm*Xm Where Y stands for the (predicted values of the) dependent variable, X1through Xm represent the m values for the predictor variables, and b0, and b1 through bm are the regression coefficients estimated by multiple regression. A generalization of the multiple regression model would be to maintain the additive nature of the model, but to replace the simple terms of the linear equation bi*Xi with fi(Xi) where fi is a non-parametric function of the predictor Xi. In other words, instead of a single coefficient for each variable (additive term) in the model, in additive models an unspecified (non-parametric) function is estimated for each predictor, to achieve the best prediction of the dependent variable values. Generalized linear models.To summarize the basic idea, the generalized linear model differs from the general linear model (of which multiple regression is a special case) in two major respects: First, the distribution of the dependent or response variable can be (explicitly) non-normal, and does not have to be continuous, e.g., it can be binomial; second, the dependent variable values are predicted from a linear combination of predictor variables, which are "connected" to the dependent variable via a link function. The general linear model for a single dependent variable can be considered a special case of the generalized linear model: In the general linear model the dependent variable values are expected to follow the normal distribution, and the link function is a simple identity function (i.e., the linear combination of values for the predictor variables is not transformed). Generalized Linear Models (GLZ) http://www.statsoft.com/TEXTBOOK/stglz.html Basic Ideas Computational Approach Types of Analyses Between-subject Designs Model Building Interpretation of Results and Diagnostics -------------------------------------------------------------------------------- This chapter describes the use of the generalized linear model for analyzing linear and non-linear effects of continuous and categorical predictor variables on a discrete or continuous dependent variable. If you are unfamiliar with the basic methods of regression in linear models, it may be useful to first review the basic information on these topics in the Elementary Concepts chapter. Discussion of the ways in which the linear regression model is extended by the general linear model can be found in the General Linear Models chapter. For additional information about generalized linear models, see also Dobson (1990), Green and Silverman (1994), or McCullagh and Nelder (1989). General Regression Models (GRM) http://statsoft.com/textbook/stathome.html Basic Ideas: The Need for Simple Models Model Building in GSR Types of Analyses Between Subject Designs Multivariate Designs Building the Whole Model Partitioning Sums of Squares Testing the Whole Model Limitations of Whole Models Building Models via Stepwise Regression Building Models via Best-Subset Regression -------------------------------------------------------------------------------- This chapter describes the use of the general linear model for finding the "best" linear model from a number of possible models. If you are unfamiliar with the basic methods of ANOVA and regression in linear models, it may be useful to first review the basic information on these topics in Elementary Concepts. A detailed discussion of univariate and multivariate ANOVA techniques can also be found in the ANOVA/MANOVA chapter; a discussion of multiple regression methods is also provided in the Multiple Regression chapter. Discussion of the ways in which the linear regression model is extended by the general linear model can be found in the General Linear Models chapter. -------------------------------------------------------------------------------- Basic Ideas: The Need for Simple Models A good theory is the end result of a winnowing process. We start with a comprehensive model that includes all conceivable, testable influences on the phenomena under investigation. Then we test the components of the initial comprehensive model, to identify the less comprehensive submodels that adequately account for the phenomena under investigation. Finally from these candidate submodels, we single out the simplest submodel, which by the principle of parsimony we take to be the "best" explanation for the phenomena under investigation. We prefer simple models not just for philosophical but also for practical reasons. Simple models are easier to put to test again in replication and cross-validation studies. Simple models are less costly to put into practice in predicting and controlling the outcome in the future. The philosophical reasons for preferring simple models should not be downplayed, however. Simpler models are easier to understand and appreciate, and therefore have a "beauty" that their more complicated counterparts often lack. The entire winnowing process described above is encapsulated in the model-building techniques of stepwise and best-subset regression. The use of these model-building techniques begins with the specification of the design for a comprehensive "whole model." Less comprehensive submodels are then tested to determine if they adequately account for the outcome under investigation. Finally, the simplest of the adequate is adopted as the "best." To index http://www.statsoft.com/TEXTBOOK/stgrm.html Basic Ideas: The Need for Simple Models Model Building in GSR Types of Analyses Between Subject Designs Multivariate Designs Building the Whole Model Partitioning Sums of Squares Testing the Whole Model Limitations of Whole Models Building Models via Stepwise Regression Building Models via Best-Subset Regression Graphical Analytic Techniques http://statsoft.com/textbook/stathome.html Brief Overviews of Types of Graphs Representative Visualization Techniques Categorized Graphs What are Categorized Graphs? Categorization Methods Histograms Scatterplots Probability Plots Quantile-Quantile Plots Probability-Probability Plots Line Plots Box Plots Pie Charts Missing/Range Data Points Plots 3D Plots Ternary Plots Brushing Smoothing Bivariate Distributions Layered Compression Projections of 3D data sets Icon Plots Analyzing Icon Plots Taxonomy of Icon Plots Standardization of Values Applications Related Graphs Graph Type Mark Icons Data Reduction Data Rotation (in 3D space) Categorized Graphs One of the most important, general, and also powerful analytic methods involves dividing ("splitting") the data set into categories in order compare the patterns of data between the resulting subsets. This common technique is known under a variety of terms (such as breaking down, grouping, categorizing, splitting, slicing, drilling-down, or conditioning) and it is used both in exploratory data analyses and hypothesis testing. For example: A positive relation between the age and the risk of a heart attack may be different in males and females (it may be stronger in males). A promising relation between taking a drug and a decrease of the cholesterol level may be present only in women with a low blood pressure and only in their thirties and forties. The process capability indices or capability histograms can be different for periods of time supervised by different operators. The regression slopes can be different in different experimental groups. There are many computational techniques that capitalize on grouping and that are designed to quantify the differences that the grouping will reveal (e.g., ANOVA/MANOVA). However, graphical techniques (such as categorized graphs discussed in this section) offer unique advantages that cannot be substituted by any computational method alone: they can reveal patterns that cannot be easily quantified (e.g., complex interactions, exceptions, anomalies) and they provide unique, multidimensional, global analytic perspectives to explore or "mine" the data. What are Categorized Graphs? Categorized graphs (the term first used in STATISTICA software by StatSoft in 1990; also recently called Trellis graphs, by Becker, Cleveland, and Clark, at Bell Labs) produce a series of 2D, 3D, ternary, or nD graphs (such as histograms, scatterplots, line plots, surface plots, ternary scatterplots, etc.), one for each selected category of cases (i.e., subset of cases), for example, respondents from New York, Chicago, Dallas, etc. These "component" graphs are placed sequentially in one display, allowing for comparisons between the patterns of data shown in graphs for each of the requested groups (e.g., cities). A variety of methods can be used to select the subsets; the simplest of them is using a categorical variable (e.g., a variable City, with three values New York, Chicago, and Dallas). For example, the following graph shows histograms of a variable representing self-reported stress levels in each of the three cities. Independent Components Analysis http://statsoft.com/textbook/stathome.html Introductory Overview Independent Component Analysis is a well established and reliable statistical method that performs signal separation. Signal separation is a frequently occurring problem and is central to Statistical Signal Processing, which has a wide range of applications in many areas of technology ranging from Audio and Image Processing to Biomedical Signal Processing, Telecommunications, and Econometrics. Imagine being in a room with a crowd of people and two speakers giving presentations at the same time. The crowed is making comments and noises in the background. We are interested in what the speakers say and not the comments emanating from the crowd. There are two microphones at different locations, recording the speakers'' voices as well as the noise coming from the crowed. Our task is to separate the voice of each speaker while ignoring the background noise This is a classic example of the Independent Component Analysis, a well established stochastic technique. ICA can be used as a method of Blind Source Separation, meaning that it can separate independent signals from linear mixtures with virtually no prior knowledge on the signals. An example is decomposition of Electro or Magnetoencephalographic signals. In computational Neuroscience, ICA has been used for Feature Extraction, in which case it seems to adequately model the basic cortical processing of visual and auditory information. New application areas are being discovered at an increasing pace. Multiple Regression and the General Linear Models http://statsoft.com/textbook/stathome.html http://www.statsoft.com/TEXTBOOK/stmulreg.html General Purpose Computational Approach Least Squares The Regression Equation Unique Prediction and Partial Correlation Predicted and Residual Scores Residual Variance and R-square Interpreting the Correlation Coefficient R Assumptions, Limitations, and Practical Considerations Assumption of Linearity Normality Assumption Limitations Choice of the number of variables Multicollinearity and matrix ill-conditioning Fitting centered polynomial models The importance of residual analysis -------------------------------------------------------------------------------- General Purpose The general purpose of multiple regression (the term was first used by Pearson, 1908) is to learn more about the relationship between several independent or predictor variables and a dependent or criterion variable. For example, a real estate agent might record for each listing the size of the house (in square feet), the number of bedrooms, the average income in the respective neighborhood according to census data, and a subjective rating of appeal of the house. Once this information has been compiled for various houses it would be interesting to see whether and how these measures relate to the price for which a house is sold. For example, one might learn that the number of bedrooms is a better predictor of the price for which a house sells in a particular neighborhood than how "pretty" the house is (subjective rating). One may also detect "outliers," that is, houses that should really sell for more, given their location and characteristics. Personnel professionals customarily use multiple regression procedures to determine equitable compensation. One can determine a number of factors or dimensions such as "amount of responsibility" (Resp) or "number of people to supervise" (No_Super) that one believes to contribute to the value of a job. The personnel analyst then usually conducts a salary survey among comparable companies in the market, recording the salaries and respective characteristics (i.e., values on dimensions) for different positions. This information can be used in a multiple regression analysis to build a regression equation of the form: Salary = .5*Resp + .8*No_Super Once this so-called regression line has been determined, the analyst can now easily construct a graph of the expected (predicted) salaries and the actual salaries of job incumbents in his or her company. Thus, the analyst is able to determine which position is underpaid (below the regression line) or overpaid (above the regression line), or paid equitably. In the social and natural sciences multiple regression procedures are very widely used in research. In general, multiple regression allows the researcher to ask (and hopefully answer) the general question "what is the best predictor of ...". For example, educational researchers might want to learn what are the best predictors of success in high-school. Psychologists may want to determine which personality variable best predicts social adjustment. Sociologists may want to find out which of the multiple social indicators best predict whether or not a new immigrant group will adapt and be absorbed into society. See also Exploratory Data Analysis and Data Mining Techniques, the General Stepwise Regression chapter, and the General Linear Models chapter. Log-Linear Analysis of Frequency Tables http://statsoft.com/textbook/stathome.html General Purpose Two-way Frequency Tables Multi-Way Frequency Tables The Log-Linear Model Goodness-of-fit Automatic Model Fitting -------------------------------------------------------------------------------- General Purpose One basic and straightforward method for analyzing data is via crosstabulation. For example, a medical researcher may tabulate the frequency of different symptoms by patients'' age and gender; an educational researcher may tabulate the number of high school drop-outs by age, gender, and ethnic background; an economist may tabulate the number of business failures by industry, region, and initial capitalization; a market researcher may tabulate consumer preferences by product, age, and gender; etc. In all of these cases, the major results of interest can be summarized in a multi-way frequency table, that is, in a crosstabulation table with two or more factors. Log-Linear provides a more "sophisticated" way of looking at crosstabulation tables. Specifically, you can test the different factors that are used in the crosstabulation (e.g., gender, region, etc.) and their interactions for statistical significance (see Elementary Concepts for a discussion of statistical significance testing). The following text will present a brief introduction to these methods, their logic, and interpretation. Correspondence analysis is a descriptive/exploratory technique designed to analyze two-way and multi-way tables containing some measure of correspondence between the rows and columns. The results provide information which is similar in nature to those produced by Factor Analysis techniques, and they allow one to explore the structure of the categorical variables included in the table. To index Multivariate Adaptive Regression Splines (MARSplines) http://statsoft.com/textbook/stathome.html Introductory Overview Regression Problems Multivariate Adaptive Regression Splines Model Selection and Pruning Applications Technical Notes: The MARSplines Algorithm Technical Notes: The MARSplines Model Introductory Overview Multivariate Adaptive Regression Splines (MARSplines) is an implementation of techniques popularized by Friedman (1991) for solving regression-type problems (see also, Multiple Regression), with the main purpose to predict the values of a continuous dependent or outcome variable from a set of independent or predictor variables. There are a large number of methods available for fitting models to continuous variables, such as a linear regression [e.g., Multiple Regression, General Linear Model (GLM)], nonlinear regression (Generalized Linear/Nonlinear Models), regression trees (see Classification and Regression Trees), CHAID, Neural Networks, etc. (see also Hastie, Tishirani, and Friedman, 2001, for an overview). Multivariate Adaptive Regression Splines (MARSplines) is a nonparametric regression procedure that makes no assumption about the underlying functional relationship between the dependent and independent variables. Instead, MARSplines constructs this relation from a set of coefficients and basis functions that are entirely "driven" from the regression data. In a sense, the method is based on the "divide and conquer" strategy, which partitions the input space into regions, each with its own regression equation. This makes MARSplines particularly suitable for problems with higher input dimensions (i.e., with more than 2 variables), where the curse of dimensionality would likely create problems for other techniques. The MARSplines technique has become particularly popular in the area of data mining because it does not assume or impose any particular type or class of relationship (e.g., linear, logistic, etc.) between the predictor variables and the dependent (outcome) variable of interest. Instead, useful models (i.e., models that yield accurate predictions) can be derived even in situations where the relationship between the predictors and the dependent variables is non-monotone and difficult to approximate with parametric models. For more information about this technique and how it compares to other methods for nonlinear regression (or regression trees), see Hastie, Tishirani, and Friedman (2001). Regression Problems Regression problems are used to determine the relationship between a set of dependent variables (also called output, outcome, or response variables) and one or more independent variables (also known as input or predictor variables). The dependent variable is the one whose values you want to predict, based on the values of the independent (predictor) variables. For instance, one might be interested in the number of car accidents on the roads, which can be caused by 1) bad weather and 2) drunk driving. In this case one might write, for example, Number_of_Accidents = Some Constant + 0.5*Bad_Weather + 2.0*Drunk_Driving The variable Number of Accidents is the dependent variable that is thought to be caused by (among other variables) Bad Weather and Drunk Driving (hence the name dependent variable). Note that the independent variables are multiplied by factors, i.e., 0.5 and 2.0. These are known as regression coefficients. The larger these coefficients, the stronger the influence of the independent variables on the dependent variable. If the two predictors in this simple (fictitious) example were measured on the same scale (e.g., if the variables were standardized to a mean of 0.0 and standard deviation 1.0), then Drunk Driving could be inferred to contribute 4 times more to car accidents than Bad Weather. (If the variables are not measured on the same scale, then direct comparisons between these coefficients are not meaningful, and, usually, some other standardized measure of predictor "importance" is included in the results.) For additional details regarding these types of statistical models, refer to Multiple Regression or General Linear Models (GLM), as well as General Regression Models (GRM). In general, the social and natural sciences regression procedures are widely used in research. Regression allows the researcher to ask (and hopefully answer) the general question "what is the best predictor of ..." For example, educational researchers might want to learn what the best predictors of success in high-school are. Psychologists may want to determine which personality variable best predicts social adjustment. Sociologists may want to find out which of the multiple social indicators best predict whether a new immigrant group will adapt and be absorbed into society. Machine Learning http://statsoft.com/textbook/stathome.html Introductory Overview Support Vector Machines (SVM) Naive Bayes k-Nearest Neighbors (KNN) -------------------------------------------------------------------------------- Machine Learning Introductory Overview Machine Learning includes a number of advanced statistical methods for handling regression and classification tasks with multiple dependent and independent variables. These methods include Support Vector Machines (SVM) for regression and classification, Naive Bayes for classification, and k-Nearest Neighbours (KNN) for regression and classification. Detailed discussions of these techniques can be found in Hastie, Tibshirani, & Freedman (2001); a specialized comprehensive introduction to support vector machines can also be found in Cristianini and Shawe-Taylor (2000). Support Vector Machines (SVM) This method performs regression and classification tasks by constructing nonlinear decision boundaries. Because of the nature of the feature space in which these boundaries are found, Support Vector Machines can exhibit a large degree of flexibility in handling classification and regression tasks of varied complexities. There are several types of Support Vector models including linear, polynomial, RBF, and sigmoid. Naive Bayes This is a well established Bayesian method primarily formulated for performing classification tasks. Given its simplicity, i.e., the assumption that the independent variables are statistically independent, Naive Bayes models are effective classification tools that are easy to use and interpret. Naive Bayes is particularly appropriate when the dimensionality of the independent space (i.e., number of input variables) is high (a problem known as the curse of dimensionality). For the reasons given above, Naive Bayes can often outperform other more sophisticated classification methods. A variety of methods exist for modeling the conditional distributions of the inputs including normal, lognormal, gamma, and Poisson. k-Nearest Neighbors k-Nearest Neighbors is a memory-based method that, in contrast to other statistical methods, requires no training (i.e., no model to fit). It falls into the category of Prototype Methods. It functions on the intuitive idea that close objects are more likely to be in the same category. Thus, in KNN, predictions are based on a set of prototype examples that are used to predict new (i.e., unseen) data based on the majority vote (for classification tasks) and averaging (for regression) over a set of k-nearest prototypes (hence the name k-nearest neighbors). To index Multidimensional Scaling http://statsoft.com/textbook/stathome.html General Purpose Logic of MDS Computational Approach How many dimensions to specify? Interpreting the Dimensions Applications MDS and Factor Analysis -------------------------------------------------------------------------------- General Purpose Multidimensional scaling (MDS) can be considered to be an alternative to factor analysis (see Factor Analysis). In general, the goal of the analysis is to detect meaningful underlying dimensions that allow the researcher to explain observed similarities or dissimilarities (distances) between the investigated objects. In factor analysis, the similarities between objects (e.g., variables) are expressed in the correlation matrix. With MDS one may analyze any kind of similarity or dissimilarity matrix, in addition to correlation matrices. Logic of MDS The following simple example may demonstrate the logic of an MDS analysis. Suppose we take a matrix of distances between major US cities from a map. We then analyze this matrix, specifying that we want to reproduce the distances based on two dimensions. As a result of the MDS analysis, we would most likely obtain a two-dimensional representation of the locations of the cities, that is, we would basically obtain a two-dimensional map. In general then, MDS attempts to arrange "objects" (major cities in this example) in a space with a particular number of dimensions (two-dimensional in this example) so as to reproduce the observed distances. As a result, we can "explain" the distances in terms of underlying dimensions; in our example, we could explain the distances in terms of the two geographical dimensions: north/south and east/west. Orientation of axes. As in factor analysis, the actual orientation of axes in the final solution is arbitrary. To return to our example, we could rotate the map in any way we want, the distances between cities remain the same. Thus, the final orientation of axes in the plane or space is mostly the result of a subjective decision by the researcher, who will choose an orientation that can be most easily explained. To return to our example, we could have chosen an orientation of axes other than north/south and east/west; however, that orientation is most convenient because it "makes the most sense" (i.e., it is easily interpretable). To index Neural Networks http://www.statsoft.com/TEXTBOOK/stneunet.html http://statsoft.com/textbook/stathome.html Preface Applications for Neural Networks The Biological Inspiration The Basic Artificial Model Using a Neural Network Gathering Data for Neural Networks Summary Pre- and Post-processing Multilayer Perceptrons Training Multilayer Perceptrons The Back Propagation Algorithm Over-learning and Generalization Data Selection Insights into MLP Training Other MLP Training Algorithms Radial Basis Function Networks Probabilistic Neural Networks Generalized Regression Neural Networks Linear Networks SOFM Networks Classification in Neural Networks Classification Statistics Regression Problems in Neural Networks Time Series Prediction in Neural Networks Variable Selection and Dimensionality Reduction Ensembles and Resampling Recommended Textbooks -------------------------------------------------------------------------------- Many concepts related to the neural networks methodology are best explained if they are illustrated with applications of a specific neural network program. Therefore, this chapter contains many references to STATISTICA Neural Networks (in short, ST Neural Networks, a neural networks application available from StatSoft), a particularly comprehensive neural network tool. -------------------------------------------------------------------------------- Preface Neural networks have seen an explosion of interest over the last few years, and are being successfully applied across an extraordinary range of problem domains, in areas as diverse as finance, medicine, engineering, geology and physics. Indeed, anywhere that there are problems of prediction, classification or control, neural networks are being introduced. This sweeping success can be attributed to a few key factors: Power. Neural networks are very sophisticated modeling techniques capable of modeling extremely complex functions. In particular, neural networks are nonlinear (a term which is discussed in more detail later in this section). For many years linear modeling has been the commonly used technique in most modeling domains since linear models have well-known optimization strategies. Where the linear approximation was not valid (which was frequently the case) the models suffered accordingly. Neural networks also keep in check the curse of dimensionality problem that bedevils attempts to model nonlinear functions with large numbers of variables. Ease of use. Neural networks learn by example. The neural network user gathers representative data, and then invokes training algorithms to automatically learn the structure of the data. Although the user does need to have some heuristic knowledge of how to select and prepare data, how to select an appropriate neural network, and how to interpret the results, the level of user knowledge needed to successfully apply neural networks is much lower than would be the case using (for example) some more traditional nonlinear statistical methods. Neural networks are also intuitively appealing, based as they are on a crude low-level model of biological neural systems. In the future, the development of this neurobiological modeling may lead to genuinely intelligent computers. To index Nonlinear Estimation http://statsoft.com/textbook/stathome.html General Purpose Estimating Linear and Nonlinear Models Common Nonlinear Regression Models Intrinsically Linear Regression Models Intrinsically Nonlinear Regression Models Nonlinear Estimation Procedures Least Squares Estimation Loss Functions Weighted Least Squares Maximum Likelihood Maximum likelihood and probit/logit models Function Minimization Algorithms Start Values, Step Sizes, Convergence Criteria Penalty Functions, Constraining Parameters Local Minima Quasi-Newton Method Simplex Procedure Hooke-Jeeves Pattern Moves Rosenbrock Pattern Search Hessian Matrix and Standard Errors Evaluating the Fit of the Model Proportion of Variance Explained Goodness-of-fit Chi-square Plot of Observed vs. Predicted Values Normal and Half-Normal Probability Plots Plot of the Fitted Function Variance/Covariance Matrix for Parameters -------------------------------------------------------------------------------- General Purpose In the most general terms, Nonlinear Estimation will compute the relationship between a set of independent variables and a dependent variable. For example, we may want to compute the relationship between the dose of a drug and its effectiveness, the relationship between training and subsequent performance on a task, the relationship between the price of a house and the time it takes to sell it, etc. You may recognize research issues in these examples that are commonly addressed by such techniques as multiple regression (see, Multiple Regression) or analysis of variance (see, ANOVA/MANOVA). In fact, you may think of Nonlinear Estimation as a generalization of those methods. Specifically, multiple regression (and ANOVA) assumes that the relationship between the independent variable(s) and the dependent variable is linear in nature. Nonlinear Estimation leaves it up to you to specify the nature of the relationship; for example, you may specify the dependent variable to be a logarithmic function of the independent variable(s), an exponential function, a function of some complex ratio of independent measures, etc. (However, if all variables of interest are categorical in nature, or can be converted into categorical variables, you may also consider Correspondence Analysis.) When allowing for any type of relationship between the independent variables and the dependent variable, two issues raise their heads. First, what types of relationships "make sense", that is, are interpretable in a meaningful manner? Note that the simple linear relationship is very convenient in that it allows us to make such straightforward interpretations as "the more of x (e.g., the higher the price of a house), the more there is of y (the longer it takes to sell it); and given a particular increase in x, a proportional increase in y can be expected." Nonlinear relationships cannot usually be interpreted and verbalized in such a simple manner. The second issue that needs to be addressed is how to exactly compute the relationship, that is, how to arrive at results that allow us to say whether or not there is a nonlinear relationship as predicted. Let us now discuss the nonlinear regression problem in a somewhat more formal manner, that is, introduce the common terminology that will allow us to examine the nature of these techniques more closely, and how they are used to address important questions in various research domains (medicine, social sciences, physics, chemistry, pharmacology, engineering, etc.). To index Nonparametric Statistics http://statsoft.com/textbook/stathome.html General Purpose Brief Overview of Nonparametric Procedures When to Use Which Method Nonparametric Correlations -------------------------------------------------------------------------------- General Purpose Brief review of the idea of significance testing. To understand the idea of nonparametric statistics (the term nonparametric was first used by Wolfowitz, 1942) first requires a basic understanding of parametric statistics. The Elementary Concepts chapter of the manual introduces the concept of statistical significance testing based on the sampling distribution of a particular statistic (you may want to review that chapter before reading on). In short, if we have a basic knowledge of the underlying distribution of a variable, then we can make predictions about how, in repeated samples of equal size, this particular statistic will "behave," that is, how it is distributed. For example, if we draw 100 random samples of 100 adults each from the general population, and compute the mean height in each sample, then the distribution of the standardized means across samples will likely approximate the normal distribution (to be precise, Student''s t distribution with 99 degrees of freedom; see below). Now imagine that we take an additional sample in a particular city ("Tallburg") where we suspect that people are taller than the average population. If the mean height in that sample falls outside the upper 95% tail area of the t distribution then we conclude that, indeed, the people of Tallburg are taller than the average population. Are most variables normally distributed? In the above example we relied on our knowledge that, in repeated samples of equal size, the standardized means (for height) will be distributed following the t distribution (with a particular mean and variance). However, this will only be true if in the population the variable of interest (height in our example) is normally distributed, that is, if the distribution of people of particular heights follows the normal distribution (the bell-shape distribution). For many variables of interest, we simply do not know for sure that this is the case. For example, is income distributed normally in the population? -- probably not. The incidence rates of rare diseases are not normally distributed in the population, the number of car accidents is also not normally distributed, and neither are very many other variables in which a researcher might be interested. For more information on the normal distribution, see Elementary Concepts; for information on tests of normality, see Normality tests. Sample size. Another factor that often limits the applicability of tests based on the assumption that the sampling distribution is normal is the size of the sample of data available for the analysis (sample size; n). We can assume that the sampling distribution is normal even if we are not sure that the distribution of the variable in the population is normal, as long as our sample is large enough (e.g., 100 or more observations). However, if our sample is very small, then those tests can be used only if we are sure that the variable is normally distributed, and there is no way to test this assumption if the sample is small. Problems in measurement. Applications of tests that are based on the normality assumptions are further limited by a lack of precise measurement. For example, let us consider a study where grade point average (GPA) is measured as the major variable of interest. Is an A average twice as good as a C average? Is the difference between a B and an A average comparable to the difference between a D and a C average? Somehow, the GPA is a crude measure of scholastic accomplishments that only allows us to establish a rank ordering of students from "good" students to "poor" students. This general measurement issue is usually discussed in statistics textbooks in terms of types of measurement or scale of measurement. Without going into too much detail, most common statistical techniques such as analysis of variance (and t- tests), regression, etc. assume that the underlying measurements are at least of interval, meaning that equally spaced intervals on the scale can be compared in a meaningful manner (e.g, B minus A is equal to D minus C). However, as in our example, this assumption is very often not tenable, and the data rather represent a rank ordering of observations (ordinal) rather than precise measurements. Parametric and nonparametric methods. Hopefully, after this somewhat lengthy introduction, the need is evident for statistical procedures that allow us to process data of "low quality," from small samples, on variables about which nothing is known (concerning their distribution). Specifically, nonparametric methods were developed to be used in cases when the researcher knows nothing about the parameters of the variable of interest in the population (hence the name nonparametric). In more technical terms, nonparametric methods do not rely on the estimation of parameters (such as the mean or the standard deviation) describing the distribution of the variable of interest in the population. Therefore, these methods are also sometimes (and more appropriately) called parameter-free methods or distribution-free methods. To index Partial Least Squares (PLS) http://statsoft.com/textbook/stathome.html Basic Ideas Computational Approach Basic Model NIPALS Algorithm SIMPLS Algorithm Training and Verification (Crossvalidation) Samples Types of Analyses Between-subject Designs Distance Graphs -------------------------------------------------------------------------------- This chapter describes the use of partial least squares regression analysis. If you are unfamiliar with the basic methods of regression in linear models, it may be useful to first review the information on these topics in Elementary Concepts. The different designs discussed in this chapter are also described in the context of General Linear Models, Generalized Linear Models, and General Stepwise Regression. -------------------------------------------------------------------------------- Basic Ideas Partial least squares regression is an extension of the multiple linear regression model (see, e.g., Multiple Regression or General Stepwise Regression). In its simplest form, a linear model specifies the (linear) relationship between a dependent (response) variable Y, and a set of predictor variables, the X''s, so that Y = b0 + b1X1 + b2X2 + ... + bpXp In this equation b0 is the regression coefficient for the intercept and the bi values are the regression coefficients (for variables 1 through p) computed from the data. So for example, one could estimate (i.e., predict) a person''s weight as a function of the person''s height and gender. You could use linear regression to estimate the respective regression coefficients from a sample of data, measuring height, weight, and observing the subjects'' gender. For many data analysis problems, estimates of the linear relationships between variables are adequate to describe the observed data, and to make reasonable predictions for new observations (see Multiple Regression or General Stepwise Regression for additional details). The multiple linear regression model has been extended in a number of ways to address more sophisticated data analysis problems. The multiple linear regression model serves as the basis for a number of multivariate methods such as discriminant analysis (i.e., the prediction of group membership from the levels of continuous predictor variables), principal components regression (i.e., the prediction of responses on the dependent variables from factors underlying the levels of the predictor variables), and canonical correlation (i.e., the prediction of factors underlying responses on the dependent variables from factors underlying the levels of the predictor variables). These multivariate methods all have two important properties in common. These methods impose restrictions such that (1) factors underlying the Y and X variables are extracted from the Y''Y and X''X matrices, respectively, and never from cross-product matrices involving both the Y and X variables, and (2) the number of prediction functions can never exceed the minimum of the number of Y variables and X variables. Partial least squares regression extends multiple linear regression without imposing the restrictions employed by discriminant analysis, principal components regression, and canonical correlation. In partial least squares regression, prediction functions are represented by factors extracted from the Y''XX''Y matrix. The number of such prediction functions that can be extracted typically will exceed the maximum of the number of Y and X variables. In short, partial least squares regression is probably the least restrictive of the various multivariate extensions of the multiple linear regression model. This flexibility allows it to be used in situations where the use of traditional multivariate methods is severely limited, such as when there are fewer observations than predictor variables. Furthermore, partial least squares regression can be used as an exploratory analysis tool to select suitable predictor variables and to identify outliers before classical linear regression. Partial least squares regression has been used in various disciplines such as chemistry, economics, medicine, psychology, and pharmaceutical science where predictive linear modeling, especially with a large number of predictors, is necessary. Especially in chemometrics, partial least squares regression has become a standard tool for modeling linear relations between multivariate measurements (de Jong, 1993). Power Analysis http://statsoft.com/textbook/stathome.html General Purpose Power Analysis and Sample Size Calculation in Experimental Design Sampling Theory Hypothesis Testing Logic Calculating Power Calculating Required Sample Size Graphical Approaches to Power Analysis Noncentrality Interval Estimation and the Evaluation of Statistical Models Inadequacies of the Hypothesis Testing Approach Advantages of Interval Estimation Reasons Why Interval Estimates are Seldom Reported Replacing Traditional Hypothesis Tests with Interval Estimates General Purpose The techniques of statistical power analysis, sample size estimation, and advanced techniques for confidence interval estimation are discussed here. The main goal of first the two techniques is to allow you to decide, while in the process of designing an experiment, (a) how large a sample is needed to enable statistical judgments that are accurate and reliable and (b) how likely your statistical test will be to detect effects of a given size in a particular situation. The third technique is useful in implementing objectives a and b and in evaluating the size of experimental effects in practice. Performing power analysis and sample size estimation is an important aspect of experimental design, because without these calculations, sample size may be too high or too low. If sample size is too low, the experiment will lack the precision to provide reliable answers to the questions it is investigating. If sample size is too large, time and resources will be wasted, often for minimal gain. In some power analysis software programs, a number of graphical and analytical tools are available to enable precise evaluation of the factors affecting power and sample size in many of the most commonly encountered statistical analyses. This information can be crucial to the design of a study that is cost-effective and scientifically useful. Noncentrality interval estimation procedures and other sophisticated confidence interval procedures provide some sophisticated confidence interval methods for analyzing the importance of an observed experimental result. An increasing number of influential statisticians are suggesting that confidence interval estimation should augment or replace traditional hypothesis testing approaches in the analysis of experimental data. To index Power Analysis and Sample Size Calculation in Experimental Design There is a growing recognition of the importance of power analysis and sample size calculation in the proper design of experiments. Click on the links below for a discussion of the fundamental ideas behind these methods. Sampling Theory Hypothesis Testing Logic Calculating Power Calculating Required Sample Size Graphical Approaches to Power Analysis Process Analysis http://statsoft.com/textbook/stathome.html Sampling Plans General Purpose Computational Approach Means for H0 and H1 Alpha and Beta Error Probabilities Fixed Sampling Plans Sequential Sampling Plans Summary Process (Machine) Capability Analysis Introductory Overview Computational Approach Process Capability Indices Process Performance vs. Process Capability Using Experiments to Improve Process Capability Testing the Normality Assumption Tolerance Limits Gage Repeatability and Reproducibility Introductory Overview Computational Approach Plots of Repeatability and Reproducibility Components of Variance Summary Non-Normal Distributions Introductory Overview Fitting Distributions by Moments Assessing the Fit: Quantile and Probability Plots Non-Normal Process Capability Indices (Percentile Method) Weibull and Reliability/Failure Time Analysis General Purpose The Weibull Distribution Censored Observations Two- and three-parameter Weibull Distribution Parameter Estimation Goodness of Fit Indices Interpreting Results Grouped Data Modified Failure Order for Multiple-Censored Data Weibull CDF, Reliability, and Hazard Functions -------------------------------------------------------------------------------- Sampling plans are discussed in detail in Duncan (1974) and Montgomery (1985); most process capability procedures (and indices) were only recently introduced to the US from Japan (Kane, 1986), however, they are discussed in three excellent recent hands-on books by Bohte (1988), Hart and Hart (1989), and Pyzdek (1989); detailed discussions of these methods can also be found in Montgomery (1991). Step-by-step instructions for the computation and interpretation of capability indices are also provided in the Fundamental Statistical Process Control Reference Manual published by the ASQC (American Society for Quality Control) and AIAG (Automotive Industry Action Group, 1991; referenced as ASQC/AIAG, 1991). Repeatability and reproducibility (R & R) methods are discussed in Grant and Leavenworth (1980), Pyzdek (1989) and Montgomery (1991); a more detailed discussion of the subject (of variance estimation) is also provided in Duncan (1974). Step-by-step instructions on how to conduct and analyze R & R experiments are presented in the Measurement Systems Analysis Reference Manual published by ASQC/AIAG (1990). In the following topics, we will briefly introduce the purpose and logic of each of these procedures. For more information on analyzing designs with random effects and for estimating components of variance, see the Variance Components chapter. -------------------------------------------------------------------------------- Sampling Plans General Purpose Computational Approach Means for H0 and H1 Alpha and Beta Error Probabilities Fixed Sampling Plans Sequential Sampling Plans Summary General Purpose A common question that quality control engineers face is to determine how many items from a batch (e.g., shipment from a supplier) to inspect in order to ensure that the items (products) in that batch are of acceptable quality. For example, suppose we have a supplier of piston rings for small automotive engines that our company produces, and our goal is to establish a sampling procedure (of piston rings from the delivered batches) that ensures a specified quality. In principle, this problem is similar to that of on-line quality control discussed in Quality Control. In fact, you may want to read that section at this point to familiarize yourself with the issues involved in industrial statistical quality control. Acceptance sampling. The procedures described here are useful whenever we need to decide whether or not a batch or lot of items complies with specifications, without having to inspect 100% of the items in the batch. Because of the nature of the problem -- whether or not to accept a batch -- these methods are also sometimes discussed under the heading of acceptance sampling. Advantages over 100% inspection. An obvious advantage of acceptance sampling over 100% inspection of the batch or lot is that reviewing only a sample requires less time, effort, and money. In some cases, inspection of an item is destructive (e.g., stress testing of steel), and testing 100% would destroy the entire batch. Finally, from a managerial standpoint, rejecting an entire batch or shipment (based on acceptance sampling) from a supplier, rather than just a certain percent of defective items (based on 100% inspection) often provides a stronger incentive to the supplier to adhere to quality standards. Computational Approach In principle, the computational approach to the question of how large a sample to take is straightforward. Elementary Concepts discusses the concept of the sampling distribution. Briefly, if we were to take repeated samples of a particular size from a population of, for example, piston rings and compute their average diameters, then the distribution of those averages (means) would approach the normal distribution with a particular mean and standard deviation (or standard error; in sampling distributions the term standard error is preferred, in order to distinguish the variability of the means from the variability of the items in the population). Fortunately, we do not need to take repeated samples from the population in order to estimate the location (mean) and variability (standard error) of the sampling distribution. If we have a good idea (estimate) of what the variability (standard deviation or sigma) is in the population, then we can infer the sampling distribution of the mean. In principle, this information is sufficient to estimate the sample size that is needed in order to detect a certain change in quality (from target specifications). Without going into the details about the computational procedures involved, let us next review the particular information that the engineer must supply in order to estimate required sample sizes. Means for H0 and H1 To formalize the inspection process of, for example, a shipment of piston rings, we can formulate two alternative hypotheses: First, we may hypothesize that the average piston ring diameters comply with specifications. This hypothesis is called the null hypothesis (H0). The second and alternative hypothesis (H1) is that the diameters of the piston rings delivered to us deviate from specifications by more than a certain amount. Note that we may specify these types of hypotheses not just for measurable variables such as diameters of piston rings, but also for attributes. For example, we may hypothesize (H1) that the number of defective parts in the batch exceeds a certain percentage. Intuitively, it should be clear that the larger the difference between H0 and H1, the smaller the sample necessary to detect this difference (see Elementary Concepts). Alpha and Beta Error Probabilities To return to the piston rings example, there are two types of mistakes that we can make when inspecting a batch of piston rings that has just arrived at our plant. First, we may erroneously reject H0, that is, reject the batch because we erroneously conclude that the piston ring diameters deviate from target specifications. The probability of committing this mistake is usually called the alpha error probability. The second mistake that we can make is to erroneously not reject H0 (accept the shipment of piston rings), when, in fact, the mean piston ring diameter deviates from the target specification by a certain amount. The probability of committing this mistake is usually called the beta error probability. Intuitively, the more certain we want to be, that is, the lower we set the alpha and beta error probabilities, the larger the sample will have to be; in fact, in order to be 100% certain, we would have to measure every single piston ring delivered to our company. Fixed Sampling Plans To construct a simple sampling plan, we would first decide on a sample size, based on the means under H0/H1 and the particular alpha and beta error probabilities. Then, we would take a single sample of this fixed size and, based on the mean in this sample, decide whether to accept or reject the batch. This procedure is referred to as a fixed sampling plan. Operating characteristic (OC) curve. The power of the fixed sampling plan can be summarized via the operating characteristic curve. In that plot, the probability of rejecting H0 (and accepting H1) is plotted on the Y axis, as a function of an actual shift from the target (nominal) specification to the respective values shown on the X axis of the plot (see example below). This probability is, of course, one minus the beta error probability of erroneously rejecting H1 and accepting H0; this value is referred to as the power of the fixed sampling plan to detect deviations. Also indicated in this plot are the power functions for smaller sample sizes. Sequential Sampling Plans As an alternative to the fixed sampling plan, we could randomly choose individual piston rings and record their deviations from specification. As we continue to measure each piston ring, we could keep a running total of the sum of deviations from specification. Intuitively, if H1 is true, that is, if the average piston ring diameter in the batch is not on target, then we would expect to observe a slowly increasing or decreasing cumulative sum of deviations, depending on whether the average diameter in the batch is larger or smaller than the specification, respectively. It turns out that this kind of sequential sampling of individual items from the batch is a more sensitive procedure than taking a fixed sample. In practice, we continue sampling until we either accept or reject the batch. Using a sequential sampling plan. Typically, we would produce a graph in which the cumulative deviations from specification (plotted on the Y-axis) are shown for successively sampled items (e.g., piston rings, plotted on the X-axis). Then two sets of lines are drawn in this graph to denote the "corridor" along which we will continue to draw samples, that is, as long as the cumulative sum of deviations from specifications stays within this corridor, we continue sampling. If the cumulative sum of deviations steps outside the corridor we stop sampling. If the cumulative sum moves above the upper line or below the lowest line, we reject the batch. If the cumulative sum steps out of the corridor to the inside, that is, if it moves closer to the center line, we accept the batch (since this indicates zero deviation from specification). Note that the inside area starts only at a certain sample number; this indicates the minimum number of samples necessary to accept the batch (with the current error probability). Summary To summarize, the idea of (acceptance) sampling is to use statistical "inference" to accept or reject an entire batch of items, based on the inspection of only relatively few items from that batch. The advantage of applying statistical reasoning to this decision is that we can be explicit about the probabilities of making a wrong decision. Whenever possible, sequential sampling plans are preferable to fixed sampling plans because they are more powerful. In most cases, relative to the fixed sampling plan, using sequential plans requires fewer items to be inspected in order to arrive at a decision with the same degree of certainty. To index Quality Control Charts http://statsoft.com/textbook/stathome.html General Purpose General Approach Establishing Control Limits Common Types of Charts Short Run Control Charts Short Run Charts for Variables Short Run Charts for Attributes Unequal Sample Sizes Control Charts for Variables vs. Charts for Attributes Control Charts for Individual Observations Out-of-Control Process: Runs Tests Operating Characteristic (OC) Curves Process Capability Indices Other Specialized Control Charts -------------------------------------------------------------------------------- General Purpose In all production processes, we need to monitor the extent to which our products meet specifications. In the most general terms, there are two "enemies" of product quality: (1) deviations from target specifications, and (2) excessive variability around target specifications. During the earlier stages of developing the production process, designed experiments are often used to optimize these two quality characteristics (see Experimental Design); the methods provided in Quality Control are on-line or in-process quality control procedures to monitor an on-going production process. For detailed descriptions of these charts and extensive annotated examples, see Buffa (1972), Duncan (1974) Grant and Leavenworth (1980), Juran (1962), Juran and Gryna (1970), Montgomery (1985, 1991), Shirland (1993), or Vaughn (1974). Two recent excellent introductory texts with a "how-to" approach are Hart & Hart (1989) and Pyzdek (1989); two recent German language texts on this subject are Rinne and Mittag (1995) and Mittag (1993). To index General Approach The general approach to on-line quality control is straightforward: We simply extract samples of a certain size from the ongoing production process. We then produce line charts of the variability in those samples, and consider their closeness to target specifications. If a trend emerges in those lines, or if samples fall outside pre-specified limits, then we declare the process to be out of control and take action to find the cause of the problem. These types of charts are sometimes also referred to as Shewhart control charts (named after W. A. Shewhart who is generally credited as being the first to introduce these methods; see Shewhart, 1931). Common Types of Charts The types of charts are often classified according to the type of quality characteristic that they are supposed to monitor: there are quality control charts for variables and control charts for attributes. Specifically, the following charts are commonly constructed for controlling variables: X-bar chart. In this chart the sample means are plotted in order to control the mean value of a variable (e.g., size of piston rings, strength of materials, etc.). R chart. In this chart, the sample ranges are plotted in order to control the variability of a variable. S chart. In this chart, the sample standard deviations are plotted in order to control the variability of a variable. S**2 chart. In this chart, the sample variances are plotted in order to control the variability of a variable. For controlling quality characteristics that represent attributes of the product, the following charts are commonly constructed: C chart. In this chart (see example below), we plot the number of defectives (per batch, per day, per machine, per 100 feet of pipe, etc.). This chart assumes that defects of the quality attribute are rare, and the control limits in this chart are computed based on the Poisson distribution (distribution of rare events). U chart. In this chart we plot the rate of defectives, that is, the number of defectives divided by the number of units inspected (the n; e.g., feet of pipe, number of batches). Unlike the C chart, this chart does not require a constant number of units, and it can be used, for example, when the batches (samples) are of different sizes. Np chart. In this chart, we plot the number of defectives (per batch, per day, per machine) as in the C chart. However, the control limits in this chart are not based on the distribution of rare events, but rather on the binomial distribution. Therefore, this chart should be used if the occurrence of defectives is not rare (e.g., they occur in more than 5% of the units inspected). For example, we may use this chart to control the number of units produced with minor flaws. P chart. In this chart, we plot the percent of defectives (per batch, per day, per machine, etc.) as in the U chart. However, the control limits in this chart are not based on the distribution of rare events but rather on the binomial distribution (of proportions). Therefore, this chart is most applicable to situations where the occurrence of defectives is not rare (e.g., we expect the percent of defectives to be more than 5% of the total number of units produced). All of these charts can be adapted for short production runs (short run charts), and for multiple process streams. To index Short Run Charts The short run control chart, or control chart for short production runs, plots observations of variables or attributes for multiple parts on the same chart. Short run control charts were developed to address the requirement that several dozen measurements of a process must be collected before control limits are calculated. Meeting this requirement is often difficult for operations that produce a limited number of a particular part during a production run. For example, a paper mill may produce only three or four (huge) rolls of a particular kind of paper (i.e., part) and then shift production to another kind of paper. But if variables, such as paper thickness, or attributes, such as blemishes, are monitored for several dozen rolls of paper of, say, a dozen different kinds, control limits for thickness and blemishes could be calculated for the transformed (within the short production run) variable values of interest. Specifically, these transformations will rescale the variable values of interest such that they are of compatible magnitudes across the different short production runs (or parts). The control limits computed for those transformed values could then be applied in monitoring thickness, and blemishes, regardless of the types of paper (parts) being produced. Statistical process control procedures could be used to determine if the production process is in control, to monitor continuing production, and to establish procedures for continuous quality improvement. For additional discussions of short run charts refer to Bothe (1988), Johnson (1987), or Montgomery (1991). Short Run Charts for Variables Nominal chart, target chart. There are several different types of short run charts. The most basic are the nominal short run chart, and the target short run chart. In these charts, the measurements for each part are transformed by subtracting a part-specific constant. These constants can either be the nominal values for the respective parts (nominal short run chart), or they can be target values computed from the (historical) means for each part (Target X-bar and R chart). For example, the diameters of piston bores for different engine blocks produced in a factory can only be meaningfully compared (for determining the consistency of bore sizes) if the mean differences between bore diameters for different sized engines are first removed. The nominal or target short run chart makes such comparisons possible. Note that for the nominal or target chart it is assumed that the variability across parts is identical, so that control limits based on a common estimate of the process sigma are applicable. Standardized short run chart. If the variability of the process for different parts cannot be assumed to be identical, then a further transformation is necessary before the sample means for different parts can be plotted in the same chart. Specifically, in the standardized short run chart the plot points are further transformed by dividing the deviations of sample means from part means (or nominal or target values for parts) by part-specific constants that are proportional to the variability for the respective parts. For example, for the short run X-bar and R chart, the plot points (that are shown in the X-bar chart) are computed by first subtracting from each sample mean a part specific constant (e.g., the respective part mean, or nominal value for the respective part), and then dividing the difference by another constant, for example, by the average range for the respective chart. These transformations will result in comparable scales for the sample means for different parts. Short Run Charts for Attributes For attribute control charts (C, U, Np, or P charts), the estimate of the variability of the process (proportion, rate, etc.) is a function of the process average (average proportion, rate, etc.; for example, the standard deviation of a proportion p is equal to the square root of p*(1- p)/n). Hence, only standardized short run charts are available for attributes. For example, in the short run P chart, the plot points are computed by first subtracting from the respective sample p values the average part p''s, and then dividing by the standard deviation of the average p''s. To index Unequal Sample Sizes When the samples plotted in the control chart are not of equal size, then the control limits around the center line (target specification) cannot be represented by a straight line. For example, to return to the formula Sigma/Square Root(n) presented earlier for computing control limits for the X-bar chart, it is obvious that unequal n''s will lead to different control limits for different sample sizes. There are three ways of dealing with this situation. Average sample size. If one wants to maintain the straight-line control limits (e.g., to make the chart easier to read and easier to use in presentations), then one can compute the average n per sample across all samples, and establish the control limits based on the average sample size. This procedure is not "exact," however, as long as the sample sizes are reasonably similar to each other, this procedure is quite adequate. Variable control limits. Alternatively, one may compute different control limits for each sample, based on the respective sample sizes. This procedure will lead to variable control limits, and result in step-chart like control lines in the plot. This procedure ensures that the correct control limits are computed for each sample. However, one loses the simplicity of straight-line control limits. Stabilized (normalized) chart. The best of two worlds (straight line control limits that are accurate) can be accomplished by standardizing the quantity to be controlled (mean, proportion, etc.) according to units of sigma. The control limits can then be expressed in straight lines, while the location of the sample points in the plot depend not only on the characteristic to be controlled, but also on the respective sample n''s. The disadvantage of this procedure is that the values on the vertical (Y) axis in the control chart are in terms of sigma rather than the original units of measurement, and therefore, those numbers cannot be taken at face value (e.g., a sample with a value of 3 is 3 times sigma away from specifications; in order to express the value of this sample in terms of the original units of measurement, we need to perform some computations to convert this number back). To index Reliability and Item Analysis http://statsoft.com/textbook/stathome.html General Introduction Basic Ideas Classical Testing Model Reliability Sum Scales Cronbach''s Alpha Split-Half Reliability Correction for Attenuation Designing a Reliable Scale -------------------------------------------------------------------------------- This chapter discusses the concept of reliability of measurement as used in social sciences (but not in industrial statistics or biomedical research). The term reliability used in industrial statistics denotes a function describing the probability of failure (as a function of time). For a discussion of the concept of reliability as applied to product quality (e.g., in industrial statistics), please refer to the section on Reliability/Failure Time Analysis in the Process Analysis chapter (see also the section Repeatability and Reproducibility in the same chapter and the chapter Survival/Failure Time Analysis). For a comparison between these two (very different) concepts of reliability, see Reliability. -------------------------------------------------------------------------------- General Introduction In many areas of research, the precise measurement of hypothesized processes or variables (theoretical constructs) poses a challenge by itself. For example, in psychology, the precise measurement of personality variables or attitudes is usually a necessary first step before any theories of personality or attitudes can be considered. In general, in all social sciences, unreliable measurements of people''s beliefs or intentions will obviously hamper efforts to predict their behavior. The issue of precision of measurement will also come up in applied research, whenever variables are difficult to observe. For example, reliable measurement of employee performance is usually a difficult task; yet, it is obviously a necessary precursor to any performance-based compensation system. In all of these cases, Reliability & Item Analysis may be used to construct reliable measurement scales, to improve existing scales, and to evaluate the reliability of scales already in use. Specifically, Reliability & Item Analysis will aid in the design and evaluation of sum scales, that is, scales that are made up of multiple individual measurements (e.g., different items, repeated measurements, different measurement devices, etc.). You can compute numerous statistics that allows you to build and evaluate scales following the so-called classical testing theory model. The assessment of scale reliability is based on the correlations between the individual items or measurements that make up the scale, relative to the variances of the items. If you are not familiar with the correlation coefficient or the variance statistic, we recommend that you review the respective discussions provided in the Basic Statistics section. The classical testing theory model of scale construction has a long history, and there are many textbooks available on the subject. For additional detailed discussions, you may refer to, for example, Carmines and Zeller (1980), De Gruitjer and Van Der Kamp (1976), Kline (1979, 1986), or Thorndyke and Hagen (1977). A widely acclaimed "classic" in this area, with an emphasis on psychological and educational testing, is Nunally (1970). Testing hypotheses about relationships between items and tests. Using Structural Equation Modeling and Path Analysis (SEPATH), you can test specific hypotheses about the relationship between sets of items or different tests (e.g., test whether two sets of items measure the same construct, analyze multi-trait, multi-method matrices, etc.). To index Basic Ideas Suppose we want to construct a questionnaire to measure people''s prejudices against foreign- made cars. We could start out by generating a number of items such as: "Foreign cars lack personality," "Foreign cars all look the same," etc. We could then submit those questionnaire items to a group of subjects (for example, people who have never owned a foreign-made car). We could ask subjects to indicate their agreement with these statements on 9-point scales, anchored at 1=disagree and 9=agree. True scores and error. Let us now consider more closely what we mean by precise measurement in this case. We hypothesize that there is such a thing (theoretical construct) as "prejudice against foreign cars," and that each item "taps" into this concept to some extent. Therefore, we may say that a subject''s response to a particular item reflects two aspects: first, the response reflects the prejudice against foreign cars, and second, it will reflect some esoteric aspect of the respective question. For example, consider the item "Foreign cars all look the same." A subject''s agreement or disagreement with that statement will partially depend on his or her general prejudices, and partially on some other aspects of the question or person. For example, the subject may have a friend who just bought a very different looking foreign car. Testing hypotheses about relationships between items and tests. To test specific hypotheses about the relationship between sets of items or different tests (e.g., whether two sets of items measure the same construct, analyze multi- trait, multi-method matrices, etc.) use Structural Equation Modeling (SEPATH). To index Classical Testing Model To summarize, each measurement (response to an item) reflects to some extent the true score for the intended concept (prejudice against foreign cars), and to some extent esoteric, random error. We can express this in an equation as: X = tau + error In this equation, X refers to the respective actual measurement, that is, subject''s response to a particular item; tau is commonly used to refer to the true score, and error refers to the random error component in the measurement. To index Structural Equation Modeling http://statsoft.com/textbook/stathome.html A Conceptual Overview The Basic Idea Behind Structural Modeling Structural Equation Modeling and the Path Diagram -------------------------------------------------------------------------------- A Conceptual Overview Structural Equation Modeling is a very general, very powerful multivariate analysis technique that includes specialized versions of a number of other analysis methods as special cases. We will assume that you are familiar with the basic logic of statistical reasoning as described in Elementary Concepts. Moreover, we will also assume that you are familiar with the concepts of variance, covariance, and correlation; if not, we advise that you read the Basic Statistics section at this point. Although it is not absolutely necessary, it is highly desirable that you have some background in factor analysis before attempting to use structural modeling. Major applications of structural equation modeling include: causal modeling, or path analysis, which hypothesizes causal relationships among variables and tests the causal models with a linear equation system. Causal models can involve either manifest variables, latent variables, or both; confirmatory factor analysis, an extension of factor analysis in which specific hypotheses about the structure of the factor loadings and intercorrelations are tested; second order factor analysis, a variation of factor analysis in which the correlation matrix of the common factors is itself factor analyzed to provide second order factors; regression models, an extension of linear regression analysis in which regression weights may be constrained to be equal to each other, or to specified numerical values; covariance structure models, which hypothesize that a covariance matrix has a particular form. For example, you can test the hypothesis that a set of variables all have equal variances with this procedure; correlation structure models, which hypothesize that a correlation matrix has a particular form. A classic example is the hypothesis that the correlation matrix has the structure of a circumplex (Guttman, 1954; Wiggins, Steiger, & Gaelick, 1981). Many different kinds of models fall into each of the above categories, so structural modeling as an enterprise is very difficult to characterize. Most structural equation models can be expressed as path diagrams. Consequently even beginners to structural modeling can perform complicated analyses with a minimum of training. To index The Basic Idea Behind Structural Modeling One of the fundamental ideas taught in intermediate applied statistics courses is the effect of additive and multiplicative transformations on a list of numbers. Students are taught that, if you multiply every number in a list by some constant K, you multiply the mean of the numbers by K. Similarly, you multiply the standard deviation by the absolute value of K. For example, suppose you have the list of numbers 1,2,3. These numbers have a mean of 2 and a standard deviation of 1. Now, suppose you were to take these 3 numbers and multiply them by 4. Then the mean would become 8, and the standard deviation would become 4, the variance thus 16. The point is, if you have a set of numbers X related to another set of numbers Y by the equation Y = 4X, then the variance of Y must be 16 times that of X, so you can test the hypothesis that Y and X are related by the equation Y = 4X indirectly by comparing the variances of the Y and X variables. This idea generalizes, in various ways, to several variables inter-related by a group of linear equations. The rules become more complex, the calculations more difficult, but the basic message remains the same -- you can test whether variables are interrelated through a set of linear relationships by examining the variances and covariances of the variables. Statisticians have developed procedures for testing whether a set of variances and covariances in a covariance matrix fits a specified structure. The way structural modeling works is as follows: You state the way that you believe the variables are inter-related, often with the use of a path diagram. You work out, via some complex internal rules, what the implications of this are for the variances and covariances of the variables. You test whether the variances and covariances fit this model of them. Results of the statistical testing, and also parameter estimates and standard errors for the numerical coefficients in the linear equations are reported. On the basis of this information, you decide whether the model seems like a good fit to your data. Survival/Failure Time Analysis http://statsoft.com/textbook/stathome.html General Information Censored Observations Analytic Techniques Life Table Analysis Number of Cases at Risk Proportion Failing Proportion surviving Cumulative Proportion Surviving (Survival Function) Probability Density Hazard rate Median survival time Required sample sizes Distribution Fitting General Introduction Estimation Goodness-of-fit Plots Kaplan-Meier Product-Limit Estimator Comparing Samples General Introduction Available tests Choosing a two-sample test Multiple sample test Unequal proportions of censored data Regression Models General Introduction Cox''s Proportional Hazard Model Cox''s Proportional Hazard Model with Time-Dependent Covariates Exponential Regression Normal and Log-Normal Regression Stratified Analyses -------------------------------------------------------------------------------- General Information These techniques were primarily developed in the medical and biological sciences, but they are also widely used in the social and economic sciences, as well as in engineering (reliability and failure time analysis). Imagine that you are a researcher in a hospital who is studying the effectiveness of a new treatment for a generally terminal disease. The major variable of interest is the number of days that the respective patients survive. In principle, one could use the standard parametric and nonparametric statistics for describing the average survival, and for comparing the new treatment with traditional methods (see Basic Statistics and Nonparametrics and Distribution Fitting). However, at the end of the study there will be patients who survived over the entire study period, in particular among those patients who entered the hospital (and the research project) late in the study; there will be other patients with whom we will have lost contact. Surely, one would not want to exclude all of those patients from the study by declaring them to be missing data (since most of them are "survivors" and, therefore, they reflect on the success of the new treatment method). Those observations, which contain only partial information are called censored observations (e.g., "patient A survived at least 4 months before he moved away and we lost contact;" the term censoring was first used by Hald, 1949). To index Censored Observations In general, censored observations arise whenever the dependent variable of interest represents the time to a terminal event, and the duration of the study is limited in time. Censored observations may occur in a number of different areas of research. For example, in the social sciences we may study the "survival" of marriages, high school drop-out rates (time to drop-out), turnover in organizations, etc. In each case, by the end of the study period, some subjects will still be married, will not have dropped out, or are still working at the same company; thus, those subjects represent censored observations. In economics we may study the "survival" of new businesses or the "survival" times of products such as automobiles. In quality control research, it is common practice to study the "survival" of parts under stress (failure time analysis). To index Analytic Techniques Essentially, the methods offered in Survival Analysis address the same research questions as many of the other procedures; however, all methods in Survival Analysis will handle censored data. The life table, survival distribution, and Kaplan-Meier survival function estimation are all descriptive methods for estimating the distribution of survival times from a sample. Several techniques are available for comparing the survival in two or more groups. Finally, Survival Analysis offers several regression models for estimating the relationship of (multiple) continuous variables to survival times. To index Life Table Analysis The most straightforward way to describe the survival in a sample is to compute the Life Table. The life table technique is one of the oldest methods for analyzing survival (failure time) data (e.g., see Berkson & Gage, 1950; Cutler & Ederer, 1958; Gehan, 1969). This table can be thought of as an "enhanced" frequency distribution table. The distribution of survival times is divided into a certain number of intervals. For each interval we can then compute the number and proportion of cases or objects that entered the respective interval "alive," the number and proportion of cases that failed in the respective interval (i.e., number of terminal events, or number of cases that "died"), and the number of cases that were lost or censored in the respective interval. Based on those numbers and proportions, several additional statistics can be computed: Number of Cases at Risk Proportion Failing Proportion surviving Cumulative Proportion Surviving (Survival Function) Probability Density Hazard rate Median survival time Required sample sizes Number of Cases at Risk. This is the number of cases that entered the respective interval alive, minus half of the number of cases lost or censored in the respective interval. Proportion Failing. This proportion is computed as the ratio of the number of cases failing in the respective interval, divided by the number of cases at risk in the interval. Proportion Surviving. This proportion is computed as 1 minus the proportion failing. Cumulative Proportion Surviving (Survival Function). This is the cumulative proportion of cases surviving up to the respective interval. Since the probabilities of survival are assumed to be independent across the intervals, this probability is computed by multiplying out the probabilities of survival across all previous intervals. The resulting function is also called the survivorship or survival function. Probability Density. This is the estimated probability of failure in the respective interval, computed per unit of time, that is: Fi = (Pi-Pi+1) /hi In this formula, Fi is the respective probability density in the i''th interval, Pi is the estimated cumulative proportion surviving at the beginning of the i''th interval (at the end of interval i-1), Pi+1 is the cumulative proportion surviving at the end of the i''th interval, and hi is the width of the respective interval. Hazard Rate. The hazard rate (the term was first used by Barlow, 1963) is defined as the probability per time unit that a case that has survived to the beginning of the respective interval will fail in that interval. Specifically, it is computed as the number of failures per time units in the respective interval, divided by the average number of surviving cases at the mid-point of the interval. Median Survival Time. This is the survival time at which the cumulative survival function is equal to 0.5. Other percentiles (25th and 75th percentile) of the cumulative survival function can be computed accordingly. Note that the 50th percentile (median) for the cumulative survival function is usually not the same as the point in time up to which 50% of the sample survived. (This would only be the case if there were no censored observations prior to this time). Required Sample Sizes. In order to arrive at reliable estimates of the three major functions (survival, probability density, and hazard) and their standard errors at each time interval the minimum recommended sample size is 30. To index Distribution Fitting General Introduction Estimation Goodness-of-fit Plots Text Mining http://statsoft.com/textbook/stathome.html Introductory Overview Some Typical Applications for Text Mining Approaches to Text Mining Issues and Considerations for "Numericizing" Text Transforming Word Frequencies Latent Semantic Indexing via Singular Value Decomposition Incorporating Text Mining Results in Data Mining Projects -------------------------------------------------------------------------------- Text Mining Introductory Overview The purpose of Text Mining is to process unstructured (textual) information, extract meaningful numeric indices from the text, and, thus, make the information contained in the text accessible to the various data mining (statistical and machine learning) algorithms. Information can be extracted to derive summaries for the words contained in the documents or to compute summaries for the documents based on the words contained in them. Hence, you can analyze words, clusters of words used in documents, etc., or you could analyze documents and determine similarities between them or how they are related to other variables of interest in the data mining project. In the most general terms, text mining will "turn text into numbers" (meaningful indices), which can then be incorporated in other analyses such as predictive data mining projects, the application of unsupervised learning methods (clustering), etc. These methods are described and discussed in great detail in the comprehensive overview work by Manning and Schьtze (2002), and for an in-depth treatment of these and related topics as well as the history of this approach to text mining, we highly recommend that source. Some Typical Applications for Text Mining Unstructured text is very common, and in fact may represent the majority of information available to a particular research or data mining project. Analyzing open-ended survey responses. In survey research (e.g., marketing), it is not uncommon to include various open-ended questions pertaining to the topic under investigation. The idea is to permit respondents to express their "views" or opinions without constraining them to particular dimensions or a particular response format. This may yield insights into customers'' views and opinions that might otherwise not be discovered when relying solely on structured questionnaires designed by "experts." For example, you may discover a certain set of words or terms that are commonly used by respondents to describe the pro''s and con''s of a product or service (under investigation), suggesting common misconceptions or confusion regarding the items in the study. Automatic processing of messages, emails, etc. Another common application for text mining is to aid in the automatic classification of texts. For example, it is possible to "filter" out automatically most undesirable "junk email" based on certain terms or words that are not likely to appear in legitimate messages, but instead identify undesirable electronic mail. In this manner, such messages can automatically be discarded. Such automatic systems for classifying electronic messages can also be useful in applications where messages need to be routed (automatically) to the most appropriate department or agency; e.g., email messages with complaints or petitions to a municipal authority are automatically routed to the appropriate departments; at the same time, the emails are screened for inappropriate or obscene messages, which are automatically returned to the sender with a request to remove the offending words or content. Analyzing warranty or insurance claims, diagnostic interviews, etc. In some business domains, the majority of information is collected in open-ended, textual form. For example, warranty claims or initial medical (patient) interviews can be summarized in brief narratives, or when you take your automobile to a service station for repairs, typically, the attendant will write some notes about the problems that you report and what you believe needs to be fixed. Increasingly, those notes are collected electronically, so those types of narratives are readily available for input into text mining algorithms. This information can then be usefully exploited to, for example, identify common clusters of problems and complaints on certain automobiles, etc. Likewise, in the medical field, open-ended descriptions by patients of their own symptoms might yield useful clues for the actual medical diagnosis. Investigating competitors by crawling their web sites. Another type of potentially very useful application is to automatically process the contents of Web pages in a particular domain. For example, you could go to a Web page, and begin "crawling" the links you find there to process all Web pages that are referenced. In this manner, you could automatically derive a list of terms and documents available at that site, and hence quickly determine the most important terms and features that are described. It is easy to see how these capabilities could efficiently deliver valuable business intelligence about the activities of competitors. Approaches to Text Mining To reiterate, text mining can be summarized as a process of "numericizing" text. At the simplest level, all words found in the input documents will be indexed and counted in order to compute a table of documents and words, i.e., a matrix of frequencies that enumerates the number of times that each word occurs in each document. This basic process can be further refined to exclude certain common words such as "the" and "a" (stop word lists) and to combine different grammatical forms of the same words such as "traveling," "traveled," "travel," etc. (stemming). However, once a table of (unique) words (terms) by documents has been derived, all standard statistical and data mining techniques can be applied to derive dimensions or clusters of words or documents, or to identify "important" words or terms that best predict another outcome variable of interest. Using well-tested methods and understanding the results of text mining. Once a data matrix has been computed from the input documents and words found in those documents, various well-known analytic techniques can be used for further processing those data including methods for clustering, factoring, or predictive data mining (see, for example, Manning and Schьtze, 2002). "Black-box" approaches to text mining and extraction of concepts. There are text mining applications which offer "black-box" methods to extract "deep meaning" from documents with little human effort (to first read and understand those documents). These text mining applications rely on proprietary algorithms for presumably extracting "concepts" from text, and may even claim to be able to summarize large numbers of text documents automatically, retaining the core and most important meaning of those documents. While there are numerous algorithmic approaches to extracting "meaning from documents," this type of technology is very much still in its infancy, and the aspiration to provide meaningful automated summaries of large numbers of documents may forever remain elusive. We urge skepticism when using such algorithms because 1) if it is not clear to the user how those algorithms work, it cannot possibly be clear how to interpret the results of those algorithms, and 2) the methods used in those programs are not open to scrutiny, for example by the academic community and peer review and, hence, one simply doesn''t know how well they might perform in different domains. As a final thought on this subject, you may consider this concrete example: Try the various automated translation services available via the Web that can translate entire paragraphs of text from one language into another. Then translate some text, even simple text, from your native language to some other language and back, and review the results. Almost every time, the attempt to translate even short sentences to other languages and back while retaining the original meaning of the sentence produces humorous rather than accurate results. This illustrates the difficulty of automatically interpreting the meaning of text. Text mining as document search. There is another type of application that is often described and referred to as "text mining" - the automatic search of large numbers of documents based on key words or key phrases. This is the domain of, for example, the popular internet search engines that have been developed over the last decade to provide efficient access to Web pages with certain content. While this is obviously an important type of application with many uses in any organization that needs to search very large document repositories based on varying criteria, it is very different from what has been described here. Issues and Considerations for "Numericizing" Text Large numbers of small documents vs. small numbers of large documents. Examples of scenarios using large numbers of small or moderate sized documents were given earlier (e.g., analyzing warranty or insurance claims, diagnostic interviews, etc.). On the other hand, if your intent is to extract "concepts" from only a few documents that are very large (e.g., two lengthy books), then statistical analyses are generally less powerful because the "number of cases" (documents) in this case is very small while the "number of variables" (extracted words) is very large. Excluding certain characters, short words, numbers, etc. Excluding numbers, certain characters, or sequences of characters, or words that are shorter or longer than a certain number of letters can be done before the indexing of the input documents starts. You may also want to exclude "rare words," defined as those that only occur in a small percentage of the processed documents. Include lists, exclude lists (stop-words). Specific list of words to be indexed can be defined; this is useful when you want to search explicitly for particular words, and classify the input documents based on the frequencies with which those words occur. Also, "stop-words," i.e., terms that are to be excluded from the indexing can be defined. Typically, a default list of English stop words includes "the", "a", "of", "since," etc, i.e., words that are used in the respective language very frequently, but communicate very little unique information about the contents of the document. Synonyms and phrases. Synonyms, such as "sick" or "ill", or words that are used in particular phrases where they denote unique meaning can be combined for indexing. For example, "Microsoft Windows" might be such a phrase, which is a specific reference to the computer operating system, but has nothing to do with the common use of the term "Windows" as it might, for example, be used in descriptions of home improvement projects. Stemming algorithms. An important pre-processing step before indexing of input documents begins is the stemming of words. The term "stemming" refers to the reduction of words to their roots so that, for example, different grammatical forms or declinations of verbs are identified and indexed (counted) as the same word. For example, stemming will ensure that both "traveling" and "traveled" will be recognized by the text mining program as the same word. Support for different languages. Stemming, synonyms, the letters that are permitted in words, etc. are highly language dependent operations. Therefore, support for different languages is important. Transforming Word Frequencies Once the input documents have been indexed and the initial word frequencies (by document) computed, a number of additional transformations can be performed to summarize and aggregate the information that was extracted. Log-frequencies. First, various transformations of the frequency counts can be performed. The raw word or term frequencies generally reflect on how salient or important a word is in each document. Specifically, words that occur with greater frequency in a document are better descriptors of the contents of that document. However, it is not reasonable to assume that the word counts themselves are proportional to their importance as descriptors of the documents. For example, if a word occurs 1 time in document A, but 3 times in document B, then it is not necessarily reasonable to conclude that this word is 3 times as important a descriptor of document B as compared to document A. Thus, a common transformation of the raw word frequency counts (wf) is to compute: f(wf) = 1+ log(wf), for wf > 0 This transformation will "dampen" the raw frequencies and how they will affect the results of subsequent computations. Binary frequencies. Likewise, an even simpler transformation can be used that enumerates whether a term is used in a document; i.e.: f(wf) = 1, for wf > 0 The resulting documents-by-words matrix will contain only 1s and 0s to indicate the presence or absence of the respective words. Again, this transformation will dampen the effect of the raw frequency counts on subsequent computations and analyses. Inverse document frequencies. Another issue that you may want to consider more carefully and reflect in the indices used in further analyses are the relative document frequencies (df) of different words. For example, a term such as "guess" may occur frequently in all documents, while another term such as "software" may only occur in a few. The reason is that one might make "guesses" in various contexts, regardless of the specific topic, while "software" is a more semantically focused term that is only likely to occur in documents that deal with computer software. A common and very useful transformation that reflects both the specificity of words (document frequencies) as well as the overall frequencies of their occurrences (word frequencies) is the so-called inverse document frequency (for the i''th word and j''th document): Time Series Analysis http://statsoft.com/textbook/stathome.html http://www.statsoft.com/TEXTBOOK/sttimser.html General Introduction Two Main Goals Identifying Patterns in Time Series Data Systematic pattern and random noise Two general aspects of time series patterns Trend Analysis Analysis of Seasonality ARIMA (Box & Jenkins) and Autocorrelations General Introduction Two Common Processes ARIMA Methodology Identification Phase Parameter Estimation Evaluation of the Model Interrupted Time Series Exponential Smoothing General Introduction Simple Exponential Smoothing Choosing the Best Value for Parameter a (alpha) Indices of Lack of Fit (Error) Seasonal and Non-seasonal Models With or Without Trend Seasonal Decomposition (Census I) General Introduction Computations X-11 Census method II seasonal adjustment Seasonal Adjustment: Basic Ideas and Terms The Census II Method Results Tables Computed by the X-11 Method Specific Description of all Results Tables Computed by the X-11 Method Distributed Lags Analysis General Purpose General Model Almon Distributed Lag Single Spectrum (Fourier) Analysis Cross-spectrum Analysis General Introduction Basic Notation and Principles Results for Each Variable The Cross-periodogram, Cross-density, Quadrature-density, and Cross-amplitude Squared Coherency, Gain, and Phase Shift How the Example Data were Created Spectrum Analysis - Basic Notations and Principles Frequency and Period The General Structural Model A Simple Example Periodogram The Problem of Leakage Padding the Time Series Tapering Data Windows and Spectral Density Estimates Preparing the Data for Analysis Results when no Periodicity in the Series Exists Fast Fourier Transformations General Introduction Computation of FFT in Time Series -------------------------------------------------------------------------------- In the following topics, we will first review techniques used to identify patterns in time series data (such as smoothing and curve fitting techniques and autocorrelations), then we will introduce a general class of models that can be used to represent time series data and generate predictions (autoregressive and moving average models). Finally, we will review some simple but commonly used modeling and forecasting techniques based on linear regression. For more information on these topics, see the topic name below. General Introduction In the following topics, we will review techniques that are useful for analyzing time series data, that is, sequences of measurements that follow non-random orders. Unlike the analyses of random samples of observations that are discussed in the context of most other statistics, the analysis of time series is based on the assumption that successive values in the data file represent consecutive measurements taken at equally spaced time intervals. Detailed discussions of the methods described in this section can be found in Anderson (1976), Box and Jenkins (1976), Kendall (1984), Kendall and Ord (1990), Montgomery, Johnson, and Gardiner (1990), Pankratz (1983), Shumway (1988), Vandaele (1983), Walker (1991), and Wei (1989). Two Main Goals There are two main goals of time series analysis: (a) identifying the nature of the phenomenon represented by the sequence of observations, and (b) forecasting (predicting future values of the time series variable). Both of these goals require that the pattern of observed time series data is identified and more or less formally described. Once the pattern is established, we can interpret and integrate it with other data (i.e., use it in our theory of the investigated phenomenon, e.g., sesonal commodity prices). Regardless of the depth of our understanding and the validity of our interpretation (theory) of the phenomenon, we can extrapolate the identified pattern to predict future events. To index -------------------------------------------------------------------------------- Identifying Patterns in Time Series Data Systematic pattern and random noise Two general aspects of time series patterns Trend Analysis Analysis of Seasonality For more information on simple autocorrelations (introduced in this section) and other auto correlations, see Anderson (1976), Box and Jenkins (1976), Kendall (1984), Pankratz (1983), and Vandaele (1983). See also: ARIMA (Box & Jenkins) and Autocorrelations Interrupted Time Series Exponential Smoothing Seasonal Decomposition (Census I) X-11 Census method II seasonal adjustment X-11 Census method II result tables Distributed Lags Analysis Single Spectrum (Fourier) Analysis Cross-spectrum Analysis Basic Notations and Principles Fast Fourier Transformations Systematic Pattern and Random Noise As in most other analyses, in time series analysis it is assumed that the data consist of a systematic pattern (usually a set of identifiable components) and random noise (error) which usually makes the pattern difficult to identify. Most time series analysis techniques involve some form of filtering out noise in order to make the pattern more salient Variance Components and Mixed Model http://statsoft.com/textbook/stathome.html Basic Ideas Properties of Random Effects Estimation of Variance Components (Technical Overview) Estimating the Variation of Random Factors Estimating Components of Variation Testing the Significance of Variance Components Estimating the Population Intraclass Correlation -------------------------------------------------------------------------------- The Variance Components and Mixed Model ANOVA/ANCOVA chapter describes a comprehensive set of techniques for analyzing research designs that include random effects; however, these techniques are also well suited for analyzing large main effect designs (e.g., designs with over 200 levels per factor), designs with many factors where the higher order interactions are not of interest, and analyses involving case weights. There are several chapters in this textbook that will discuss Analysis of Variance for factorial or specialized designs. For a discussion of these chapters and the types of designs for which they are best suited refer to the section on Methods for Analysis of Variance. Note, however, that the General Linear Models chapter describes how to analyze designs with any number and type of between effects and compute ANOVA-based variance component estimates for any effect in a mixed-model analysis. -------------------------------------------------------------------------------- Basic Ideas Experimentation is sometimes mistakenly thought to involve only the manipulation of levels of the independent variables and the observation of subsequent responses on the dependent variables. Independent variables whose levels are determined or set by the experimenter are said to have fixed effects. There is a second class of effects, however, which is often of great interest to the researcher, Random effects are classification effects where the levels of the effects are assumed to be randomly selected from an infinite population of possible levels. Many independent variables of research interest are not fully amenable to experimental manipulation, but nevertheless can be studied by considering them to have random effects. For example, the genetic makeup of individual members of a species cannot at present be (fully) experimentally manipulated, yet it is of great interest to the geneticist to assess the genetic contribution to individual variation on outcomes such as health, behavioral characteristics, and the like. As another example, a manufacturer might want to estimate the components of variation in the characteristics of a product for a random sample of machines operated by a random sample of operators. The statistical analysis of random effects is accomplished by using the random effect model, if all of the independent variables are assumed to have random effects, or by using the mixed model, if some of the independent variables are assumed to have random effects and other independent variables are assumed to have fixed effects. Properties of random effects. To illustrate some of the properties of random effects, suppose you collected data on the amount of insect damage done to different varieties of wheat. It is impractical to study insect damage for every possible variety of wheat, so to conduct the experiment, you randomly select four varieties of wheat to study. Plant damage is rated for up to a maximum of four plots per variety. Ratings are on a 0 (no damage) to 10 (great damage) scale. To determine the components of variation in resistance to insect damage for Variety and Plot, an ANOVA can first be performed. Perhaps surprisingly, in the ANOVA, Variety can be treated as a fixed or as a random factor without influencing the results (provided that Type I Sums of squares are used and that Variety is always entered first in the model). The Spreadsheet below shows the ANOVA results of a mixed model analysis treating Variety as a fixed effect and ignoring Plot, i.e., treating the plot-to-plot variation as a measure of random error. As can be seen, the difference in the two sets of estimates is that a variance component is estimated for Variety only when it is considered to be a random effect. This reflects the basic distinction between fixed and random effects. The variation in the levels of random factors is assumed to be representative of the variation of the whole population of possible levels. Thus, variation in the levels of a random factor can be used to estimate the population variation. Even more importantly, covariation between the levels of a random factor and responses on a dependent variable can be used to estimate the population component of variance in the dependent variable attributable to the random factor. The variation in the levels of fixed factors is instead considered to be arbitrarily determined by the experimenter (i.e., the experimenter can make the levels of a fixed factor vary as little or as much as desired). Thus, the variation of a fixed factor cannot be used to estimate its population variance, nor can the population covariance with the dependent variable be meaningfully estimated. With this basic distinction between fixed effects and random effects in mind, we now can look more closely at the properties of variance components STATISTICS GLOSSARY http://statsoft.com/textbook/glosfra.html Distribution Tables http://statsoft.com/textbook/stathome.html Compared to probability calculators (e.g., the one included in STATISTICA), the traditional format of distribution tables such as those presented below, has the advantage of showing many values simultaneously and, thus, enables the user to examine and quickly explore ranges of probabilities. -------------------------------------------------------------------------------- Z Table t Table Chi-Square Table F Tables for: alpha=.10 alpha=.05 alpha=.025 alpha=.01 Note that all table values were calculated using the distribution facilities in STATISTICA BASIC, and they were verified against other published tables. REFERENCES CITED http://statsoft.com/textbook/stathome.html Video-tutorial http://statsoft.com/support/download/video-tutorials/ Descriptive Statistics and Exploratory Analysis http://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture2.pdf What is descriptive statistics and exploratory data analysis? • Basic numerical summaries of data • Basic graphical summaries of data •How to use R for calculating descriptive statistics and making graphs Before making inferences from data it is essential to examine all your variables. Why? To listen to the data: - to catch mistakes - to see patterns in the data - to find violations of statistical assumptions - to generate hypotheses …and because if you don’t, you will have trouble later Dimensionality of Data Sets • Univariate: Measurement made on one variable per subject • Bivariate: Measurement made on two variables per subject • Multivariate: Measurement made on many variables per subject Numerical Summaries of Data • Central Tendency measures. They are computed to give a “center” around which the measurements in the data are distributed. • Variation or Variability measures. They describe “data spread” or how far away the measurements are from the center. • Relative Standing measures. They describe the relative position of specific measurements in the data. Location: Mean The Mean To calculate the average of a set of observations, add their value and divide by the number of observations Other Types of Means - Weighted means,Trimmed, Geometric, Harmonic Location: Median • Median – the exact middle value • Calculation: - If there are an odd number of observations, find the middle value - If there are an even number of observations, find the middle two values and average them Which Location Measure Is Best? 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 0 1 2 3 4 5 6 7 8 9 10 Mean = 4 Median = 3 Median = 3 • Mean is best for symmetric distributions without outliers • Median is useful for skewed distributions or data with outliers Scale: Variance • Average of squared deviations of values from the mean Why Squared Deviations? • Adding deviations will yield a sum of ? • Absolute values do not have nice mathematical properties • Squares eliminate the negatives • Result: – Increasing contribution to the variance as you go farther from the mean. Scale: Standard Deviation • Variance is somewhat arbitrary • What does it mean to have a variance of 10.8? Or 2.2? Or 1459.092? Or 0.000001? • Nothing. But if you could “standardize” that value, you could talk about any variance (i.e. deviation) in equivalent terms • Standard deviations are simply the square root of the variance Scale: Standard Deviation1. Score (in the units that are meaningful) 2. Mean 3. Each score’s deviation from the mean 4. Square that deviation 5. Sum all the squared deviations (Sum of Squares) 6. Divide by n-1 7. Square root – now the value is in the units we started with!!! Interesting Theoretical Result At least within (1 - 1/12) = 0% …….….. k=1 (м ± 1у) (1 - 1/22) = 75% …........ k=2 (м ± 2у) (1 - 1/32) = 89% ………....k=3 (м ± 3у) Note use of у (sigma) to Note use of м (mu) to represent “standard deviation.” represent “mean”. • Regardless of how the data are distributed, a certain percentage of values must fall within k standard deviations from the mean Often We Can Do Better For many lists of observations – especially if their histogram is bell-shaped 1. Roughly 68% of the observations in the list lie within 1 standard deviation of the average 2. 95% of the observations lie within 2 standard deviations of the average The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger • Q2 is the same as the median (50% are smaller, 50% are larger) • Only 25% of the observations are greater than the third quartile Percentiles (aka Quantiles) In general the nth percentile is a value such that n% of the observations fall at or below or it Q1 = 25th percentile Q2 = 75th percentile Median = 50th percentile Univariate Data: Histograms and Bar Plots • What’s the difference between a histogram and bar plot? • Used for categorical variables to show frequency or proportion in each category. • Translate the data from frequency tables into a pictorial representation… Bar plot Histogram • Used to visualize distribution (shape, center, range, variation) of continuous variables • “Bin size” important More on Histograms • What’s the difference between a frequency histogram and a density histogram? More on Histograms • What’s the difference between a frequency histogram and a density histogram? Bivariate Data Multivariate Data • Organize units into clusters • Descriptive, not inferential • Many approaches • “Clusters” always produced Clustering Data Reduction Approaches (PCA) • Reduce n-dimensional dataset into much smaller number • Finds a new (smaller) set of variables that retains most of the information in the total sample • Effective way to visualize multivariate data Scale: Quartiles and IQR Descriptive Statistics and Exploratory Analysis http://www.iasri.res.in/ebook/EB_SMAR/e-book_pdf%20files/Manual%20II/1-Descriptive%20Statistics.pdf Descriptive Statistics http://www.utcomchatt.org/docs/Descriptive_Statistics_1142008.pdf Statistical Theory & Methods and Applied Statistics http://www.learn.colostate.edu/courses/STAT/STAT523.dot STAT 523 - Quantitative Spatial Analysis Techniques in spatial analysis: point pattern analysis, spatial autocorrelation, trend surface and spectral analysis. STAT 501 - Statistical Science Overview of statistics: theory; use in agriculture, business, environment, engineering; modeling; computing; statisticians as researchers/consultants STAT 511 - Design and Data Analysis for Researchers I Statistical methods for experimenters and researchers emphasizing design and analysis of experiments. STAT 512 - Design and Data Analysis for Researchers II Statistical methods for experimenters and researchers emphasizing design and analysis of experiments. STAT 520 - Introduction to Probability Theory Probability, random variables, distributions, expectations, generating functions, limit theorems, convergence, random processes STAT 521 - Stochastic Processes I Characterization of stochastic processes, Markov chains in discrete and continuous time, branching processes, renewal theory, Brownian motion STAT 525 - Analysis of Time Series Trend and seasonality, stationary processes, Hilbert space techniques, spectral distribution function, fitting ARIMA models, linear prediction. Spectral analysis; the periodogram; spectral estimation techniques; multivariate time series; linear systems and optimal control; Kalman filtering and prediction STAT 530 - Mathematical Statistics Sampling distributions, estimation, testing, confidence intervals; exact and asymptotic theories of maximum likelihood and distribution-free methods STAT 540 - Data Analysis and Regression Introduction to multiple regression and data analysis with emphasis on graphics and computing STAT 301 Introductions to Statistical Methods Techniques in statistical inference; confidence intervals, hypothesis tests, correlation and regression, analysis of variance, chi-square tests STAT 315 Statistics for Engineers and Scientists Techniques in statistical inference; confidence intervals, hypothesis tests, correlation and regression, analysis of variance, chi-square tests STAT 460 Applied Multivariate Analysis Principles for multivariate estimation and testing; multivariate analysis of variance, discriminant analysis; principal components, factor analysis STAT 501 Statistical Science Model building and decision making; communication of statistical information STAT 457 Statistics for Environmental Monitoring Applications of statistics in environmental pollution studies involving air, water, or soil monitoring; sampling designs; trend analysis; censored data STAT 560 Applied Multivariate Analysis Multivariate analysis of variance; principal components; factor analysis; discriminate analysis; cluster analysis STAT 570 Nonparametric Statistics Distribution and uses of order statistics; nonparametric inferential techniques, their uses and mathematical properties STAT 600 Statistical Computing Statistical packages; graphical data presentation; model fitting and diagnostics; random numbers; simulation; numerical methods in statistics STAT 605 Theory of Sampling Techniques Survey designs; simple random, stratified, cluster samples; theory of estimation; optimization techniques for minimum variance or costs STAT 640 Design and Linear Modeling Introduction to linear models; experimental design; fixed, random, and mixed models. Mixed factorials; response surface methodology; Taguchi methods; variance components STAT 645 Categorical Data Analysis and GLIM Generalized linear models, binary and polytomous data, log linear models, quasilikelihood models, survival data models STAT 675 Bayesian Statistics Bayesian inference and theory, hierarchical models, Markov chain Monte Carlo theory and methods, model criticism and selection, hierarchical regression and generalized linear models, and other topics the chi-squared distribution http://www.colby.edu/biology/BI17x/freq.html Using probability theory, statisticians have devised a way to determine if a frequency distribution differs from the expected distribution. To use this chi-square test, we first have to calculate chi-squared. Chi-squared = … (observed-expected)2/(expected) We have two classes to consider in this example, heads and tails. Chi-squared = (100-108)2/100 + (100-92)2/100 = (-8)2/100 + (8)2/100 = 0.64 + 0.64 = 1.28 Pearson''s chi-square test http://en.wikipedia.org/wiki/Pearson''s_chi-square_test Pearson''s chi-square (÷2) test is the best-known of several chi-square tests – statistical procedures whose results are evaluated by reference to the chi-square distribution. Its properties were first investigated by Karl Pearson. In contexts where it is important to make a distinction between the test statistic and its distribution, names similar to Pearson X-squared test or statistic are used. It tests a null hypothesis that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution. The events considered must be mutually exclusive and have total probability 1. A common case for this is where the events each cover an outcome of a categorical variable. A simple example is the hypothesis that an ordinary six-sided die is "fair", i.e., all six outcomes are equally likely to occur. Pearson''s chi-square is the original and most widely-used chi-square test. Chi-square test http://en.wikipedia.org/wiki/Chi-square_test A chi-square test (also chi-squared or ÷2 test) is any statistical hypothesis test in which the sampling distribution of the test statistic is a chi-square distribution when the null hypothesis is true, or any in which this is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be made to approximate a chi-square distribution as closely as desired by making the sample size large enough. Some examples of chi-squared tests where the chi-square distribution is only approximately valid: Pearson''s chi-square test, also known as the chi-square goodness-of-fit test or chi-square test for independence. When mentioned without any modifiers or without other precluding context, this test is usually understood (for an exact test used in place of ÷2, see Fisher''s exact test). Yates'' chi-square test, also known as Yates'' correction for continuity. Mantel-Haenszel chi-square test. Linear-by-linear association chi-square test. The portmanteau test in time-series analysis, testing for the presence of autocorrelation Likelihood-ratio tests in general statistical modelling, for testing whether there is evidence of the need to move from a simple model to a more complicated one (where the simple model is nested within the complicated one). One case where the distribution of the test statistic is an exact chi-square distribution is the test that the variance of a normally-distributed population has a given value based on a sample variance. Such a test is uncommon in practice because values of variances to test against are seldom known exactly. Chi-Square Procedures for the Analysis of Categorical Frequency data http://faculty.vassar.edu/lowry/PDF/c8p1.pdf ch8 Introduction to Procedures Involving Sample Means http://faculty.vassar.edu/lowry/PDF/c9p1.pdf ch9 Basic Concepts of Probability http://faculty.vassar.edu/lowry/PDF/c5p1.pdf ch5 Introduction to Linear Correlation and Regression http://faculty.vassar.edu/lowry/PDF/c3p1.pdf ch3 Distributions http://faculty.vassar.edu/lowry/PDF/c2p1.pdf ch2 Principles of Measurement http://faculty.vassar.edu/lowry/PDF/c1p1.pdf ch1 One-Way Analysis of Variance for Correlated Samples http://faculty.vassar.edu/lowry/PDF/c15p1.pdf ch15 One-Way Analysis of Variance for Independent Samples http://faculty.vassar.edu/lowry/PDF/c14p1.pdf ch14 Two-Way Analysis of Variance for Independent Samples http://faculty.vassar.edu/lowry/PDF/c16p1.pdf ch16 One-Way Analysis of Covariance for Independent Samples http://faculty.vassar.edu/lowry/PDF/c17p1.pdf ch17 TEACHING STATISTICS COURSE OUTLINE Introduction Review of Algebra Measurement Frequency Distributions The Normal Curve Statistics First Test Interpretation of Scores Regression Correlation Second Test Logic of Inferential Statistics The Sampling Distribution Some Hypothesis Tests The t-tests Additional Topics Final http://www.psychstat.missouristate.edu/introbook/sbk01.htm - http://www.psychstat.missouristate.edu/introbook/sbk29.htm |