All About Statistics  

Multiple Regression
http://www.statsoft.com/TEXTBOOK/stmulreg.html
http://statsoft.com/textbook/stathome.html

General Purpose
Computational Approach
Least Squares
The Regression Equation
Unique Prediction and Partial Correlation
Predicted and Residual Scores
Residual Variance and R-square
Interpreting the Correlation Coefficient R
Assumptions, Limitations, and Practical Considerations
Assumption of Linearity
Normality Assumption
Limitations
Choice of the number of variables
Multicollinearity and matrix ill-conditioning
Fitting centered polynomial models
The importance of residual analysis

Elementary Concepts in Statistics
http://www.statsoft.com/TEXTBOOK/esc.html

Overview of Elementary Concepts in Statistics. In this introduction, we will briefly discuss those elementary statistical concepts that provide the necessary foundations for more specialized expertise in any area of statistical data analysis. The selected topics illustrate the basic assumptions of most statistical methods and/or have been demonstrated in research to be necessary components of one''s general understanding of the "quantitative nature" of reality (Nisbett, et al., 1987). Because of space limitations, we will focus mostly on the functional aspects of the concepts discussed and the presentation will be very short. Further information on each of those concepts can be found in statistical textbooks. Recommended introductory textbooks are: Kachigan (1986), and Runyon and Haber (1976); for a more advanced discussion of elementary theory and assumptions of statistics, see the classic books by Hays (1988), and Kendall and Stuart (1979).
--------------------------------------------------------------------------------
What are variables?
Correlational vs. experimental research
Dependent vs. independent variables
Measurement scales
Relations between variables
Why relations between variables are important
Two basic features of every relation between variables
What is "statistical significance" (p-value)
How to determine that a result is "really" significant
Statistical significance and the number of analyses performed
Strength vs. reliability of a relation between variables
Why stronger relations between variables are more significant
Why significance of a relation between variables depends on the size of the sample
Example: "Baby boys to baby girls ratio"
Why small relations can be proven significant only in large samples
Can "no relation" be a significant result?
How to measure the magnitude (strength) of relations between variables
Common "general format" of most statistical tests
How the "level of statistical significance" is calculated
Why the "Normal distribution" is important
Illustration of how the normal distribution is used in statistical reasoning (induction)
Are all test statistics normally distributed?

Why significance of a relation between variables depends on the size of the sample
Example: "Baby boys to baby girls ratio"
Why small relations can be proven significant only in large samples
Can "no relation" be a significant result?
How to measure the magnitude (strength) of relations between variables
Common "general format" of most statistical tests
How the "level of statistical significance" is calculated
Why the "Normal distribution" is important
Illustration of how the normal distribution is used in statistical reasoning (induction)
Are all test statistics normally distributed?
How do we know the consequences of violating the normality assumption?

Basic Statistics
http://statsoft.com/textbook/stathome.html
Descriptive statistics
"True" Mean and Confidence Interval
Shape of the Distribution, Normality
Correlations
Purpose (What is Correlation?)
Simple Linear Correlation (Pearson r)
How to Interpret the Values of Correlations
Significance of Correlations
Outliers
Quantitative Approach to Outliers
Correlations in Non-homogeneous Groups
Nonlinear Relations between Variables
Measuring Nonlinear Relations
Exploratory Examination of Correlation Matrices
Casewise vs. Pairwise Deletion of Missing Data
How to Identify Biases Caused by the Bias due to Pairwise Deletion of Missing Data
Pairwise Deletion of Missing Data vs. Mean Substitution
Spurious Correlations
Are correlation coefficients "additive?"
How to Determine Whether Two Correlation Coefficients are Significant
t-test for independent samples
Purpose, Assumptions
Arrangement of Data
t-test graphs
More Complex Group Comparisons
t-test for dependent samples
Within-group Variation
Purpose
Assumptions
Arrangement of Data
Matrices of t-tests
More Complex Group Comparisons
Breakdown: Descriptive statistics by groups
Purpose
Arrangement of Data
Statistical Tests in Breakdowns
Other Related Data Analysis Techniques
Post-Hoc Comparisons of Means
Breakdowns vs. Discriminant Function Analysis
Breakdowns vs. Frequency Tables
Graphical breakdowns
Frequency tables
Purpose
Applications
Crosstabulation and stub-and-banner tables
Purpose and Arrangement of Table
2x2 Table
Marginal Frequencies
Column, Row, and Total Percentages
Graphical Representations of Crosstabulations
Stub-and-Banner Tables
Interpreting the Banner Table
Multi-way Tables with Control Variables
Graphical Representations of Multi-way Tables
Statistics in crosstabulation tables
Multiple responses/dichotomies

ANOVA/MANOVA
http://www.statsoft.com/TEXTBOOK/stanman.html
http://statsoft.com/textbook/stathome.html
Basic Ideas
The Partitioning of Sums of Squares
Multi-Factor ANOVA
Interaction Effects
Complex Designs
Between-Groups and Repeated Measures
Incomplete (Nested) Designs
Analysis of Covariance (ANCOVA)
Fixed Covariates
Changing Covariates
Multivariate Designs: MANOVA/MANCOVA
Between-Groups Designs
Repeated Measures Designs
Sum Scores versus MANOVA
Contrast Analysis and Post hoc Tests
Why Compare Individual Sets of Means?
Contrast Analysis
Post hoc Comparisons
Assumptions and Effects of Violating Assumptions
Deviation from Normal Distribution
Homogeneity of Variances
Homogeneity of Variances and Covariances
Sphericity and Compound Symmetry
Methods for Analysis of Variance

This chapter includes a general introduction to ANOVA and a discussion of the general topics in the analysis of variance techniques, including repeated measures designs, ANCOVA, MANOVA, unbalanced and incomplete designs, contrast effects, post-hoc comparisons, assumptions, etc. For related topics, see also Variance Components (topics related to estimation of variance components in mixed model designs), Experimental Design/DOE (topics related to specialized applications of ANOVA in industrial settings), and Repeatability and Reproducibility Analysis (topics related to specialized designs for evaluating the reliability and precision of measurement systems).

See also General Linear Models, General Regression Models; to analyze nonlinear models, see Generalized Linear Models.

Association Rules
http://statsoft.com/textbook/stathome.html

--------------------------------------------------------------------------------

Association Rules Introductory Overview
Computational Procedures and Terminology
Tabular Representation of Associations
Graphical Representation of Associations
Interpreting and Comparing Results

--------------------------------------------------------------------------------

Association Rules Introductory Overview

The goal of the techniques described in this section is to detect relationships or associations between specific values of categorical variables in large data sets. This is a common task in many data mining projects as well as in the data mining subcategory text mining. These powerful exploratory techniques have a wide range of applications in many areas of business practice and also research - from the analysis of consumer preferences or human resource management, to the history of language. These techniques enable analysts and researchers to uncover hidden patterns in large data sets, such as "customers who order product A often also order product B or C" or "employees who said positive things about initiative X also frequently complain about issue Y but are happy with issue Z." The implementation of the so-called a-priori algorithm (see Agrawal and Swami, 1993; Agrawal and Srikant, 1994; Han and Lakshmanan, 2001; see also Witten and Frank, 2000) allows you to process rapidly huge data sets for such associations, based on predefined "threshold" values for detection.

How association rules work. The usefulness of this technique to address unique data mining problems is best illustrated in a simple example. Suppose you are collecting data at the check-out cash registers at a large book store. Each customer transaction is logged in a database, and consists of the titles of the books purchased by the respective customer, perhaps additional magazine titles and other gift items that were purchased, and so on. Hence, each record in the database will represent one customer (transaction), and may consist of a single book purchased by that customer, or it may consist of many (perhaps hundreds of) different items that were purchased, arranged in an arbitrary order depending on the order in which the different items (books, magazines, and so on) came down the conveyor belt at the cash register. The purpose of the analysis is to find associations between the items that were purchased, i.e., to derive association rules that identify the items and co-occurrences of different items that appear with the greatest (co-)frequencies. For example, you want to learn which books are likely to be purchased by a customer who you know already purchased (or is about to purchase) a particular book. This type of information could then quickly be used to suggest to the customer those additional titles. You may already be "familiar" with the results of these types of analyses, if you are a customer of various on-line (Web-based) retail businesses; many times when making a purchase on-line, the vendor will suggest similar items (to the ones purchased by you) at the time of "check-out", based on some rules such as "customers who buy book title A are also likely to purchase book title B," and so on.

Unique data analysis requirements. Crosstabulation tables, and in particular Multiple Response tables can be used to analyze data of this kind. However, in cases when the number of different items (categories) in the data is very large (and not known ahead of time), and when the "factorial degree" of important association rules is not known ahead of time, then these tabulation facilities may be too cumbersome to use, or simply not applicable: Consider once more the simple "bookstore-example" discussed earlier. First, the number of book titles is practically unlimited. In other words, if we would make a table where each book title would represent one dimension, and the purchase of that book (yes/no) would be the classes or categories for each dimension, then the complete crosstabulation table would be huge and sparse (consisting mostly of empty cells). Alternatively, we could construct all possible two-way tables from all items available in the store; this would allow us to detect two-way associations (association rules) between items. However, the number of tables that would have to be constructed would again be huge, most of the two-way tables would be sparse, and worse, if there were any three-way association rules "hiding" in the data, we would miss them completely. The a-priori algorithm implemented in Association Rules will not only automatically detect the relationships ("cross-tabulation tables") that are important (i.e., cross-tabulation tables that are not sparse, not containing mostly zero''s), but also determine the factorial degree of the tables that contain the important association rules.

To summarize, Association Rules will allow you to find rules of the kind If X then (likely) Y where X and Y can be single values, items, words, etc., or conjunctions of values, items, words, etc. (e.g., if (Car=Porsche and Gender=Male and Age<20) then (Risk=High and Insurance=High)). The program can be used to analyze simple categorical variables, dichotomous variables, and/or multiple response variables. The algorithm will determine association rules without requiring the user to specify the number of distinct categories present in the data, or any prior knowledge regarding the maximum factorial degree or complexity of the important associations. In a sense, the algorithm will construct cross-tabulation tables without the need to specify the number of dimensions for the tables, or the number of categories for each dimension. Hence, this technique is particularly well suited for data and text mining of huge databases.

Boosting Trees for Regression and Classification
http://statsoft.com/textbook/stathome.html
Boosting Trees for Regression and Classification Introductory Overview
Gradient Boosting Trees
The Problem of Overfitting; Stochastic Gradient Boosting
Stochastic Gradient Boosting Trees and Classification
Large Numbers of Categories

--------------------------------------------------------------------------------
Boosting Trees for Regression and Classification Introductory Overview
The general computational approach of stochastic gradient boosting is also known by the names TreeNet (TM Salford Systems, Inc.) and MART (TM Jerill, Inc.). Over the past few years, this technique has emerged as one of the most powerful methods for predictive data mining. Some implementations of these powerful algorithms allow them to be used for regression as well as classification problems, with continuous and/or categorical predictors. Detailed technical descriptions of these methods can be found in Friedman (1999a, b) as well as Hastie, Tibshirani, & Friedman (2001).

Gradient Boosting Trees

The algorithm for Boosting Trees evolved from the application of boosting methods to regression trees. The general idea is to compute a sequence of (very) simple trees, where each successive tree is built for the prediction residuals of the preceding tree. As described in the General Classification and Regression Trees Introductory Overview, this method will build binary trees, i.e., partition the data into two samples at each split node. Now suppose that you were to limit the complexities of the trees to 3 nodes only: a root node and two child nodes, i.e., a single split. Thus, at each step of the boosting (boosting trees algorithm), a simple (best) partitioning of the data is determined, and the deviations of the observed values from the respective means (residuals for each partition) are computed. The next 3-node tree will then be fitted to those residuals, to find another partition that will further reduce the residual (error) variance for the data, given the preceding sequence of trees.

It can be shown that such "additive weighted expansions" of trees can eventually produce an excellent fit of the predicted values to the observed values, even if the specific nature of the relationships between the predictor variables and the dependent variable of interest is very complex (nonlinear in nature). Hence, the method of gradient boosting - fitting a weighted additive expansion of simple trees - represents a very general and powerful machine learning algorithm. To index

Canonical Analysis
http://statsoft.com/textbook/stathome.html
General Purpose
Computational Methods and Results
Assumptions
General Ideas
Sum Scores
Canonical Roots/Variates
Number of Roots
Extraction of Roots

--------------------------------------------------------------------------------

General Purpose

There are several measures of correlation to express the relationship between two or more variables. For example, the standard Pearson product moment correlation coefficient (r) measures the extent to which two variables are related; there are various nonparametric measures of relationships that are based on the similarity of ranks in two variables; Multiple Regression allows one to assess the relationship between a dependent variable and a set of independent variables; Multiple Correspondence Analysis is useful for exploring the relationships between a set of categorical variables.

Canonical Correlation is an additional procedure for assessing the relationship between variables. Specifically, this analysis allows us to investigate the relationship between two sets of variables. For example, an educational researcher may want to compute the (simultaneous) relationship between three measures of scholastic ability with five measures of success in school. A sociologist may want to investigate the relationship between two predictors of social mobility based on interviews, with actual subsequent social mobility as measured by four different indicators. A medical researcher may want to study the relationship of various risk factors to the development of a group of symptoms. In all of these cases, the researcher is interested in the relationship between two sets of variables, and Canonical Correlation would be the appropriate method of analysis.

In the following topics we will briefly introduce the major concepts and statistics in canonical correlation analysis. We will assume that you are familiar with the correlation coefficient as described in Basic Statistics, and the basic ideas of multiple regression as described in the overview section of Multiple Regression. To index



Computational Methods and Results

Some of the computational issues involved in canonical correlation and the major results that are commonly reported will now be reviewed.

Eigenvalues. When extracting the canonical roots, you will compute the eigenvalues. These can be interpreted as the proportion of variance accounted for by the correlation between the respective canonical variates. Note that the proportion here is computed relative to the variance of the canonical variates, that is, of the weighted sum scores of the two sets of variables; the eigenvalues do not tell how much variability is explained in either set of variables. You will compute as many eigenvalues as there are canonical roots, that is, as many as the minimum number of variables in either of the two sets.

Successive eigenvalues will be of smaller and smaller size. First, compute the weights that maximize the correlation of the two sum scores. After this first root has been extracted, you will find the weights that produce the second largest correlation between sum scores, subject to the constraint that the next set of sum scores does not correlate with the previous one, and so on.

CHAID Analysis
http://statsoft.com/textbook/stathome.html

General CHAID Introductory Overview
Basic Tree-Building Algorithm: CHAID and Exhaustive CHAID
General Computation Issues of CHAID
CHAID, C&RT, and QUEST

--------------------------------------------------------------------------------

General CHAID Introductory Overview

The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the oldest tree classification methods originally proposed by Kass (1980; according to Ripley, 1996, the CHAID algorithm is a descendent of THAID developed by Morgan and Messenger, 1973). CHAID will "build" non-binary trees (i.e., trees where more than two branches can attach to a single root or node), based on a relatively simple algorithm that is particularly well suited for the analysis of larger datasets. Also, because the CHAID algorithm will often effectively yield many multi-way frequency tables (e.g., when classifying a categorical response variable with many categories, based on categorical predictors with many classes), it has been particularly popular in marketing research, in the context of market segmentation studies.

Both CHAID and C&RT techniques will construct trees, where each (non-terminal) node identifies a split condition, to yield optimum prediction (of continuous dependent or response variables) or classification (for categorical dependent or response variables). Hence, both types of algorithms can be applied to analyze regression-type problems or classification-type. To index



Basic Tree-Building Algorithm: CHAID and Exhaustive CHAID

The acronym CHAID stands for Chi-squared Automatic Interaction Detector. This name derives from the basic algorithm that is used to construct (non-binary) trees, which for classification problems (when the dependent variable is categorical in nature) relies on the Chi-square test to determine the best next split at each step; for regression-type problems (continuous dependent variable) the program will actually compute F-tests. Specifically, the algorithm proceeds as follows:

Preparing predictors. The first step is to create categorical predictors out of any continuous predictors by dividing the respective continuous distributions into a number of categories with an approximately equal number of observations. For categorical predictors, the categories (classes) are "naturally" defined.

Merging categories. The next step is to cycle through the predictors to determine for each predictor the pair of (predictor) categories that is least significantly different with respect to the dependent variable; for classification problems (where the dependent variable is categorical as well), it will compute a Chi-square test (Pearson Chi-square); for regression problems (where the dependent variable is continuous), F tests. If the respective test for a given pair of predictor categories is not statistically significant as defined by an alpha-to-merge value, then it will merge the respective predictor categories and repeat this step (i.e., find the next pair of categories, which now may include previously merged categories). If the statistical significance for the respective pair of predictor categories is significant (less than the respective alpha-to-merge value), then (optionally) it will compute a Bonferroni adjusted p-value for the set of categories for the respective predictor.

Selecting the split variable. The next step is to choose the split the predictor variable with the smallest adjusted p-value, i.e., the predictor variable that will yield the most significant split; if the smallest (Bonferroni) adjusted p-value for any predictor is greater than some alpha-to-split value, then no further splits will be performed, and the respective node is a terminal node.

Continue this process until no further splits can be performed (given the alpha-to-merge and alpha-to-split values).

Classification and Regression Trees (C&RT)
http://statsoft.com/textbook/stathome.html

C&RT Introductory Overview - Basic Ideas
Computational Details
Computational Formulas

--------------------------------------------------------------------------------

Introductory Overview - Basic Ideas


Overview
C&RT builds classification and regression trees for predicting continuous dependent variables (regression) and categorical predictor variables (classification). The classic C&RT algorithm was popularized by Breiman et al. (Breiman, Friedman, Olshen, & Stone, 1984; see also Ripley, 1996). A general introduction to tree-classifiers, specifically to the QUEST (Quick, Unbiased, Efficient Statistical Trees) algorithm, is also presented in the context of the Classification Trees Analysis facilities, and much of the following discussion presents the same information, in only a slightly different context. Another, similar type of tree building algorithm is CHAID (Chi-square Automatic Interaction Detector; see Kass, 1980).

Classification and Regression Problems
There are numerous algorithms for predicting continuous variables or categorical variables from a set of continuous predictors and/or categorical factor effects. For example, in GLM (General Linear Models) and GRM (General Regression Models), you can specify a linear combination (design) of continuous predictors and categorical factor effects (e.g., with two-way and three-way interaction effects) to predict a continuous dependent variable. In GDA (General Discriminant Function Analysis), you can specify such designs for predicting categorical variables, i.e., to solve classification problems.

Regression-type problems. Regression-type problems are generally those where one attempts to predict the values of a continuous variable from one or more continuous and/or categorical predictor variables. For example, you may want to predict the selling prices of single family homes (a continuous dependent variable) from various other continuous predictors (e.g., square footage) as well as categorical predictors (e.g., style of home, such as ranch, two-story, etc.; zip code or telephone area code where the property is located, etc.; note that this latter variable would be categorical in nature, even though it would contain numeric values or codes). If you used simple multiple regression, or some general linear model (GLM) to predict the selling prices of single family homes, you would determine a linear equation for these variables that can be used to compute predicted selling prices. There are many different analytic procedures for fitting linear models (GLM, GRM, Regression), various types of nonlinear models (e.g., Generalized Linear/Nonlinear Models (GLZ), Generalized Additive Models (GAM), etc.), or completely custom-defined nonlinear models (see Nonlinear Estimation), where you can type in an arbitrary equation containing parameters to be estimated. CHAID also analyzes regression-type problems, and produces results that are similar (in nature) to those computed by C&RT. Note that various neural network architectures are also applicable to solve regression-type problems.

Classification-type problems. Classification-type problems are generally those where one attempts to predict values of a categorical dependent variable (class, group membership, etc.) from one or more continuous and/or categorical predictor variables. For example, you may be interested in predicting who will or will not graduate from college, or who will or will not renew a subscription. These would be examples of simple binary classification problems, where the categorical dependent variable can only assume two distinct and mutually exclusive values. In other cases one might be interested in predicting which one of multiple different alternative consumer products (e.g., makes of cars) a person decides to purchase, or which type of failure occurs with different types of engines. In those cases there are multiple categories or classes for the categorical dependent variable. There are a number of methods for analyzing classification-type problems and to compute predicted classifications, either from simple continuous predictors (e.g., binomial or multinomial logit regression in GLZ), from categorical predictors (e.g., Log-Linear analysis of multi-way frequency tables), or both (e.g., via ANCOVA-like designs in GLZ or GDA). The CHAID also analyzes classification-type problems, and produces results that are similar (in nature) to those computed by C&RT. Note that various neural network architectures are also applicable to solve classification-type problems.

Classification and Regression Trees (C&RT)
In most general terms, the purpose of the analyses via tree-building algorithms is to determine a set of if-then logical (split) conditions that permit accurate prediction or classification of cases.

Cluster Analysis
http://statsoft.com/textbook/stathome.html

General Purpose
Statistical Significance Testing
Area of Application
Joining (Tree Clustering)
Hierarchical Tree
Distance Measures
Amalgamation or Linkage Rules
Two-way Joining
Introductory Overview
Two-way Joining
k-Means Clustering
Example
Computations
Interpretation of results
EM (Expectation Maximization) Clustering
Introductory Overview
The EM Algorithm
Finding the Right Number of Clusters in k-Means and EM Clustering: v-Fold Cross-Validation

--------------------------------------------------------------------------------

General Purpose

The term cluster analysis (first used by Tryon, 1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories. A general question facing researchers in many areas of inquiry is how to organize observed data into meaningful structures, that is, to develop taxonomies. In other words cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise. Given the above, cluster analysis can be used to discover structures in data without providing an explanation/interpretation. In other words, cluster analysis simply discovers structures in data without explaining why they exist.

We deal with clustering in almost every aspect of daily life. For example, a group of diners sharing the same table in a restaurant may be regarded as a cluster of people. In food stores items of similar nature, such as different types of meat or vegetables are displayed in the same or nearby locations. There is a countless number of examples in which clustering playes an important role. For instance, biologists have to organize the different species of animals before a meaningful description of the differences between animals is possible. According to the modern system employed in biology, man belongs to the primates, the mammals, the amniotes, the vertebrates, and the animals. Note how in this classification, the higher the level of aggregation the less similar are the members in the respective class. Man has more in common with all other primates (e.g., apes) than it does with the more "distant" members of the mammals (e.g., dogs), etc. For a review of the general categories of cluster analysis methods, see Joining (Tree Clustering), Two-way Joining (Block Clustering), and k-Means Clustering. In short, whatever the nature of your business is, sooner or later you will run into a clustering problem of one form or another.

Correspondence Analysis
http://statsoft.com/textbook/stathome.html
General Purpose
Supplementary Points
Multiple Correspondence Analysis
Burt Tables

--------------------------------------------------------------------------------

General Purpose

Correspondence analysis is a descriptive/exploratory technique designed to analyze simple two-way and multi-way tables containing some measure of correspondence between the rows and columns. The results provide information which is similar in nature to those produced by Factor Analysis techniques, and they allow one to explore the structure of categorical variables included in the table. The most common kind of table of this type is the two-way frequency crosstabulation table (see, for example, Basic Statistics or Log-Linear).

In a typical correspondence analysis, a crosstabulation table of frequencies is first standardized, so that the relative frequencies across all cells sum to 1.0. One way to state the goal of a typical analysis is to represent the entries in the table of relative frequencies in terms of the distances between individual rows and/or columns in a low-dimensional space. This is best illustrated by a simple example, which will be described below. There are several parallels in interpretation between correspondence analysis and Factor Analysis, and some similar concepts will also be pointed out below.

For a comprehensive description of this method, computational details, and its applications (in the English language), refer to the classic text by Greenacre (1984). These methods were originally developed primarily in France by Jean-Paul Benz&#1081;rci in the early 1960''s and 1970''s (e.g., see Benz&#1081;rci, 1973; see also Lebart, Morineau, and Tabard, 1977), but have only more recently gained increasing popularity in English-speaking countries (see, for example, Carrol, Green, and Schaffer, 1986; Hoffman and Franke, 1986). (Note that similar techniques were developed independently in several countries, where they were known as optimal scaling, reciprocal averaging, optimal scoring, quantification method, or homogeneity analysis). In the following paragraphs, a general introduction to correspondence analysis will be presented.

Data Mining Techniques
http://statsoft.com/textbook/stathome.html
http://www.statsoft.com/TEXTBOOK/stdatmin.html

Data Mining
Crucial Concepts in Data Mining
Data Warehousing
On-Line Analytic Processing (OLAP)
Exploratory Data Analysis (EDA) and Data Mining Techniques
EDA vs. Hypothesis Testing
Computational EDA Techniques
Graphical (data visualization) EDA techniques
Verification of results of EDA
Neural Networks

--------------------------------------------------------------------------------

Data Mining

Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications. The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment (i.e., the application of the model to new data in order to generate predictions)
http://educationally.narod.ru/clusteranalysisdmphotoalbum.html

Discriminant Function Analysis
http://statsoft.com/textbook/stathome.html

General Purpose
Computational Approach
Stepwise Discriminant Analysis
Interpreting a Two-Group Discriminant Function
Discriminant Functions for Multiple Groups
Assumptions
Classification

--------------------------------------------------------------------------------

General Purpose

Discriminant function analysis is used to determine which variables discriminate between two or more naturally occurring groups. For example, an educational researcher may want to investigate which variables discriminate between high school graduates who decide (1) to go to college, (2) to attend a trade or professional school, or (3) to seek no further training or education. For that purpose the researcher could collect data on numerous variables prior to students'' graduation. After graduation, most students will naturally fall into one of the three categories. Discriminant Analysis could then be used to determine which variable(s) are the best predictors of students'' subsequent educational choice.

A medical researcher may record different variables relating to patients'' backgrounds in order to learn which variables best predict whether a patient is likely to recover completely (group 1), partially (group 2), or not at all (group 3). A biologist could record different characteristics of similar types (groups) of flowers, and then perform a discriminant function analysis to determine the set of characteristics that allows for the best discrimination between the types. To index

Distribution Fitting
http://statsoft.com/textbook/stathome.html

General Purpose
Fit of the Observed Distribution
Types of Distributions
Bernoulli Distribution
Beta Distribution
Binomial Distribution
Cauchy Distribution
Chi-square Distribution
Exponential Distribution
Extreme Value Distribution
F Distribution
Gamma Distribution
Geometric Distribution
Gompertz Distribution
Laplace Distribution
Logistic Distribution
Log-normal Distribution
Normal Distribution
Pareto Distribution
Poisson Distribution
Rayleigh Distribution
Rectangular Distribution
Student''s t Distribution
Weibull Distribution

--------------------------------------------------------------------------------
General Purpose
In some research applications one can formulate hypotheses about the specific distribution of the variable of interest. For example, variables whose values are determined by an infinite number of independent random events will be distributed following the normal distribution: one can think of a person''s height as being the result of very many independent factors such as numerous specific genetic predispositions, early childhood diseases, nutrition, etc. (see the animation below for an example of the normal distribution). As a result, height tends to be normally distributed in the U.S. population. On the other hand, if the values of a variable are the result of very rare events, then the variable will be distributed according to the Poisson distribution (sometimes called the distribution of rare events). For example, industrial accidents can be thought of as the result of the intersection of a series of unfortunate (and unlikely) events, and their frequency tends to be distributed according to the Poisson distribution. These and other distributions are described in greater detail in the respective glossary topics.
Another common application where distribution fitting procedures are useful is when one wants to verify the assumption of normality before using some parametric test (see General Purpose of Nonparametric Tests). For example, you may want to use the Kolmogorov-Smirnov test for normality or the Shapiro-Wilks'' W test to test for normality

Experimental Design (Industrial DOE)
http://statsoft.com/textbook/stathome.html

DOE Overview
Experiments in Science and Industry
Differences in techniques
Overview
General Ideas
Computational Problems
Components of Variance, Denominator Synthesis
Summary
2**(k-p) Fractional Factorial Designs
Basic Idea
Generating the Design
The Concept of Design Resolution
Plackett-Burman (Hadamard Matrix) Designs for Screening
Enhancing Design Resolution via Foldover
Aliases of Interactions: Design Generators
Blocking
Replicating the Design
Adding Center Points
Analyzing the Results of a 2**(k-p) Experiment
Graph Options
Summary
2**(k-p) Maximally Unconfounded and Minimum Aberration Designs
Basic Idea
Design Criteria
Summary
3**(k-p) , Box-Behnken, and Mixed 2 and 3 Level Factorial Designs
Overview
Designing 3**(k-p) Experiments
An Example 3**(4-1) Design in 9 Blocks
Box-Behnken Designs
Analyzing the 3**(k-p) Design
ANOVA Parameter Estimates
Graphical Presentation of Results
Designs for Factors at 2 and 3 Levels
Central Composite and Non-Factorial Response Surface Designs
Overview
Design Considerations
Alpha for Rotatability and Orthogonality
Available Standard Designs
Analyzing Central Composite Designs
The Fitted Response Surface
Categorized Response Surfaces
Latin Square Designs
Overview
Latin Square Designs
Analyzing the Design
Very Large Designs, Random Effects, Unbalanced Nesting
Taguchi Methods: Robust Design Experiments
Overview
Quality and Loss Functions
Signal-to-Noise (S/N) Ratios
Orthogonal Arrays
Analyzing Designs
Accumulation Analysis
Summary
Mixture designs and triangular surfaces
Overview
Triangular Coordinates
Triangular Surfaces and Contours
The Canonical Form of Mixture Polynomials
Common Models for Mixture Data
Standard Designs for Mixture Experiments
Lower Constraints
Upper and Lower Constraints
Analyzing Mixture Experiments
Analysis of Variance
Parameter Estimates
Pseudo-Components
Graph Options
Designs for constrained surfaces and mixtures
Overview
Designs for Constrained Experimental Regions
Linear Constraints
The Piepel & Snee Algorithm
Choosing Points for the Experiment
Analyzing Designs for Constrained Surfaces and Mixtures
Constructing D- and A-optimal designs
Overview
Basic Ideas
Measuring Design Efficiency
Constructing Optimal Designs
General Recommendations
Avoiding Matrix Singularity
"Repairing" Designs
Constrained Experimental Regions and Optimal Design
Special Topics
Profiling Predicted Responses and Response Desirability
Residuals Analysis
Box-Cox Transformations of Dependent Variables

Principal Components and Factor Analysis
http://statsoft.com/textbook/stathome.html

General Purpose
Basic Idea of Factor Analysis as a Data Reduction Method
Factor Analysis as a Classification Method
Miscellaneous Other Issues and Statistics

--------------------------------------------------------------------------------
General Purpose
The main applications of factor analytic techniques are: (1) to reduce the number of variables and (2) to detect structure in the relationships between variables, that is to classify variables. Therefore, factor analysis is applied as a data reduction or structure detection method (the term factor analysis was first introduced by Thurstone, 1931). The topics listed below will describe the principles of factor analysis, and how it can be applied towards these two purposes. We will assume that you are familiar with the basic logic of statistical reasoning as described in Elementary Concepts. Moreover, we will also assume that you are familiar with the concepts of variance and correlation; if not, we advise that you read the Basic Statistics chapter at this point.

There are many excellent books on factor analysis. For example, a hands-on how-to approach can be found in Stevens (1986); more detailed technical descriptions are provided in Cooley and Lohnes (1971); Harman (1976); Kim and Mueller, (1978a, 1978b); Lawley and Maxwell (1971); Lindeman, Merenda, and Gold (1980); Morrison (1967); or Mulaik (1972). The interpretation of secondary factors in hierarchical factor analysis, as an alternative to traditional oblique rotational strategies, is explained in detail by Wherry (1984).

Confirmatory factor analysis. Structural Equation Modeling (SEPATH) allows you to test specific hypotheses about the factor structure for a set of variables, in one or several samples (e.g., you can compare factor structures across samples).

Correspondence analysis. Correspondence analysis is a descriptive/exploratory technique designed to analyze two-way and multi-way tables containing some measure of correspondence between the rows and columns. The results provide information which is similar in nature to those produced by factor analysis techniques, and they allow one to explore the structure of categorical variables included in the table. For more information regarding these methods, refer to Correspondence Analysis. To index



Basic Idea of Factor Analysis as a Data Reduction Method

Suppose we conducted a (rather "silly") study in which we measure 100 people''s height in inches and centimeters. Thus, we would have two variables that measure height. If in future studies, we want to research, for example, the effect of different nutritional food supplements on height, would we continue to use both measures? Probably not; height is one characteristic of a person, regardless of how it is measured.

Let us now extrapolate from this "silly" study to something that one might actually do as a researcher. Suppose we want to measure people''s satisfaction with their lives. We design a satisfaction questionnaire with various items; among other things we ask our subjects how satisfied they are with their hobbies (item 1) and how intensely they are pursuing a hobby (item 2). Most likely, the responses to the two items are highly correlated with each other. (If you are not familiar with the correlation coefficient, we recommend that you read the description in Basic Statistics - Correlations) Given a high correlation between the two items, we can conclude that they are quite redundant.

Combining Two Variables into a Single Factor. One can summarize the correlation between two variables in a scatterplot. A regression line can then be fitted that represents the "best" summary of the linear relationship between the variables. If we could define a variable that would approximate the regression line in such a plot, then that variable would capture most of the "essence" of the two items. Subjects'' single scores on that new factor, represented by the regression line, could then be used in future data analyses to represent that essence of the two items. In a sense we have reduced the two variables to one factor. Note that the new factor is actually a linear combination of the two variables.

Principal Components Analysis. The example described above, combining two correlated variables into one factor, illustrates the basic idea of factor analysis, or of principal components analysis to be precise (we will return to this later). If we extend the two-variable example to multiple variables, then the computations become more involved, but the basic principle of expressing two or more variables by a single factor remains the same.

Extracting Principal Components. We do not want to go into the details about the computational aspects of principal components analysis here, which can be found elsewhere (references were provided at the beginning of this section). However, basically, the extraction of principal components amounts to a variance maximizing (varimax) rotation of the original variable space. For example, in a scatterplot we can think of the regression line as the original X axis, rotated so that it approximates the regression line. This type of rotation is called variance maximizing because the criterion for (goal of) the rotation is to maximize the variance (variability) of the "new" variable (factor), while minimizing the variance around the new variable (see Rotational Strategies).

General Discriminant Analysis (GDA)
http://statsoft.com/textbook/stathome.html

Introductory Overview
Advantages of GDA

--------------------------------------------------------------------------------

Introductory Overview

General Discriminant Analysis (GDA) is called a "general" discriminant analysis because it applies the methods of the general linear model (see also General Linear Models (GLM)) to the discriminant function analysis problem. A general overview of discriminant function analysis, and the traditional methods for fitting linear models with categorical dependent variables and continuous predictors, is provided in the context of Discriminant Analysis. In GDA, the discriminant function analysis problem is "recast" as a general multivariate linear model, where the dependent variables of interest are (dummy-) coded vectors that reflect the group membership of each case. The remainder of the analysis is then performed as described in the context of General Regression Models (GRM), with a few additional features noted below. To index





Advantages of GDA

Specifying models for predictor variables and predictor effects. One advantage of applying the general linear model to the discriminant analysis problem is that you can specify complex models for the set of predictor variables. For example, you can specify for a set of continuous predictor variables, a polynomial regression model, response surface model, factorial regression, or mixture surface regression (without an intercept). Thus, you could analyze a constrained mixture experiment (where the predictor variable values must sum to a constant), where the dependent variable of interest is categorical in nature. In fact, GDA does not impose any particular restrictions on the type of predictor variable (categorical or continuous) that can be used, or the models that can be specified. However, when using categorical predictor variables, caution should be used (see "A note of caution for models with categorical predictors, and other advanced techniques" below).

Stepwise and best-subset analyses. In addition to the traditional stepwise analyses for single continuous predictors provided in Discriminant Analysis, General Discriminant Analysis makes available the options for stepwise and best-subset analyses provided in General Regression Models (GRM). Specifically, you can request stepwise and best-subset selection of predictors or sets of predictors (in multiple-degree of freedom effects, involving categorical predictors), based on the F-to-enter and p-to-enter statistics (associated with the multivariate Wilks'' Lambda test statistic). In addition, when a cross-validation sample is specified, best-subset selection can also be based on the misclassification rates for the cross-validation sample; in other words, after estimating the discriminant functions for a given set of predictors, the misclassification rates for the cross-validation sample are computed, and the model (subset of predictors) that yields the lowest misclassification rate for the cross-validation sample is chosen. This is a powerful technique for choosing models that may yield good predictive validity, while avoiding overfitting of the data (see also Neural Networks).

Desirability profiling of posterior classification probabilities. Another unique option of General Discriminant Analysis (GDA) is the inclusion of Response/desirability profiler options. These options are described in some detail in the context of Experimental Design (DOE). In short, the predicted response values for each dependent variable are computed, and those values can be combined into a single desirability score. A graphical summary can then be produced to show the "behavior" of the predicted responses and the desirability score over the ranges of values for the predictor variables. In GDA, you can profile both simple predicted values (like in General Regression Models) for the coded dependent variables (i.e., dummy-coded categories of the categorical dependent variable), and you can also profile posterior prediction probabilities. This unique latter option allows you to evaluate how different values for the predictor variables affect the predicted classification of cases, and is particularly useful when interpreting the results for complex models that involve categorical and continuous predictors and their interactions.

A note of caution for models with categorical predictors, and other advanced techniques. General Discriminant Analysis provides functionality that makes this technique a general tool for classification and data mining. However, most -- if not all -- textbook treatments of discriminant function analysis are limited to simple and stepwise analyses with single degree of freedom continuous predictors. No "experience" (in the literature) exists regarding issues of robustness and effectiveness of these techniques, when they are generalized in the manner provided in this very powerful analysis. The use of best-subset methods, in particular when used in conjunction with categorical predictors or when using the misclassification rates in a cross-validation sample for choosing the best subset of predictors, should be considered a heuristic search method, rather than a statistical analysis technique.

The use of categorical predictor variables. The use of categorical predictor variables or effects in a discriminant function analysis model may be (statistically) questionable. For example, you can use GDA to analyze a 2 by 2 frequency table, by specifying one variable in the 2 by 2 table as the dependent variable, and the other as the predictor. Clearly, the (ab)use of GDA in this manner would be silly (although, interestingly, in most cases you will get results that are generally compatible with those you would get by computing a simple Chi-square test for the 2 by 2 table). On the other hand, if you only consider the parameter estimates computed by GDA as the least squares solution to a set of linear (prediction) equations, then the use of categorical predictors in GDA is fully justified; moreover, it is not uncommon in applied research to be confronted with a mixture of continuous and categorical predictors (e.g., income or age which are continuous, along with occupational status, which is categorical) for predicting a categorical dependent variable. In those cases, it can be very instructive to consider specific models involving the categorical predictors, and possibly interactions between categorical and continuous predictors for classifying observations. However, to reiterate, the use of categorical predictor variables in discriminant function analysis is not widely documented, and you should proceed cautiously before accepting the results of statistical significance tests, and before drawing final conclusions from your analyses. Also remember that there are alternative methods available to perform similar analyses, namely, the multinomial logit models available in Generalized Linear Models (GLZ), and the methods for analyzing multi-way frequency tables in Log-Linear.

To index

General Linear Models (GLM)
http://statsoft.com/textbook/stathome.html
http://www.statsoft.com/TEXTBOOK/stglm.html

Basic Ideas: The General Linear Model
Historical background
The purpose of multiple regression
Computations for solving the multiple regression equation
Extension of multiple regression to the general linear model
The sigma-restricted vs. overparameterized model
Summary of computations
Types of Analyses
Between-subject designs
Within-subject (repeated measures) designs
Multivariate designs
Estimation and Hypothesis Testing
Whole model tests
Six types of sums of squares
Error terms for tests
Testing specific hypotheses
Testing hypotheses for repeated measures and dependent variables

--------------------------------------------------------------------------------
This chapter describes the use of the general linear model in a wide variety of statistical analyses. If you are unfamiliar with the basic methods of ANOVA and regression in linear models, it may be useful to first review the basic information on these topics in Elementary Concepts. A detailed discussion of univariate and multivariate ANOVA techniques can also be found in the ANOVA/MANOVA chapter.
Basic Ideas: The General Linear Model

The following topics summarize the historical, mathematical, and computational foundations for the general linear model. For a basic introduction to ANOVA (MANOVA, ANCOVA) techniques, refer to ANOVA/MANOVA; for an introduction to multiple regression, see Multiple Regression; for an introduction to the design an analysis of experiments in applied (industrial) settings, see Experimental Design.

Historical Background

The roots of the general linear model surely go back to the origins of mathematical thought, but it is the emergence of the theory of algebraic invariants in the 1800''s that made the general linear model, as we know it today, possible. The theory of algebraic invariants developed from the groundbreaking work of 19th century mathematicians such as Gauss, Boole, Cayley, and Sylvester. The theory seeks to identify those quantities in systems of equations which remain unchanged under linear transformations of the variables in the system. Stated more imaginatively (but in a way in which the originators of the theory would not consider an overstatement), the theory of algebraic invariants searches for the eternal and unchanging amongst the chaos of the transitory and the illusory. That is no small goal for any theory, mathematical or otherwise.

The wonder of it all is the theory of algebraic invariants was successful far beyond the hopes of its originators. Eigenvalues, eigenvectors, determinants, matrix decomposition methods; all derive from the theory of algebraic invariants. The contributions of the theory of algebraic invariants to the development of statistical theory and methods are numerous, but a simple example familiar to even the most casual student of statistics is illustrative. The correlation between two variables is unchanged by linear transformations of either or both variables. We probably take this property of correlation coefficients for granted, but what would data analysis be like if we did not have statistics that are invariant to the scaling of the variables involved? Some thought on this question should convince you that without the theory of algebraic invariants, the development of useful statistical techniques would be nigh impossible.

The development of the linear regression model in the late 19th century, and the development of correlational methods shortly thereafter, are clearly direct outgrowths of the theory of algebraic invariants. Regression and correlational methods, in turn, serve as the basis for the general linear model. Indeed, the general linear model can be seen as an extension of linear multiple regression for a single dependent variable. Understanding the multiple regression model is fundamental to understanding the general linear model, so we will look at the purpose of multiple regression, the computational algorithms used to solve regression problems, and how the regression model is extended in the case of the general linear model. A basic introduction to multiple regression methods and the analytic problems to which they are applied is provided in the Multiple Regression. To index

Generalized Additive Models (GAM)
http://statsoft.com/textbook/stathome.html

Additive models

Generalized linear models

Distributions and link functions

Generalized additive models

Estimating the non-parametric function of predictors via scatterplot smoothers

A specific example: The generalized additive logistic model

Fitting generalized additive models

Interpreting the results

Degrees of freedom

A Word of Caution



--------------------------------------------------------------------------------

The methods available in Generalized Additive Models are implementations of techniques developed and popularized by Hastie and Tibshirani (1990). A detailed description of these and related techniques, the algorithms used to fit these models, and discussions of recent research in this area of statistical modeling can also be found in Schimek (2000).

Additive models. The methods described in this section represent a generalization of multiple regression (which is a special case of general linear models). Specifically, in linear regression, a linear least-squares fit is computed for a set of predictor or X variables, to predict a dependent Y variable. The well known linear regression equation with m predictors, to predict a dependent variable Y, can be stated as:

Y = b0 + b1*X1 + ... + bm*Xm

Where Y stands for the (predicted values of the) dependent variable, X1through Xm represent the m values for the predictor variables, and b0, and b1 through bm are the regression coefficients estimated by multiple regression. A generalization of the multiple regression model would be to maintain the additive nature of the model, but to replace the simple terms of the linear equation bi*Xi with fi(Xi) where fi is a non-parametric function of the predictor Xi. In other words, instead of a single coefficient for each variable (additive term) in the model, in additive models an unspecified (non-parametric) function is estimated for each predictor, to achieve the best prediction of the dependent variable values.

Generalized linear models.To summarize the basic idea, the generalized linear model differs from the general linear model (of which multiple regression is a special case) in two major respects: First, the distribution of the dependent or response variable can be (explicitly) non-normal, and does not have to be continuous, e.g., it can be binomial; second, the dependent variable values are predicted from a linear combination of predictor variables, which are "connected" to the dependent variable via a link function. The general linear model for a single dependent variable can be considered a special case of the generalized linear model: In the general linear model the dependent variable values are expected to follow the normal distribution, and the link function is a simple identity function (i.e., the linear combination of values for the predictor variables is not transformed).

Generalized Linear Models (GLZ)
http://www.statsoft.com/TEXTBOOK/stglz.html
Basic Ideas
Computational Approach
Types of Analyses
Between-subject Designs
Model Building
Interpretation of Results and Diagnostics

--------------------------------------------------------------------------------
This chapter describes the use of the generalized linear model for analyzing linear and non-linear effects of continuous and categorical predictor variables on a discrete or continuous dependent variable. If you are unfamiliar with the basic methods of regression in linear models, it may be useful to first review the basic information on these topics in the Elementary Concepts chapter. Discussion of the ways in which the linear regression model is extended by the general linear model can be found in the General Linear Models chapter.
For additional information about generalized linear models, see also Dobson (1990), Green and Silverman (1994), or McCullagh and Nelder (1989).

General Regression Models (GRM)
http://statsoft.com/textbook/stathome.html

Basic Ideas: The Need for Simple Models
Model Building in GSR
Types of Analyses
Between Subject Designs
Multivariate Designs
Building the Whole Model
Partitioning Sums of Squares
Testing the Whole Model
Limitations of Whole Models
Building Models via Stepwise Regression
Building Models via Best-Subset Regression

--------------------------------------------------------------------------------
This chapter describes the use of the general linear model for finding the "best" linear model from a number of possible models. If you are unfamiliar with the basic methods of ANOVA and regression in linear models, it may be useful to first review the basic information on these topics in Elementary Concepts. A detailed discussion of univariate and multivariate ANOVA techniques can also be found in the ANOVA/MANOVA chapter; a discussion of multiple regression methods is also provided in the Multiple Regression chapter. Discussion of the ways in which the linear regression model is extended by the general linear model can be found in the General Linear Models chapter.


--------------------------------------------------------------------------------
Basic Ideas: The Need for Simple Models
A good theory is the end result of a winnowing process. We start with a comprehensive model that includes all conceivable, testable influences on the phenomena under investigation. Then we test the components of the initial comprehensive model, to identify the less comprehensive submodels that adequately account for the phenomena under investigation. Finally from these candidate submodels, we single out the simplest submodel, which by the principle of parsimony we take to be the "best" explanation for the phenomena under investigation.

We prefer simple models not just for philosophical but also for practical reasons. Simple models are easier to put to test again in replication and cross-validation studies. Simple models are less costly to put into practice in predicting and controlling the outcome in the future. The philosophical reasons for preferring simple models should not be downplayed, however. Simpler models are easier to understand and appreciate, and therefore have a "beauty" that their more complicated counterparts often lack.

The entire winnowing process described above is encapsulated in the model-building techniques of stepwise and best-subset regression. The use of these model-building techniques begins with the specification of the design for a comprehensive "whole model." Less comprehensive submodels are then tested to determine if they adequately account for the outcome under investigation. Finally, the simplest of the adequate is adopted as the "best." To index



http://www.statsoft.com/TEXTBOOK/stgrm.html
Basic Ideas: The Need for Simple Models
Model Building in GSR
Types of Analyses
Between Subject Designs
Multivariate Designs
Building the Whole Model
Partitioning Sums of Squares
Testing the Whole Model
Limitations of Whole Models
Building Models via Stepwise Regression
Building Models via Best-Subset Regression

Graphical Analytic Techniques
http://statsoft.com/textbook/stathome.html

Brief Overviews of Types of Graphs
Representative Visualization Techniques
Categorized Graphs
What are Categorized Graphs?
Categorization Methods
Histograms
Scatterplots
Probability Plots
Quantile-Quantile Plots
Probability-Probability Plots
Line Plots
Box Plots
Pie Charts
Missing/Range Data Points Plots
3D Plots
Ternary Plots
Brushing
Smoothing Bivariate Distributions
Layered Compression
Projections of 3D data sets
Icon Plots
Analyzing Icon Plots
Taxonomy of Icon Plots
Standardization of Values
Applications
Related Graphs
Graph Type
Mark Icons
Data Reduction
Data Rotation (in 3D space)
Categorized Graphs

One of the most important, general, and also powerful analytic methods involves dividing ("splitting") the data set into categories in order compare the patterns of data between the resulting subsets. This common technique is known under a variety of terms (such as breaking down, grouping, categorizing, splitting, slicing, drilling-down, or conditioning) and it is used both in exploratory data analyses and hypothesis testing. For example: A positive relation between the age and the risk of a heart attack may be different in males and females (it may be stronger in males). A promising relation between taking a drug and a decrease of the cholesterol level may be present only in women with a low blood pressure and only in their thirties and forties. The process capability indices or capability histograms can be different for periods of time supervised by different operators. The regression slopes can be different in different experimental groups.

There are many computational techniques that capitalize on grouping and that are designed to quantify the differences that the grouping will reveal (e.g., ANOVA/MANOVA). However, graphical techniques (such as categorized graphs discussed in this section) offer unique advantages that cannot be substituted by any computational method alone: they can reveal patterns that cannot be easily quantified (e.g., complex interactions, exceptions, anomalies) and they provide unique, multidimensional, global analytic perspectives to explore or "mine" the data.

What are Categorized Graphs?

Categorized graphs (the term first used in STATISTICA software by StatSoft in 1990; also recently called Trellis graphs, by Becker, Cleveland, and Clark, at Bell Labs) produce a series of 2D, 3D, ternary, or nD graphs (such as histograms, scatterplots, line plots, surface plots, ternary scatterplots, etc.), one for each selected category of cases (i.e., subset of cases), for example, respondents from New York, Chicago, Dallas, etc. These "component" graphs are placed sequentially in one display, allowing for comparisons between the patterns of data shown in graphs for each of the requested groups (e.g., cities).

A variety of methods can be used to select the subsets; the simplest of them is using a categorical variable (e.g., a variable City, with three values New York, Chicago, and Dallas). For example, the following graph shows histograms of a variable representing self-reported stress levels in each of the three cities.

Independent Components Analysis
http://statsoft.com/textbook/stathome.html

Introductory Overview


Independent Component Analysis is a well established and reliable statistical method that performs signal separation. Signal separation is a frequently occurring problem and is central to Statistical Signal Processing, which has a wide range of applications in many areas of technology ranging from Audio and Image Processing to Biomedical Signal Processing, Telecommunications, and Econometrics.

Imagine being in a room with a crowd of people and two speakers giving presentations at the same time. The crowed is making comments and noises in the background. We are interested in what the speakers say and not the comments emanating from the crowd. There are two microphones at different locations, recording the speakers'' voices as well as the noise coming from the crowed. Our task is to separate the voice of each speaker while ignoring the background noise
This is a classic example of the Independent Component Analysis, a well established stochastic technique. ICA can be used as a method of Blind Source Separation, meaning that it can separate independent signals from linear mixtures with virtually no prior knowledge on the signals. An example is decomposition of Electro or Magnetoencephalographic signals. In computational Neuroscience, ICA has been used for Feature Extraction, in which case it seems to adequately model the basic cortical processing of visual and auditory information. New application areas are being discovered at an increasing pace.

Multiple Regression and the General Linear Models
http://statsoft.com/textbook/stathome.html
http://www.statsoft.com/TEXTBOOK/stmulreg.html
General Purpose
Computational Approach
Least Squares
The Regression Equation
Unique Prediction and Partial Correlation
Predicted and Residual Scores
Residual Variance and R-square
Interpreting the Correlation Coefficient R
Assumptions, Limitations, and Practical Considerations
Assumption of Linearity
Normality Assumption
Limitations
Choice of the number of variables
Multicollinearity and matrix ill-conditioning
Fitting centered polynomial models
The importance of residual analysis

--------------------------------------------------------------------------------

General Purpose

The general purpose of multiple regression (the term was first used by Pearson, 1908) is to learn more about the relationship between several independent or predictor variables and a dependent or criterion variable. For example, a real estate agent might record for each listing the size of the house (in square feet), the number of bedrooms, the average income in the respective neighborhood according to census data, and a subjective rating of appeal of the house. Once this information has been compiled for various houses it would be interesting to see whether and how these measures relate to the price for which a house is sold. For example, one might learn that the number of bedrooms is a better predictor of the price for which a house sells in a particular neighborhood than how "pretty" the house is (subjective rating). One may also detect "outliers," that is, houses that should really sell for more, given their location and characteristics.

Personnel professionals customarily use multiple regression procedures to determine equitable compensation. One can determine a number of factors or dimensions such as "amount of responsibility" (Resp) or "number of people to supervise" (No_Super) that one believes to contribute to the value of a job. The personnel analyst then usually conducts a salary survey among comparable companies in the market, recording the salaries and respective characteristics (i.e., values on dimensions) for different positions. This information can be used in a multiple regression analysis to build a regression equation of the form:

Salary = .5*Resp + .8*No_Super

Once this so-called regression line has been determined, the analyst can now easily construct a graph of the expected (predicted) salaries and the actual salaries of job incumbents in his or her company. Thus, the analyst is able to determine which position is underpaid (below the regression line) or overpaid (above the regression line), or paid equitably.

In the social and natural sciences multiple regression procedures are very widely used in research. In general, multiple regression allows the researcher to ask (and hopefully answer) the general question "what is the best predictor of ...". For example, educational researchers might want to learn what are the best predictors of success in high-school. Psychologists may want to determine which personality variable best predicts social adjustment. Sociologists may want to find out which of the multiple social indicators best predict whether or not a new immigrant group will adapt and be absorbed into society.

See also Exploratory Data Analysis and Data Mining Techniques, the General Stepwise Regression chapter, and the General Linear Models chapter.

Log-Linear Analysis of Frequency Tables
http://statsoft.com/textbook/stathome.html

General Purpose
Two-way Frequency Tables
Multi-Way Frequency Tables
The Log-Linear Model
Goodness-of-fit
Automatic Model Fitting

--------------------------------------------------------------------------------
General Purpose
One basic and straightforward method for analyzing data is via crosstabulation. For example, a medical researcher may tabulate the frequency of different symptoms by patients'' age and gender; an educational researcher may tabulate the number of high school drop-outs by age, gender, and ethnic background; an economist may tabulate the number of business failures by industry, region, and initial capitalization; a market researcher may tabulate consumer preferences by product, age, and gender; etc. In all of these cases, the major results of interest can be summarized in a multi-way frequency table, that is, in a crosstabulation table with two or more factors.

Log-Linear provides a more "sophisticated" way of looking at crosstabulation tables. Specifically, you can test the different factors that are used in the crosstabulation (e.g., gender, region, etc.) and their interactions for statistical significance (see Elementary Concepts for a discussion of statistical significance testing). The following text will present a brief introduction to these methods, their logic, and interpretation.

Correspondence analysis is a descriptive/exploratory technique designed to analyze two-way and multi-way tables containing some measure of correspondence between the rows and columns. The results provide information which is similar in nature to those produced by Factor Analysis techniques, and they allow one to explore the structure of the categorical variables included in the table. To index

Multivariate Adaptive Regression Splines (MARSplines)
http://statsoft.com/textbook/stathome.html

Introductory Overview
Regression Problems
Multivariate Adaptive Regression Splines
Model Selection and Pruning
Applications
Technical Notes: The MARSplines Algorithm
Technical Notes: The MARSplines Model
Introductory Overview

Multivariate Adaptive Regression Splines (MARSplines) is an implementation of techniques popularized by Friedman (1991) for solving regression-type problems (see also, Multiple Regression), with the main purpose to predict the values of a continuous dependent or outcome variable from a set of independent or predictor variables. There are a large number of methods available for fitting models to continuous variables, such as a linear regression [e.g., Multiple Regression, General Linear Model (GLM)], nonlinear regression (Generalized Linear/Nonlinear Models), regression trees (see Classification and Regression Trees), CHAID, Neural Networks, etc. (see also Hastie, Tishirani, and Friedman, 2001, for an overview).

Multivariate Adaptive Regression Splines (MARSplines) is a nonparametric regression procedure that makes no assumption about the underlying functional relationship between the dependent and independent variables. Instead, MARSplines constructs this relation from a set of coefficients and basis functions that are entirely "driven" from the regression data. In a sense, the method is based on the "divide and conquer" strategy, which partitions the input space into regions, each with its own regression equation. This makes MARSplines particularly suitable for problems with higher input dimensions (i.e., with more than 2 variables), where the curse of dimensionality would likely create problems for other techniques.

The MARSplines technique has become particularly popular in the area of data mining because it does not assume or impose any particular type or class of relationship (e.g., linear, logistic, etc.) between the predictor variables and the dependent (outcome) variable of interest. Instead, useful models (i.e., models that yield accurate predictions) can be derived even in situations where the relationship between the predictors and the dependent variables is non-monotone and difficult to approximate with parametric models. For more information about this technique and how it compares to other methods for nonlinear regression (or regression trees), see Hastie, Tishirani, and Friedman (2001).

Regression Problems
Regression problems are used to determine the relationship between a set of dependent variables (also called output, outcome, or response variables) and one or more independent variables (also known as input or predictor variables). The dependent variable is the one whose values you want to predict, based on the values of the independent (predictor) variables. For instance, one might be interested in the number of car accidents on the roads, which can be caused by 1) bad weather and 2) drunk driving. In this case one might write, for example,

Number_of_Accidents = Some Constant + 0.5*Bad_Weather + 2.0*Drunk_Driving

The variable Number of Accidents is the dependent variable that is thought to be caused by (among other variables) Bad Weather and Drunk Driving (hence the name dependent variable). Note that the independent variables are multiplied by factors, i.e., 0.5 and 2.0. These are known as regression coefficients. The larger these coefficients, the stronger the influence of the independent variables on the dependent variable. If the two predictors in this simple (fictitious) example were measured on the same scale (e.g., if the variables were standardized to a mean of 0.0 and standard deviation 1.0), then Drunk Driving could be inferred to contribute 4 times more to car accidents than Bad Weather. (If the variables are not measured on the same scale, then direct comparisons between these coefficients are not meaningful, and, usually, some other standardized measure of predictor "importance" is included in the results.)

For additional details regarding these types of statistical models, refer to Multiple Regression or General Linear Models (GLM), as well as General Regression Models (GRM). In general, the social and natural sciences regression procedures are widely used in research. Regression allows the researcher to ask (and hopefully answer) the general question "what is the best predictor of ..." For example, educational researchers might want to learn what the best predictors of success in high-school are. Psychologists may want to determine which personality variable best predicts social adjustment. Sociologists may want to find out which of the multiple social indicators best predict whether a new immigrant group will adapt and be absorbed into society.

Machine Learning
http://statsoft.com/textbook/stathome.html

Introductory Overview
Support Vector Machines (SVM)
Naive Bayes
k-Nearest Neighbors (KNN)

--------------------------------------------------------------------------------
Machine Learning Introductory Overview


Machine Learning includes a number of advanced statistical methods for handling regression and classification tasks with multiple dependent and independent variables. These methods include Support Vector Machines (SVM) for regression and classification, Naive Bayes for classification, and k-Nearest Neighbours (KNN) for regression and classification. Detailed discussions of these techniques can be found in Hastie, Tibshirani, & Freedman (2001); a specialized comprehensive introduction to support vector machines can also be found in Cristianini and Shawe-Taylor (2000).

Support Vector Machines (SVM)

This method performs regression and classification tasks by constructing nonlinear decision boundaries. Because of the nature of the feature space in which these boundaries are found, Support Vector Machines can exhibit a large degree of flexibility in handling classification and regression tasks of varied complexities. There are several types of Support Vector models including linear, polynomial, RBF, and sigmoid.

Naive Bayes

This is a well established Bayesian method primarily formulated for performing classification tasks. Given its simplicity, i.e., the assumption that the independent variables are statistically independent, Naive Bayes models are effective classification tools that are easy to use and interpret. Naive Bayes is particularly appropriate when the dimensionality of the independent space (i.e., number of input variables) is high (a problem known as the curse of dimensionality). For the reasons given above, Naive Bayes can often outperform other more sophisticated classification methods. A variety of methods exist for modeling the conditional distributions of the inputs including normal, lognormal, gamma, and Poisson.

k-Nearest Neighbors

k-Nearest Neighbors is a memory-based method that, in contrast to other statistical methods, requires no training (i.e., no model to fit). It falls into the category of Prototype Methods. It functions on the intuitive idea that close objects are more likely to be in the same category. Thus, in KNN, predictions are based on a set of prototype examples that are used to predict new (i.e., unseen) data based on the majority vote (for classification tasks) and averaging (for regression) over a set of k-nearest prototypes (hence the name k-nearest neighbors). To index

Multidimensional Scaling
http://statsoft.com/textbook/stathome.html

General Purpose
Logic of MDS
Computational Approach
How many dimensions to specify?
Interpreting the Dimensions
Applications
MDS and Factor Analysis

--------------------------------------------------------------------------------

General Purpose

Multidimensional scaling (MDS) can be considered to be an alternative to factor analysis (see Factor Analysis). In general, the goal of the analysis is to detect meaningful underlying dimensions that allow the researcher to explain observed similarities or dissimilarities (distances) between the investigated objects. In factor analysis, the similarities between objects (e.g., variables) are expressed in the correlation matrix. With MDS one may analyze any kind of similarity or dissimilarity matrix, in addition to correlation matrices.

Logic of MDS

The following simple example may demonstrate the logic of an MDS analysis. Suppose we take a matrix of distances between major US cities from a map. We then analyze this matrix, specifying that we want to reproduce the distances based on two dimensions. As a result of the MDS analysis, we would most likely obtain a two-dimensional representation of the locations of the cities, that is, we would basically obtain a two-dimensional map.

In general then, MDS attempts to arrange "objects" (major cities in this example) in a space with a particular number of dimensions (two-dimensional in this example) so as to reproduce the observed distances. As a result, we can "explain" the distances in terms of underlying dimensions; in our example, we could explain the distances in terms of the two geographical dimensions: north/south and east/west.

Orientation of axes. As in factor analysis, the actual orientation of axes in the final solution is arbitrary. To return to our example, we could rotate the map in any way we want, the distances between cities remain the same. Thus, the final orientation of axes in the plane or space is mostly the result of a subjective decision by the researcher, who will choose an orientation that can be most easily explained. To return to our example, we could have chosen an orientation of axes other than north/south and east/west; however, that orientation is most convenient because it "makes the most sense" (i.e., it is easily interpretable). To index

Neural Networks
http://www.statsoft.com/TEXTBOOK/stneunet.html
http://statsoft.com/textbook/stathome.html

Preface
Applications for Neural Networks
The Biological Inspiration
The Basic Artificial Model
Using a Neural Network
Gathering Data for Neural Networks
Summary
Pre- and Post-processing
Multilayer Perceptrons
Training Multilayer Perceptrons
The Back Propagation Algorithm
Over-learning and Generalization
Data Selection
Insights into MLP Training
Other MLP Training Algorithms
Radial Basis Function Networks
Probabilistic Neural Networks
Generalized Regression Neural Networks
Linear Networks
SOFM Networks
Classification in Neural Networks
Classification Statistics
Regression Problems in Neural Networks
Time Series Prediction in Neural Networks
Variable Selection and Dimensionality Reduction
Ensembles and Resampling
Recommended Textbooks

--------------------------------------------------------------------------------
Many concepts related to the neural networks methodology are best explained if they are illustrated with applications of a specific neural network program. Therefore, this chapter contains many references to STATISTICA Neural Networks (in short, ST Neural Networks, a neural networks application available from StatSoft), a particularly comprehensive neural network tool.
--------------------------------------------------------------------------------
Preface

Neural networks have seen an explosion of interest over the last few years, and are being successfully applied across an extraordinary range of problem domains, in areas as diverse as finance, medicine, engineering, geology and physics. Indeed, anywhere that there are problems of prediction, classification or control, neural networks are being introduced. This sweeping success can be attributed to a few key factors:


Power. Neural networks are very sophisticated modeling techniques capable of modeling extremely complex functions. In particular, neural networks are nonlinear (a term which is discussed in more detail later in this section). For many years linear modeling has been the commonly used technique in most modeling domains since linear models have well-known optimization strategies. Where the linear approximation was not valid (which was frequently the case) the models suffered accordingly. Neural networks also keep in check the curse of dimensionality problem that bedevils attempts to model nonlinear functions with large numbers of variables.
Ease of use. Neural networks learn by example. The neural network user gathers representative data, and then invokes training algorithms to automatically learn the structure of the data. Although the user does need to have some heuristic knowledge of how to select and prepare data, how to select an appropriate neural network, and how to interpret the results, the level of user knowledge needed to successfully apply neural networks is much lower than would be the case using (for example) some more traditional nonlinear statistical methods.
Neural networks are also intuitively appealing, based as they are on a crude low-level model of biological neural systems. In the future, the development of this neurobiological modeling may lead to genuinely intelligent computers.

To index

Nonlinear Estimation
http://statsoft.com/textbook/stathome.html

General Purpose
Estimating Linear and Nonlinear Models
Common Nonlinear Regression Models
Intrinsically Linear Regression Models
Intrinsically Nonlinear Regression Models
Nonlinear Estimation Procedures
Least Squares Estimation
Loss Functions
Weighted Least Squares
Maximum Likelihood
Maximum likelihood and probit/logit models
Function Minimization Algorithms
Start Values, Step Sizes, Convergence Criteria
Penalty Functions, Constraining Parameters
Local Minima
Quasi-Newton Method
Simplex Procedure
Hooke-Jeeves Pattern Moves
Rosenbrock Pattern Search
Hessian Matrix and Standard Errors
Evaluating the Fit of the Model
Proportion of Variance Explained
Goodness-of-fit Chi-square
Plot of Observed vs. Predicted Values
Normal and Half-Normal Probability Plots
Plot of the Fitted Function
Variance/Covariance Matrix for Parameters

--------------------------------------------------------------------------------

General Purpose

In the most general terms, Nonlinear Estimation will compute the relationship between a set of independent variables and a dependent variable. For example, we may want to compute the relationship between the dose of a drug and its effectiveness, the relationship between training and subsequent performance on a task, the relationship between the price of a house and the time it takes to sell it, etc. You may recognize research issues in these examples that are commonly addressed by such techniques as multiple regression (see, Multiple Regression) or analysis of variance (see, ANOVA/MANOVA). In fact, you may think of Nonlinear Estimation as a generalization of those methods. Specifically, multiple regression (and ANOVA) assumes that the relationship between the independent variable(s) and the dependent variable is linear in nature. Nonlinear Estimation leaves it up to you to specify the nature of the relationship; for example, you may specify the dependent variable to be a logarithmic function of the independent variable(s), an exponential function, a function of some complex ratio of independent measures, etc. (However, if all variables of interest are categorical in nature, or can be converted into categorical variables, you may also consider Correspondence Analysis.)

When allowing for any type of relationship between the independent variables and the dependent variable, two issues raise their heads. First, what types of relationships "make sense", that is, are interpretable in a meaningful manner? Note that the simple linear relationship is very convenient in that it allows us to make such straightforward interpretations as "the more of x (e.g., the higher the price of a house), the more there is of y (the longer it takes to sell it); and given a particular increase in x, a proportional increase in y can be expected." Nonlinear relationships cannot usually be interpreted and verbalized in such a simple manner. The second issue that needs to be addressed is how to exactly compute the relationship, that is, how to arrive at results that allow us to say whether or not there is a nonlinear relationship as predicted.

Let us now discuss the nonlinear regression problem in a somewhat more formal manner, that is, introduce the common terminology that will allow us to examine the nature of these techniques more closely, and how they are used to address important questions in various research domains (medicine, social sciences, physics, chemistry, pharmacology, engineering, etc.). To index

Nonparametric Statistics
http://statsoft.com/textbook/stathome.html
General Purpose
Brief Overview of Nonparametric Procedures
When to Use Which Method
Nonparametric Correlations

--------------------------------------------------------------------------------
General Purpose
Brief review of the idea of significance testing. To understand the idea of nonparametric statistics (the term nonparametric was first used by Wolfowitz, 1942) first requires a basic understanding of parametric statistics. The Elementary Concepts chapter of the manual introduces the concept of statistical significance testing based on the sampling distribution of a particular statistic (you may want to review that chapter before reading on). In short, if we have a basic knowledge of the underlying distribution of a variable, then we can make predictions about how, in repeated samples of equal size, this particular statistic will "behave," that is, how it is distributed. For example, if we draw 100 random samples of 100 adults each from the general population, and compute the mean height in each sample, then the distribution of the standardized means across samples will likely approximate the normal distribution (to be precise, Student''s t distribution with 99 degrees of freedom; see below). Now imagine that we take an additional sample in a particular city ("Tallburg") where we suspect that people are taller than the average population. If the mean height in that sample falls outside the upper 95% tail area of the t distribution then we conclude that, indeed, the people of Tallburg are taller than the average population.

Are most variables normally distributed? In the above example we relied on our knowledge that, in repeated samples of equal size, the standardized means (for height) will be distributed following the t distribution (with a particular mean and variance). However, this will only be true if in the population the variable of interest (height in our example) is normally distributed, that is, if the distribution of people of particular heights follows the normal distribution (the bell-shape distribution).
For many variables of interest, we simply do not know for sure that this is the case. For example, is income distributed normally in the population? -- probably not. The incidence rates of rare diseases are not normally distributed in the population, the number of car accidents is also not normally distributed, and neither are very many other variables in which a researcher might be interested.

For more information on the normal distribution, see Elementary Concepts; for information on tests of normality, see Normality tests.

Sample size. Another factor that often limits the applicability of tests based on the assumption that the sampling distribution is normal is the size of the sample of data available for the analysis (sample size; n). We can assume that the sampling distribution is normal even if we are not sure that the distribution of the variable in the population is normal, as long as our sample is large enough (e.g., 100 or more observations). However, if our sample is very small, then those tests can be used only if we are sure that the variable is normally distributed, and there is no way to test this assumption if the sample is small.

Problems in measurement. Applications of tests that are based on the normality assumptions are further limited by a lack of precise measurement. For example, let us consider a study where grade point average (GPA) is measured as the major variable of interest. Is an A average twice as good as a C average? Is the difference between a B and an A average comparable to the difference between a D and a C average? Somehow, the GPA is a crude measure of scholastic accomplishments that only allows us to establish a rank ordering of students from "good" students to "poor" students. This general measurement issue is usually discussed in statistics textbooks in terms of types of measurement or scale of measurement. Without going into too much detail, most common statistical techniques such as analysis of variance (and t- tests), regression, etc. assume that the underlying measurements are at least of interval, meaning that equally spaced intervals on the scale can be compared in a meaningful manner (e.g, B minus A is equal to D minus C). However, as in our example, this assumption is very often not tenable, and the data rather represent a rank ordering of observations (ordinal) rather than precise measurements.

Parametric and nonparametric methods. Hopefully, after this somewhat lengthy introduction, the need is evident for statistical procedures that allow us to process data of "low quality," from small samples, on variables about which nothing is known (concerning their distribution). Specifically, nonparametric methods were developed to be used in cases when the researcher knows nothing about the parameters of the variable of interest in the population (hence the name nonparametric). In more technical terms, nonparametric methods do not rely on the estimation of parameters (such as the mean or the standard deviation) describing the distribution of the variable of interest in the population. Therefore, these methods are also sometimes (and more appropriately) called parameter-free methods or distribution-free methods.

To index

Partial Least Squares (PLS)
http://statsoft.com/textbook/stathome.html

Basic Ideas
Computational Approach
Basic Model
NIPALS Algorithm
SIMPLS Algorithm
Training and Verification (Crossvalidation) Samples
Types of Analyses
Between-subject Designs
Distance Graphs

--------------------------------------------------------------------------------

This chapter describes the use of partial least squares regression analysis. If you are unfamiliar with the basic methods of regression in linear models, it may be useful to first review the information on these topics in Elementary Concepts. The different designs discussed in this chapter are also described in the context of General Linear Models, Generalized Linear Models, and General Stepwise Regression.



--------------------------------------------------------------------------------

Basic Ideas

Partial least squares regression is an extension of the multiple linear regression model (see, e.g., Multiple Regression or General Stepwise Regression). In its simplest form, a linear model specifies the (linear) relationship between a dependent (response) variable Y, and a set of predictor variables, the X''s, so that
Y = b0 + b1X1 + b2X2 + ... + bpXp

In this equation b0 is the regression coefficient for the intercept and the bi values are the regression coefficients (for variables 1 through p) computed from the data.

So for example, one could estimate (i.e., predict) a person''s weight as a function of the person''s height and gender. You could use linear regression to estimate the respective regression coefficients from a sample of data, measuring height, weight, and observing the subjects'' gender. For many data analysis problems, estimates of the linear relationships between variables are adequate to describe the observed data, and to make reasonable predictions for new observations (see Multiple Regression or General Stepwise Regression for additional details).

The multiple linear regression model has been extended in a number of ways to address more sophisticated data analysis problems. The multiple linear regression model serves as the basis for a number of multivariate methods such as discriminant analysis (i.e., the prediction of group membership from the levels of continuous predictor variables), principal components regression (i.e., the prediction of responses on the dependent variables from factors underlying the levels of the predictor variables), and canonical correlation (i.e., the prediction of factors underlying responses on the dependent variables from factors underlying the levels of the predictor variables). These multivariate methods all have two important properties in common. These methods impose restrictions such that (1) factors underlying the Y and X variables are extracted from the Y''Y and X''X matrices, respectively, and never from cross-product matrices involving both the Y and X variables, and (2) the number of prediction functions can never exceed the minimum of the number of Y variables and X variables.

Partial least squares regression extends multiple linear regression without imposing the restrictions employed by discriminant analysis, principal components regression, and canonical correlation. In partial least squares regression, prediction functions are represented by factors extracted from the Y''XX''Y matrix. The number of such prediction functions that can be extracted typically will exceed the maximum of the number of Y and X variables.

In short, partial least squares regression is probably the least restrictive of the various multivariate extensions of the multiple linear regression model. This flexibility allows it to be used in situations where the use of traditional multivariate methods is severely limited, such as when there are fewer observations than predictor variables. Furthermore, partial least squares regression can be used as an exploratory analysis tool to select suitable predictor variables and to identify outliers before classical linear regression.

Partial least squares regression has been used in various disciplines such as chemistry, economics, medicine, psychology, and pharmaceutical science where predictive linear modeling, especially with a large number of predictors, is necessary. Especially in chemometrics, partial least squares regression has become a standard tool for modeling linear relations between multivariate measurements (de Jong, 1993).

Power Analysis
http://statsoft.com/textbook/stathome.html
General Purpose
Power Analysis and Sample Size Calculation in Experimental Design
Sampling Theory
Hypothesis Testing Logic
Calculating Power
Calculating Required Sample Size
Graphical Approaches to Power Analysis
Noncentrality Interval Estimation and the Evaluation of Statistical Models
Inadequacies of the Hypothesis Testing Approach
Advantages of Interval Estimation
Reasons Why Interval Estimates are Seldom Reported
Replacing Traditional Hypothesis Tests with Interval Estimates
General Purpose
The techniques of statistical power analysis, sample size estimation, and advanced techniques for confidence interval estimation are discussed here. The main goal of first the two techniques is to allow you to decide, while in the process of designing an experiment, (a) how large a sample is needed to enable statistical judgments that are accurate and reliable and (b) how likely your statistical test will be to detect effects of a given size in a particular situation. The third technique is useful in implementing objectives a and b and in evaluating the size of experimental effects in practice.

Performing power analysis and sample size estimation is an important aspect of experimental design, because without these calculations, sample size may be too high or too low. If sample size is too low, the experiment will lack the precision to provide reliable answers to the questions it is investigating. If sample size is too large, time and resources will be wasted, often for minimal gain.

In some power analysis software programs, a number of graphical and analytical tools are available to enable precise evaluation of the factors affecting power and sample size in many of the most commonly encountered statistical analyses. This information can be crucial to the design of a study that is cost-effective and scientifically useful.

Noncentrality interval estimation procedures and other sophisticated confidence interval procedures provide some sophisticated confidence interval methods for analyzing the importance of an observed experimental result. An increasing number of influential statisticians are suggesting that confidence interval estimation should augment or replace traditional hypothesis testing approaches in the analysis of experimental data. To index


Power Analysis and Sample Size Calculation in Experimental Design

There is a growing recognition of the importance of power analysis and sample size calculation in the proper design of experiments. Click on the links below for a discussion of the fundamental ideas behind these methods.


Sampling Theory
Hypothesis Testing Logic
Calculating Power
Calculating Required Sample Size
Graphical Approaches to Power Analysis

Process Analysis
http://statsoft.com/textbook/stathome.html

Sampling Plans
General Purpose
Computational Approach
Means for H0 and H1
Alpha and Beta Error Probabilities
Fixed Sampling Plans
Sequential Sampling Plans
Summary
Process (Machine) Capability Analysis
Introductory Overview
Computational Approach
Process Capability Indices
Process Performance vs. Process Capability
Using Experiments to Improve Process Capability
Testing the Normality Assumption
Tolerance Limits
Gage Repeatability and Reproducibility
Introductory Overview
Computational Approach
Plots of Repeatability and Reproducibility
Components of Variance
Summary
Non-Normal Distributions
Introductory Overview
Fitting Distributions by Moments
Assessing the Fit: Quantile and Probability Plots
Non-Normal Process Capability Indices (Percentile Method)
Weibull and Reliability/Failure Time Analysis
General Purpose
The Weibull Distribution
Censored Observations
Two- and three-parameter Weibull Distribution
Parameter Estimation
Goodness of Fit Indices
Interpreting Results
Grouped Data
Modified Failure Order for Multiple-Censored Data
Weibull CDF, Reliability, and Hazard Functions

--------------------------------------------------------------------------------
Sampling plans are discussed in detail in Duncan (1974) and Montgomery (1985); most process capability procedures (and indices) were only recently introduced to the US from Japan (Kane, 1986), however, they are discussed in three excellent recent hands-on books by Bohte (1988), Hart and Hart (1989), and Pyzdek (1989); detailed discussions of these methods can also be found in Montgomery (1991).
Step-by-step instructions for the computation and interpretation of capability indices are also provided in the Fundamental Statistical Process Control Reference Manual published by the ASQC (American Society for Quality Control) and AIAG (Automotive Industry Action Group, 1991; referenced as ASQC/AIAG, 1991). Repeatability and reproducibility (R & R) methods are discussed in Grant and Leavenworth (1980), Pyzdek (1989) and Montgomery (1991); a more detailed discussion of the subject (of variance estimation) is also provided in Duncan (1974).

Step-by-step instructions on how to conduct and analyze R & R experiments are presented in the Measurement Systems Analysis Reference Manual published by ASQC/AIAG (1990). In the following topics, we will briefly introduce the purpose and logic of each of these procedures. For more information on analyzing designs with random effects and for estimating components of variance, see the Variance Components chapter.


--------------------------------------------------------------------------------
Sampling Plans
General Purpose
Computational Approach
Means for H0 and H1
Alpha and Beta Error Probabilities
Fixed Sampling Plans
Sequential Sampling Plans
Summary
General Purpose

A common question that quality control engineers face is to determine how many items from a batch (e.g., shipment from a supplier) to inspect in order to ensure that the items (products) in that batch are of acceptable quality. For example, suppose we have a supplier of piston rings for small automotive engines that our company produces, and our goal is to establish a sampling procedure (of piston rings from the delivered batches) that ensures a specified quality. In principle, this problem is similar to that of on-line quality control discussed in Quality Control. In fact, you may want to read that section at this point to familiarize yourself with the issues involved in industrial statistical quality control.

Acceptance sampling. The procedures described here are useful whenever we need to decide whether or not a batch or lot of items complies with specifications, without having to inspect 100% of the items in the batch. Because of the nature of the problem -- whether or not to accept a batch -- these methods are also sometimes discussed under the heading of acceptance sampling.

Advantages over 100% inspection. An obvious advantage of acceptance sampling over 100% inspection of the batch or lot is that reviewing only a sample requires less time, effort, and money. In some cases, inspection of an item is destructive (e.g., stress testing of steel), and testing 100% would destroy the entire batch. Finally, from a managerial standpoint, rejecting an entire batch or shipment (based on acceptance sampling) from a supplier, rather than just a certain percent of defective items (based on 100% inspection) often provides a stronger incentive to the supplier to adhere to quality standards.

Computational Approach

In principle, the computational approach to the question of how large a sample to take is straightforward. Elementary Concepts discusses the concept of the sampling distribution. Briefly, if we were to take repeated samples of a particular size from a population of, for example, piston rings and compute their average diameters, then the distribution of those averages (means) would approach the normal distribution with a particular mean and standard deviation (or standard error; in sampling distributions the term standard error is preferred, in order to distinguish the variability of the means from the variability of the items in the population). Fortunately, we do not need to take repeated samples from the population in order to estimate the location (mean) and variability (standard error) of the sampling distribution. If we have a good idea (estimate) of what the variability (standard deviation or sigma) is in the population, then we can infer the sampling distribution of the mean. In principle, this information is sufficient to estimate the sample size that is needed in order to detect a certain change in quality (from target specifications). Without going into the details about the computational procedures involved, let us next review the particular information that the engineer must supply in order to estimate required sample sizes.

Means for H0 and H1

To formalize the inspection process of, for example, a shipment of piston rings, we can formulate two alternative hypotheses: First, we may hypothesize that the average piston ring diameters comply with specifications. This hypothesis is called the null hypothesis (H0). The second and alternative hypothesis (H1) is that the diameters of the piston rings delivered to us deviate from specifications by more than a certain amount. Note that we may specify these types of hypotheses not just for measurable variables such as diameters of piston rings, but also for attributes. For example, we may hypothesize (H1) that the number of defective parts in the batch exceeds a certain percentage. Intuitively, it should be clear that the larger the difference between H0 and H1, the smaller the sample necessary to detect this difference (see Elementary Concepts).

Alpha and Beta Error Probabilities

To return to the piston rings example, there are two types of mistakes that we can make when inspecting a batch of piston rings that has just arrived at our plant. First, we may erroneously reject H0, that is, reject the batch because we erroneously conclude that the piston ring diameters deviate from target specifications. The probability of committing this mistake is usually called the alpha error probability. The second mistake that we can make is to erroneously not reject H0 (accept the shipment of piston rings), when, in fact, the mean piston ring diameter deviates from the target specification by a certain amount. The probability of committing this mistake is usually called the beta error probability. Intuitively, the more certain we want to be, that is, the lower we set the alpha and beta error probabilities, the larger the sample will have to be; in fact, in order to be 100% certain, we would have to measure every single piston ring delivered to our company.

Fixed Sampling Plans

To construct a simple sampling plan, we would first decide on a sample size, based on the means under H0/H1 and the particular alpha and beta error probabilities. Then, we would take a single sample of this fixed size and, based on the mean in this sample, decide whether to accept or reject the batch. This procedure is referred to as a fixed sampling plan.

Operating characteristic (OC) curve. The power of the fixed sampling plan can be summarized via the operating characteristic curve. In that plot, the probability of rejecting H0 (and accepting H1) is plotted on the Y axis, as a function of an actual shift from the target (nominal) specification to the respective values shown on the X axis of the plot (see example below). This probability is, of course, one minus the beta error probability of erroneously rejecting H1 and accepting H0; this value is referred to as the power of the fixed sampling plan to detect deviations. Also indicated in this plot are the power functions for smaller sample sizes.

Sequential Sampling Plans

As an alternative to the fixed sampling plan, we could randomly choose individual piston rings and record their deviations from specification. As we continue to measure each piston ring, we could keep a running total of the sum of deviations from specification. Intuitively, if H1 is true, that is, if the average piston ring diameter in the batch is not on target, then we would expect to observe a slowly increasing or decreasing cumulative sum of deviations, depending on whether the average diameter in the batch is larger or smaller than the specification, respectively. It turns out that this kind of sequential sampling of individual items from the batch is a more sensitive procedure than taking a fixed sample. In practice, we continue sampling until we either accept or reject the batch.

Using a sequential sampling plan. Typically, we would produce a graph in which the cumulative deviations from specification (plotted on the Y-axis) are shown for successively sampled items (e.g., piston rings, plotted on the X-axis). Then two sets of lines are drawn in this graph to denote the "corridor" along which we will continue to draw samples, that is, as long as the cumulative sum of deviations from specifications stays within this corridor, we continue sampling.

If the cumulative sum of deviations steps outside the corridor we stop sampling. If the cumulative sum moves above the upper line or below the lowest line, we reject the batch. If the cumulative sum steps out of the corridor to the inside, that is, if it moves closer to the center line, we accept the batch (since this indicates zero deviation from specification). Note that the inside area starts only at a certain sample number; this indicates the minimum number of samples necessary to accept the batch (with the current error probability).

Summary

To summarize, the idea of (acceptance) sampling is to use statistical "inference" to accept or reject an entire batch of items, based on the inspection of only relatively few items from that batch. The advantage of applying statistical reasoning to this decision is that we can be explicit about the probabilities of making a wrong decision.

Whenever possible, sequential sampling plans are preferable to fixed sampling plans because they are more powerful. In most cases, relative to the fixed sampling plan, using sequential plans requires fewer items to be inspected in order to arrive at a decision with the same degree of certainty. To index

Quality Control Charts
http://statsoft.com/textbook/stathome.html

General Purpose
General Approach
Establishing Control Limits
Common Types of Charts
Short Run Control Charts
Short Run Charts for Variables
Short Run Charts for Attributes
Unequal Sample Sizes
Control Charts for Variables vs. Charts for Attributes
Control Charts for Individual Observations
Out-of-Control Process: Runs Tests
Operating Characteristic (OC) Curves
Process Capability Indices
Other Specialized Control Charts

--------------------------------------------------------------------------------
General Purpose
In all production processes, we need to monitor the extent to which our products meet specifications. In the most general terms, there are two "enemies" of product quality: (1) deviations from target specifications, and (2) excessive variability around target specifications. During the earlier stages of developing the production process, designed experiments are often used to optimize these two quality characteristics (see Experimental Design); the methods provided in Quality Control are on-line or in-process quality control procedures to monitor an on-going production process. For detailed descriptions of these charts and extensive annotated examples, see Buffa (1972), Duncan (1974) Grant and Leavenworth (1980), Juran (1962), Juran and Gryna (1970), Montgomery (1985, 1991), Shirland (1993), or Vaughn (1974). Two recent excellent introductory texts with a "how-to" approach are Hart & Hart (1989) and Pyzdek (1989); two recent German language texts on this subject are Rinne and Mittag (1995) and Mittag (1993). To index



General Approach

The general approach to on-line quality control is straightforward: We simply extract samples of a certain size from the ongoing production process. We then produce line charts of the variability in those samples, and consider their closeness to target specifications. If a trend emerges in those lines, or if samples fall outside pre-specified limits, then we declare the process to be out of control and take action to find the cause of the problem. These types of charts are sometimes also referred to as Shewhart control charts (named after W. A. Shewhart who is generally credited as being the first to introduce these methods; see Shewhart, 1931).

Common Types of Charts

The types of charts are often classified according to the type of quality characteristic that they are supposed to monitor: there are quality control charts for variables and control charts for attributes. Specifically, the following charts are commonly constructed for controlling variables:

X-bar chart. In this chart the sample means are plotted in order to control the mean value of a variable (e.g., size of piston rings, strength of materials, etc.).
R chart. In this chart, the sample ranges are plotted in order to control the variability of a variable.
S chart. In this chart, the sample standard deviations are plotted in order to control the variability of a variable.
S**2 chart. In this chart, the sample variances are plotted in order to control the variability of a variable.
For controlling quality characteristics that represent attributes of the product, the following charts are commonly constructed:
C chart. In this chart (see example below), we plot the number of defectives (per batch, per day, per machine, per 100 feet of pipe, etc.). This chart assumes that defects of the quality attribute are rare, and the control limits in this chart are computed based on the Poisson distribution (distribution of rare events).
U chart. In this chart we plot the rate of defectives, that is, the number of defectives divided by the number of units inspected (the n; e.g., feet of pipe, number of batches). Unlike the C chart, this chart does not require a constant number of units, and it can be used, for example, when the batches (samples) are of different sizes.
Np chart. In this chart, we plot the number of defectives (per batch, per day, per machine) as in the C chart. However, the control limits in this chart are not based on the distribution of rare events, but rather on the binomial distribution. Therefore, this chart should be used if the occurrence of defectives is not rare (e.g., they occur in more than 5% of the units inspected). For example, we may use this chart to control the number of units produced with minor flaws.
P chart. In this chart, we plot the percent of defectives (per batch, per day, per machine, etc.) as in the U chart. However, the control limits in this chart are not based on the distribution of rare events but rather on the binomial distribution (of proportions). Therefore, this chart is most applicable to situations where the occurrence of defectives is not rare (e.g., we expect the percent of defectives to be more than 5% of the total number of units produced).
All of these charts can be adapted for short production runs (short run charts), and for multiple process streams. To index

Short Run Charts

The short run control chart, or control chart for short production runs, plots observations of variables or attributes for multiple parts on the same chart. Short run control charts were developed to address the requirement that several dozen measurements of a process must be collected before control limits are calculated. Meeting this requirement is often difficult for operations that produce a limited number of a particular part during a production run.

For example, a paper mill may produce only three or four (huge) rolls of a particular kind of paper (i.e., part) and then shift production to another kind of paper. But if variables, such as paper thickness, or attributes, such as blemishes, are monitored for several dozen rolls of paper of, say, a dozen different kinds, control limits for thickness and blemishes could be calculated for the transformed (within the short production run) variable values of interest. Specifically, these transformations will rescale the variable values of interest such that they are of compatible magnitudes across the different short production runs (or parts). The control limits computed for those transformed values could then be applied in monitoring thickness, and blemishes, regardless of the types of paper (parts) being produced. Statistical process control procedures could be used to determine if the production process is in control, to monitor continuing production, and to establish procedures for continuous quality improvement.

For additional discussions of short run charts refer to Bothe (1988), Johnson (1987), or Montgomery (1991).

Short Run Charts for Variables

Nominal chart, target chart. There are several different types of short run charts. The most basic are the nominal short run chart, and the target short run chart. In these charts, the measurements for each part are transformed by subtracting a part-specific constant. These constants can either be the nominal values for the respective parts (nominal short run chart), or they can be target values computed from the (historical) means for each part (Target X-bar and R chart). For example, the diameters of piston bores for different engine blocks produced in a factory can only be meaningfully compared (for determining the consistency of bore sizes) if the mean differences between bore diameters for different sized engines are first removed. The nominal or target short run chart makes such comparisons possible. Note that for the nominal or target chart it is assumed that the variability across parts is identical, so that control limits based on a common estimate of the process sigma are applicable.

Standardized short run chart. If the variability of the process for different parts cannot be assumed to be identical, then a further transformation is necessary before the sample means for different parts can be plotted in the same chart. Specifically, in the standardized short run chart the plot points are further transformed by dividing the deviations of sample means from part means (or nominal or target values for parts) by part-specific constants that are proportional to the variability for the respective parts. For example, for the short run X-bar and R chart, the plot points (that are shown in the X-bar chart) are computed by first subtracting from each sample mean a part specific constant (e.g., the respective part mean, or nominal value for the respective part), and then dividing the difference by another constant, for example, by the average range for the respective chart. These transformations will result in comparable scales for the sample means for different parts.

Short Run Charts for Attributes

For attribute control charts (C, U, Np, or P charts), the estimate of the variability of the process (proportion, rate, etc.) is a function of the process average (average proportion, rate, etc.; for example, the standard deviation of a proportion p is equal to the square root of p*(1- p)/n). Hence, only standardized short run charts are available for attributes. For example, in the short run P chart, the plot points are computed by first subtracting from the respective sample p values the average part p''s, and then dividing by the standard deviation of the average p''s. To index



Unequal Sample Sizes

When the samples plotted in the control chart are not of equal size, then the control limits around the center line (target specification) cannot be represented by a straight line. For example, to return to the formula Sigma/Square Root(n) presented earlier for computing control limits for the X-bar chart, it is obvious that unequal n''s will lead to different control limits for different sample sizes. There are three ways of dealing with this situation.

Average sample size. If one wants to maintain the straight-line control limits (e.g., to make the chart easier to read and easier to use in presentations), then one can compute the average n per sample across all samples, and establish the control limits based on the average sample size. This procedure is not "exact," however, as long as the sample sizes are reasonably similar to each other, this procedure is quite adequate.

Variable control limits. Alternatively, one may compute different control limits for each sample, based on the respective sample sizes. This procedure will lead to variable control limits, and result in step-chart like control lines in the plot. This procedure ensures that the correct control limits are computed for each sample. However, one loses the simplicity of straight-line control limits.

Stabilized (normalized) chart. The best of two worlds (straight line control limits that are accurate) can be accomplished by standardizing the quantity to be controlled (mean, proportion, etc.) according to units of sigma. The control limits can then be expressed in straight lines, while the location of the sample points in the plot depend not only on the characteristic to be controlled, but also on the respective sample n''s. The disadvantage of this procedure is that the values on the vertical (Y) axis in the control chart are in terms of sigma rather than the original units of measurement, and therefore, those numbers cannot be taken at face value (e.g., a sample with a value of 3 is 3 times sigma away from specifications; in order to express the value of this sample in terms of the original units of measurement, we need to perform some computations to convert this number back). To index

Reliability and Item Analysis
http://statsoft.com/textbook/stathome.html

General Introduction
Basic Ideas
Classical Testing Model
Reliability
Sum Scales
Cronbach''s Alpha
Split-Half Reliability
Correction for Attenuation
Designing a Reliable Scale

--------------------------------------------------------------------------------
This chapter discusses the concept of reliability of measurement as used in social sciences (but not in industrial statistics or biomedical research). The term reliability used in industrial statistics denotes a function describing the probability of failure (as a function of time). For a discussion of the concept of reliability as applied to product quality (e.g., in industrial statistics), please refer to the section on Reliability/Failure Time Analysis in the Process Analysis chapter (see also the section Repeatability and Reproducibility in the same chapter and the chapter Survival/Failure Time Analysis). For a comparison between these two (very different) concepts of reliability, see Reliability.
--------------------------------------------------------------------------------
General Introduction
In many areas of research, the precise measurement of hypothesized processes or variables (theoretical constructs) poses a challenge by itself. For example, in psychology, the precise measurement of personality variables or attitudes is usually a necessary first step before any theories of personality or attitudes can be considered. In general, in all social sciences, unreliable measurements of people''s beliefs or intentions will obviously hamper efforts to predict their behavior. The issue of precision of measurement will also come up in applied research, whenever variables are difficult to observe. For example, reliable measurement of employee performance is usually a difficult task; yet, it is obviously a necessary precursor to any performance-based compensation system.

In all of these cases, Reliability & Item Analysis may be used to construct reliable measurement scales, to improve existing scales, and to evaluate the reliability of scales already in use. Specifically, Reliability & Item Analysis will aid in the design and evaluation of sum scales, that is, scales that are made up of multiple individual measurements (e.g., different items, repeated measurements, different measurement devices, etc.). You can compute numerous statistics that allows you to build and evaluate scales following the so-called classical testing theory model.

The assessment of scale reliability is based on the correlations between the individual items or measurements that make up the scale, relative to the variances of the items. If you are not familiar with the correlation coefficient or the variance statistic, we recommend that you review the respective discussions provided in the Basic Statistics section.

The classical testing theory model of scale construction has a long history, and there are many textbooks available on the subject. For additional detailed discussions, you may refer to, for example, Carmines and Zeller (1980), De Gruitjer and Van Der Kamp (1976), Kline (1979, 1986), or Thorndyke and Hagen (1977). A widely acclaimed "classic" in this area, with an emphasis on psychological and educational testing, is Nunally (1970).

Testing hypotheses about relationships between items and tests. Using Structural Equation Modeling and Path Analysis (SEPATH), you can test specific hypotheses about the relationship between sets of items or different tests (e.g., test whether two sets of items measure the same construct, analyze multi-trait, multi-method matrices, etc.). To index



Basic Ideas

Suppose we want to construct a questionnaire to measure people''s prejudices against foreign- made cars. We could start out by generating a number of items such as: "Foreign cars lack personality," "Foreign cars all look the same," etc. We could then submit those questionnaire items to a group of subjects (for example, people who have never owned a foreign-made car). We could ask subjects to indicate their agreement with these statements on 9-point scales, anchored at 1=disagree and 9=agree.

True scores and error. Let us now consider more closely what we mean by precise measurement in this case. We hypothesize that there is such a thing (theoretical construct) as "prejudice against foreign cars," and that each item "taps" into this concept to some extent. Therefore, we may say that a subject''s response to a particular item reflects two aspects: first, the response reflects the prejudice against foreign cars, and second, it will reflect some esoteric aspect of the respective question. For example, consider the item "Foreign cars all look the same." A subject''s agreement or disagreement with that statement will partially depend on his or her general prejudices, and partially on some other aspects of the question or person. For example, the subject may have a friend who just bought a very different looking foreign car.

Testing hypotheses about relationships between items and tests. To test specific hypotheses about the relationship between sets of items or different tests (e.g., whether two sets of items measure the same construct, analyze multi- trait, multi-method matrices, etc.) use Structural Equation Modeling (SEPATH). To index



Classical Testing Model

To summarize, each measurement (response to an item) reflects to some extent the true score for the intended concept (prejudice against foreign cars), and to some extent esoteric, random error. We can express this in an equation as:
X = tau + error
In this equation, X refers to the respective actual measurement, that is, subject''s response to a particular item; tau is commonly used to refer to the true score, and error refers to the random error component in the measurement. To index

Structural Equation Modeling
http://statsoft.com/textbook/stathome.html

A Conceptual Overview
The Basic Idea Behind Structural Modeling
Structural Equation Modeling and the Path Diagram

--------------------------------------------------------------------------------
A Conceptual Overview
Structural Equation Modeling is a very general, very powerful multivariate analysis technique that includes specialized versions of a number of other analysis methods as special cases. We will assume that you are familiar with the basic logic of statistical reasoning as described in Elementary Concepts. Moreover, we will also assume that you are familiar with the concepts of variance, covariance, and correlation; if not, we advise that you read the Basic Statistics section at this point. Although it is not absolutely necessary, it is highly desirable that you have some background in factor analysis before attempting to use structural modeling.

Major applications of structural equation modeling include:

causal modeling, or path analysis, which hypothesizes causal relationships among variables and tests the causal models with a linear equation system. Causal models can involve either manifest variables, latent variables, or both;
confirmatory factor analysis, an extension of factor analysis in which specific hypotheses about the structure of the factor loadings and intercorrelations are tested;
second order factor analysis, a variation of factor analysis in which the correlation matrix of the common factors is itself factor analyzed to provide second order factors;
regression models, an extension of linear regression analysis in which regression weights may be constrained to be equal to each other, or to specified numerical values;
covariance structure models, which hypothesize that a covariance matrix has a particular form. For example, you can test the hypothesis that a set of variables all have equal variances with this procedure;
correlation structure models, which hypothesize that a correlation matrix has a particular form. A classic example is the hypothesis that the correlation matrix has the structure of a circumplex (Guttman, 1954; Wiggins, Steiger, & Gaelick, 1981).
Many different kinds of models fall into each of the above categories, so structural modeling as an enterprise is very difficult to characterize.
Most structural equation models can be expressed as path diagrams. Consequently even beginners to structural modeling can perform complicated analyses with a minimum of training. To index



The Basic Idea Behind Structural Modeling

One of the fundamental ideas taught in intermediate applied statistics courses is the effect of additive and multiplicative transformations on a list of numbers. Students are taught that, if you multiply every number in a list by some constant K, you multiply the mean of the numbers by K. Similarly, you multiply the standard deviation by the absolute value of K.

For example, suppose you have the list of numbers 1,2,3. These numbers have a mean of 2 and a standard deviation of 1. Now, suppose you were to take these 3 numbers and multiply them by 4. Then the mean would become 8, and the standard deviation would become 4, the variance thus 16.

The point is, if you have a set of numbers X related to another set of numbers Y by the equation Y = 4X, then the variance of Y must be 16 times that of X, so you can test the hypothesis that Y and X are related by the equation Y = 4X indirectly by comparing the variances of the Y and X variables.

This idea generalizes, in various ways, to several variables inter-related by a group of linear equations. The rules become more complex, the calculations more difficult, but the basic message remains the same -- you can test whether variables are interrelated through a set of linear relationships by examining the variances and covariances of the variables.

Statisticians have developed procedures for testing whether a set of variances and covariances in a covariance matrix fits a specified structure. The way structural modeling works is as follows:

You state the way that you believe the variables are inter-related, often with the use of a path diagram.
You work out, via some complex internal rules, what the implications of this are for the variances and covariances of the variables.
You test whether the variances and covariances fit this model of them.
Results of the statistical testing, and also parameter estimates and standard errors for the numerical coefficients in the linear equations are reported.
On the basis of this information, you decide whether the model seems like a good fit to your data.

Survival/Failure Time Analysis
http://statsoft.com/textbook/stathome.html

General Information
Censored Observations
Analytic Techniques
Life Table Analysis
Number of Cases at Risk
Proportion Failing
Proportion surviving
Cumulative Proportion Surviving (Survival Function)
Probability Density
Hazard rate
Median survival time
Required sample sizes
Distribution Fitting
General Introduction
Estimation
Goodness-of-fit
Plots
Kaplan-Meier Product-Limit Estimator
Comparing Samples
General Introduction
Available tests
Choosing a two-sample test
Multiple sample test
Unequal proportions of censored data
Regression Models
General Introduction
Cox''s Proportional Hazard Model
Cox''s Proportional Hazard Model with Time-Dependent Covariates
Exponential Regression
Normal and Log-Normal Regression
Stratified Analyses

--------------------------------------------------------------------------------
General Information
These techniques were primarily developed in the medical and biological sciences, but they are also widely used in the social and economic sciences, as well as in engineering (reliability and failure time analysis).

Imagine that you are a researcher in a hospital who is studying the effectiveness of a new treatment for a generally terminal disease. The major variable of interest is the number of days that the respective patients survive. In principle, one could use the standard parametric and nonparametric statistics for describing the average survival, and for comparing the new treatment with traditional methods (see Basic Statistics and Nonparametrics and Distribution Fitting). However, at the end of the study there will be patients who survived over the entire study period, in particular among those patients who entered the hospital (and the research project) late in the study; there will be other patients with whom we will have lost contact. Surely, one would not want to exclude all of those patients from the study by declaring them to be missing data (since most of them are "survivors" and, therefore, they reflect on the success of the new treatment method). Those observations, which contain only partial information are called censored observations (e.g., "patient A survived at least 4 months before he moved away and we lost contact;" the term censoring was first used by Hald, 1949). To index



Censored Observations

In general, censored observations arise whenever the dependent variable of interest represents the time to a terminal event, and the duration of the study is limited in time. Censored observations may occur in a number of different areas of research. For example, in the social sciences we may study the "survival" of marriages, high school drop-out rates (time to drop-out), turnover in organizations, etc. In each case, by the end of the study period, some subjects will still be married, will not have dropped out, or are still working at the same company; thus, those subjects represent censored observations.

In economics we may study the "survival" of new businesses or the "survival" times of products such as automobiles. In quality control research, it is common practice to study the "survival" of parts under stress (failure time analysis). To index



Analytic Techniques

Essentially, the methods offered in Survival Analysis address the same research questions as many of the other procedures; however, all methods in Survival Analysis will handle censored data. The life table, survival distribution, and Kaplan-Meier survival function estimation are all descriptive methods for estimating the distribution of survival times from a sample. Several techniques are available for comparing the survival in two or more groups. Finally, Survival Analysis offers several regression models for estimating the relationship of (multiple) continuous variables to survival times. To index



Life Table Analysis

The most straightforward way to describe the survival in a sample is to compute the Life Table. The life table technique is one of the oldest methods for analyzing survival (failure time) data (e.g., see Berkson & Gage, 1950; Cutler & Ederer, 1958; Gehan, 1969). This table can be thought of as an "enhanced" frequency distribution table. The distribution of survival times is divided into a certain number of intervals. For each interval we can then compute the number and proportion of cases or objects that entered the respective interval "alive," the number and proportion of cases that failed in the respective interval (i.e., number of terminal events, or number of cases that "died"), and the number of cases that were lost or censored in the respective interval.

Based on those numbers and proportions, several additional statistics can be computed:

Number of Cases at Risk
Proportion Failing
Proportion surviving
Cumulative Proportion Surviving (Survival Function)
Probability Density
Hazard rate
Median survival time
Required sample sizes
Number of Cases at Risk. This is the number of cases that entered the respective interval alive, minus half of the number of cases lost or censored in the respective interval.

Proportion Failing. This proportion is computed as the ratio of the number of cases failing in the respective interval, divided by the number of cases at risk in the interval.

Proportion Surviving. This proportion is computed as 1 minus the proportion failing.

Cumulative Proportion Surviving (Survival Function). This is the cumulative proportion of cases surviving up to the respective interval. Since the probabilities of survival are assumed to be independent across the intervals, this probability is computed by multiplying out the probabilities of survival across all previous intervals. The resulting function is also called the survivorship or survival function.

Probability Density. This is the estimated probability of failure in the respective interval, computed per unit of time, that is:

Fi = (Pi-Pi+1) /hi

In this formula, Fi is the respective probability density in the i''th interval, Pi is the estimated cumulative proportion surviving at the beginning of the i''th interval (at the end of interval i-1), Pi+1 is the cumulative proportion surviving at the end of the i''th interval, and hi is the width of the respective interval.

Hazard Rate. The hazard rate (the term was first used by Barlow, 1963) is defined as the probability per time unit that a case that has survived to the beginning of the respective interval will fail in that interval. Specifically, it is computed as the number of failures per time units in the respective interval, divided by the average number of surviving cases at the mid-point of the interval.

Median Survival Time. This is the survival time at which the cumulative survival function is equal to 0.5. Other percentiles (25th and 75th percentile) of the cumulative survival function can be computed accordingly. Note that the 50th percentile (median) for the cumulative survival function is usually not the same as the point in time up to which 50% of the sample survived. (This would only be the case if there were no censored observations prior to this time).

Required Sample Sizes. In order to arrive at reliable estimates of the three major functions (survival, probability density, and hazard) and their standard errors at each time interval the minimum recommended sample size is 30. To index



Distribution Fitting

General Introduction
Estimation
Goodness-of-fit
Plots

Text Mining
http://statsoft.com/textbook/stathome.html

Introductory Overview
Some Typical Applications for Text Mining
Approaches to Text Mining
Issues and Considerations for "Numericizing" Text
Transforming Word Frequencies
Latent Semantic Indexing via Singular Value Decomposition
Incorporating Text Mining Results in Data Mining Projects

--------------------------------------------------------------------------------
Text Mining Introductory Overview


The purpose of Text Mining is to process unstructured (textual) information, extract meaningful numeric indices from the text, and, thus, make the information contained in the text accessible to the various data mining (statistical and machine learning) algorithms. Information can be extracted to derive summaries for the words contained in the documents or to compute summaries for the documents based on the words contained in them. Hence, you can analyze words, clusters of words used in documents, etc., or you could analyze documents and determine similarities between them or how they are related to other variables of interest in the data mining project. In the most general terms, text mining will "turn text into numbers" (meaningful indices), which can then be incorporated in other analyses such as predictive data mining projects, the application of unsupervised learning methods (clustering), etc. These methods are described and discussed in great detail in the comprehensive overview work by Manning and Sch&#1100;tze (2002), and for an in-depth treatment of these and related topics as well as the history of this approach to text mining, we highly recommend that source.

Some Typical Applications for Text Mining

Unstructured text is very common, and in fact may represent the majority of information available to a particular research or data mining project.

Analyzing open-ended survey responses. In survey research (e.g., marketing), it is not uncommon to include various open-ended questions pertaining to the topic under investigation. The idea is to permit respondents to express their "views" or opinions without constraining them to particular dimensions or a particular response format. This may yield insights into customers'' views and opinions that might otherwise not be discovered when relying solely on structured questionnaires designed by "experts." For example, you may discover a certain set of words or terms that are commonly used by respondents to describe the pro''s and con''s of a product or service (under investigation), suggesting common misconceptions or confusion regarding the items in the study.

Automatic processing of messages, emails, etc. Another common application for text mining is to aid in the automatic classification of texts. For example, it is possible to "filter" out automatically most undesirable "junk email" based on certain terms or words that are not likely to appear in legitimate messages, but instead identify undesirable electronic mail. In this manner, such messages can automatically be discarded. Such automatic systems for classifying electronic messages can also be useful in applications where messages need to be routed (automatically) to the most appropriate department or agency; e.g., email messages with complaints or petitions to a municipal authority are automatically routed to the appropriate departments; at the same time, the emails are screened for inappropriate or obscene messages, which are automatically returned to the sender with a request to remove the offending words or content.

Analyzing warranty or insurance claims, diagnostic interviews, etc. In some business domains, the majority of information is collected in open-ended, textual form. For example, warranty claims or initial medical (patient) interviews can be summarized in brief narratives, or when you take your automobile to a service station for repairs, typically, the attendant will write some notes about the problems that you report and what you believe needs to be fixed. Increasingly, those notes are collected electronically, so those types of narratives are readily available for input into text mining algorithms. This information can then be usefully exploited to, for example, identify common clusters of problems and complaints on certain automobiles, etc. Likewise, in the medical field, open-ended descriptions by patients of their own symptoms might yield useful clues for the actual medical diagnosis.

Investigating competitors by crawling their web sites. Another type of potentially very useful application is to automatically process the contents of Web pages in a particular domain. For example, you could go to a Web page, and begin "crawling" the links you find there to process all Web pages that are referenced. In this manner, you could automatically derive a list of terms and documents available at that site, and hence quickly determine the most important terms and features that are described. It is easy to see how these capabilities could efficiently deliver valuable business intelligence about the activities of competitors.

Approaches to Text Mining

To reiterate, text mining can be summarized as a process of "numericizing" text. At the simplest level, all words found in the input documents will be indexed and counted in order to compute a table of documents and words, i.e., a matrix of frequencies that enumerates the number of times that each word occurs in each document. This basic process can be further refined to exclude certain common words such as "the" and "a" (stop word lists) and to combine different grammatical forms of the same words such as "traveling," "traveled," "travel," etc. (stemming). However, once a table of (unique) words (terms) by documents has been derived, all standard statistical and data mining techniques can be applied to derive dimensions or clusters of words or documents, or to identify "important" words or terms that best predict another outcome variable of interest.

Using well-tested methods and understanding the results of text mining. Once a data matrix has been computed from the input documents and words found in those documents, various well-known analytic techniques can be used for further processing those data including methods for clustering, factoring, or predictive data mining (see, for example, Manning and Sch&#1100;tze, 2002).

"Black-box" approaches to text mining and extraction of concepts. There are text mining applications which offer "black-box" methods to extract "deep meaning" from documents with little human effort (to first read and understand those documents). These text mining applications rely on proprietary algorithms for presumably extracting "concepts" from text, and may even claim to be able to summarize large numbers of text documents automatically, retaining the core and most important meaning of those documents. While there are numerous algorithmic approaches to extracting "meaning from documents," this type of technology is very much still in its infancy, and the aspiration to provide meaningful automated summaries of large numbers of documents may forever remain elusive. We urge skepticism when using such algorithms because 1) if it is not clear to the user how those algorithms work, it cannot possibly be clear how to interpret the results of those algorithms, and 2) the methods used in those programs are not open to scrutiny, for example by the academic community and peer review and, hence, one simply doesn''t know how well they might perform in different domains. As a final thought on this subject, you may consider this concrete example: Try the various automated translation services available via the Web that can translate entire paragraphs of text from one language into another. Then translate some text, even simple text, from your native language to some other language and back, and review the results. Almost every time, the attempt to translate even short sentences to other languages and back while retaining the original meaning of the sentence produces humorous rather than accurate results. This illustrates the difficulty of automatically interpreting the meaning of text.

Text mining as document search. There is another type of application that is often described and referred to as "text mining" - the automatic search of large numbers of documents based on key words or key phrases. This is the domain of, for example, the popular internet search engines that have been developed over the last decade to provide efficient access to Web pages with certain content. While this is obviously an important type of application with many uses in any organization that needs to search very large document repositories based on varying criteria, it is very different from what has been described here.

Issues and Considerations for "Numericizing" Text

Large numbers of small documents vs. small numbers of large documents. Examples of scenarios using large numbers of small or moderate sized documents were given earlier (e.g., analyzing warranty or insurance claims, diagnostic interviews, etc.). On the other hand, if your intent is to extract "concepts" from only a few documents that are very large (e.g., two lengthy books), then statistical analyses are generally less powerful because the "number of cases" (documents) in this case is very small while the "number of variables" (extracted words) is very large.

Excluding certain characters, short words, numbers, etc. Excluding numbers, certain characters, or sequences of characters, or words that are shorter or longer than a certain number of letters can be done before the indexing of the input documents starts. You may also want to exclude "rare words," defined as those that only occur in a small percentage of the processed documents.

Include lists, exclude lists (stop-words). Specific list of words to be indexed can be defined; this is useful when you want to search explicitly for particular words, and classify the input documents based on the frequencies with which those words occur. Also, "stop-words," i.e., terms that are to be excluded from the indexing can be defined. Typically, a default list of English stop words includes "the", "a", "of", "since," etc, i.e., words that are used in the respective language very frequently, but communicate very little unique information about the contents of the document.

Synonyms and phrases. Synonyms, such as "sick" or "ill", or words that are used in particular phrases where they denote unique meaning can be combined for indexing. For example, "Microsoft Windows" might be such a phrase, which is a specific reference to the computer operating system, but has nothing to do with the common use of the term "Windows" as it might, for example, be used in descriptions of home improvement projects.

Stemming algorithms. An important pre-processing step before indexing of input documents begins is the stemming of words. The term "stemming" refers to the reduction of words to their roots so that, for example, different grammatical forms or declinations of verbs are identified and indexed (counted) as the same word. For example, stemming will ensure that both "traveling" and "traveled" will be recognized by the text mining program as the same word.

Support for different languages. Stemming, synonyms, the letters that are permitted in words, etc. are highly language dependent operations. Therefore, support for different languages is important.

Transforming Word Frequencies

Once the input documents have been indexed and the initial word frequencies (by document) computed, a number of additional transformations can be performed to summarize and aggregate the information that was extracted.

Log-frequencies. First, various transformations of the frequency counts can be performed. The raw word or term frequencies generally reflect on how salient or important a word is in each document. Specifically, words that occur with greater frequency in a document are better descriptors of the contents of that document. However, it is not reasonable to assume that the word counts themselves are proportional to their importance as descriptors of the documents. For example, if a word occurs 1 time in document A, but 3 times in document B, then it is not necessarily reasonable to conclude that this word is 3 times as important a descriptor of document B as compared to document A. Thus, a common transformation of the raw word frequency counts (wf) is to compute:

f(wf) = 1+ log(wf), for wf > 0

This transformation will "dampen" the raw frequencies and how they will affect the results of subsequent computations.

Binary frequencies. Likewise, an even simpler transformation can be used that enumerates whether a term is used in a document; i.e.:

f(wf) = 1, for wf > 0

The resulting documents-by-words matrix will contain only 1s and 0s to indicate the presence or absence of the respective words. Again, this transformation will dampen the effect of the raw frequency counts on subsequent computations and analyses.

Inverse document frequencies. Another issue that you may want to consider more carefully and reflect in the indices used in further analyses are the relative document frequencies (df) of different words. For example, a term such as "guess" may occur frequently in all documents, while another term such as "software" may only occur in a few. The reason is that one might make "guesses" in various contexts, regardless of the specific topic, while "software" is a more semantically focused term that is only likely to occur in documents that deal with computer software. A common and very useful transformation that reflects both the specificity of words (document frequencies) as well as the overall frequencies of their occurrences (word frequencies) is the so-called inverse document frequency (for the i''th word and j''th document):

Time Series Analysis
http://statsoft.com/textbook/stathome.html
http://www.statsoft.com/TEXTBOOK/sttimser.html
General Introduction
Two Main Goals
Identifying Patterns in Time Series Data
Systematic pattern and random noise
Two general aspects of time series patterns
Trend Analysis
Analysis of Seasonality
ARIMA (Box & Jenkins) and Autocorrelations
General Introduction
Two Common Processes
ARIMA Methodology
Identification Phase
Parameter Estimation
Evaluation of the Model
Interrupted Time Series
Exponential Smoothing
General Introduction
Simple Exponential Smoothing
Choosing the Best Value for Parameter a (alpha)
Indices of Lack of Fit (Error)
Seasonal and Non-seasonal Models With or Without Trend
Seasonal Decomposition (Census I)
General Introduction
Computations
X-11 Census method II seasonal adjustment
Seasonal Adjustment: Basic Ideas and Terms
The Census II Method
Results Tables Computed by the X-11 Method
Specific Description of all Results Tables Computed by the X-11 Method
Distributed Lags Analysis
General Purpose
General Model
Almon Distributed Lag
Single Spectrum (Fourier) Analysis
Cross-spectrum Analysis
General Introduction
Basic Notation and Principles
Results for Each Variable
The Cross-periodogram, Cross-density, Quadrature-density, and Cross-amplitude
Squared Coherency, Gain, and Phase Shift
How the Example Data were Created
Spectrum Analysis - Basic Notations and Principles
Frequency and Period
The General Structural Model
A Simple Example
Periodogram
The Problem of Leakage
Padding the Time Series
Tapering
Data Windows and Spectral Density Estimates
Preparing the Data for Analysis
Results when no Periodicity in the Series Exists
Fast Fourier Transformations
General Introduction
Computation of FFT in Time Series

--------------------------------------------------------------------------------
In the following topics, we will first review techniques used to identify patterns in time series data (such as smoothing and curve fitting techniques and autocorrelations), then we will introduce a general class of models that can be used to represent time series data and generate predictions (autoregressive and moving average models). Finally, we will review some simple but commonly used modeling and forecasting techniques based on linear regression. For more information on these topics, see the topic name below.


General Introduction

In the following topics, we will review techniques that are useful for analyzing time series data, that is, sequences of measurements that follow non-random orders. Unlike the analyses of random samples of observations that are discussed in the context of most other statistics, the analysis of time series is based on the assumption that successive values in the data file represent consecutive measurements taken at equally spaced time intervals.

Detailed discussions of the methods described in this section can be found in Anderson (1976), Box and Jenkins (1976), Kendall (1984), Kendall and Ord (1990), Montgomery, Johnson, and Gardiner (1990), Pankratz (1983), Shumway (1988), Vandaele (1983), Walker (1991), and Wei (1989).



Two Main Goals

There are two main goals of time series analysis: (a) identifying the nature of the phenomenon represented by the sequence of observations, and (b) forecasting (predicting future values of the time series variable). Both of these goals require that the pattern of observed time series data is identified and more or less formally described. Once the pattern is established, we can interpret and integrate it with other data (i.e., use it in our theory of the investigated phenomenon, e.g., sesonal commodity prices). Regardless of the depth of our understanding and the validity of our interpretation (theory) of the phenomenon, we can extrapolate the identified pattern to predict future events. To index




--------------------------------------------------------------------------------
Identifying Patterns in Time Series Data
Systematic pattern and random noise
Two general aspects of time series patterns
Trend Analysis
Analysis of Seasonality
For more information on simple autocorrelations (introduced in this section) and other auto correlations, see Anderson (1976), Box and Jenkins (1976), Kendall (1984), Pankratz (1983), and Vandaele (1983). See also:
ARIMA (Box & Jenkins) and Autocorrelations
Interrupted Time Series
Exponential Smoothing
Seasonal Decomposition (Census I)
X-11 Census method II seasonal adjustment
X-11 Census method II result tables
Distributed Lags Analysis
Single Spectrum (Fourier) Analysis
Cross-spectrum Analysis
Basic Notations and Principles
Fast Fourier Transformations
Systematic Pattern and Random Noise

As in most other analyses, in time series analysis it is assumed that the data consist of a systematic pattern (usually a set of identifiable components) and random noise (error) which usually makes the pattern difficult to identify. Most time series analysis techniques involve some form of filtering out noise in order to make the pattern more salient

Variance Components and Mixed Model
http://statsoft.com/textbook/stathome.html

Basic Ideas
Properties of Random Effects
Estimation of Variance Components (Technical Overview)
Estimating the Variation of Random Factors
Estimating Components of Variation
Testing the Significance of Variance Components
Estimating the Population Intraclass Correlation

--------------------------------------------------------------------------------
The Variance Components and Mixed Model ANOVA/ANCOVA chapter describes a comprehensive set of techniques for analyzing research designs that include random effects; however, these techniques are also well suited for analyzing large main effect designs (e.g., designs with over 200 levels per factor), designs with many factors where the higher order interactions are not of interest, and analyses involving case weights.
There are several chapters in this textbook that will discuss Analysis of Variance for factorial or specialized designs. For a discussion of these chapters and the types of designs for which they are best suited refer to the section on Methods for Analysis of Variance. Note, however, that the General Linear Models chapter describes how to analyze designs with any number and type of between effects and compute ANOVA-based variance component estimates for any effect in a mixed-model analysis.


--------------------------------------------------------------------------------
Basic Ideas
Experimentation is sometimes mistakenly thought to involve only the manipulation of levels of the independent variables and the observation of subsequent responses on the dependent variables. Independent variables whose levels are determined or set by the experimenter are said to have fixed effects. There is a second class of effects, however, which is often of great interest to the researcher, Random effects are classification effects where the levels of the effects are assumed to be randomly selected from an infinite population of possible levels. Many independent variables of research interest are not fully amenable to experimental manipulation, but nevertheless can be studied by considering them to have random effects. For example, the genetic makeup of individual members of a species cannot at present be (fully) experimentally manipulated, yet it is of great interest to the geneticist to assess the genetic contribution to individual variation on outcomes such as health, behavioral characteristics, and the like. As another example, a manufacturer might want to estimate the components of variation in the characteristics of a product for a random sample of machines operated by a random sample of operators. The statistical analysis of random effects is accomplished by using the random effect model, if all of the independent variables are assumed to have random effects, or by using the mixed model, if some of the independent variables are assumed to have random effects and other independent variables are assumed to have fixed effects.

Properties of random effects. To illustrate some of the properties of random effects, suppose you collected data on the amount of insect damage done to different varieties of wheat. It is impractical to study insect damage for every possible variety of wheat, so to conduct the experiment, you randomly select four varieties of wheat to study. Plant damage is rated for up to a maximum of four plots per variety. Ratings are on a 0 (no damage) to 10 (great damage) scale.
To determine the components of variation in resistance to insect damage for Variety and Plot, an ANOVA can first be performed. Perhaps surprisingly, in the ANOVA, Variety can be treated as a fixed or as a random factor without influencing the results (provided that Type I Sums of squares are used and that Variety is always entered first in the model). The Spreadsheet below shows the ANOVA results of a mixed model analysis treating Variety as a fixed effect and ignoring Plot, i.e., treating the plot-to-plot variation as a measure of random error.

As can be seen, the difference in the two sets of estimates is that a variance component is estimated for Variety only when it is considered to be a random effect. This reflects the basic distinction between fixed and random effects. The variation in the levels of random factors is assumed to be representative of the variation of the whole population of possible levels. Thus, variation in the levels of a random factor can be used to estimate the population variation. Even more importantly, covariation between the levels of a random factor and responses on a dependent variable can be used to estimate the population component of variance in the dependent variable attributable to the random factor. The variation in the levels of fixed factors is instead considered to be arbitrarily determined by the experimenter (i.e., the experimenter can make the levels of a fixed factor vary as little or as much as desired). Thus, the variation of a fixed factor cannot be used to estimate its population variance, nor can the population covariance with the dependent variable be meaningfully estimated. With this basic distinction between fixed effects and random effects in mind, we now can look more closely at the properties of variance components

STATISTICS GLOSSARY
http://statsoft.com/textbook/glosfra.html
Distribution Tables
http://statsoft.com/textbook/stathome.html

Compared to probability calculators (e.g., the one included in STATISTICA), the traditional format of distribution tables such as those presented below, has the advantage of showing many values simultaneously and, thus, enables the user to examine and quickly explore ranges of probabilities.


--------------------------------------------------------------------------------
Z Table

t Table

Chi-Square Table

F Tables for:
alpha=.10

alpha=.05
alpha=.025

alpha=.01


Note that all table values were calculated using the distribution facilities in STATISTICA BASIC, and they were verified against other published tables.

REFERENCES CITED
http://statsoft.com/textbook/stathome.html
Video-tutorial
http://statsoft.com/support/download/video-tutorials/
Descriptive Statistics and Exploratory Analysis
http://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture2.pdf

What is descriptive statistics and exploratory
data analysis?
• Basic numerical summaries of data
• Basic graphical summaries of data
•How to use R for calculating descriptive statistics
and making graphs
Before making inferences from data it is essential to
examine all your variables.
Why?
To listen to the data:
- to catch mistakes
- to see patterns in the data
- to find violations of statistical assumptions
- to generate hypotheses
…and because if you don’t, you will have trouble later
Dimensionality of Data Sets
• Univariate: Measurement made on one variable per
subject
• Bivariate: Measurement made on two variables per
subject
• Multivariate: Measurement made on many variables
per subject
Numerical Summaries of Data
• Central Tendency measures. They are computed
to give a “center” around which the measurements in
the data are distributed.
• Variation or Variability measures. They describe
“data spread” or how far away the measurements are
from the center.
• Relative Standing measures. They describe the
relative position of specific measurements in the data.
Location: Mean
The Mean
To calculate the average of a set of observations, add their
value and divide by the number of observations

Other Types of Means - Weighted means,Trimmed, Geometric, Harmonic

Location: Median
• Median – the exact middle value
• Calculation:
- If there are an odd number of observations, find the middle value
- If there are an even number of observations, find the middle two
values and average them

Which Location Measure Is Best?
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
Median = 3 Median = 3
• Mean is best for symmetric distributions without outliers
• Median is useful for skewed distributions or data with
outliers

Scale: Variance
• Average of squared deviations of values from
the mean
Why Squared Deviations?
• Adding deviations will yield a sum of ?
• Absolute values do not have nice mathematical
properties
• Squares eliminate the negatives
• Result:
– Increasing contribution to the variance as you go
farther from the mean.
Scale: Standard Deviation
• Variance is somewhat arbitrary
• What does it mean to have a variance of 10.8? Or
2.2? Or 1459.092? Or 0.000001?
• Nothing. But if you could “standardize” that value,
you could talk about any variance (i.e. deviation) in
equivalent terms
• Standard deviations are simply the square root of the
variance
Scale: Standard Deviation1. Score (in the units that are meaningful)
2. Mean
3. Each score’s deviation from the mean
4. Square that deviation
5. Sum all the squared deviations (Sum of Squares)
6. Divide by n-1
7. Square root – now the value is in the units we started with!!!
Interesting Theoretical Result
At least within
(1 - 1/12) = 0% …….….. k=1 (&#1084; ± 1&#1091;)
(1 - 1/22) = 75% …........ k=2 (&#1084; ± 2&#1091;)
(1 - 1/32) = 89% ………....k=3 (&#1084; ± 3&#1091;)
Note use of &#1091; (sigma) to
Note use of &#1084; (mu) to represent “standard deviation.”
represent “mean”.
• Regardless of how the data are distributed, a certain
percentage of values must fall within k standard deviations
from the mean
Often We Can Do Better
For many lists of observations – especially if their histogram is bell-shaped
1. Roughly 68% of the observations in the list lie within 1 standard
deviation of the average
2. 95% of the observations lie within 2 standard deviations of the
average
The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
• Q2 is the same as the median (50% are smaller, 50% are
larger)
• Only 25% of the observations are greater than the third
quartile
Percentiles (aka Quantiles)
In general the nth percentile is a value such that n% of the
observations fall at or below or it
Q1 = 25th percentile
Q2 = 75th percentile
Median = 50th percentile

Univariate Data: Histograms and
Bar Plots
• What’s the difference between a histogram and bar plot?
• Used for categorical variables to show frequency or proportion in
each category.
• Translate the data from frequency tables into a pictorial
representation…
Bar plot
Histogram
• Used to visualize distribution (shape, center, range, variation) of
continuous variables
• “Bin size” important

More on Histograms
• What’s the difference between a frequency histogram
and a density histogram?

More on Histograms
• What’s the difference between a frequency histogram
and a density histogram?

Bivariate Data
Multivariate Data
• Organize units into clusters
• Descriptive, not inferential
• Many approaches
• “Clusters” always produced
Clustering
Data Reduction Approaches (PCA)
• Reduce n-dimensional dataset into much smaller number
• Finds a new (smaller) set of variables that retains most of
the information in the total sample
• Effective way to visualize multivariate data

Scale: Quartiles and IQR

Descriptive Statistics and Exploratory Analysis
http://www.iasri.res.in/ebook/EB_SMAR/e-book_pdf%20files/Manual%20II/1-Descriptive%20Statistics.pdf
Descriptive Statistics
http://www.utcomchatt.org/docs/Descriptive_Statistics_1142008.pdf
Statistical Theory & Methods and Applied Statistics
http://www.learn.colostate.edu/courses/STAT/STAT523.dot
STAT 523 - Quantitative Spatial Analysis
Techniques in spatial analysis: point pattern analysis, spatial autocorrelation, trend surface and spectral analysis.

STAT 501 - Statistical Science
Overview of statistics: theory; use in agriculture, business, environment, engineering; modeling; computing; statisticians as researchers/consultants

STAT 511 - Design and Data Analysis for Researchers I
Statistical methods for experimenters and researchers emphasizing design and analysis of experiments.

STAT 512 - Design and Data Analysis for Researchers II
Statistical methods for experimenters and researchers emphasizing design and analysis of experiments.

STAT 520 - Introduction to Probability Theory
Probability, random variables, distributions, expectations, generating functions, limit theorems, convergence, random processes

STAT 521 - Stochastic Processes I
Characterization of stochastic processes, Markov chains in discrete and continuous time, branching processes, renewal theory, Brownian motion

STAT 525 - Analysis of Time Series
Trend and seasonality, stationary processes, Hilbert space techniques, spectral distribution function, fitting ARIMA models, linear prediction. Spectral analysis; the periodogram; spectral estimation techniques; multivariate time series; linear systems and optimal control; Kalman filtering and prediction

STAT 530 - Mathematical Statistics
Sampling distributions, estimation, testing, confidence intervals; exact and asymptotic theories of maximum likelihood and distribution-free methods

STAT 540 - Data Analysis and Regression
Introduction to multiple regression and data analysis with emphasis on graphics and computing


STAT 301 Introductions to Statistical Methods
Techniques in statistical inference; confidence intervals, hypothesis tests, correlation and regression, analysis of variance, chi-square tests

STAT 315 Statistics for Engineers and Scientists
Techniques in statistical inference; confidence intervals, hypothesis tests, correlation and regression, analysis of variance, chi-square tests

STAT 460 Applied Multivariate Analysis
Principles for multivariate estimation and testing; multivariate analysis of variance, discriminant analysis; principal components, factor analysis

STAT 501 Statistical Science
Model building and decision making; communication of statistical information

STAT 457 Statistics for Environmental Monitoring
Applications of statistics in environmental pollution studies involving air, water, or soil monitoring; sampling designs; trend analysis; censored data

STAT 560 Applied Multivariate Analysis
Multivariate analysis of variance; principal components; factor analysis; discriminate analysis; cluster analysis

STAT 570 Nonparametric Statistics
Distribution and uses of order statistics; nonparametric inferential techniques, their uses and mathematical properties

STAT 600 Statistical Computing
Statistical packages; graphical data presentation; model fitting and diagnostics; random numbers; simulation; numerical methods in statistics

STAT 605 Theory of Sampling Techniques
Survey designs; simple random, stratified, cluster samples; theory of estimation; optimization techniques for minimum variance or costs

STAT 640 Design and Linear Modeling
Introduction to linear models; experimental design; fixed, random, and mixed models. Mixed factorials; response surface methodology; Taguchi methods; variance components

STAT 645 Categorical Data Analysis and GLIM
Generalized linear models, binary and polytomous data, log linear models, quasilikelihood models, survival data models

STAT 675 Bayesian Statistics
Bayesian inference and theory, hierarchical models, Markov chain Monte Carlo theory and methods, model criticism and selection, hierarchical regression and generalized linear models, and other topics

the chi-squared distribution
http://www.colby.edu/biology/BI17x/freq.html

Using probability theory, statisticians have devised a way to determine if a frequency distribution differs from the expected distribution. To use this chi-square test, we first have to calculate chi-squared.

Chi-squared = … (observed-expected)2/(expected)

We have two classes to consider in this example, heads and tails.

Chi-squared = (100-108)2/100 + (100-92)2/100 = (-8)2/100 + (8)2/100 = 0.64 + 0.64 = 1.28

Pearson''s chi-square test
http://en.wikipedia.org/wiki/Pearson''s_chi-square_test

Pearson''s chi-square (÷2) test is the best-known of several chi-square tests – statistical procedures whose results are evaluated by reference to the chi-square distribution. Its properties were first investigated by Karl Pearson. In contexts where it is important to make a distinction between the test statistic and its distribution, names similar to Pearson X-squared test or statistic are used.

It tests a null hypothesis that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution. The events considered must be mutually exclusive and have total probability 1. A common case for this is where the events each cover an outcome of a categorical variable. A simple example is the hypothesis that an ordinary six-sided die is "fair", i.e., all six outcomes are equally likely to occur. Pearson''s chi-square is the original and most widely-used chi-square test.

Chi-square test
http://en.wikipedia.org/wiki/Chi-square_test
A chi-square test (also chi-squared or ÷2 test) is any statistical hypothesis test in which the sampling distribution of the test statistic is a chi-square distribution when the null hypothesis is true, or any in which this is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be made to approximate a chi-square distribution as closely as desired by making the sample size large enough.

Some examples of chi-squared tests where the chi-square distribution is only approximately valid:

Pearson''s chi-square test, also known as the chi-square goodness-of-fit test or chi-square test for independence. When mentioned without any modifiers or without other precluding context, this test is usually understood (for an exact test used in place of ÷2, see Fisher''s exact test).
Yates'' chi-square test, also known as Yates'' correction for continuity.
Mantel-Haenszel chi-square test.
Linear-by-linear association chi-square test.
The portmanteau test in time-series analysis, testing for the presence of autocorrelation
Likelihood-ratio tests in general statistical modelling, for testing whether there is evidence of the need to move from a simple model to a more complicated one (where the simple model is nested within the complicated one).
One case where the distribution of the test statistic is an exact chi-square distribution is the test that the variance of a normally-distributed population has a given value based on a sample variance. Such a test is uncommon in practice because values of variances to test against are seldom known exactly.

Chi-Square Procedures for the Analysis of Categorical Frequency data
http://faculty.vassar.edu/lowry/PDF/c8p1.pdf ch8
Introduction to Procedures Involving Sample Means
http://faculty.vassar.edu/lowry/PDF/c9p1.pdf ch9
Basic Concepts of Probability
http://faculty.vassar.edu/lowry/PDF/c5p1.pdf ch5
Introduction to Linear Correlation and Regression
http://faculty.vassar.edu/lowry/PDF/c3p1.pdf ch3
Distributions
http://faculty.vassar.edu/lowry/PDF/c2p1.pdf ch2
Principles of Measurement
http://faculty.vassar.edu/lowry/PDF/c1p1.pdf ch1
One-Way Analysis of Variance for Correlated Samples
http://faculty.vassar.edu/lowry/PDF/c15p1.pdf ch15
One-Way Analysis of Variance for Independent Samples
http://faculty.vassar.edu/lowry/PDF/c14p1.pdf ch14
Two-Way Analysis of Variance for Independent Samples
http://faculty.vassar.edu/lowry/PDF/c16p1.pdf ch16
One-Way Analysis of Covariance for Independent Samples
http://faculty.vassar.edu/lowry/PDF/c17p1.pdf ch17
TEACHING STATISTICS
COURSE OUTLINE
Introduction
Review of Algebra
Measurement
Frequency Distributions
The Normal Curve
Statistics
First Test
Interpretation of Scores
Regression
Correlation
Second Test
Logic of Inferential Statistics
The Sampling Distribution
Some Hypothesis Tests
The t-tests
Additional Topics
Final

http://www.psychstat.missouristate.edu/introbook/sbk01.htm - http://www.psychstat.missouristate.edu/introbook/sbk29.htm
Hosted by uCoz