Spatial Statistics Techniques

Spatial Statistics Techniques

Calculating Map Similarity

While visual analysis of a set of maps might identify broad relationships, it takes quantitative map analysis to handle a detailed scrutiny. Consider the three maps shown in figure 4.2.1-1— what areas identify similar patterns? If you focus your attention on a location in the lower right portion how similar is the data pattern to all of the other locations in the field?

The answers to these questions are much too complex for visual analysis and certainly beyond the geo-query and display procedures of standard desktop mapping packages. While the data in the example shows the relative amounts of phosphorous, potassium and nitrogen throughout a field, it could as easily be demographic data representing income, education and property values; or sales data tracking three different products; or public health maps representing different disease incidences; or crime statistics representing different types of felonies or misdemeanors.

Regardless of the data and application arena, a multivariate procedure for assessing similarity often is used to analyze the relationships. In visual analysis you move your eye among the maps to summarize the color assignments at different locations. The difficulty in this approach is two-fold— remembering the color patterns and calculating the difference. The map analysis procedure does the same thing except it uses map values in place of the colors. In addition, the computer doesn’t tire as easily and completes the comparison for all of the locations throughout the map window (3289 in this example) in a couple of seconds.

The upper-left portion of figure 4.2.1-2 illustrates capturing the data patterns of two locations for comparison. The “data spear” at map location 45column, 18row identifies that the P-level as 11.0ppm, the K-level as 177.0 and N-level as 32.9. This step is analogous to your eye noting a color pattern of dark-red, dark-orange and light-green. The other location for comparison (32c, 62r) has a data pattern of P= 53.2, K= 412.0 and N= 27.9; or as your eye sees it, a color pattern of dark-green, dark-green and yellow

The right side of the figure conceptually depicts how the computer calculates a similarity value for the two response patterns. The realization that mapped data can be expressed in both geographic space and data space is a key element to understanding the procedure.

Geographic space uses coordinates, such latitude and longitude, to locate things in the real world—such as the southeast and extreme north points identified in the example. The geographic expression of the complete set of measurements depicts their spatial distribution in familiar map form.

Data space, on the other hand, is a bit less familiar but can be conceptualized as a box with balls floating within it. In the example, the three axes defining the extent of the box correspond to the P, K and N levels measured in the field. The floating balls represent grid cells defining the geographic space—one for each grid cell. The coordinates locating the floating balls extend from the data axes—11.0, 177.0 and 32.9 for the comparison point. The other point has considerably higher values in P and K with slightly lower N (53.2, 412.0, 27.9) so it plots at a different location in data space.

The bottom line is that the position of any point in data space identifies its numerical pattern—low, low, low is in the back-left corner, while high, high, high is in the upper-right corner. Points that plot in data space close to each other are similar; those that plot farther away are less similar.

In the example, the floating ball in the foreground is the farthest one (least similar) from the comparison point’s data pattern. This distance becomes the reference for ‘most different’ and sets the bottom value of the similarity scale (0%). A point with an identical data pattern plots at exactly the same position in data space resulting in a data distance of 0 that equates to the highest similarity value (100%).

The similarity map shown in figure 4.2.1-3 applies the similarity scale to the data distances calculated between the comparison point and all of the other points in data space. The green tones indicate field locations with fairly similar P, K and N levels. The red tones indicate dissimilar areas. It is interesting to note that most of the very similar locations are in the left portion of the field.

Map Similarity can be an invaluable tool for investigating spatial patterns in any complex set of mapped data. Humans are unable to conceptualize more than three variables (the data space box); however a similarity index can handle any number of input maps. In addition, the different layers can be weighted to reflect relative importance in determining overall similarity.

In effect, a similarity map replaces a lot of laser-pointer waving and subjective suggestions of how similar/dissimilar locations are with a concrete, quantitative measurement for each map location.

4.2.2 Identifying Data Zones

The preceding section introduced the concept of ‘data distance’ as a means to measure similarity within a map. One simply mouse-clicks a location and all of the other locations are assigned a similarity value from 0 (zero percent similar) to 100 (identical) based on a set of specified maps. The statistic replaces difficult visual interpretation of map displays with an exact quantitative measure at each location.

Figure 4.2.2-1 depicts level slicing for areas that are unusually high in P, K and N (nitrogen). In this instance the data pattern coincidence is a box in 3-dimensional data space.

A mathematical trick was employed to get the map solution shown in the figure. On the individual maps, high areas were set to P=1, K= 2 and N=4, then the maps were added together. The result is a range of coincidence values from zero (0+0+0= 0; gray= no high areas) to seven (1+2+4= 7; red= high P, high K, high N). The map values between these extremes identify the individual map layers having high measurements. For example, the yellow areas with the value 3 have high P and K but not N (1+2+0= 3). If four or more maps are combined, the areas of interest are assigned increasing binary progression values (…8, 16, 32, etc)—the sum will always uniquely identify the combinations.

While Level Slicing is not a sophisticated classifier, it illustrates the useful link between data space and geographic space. This fundamental concept forms the basis for most geo-statistical analysis including map clustering and regression.

4.2.3 Mapping Data Clusters

While both Map Similarity and Level Slicing techniques are useful in examining spatial relationships, they require the user to specify data analysis parameters. But what if you don’t know what level slice intervals to use or which locations in the field warrant map similarity investigation? Can the computer on its own identify groups of similar data? How would such a classification work? How well would it work?

Figure 4.2.3-1 shows some examples derived from Map Clustering. The map stack on the left shows the input maps used for the cluster analysis. The maps are the same P, K, and N maps identifying phosphorous, potassium and nitrogen levels used in the previous discussions in this section. However, keep in mind that the input maps could be crime, pollution or sales data—any set of application related data. Clustering simply looks at the numerical pattern at each map location and sorts them into discrete groups regardless of the nature of the data or its application.

The map in the center of the figure shows the results of classifying the P, K and N map stack into two clusters. The data pattern for each cell location is used to partition the field into two groups that meet the criteria as being 1) as different as possible between groups and 2) as similar as possible within a group.

The two smaller maps at the right show the division of the data set into three and four clusters. In all three of the cluster maps red is assigned to the cluster with relatively low responses and green to the one with relatively high responses. Note the encroachment on these marginal groups by the added clusters that are formed by data patterns at the boundaries.

The mechanics of generating cluster maps are quite simple. Simply specify the input maps and the number of clusters you want then miraculously a map appears with discrete data groupings. So how is this miracle performed? What happens inside cluster’s black box?

The schematic in figure 4.2.3-2 depicts the process. The floating balls identify the data patterns for each map location (geographic space) plotted against the P, K and N axes (data space). For example, the large ball appearing closest to you depicts a location with high values on all three input maps. The tiny ball in the opposite corner (near the plot origin) depicts a map location with small map values. It seems sensible that these two extreme responses would belong to different data groupings.

While the specific algorithm used in clustering is beyond the scope of this chapter, it suffices to note that ‘data distances’ between the floating balls are used to identify cluster membership—groups of floating balls that are relatively far from other groups and relatively close to each other form separate data clusters. In this example, the red balls identify relatively low responses while green ones have relatively high responses. The geographic pattern of the classification is shown in the map in the lower right portion of the figure.

Identifying groups of neighboring data points to form clusters can be tricky business. Ideally, the clusters will form distinct clouds in data space. But that rarely happens and the clustering technique has to enforce decision rules that slice a boundary between nearly identical responses. Also, extended techniques can be used to impose weighted boundaries based on data trends or expert knowledge. Treatment of categorical data and leveraging spatial autocorrelation are other considerations.

So how do know if the clustering results are acceptable? Most statisticians would respond, “…you can’t tell for sure.” While there are some elaborate procedures focusing on the cluster assignments at the boundaries, the most frequently used benchmarks use standard statistical indices.

Figure 4.2.3-3 shows the performance table and box-and-whisker plots for the map containing two clusters. The average, standard deviation, minimum and maximum values within each cluster are calculated. Ideally the averages would be radically different and the standard deviations small—large difference between groups and small differences within groups.

Box-and-whisker plots enable a visualize assessment of the differences. The box is centered on the average (position) and extends above and below one standard deviation (width) with the whiskers drawn to the minimum and maximum values to provide a visual sense of the data range. When the diagrams for the two clusters overlap, as they do for the phosphorous responses, it suggests that the clusters are not distinct along this data axis.

The separation between the boxes for the K and N axes suggests greater distinction between the clusters. Given the results a practical user would likely accept the classification results. And statisticians hopefully will accept in advance apologies for such a conceptual and terse treatment of a complex spatial statistics topic.

4.2.4 Deriving Prediction Maps

For years non-spatial statistics has been predicting things by analyzing a sample set of data for a numerical relationship (equation) then applying the relationship to another set of data. The drawbacks are that the non-approach doesn’t account for geographic relationships and the result is just a table of numbers. Extending predictive analysis to mapped data seems logical; after all, maps are just organized sets of numbers. And GIS enables us to link the numerical and geographic distributions of the data.

To illustrate the data mining procedure, the approach can be applied to the same field that has been the focus for the previous discussion. The top portion of figure 4.2.4-1 shows the yield pattern of corn for the field varying from a low of 39 bushels per acre (red) to a high of 279 (green). The corn yield map is termed the dependent map variable and identifies the phenomena to be predicted.

The independent map variables depicted in the bottom portion of the figure are used to uncover the spatial relationship used for prediction— prediction equation. In this instance, digital aerial imagery will be used to explain the corn yield patterns. The map on the left indicates the relative reflectance of red light off the plant canopy while the map on the right shows the near-infrared response (a form of light just beyond what we can see).

While it is difficult to visually assess the subtle relationships between corn yield and the red and near-infrared images, the computer “sees” the relationship quantitatively. Each grid location in the analysis frame has a value for each of the map layers— 3,287 values defining each geo-registered map covering the 189-acre field.

For example, top portion of figure 4.2.4-2 identifies that the example location has a ‘joint’ condition of red band equals 14.7 and yield equals 218. The lines parallel to axes in the scatter plot on the right identifies the precise position of the pair of map values—X= 14.7 and Y= 218. Similarly, the near-infrared and yield values for the same location are shown in the bottom portion of the figure.

The set of dots in both of the scatter plots represents all of the data pairs for each grid location. The slanted lines through the dots represent the prediction equations derived through regression analysis. While the mathematics is a bit complex, the effect is to identify a line that ‘best fits the data’— just as many data points above as below the regression line

In a sense, the line identifies the average yield for each step along the X-axis for the red and near-infrared bands and a reasonable guess of the corn yield for each level of spectral response. That’s how a regression prediction is used— a value for the red band (or near-infrared band) in another field is entered and the equation for the line calculates a predicted corn yield. Repeating the calculation for all of the locations in the field generates a prediction map of yield from remotely sensed data.

A major problem is that the R-squared statistic summarizing the residuals for both of the prediction equations is fairly small (R^2= 26% and 4.7% respectively) which suggests that the prediction l

Stratifying Maps for Better Predictions