Spatial Statistics Techniques

Spatial Statistics Techniques

Spatial Statistics Techniques

As outlined in section 2.1, Spatial Statistics can be grouped into two broad camps—surface modeling and spatial data mining. Surface Modeling involves the translation of discrete point data into a continuous surface that represents the geographic distribution of the data. Spatial Data Mining, on the other hand, seeks to uncover numerical relationships within and among sets of mapped data.

4.1 Surface Modeling

The conversion of a set of point samples into its implied geographic distribution involves several considerations—an understanding of the procedures themselves, the underlying assumptions, techniques for benchmarking the derived map surfaces and methods for assessing the results and characterizing accuracy.

4.1.1 Point Samples to Map Surfaces

Soil sampling has long been at the core of agricultural research and practice. Traditionally point-sampled data were analyzed by non-spatial statistics to identify the typical nutrient level throughout an entire field. Considerable effort was expended to determine the best single estimate and assess just how good the average estimate was in typifying a field.

However non-spatial techniques fail to make use of the geographic patterns inherent in the data to refine the estimate—the typical level is assumed everywhere the same within a field. The computed standard deviation indicates just how good this assumption is—the larger the standard deviation the less valid is the assumption “…everywhere the same.”

Surface Modeling utilizes the spatial patterns in a data set to generate localized estimates throughout a field. Conceptually it maps the variance by using geographic position to help explain the differences in the sample values. In practice, it simply fits a continuous surface to the point data spikes as depicted in figure 4.1.1-1.

While the extension from non-spatial to spatial statistics is quite a theoretical leap, the practical steps are relatively easy. The left side of the figure shows 2D and 3D point maps of phosphorous soil samples collected throughout the field. This highlights the primary difference from traditional soil sampling—each sample must be geo-referenced as it is collected. In addition, the sampling pattern and intensity are often different than traditional grid sampling to maximize spatial information within the data collected.

The surface map on the right side of the figure depicts the continuous spatial distribution derived from the point data. Note that the high spikes in the left portion of the field and the relatively low measurements in the center are translated into the peaks and valleys of the surface map.

When mapped, the traditional, non-spatial approach forms a flat plane (average phosphorous level) aligned within the bright yellow zone. Its “…everywhere the same” assumption fails to recognize the patterns of larger levels and smaller levels captured in the surface map of the data’s geographic distribution. A fertilization plan for phosphorous based on the average level (22ppm) would be ideal for very few locations and be inappropriate for most of the field as the sample data varies from 5 to 102ppm phosphorous.

4.1.2 Spatial Autocorrelation

Spatial Interpolation’s basic concept involves Spatial Autocorrelation, referring to the degree of similarity among neighboring points (e.g., soil nutrient samples). If they exhibit a lot similarity, termed spatial dependence, they ought to derive a good map. If they are spatially independent, then expect a map of pure, dense gibberish. So how can we measure whether “what happens at one location depends on what is happening around it?"

Common sense leads us to believe more similarity exists among the neighboring soil samples (lines in the left side of figure 4.1.2-1) than among sample points farther away. Computing the differences in the values between each sample point and its closest neighbor provides a test of the assertion as nearby differences should be less than the overall difference among the values of all sample locations.

If the differences in neighboring values are a lot smaller than the overall variation, then a high degree of positive spatial dependency is indicated. If they are about the same or if the neighbors variation is larger (indicating a rare checkerboard-like condition), then the assumption of spatial dependence fails. If the dependency test fails, it means an interpolated soil nutrient map likely is just colorful gibberish.

The difference test however, is limited as it merely assesses the closest neighbor, regardless of its distance. A Variogram (right side of figure 4-2) is a plot of the similarity among values based on the distance between them. Instead of simply testing whether close things are related, it shows how the degree of dependency relates to varying distances between locations. The origin of the plot at 0,0 is a unique case where the distance between samples is zero and there is no dissimilarity (data variation = 0) indicating that a location is exactly the same as itself.

As the distance between points increase, subsets of the data are scrutinized for their dependency. The shaded portion in the idealized plot shows how quickly the spatial dependency among points deteriorates with distance. The maximum range (Max Range) position identifies the distance between points beyond which the data values are considered independent. This tells us that using data values beyond this distance for interpolation actually can mess-up the interpolation.

The minimum range (Min Range) position identifies the smallest distance contained in the actual data set and is determined by the sampling design used to collect the data. If a large portion of the shaded area falls below this distance, it tells you there is insufficient spatial dependency in the data set to warrant interpolation. If you proceed with the interpolation, a nifty colorful map will be generated, but likely of questionable accuracy. Worse yet, if the sample data plots as a straight line or circle, no spatial dependency exists and the map will be of no value.

Analysis of the degree of spatial autocorrelation in a set of point samples is mandatory before spatially interpolating any data. This step is not required to mechanically perform the analysis as the procedure will always generate a map. However, it is the initial step in determining if the map generated is likely to be a good one.

4.1.3 Benchmarking Interpolation Approaches

For some, the previous discussion on generating maps from soil samples might have been too simplistic—enter a few things then click on a data file and, in a few moments you have a soil nutrient surface. Actually, it is that easy to create one. The harder part is figuring out if the map generated makes sense and whether it is something you ought to use for subsequent analysis and important management decisions.

The following discussion investigates the relative amounts of spatial information provided by comparing a whole-field average to interpolated map surfaces generated from the same data set. The top-left portion in figure 4.1.3-3 shows the map of the average phosphorous level in the field. It forms a flat surface as there isn’t any information about spatial variability in an average value.

The non-spatial estimate simply adds up all of the sample measurements and divides by the number of samples to get 22ppm. Since the procedure didn’t consider the relative position of the different samples, it is unable to map the variations in the measurements. The assumption is that the average is everywhere, plus or minus the standard deviation. But there is no spatial guidance where phosphorous levels might be higher, or where they might be lower than the average.

The spatially based estimates are shown in the interpolated map surface below the average plane. As described in the previous section 4.1.2, spatial interpolation looks at the relative positioning of the soil samples as well as their measure phosphorous levels. In this instance the big bumps were influenced by high measurements in that vicinity while the low areas responded to surrounding low values.

The map surface in the right portion of figure 4.1.3-1 compares the two maps simply by subtracting them. The color ramp was chosen to emphasize the differences between the whole-field average estimates and the interpolated ones. The center yellow band indicates the average level while the progression of green tones locates areas where the interpolated map estimated that there was more phosphorous than the whole field average. The higher locations identify where the average value is less than the interpolated ones. The lower locations identify the opposite condition where the average value is more than the interpolated ones. Note the dramatic differences between the two maps.

Now turn your attention to figure 4.1.3-2 that compares maps derived by two different interpolation techniques—IDW (inverse distance-weighted) and Krig. Note the similarity in the peaks and valleys of the two surfaces. While subtle differences are visible the general trends in the spatial distribution of the data are identical.

The difference map on the right confirms the coincident trends. The broad band of yellow identifies areas that are +/- 1 ppm. The brown color identifies areas that are within 10 ppm with the IDW surface estimates a bit more than the Krig ones. Applying the same assumption about +/- 10 ppm difference being negligible in a fertilization program the maps are effectively identical.

So what’s the bottom line? That there often are substantial differences between a whole field average and any interpolated surface. It suggests that finding the best interpolation technique isn’t as important as using an interpolated surface over the whole field average. This general observation holds most mapped data exhibiting spatial autocorrelation.

4.1.4 Assessing Interpolation Results

The previous discussion compared the assumption of the field average with map surfaces generated by two different interpolation techniques for phosphorous levels throughout a field. While there was considerable differences between the average and the derived surfaces (from -20 to +80ppm), there was relatively little difference between the two surfaces (+/- 10ppm).

But which surface best characterizes the spatial distribution of the sampled data? The answer to this question lies in Residual Analysis—a technique that investigates the differences between estimated and measured values throughout a field. It is common sense that one should not simply accept an interpolated map without assessing its accuracy. Ideally, one designs an appropriate sampling pattern and then randomly locates a number of test points to evaluate interpolation performance.

So which surface, IDW or Krig, did a better job in estimating the measured phosphorous levels for a test set of measurements? The table in figure 4.1.4-1 reports the results for twelve randomly positioned test samples. The first column identifies the sample ID and the second column reports the actual measured value for that location

Column C simply depicts estimating the whole-field average (21.6) at each of the test locations. Column D computes the difference of the estimated value minus actual measured value for the test set—formally termed the residual. For example, the first test point (ID#59) estimated the average of 21.6 but was actually measured as 20.0, so the residual is 1.6 (21.6-20.0= 1.6ppm) …very close. However, test point #109 is way off (21.6-103.0= -81.4ppm) …nearly 400% under-estimate error.

The residuals for the IDW and Krig maps are similarly calculated to form columns F and H, respectively. First note that the residuals for the whole-field average are generally larger than either those for the IDW or Krig estimates. Next note that the residual patterns between the IDW and Krig are very similar—when one is way off, so is the other and usually by about the same amount. A notable exception is for test point #91 where Krig dramatically over-estimates.

The rows at the bottom of the table summarize the residual analysis results. The Residual sum row characterizes any bias in the estimates—a negative value indicates a tendency to underestimate with the magnitude of the value indicating how much. The –92.8 value for the whole-field average indicates a relatively strong bias to underestimate.

The Average error row reports how typically far off the estimates were. The 19.0ppm average error for the whole-field average is three times worse than Krig’s estimated error (6.08) and nearly four times worse than IDW’s (5.24).

Comparing the figures to the assumption that +/-10ppm is negligible in a fertilization program it is readily apparent that the whole-field estimate is inappropriate to use and that the accuracy differences between IDW and Krig are minor.

The Normalized error row simply calculates the average error as a proportion of the average value for the test set of samples (5.24/29.3= .18 for IDW). This index is the most useful as it enables the comparison of the relative map accuracies between different maps. Generally speaking, maps with normalized errors of more than .30 are suspect and one might not want to make important decisions using them.

The bottom line is that Residual Analysis is an important consideration when spatially interpolating data. Without an understanding of the relative accuracy and interpolation error of the base maps, one can’t be sure of any modeling results using the data. The investment in a few extra sampling points for testing and residual analysis of these data provides a sound foundation for site-specific management. Without it, the process can become one of blind faith and wishful thinking.

4.2 Spatial Data Mining

Spatial data mining involves procedures for uncovering numerical relationships within and among sets of mapped data. The underlying concept links a map’s geographic distribution to its corresponding numeric distribution through the coordinates and map values stored at each location. This ‘data space’ and ‘geographic space’ linkage provides a framework for calculating map similarity, identifying data zones, mapping data clusters, deriving prediction maps and refining analysis techniques.

4.2.1 Calculating Map Similarity

While visual analysis of a set of maps might identify broad relationships, it takes quantitative map analysis to handle a detailed scrutiny. Consider the three maps shown in figure 4.2.1-1— what areas identify similar patterns? If you focus

Spatial Statistics Techniques-next