Stratifying Maps for Better Predictions -spatial analysis techniques

Stratifying Maps for Better Predictions -spatial analysis techniques

A major problem is that the R-squared statistic summarizing the residuals for both of the prediction equations is fairly small (R^2= 26% and 4.7% respectively) which suggests that the prediction lines do not fit the data very well. One way to improve the predictive model might be to combine the information in both of the images. The Normalized Density Vegetation Index (NDVI) does just that by calculating a new value that indicates plant density and vigor— NDVI= ((NIR – Red) / (NIR + Red)).

Figure 4.2.4-3 shows the process for calculating NDVI for the sample grid location— ((121-14.7) / (121 + 14.7)) = 106.3 / 135.7 = .783. The scatter plot on the right shows the yield versus NDVI plot and regression line for all of the field locations. Note that the R^2 value is a higher at 30% indicating that the combined index is a better predictor of yield.

The bottom portion of the figure evaluates the prediction equation’s performance over the field. The two smaller maps show the actual yield (left) and predicted yield (right). As you would expect the prediction map doesn’t contain the extreme high and low values actually measured.

The larger map on the right calculates the error of the estimates by simply subtracting the actual measurement from the predicted value at each map location. The error map suggests that overall the yield estimates are not too bad— average error is a 2.62 bu/ac over estimate and 67% of the field is within +/- 20 bu/ac. Also note the geographic pattern of the errors with most of the over estimates occurring along the edge of the field, while most of the under estimates are scattered along NE-SW strips.

Evaluating a prediction equation on the data that generated it is not validation; however the procedure provides at least some empirical verification of the technique. It suggests hope that with some refinement the prediction model might be useful in predicting yield from remotely sensed data well before harvest.

4.2.5 Stratifying Maps for Better Predictions

The preceding section described a procedure for predictive analysis of mapped data. While the underlying theory, concerns and considerations can be quite complex, the procedure itself is quite simple. The grid-based processing preconditions the maps so each location (grid cell) contains the appropriate data. The ‘shishkebab’ of numbers for each location within a stack of maps are analyzed for a prediction equation that summarizes the relationships.

The left side of figure 4.2.5-1 shows the evaluation procedure for regression analysis and error map used to relate a map of NDVI to a map of corn yield for a farmer’s field. One way to improve the predictions is to stratify the data set by breaking it into groups of similar characteristics. The idea is that set of prediction equations tailored to each stratum will result in better predictions than a single equation for an entire area. The technique is commonly used in non-spatial statistics where a data set might be grouped by age, income, and/or education prior to analysis. Additional factors for stratifying, such as neighboring conditions, data clustering and/or proximity can be used as well.

While there are numerous alternatives for stratifying, subdividing the error map will serve to illustrate the conceptual approach. The histogram in the center of figure 4.2.5-1 shows the distribution of values on the Error Map. The vertical bars identify the breakpoints at +/- 1 standard deviation and divide the map values into three strata—zone 1 of unusually high under-estimates (red), zone 2 of typical error (yellow) and zone 3 of unusually high over-estimates (green). The map on the right of the figure maps the three strata throughout the field.

The rationale behind the stratification is that the whole-field prediction equation works fairly well for zone 2 but not so well for zones 1 and 3. The assumption is that conditions within zone 1 make the equation under estimate while conditions within zone 3 cause it to over estimate. If the assumption holds one would expect a tailored equation for each zone would be better at predicting corn yield than a single overall equation.

Figure 4.2.5-2 summarizes the results of deriving and applying a set of three prediction equations. The left side of the figure illustrates the procedure. The Error Zones map is used as a template to identify the NDVI and Yield values used to calculate three separate prediction equations. For each map location, the algorithm first checks the value on the Error Zones map then sends the data to the appropriate group for analysis. Once the data has been grouped a regression equation is generated for each zone

The R^2 statistic for all three equations (.68, .60, and .42 respectively) suggests that the equations fit the data fairly well and ought to be good predictors. The right side of figure 4.2.5-2 shows a composite prediction map generated by applying the equations to the NDVI data respecting the zones identified on the template map.

The left side of figure 4.2.5-3 provides a visual comparison between the actual yield and predicted maps. The stratified prediction shows detailed estimates that more closely align with the actual yield pattern than the ‘whole-field’ derived prediction map using a single equation. The error map for the stratified prediction shows that eighty percent of the estimates are within +/- 20 bushels per acre. The average error is only 4 bu/ac and having maximum under/over estimates of –81.2 and 113, respectively. All in all, fairly good yield estimates based on a remote sensing data collected nearly a month before the field was harvested.

A couple of things should be noted from this example of spatial data mining. First, that there is a myriad of other ways to stratify mapped data—1) Geographic Zones, such as proximity to the field edge; 2) Dependent Map Zones, such as areas of low, medium and high yield; 3) Data Zones, such as areas of similar soil nutrient levels; and 4) Correlated Map Zones, such as micro terrain features identifying small ridges and depressions. The process of identifying useful and consistent stratification schemes is an emerging research frontier in the spatial sciences.

Second, the error map is a key part in evaluating and refining prediction equations. This point is particularly important if the equations are to be extended in space and time. The technique of using the same data set to develop and evaluate the prediction equations isn’t always adequate. The results need to be tried at other locations and dates to verify performance. While spatial data mining methodology might be at hand, good science is imperative.

Finally, one needs to recognize that spatial data mining is not restricted to precision agriculture but has potential for analyzing relationships within almost any set of mapped data. For example, prediction models can be developed for geo-coded sales from demographic data or timber production estimates from soil/terrain patterns. The bottom line is that maps are increasingly seen as organized sets of data that can be map-ematically analyzed for spatial relationships— we have only scratched the surface.

5.0 Spatial Analysis Techniques

While map analysis tools might at first seem uncomfortable they simply are extensions of traditional analysis procedures brought on by the digital nature of modern maps. The previous section 6.0 described a conceptual framework and some example procedures that extend traditional statistics to a spatial statistics that investigates numerical relationships within and among mapped data layers.

Similarly, a mathematical framework can be used to organize spatial analysis operations. Like basic math, this approach uses sequential processing of mathematical operations to perform a wide variety of complex map analyses. By controlling the order that the operations are executed, and using a common database to store the intermediate results, a mathematical-like processing structure is developed.

This ‘map algebra’ is similar to traditional algebra where basic operations, such as addition, subtraction and exponentiation, are logically sequenced for specific variables to form equations—however, in map algebra the variables represent entire maps consisting of thousands of individual grid values. Most of traditional mathematical capabilities, plus extensive set of advanced map processing operations, comprise the map analysis toolbox.

As with matrix algebra (a mathematics operating on sets of numbers) new operations emerge that are based on the nature of the data. Matrix algebra’s transposition, inversion and diagonalization are examples of the extended set of techniques in matrix algebra.

In grid-based map analysis, the spatial coincidence and juxtaposition of values among and within maps create new analytical operation, such as coincidence, proximity, visual exposure and optimal routes. These operators are accessed through general purpose map analysis software available in most GIS systems. While the specific command syntax and mechanics differs among software packages, the basic analytical capabilities and spatial reasoning skills used in GIS modeling and analysis form a common foundation.

There are two fundamental conditions required by any spatial analysis package—a consistent data structure and an iterative processing environment. The earlier section 3.0 described the characteristics of the grid-based data structure by introducing the concepts of an analysis frame, map stack, data types and display forms. The traditional discrete set of map features (points, lines and polygons) where extended to map surfaces that characterize geographic space as a continuum of uniformly-spaced grid cells. This structure forms a framework for the map-ematics underlying GIS modeling and analysis.

The second condition of map analysis provides an iterative processing environment by logically sequencing map analysis operations. This involves:

- retrieval of one or more map layers from the database,

- processing that data as specified by the user,

- creation of a new map containing the processing results, and

- storage of the new map for subsequent processing.

Each new map derived as processing continues aligns with the analysis frame so it is automatically geo-registered to the other maps in the database. The values comprising the derived maps are a function of the processing specified for the input maps. This cyclical processing provides an extremely flexible structure similar to “evaluating nested parentheses” in traditional math. Within this structure, one first defines the values for each variable and then solves the equation by performing the mathematical operations on those numbers in the order prescribed by the equation

This same basic mathematical structure provides the framework for computer-assisted map analysis. The only difference is that the variables are represented by mapped data composed of thousands of organized values. Figure 5.0-1 shows a solution for calculating the percent change in animal activity.

The processing steps shown in the figure are identical to the algebraic formula for percent change except the calculations are performed for each grid cell in the study area and the result is a map that identifies the percent change at each location. Map analysis identifies what kind of change (thematic attribute) occurred where (spatial attribute). The characterization of “what and where” provides information needed for continued GIS modeling, such as determining if areas of large increases in animal activity are correlated with particular cover types or near areas of low human activity.

5.1 Spatial Analysis Framework

Within this iterative processing structure, four fundamental classes of map analysis operations can be identified. These include:

- Reclassifying Maps – involving the reassignment of the values of an existing map as a function of its initial value, position, size, shape or contiguity of the spatial configuration associated with each map category.

- Overlaying Maps – resulting in the creation of a new map where the value assigned to every location is computed as a function of the independent values associated with that location on two or more maps.

- Measuring Distance and Connectivity – involving the creation of a new map expressing the distance and route between locations as straight-line length (simple proximity) or as a function of absolute or relative barriers (effective proximity).

- Summarizing Neighbors – resulting in the creation of a new map based on the consideration of values within the general vicinity of target locations.

Reclassification operations merely repackage existing information on a single map. Overlay operations, on the other hand, involve two or more maps and result in the delineation of new boundaries. Distance and connectivity operations are more advanced techniques that generate entirely new information by characterizing the relative positioning of map features. Neighborhood operations summarize the conditions occurring in the general vicinity of a location.

The reclassifying and overlaying operations based on point processing are the backbone of current GIS applications, allowing rapid updating and examination of mapped data. However, other than the significant advantage of speed and ability to handle tremendous volumes of data, these capabilities are similar to those of manual map processing. Map-wide overlays, distance and neighborhood operations, on the other hand, identify more advanced analytic capabilities and most often do not have paper-map legacy procedures.

The mathematical structure and classification scheme of Reclassify, Overlay, Distance and Neighbors form a conceptual framework that is easily adapted to modeling spatial relationships in both physical and abstract systems. A major advantage is flexibility. For example, a model for siting a new highway could be developed as a series of processing steps. The analysis likely would consider economic and social concerns (e.g., proximity to high housing density, visual exposure to houses), as well as purely engineering ones (e.g., steep slopes, water bodies). The combined expression of both physical and non-physical concerns within a quantified spatial context is a major benefit.

However, the ability to simulate various scenarios (e.g., steepness is twice as important as visual exposure and proximity to housing is four times more important than all other considerations) provides an opportunity to fully integrate spatial information into the decision-making process. B

spatial analysis- Overlaying Maps