Intro to Atlas GIS 2.0
Mackenzie, Tanjuakio and Sparco

to next chapter


Chapter 7: Spatial Statistics

The basic premise of spatial analysis is that nearby things tend to be more alike than far-apart things. In other words, we frequently discern "neighborhood effects." Real estate agents frequently explain that the three major determinants of housing values are "location, location and location." A property located in a high-price neighborhood is worth more than an otherwise identical property located in a low-price neighborhood.

This chapter discusses AtlasGIS's report generation capabilities, and introduces some basic concepts of spatial statistics.

Neighborhood effects in spatial data tend to invalidate some conventional statistical tests of association between spatial variables. The final part of this chapter summarizes a spatial regression analysis of the effects of Superfund toxic waste sites on proximate residential property values. AtlasGIS is used for data pre-processing; SPACESTAT, a DOS-based spatial regression package, is used for estimating the econometric model. The diagnosis of spatial autocorrelation pathologies and their effects on the regression model are discussed. Readers lacking econometric training will probably opt to skip this last section.

Summary Data Reports

The Select-Info utility generates summary statistics on any currently-selected features. You can get a quick-and-dirty printout of this with the /Print tool, sending the report to an ASCII file or your printer. (You can't adjust the format of this report until you import it to your word-processor.)

You can generate summary reports for selected features in the format you want with Print-Attribute. This calls up a Report Settings screen on which you specify title, headings, columns, sort order, etc. You can add calculated columns with the /Insert tool within the <<Define Columns>> settings sheet. You can print the report on your printer or write it to an ASCII text file with the /File-Save tool. The ASCII file can be accessed by almost any word-processing or statistical analysis package.

Keep in mind that AtlasGIS's database manager and statistical analysis capabilities are both somewhat limited. If you need to do complex database or statistical operations on selected features, you will generally want to write the attributes of the selected features to a separate file and analyze these data using a dedicated database manager or statistical analysis program.

Spatial Statistics: Basic Issues

Spatial statistics involves analysis of relationships between spatially-referenced variables. Generally, the relationships between such variables are influenced by their relative spatial distributions. While classical statistical methods assume that data are randomly sampled from a homogeneous data-generating process, spatial data typically violate critical aspects of these assumptions. First, spatial data are, by definition, not randomly-scattered in space: their locations have some pattern and (presumably) meaning or explanation. Second, many types of geographic features (lines and regions) are spatially heterogeneous: for example, roads have different lengths, capacities, directions, etc.; county regions have different areas, populations, compactness, etc. Only point features are spatially homogeneous. Statistical analysis of spatially heterogeneous features generally requires the use of observation-weighting methods to achieve approximate homogeneity.

While the nearness of observations is often hypothesized to be a determinant of relationship between them, "nearness" itself can be very difficult to define or measure. While time-series data are ordered in a single (time) dimension, in which proximity between observations can be indexed directly, spatial data are ordered in two or three different (spatial) dimensions which generally precludes unambiguous uni-dimensional indexing of proximity. The proximity between any two region features may be indexed in any number of ways: by presence or absence of shared boundary; length of shared boundary (if any); ratio of shared versus unshared boundary; distance between region centroids; average of distances between nearest boundary points and distances between furthest boundary points, etc.

Spatial statistics, particularly the summary statistics generated by AtlasGIS's Select-Info utility, require careful interpretation. For example, Select-Info calculates simple averages of percents, which are invalid unless the percents are based on the same number of observations in each feature. To illustrate: suppose a region with 100,000 people is 20% minority and an adjacent region with 50,000 people is 10% minority. The combined region (weighted average) is 16.6% minority--25,000 minority out of 150,000 total--not 15% minority (simple average).

Spatial Regression

GIS's are widely used for visual analysis of spatial relationships between variables. They can also support statistical analysis of such relationships. The spatial regression model is a variant of the classical ordinary least-squares (OLS) regression model. It treats a statistical pathology of spatial data known as spatial autocorrelation.

Briefly, wherever a significant portion of the value of an observation is explained by the values of proximate observations, the classical ordinary least-squares regression model is inefficient, and conventional model statistics may be seriously misleading. The usual consequence of positive spatial autocorrelation is that the confidence intervals on the regression coefficients are incorrect and overstate the statistical significance of those coefficients, leading to possible Type II statistical error (failure to reject a wrong hypothesis).

Let S denote the variance scalar, let X' and inv[X] denote the transpose and inverse of matrix X respectively.

Suppose the relationship between one variable Y and a set of explanatory variables X is specified in linear form Y = XB + e, where B is a vector of regression coefficients and e is the vector of residuals with variance-covariance matrix G. Estimated via OLS, B = inv[X'X]X'Y. However, the presence of spatial autocorrelation implies non-zero off-diagonals in G. In the case of positive autocorrelation, the estimated variances of the OLS estimator are biased downward.

This is essentially a weighting problem: geographically clustered observations contain a high degree of redundant information, and should be given smaller weights in the regression than geographically remote observations. Where G has non-zero off-diagonals, the generalized least-squares (GLS) estimator is inv[X'inv[G]X]X'inv[G]Y, which incorporates the appropriate weights and is BLUE ("Best"--i.e. minimum- variance--Linear Unbiased Estimator). The OLS estimator may appear to have lower variance than the GLS estimator if model statistics are calculated from the erroneous variance-covariance matrix Sinv[X'X] rather than the true variance-covariance matrix of the OLS estimator, Sinv[X'X]X'[G*]Xinv[X'X], which incorporates [G*], the normalization of G. In cases of positive autocorrelation the apparent variance-covariance of the OLS estimator, Sinv[X'X], may be smaller than Sinv[X'inv[G*]X], the variance-covariance of the GLS estimator. However, Sinv[X'X]X'[G*]Xinv[X'X], the true variance-covariance of OLS, is easily shown to be larger than Sinv[X'inv[G*]X].

Since the true residual variance-covariance matrix G has N(N+1)/2 unique elements--including N(N-1)/2 unique off-diagonal elements reflecting spatial autocorrelation--which cannot be calculated from only N observations, some estimate of G must be developed. For example, we might assume a first-order autoregressive structure approximable by some function of the linear distances between observations. Given the NxN matrix of calculated distance elements dij between all observations i and j, let W represent the matrix of reciprocal elements 1/dij normalized so that row sums equal one. G can then be approximated by I-rW, where r is a scalar coefficient of autocorrelation, and the estimated generalized least-squares (EGLS) estimator of B is calculated as inv[X'inv(IrW)X]X'inv(IrW)Y.

Anselin distinguishes two types of spatial autocorrelation: nuisance autocorrelation and substantive autocorrelation. Nuisance spatial autocorrelation involves model residuals only, reduces model efficiency (i.e., increases the variances of parameter estimates), and is corrected via a spatial error model specification which can be estimted via an EGLS procedure such as described above. Substantive spatial autocorrelation involves one or more independent variables, generates model bias, and is corrected via a more complex spatial lag specification estimated via maximum likelihood methods.

Spatial Regression Packages

For spatial regression analyses, the Spatial Analysis Lab mostly uses SPACESTAT, a DOS-based econometric package written by Luc Anselin. SPACESTAT is built on GAUSS matrix manipulation routines, and functions most effectively with large allocations of extended memory and disk swap space.

SPACESTAT includes various diagnostics of spatial association, treats a fairly wide range of regression pathologies, and supports various maximum-likelihood procedures and even some bootstrapping. The program has a fairly straightforward menu interface. Most of its regression procedures use a distance matrix which SPACESTAT calculates prior to execution of actual regressions.

Alternatively, spatial regression procedures can be programmed using SAS IML, available on the University of Delaware's central UNIX system.

Case Study: Superfund Sites in New Castle County (Tanjuakio and Mackenzie)

We analyze the effects of multiple Superfund sites (i.e., toxic waste sites included on the EPA's National Priorities List under the Comprehensive Environmental Response, Compensation and Liability Act of 1980) in New Castle County, Delaware, on the values of proximate residential properties. New Castle is the most densely populated county in Delaware, containing 66 percent of the state's population on 20 percent of its total land area. The county also has a total of eleven Superfund sites in various stages of administrative action or remediation.

Block-level data were extracted from STF 1B files from the 1990 Census of Population and Housing. While New Castle County contains 5,092 1990 Census blocks, average housing values (as estimated by owner-occupant respondents) were reported for only 2,404 blocks. Other extracted block-level variables included the median age of the housing stock (AGE), the proportion of white population to total population (PCTWHITE), mean number of rooms per housing unit (NUMROOM), mean household income (INCOME), and mean commuting time to work (COMMUTE).

The STF 1B file also contains latitude and longitude coordinates of the centroids of each Census block. (We were able to import the STF1B records directly into AtlasGIS as datapoints.) These locational data were used to compute linear distances in miles between block centroids. This distance matrix was subsequently used to construct the weighting matrix for the spatial lag model as described below.

Site descriptions and centroid coordinates for each of New Castle County's eleven superfund sites were obtained from the Division of Air and Waste Management's

1991-1992 Superfund Annual Report

(Delaware Department of Natural Resources and Environmental Control). The data on Superfund site locations were combined with the block centroid coordinates to calculate linear distances of each Census block to each of the eleven Superfund sites (DISTANCE1...DISTANCE11). Reciprocals of each Superfund site's distances from the Census blocks were included as eleven additional explanatory variables in the hedonic model.

Trial models indicated that a Superfund site's influence on proximate property values is insignificant beyond two miles from the site centroid. Census blocks at least two miles from all Superfund sites were thus omitted from the regression. This reduction in degrees of freedom somewhat weakens the significance of housing attribute coefficients, but enhances the performance of the Superfund site distance variables by precluding collinearity between those variables. It also substantially reduces computation costs. The analysis thus retained 385 Census block observations located within two miles of at least one of the Superfund sites.

Several superfund sites were either spatially clustered or similarly oriented vis-a-vis Census block clusters that proximity effects of individual sites within a cluster or linear array of sites were not readily quantifiable. The following six sites or site clusters were thus analyzed in the final models (NPL rankings for each site are included in parentheses):

  1. the Army Creek (8), Delaware Sand and Gravel(272), New Castle Spill (545) cluster;
  2. the Tybouts (2), Koppers (827), DuPont-Newport (169) cluster;
  3. the Delaware City PVC (972), Standard Chlorine (690) cluster;
  4. the Harvey and Knott Drum site (957);
  5. the Halby Chemical site (952); and
  6. the Sealand Limited site (839).
Although mean age of housing unit (AGE) and mean commuting time to work (COMMUTE)

were expected to have negative influences on housing values, COMMUTE was consistently insignificant in the trial regressions, and AGE was either insignificant or positively correlated with mean housing values. These variables were dropped from the model.

The final model was thus specified:

HOUSEVAL = a + b1 ln(INCOME) + b2 ln(ROOMS) + b3 RACE + c1 (1/DISTANCE1) + ... + c6 (1/DISTANCE6),

where the subscripts 1...6 identify the specific Superfund sites described above. The coefficients c1...c6 gauge specific economic impacts of each of these sites on housing values in proximate Census blocks. These inverse distance variables are identified as SITE1 ... SITE6 in the regression output.

Comparable versions of this model were estimated via OLS and EGLS using a weighting matrix constructed from the element reciprocals of the distance matrix. For simplicity, the coefficient of autocorrelation r was assumed to equal one. This makes the EGLS procedure equivalent to weighted OLS using spatial differences: the explanatory variables X are simply premultiplied by (IW).

Both versions of the model were estimated using SAS Proc IML. Table 1 summarizes the OLS and EGLS estimation results. The OLS model statistics generated by the regression procedure are calculated from the variance-covariance matrix Sinv[X'X) rather than the true variance-covariance Sinv[X'X]X'[G*]Xinv[X'X].

While the OLS results appear to be superior to the EGLS results, the Moran's I statistic computed from the OLS residuals I = [e'We]/[e'e] yielded a standard deviation of 20.24, confirming the presence of positive spatial autocorrelation in the residuals (Cliff and Ord; Upton and Fingleton). The Moran's I statistic calculated from the housing values I = [Y'WY]/[Y'Y] yields a value of 0.3525, significantly above its expected value of 1/(n-1) or 0.0026, further confirming the hypothesized "neighborhood effect." Both of these tests indicate the appropriateness of the EGLS model, and indicate that the significance and goodness-of-fit measures of the OLS model are seriously exaggerated. The Moran's I statistics calculated from the EGLS model residuals indicates reasonably successful treatment of this spatial autocorrelation.


        Table 1:  OLS vs. EGLS Regression Results       
Hedonic Analysis of New Castle County, DE, Superfund Sites

OLS MODEL SPATIAL LAG MODEL Coefficient t-stat Coefficient t-stat W_LNHVAL - - -0.0085 -2.54 INTERCEPT 7.5505 19.42 7.8617 19.35 LN(INC) 0.3429 8.91 0.3160 7.94 LN(ROOMS) 0.3745 5.64 0.3765 5.69 RACE -0.4871 -9.71 -0.4418 -8.48 SITE1 -0.4197 -2.73 -0.4418 -2.88 SITE2 -0.2416 -2.00 -0.2535 -2.10 SITE3 -0.0730 -0.43 -0.1158 -0.68 SITE4 -0.3247 -2.18 -0.3564 -2.40 SITE5 -0.1555 -2.01 -0.1329 -1.71 SITE6 -0.1855 -2.21 -0.2035 -2.43 adjR2 0.5217 0.5251


Of the six Superfund sites included in the model, all but the Delaware PVC-Standard Chlorine cluster (Site 3) are shown to have significant negative effects on proximate property values. The coefficient magnitudes indicate a different prioritization of sites by economic damage than the National Priorities List:


Table 2: Comparison of Hedonic Damage Rankings vs. NPL Rankings, 
              New Castle County, Delaware, Superfund Sites

Site     Name                   Coefficient              NPL Rank

Site1    Army Creek-DSG-NCS      -0.4418                    2 
Site4    Harvey                  -0.3564                    6 
Site2    Tybouts-Koppers-Dupont  -0.2535                    1 
Site6    Sealand                 -0.2035                    4 
Site5    Halby                   -0.1329                    5 
Site3    PVC-SC                  -0.1158 (insignificant)    3

These empirical results demonstrate that spatial autocorrelation problems may substantially impair the reliability of conventional least-squares estimators used in hedonic analyses. In this case, positive spatial autocorrelation substantially inflated the significance and goodness-of-fit measures for the OLS model. Tests for spatial autocorrelation are fairly easy to execute, and even simple EGLS procedures such as those illustrated here can substantially improve the reliability of estimation results.


Bibliography

Anselin, Luc. 1988. Spatial Econometrics: Methods and Models. Kluwer Academic Publishers. Dordrecht (NL). A fairly comprehensive review of spatial regression pathologies and corrections written mainly for econometricians and regional science specialists.

Cliff, A.D. and J.K. Ord. 1973. Spatial Autocorrelation. Monographs in Spatial and Environmental Systems Analysis. Pion Ltd. The same level of technicality, but written for a more broader audience.

Odland, John. 1988. Spatial Autocorrelation. Scientific Geography Series, Sage Publications. A nice introductory presentation.

Upton, Graham and B. Fingleton. 1985. Spatial Data Analysis by Example Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons. A broad survey of spatial statistics covering point-pattern and qualitative data (vol. 1) and categorical and directional data (vol. 2).


previous chapter
next chapter