Article Text

## Abstract

**OBJECTIVES** To describe the small area system developed in Finland. To illustrate the use of the system with analyses of incidence of lung cancer around an asbestos mine. To compare the performance of different spatial statistical models when applied to sparse data.

**METHODS** In the small area system, cancer and population data are available by sex, age, and socioeconomic status in adjacent “pixels”, squares of size 0.5 km × 0.5 km. The study area was partitioned into sub-areas based on estimated exposure. The original data at the pixel level were used in a spatial random field model. For comparison, standardised incidence ratios were estimated, and full bayesian and empirical bayesian models were fitted to aggregated data. Incidence of lung cancer around a former asbestos mine was used as an illustration.

**RESULTS** The spatial random field model, which has been used in former small area studies, did not converge with present fine resolution data. The number of neighbouring pixels used in smoothing had to be enlarged, and informative distributions for hyperparameters were used to stabilise the unobserved random field. The ordered spatial random field model gave lower estimates than the Poisson model. When one of the three effects of area were fixed, the model gave similar estimates with a narrower interval than the Poisson model.

**CONCLUSIONS** The use of fine resolution data and socioeconomic status as a means of controlling for confounding related to lifestyle is useful when estimating risk of cancer around point sources. However, better statistical methods are needed for spatial modelling of fine resolution data.

- disease mapping
- point source
- spatial random field model

## Statistics from Altmetric.com

Many environmental exposures originate from point sources such as factories, power plants, oil refineries, and dump areas. Because the point sources are often easy to recognise, they can be associated with strong negative attitudes and even fears.1 A suspicion of increased risk of cancer in a region can cause remarkable psychological and economic consequences.2 Therefore, public health authorities should be able to respond rapidly to questions about potentially increased risks of disease. Due to the progress in geographical information systems (GIS), methods providing rapid initial answers to questions on risks of disease in an area have become realistic.3

The Finnish Cancer Registry and National Public Health Institute have developed a small area system, which can be used to compare the risk of cancer in a freely selected area of interest with a given reference area.4 In the United Kingdom, The Small Area Health Statistics Unit has developed a similar system.5 Many analyses have been done for this purpose with study areas defined by concentric circles,6-14 with bandwidths beside sources of pollution15-17 or with confirmatory partitions based on other covariates.18

Several statistical methods are available for analyzing risk of disease around point sources—for example, those of Elliot*et al*,3
19 and Lawson*et al*.20
21 The conventional approach is to calculate the standardised incidence (or mortality) ratios (SIR or SMR). Such standardised ratios are often subject to smoothing.

The confidence intervals (CIs) for SIRs (and SMRs) are calculated with assumed independence when applying the Poisson regression.4 However, the existence of spatially correlated factors which are not found but affect the risk of cancer cannot be excluded. Therefore, lack of spatial autocorrelation is often a strong simplification. Poisson regression with neighbouring counts as covariates22 is not a solution because it is mathematically inconsistent. Therefore, hierarchical methods as suggested by Clayton and Kaldor23 and Besag*et al*
24 are more realistic approaches in spatial modelling of non-infective diseases. If exact coordinates for cases and reliable estimates for the population density exist, then point process modelling is also plausible.25
26

In this paper, we first describe the small area system developed in Finland. Then, we describe the data and define the confirmatory partition used in our illustration. With the full bayesian approach closely related to Besag *et al*, we derived the spatial random field model for estimation of risks of cancer with fine resolution data.24 We compared the results obtained by means of the spatial random field model, which uses fine resolution data with the ones obtained by other spatial methods that use aggregated data. Our main objective was not in the estimation of the environmental effects but in comparing the statistical methods to choose the best approach for the small area system.

## Methods

### THE SMALL AREA SYSTEM

The Finnish Cancer Registry often has to consider questions of whether there is an excess number of cases of cancer in a given geographic area. Previously, the analyses could only be undertaken at the level of the municipalities, which in Finland have a median population of 4800 and a median area of 380 km^{2}. To carry out analyses in smaller areas, the Finnish Cancer Registry and National Public Health Institute started to develop a system called SMASH (small area statistics on health).

The latitude and the longitude with an accuracy of 10 m for the centre of each building in Finland are given by the local housing officials. These data have been linked with the Central Population Register to obtain coordinates of the home address of each Finn. Based on these data, numbers of inhabitants by sex, age, and socioeconomic classification in squares of size 0.5 km×0.5 km (pixels) are calculated for all of Finland by Statistics Finland. Currently, SMASH holds population data for the years 1980, 1990, and 1997, separately for both sexes, for age groups (0–4, 5–9, . . .,80–84, and >85) and for socioeconomic groups (farmers, upper clerical workers, lower clerical workers, skilled workers, unskilled workers, and others). The number of inhabited pixels was 197 761 in 1980, 197 169 in 1990, and 195 870 in 1997, covering about 16% of the land area of the country.

The SMASH also holds data on cancers obtained from the Finnish Cancer Registry. Current data consist of all the cases of cancer diagnosed in 1981–97 in Finland, which are registered by the Finnish Cancer Registry. The metric coordinates of residence were linked for the individual cases of cancer by Statistics Finland. The place of residence is known on 31 December in 1980 and 1990 and on 1 January in the year of diagnosis of the cancer. Sex, age, and socioeconomic status of cases are known in the same categories as in population data.

The purpose of SMASH is to compare incidence of cancer in the area of interest with that of a selected reference area. The area, cancer, period of diagnoses, date of living in the area, age groups, socioeconomical classes, and the reference area can be chosen freely within the precision of resolution of the system (0.5 km×0.5 km). Examples are: the whole of Finland, one of the five large geographical areas used in cancer treatment administration (east, south, southwest, west, or north), old towns, new towns, rural areas, or any combination of individual areas.

In routine use, SMASH enables the user to calculate the SIRs for a selected area—that is,

where *O*
_{i} is the number of observed cases of cancer in pixel *i* which belongs to the area of interest. The expected number of cases of cancer*E _{i}
* is counted in pixel

*i*separately for each sex, age, calendar year, and socioeconomic group by multiplying the class specific incidence of cancer for sex, age, and socioeconomic status in the reference area with the class specific number of inhabitants in pixel

*i*for sex, age, and socioeconomic status. The CIs for the SIRs are calculated with the Byar's approximation formula with the Poisson assumption.27

### CASE STUDY MATERIAL

Exposure to various types of asbestos results in an increased risk of lung cancer.28 Anthophyllite asbestos mining continued between years 1918–75 in Paakkila in the eastern part of Finland. The study area is a square of size 50 km×50 km around the former asbestos mine consisting of a rural area with three built up areas. The present analyses used population data for 31 December 1980.

The cancer data consist of those people with diagnosed lung cancer (seventh revision of the international classification of diseases (ICD-7) 162) between 1981–97, registered by the Finnish Cancer Registry, and whose place of residence was within the study area on 31 December 1980. For calculation of the expected number of cases (stratified by sex, age, and socioeconomic class) all the rural areas in Finland were selected as the reference area.

In the present case study we used an initially defined confirmatory partition (fig 1). The risk area*A _{1}
* was defined as a 4 km×4 km square around the asbestos mine. The area was not centred at the mine, because prevailing wind direction in this area is from the south or southwest. The bandwidth beside the main roads away from the mine formed the risk area

*A*. The exposure in the risk area

_{2}*A*was supposed to be higher than in the rest of the study area, because of transportation of asbestos to and away from the mine. The rest of the study area formed the risk area

_{2}*A*.

_{3}### SPATIAL ANALYSIS

As in the standard use of SMASH today, we started with a non-spatial analysis of aggregated data, where the cases*O _{k}
* in the areas

*A*and

_{1}, A_{2}*A*were assumed to follow the Poisson distribution with mean

_{3}*E*. The maximum likelihood estimate of the relative risk under this model is simply SIR

_{k}λ_{k}_{k}=

*O*. The CIs were estimated according to Byar's approximation formula.27

_{k}/E_{k}As a second approach, we applied a full bayesian model to the aggregated data in each confirmatory area. In bayesian analysis initial knowledge of the phenomenon can be expressed in terms of a prior distribution. This can be chosen to represent informative expert knowledge, if available. We chose a non-informative distribution for the relative risks. The prior mean for relative risks was 1, and the variance was very large. A mean of 1 corresponds to the expected SIR=1. The posterior distribution of relative risks provided a summary of our initial knowledge and the information drawn from data. With the WinBUGS software, the posterior distribution was approximated by the numerical Markov chain Monte Carlo (MCMC) sampling method.29 We calculated the posterior estimates of relative risks with bayesian CIs. Equal tail intervals were calculated; the lower limits were at 2.5% and the upper limits were at the 97.5% percentile. Posterior probabilities were used to compare the risk levels between the confirmatory areas.

As a third approach, we applied the empirical bayesian method introduced by Clayton and Kaldor23 also to the aggregated data. In this approach the prior distribution for relative risk was derived from data. We chose a gamma distribution, with a mean equal to the mean of the SIRs, and a variance equal to the sample variance of the SIRs. We again calculated the posterior risk estimates with equal tail 95% CIs with WinBUGS. In empirical bayesian analysis any posterior probabilities cannot be calculated, because the posterior distribution is estimated.30

As a fourth approach, we used a spatial random field model. The derivation of this new model is described in the . By contrast with the three other approaches, data were not aggregated, but estimation was done with the 0.5 km×0.5 km pixels. This model is closely related to the one presented by Besag *et al*.24 The model is based on the assumption that two nearby pixels are more alike for exposure than any two pixels farther apart. This was likely to be the case, because the underlying exposure variable was the amount of asbestos particles emitted from the local asbestos mine. An analysis of the data from the register at the pixel level requires the use of a smoothed spatial statistical model, because the numbers of inhabitants and cases of cancer in pixels is very low and therefore, the risk estimates are prone to large random variation. The present spatial random field model was too complex to be calculated with WinBUGS. Hence, the program used was written by ourselves with the language C. As in a case of the full bayesian approach, we calculated the posterior probabilities and the posterior estimates of relative risks with bayesian equal tail CIs.

## Results

The spatial distribution of population at risk is shown in figure1. The number of inhabited pixels in the study area was 2051, which was 21% of the total number of pixels (table 1). The number of inhabitants varied from 1 to 782 in the inhabited pixels, whereas the total number of inhabitants was 19 825 in the whole study area. The spatial distribution of cases of lung cancer is shown in figure 2. The number of pixels with cases of cancer was 138, with 1–5 cases in each (totally 184 cases of cancer in the whole area).

We modelled the risk of lung cancer in the inhabited pixels. The spatial random field model did not work with the fine resolution data without some stabilising methods. In the derivation of the random field, equation (A.3), we started with eight neighbouring pixels. Because of the large number of isolated pixels we needed to enlarge the neighbourhood to cover 20 neighbouring pixels. Also, the smoothing parameter, *κ* in equation (A.3), was problematic because the data contain only a small amount of information on the smoothing level. Hence we had to set restrictions at baseline on the variations of *κ* to obtain a stable Markov chain in the MCMC simulation. We tried to keep the restrictions as weak as possible. Because of problems with convergence, an informative distribution was also selected for the variances of the effect of area, *δ _{k}
*s, to allow for large variations in the effects of area,

*ξ*s in equation (A.2). This was still not enough to achieve convergence, and we had to impose some restrictions on the effects of area. We applied two approaches. Firstly, based on SIRs we assumed that the risk of disease decreases with distance from the point source. bayesian methods allow order restrictions to be assigned to the model.30 So we assigned an order

_{k}*ξ*⩾

_{1}*ξ*⩾

_{2}*ξ*. The joint baseline distribution was thus

_{3}*p*(

*ξ*,

_{1}*ξ*,

_{2}*ξ*) ∝

_{3}*p(ξ*)

_{1}*p*(

*ξ*)

_{2}*p*(

*ξ*)1

_{3}_{{ξ1⩾ ξ2 ⩾ξ3}}, where the indicator function 1

_{{ξ1 ⩾ξ2 ⩾ξ3}}is 1 if

*ξ*⩾

_{1}*ξ*⩾

_{2}*ξ*

_{3}, and zero otherwise. The order was also supported by the analyses of the aggregated data. Secondly, because we were not so interested in the effect of distant area

*ξ*, we fixed it to be equal to the SIR in the sub-area

_{3}*A*.

_{3}The estimates of the parameters based on area are shown in table 2. Maximum likelihood estimates (SIRs) of relative risks indicate that the risk of cancer even within the area*A _{3}
* was significantly higher than in rural areas on average. The estimates given by the full bayesian model with non-informative distributions were the same, with slightly narrower CIs, as expected. In the empirical bayesian model, the estimate of relative risk in the area

*A*decreased but in the other areas they stayed almost the same. This is typical for the empirical bayesian method due to the shrinkage effect. The spatial random field model with the effect of

_{1}*A*fixed gave slightly higher estimates than SIRs, but the intervals were narrower. Instead, the ordered spatial random field model gave clearly lower estimates.

_{3}With the full bayesian approaches we can simulate the exact posterior distribution to compare the area specific risks to each other or to the mean risk in rural areas, all this in terms of posterior probabilities (table 3). The risk of lung cancer decreased when the distance from the asbestos mine increased. On the other hand, in two areas nearest to the asbestos mine the risks of lung cancer were still higher than in rural areas in Finland in 1981–97 on average. In the most distant area,*A _{3}
*, the risk of lung cancer was higher than in rural areas on average by the full bayesian model, but not by the ordered spatial random field model.

## Discussion

A small area system with fine resolution data is expected to be more informative than conventional aggregated data for estimation of risks of cancer. On the other hand sparse data cause methodological and computational problems. The spatial random field models developed for conventional small areas could not be straightforwardly adapted to the present system of small areas with very few inhabitants and with no or very few neighbours. Therefore, better statistical methods have to be developed for analyzing fine resolution data on incidence of cancer.

Public health authorities need to be able to react rapidly to suspected increases in risks of cancer due to the environmental pollution around point sources.3 In SMASH, the system developed in Finland, calculation of SIRs can be done within a couple of hours. The system also allows the study area to be chosen freely with an accuracy of 0.5 km. The main weakness currently in the interpretation of the results from SMASH is that exposure assessment is solely based on place of residence at one time and there is no accounting for migration or daily commuting. Socioeconomic status is known to be a good proxy of risk factors strongly related to social class, such as smoking, healthy diet, etc.31 However, often data on other confounding variables would also be needed for reliable inference. Therefore comparisons by area, as in SMASH, can only provide preliminary results on risks of cancer, which then can be used to help to decide on the need for further investigations, which often include extensive fieldwork.3

Due to the low numbers in the present analyses, indirect standardisation and SIRs were the only feasible method of analyses. The SIRs produce a valid comparison between the area of interest and the reference area. However, due to the indirect standardisation used, two SIRs are mutually comparable only if there is no difference in stratum specific relative risks.32 In the present study, the strata that were adjusted for indirect standardisation were sex, age, and socioeconomic status. The only major difference between study areas was the lower proportion of farmers in sub-area*A _{1}
*. However, the relative risks between study areas among farmers and non-farmers were quite similar. Therefore, the SIRs in the present study are comparable.

We found an increased risk of lung cancer around the former asbestos mine. It is known that asbestos miners in Paakkila have a 2.8-fold increased risk of lung cancer.33 In the present case study we were not able to use the occupational exposure of the ex-miners as a covariate or confounder, because our broad standard socioeconomic classification did not include an indicator of former work with asbestos. If the matrix on population and cancer data had been developed especially for studying the problem of environmental exposure to asbestos, the population in each pixel could be classified into asbestos miners and non-miners by linking the personal identifiers with the existing cohort of asbestos miners.33 It should also be noted that our data may not best exemplify a study on environmental exposures, because the main exposure to asbestos is occupational and at one geographical location (in the factory). Holopainen*et al*
34 calculated the SIRs for lung cancer in 1953–95 in the municipality of Tuusniemi, where the asbestos mine is. They found that the SIR of lung cancer among the total population was 1.08, among occupationally exposed subjects it was 3.0, and among non-occupationally exposed people it was 0.89. In the present analyses, the risk estimates in the area*A _{1}
* were similar to the SIR among miners and the estimate in the area

*A*were similar to the SIR among inhabitants of the Tuusniemi district. Therefore, occupational exposure seems to be the main cause of the increased risk of lung cancer among inhabitants living near the asbestos mine.

_{3}Georeferenced data are often spatially autocorrelated.35Spatial statistical modelling, as in the present spatial random field model, is then a useful approach because it accounts for spatial autocorrelation. One of the strengths of the spatial random field model is that it exploits information from the neighbouring pixels for smoothing the risks towards the local mean. Estimation uncertainty of risk at the pixel level is also possible, because the model provides exact estimates of the precision for each pixel—that is, posterior variances. Statistical strength of the spatial random field model and the full bayesian model is that the assumption of the Poisson process is usually valid conditionally, in a hierarchical model in a case of non-infectious diseases. Another strength of the spatial random field model and other bayesian methods is that prior knowledge about the subject of analysis can be used if available. Also, the interpretation of the bayesian CI is convenient, because it consideres the actual probability that the value of an unknown variable lies between the limits of the interval. When the posterior distribution is represented by a numerical sample generated by MCMC, the computation of approximate posterior probabilities of functions of the variables in the model is straightforward.

The spatial random field model presented by Besag*et al*
24 is commonly used for mapping incidences of disease, and it has proved to be one of the best approaches to mapping disease.21 The estimation of the spatial random field model is, however, quite laborious and we also found several problems with convergence when the model was used with sparse data. If data are very few or pixels have only few neighbours, smoothing of the risk estimates can be difficult. This can be partly overcome by enlargening the neighbourhood area with some reasonable weighting. However, this should be done carefully, because at the same time one loses precision in reference area. Another way to help the spatial random field model to converge is to choose informative distributions for hyperparameters. With fine resolution data other restrictions may still be needed. In this paper we applied two types of restrictions. Ordering the effects of area did not work. Fixing one of the three effects of area proved to be a useful limitation, as the model converged. Also the sizes of the estimates were reasonable compared with SIRs, and the estimates had narrower CIs.

In approaches with aggregated data, it should be noted that aggregation is a way of smoothing the data. After aggregation, estimation or mapping of risk of cancer with finer resolution is impossible, because the area risks are the only information available. After all, if the data are very sparse, then aggregation into larger areas is reasonable, but it is an open question how large the areas should be.

The approach of estimating the SIRs is easy and fast. Main statistical problems with SIRs are associated with the Poisson assumption and with the lack of spatial autocorrelation. Although the sum of independent numbers of observed cases is Poisson distributed, this is no longer the case if spatial autocorrelation exists and the resulting CIs may be wrong. For reliable inference, the SIRs require quite large areas—that is, large populations, because when the expected number of cases is small the CIs of SIRs can become very wide. The requirement of aggregation to large areas may be a contradiction in the study of point sources. However, SIRs are useful summaries of data at an early stage of modelling.

In conclusion, the small area system, SMASH, developed in Finland is useful for estimation of risks of cancer in a small area due to the use of fine resolution data and availability of socioeconomic status, which is a major confounder. On the other hand, sparse data can cause stabilising problems in the computing. The spatial random field model applied in previous small area studies did not work straightforwardly with fine resolution and sparse data. Therefore, better statistical methods are needed for spatial modelling of fine resolution data.

## Acknowledgments

The Academy of Finland through the project “Small area analyses of cancer incidence around a point source (42559)” and The Ellen and Artturi Nyyssönen Foundation have supported the work of EK. The European Community through the project “A European environment and health information system for exposure and disease mapping and risk assessment (VS/1999/5279 (99CVF2–606))” has supported the work of JR. The Academy of Finland has supported AP through a senior researcher grant (385).

## Appendix

Derivation of the spatial random field model

The risk of cancer in pixel *i*is denoted by *θ _{i}
*,

*i*= 1, . . .,I. Hence, the collection (

*θ*) seen at the pixel locations is called a disease map and is the objective of statistical inference. We assume, that given (

_{1}, . . .,θ_{I}*θ*), the observed cases

_{1}, . . .,θ_{I}*O*are conditionally independent and follow the Poisson distribution with mean:

_{i}Covariate information on sex, age, and socioeconomic status was considered through the expected number of cases*E _{i}
*. The dummy variable

*h*contained the effect of distance measured from the fixed point source. The unmeasured covariate

_{i}*ζ*= (

*ζ*) was specified as a spatially correlated random field. The role of this random effect was in the smoothing of the risks of cancer and in the exchange of information between nearby pixels. The likelihood function can be written as a product:

_{1}, . . .,ζ_{I}In the analysis, we present the distance effect as a log linear model with the confirmatory partition described in methods section, so that:

where 1_{k}
*(i)* is 1 if pixel *i* was located in the sub-area*k*, and zero otherwise. The confirmatory partition is convenient to use with SMASH because the source of exposure of pollutant is not always a point source. For example, a fixed source can be a river, power line, or a dump area. The sub-areas are also allowed to be asymmetric. Conditional on*E*,* h*, and*ζ* the model (A.1) is loglinear with:

where *E _{i}
* was treated as an offset and

*η*= log(

_{i}*ζ*).

_{i}We assumed that the effect of area (effect distance from the point source) varies around unity in the risk areas. Therefore, baseline distributions for *ξ _{k}
*s were assumed to be normal with zero mean and variance

*δ*. We assigned inverse gamma distribution with fixed parameters as the prior of variances

_{k}*δ*s.

_{k}The prior derivation for the random field*η = (η _{1}, . . .,η_{I})* resembles the derivation by Besag

*et al*24. We assumed that

*η*is an intrinsic gaussian random field defined by the joint improper distribution:

where *κ* has the role of a smoothing parameter and *w _{ij}
* is a weight relating the neighbouring pixels

*i*and

*j*. The prior for

*κ*was assumed to be gamma distribution with fixed variables. We chose the weights

*w*to be one for all neighbouring pixels.

_{ij}The posterior distribution was calculated by MCMC simulation. In the derivation of the Markov chain we applied a single site Metropolis updating scheme for*η _{i}
*s and

*ξ*s and Gibbs updating for

_{k}*κ*and

*δ*s these being conventional choices. For further information, see Gilks

_{k}*et al*.36