Applied Spatial Analysis and Policy

, Volume 5, Issue 1, pp 25–49

# The Segmentation of Local Government Areas: Creating a New Geography of Nigeria

• Daniel Vickers
• Dimitris Ballas
Article

## Abstract

Social area classifications group areas on the basis of social or socio-economic similarity into cluster units which define their demographic and social characteristics. The methods used to create these systems combine geographic thought and theory with statistical manipulations of multivariate data. The development and use of geodemographic systems appear to be restricted within developing countries. Some commentators suggest that area classifications may not offer benefits to these countries. This paper argues that the developing world has a lot to benefit from this type of geography. It presents the case of Nigeria where a classification system has been developed for the 774 Local Government Areas (LGA) of the country. Insight is provided into the variables and methodological approach that has been used to create the Nigerian system.

### Keywords

Nigeria Area classifications Geodemographics Local Government Areas

## Introduction

The classification of areas concentrates on grouping geographical units based on the socio-economic characteristics of their residents (Vickers and Rees 2006). Early work on social and geographical area classification is generally credited to a survey conducted by social reformer Charles Booth in 19th century England. This was a massive enquiry of the social and economic conditions of the people of London (Orford et al. 2002) which led to the production of detail maps showing the social class of London at street geography.

It is therefore no coincidence that most of the development and use of segmentation techniques over concentrate in developed countries. The benefits of understanding and identifying people with similar characteristics have been greatly exploited by the commercial sector in most developed world countries (Harris et al. 2005). However it is noteworthy that there is increasing public sector use of geodemographic techniques (Brown et al. 2000; Abbas et al. 2009; Singleton and Longley 2009) for informed decision making.

Many developing world countries are faced with challenges which culminate in the Millennium Development Goals (MDGs) (U.N. 2007). The goals address a variety of problems for different segments of population. This in itself suggests that different kinds of advantage or disadvantage vary in the ways in which they relate the population. Understanding the spatial variation of socio-cultural or economic advantage or disadvantage and reaching the most ideal target group is therefore a priority for governments and policy makers in most developing world countries.

This paper discusses an attempt to build for the first time, a geodemographic classification system for Nigeria. The country is endowed with abundant natural resources and potential in human capital (World Bank 1996a, b). However, a key challenge that has faced successive governments is how to ensure that the merits of these resources trickle down to the most deprived segments of the population (Ogunbodede 2006; Okonjo-Iweala and Osafo-Kwaako 2007). The chasm of inequality therefore continues to widen.

The Local Government Area (LGA) administrative level is the geographical scale at which national and state governments expect the impacts of some of their policies to reach people at the grass roots. Unfortunately little country-wide spatial analysis has been conducted at this scale. Using national survey datasets data supplied by the National Bureau of Statistics (NBS) some information from the 2006 Census, cluster analysis techniques have been deployed on Nigerian spatial statistics to investigate the possibility of assigning each LGA to groups based on their general similarity. The paper provides a review of the reasoning behind the variables selected and elucidates the analytical methodologies used to derive the system.

## Geodemographics and Classification of People in Taxonomic Space

Social area classification is a way of segmenting geographical units into groups based on the socio-economic features of their residents (Vickers and Rees 2006). In other words, the typology assigned to an areal unit (LGAs in this case) is not only indicative of the nature of such geographical area, but also speaks about the general characteristics of the residents (Harris et al. 2005).

Much research within the sphere of geodemographics is centered on a view that people cluster together as a result of similar characteristics. As Foley (1997, p. 6) points out:

The theory of geodemographics is based on the fact that similar people tend to cluster together and households in the same postcode sector or enumeration district can be placed in the same category.

The statement above draws upon a sentence which was used to justify heuristic calculations in an urban growth simulation and eventually became known as the first law of geography. “Everything is related to everything else, but near things are more related than distant things” (Tobler 1970, p. 236). Vickers (2006, p. 16) refined Tobler’s law to fit geodemographics and coined the following statement:

People who live in the same neighbourhood are more similar than those who live in a different neighbourhood, but they may be just as similar to people in another neighbourhood in a different place.

Area segmentations and geodemographic underpinnings assume that there is a linkage between the socio-economic landscape and the conglomeration and clustering of groups of behaviours.

The origin of area classifications is generally credited to a survey conducted by social reformer Charles Booth in the 19th century. This was a massive enquiry of the social and economic conditions of the people of London (Orford et al. 2002) which lead to an innovation. For the first time, detail maps showing the social class of London at street geography was produced.

Booth’s survey which was aimed at showing that poverty trends could be measured accurately revealed that the incidence of poverty in London was far greater than he had imagined. From his results, he concluded that 30.7% of the city’s population was actually below the poverty line (Simey and Simey 1960).

Developments in the Chicago School of Urban Sociologists characterised the next phase of developments. The group at the University of Chicago consisted of urban sociologists who worked on a number of representations of social city structures (Robson 1971). The concentric ring model was developed by Ernest Burges in 1925 and is patterned after Von Thunen’s theory which relates to rural or agricultural land around a city or market centre. Other models developed by the school include the sector theory developed by Homer Hoyt in 1939 which is of the view that high rent residential neighbourhoods are instrumental in shaping the land use structure of a city and the multiple nuclei model presented by Harris and Ullman in 1954 which is of the view that a number of separate nuclei actually shapes the land use pattern of a particular city.

Contemporary statistical approaches to understanding how people congregate within geographical areas resulted from the publication of census results for administrative areas at small scales. Developments in the United States was driven by the work of Jonathan Robbin who is credited with pioneering contemporary, computer based geodemographic systems (Goss 1995). In the United Kingdom the work of Webber (1977) also served as a springboard for the development of modern day geodemographics (Burrows and Gane 2006; Vickers 2006; Harris et al. 2005).

The spread of geodemographic techniques across the rest of the world as depicted in Fig. 1 has been relatively slow. Most of the companies that have capitalised on the techniques in recent times are commercial in nature aimed at market targeting. Their proliferation has contributed to the addition of non-census datasets often based on sample surveys of consumer choices and behaviours to available census statistics. The addition of such data to the classification systems helps fill the gap of trends not covered by the census according to Harris et al. (2005).

Amongst the countries that have tasted the geodemographic technique, only the United Kingdom benefits from open-source and freely available system (Vickers 2006). Clearly, the map reflects the absence of geodemographic technologies in many developing world countries. Developing world countries are faced with significant socio-economic challenges which make planning, resource allocation and targeting policies important issues for sustainable development. Over time, some of the countries’ statistical departments have accumulated volumes of national data relating to socio-economics and demography. These have been under-utilised partly due to restricted access and obsolete formats of storage.

Developing countries can benefit tremendously from applying geodemographic techniques to their analysis and interpretation of social and spatial inequalities. For instance geodemographics can be used to encourage and strengthen the concept of national identity by showing inter and intra regional strengths of the similarity between residents of close and distant areas.

Geodemographic segmentation systems present an option for the investigation of local level inequalities especially for the data-scarce countries of the developing world. Many of the MDG indicators used by the UN and partner agencies are derived from surveys which are usually analysed and reported at higher levels of geography. By plugging some of these indicators into a geodemographic system, local level disparities can be revealed.

The process of targeting interventions requires mechanisms for identifying special populations or sometimes vulnerable population groups. As a result of their multivariate characteristic, geodemographic systems offer policy makers the option of targeting strategic interventions and ultimately saving cost. This will be of particular benefit to developing countries in a time of global economic crunch.

## Social and Geographical Make-up of Nigeria

Nigeria boasts of a population of over 140 million people (NBS 2006a, b). On average, the population is growing at a rate of 2.4% per year (World Bank 2007a). Between 1990 and 2005, an estimated 70.8% of the population lived on less than one United States Dollar a day (World Bank 2007b). A dollar per day is conventionally adopted as a standard for benchmarking the poverty line (NBS 2005). The country is currently classified as a low income country.

Comprising an area totalling 356,669 square miles (Gordon 2003) Nigeria has a rich base natural resources and ranks among the top ten oil exporters globally exporting about 2.15 million barrels per day (EIA 2006). The country maintains a high profile economic and political status on the African continent. Indeed some commentators (Gordon 2003) agree that several African countries have their economic stability hinged on the political and economic stability of Nigeria.

Nigeria is made up of more than 250 ethnic groups each with its local language (dialect) and multiple religions. The Nigerian people are characterized by different priorities which make it difficult to govern the country.

The creation of administrative boundaries particularly at the regional level closely matches the geographical distribution of Nigerian ethnic groups. From a holistic perspective, there are three major ethnic groups viz: the Yoruba’s, found in the South West, the Igbo’s dominating the South East and the Hausa-Fulani groups concentrated in the North. Nigeria’s top level of administrative geography comprises six geo-political zones shown in Fig. 2. While the South West, South East and North West/North East are dominated by the three major ethnic groups, the North Central and South-South zones are composed of less populated ethnic groups.

Each geopolitical zone is further subdivided into states which are 36 in number. At the state-level geography, there is also a national capital city called Abuja located centrally in the North Central zone. The administrative boundaries for states are therefore technically 37 in number. Each state is administered by a state government with a governor as the head. States are further split into smaller geographies called Local Government Areas (LGAs). It has been argued that the LGA level is where most policy benefits can reach the masses at the grass root (Olowu and Ayo 1985). Nigerian LGAs are 774 in number. Beyond LGAs are Enumeration Areas (EAs) which are traditionally used for census enumeration and total up to 662,000.

The dataset for the analysis was supplied by the Nigerian National Bureau of Statistics (NBS) for all the 774 LGAs of the country.

## The Selection of Initial Variables

Cluster analysis methods have been streamlined into a number of steps identified by Milligan (1996, 2003). Primarily, the first step will be to identify the clustering objects which in this case are Nigerian LGAs. Once the objects are identified, a choice of variable inputs is required.

The choice of variables for inclusion in the Nigerian classification took into context a number of theoretical and statistical issues. Dataset made available for the 774 LGAs provided a total of 644 variables. The initial resource resulted in a total of 498,456 data points (i.e. 644 × 774). As a starting point, each variable was assigned to one of ten (10) domains listed below.
• Agriculture

• Demographic

• Education

• Employment

• Health

• Household Composition

• Household Infrastructure

• Housing

• Socio-economic

• Women and Children

An initial list of 125 variables was selected with the intention of ensuring a relatively representative number of variables across all the domains. Secondly, variables with multiple missing values were avoided.

Other principles appropriated in the initial selection relates to policy relevance and updatability of the variables. Variables which were deemed to provide useful proxies for assessing some of Nigeria’s current policy programmes especially within the framework of the Millennium Development Goals (MDGs) were considered. Such policy programmes include the National Economic Empowerment and Development Strategy (NEEDS) (NPC 2004) and the Universal Basic Education Project (UBE) (UNESCO 2000).

An extensive review of literature (too voluminous for inclusion in this paper) was done on the different policy programmes and their objectives. Doing this helped the authors identify the appropriate measurable population characteristic that suits each policy programme. It is pertinent to say that it is possible for different researchers to select a different list of variables for initial consideration as (Harris et al. 2005) identify that the process of selecting initial variables can be subjective.

## The Analysis of Principal Components

The first set of variables was eliminated by analysing the variables on a domain-by-domain basis. For instance all ten (10) variables contained in the education domain were tested against each other. This process (intra-domain variable reduction) resulted in the first set of reductions for each domain. The next step was to evaluate all variables which survived the first test together irrespective of their domains (i.e. inter-domain variable reduction).

Principal components analysis (PCA) is an analytical technique used to alter the relationship of correlation existing within a dataset. It works by transforming a set of correlated variables to a smaller set of uncorrelated variables (Dunteman 1989). The use of PCA in cluster analysis can help mitigate the problem of redundancy (Harris et al. 2005) and also identify which variables are likely to have significant influence on the dataset (Jolliffe 2002).

The results produced by the PCA were analysed by examining the components loading matrix for each principal component. Normally, the first principal component accounts for most of the variability within the dataset. Subsequent components account for as much variability that is left within the dataset. Typically “the sum of the variances of the principal components is equal to the sum of the variances of the original variables” (Dunteman 1989, p. 17).

Table 1 show that the variable on houses built with cement or sandcrete has the highest correlation with principal component 1. This implies that 79% of the variance existing within the variable is explained by component 1.
Table 1

Variable

Built with cement/sandcrete

0.89

Motorcycle ownership

0.88

Completed secondary education

0.88

Ownership of mobile phone

0.86

Built with mud/mud bricks

−0.84

Uneducated

−0.84

−0.83

Post secondary education

0.82

Non-wood fuel for cooking

0.82

Own a house

−0.80

## Population Proportional Representation

The proportion of the population for which each variable accounts was also considered. It is necessary to avoid variables which only represent small proportion of the population (Vickers 2006). When a variable represents only a small proportion of the population there is the tendency for that variable to be volatile and change rapidly over time. Such a variable would not sustain the longevity of the classification. The percentage of households that built their residences with cardboard represented the smallest sample across LGAs. An initial look at this variable would make it appear as an interesting variable but further examination revealed uninteresting patterns. Averagely, the variable represents approximately 88 people per LGA. In the dataset, there are numerous unrepresented and under-represented zones characterized by the variable.

Another problem that can result from variables with small sample proportions is that they provide little distinctive information for naming and profiling the clusters created from the analysis (Harris et al. 2005). A solution considered for some of the variables was to merge them (where they fall under the same domain and share a similar base). For instance the variables on children not in school due to early marriage and children not in school due to teenage pregnancy were merged as they shared the same base (100%) and fall under the same domain.

## Issues Surrounding Skew

In addition to the examination of variables sample proportions, the skew exhibited by the variables was also considered. It is desirable to include normally distributed variables in the classification because the presence of outliers is more likely to create artificial clusters. Positively skewed variables were avoided in the variable selection process. Positive skew occurs because of an accumulation of large values at the lower end of the distribution or where there are outliers or extreme values within the distribution (Harris et al. 2005; Table 2).
Table 2

Variables with the largest positive skews

Variable

Skew

% of households which always find it difficult to pay house rent

11.25

% of households built with cardboard

11.15

% of children not in school due to teenage pregnancy

8.87

Population Density

8.09

% of households built with stone

7.41

The problem with most of these variables is that they identify small proportions of the population hence they concentrate at the lower end of the 0–100% scale. Variables that would work well within the classification are those that spread in their variation across geographic areas.

The problem of skew was examined very closely. A variable such as population density would normally be expected to exhibit this trait but in terms of its relevance to the classification, it is a useful discriminator for rural-urban divide. A common solution suggested by (Harris et al. 2005) is to transform the data. Log transformation was used in this exercise.

## Variable Cross-Correlations

Consideration was also given to the relationships between the variables within the dataset. The inclusion of two highly related variables in a clustering algorithm will often result in the repetition of the same information or population behaviour (Milligan and Stephen 2003). This can give undue advantage to such behaviour and can also mask other important underlying characteristics existing within the population.

The co-efficient of correlation was used as the statistic for examining the relationship between variables. Correlations can be positive or negative. Both high positive and negative correlations are not desirable within the dataset as they inform redundancy.

Some high correlations were observed within the dataset. In order to illustrate these different types of correlations a domain-by-domain data reduction process was first embarked upon. Having examined the dataset on a domain-by-domain basis, all the domains were examined together. This allowed for the identification of another type of relationship between variables. Table 3 identifies some variables within different domains which do not share the same denominators but are related.
Table 3

Relationship between variables in different domains

Variable

Domain

Variable

Domain

Correlation

Redundancy (%)

Owns 2–10 cattle

A

Uneducated Population

E

0.82

67

Never married

D

Uneducated Population

E

−0.78

61

Post-secondary education

E

Ownership of mobile phones

S

0.78

60

Renting a house

H

Difficulty in paying house rent-sometimes

S

0.77

59

Owns 2–10 cattle

A

E

0.75

57

E

Built with cement/sandcrete

HO

−0.75

56

Post-secondary education

E

Non-wood fuel for cooking

HI

0.74

55

Private-formal employment

EM

Ownership of mobile phones

HI

0.74

54

Post-secondary education

E

Renting a house

HO

0.72

52

Age 0–14

D

Completed secondary education

E

−0.71

51

Never married

D

E

−0.71

51

Where: A Agriculture, D Demographic, E Education, EM Employment, H Health, HC Household Composition, HI Household Infrastructure, HO Housing, S Socio-economic, WC Women and Children

The relationships are due to the capability of one variable being able to explain the variation existing within the other. In other words, the theoretical characteristics inherent in one variable can also be caused by the presence of the other variable. People who own 2–10 cattle for instance have a high positive correlation of 0.82 with people who are uneducated. This is straightforward. The positive nature of the correlation suggests that there is a higher tendency for people who are not educated and own cattle to live within the same locality.

## Geographic Variation of Variables

Following the examination of correlations, the manner in which variables vary across the LGAs was considered. A useful statistic for measuring the geographic variation of variables is the standard deviation. In Table 4, the percentage of people owning motorcycles combines a mean of 31.98 with a standard deviation of 28.56. We can deduce from this that two thirds of the values of the variable for LGAs lie between 3.42 and 60.55.
Table 4

Variables with high standard deviations

Variable

Mean

SD

% of households built with cement/sandcrete

40.14

33.22

% of households built with mud/mud bricks

54.45

31.8

% of households using agricultural inputs

40.70

30.74

50.68

28.96

% of households owning motorcycles

31.98

28.56

45.63

28.56

% of single room housing unit

66.52

28.13

% of whole buildings

26.87

27.98

% of the total population uneducated

39.60

26.55

% of households owning less than 1 ha of land

26.42

26.09

Variables with larger standard deviations will prove more useful than those with lower values. This is because they present better distinctions between areas. However it is important to consider the sample size of a variable across the LGAs especially when dealing with variables with low standard deviation values. Some of these variables can be merged with other variables and renamed. A variable that was created in this manner is cattle ownership. The percentage of people owning over 50 showed little variation across the geographic areas with a standard deviation of 1.42. This variable was merged with three other variables within the same domain and sharing the same denominator. These include the percentage of people owning 2–10 cattle, 11–20 cattle and 21–50 cattle. By merging these variables, a new variable was created and called cattle-ownership. The standard deviation for the new variable is 22.19. This is a significant improvement when compared with a value of 1.42.

Making decisions about the final choice of variables is a prolonged activity. It may be practically impossible for two different classification developers to come up with a list of variables which are 100% the same. This is because along the course of the selection process, a number of subjective decisions are made. For instance, the variable on percentage of flats initially seemed to be very interesting. However, it was eventually excluded due to its behaviour during the clustering process because it added to the geo-political regional divide of the country and masked some underlying distinctive features of the LGAs. The series of discussions reveal that the processes of trimming down the initial list for the Nigerian classification required detailed theoretical and analytical work. In summary, the exercise commenced with 644 variables from which 125 were selected. The 125 variables were reduced to 54 which were further reduced to 45.

## The Final List of Variables

Complex decisions were made during the variables reduction and selection process. Further detailed reports relating to the analysis outline the importance each of domain and the relevance of the variables they encapsulate. For the purpose of this paper, the list of variables is shown in Table 5.
Table 5

The final list of 45 variables

Variable

Domain

Use of agricultural inputs

Agriculture

Owns cattle

Agriculture

Age 0–14

Demographic

Age 15–59

Demographic

Age 60 and over

Demographic

Never married

Demographic

At least one pensioner

Demographic

Population density

Demographic

Separated couples

Demographic

Completed secondary education

Education

Education

Education

Education

Economically active population

Employment

Self employed

Employment

Employment in the transport sector

Employment

Employment in agriculture

Employment

Taking anti malaria measures

Health

Health

Household size 1–2 persons

Household composition

Mean household size

Household composition

Ownership of mobile phone

Household Infrastructure

Ownership of personal computer

Household Infrastructure

Household Infrastructure

No toilet facility

Household Infrastructure

Safe toilet sanitation

Household Infrastructure

Non-wood fuel for cooking

Household Infrastructure

Lighting energy-mains electricity

Household Infrastructure

Household Infrastructure

Own a house

Housing

Renting a house

Housing

Free accommodation

Housing

Built with burnt bricks

Housing

Single room

Housing

Duplex

Housing

Built with cement/mud brick

Housing

Vehicle ownership

Socio-economic

Motorcycle ownership

Socio-economic

Size of land-over 6 ha

Socio-economic

Improved security

Socio-economic

Socio-economic

Difficulty in meeting basic needs

Socio-economic

Children living with single parents

Women and Children

Early marriage and teenage pregnancy

Women and Children

Vaccinated

Women and Children

## Clustering Process

Following the selection of the input variables, the next step was to prepare the data for classification. All the variables for the 774 LGAs were assembled into a single database and manually checked for any errors. Once this was done, the clustering process could commence.

The next phase was to explore the scales of the variables. It is important to stress that most of the methods adopted by commercial geodemographic vendors on this issue are not in the public domain and are not usually subjected to rigorous academic review.

Scale in this context refers to the unit of measurement like percentage or ratio. Cluster analysis comprises different stages or steps. At certain stages, variables may be analysed independently without altering their scale. However, at the stage when a clustering algorithm is deployed on the variables, the outcome is dependent on the same clustering criterion applied to all the variables. It is inappropriate to run a clustering algorithm on a dataset which consists of variables of different scales. The reason for this is that the nature of the scale may cause undue advantage to be given to certain variables while others suffer. Methods of normalisation help in re-scaling distributions (Berrar et al. 2003).

Prior to normalisation, the variables were log transformed. Some of the variables combined small dispersions with large means. Log transformation has the ability to cope well with the heterogeneity of the variance existing within some of the variables. In addition the presence of positive skew in some of the variables—particularly those sensitive to urban-rural characteristics gave rise to a need to embark on a log transformation. Vickers (2006) also found log transformation to be effective in creating the United Kingdom (U.K.) Output Area (OA) classification.

Different methods of normalisation were explored with the data. Range standardisation was initially tried on the data because of its advantage of handling outliers. However, the resultant clusters failed to expose the diversity within areas. This finding is coherent with work in progress for another developing country—the Philippines. The z-score standardisation technique was adopted as the method for putting variables for the classification system on a common scale. The method has the ability to maintain a mean of zero for the standardised values and a standard deviation of one (Urdan 2005). With a mean of zero, distortions stemming from the central value of each variable can be avoided.

A variety of clustering methods were considered for the analysis. K-means clustering method was adopted. In addition to being less computer intensive, the method generates a number of associated tables including the ANOVA table which proves useful for further evaluation of the clusters produced.

Of paramount importance was the need to identify if all the variables work well within the algorithm without clouding differentiating features of the LGAs. Nigeria is known to comprise of 6 geo-political regions and 774 LGAs. What were not known prior to these analyses were the inter-regional similarities and dissimilarities existing within the national system.

The algorithm was therefore run for 2 to 15 clusters on the log-transformed z-scores for each variable and LGA. It is important to mention that at this stage, there were 46 variables. Each of the clusters was mapped for visualisation purposes. It was immediately evident that the algorithm was working but a strong presence of the North-South-Middle belt divide in the results suggested the possible presence of clouding variables.

Between the two and seven cluster solutions, the regional impression was very pronounced. From the eight clusters solution the impression was slightly ameliorated but at this stage, clusters memberships were significantly impaired.

One key contributor to the regional impression was the presence of some strong outliers in the z-scores used for the clustering algorithm. It was therefore decided that a range should be specified for the scores. Based on the frequency distribution of the z-scores, they were capped within a range of −3 and +3 due to the frequency of the scores. Every value greater than 3 or less than −3 was rounded down/up to the benchmark.

The algorithms were re-run again for 2 to 15 clusters. The regional impression was minimized but still evident. At this point it was decided that (n-1) variables should be clustered. At this stage, n was equal to 46 variables.

This meant each of the 46 variables was excluded for each 2 to 15 cluster solutions that were examined. After careful examination, it was discovered that by excluding the variable on ‘Flats’, the geo-political regional divide greatly diminished. It was at this stage that the difficult but inevitable decision to exclude the variable Flats was made. The total number of variables included in the algorithm was therefore reduced to 45 as shown in Table 5 above.

## Deploying a Cluster Stopping Rule

Methods used to investigate the probable number of clusters at the top hierarchy are slightly informal. Everitt et al. (2001) identified the possibility of plotting the value of the clustering criterion against the number of clusters and observe points of great change in the plot. Generally, the average distance of cases from their cluster centres can be plotted against the number of clusters.

The decision on the number of clusters constituting the top level of the classification was made by taking multiple issues into consideration. These include visualisation, cluster memberships, predictive power and discriminatory power.

## Average Distance of Cases to Cluster Centre

In this exercise, the algorithm was run for 2 to 15 clusters. The average distance of each case from its cluster centre was computed and plotted against the number of clusters. This graph is depicted in Fig. 3.

A sharp increase in the average cluster centre would suggest the optimal solution for number of clusters at the top hierarchy. From the chart above, it is not easy to decipher where there is an abrupt change in the magnitude of the average distance from cluster centres.

The maps of the cluster solutions were examined at this point and it was observed that better discrimination between areas commenced from point 5 upwards and visualisation was more difficult after point 7. From point 8, the range of cluster composition (i.e. number of zones in each cluster) increased greatly. It was therefore decided that the 5, 6 and 7 cluster solutions would be put to test against each other.

A further evaluation of the distance of each LGA from its cluster centre was examined for clusters 5, 6 and 7. The less the distance a case is from its cluster centre, the better. Each of the three cluster solutions showed a positive skew. The positive skew is significance that majority of the LGAs are at the lower distance categories. The larger the value of the positive skew, the better the solution in terms of how close members are to the centre of their cluster. Both the 5 and 7 cluster solutions have a skew of 0.31 while the 6 cluster solution has a skew of 0.38 indicating that its members are more compact than the 5 and 7 cluster solutions.

## Analysing the Range of Clusters

The composition of clusters is desired to be relatively balanced. Where too many cases concentrate in too few clusters creates a skewed distribution. Cluster membership was assessed by examining the range of the distribution for each of the three solutions.

As can be seen in Table 6, the seven clusters solution is marginally out-performed by the five and six cluster solutions.
Table 6

Assessing the sizes of clusters

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Range

Five cluster solution

Num. of LGAs

208

110

128

120

208

98

% of LGAs

26.9

14.2

16.5

15.5

26.9

12.7

Six cluster solution

Num. of LGAs

181

166

114

126

82

105

99

% of LGAs

23.4

21.4

14.7

16.3

10.6

13.6

12.8

Seven cluster solution

Num. of LGAs

107

102

162

160

52

89

102

110

% of LGAs

13.8

13.2

20.9

20.7

6.7

11.5

13.2

14.2

## Predictive Power

In order to assess the predictive power of each solution, seven variables (not included in the clustering process) were selected. The variables include: population in monogamous marriage, population in public sector employment, economically inactive population aged 15 to 24, uneducated population, household heads self employed in agriculture, population of children living in non nuclear households, children aged 0 to 4 vaccinated against measles.

The populations represented in each variable were aggregated by geodemographic cluster for each of the three solutions. Total populations by geodemographic clusters were also derived. The crude rates for each geodemographic cluster was calculated by dividing the total population by the population of the variable represented. For each LGA, the rate calculated for its geodemographic cluster was applied to its total population to derive a predicted value for that variable.

For each of the cluster solutions, the predicted values for the 7 variables were subtracted from the actual values. The error introduced was then quantified using a variant of Fisher and Langford’s (1995) root mean square (RMS) error adapted by Gregory (2000).
$${E^{{RMS}}} = {\left[ {\frac{1}{m}{{\sum\limits_m {\left( {\frac{{y - y\prime }}{y}} \right)} }^2}} \right]^{{1/2}}}$$
(1)
where
ERMS

is the Root Mean Square error

m

is the number of LGAs

y

is the actual value of the variable

y′

is the predicted value of the variable

The results from the analysis are provided in Table 7. The five cluster solution appears to work with variables that are highly correlated with the total population. It performs well with two variables (Var 1 and Var 3).
Table 7

Results from RMS error analysis

Variable code

Variables

Correlation with total population

RMS for 5 cluster solution

RMS for 6 cluster solution

RMS for 7 cluster solution

Var 1

Population in monogamous marriage

0.84

0.51

0.52

0.51

Var 2

Population in public sector employment

0.67

2.56

2.45

2.43

Var 3

Economically inactive population aged 15 to 24

0.88

3.24

3.25

3.38

Var 4

Uneducated population

0.37

1.42

1.37

1.37

Var 5

Household heads self employed in agriculture

0.19

5.31

5.23

5.55

Var 6

Population of children living in non nuclear households

0.63

1.98

1.89

1.89

Var 7

Children aged 0 to 4 vaccinated against measles

0.88

2.98

2.74

2.79

It is difficult to draw precise conclusions between the six and seven cluster solutions. However a closer observation shows that in both situations where the 7 cluster solution out-performs the 6 cluster solution, the RMS error is marginal. For Var 1, there is a marginal difference of 0.01 while Var 2 shows a marginal difference of 0.02.

Another interesting reflection pertains to Var 3. While the 5 cluster solution marginally out-performs the 6 cluster solution, the 6 cluster solution performs much better than the 7 cluster solution.

Based on these findings, both the 6 and 7 cluster solutions perform better than the 5 cluster solution, but the 6 cluster solution marginally out-performs the 7 cluster solution.

## Discriminatory Power

Discriminatory power has been assessed by deriving the Gini coefficient for each of the seven variables listed above across the three solutions. The Gini coefficient is used to measure the degree of concentration of a variable within a distribution of its elements Brown (1994).

When used alongside the Lorenz curve, the Gini coefficient allows a graphical comparison of inequality. Values for the Gini coefficient range from 0, where there is perfect equality, and 1, where there is perfect inequality. The Gini represents an expression of the area located between the line of perfect equality and the Lorenz curve. Leventhal (1995) describes it as a method which can help mitigate the challenges posed by numerical methods of comparison of discriminatory power.

The Lorenz curves have been derived by calculating an index for each cluster in each of the three solutions. The index is the proportion of the target population or people defined by a variable within each cluster to the proportion of the catchment or base population for each cluster. This index, an indicator of the propensity of the variable has been used to sort the percentage of each variable and base population (total populations in this case) in descending order.

The target and base percentage values were subsequently accumulated and used to derive the Gini co-efficient and Lorenz curves. For the Gini, the following formula suggested by Brown (1994) was used:
$$G = 1 - \sum\limits_{{{\rm{i}} = 0}}^{{{\rm{k}} - 1}} {\left( {{y_{{{\rm{i}} + 1}}} + {y_i}} \right)\left( {{x_{{{\rm{i}} + 1}}} - {x_i}} \right)}$$
(2)
where
G

is the value of the Gini coefficient

k

is the number of data points for the profile and base populations

y

is the profile population for a selected geodemographic cluster

x

is the base population for the selected geodemographic cluster

Since the population employed in the public sector are unevenly distributed (i.e. inequality exists), the curve shifts away from the diagonal line of perfect equality. The larger the area between these two lines, the better a system can uncover the differences within the population (Fig. 4).
The suggestion of Brown (1994, p. 1247) about the area between the Lorenz curve and the line of perfect equality is summarised in his writing:

Defined graphically, the Gini coefficient formally is measured as the area between the equality curve and the Lorenz curve, divided by the area under the equality curve.

From this suggestion we deduce that the area between the Lorenz curve and line of perfect equality is half the value of the Gini co-efficient. Clearly a larger value of the Gini would equate to a larger value for the area inside the curve.
An assessment of Table 8 reveals that with any variable, no one solution out-performs the other two. Solutions 6 and 7 again seem to perform better than solution 5. The seven cluster solution marginally out-performs the 6 cluster solution with Var 3 but has a tie with the five cluster solution. It is difficult to choose (in terms of discrimination) between the 6 and 7 cluster solutions.
Table 8

Comparison of the discriminatory power

Variable Code

Variables

Correlation with total population

GINI for 5 cluster solution

GINI for 6 cluster solution

GINI for 7 cluster solution

Var 1

Population in monogamous marriage

0.84

0.08

0.08

0.08

Var 2

Population in public sector employment

0.67

0.25

0.26

0.26

Var 3

Economically inactive population aged 15 to 24

0.88

0.08

0.07

0.08

Var 4

Uneducated population

0.37

0.31

0.34

0.34

Var 5

Household heads self employed in agriculture

0.19

0.22

0.22

0.22

Var 6

Population of children living in non nuclear households

0.63

0.26

0.28

0.28

Var 7

Children aged 0 to 4 vaccinated against measles

0.88

0.10

0.11

0.11

The entire exercise shows that it may be difficult to decide on the number of clusters that constitute the top hierarchy of the classification. The three solutions have their merits and de-merits. Generally, discrimination appears to increase with increasing number of clusters. However, with increasing number of clusters, there is also the chance of creating an imbalance in the number of LGAs constituted within each cluster.

For the purpose of this work, the six cluster solution was chosen to enable further analysis and visualisation. The selection of a 6 cluster solution at the first hierarchy implied that the next hierarchy could be created. To create the second hierarchy, each of the 6 cluster groups identified at the first level was clustered separately using K-means algorithm. This method was adopted in the creation of the United Kingdom Office for National Statistics (ONS) Output Area (O.A.) classification system (Vickers 2006). At this level, between 2 and 5 cluster solutions were created for each of the top 6 clusters and evaluated in a similar manner as discussed above. The solutions resulting from the second level of analysis was a total of 23 clusters. A third level of clusters was also created comprising 57 clusters. The top level of the classification is called Super-groups; the second is called Groups while the third level of the hierarchy is called Sub-groups. They are shown in Table 9.
Table 9

Cluster labels

Super-groups

Super-group label

Groups

Group label

Sub-groups

1

Green Towns

1.1

Conventional Green Towns

1.1.1

1.1.2

1.1.3

1.2

Underprivileged Green Towns

1.2.1

1.2.2

1.3

Flourishing Green Towns

1.3.1

1.3.2

1.3.3

1.4

Struggling Green Towns

1.4.1

1.4.2

1.4.3

2

Emerging Localities

2.1

Moderately Emerging Localities

2.1.1

2.1.2

2.1.3

2.2

Comfortable Emerging Localities

2.2.1

2.2.2

2.2.3

2.3

Transient Emerging Localities

2.3.1

2.3.2

2.3.3

3

Intermediate Territories

3.1

Constrained Intermediate Territories

3.1.1

3.1.2

3.2

Well-to-do Intermediate Territories

3.2.1

3.2.2

3.2.3

3.3

Deprived Intermediate Territories

3.3.1

3.3.2

3.4

Customary Intermediate Territories

3.4.1

3.4.2

4

Diluted Societies

4.1

Thriving Diluted Societies

4.1.1

4.1.2

4.2

Labouring Diluted Societies

4.2.1

4.2.2

4.2.3

4.3

Deprived Diluted Societies

4.3.1

4.3.2

4.4

Modest Diluted Societies

4.4.1

4.4.2

4.4.3

5

Country Dwellings

5.1

Toiling Country Dwellings

5.1.1

5.1.2

5.2

Deprived Country Dwellings

5.2.1

5.2.2

5.3

Middle-class Country Dwellings

5.3.1

5.3.2

5.3.3

6

Urban Nodes

6.1

Prosperous Urban Nodes

6.1.1

6.1.2

6.2

6.2.1

6.2.2

6.3

Average Urban Nodes

6.3.1

6.3.2

6.4

Affluent Urban Nodes

6.4.1

6.4.2

6.4.3

6.5

Striving Urban Nodes

6.5.1

6.5.2

## Cluster Defining Characteristics

Labelling a group with a name can be contentious. It is a very complex process requiring consideration of numerous issues. The names are expected to be as widely representative of the characteristics of most the people living in those areas as possible. This does not in any way imply that every single person within a cluster can be labelled that way. To some extent, diversity still exists within similarity (Voas and Williamson 2001). The names attached to clusters are only indicative of the predominant features of the areas in question. Table 9 shows the names given to each of the super-groups and groups.

Names should not in any way be offensive especially in a multi-ethnic and multi-religious country like Nigeria. Care was taken to ensure that the names chosen do not appear to stigmatise any section of the population. Religious and ethnic languages were avoided. Distinguishing variables have been identified by examining the resulting z-scores for the final cluster centres of each cluster (Debenham 2002).

Radar charts as displayed in Fig. 5 have been created for all super-groups, groups and sub-groups. In addition to the charts, pen portraits which are textual descriptions of the super-groups and groups have also been developed. Pen portraits which summarise the prevalent characteristics of each cluster have the benefit of elucidating (in qualitative terms) some of the information inherent in complex quantitative analysis.

## Visualising National and Sub-National Profiles

The fact that the classification system is an area-based estimator means it is linked with geography. In this context the LGA administrative geography of Nigeria is the finest scale at which the system can be visualised in maps as shown in Fig. 6.
Beyond mapping, charts can also be used to show the distribution of varying population characteristics. At super-group level, 21.8% (the highest) of the population concentrate within the Urban Nodes. This is closely followed by the Emerging Localities (21.2%) and Green Towns (20.3%). In Fig. 7, diluted Societies, Intermediate Territories and Country Dwellings are represented by 14.3%, 12.8% and 9.6% of the total population respectively.

The distribution of households also reveals a similar pattern. However, this time around Green Towns have the largest representation of 24.6%. This is closely followed by a 23.1% representation within Urban Nodes. Emerging Localities and Intermediate Territories have 16.9% and 13.5% respectively while Diluted Societies and Country Dwellings enjoy 12.8% and 9.1% respectively.

It is also interesting to note from the chart that the proportional representation of households within the Green Towns, Intermediate Territories and Urban Nodes supersedes their population distribution. This is a reflection of population density within these geodemographic types. Within Green Towns, population density is average. It is just above average within the Intermediate Territories and very high within the Urban Nodes.

## Conclusions

As far back as 1996, commentators on Nigeria’s economic reforms have suggested that national poverty alleviation programmes would require flexibility “to address the diversity of the needs of poor individuals and communities” (World Bank 1996a, b). There have also been suggestions to discourage common methods of targeting which aim at reaching the entire population (World Bank 1996a, b) because they are less effective and require large resources.

Recently, the progress and challenges of Nigeria’s economy were evaluated (Okonjo-Iweala and Osafo-Kwaako 2007). Okonjo-Iweala and Osafo-Kwaako noted that while there has been success in various arenas, much work still has to be done. Top on their priority list of current challenges is the need to extend the reforms to sub-national levels. Essentially, effective machinery has to be instituted beyond national and state levels to better discriminate areas of need.

With increasing evidence that different neighbourhood types are characterised by varying types and levels of deprivation (Harris et al. 2005) the one size fits all approach in policy making is becoming obsolete. The provision of workable solutions at local level governance requires the appropriation of differentiating strategies at this level.

In spite of the early recognitions for the requirement of new ways of targeting and allocating resources, the creation of a workable solution has had to wait until now. This paper has summarised the first attempt at creating a geodemographic classification system for Nigeria. The classification which encapsulates 45 variables segments each Local Government Area in Nigeria into 6 super-groups 23 groups and 57 sub-groups based on their similarity.

The methods used for the analysis centre around cluster analysis. Effort has been made to explain and justify the reasons for decisions made in the process of the analysis.

It is fair to say that the results from this analysis suggest that developing world countries amidst challenges surrounding availability and access to relevant data can also benefit from geodemographic methods.

An important lesson learnt from the analysis reported in this paper points to the fact that a one-size fits all approach particularly in the choice of input variables should does not apply to all countries. Creating a geodemographic classification system for any country should take into context some of the peculiarities of that country. The nature of the variables considered and included in the build process helps explain the sorts of issues that are pertinent to policy issues within a developing country like Nigeria.

Apart from contributing to the literature on open geodemographics, the creation of the Nigerian system will be useful in solving numerous unanswered policy related questions. It will spark interest from the private sector and the Nigerian academic community stands to benefit from this new geography.

Strengthening the concept of national identity has constantly been a challenge for successive governments. The thought of many Nigerians is that the country is too diverse to co-exist. However, this report has shown as suggested by Vickers (2006) that even though closely situated neighbourhoods are alike, they are just as similar in their geodemographic make up to distant neighbourhoods. This also suggests that there will be commonalities in their wants and ways of life.

A website (www.nigerianlgaclassification.com) has been developed where members of the public can easily access further detailed information too voluminous for inclusion in this paper. The website also showcases comprehensive profiles and interactive maps and charts relating to the Nigerian system. It is hoped that this exercise will serve as a platform for extending the development and use of geodemographic typologies in developing countries.

### References

1. Abbas, J., Ojo, A., & Orange, S. (2009). Geodemographics-a tool for health intelligence? Public Health, 123(1), 35–39.
2. Berrar, D. P., Dubitzky, W., & Granzow, M. (2003). A practical approach to microarray data analysis. Springer.Google Scholar
3. Brown, M. C. (1994). Using Gini-style indices to evaluate the spatial patterns of health practitioners: theoretical considerations and an application based on Alberta data. Social Science & Medicine, 38(9), 1243–1256.
4. Brown, P. J. B., Hirschfield, A. F. G., & Batey, P. W. J. (2000). Adding value to census data: Public sector applications of super profiles geodemographic typology. Working Paper 56, URPERRL, Department of Civic Design, University of Liverpool.Google Scholar
5. Burrows, R., & Gane, N. (2006). Geodemographics, software and class. Sociology, 40(5), 793–812.
6. Debenham (2002). Understanding geodemographic classification: Creating the building blocks for an extension. Working Paper 02/01, School of Geography, University of Leeds.Google Scholar
7. Dunteman, G. H. (1989). Principal components analysis. London: Sage Publications, Inc.Google Scholar
8. Energy Information Administration (2006). Top World Oil Net exporters, 2006 [Online] http://tonto.eia.doe.gov/country/index.cfm. Accessed 8/2/2008.
9. Everitt, B. S., Landau, S., & Leese, M. (2001). Cluster analysis (4th ed.). London: Arnold.Google Scholar
10. Fisher, P. F., & Langford, M. (1995). Modeling the errors in areal interpolation between zonal systems by Monte Carlo simulation. Environment and Planning A, 27(2), 211–224.
12. Gordon, A. (2003). Nigeria’s diverse peoples: A reference source book. Santa Barbara: ABC-CLIO.Google Scholar
13. Goss, J. (1995). Marketing the new marketing: The strategic discourse of geodemographic information systems. In J. Pickles (Ed.), Ground truth: The social implications of geographic information systems. New York: Guildford Press.Google Scholar
14. Gregory, I. N. (2000). An evaluation of the accuracy of the areal interpolation of data for the analysis of long-term change in England and Wales [Online] http://www.geocomputation.org/2000/GC045/Gc045.htm. Accessed 5/3/07.
15. Harris, R., Sleight, P., & Webber, R. (2005). Geodemographics, GIS and neighbourhood targeting. London: Wiley.Google Scholar
16. Jolliffe, I. T. (2002). Principal components analysis. New York: Springer.Google Scholar
17. Leventhal, B. (1995). Evaluation of geodemographic classifications. Journal of Targeting, Measurement and Analysis for Marketing, 4(2), 173–183.Google Scholar
18. Milligan, G. W. (1996). Clustering validation: Results and implications for applied analysis. In P. Arabie, L. J. Hubert, & G. De Soete (Eds.), Clustering and classification. Singapore: World Scientific.Google Scholar
19. Milligan, G. W., & Stephen, C. H. (2003). Clustering and classification methods. In B. W. Irvin, J. A. Schinka, W. F. Velicer (Eds), Handbook of psychology. New Jersey: John Wiley and Sons Inc.Google Scholar
20. NBS. (2005). Poverty profile for Nigeria. Nigeria: National Bureau of Statistics.Google Scholar
21. NBS (2006). Federal Republic of Nigeria 2006 Population Census, Official Gazette (FGP 71/52007/2,500(OL24).Google Scholar
22. NBS. (2006b). Core welfare indicators questionnaire survey, final statistical report, Federal Republic of Nigeria. Nigeria: National Bureau of Statistics.Google Scholar
23. NPC. (2004). Meeting everyone’s needs: National economic empowerment development strategy. Nigeria: National Planning Commission.Google Scholar
24. Ogunbodede, E. F. (2006). Developing geospatial information for poverty reduction: Lessons and challenges from Nigeria’s 2006 Census. GSDI-9 Conference Proceedings, Santiago, Chile.Google Scholar
25. Okonjo-Iweala, N., & Osafo-Kwaako, P. (2007). Nigeria’s economic reforms: Progress and challenges. Massachusetts: The Brookings Institution.Google Scholar
26. Olowu, D., & Ayo, S. B. (1985). Local government and community development in Nigeria: Developments since the 1976 Local Government Reform. Community Development Journal, 20(4), 283–292.
27. Orford, S., Dorling, D., Mitchell, R., Shaw, M., & Smith, G. D. (2002). Life and death of the people of London: a historical GIS of Charles Booth’s inquiry. Health & Place, 8(1), 25–35.
28. Robson, B. T. (1971). Urban analysis: a study of city structure. Cambridge: Cambridge University Press.Google Scholar
29. Simey, T., & Simey, M. (1960). Charles Booth. Social scientist. London: Oxford University Press.Google Scholar
30. Singleton, A. D., & Longley, P. (2009). Creating open source geodemographics: refining a national classification of census output areas for applications in higher education. Papers in Regional Science, 88, 643–666.
31. Tobler, W. R. (1970). A computer movie simulating urban growth in the Detroit Region. Economic Geography, 46(2), 234–240.
32. UNESCO (2000). UNESCO/Nigeria Co-operation for Universal Basic Education. United Nations Educational, Scientific and Cultural Organization. [Online] http://unesdoc.unesco.org/images/0014/001485/148544eo.pdf, Accessed 15/08/2007.
33. United Nations (2007). World population prospects: The 2006 Revision, United Nations Population Division, New York, U.S.A. [Online] http://www.un.org/esa/population/publications/wpp2006/English.pdf, Accessed 13/02/2008.
34. Urdan, T. C. (2005). Statistics in plain english. New Jersey: Lawrence Erlbaum Associates.Google Scholar
35. Vickers, D. W. (2006). Multi-level integrated classifications based on the 2001 census. Unpublished PhD thesis. School of Geography, University of Leeds.Google Scholar
36. Vickers, D., & Rees, P. (2006). Introducing the area classification of output areas. Population Trends, 125, 15–29.Google Scholar
37. Voas, D., & Williamson, P. (2001). The diversity of diversity: a critique of geodemographic classification. Area, 33(1), 63–76.
38. Webber, R. (1977). The national classification of residential neighbourhoods: an introduction to the classification of wards and parishes. Planning Research Applications Group, Centre for Environmental Studies, 23.Google Scholar
39. World Bank (1996a). Nigeria, poverty in the midst of plenty, the challenge of growth with inclusion: A world bank poverty assessment. [Online] http://www-wds.worldbank.org/servlet/WDSContentServer/WDSP/IB/1996/05/31/000009265_3961029235646/Rendered/PDF/multi0page.pdf. Accessed 10/7/2007.
40. World Bank (1996b). Nigeria: Targeting communities for effective poverty alleviation [Online] http://www.worldbank.org/afr/findings/english/find68.htm. Accessed 02/04/2008.
41. World Bank (2007a). Independent evaluation group approach paper Nigeria: Country assistance evaluation [Online] http://lnweb18.worldbank.org/oed/oeddoclib.nsf/DocUNIDViewForJavaSearch/3EE7F2E5A37A9BEC85257321007A7697/\$file/nigeria_cae_approach_paper.pdf. Accessed 7/05/2008.
42. World Bank. (2007). World development indicators 2007. Washington, D.C.: World Bank Publications.