Organized by the ESCAP secretariat
with active support of the Working Party on
the Application of New Technology to Population
Data, the Workshop was attended by 38 participants
from Armenia, Bangladesh, Brunei Darussalam,
Cambodia, China, India, Indonesia, Japan, Kiribati,
Malaysia, Maldives, Mongolia, Nepal, Pakistan,
Papua New Guinea, Philippines, Republic of Korea,
Samoa, Sri Lanka, Thailand and Viet Nam. The
members of the Working Party from Australia;
Bangladesh; Indonesia; Japan; Macao, China;
New Zealand; Philippines; Singapore; and Thailand
shared their respective country experiences.
Representatives of the Statistical Institute
for Asia and the Pacific (SIAP), the UNFPA Country
Technical Services Team in Bangkok, and the
United Nations Statistics Division (UNSD) and
the United States Census Bureau (USCB) participated
actively as resource persons. The Workshop
benefited also from presentations by invited
private sector companies.
"I encourage you in your invaluable
work of making census data as easily
available as possible to your clients.
That goal cannot be achieved without
application of modern information technology."
Mr Kim Hak-Su, Executive Secretary
of ESCAP, in his inaugural address to
the Workshop.
State
of Information Technology in National Statistical
and Census Offices in June 2001
This article provides an update
of selected aspects of ESCAP's 1998 survey on
Application of New Technology in Population
Data Collection, Processing, Dissemination and
Presentation, the results of which were published
in June 1998 (see http://www.unescap.org/stat/pop-it/pop-itnl/news_03.asp).
It is based on responses to an email questionnaire
sent in June 2001 to the statistical and census
offices that had responded to the first survey.
Only a small number of questions were included
from the previous round in order to make responding
easy.
A comparison between the 1998 and 2001 results
indicate that during the past three years, the
PC, LAN and Internet infrastructure have been
upgraded in all responding offices. The
increase in the number of computers in use has
been remarkable, exceeding 50 per cent in many
responding offices. The staff-to-PC ratio
has come down because of the increase in the
number of PCs, although staff reduction was
also a factor in some offices.-
Table 1.
The number of staff and PCs in selected statistical
offices, 1998 and 2001 (sorted by the staff/PC
ratio in 1998)
Country/area
Total staff
1998
Total staff
2001
Staff/PC
1998
Staff/PC
2001
PCs in LAN
1998, per cent
PCs in LAN
2001, per cent
Staff change
1998-2000
PC change
1998-2000, per cent
New Zealand
729
900
0.8
0.8
100
100
23.5
19.3
Australia
2845
3140
0.9
0.8
100
100
10.4
19.3
Japan
1823
1788
0.9
0.8
100
100
-1.9
10.0
Republic of Korea
1281
1568
1.2
0.9
99
100
22.4
57.6
Hong Kong, China
1495
1505
1.9
1.0
59
49
0.7
81.5
Lao PDR
50
30
1.9
0.7
85
78
-40.0
76.9
Samoa
32
36
2.7
2.0
0
100
12.5
50.0
Philippines
3131
3554
3.3
3.1
26
47
13.5
21.3
Turkey
2741
3079
3.8
2.2
5
71
12.3
91.8
Armenia
61
226
4.4
1.5
71
32
270.5
1000.0
Myanmar
311
302
8.9
5.3
29
28
-2.9
62.9
While technologically advanced
offices provide email and web connection for
all staff, and on every PC, some offices still
have to manage with a couple of email accounts
and a dismal connection speed, which make any
web browsing barely feasible (see Table 2).
Where a viable Internet connection is missing,
field operations have to rely on conventional
means of communication, with data being transferred
on paper, on diskettes or through direct telephone
connections. Web sites, if they exist,
are likely to be a result of efforts by dedicated
individuals using infrastructure outside v
of their offices.
Table 2.
Typical PC configuration and Internet connection
in selected NSOs in June 2001
Country
Typical
PC processor, MHz
Typical
PC RAM, MB
Typical
PC hard disk, GB
Type of
Internet connection
Speed of
Internet connection, kbps
Share of
PCs that can send email, %
Share of
PCs that can browse web, %
Armenia
100
16
1
Radio modem
4
1.9
6.5
Australia
484*
137*
10*
Frame relay, full duplex
1000
82.1
87.2
Hong Kong, China
333
32
3.2
T1
1544
21.3
21.3
Japan
667
256
15
Dual T1
3000
100.0
15.9
Lao PDR
600
64
10
Cable modem
56
4.3
0.0
Myanmar
166
16
1
None
-
3.5
0.0
New Zealand
200
128
6
Frame relay
2000
90.9
90.9
Philippines
500
64
8
Leased line
64
20.0
20.0
Republic of Korea
700
64
10
T1
2048
100.0
100.0
Samoa
333
64
6
Dial-up
.
38.9
38.9
Turkey
200
32
6
Leased line
128
46.4
46.4
* Weighted average
of PCs and notebooks
Web
site analysis
Outside the survey, a separate
technology review[1]
was made of known web sites of statistical and
census offices.
In June 2001, a little over half of ESCAP's
regional members and associate members, i.e.
32 of 57, had a national statistical web site
in June 2001. Most of the NSO and census
web servers were located in the respective capitals.
Six of the 35 sites investigated were hosted
abroad (statistical offices of Azerbaijan, Fiji,
Islamic Republic of Iran, Marshall Islands,
Federated States of Micronesia, and the Office
of the Registrar General of India).
Figure 1 shows results of an experimental test
on how fast individual web sites responded to
a series of ping requests. In the test
done from Bangkok, the fasted average response
was received from the nearest web site, the
National Statistical Office of Thailand, followed
by sites located in the United States and ASEAN
countries. The slowest responses came
generally from the most distant sites in the
Eurasian continent. The fast responses
from .fj (Fiji) and .fm (Micronesia) are due
to their location in Seattle and Honolulu, respectively,
which in the Internet topology are advantageous
locations in relation to Thailand. If
the same test were conducted from a third party
server located somewhere else, the results would
be different.
[1]
A complete version of the review is available
in paper http://www.stat.go.jp/english/iaos/paper/survo.pdf
presented at the IAOS Satellite Meeting On Statistics
for the Information Society, Tokyo, 30-31 August
2001.
Figure 1.
Average response time of statistical web sites
to ICMP ping requests from Bangkok.
The technology of each web server
was further investigated through Netcraft's
detection service (http://www.netcraft.com).
Apache and Microsoft Internet Information Server
were by far the most common servers (see Table
3). In addition, Netscape Enterprise and
Lotus Domino servers were hosting three and
two web sites, respectively. The Apache
web servers were running on various Unix derivatives,
the most popular being Linux (5 servers) and
Solaris (4 servers). All fourteen MS-IIS
servers, as well as the two Lotus Domino servers,
ran on Windows NT4. The Netscape Enterprise
servers were on Solaris. Judging from
the name of the net block owner, three quarters
of statistical and census web servers were maintained
externally.
Table 3.
Statistical and census web servers by type and
location of hosting, June 2001
Net
block owned by
Total
Share,
per cent
NS/census office
Outsider
Apache 1.3.x
1
14
15
44
Microsoft-IIS/4.0
4
10
14
41
Netscape Enterprise 3.6 - 4.1
2
1
3
9
Lotus Domino 5.0.x
2
-
2
6
Total number of servers
9
25
34
100
Share, per cent
26
74
100
Data
capture technologies
The results from the 2000/2001
round of censuses are being tabulated faster
than ever before. The March 2001 ESCAP
Workshop concluded that data capture through
OCR/ICR had become a proven technology that
could make significant cost, timeliness and
accuracy improvements in census data capture.
Several countries that were using OCR or ICR
technology for the first time had released preliminary
results (based on the whole population) in a
matter of a couple of months.
Although the learning curve to master OCR/ICR
is relatively steep, the technology has lowered
the total cost of census taking, in some countries
by 50 per cent or more. The scanners and
recognition software are rather expensive, but
the cost can be moderated by using the same
technology in several censuses and surveys and
by sharing it with other agencies.
Twelve of the participating 24 offices in the
mentioned workshop indicated that their offices
still relied on keyboard entry; two used OMR
and nine OCR/ICR. The only country to
offer the possibility of submitting information
through the Internet was Singapore, where eventually
15 per cent of the population chose this option
(see
article on page 7). Other Singaporeans
responded either to computer-aided telephone
interviews (CATI) or to person-to-person interviews.
Data capture
technology in the 2000 round of censuses in selected
ESCAP members and associate members
Keyboard entry
OMR
OCR/ICR
Internet+CATI+OCR
Brunei Darussalam
Bangladesh
Australia
Singapore
Cambodia
Pakistan
Bangladesh
Indonesia
China
Kiribati
India
Malaysia
Indonesia
Mongolia
Macao, China
Nepal
New Zealand
Papua New Guinea
Philippines
Republic of Korea
Thailand
Samoa
Sri Lanka
Viet Nam
The 'beauty' of optical recognition
technologies is that after the questionnaire
forms have been scanned into images, they can
be split into pieces, question by question or
character by character, for recognition in a
priority order. Thus, data tabulation
and analysis can be started from the most important
information and almost immediately after imaging.
That is a major advantage over manual keyboard
entry, which normally progresses form by form.
Handwritten open responses and questions requiring
manual coding can be dealt with later as experts
and verifiers working on them make progress.
Comparison
of two data capture strategies
The ESCAP Workshop agreed that
data capture through OCR/ICR has become a proven
technology that can make significant cost, timeliness
and accuracy improvements in census data processing.
Below is a comparison of two experiences shared
at the Workshop:
Philippines
Indonesia
Optical numeric
recognition.
Four regional data
capture centres, each having.
Windows NT network
with five mid-volume scanners (Kodak
3510), fifteen Pentium III workstations,
three magneto-optical disk drives, three
CD-writers, a network printer and a
500 MHz Pentium III server with 90 GB
hard disk capacity.
Software: Kodak
MVCS for scanning, Eyes and Hands for
Forms for ICR, and a tailor-made Census
Progress Monitoring System.
The four data capture
centres were operated by a total of
146 persons, in two shifts, six days
a week.
A work shift was
staffed by a shift supervisor, four
data controllers (preparing forms for
scanning and checking the validity of
geographic codes), five scanner operators,
four verifier operators and an operator
for file preparation and transfer.
Optical numeric
recognition.
Decentralized data
capture in 41 centres having a total
of 79 scanners at their disposal.
Scanning, recognition,
verification and editing stages.
Kodak DS Scanners
3500 in the central office and in provincial
offices.
Results
Over 15 million
forms scanned.
Reduction of staff
required for capturing the data from
600 to persons in 1995 to 146 in 2000.
Nearly perfect
recognition rate for OMR fields.
For handwritten
fields a much lower rate.
Average recognition
rate of 90-95 per cent.
Average speeds
for interpretation and verification
3,400-3,500 and 270-320 forms per hour,
respectively.
Results
55 million double-sided
household forms (representing the number
of households in Indonesia) scanned.
Nearly perfect
OMR recognition rate.
Recognition of
numbers at a lower rate.
Human intervention
by enhancing the quality of numbers
did not markedly improve the recognition
results.
Main problems
The configuration
had too few (only four) software licences
for data verification; 8-10 verification
licences would have been optimal.
Uneven quality
of the printed forms.
Handwriting entries
illegible or too faint, which increased
the work needed before scanning and
at the verification stage.
Some forms had
to be enhanced or rewritten before scanning.
The Singapore Department of Statistics
implemented a ground-breaking Internet census
information submission in its 2000 census.
Of all census respondents, 15 per cent chose
to submit their information through the Internet
while others responded either to computer-aided
telephone interviews (CATI) or to person-to-person
interviews.
The system was available in the English language
only and represented the second generation of
Internet data collection systems in Singapore.
The first one, for the Business Expectation
Survey, was launched in March 1998.
The Singaporean Internet data collection system
was designed keeping in mind nine target features,
namely (i) fast performance, (ii) user-friendliness,
(iii) security, (iv) stability, (v) compatibility
with a large number of browser platforms, (vi)
possibility to continue form completion in another
user session, (vii) integration with other data
collection modes, (viii) intelligent branching
of questions, and (ix) verification during and
after completion of the form. Given the
existing technology, many of those requirements
are still in obvious contradiction with each
other.
The Department of Statistics used prototyping
and intensive user-acceptance testing to fine-tune
the system. The front page of the census
site was made small in size (kilobytes) and
the web form was split into many parts in order
to achieve satisfactory performance for users.
For the same reason, the number of automated
checks, which were first built into the form,
were reduced and moved to the server side.
Special attention was paid to the clarity of
the form layout, questions and definitions.
During the enumeration period, hotline telephone
support was available, and in response to the
feedback, frequent system upgrades were made.
High-level security was maintained at all times,
with escalation procedures and plans for contingencies
in place.
Some other countries, including Australia and
Switzerland, have used the Internet for census
data collection in 2001. If well implemented,
the technology platform is not the main obstacle
in Internet collection. The main concerns
are related to perceived data security and potential
bias that the collection method could cause.
Put
census data on the Internet
The ESCAP Workshop agreed that
the ultimate goal was that all publishable census
data should be made available on the Internet.
That goal is today well within the reach of
currently available technology as large volumes
of data can be made accessible more easily and
cheaply than ever before.
In a modern Internet development strategy,
the same facility is designed to cater for the
needs of both internal and external users.
Well-designed web sites could deliver data both
to general data users, such as students, pensioners,
libraries, and small businesses, and to analysts
with more complex and often voluminous requirements
including an interest in detailed metadata.
Statistics New Zealand is planning to expand
the use of intermediaries in connection with
its 2001 census, including the media, libraries,
information brokers and bundlers, channel managers
of high speed networks, community organizations
and government organizations who already had
close contacts with user groups. The agency
will pay significant attention to improving
the navigation of the census web site, thereby
assisting users to service themselves. A more
user friendly site is expected to be achieved,
among other things, by using common language;
removing or explaining census jargon; increasing
the ways to access data, terminology, area breakdowns
and maps; and by improving sorting-by-topic
and other features of the search facility.
Compact
disks ideal for delivering volumes of data
The ESCAP Workshop was given
some demonstrations of user friendly CDs that
were developed with public domain software.
The Cambodian 1998 census is available on four
CDs, containing priority tables at country,
province and district levels; mapping and graphing
database based on PopMap; a very large REDATAM-based
database containing microdata of all person
and housing records; and aggregated data for
Cambodia's 13,339 villages in six DBF-databases,
each covering a different topic. The visual
effectiveness and user friendliness of the PopMap-based
CD was particularly noted by the Workshop.
The GIS application consists of detailed maps
for Cambodia, its provinces, districts and communes,
with line layers for the main routes and rivers
and point layers for the villages and schools.
A total of 123 different indicators down to
the commune level formed the heart of the application.
The Viet Nam census CD is based on the IMPS
suite, including its database, cross tabulation,
and table and map viewer components.
Also presented at the Workshop were three leading
commercial data dissemination tools. Software
suites of Beyond
20/20, PC-Axis (Statistics
Sweden) and SuperSTAR (Space-Time
Research) are suitable for small and large
data sets, and have powerful desktop data manipulation
facilities and web based detailed data access
facilities. Their performance, especially
in terms of retrieval and tabulation speeds,
and the flexibility and ease of control, is
impressive and goes well beyond what off-the-shelf
database packages and some of the public domain
packages provide. Although sophistication
and performance naturally carry a price tag,
commercial dissemination packages are worth
evaluating when creating dissemination strategies.
Prices for statistical and census offices generally
depend on the population of the country, the
size of data sets involved, and the volume of
dissemination, and are generally subject to
one-to-one negotiation.
Data
warehouse with a browser interface is today's
mass storage solution
In a traditional census database
model each census year has formed a dedicated
database with specialized codes and definitions.
In a modern warehouse approach, data from censuses
conducted at different times are combined with
other data. The ESCAP Workshop recognized,
however, that setting up a data warehouse is
a challenging process and involves a lot of
preparatory work, including standardization
of codes and definitions and cleaning of data.
Compared to conventional data warehouses holding
transaction and business data, statistical data
warehouses have to facilitate more elaborate
data analysis. Statisticians and analysts
require that data warehouses facilitate highly
flexible data analysis, display metadata dynamically
during analysis, and allow the customization
of reports and other outputs.
The ESCAP Workshop noted that a thin-client
design, where most processing is done at the
server-end, is preferred for warehouses that
stored huge volumes of census data. In
the system design, special attention needs to
be paid to the integration of data extraction
and data analysis tools, since statistical analysis
is usually an iterative process, requiring testing
of a large number of variables.
In their evaluation, the Singapore Department
of Statistics considered a hierarchical drilldown
a suitable method for selecting data from a
warehouse, especially when business metadata
are dynamically displayed. The ability
to save previously selected items is very important
for queries that are needed frequently or repeatedly.
A "drag and drop" -type of interface would make
statistical analysis convenient: calculating
statistical parameters, such as the mean and
standard deviation, could be achieved by 'dropping
them into' data items (records or variables),
or vice versa, data items could be 'dropped
into' statistical parameters. Another
criterion that Singapore set for a data warehouse
package is the possibility of making revisions
to data both locally (affecting only the analyst)
and globally (affecting all users of the data
warehouse).
The Workshop agreed that graphical and topographical
tools, with integration to tabulation and drill-down
possibility into points of interest in a graph
or map, are also desirable features in a census
data warehouse. A good data warehouse
system supports saving of data outputs, including
data extracts, tabulations, analytical and other
reports, or graphs, in common data formats which
could be read by third party software.
GIS
for effective presentation
Information presented on maps
is essential at almost all stages of census
operations. Therefore, one cannot do without
geofererenced databases. A grid square
database is a low cost alternative for presenting
small area data. It could be considered
by census organizations that do not have the
resources and expertise required for digitizing
the enumeration boundaries. The allocation
of households to grid squares is resource consuming
and requires fairly detailed maps. There
are simpler techniques that could be used for
allocating complete enumeration districts to
grid squares.
The Workshop was given an overview of how the
United States
Census Bureau uses georeferenced data to
display census results. The Bureau's GIS
presentation system is building on TIGER (Topologically
Integrated Geographic Encoding and Referencing)
database, which contains detailed geographic
features for the United States. TIGER
mapping was used at all stages of the 2000 census,
from enumeration to reporting of results.
The American FactFinder is a web-based system
for access and dissemination of Census Bureau
data on the Internet, built from TIGER boundaries
and other geographic information, census data
and metadata. The current elaborate online
version is a result of incremental work over
the past two decades, responding to the legislative
mandate to provide the public a full and free
access to census statistics. In the FactFinder,
it is possible to drilldown the maps (which
were based on vector graphics) from country
level down to the census block level.
Feedback
from data users is a cornerstone of census dissemination
strategies in Australia and New Zealand
The Australian Bureau of Statistics
and Statistics New Zealand are among census
offices that pay significant and continuous
attention to evaluating their products and consulting
with users. Internal and external users
are also involved in prototype and acceptance
testing. The user feedback forms a basis
for their proactive product development strategy.
The web has emerged as the main dissemination
channel for the 2001 Australian and New Zealand
censuses, and their development efforts are
focused accordingly. Providing self-service
and dynamic access to data, they are planning
to make data users more self-reliant and to
lower the overall dissemination cost.
At the same time the role of printed material
is changing. Statistics New Zealand for instance
is phasing out some of the 'traditional' publications
and developing a capability to print any electronic
publication, on an individual basis, as and
when needed.
Another significant advantage of the Internet
is that it shortens the delivery time of census
data to users. Internet technology makes
the time of data release more predictable than
in conventional hard copy dissemination, as
the printing process and distribution often
take a longer time than expected.
Australia and New Zealand have decided to continue
publishing community profiles of key data from
their 2001 censuses as such products have proved
effective in raising public awareness of census
data and in increasing its use.
Elsewhere in the region, the ESCAP Workshop
observed that the participation of the private
sector in user consultations was often sporadic
and in some countries absent altogether.
Therefore, it encouraged census offices to contact
potential clients in the private sector and
involve them in producer-user consultations
and other promotional activities as equal customers.
The public at large and children were also recognized
as important clients.
Just as important as good design is the users'
awareness of available census products and services.
Census offices should establish marketing strategies
to inform established and potential users about
the benefits of census products. Those
strategies might use several modes of communication
and include visible product launches.
Maintaining ongoing awareness during and between
the census cycles is an important part of the
strategy.
ESCAP
Workshop on Population Data Analysis, Storage
and Dissemination Technologies, Bangkok, 27-30
March 2001Recommendations
The ESCAP Workshop made over
50 specific recommendations regarding census
data collection and capture, data storage and
analysis, and data dissemination strategies
and technologies. The topics on which
recommendations were made are listed below;
the detailed recommendations can be found in
the Workshop report, http://www.unescap.org/stat/pop-it/pop-wdt/wdt-rep.asp
Related to
census data collection and capture, recommendations
were made on:
the importance of careful
questionnaire form design in successful character
recognition
just-in-time training
of enumerators in filling out OCR/ICR forms
the use of proper pencils
or pens in marking OCR/ICR forms
the maintenance of scanners
the robustness of the
file management component of the data capture
chain
the testing of the proposed
data capture configurations in real situations
and making necessary modifications to them
bandwidth, security and
other considerations in Internet data collection
systems
the testing of Internet
data collection forms in different bandwidths
and improving the real and perceived
performance
data collection control
when Internet collection was accompanied by
other collection methods
Recommendations
regarding census data storage and analysis
related to:
using new technologies
to link census data longitudinally and with
other data sets
reviewing the applicability
of data warehousing technology when new storage
systems were considered
starting the building
of a data warehouse in a modular fashion and
with manageable data content, with business
and statistical considerations in mind
the high cost and effort
involved in setting up a data warehouse and
cleaning the data
building a central system
for maintaining statistical metadata
When considering
data users' needs and dissemination strategies,
census offices were recommended to:
adopt a proactive strategy
towards the improvement of data dissemination
diversify data dissemination
strategies and technology solutions according
to the needs of different types of users
utilize the possibility
offered by optical recognition to capture
and release census data gradually, starting
from key information
use prototyping and vigorous
testing to perfect dissemination products
use modern marketing
techniques to increase data use
choose hardware and software
platforms that are compatible with standard
technologies
provide web links to
national counterpart sites and other sites
containing useful census information
consider creating community
profiles of census data to increase their
use
With regard
to data dissemination through the Internet,
the Workshop recommended that statistical and
census offices
adopt the Internet as
part of their dissemination strategy, use
hypertext interface on CD-ROM, and use email
for data promotion and for disseminating summary
results
develop an internal policy
and utilization of the Internet in general
and include the production of web material
in training programmes
create functional coordinating
mechanisms for web site management
improve internal web
site management skills through recruitment
and training
design census dissemination
sites for relatively low bandwidths by using
various page authoring and data access techniques
provide file formats
and scripts that all common browsers could
handle
include in web sites
census metadata in an easily accessible format
consider features that
help clients service themselves when accessing
census data
monitor the web site
traffic and adjust the site content and navigation
as the reports might suggest
pay special attention
to the clarity of information and test the
individual pages and the whole site thoroughly
provide the most popular
content in static HTML in order to improve
the site performance
be prepared to adjust
the number of servers and balance the load
as the traffic increases
to ensure the uptime
of the public web site, use separate servers
for resource-consuming tasks
keep production servers
isolated from the Internet
consider using XML to
code structured data pages
Recommendations
on using geographical information systems included:
starting the application
of GIS from low-cost alternatives and moving
to advanced GIS technology when skills improved
considering grid square
GIS as an alternative for presenting census
data on maps
the visually effective
use of low-end GIS and high-end GIS