|
|
|
|
| Workshop on
Application of New Information Technology to Population
Data |
| Bangkok,
12-20 October 1999 |
STAT/WNIT/Rep
16 June 2000
ENGLISH ONLY
ECONOMIC AND SOCIAL COMMISSION FOR ASIA AND THE
PACIFIC
Workshop on Application of New Information
Technology to Population Data
12-20 October 1999
Bangkok |
| Report on
the Workshop on Application of New Information
Technology to Population Data |
The designations employed and the presentation
of the material in this report do not imply the
expression of any opinion whatsoever on the part
of the Secretariat of the United Nations concerning
the legal status of any country, territory, city
or area, or of its authorities, or concerning
the delimitation of its frontiers or boundaries.
Mention of any firm, licensed process or product
does not imply endorsement by the United Nations.
This report has been issued without formal editing.
|
| Contents |
Abbreviations
and Descriptions
- ORGANIZATION
OF THE WORKSHOP
- Attendance
- Opening
of the Workshop
- Workshop
arrangements
- Documentation
- INTRODUCTION
TO INFORMATION TECHNOLOGY IN CENSUS OPERATIONS
- Project
RAS/96/P12
- Opening
of the Workshop
- Census
processes
- Technology
applied in recent censuses and surveys
- IT
trends
- Quality
management
- Expectations
for the Workshop
- PAPER
BASED DATA COLLECTION AND CAPTURE
- Optical
Mark Recognition (OMR)
- Demonstration
of Optical Mark Reader (OMR)
- Optical
Character Recognition (OCR)
- OCR
technology for the Indonesian Census
2000
- Demonstration
of OCR cluster
- New
Zealand experience in 1996
- Observations
and recommendations on OCR
- Archiving
of census forms
- NON-PAPER
BASED DATA COLLECTION AND CAPTURE
- Computer
Assisted Telephone Interviewing
- Internet
and CATI in Singapore Census 2000
- Computer
Assisted Personal Interviewing
- IMPLICATIONS
FOR THE GUIDELINES ON THE APPLICATION OF NEW
TECHNOLOGY TO POPULATION DATA COLLECTION AND
CAPTURE
- ADDING
VALUE TO CENSUS DATA THROUGH DATA WAREHOUSING
AND DATA MINING
- DATA
DISSEMINATION
- Implications
for the guidelines on the Application
of New Information Technology to Population
Data Dissemination
- GEOGRAPHIC
INFORMATION SYSTEMS
- Implications
for the guidelines on the Application
of Geo-Positioning Systems and Geographic
Information Systems for Digital Mapping
and Statistical Management
- RECOMMENDATIONS
OF THE WORKSHOP
- General,
IT management
- Data
collection and capture
- Guidelines
- Data
warehousing, databases, data archiving
- Data
Dissemination
- Mapping
and GIS
- Follow
up
Annex I: List
of Participants Annex
II: Tentative Time Schedule Annex
III: List of Documents |
| ABBREVIATIONS
AND DESCRIPTIONS |
| AFPS Pro |
|
Comprehensive application
for high-volume forms processing based on
advanced imaging technologies. |
| ArcInfo |
|
Comprehensive GIS software
for a variety of computing environments. |
| ArcView |
|
Desktop mapping and GIS
software. |
| Blaise |
|
A survey data collection
and processing system. |
| CAPI |
|
Computer Assisted Personal
Interviewing. |
| CARS |
|
Classifications and Related
Systems. |
| CATI |
|
Computer Assisted Telephone
Interviewing. |
| CGI |
|
Common Gateway Interface
facilitating dynamic content provision from
web servers to client computers. |
| CSV |
|
format "Comma Separated
Value" format. An ASCII file that
is commonly used as an intermediate format
when transferring files between databases
and spreadsheets of different makes.
Values are enclosed in quotation marks and
separated by commas |
| dpi |
|
dots per inch. |
| EA |
|
Enumeration Area. |
| FLY (fly) |
|
C program that creates
GIF image files on the fly from CGI and
other programs. |
| GIS |
|
Geographic Information
System. |
| GPS |
|
Global Positioning System. |
| HTML |
|
HyperText Markup Language. |
| ICR |
|
Intelligent Character
Recognition. |
| IMPS |
|
Integrated Microcomputer
Processing System. |
| IT |
|
Information technology. |
| KFI |
|
Keying-from-image. |
| KFP |
|
Keying-from-paper. |
| LAN |
|
Local Area Network. |
| MapInfo |
|
Software product for mapping,
data visualization and GIS. |
| NCS Nestor Reader |
|
Development tool for building
forms processing or automatic data capture/entry
applications. |
| NSO(s) |
|
National Statistical Office(s). |
| OCR |
|
Optical Character Recognition. |
| OLAP |
|
Online Analysis Processing. |
| OMR |
|
Optical Mark Recognition/Reader. |
| PC |
|
Personal computer. |
| PDF |
|
Portable Document Format. |
| PopMap |
|
Integrated geographical
software providing maps and a graphics database. |
| PQM |
|
Process Quality Management. |
| SAS |
|
Statistical Analysis Software. |
| SIAP |
|
Statistical Institute
for Asia and the Pacific. |
| SPSS |
|
Statistical Package for
Social Sciences. |
| SQL |
|
Structured Query Language. |
| SQM |
|
Statistical Quality Management. |
| SuperCROSS |
|
Fast cross-tabulation
software. |
| SuperMAP |
|
Mapping software. |
| TCDC |
|
Technical Cooperation
among Developing Countries. |
| TIFF format |
|
Tag Image File Format. |
| TREND |
|
Time Series Retrieval
and Dissemination Database. |
| UNFPA |
|
United Nations Population
Fund. |
| UNFPA/CST |
|
United Nations Population
Fund /Country Support Team. |
|
|
|
|
| I.
ORGANIZATION OF THE WORKSHOP |
| A.
Attendance |
| 1. |
The Workshop on Application of New Information
Technology to Population Data, funded by the
United Nations Population Fund (UNFPA) under
the project RAS/96/P12, was held in Bangkok
from 12 to 20 October 1999. It was organized
by the secretariat of the Economic and Social
Commission for Asia and the Pacific of the United
Nations (ESCAP) with active support of the Working
Party on the Application of New Technology to
Population Data.
|
| 2. |
The Workshop was attended by thirty-one participants
from nineteen selected countries/areas in the
Asian and Pacific region: Bangladesh; Fiji;
Hong Kong, China; India; Indonesia; Islamic
Republic of Iran; Kazakhstan; Malaysia; Maldives;
Mongolia; Myanmar; Nepal; Pakistan; Philippines;
Republic of Korea; Samoa; Sri Lanka; Thailand
and Viet Nam.
|
| 3 |
The members of the Working Party,
consisting of nine experts from Australia; Bangladesh;
Indonesia; Japan; Macao, China; New Zealand; Philippines;
Singapore and Thailand; and representatives of
the Statistical Institute for Asia and the Pacific
(SIAP), and UNFPA Country Support Teams for East
Asia, and Central and South Asia participated
as resource persons. Invited private sector
companies also participated as observers and made
presentations. |
| 4. |
The list of participants is
attached as Annex I. |
| B.
Opening of the Workshop |
| 5. |
The Workshop was inaugurated
by Ms Kayoko Mizuta, the Deputy Executive Secretary
of ESCAP. In her opening statement, Ms Mizuta
welcomed the participants and thanked the donor
agency and resource persons for the role and the
commitment they played in the organization and
funding of the Workshop. She appreciated
the cooperation extended by private sector organizations
to the Workshop. She noted that the Workshop
was one of the outputs of the ESCAP project RAS/96/P12
and that it was organized under the guidance of
the Working Party on the Application of New Technology
to Population Data. Apart from the Workshop,
other major outputs of the Working Party included
three guidelines on (a) population data collection
and capture; (b) modern mapping and GIS; and (c)
population data dissemination. |
| 6. |
In noting the benefits of new
technology to statistical services in the region,
Ms Mizuta emphasized the role information technology
(IT) played in reducing costs of census and survey
operations. While it was not possible to
present the full spectrum of technological innovations
in just one Workshop, she hoped that, by sharing
information and experiences in significant areas
of IT, participants would enrich and further improve
their understanding of new technologies relevant
for their operations. Ms Mizuta closed her opening
statement by highlighting that the Workshop materials
would be made available through the project web
site and by wishing the Workshop success. |
| C.
Workshop arrangements |
| 7. |
The Workshop noted that the
time schedule (see Annex II) prepared by the secretariat
was based on the tentative agenda, and agreed
to proceed accordingly in six modules as follows:
|
| Module |
Organizer |
| 1. |
Introduction to IT in
census operations |
ESCAP secretariat |
| 2. |
Paper based data collection
and capture |
Indonesia and Japan |
| 3. |
Non-paper based data collection
and capture |
Singapore and Australia |
| 4. |
Adding value to census
data through data warehousing and data mining |
ESCAP secretariat |
| 5. |
Data dissemination |
New Zealand |
| 6. |
Geographic information
systems |
Bangladesh |
|
| 8. |
The Workshop acknowledged with
thanks the following presentations and support
by private sector companies: |
| Topic |
|
Presenter |
| 2.3 |
Is OMR technology still
feasible? |
|
DRS Data and Research
Services plc United Kingdom |
| 2.4 |
Census Success Story:
US Census |
|
Kodak (United States) |
| 2.6 |
Imaging for Census Data
Capture |
|
Kodak Philippines Ltd. |
| 2.8 |
Demonstration of pilot
application in Statistics Indonesia (hardware
support) |
|
Fujitsu, Thailand |
| 2.9 |
Integrated demonstration
on forms |
|
Co-ordinated by Scientific
Digital Business, Thailand |
|
- Forms capture |
|
Kodak |
|
- Forms recognition |
|
Top Image Systems. |
| 4.1 |
Data werehouse implementation
approach and methodology |
|
Unisys Thailand Ltd. |
| 4.2 |
SAS approach and fitness
to data warehouse processes |
|
SAS Institute Pte Ltd,
Bangkok, Thailand |
| 4.3 |
SAS demonstration |
|
SAS Institute Pte Ltd,
Bangkok, Thailand |
| 6.2 |
Production of quality
maps for censuses |
|
Kevron Pty. Ltd, Australia |
|
| D.
Documentation |
| 9. |
The documents presented at the
Workshop are listed in Annex III to the report. |
| II.
INTRODUCTION TO INFORMATION TECHNOLOGY IN CENSUS
OPERATIONS |
|
|
| A.
Project RAS/96/P12 |
|
|
| 10. |
The Workshop noted the extensive
activities and outputs of the UNFPA-funded project
RAS/96/P12, entitled the Application of New Technology
in Population Data Collection, Processing, Dissemination
and Presentation, and its Working Party on Application
of New Technology to Population Data. The
project had been initiated in April 1997 with
the objective of improving the capabilities of
member and associate member countries/areas of
ESCAP in the application of modern information
technology (IT) in population statistics production
and dissemination. |
| 11. |
The Workshop reiterated the
importance of providing valid, reliable and timely
data for developing population policies and programmes.
The application of modern IT would be more important
than ever in achieving that goal. |
| 12. |
It was noted that the ability
to exploit modern IT varied greatly in the region,
but that diversity also offered an opportunity
for intra-regional cooperation. Thus, the
basic thrust of the project was to share the experiences
of NSOs that had made significant progress in
exploiting new technology. At the beginning
of project implementation, a Working Party was
established with experts from nine countries to
identify priorities, to provide guidance in the
systematic application of IT, to consolidate the
experience of the countries and to share those
experiences within the region. |
| 13. |
Since 1997, the Working Party
had met four times to identify and discuss the
topics of principal interest to the project.
Each meeting had focused on one of the technology
areas for which members had contributed a large
number of technical papers. Other project
outputs included self-contained guidelines on
the application of new technology to three important
aspects of census processing, namely (a) population
data collection and capture; (b) mapping and geographic
information systems; and (c) population data dissemination.
The Working Party also guided the implementation
of three pilot projects under RAS/96/P12, one
each by the NSOs of Bangladesh, Indonesia and
Philippines, to test such new technologies.
Each project would produce a report at the Workshop
describing the technologies piloted and experiences
gained. |
| 14. |
The Workshop noted that further
outputs of the project included five newsletters,
a web site containing documents of the Working
Party meetings, an awareness package to promote
effective and efficient utilization of IT in population
census and survey processing, and a survey on
the application of IT within the region. |
|
|
| B.
Objective of the Workshop |
|
|
| 15. |
The participants noted that
the overall objective of the Workshop was to sensitize
participants to the opportunities that modern
information technology provided in population
data operations. Immediate objectives of
the Workshop were (a) to provide information that
would improve the basic understanding of new technologies
relevant to population censuses and surveys; (b)
to discuss advantages and constraints of important
new information technologies; (c) to consider
strategic implications that information technology
would have on the planning, conduct and processing
of population censuses and surveys; and (d) to
facilitate the understanding of the overall role
of new technology in conducting censuses and surveys. |
|
|
| C.
Census processes |
|
|
| 16. |
The Workshop reviewed major
processes and activities associated with the conduct
of censuses or large-scale population surveys.
Three distinct phases were identified. The
pre-enumeration stage included census planning,
census organization, questionnaire design, forms
and manuals drafting, cartography, publicity,
data processing system design and development,
and the conduct of the pilot census. The
census planning entailed obtaining legal and financial
support from the Government, estimating resource
requirements, preparing budgets and scheduling
the event. The census organization established
central and field offices, created national and
regional committees and co-ordinated with other
Government offices. The questionnaire design
required dialogue with potential users and was
a precursor to developing the tabulation plan.
The questionnaire, forms, manuals and the data
processing system were tested during the pilot
census. The enumeration stage included the
recruitment and training of field workers, the
establishing of house listings, the actual enumeration
and the post-enumeration survey. The post-enumeration
stage included the data processing from data capture
to final tabulations, the analysis of results,
the evaluation of the census process, and the
dissemination of reports. |
| 17. |
The Workshop noted that, during
the previous round of censuses, countries of the
region had needed from 3 to 7 years in order to
complete a census programme from the initial planning
stage until the basic results were disseminated. |
|
|
| D.
Technology applied in recent censuses and surveys
|
|
|
| 18. |
The Workshop reviewed the results
of the ESCAP Survey on Application of New Technology
in Population Data Collection, Processing and
Dissemination, conducted in April 1998.
The questionnaire had been sent to 56 national
statistical offices in the Region and 29 responses
were returned. The report was published
as document STAT/WNIT/1 and was made available
to the participants of the Workshop. |
| 19. |
The survey had revealed a broad
infrastructure gap among the countries of the
region. Technologically advanced offices
provided network-connected PCs for every staff
member, including individual e?mail addresses
and instant Internet connections. Offices
with the weakest IT infrastructure had practically
no internal or global network connectivity available
for general use and as many as 15 persons had
to share a PC. |
| 20. |
According to the Survey, on
average it took 17 months from the beginning of
data collection to the tabulation and analysis
of results. In some cases, up to four years
were needed. |
| 21. |
The Workshop noted that technologically
advanced NSOs developed applications in-house
and used IT across all operations. Such
custom-made applications were typically developed
in areas of data scrutiny, data editing, data
estimation and tabulation, whereas data analysis
was usually conducted with commercially available
statistical software packages. Overall,
a significant use was indicated of off-the-shelf
software packages, but there was no significant
difference in the prevalence of brand names between
developed and developing countries. |
|
|
| E.
IT trends |
|
|
| 22. |
The Workshop reviewed recent
trends in information technology and noted that
hardware and software developments produced data
processing systems with ever increasing power,
capacity and complexity which at the same time
had become easier to use and cheaper to acquire. |
| 23. |
Chip processing speeds commonly
available were 400 MHz or better, while RAM sizes
mostly exceeded 32 MB. Together with graphics
accelerators and other technical features, that
configuration translated into substantial processing
power which in turn was a basis for the development
of increasingly capable software systems.
Disk storage systems of 6 GB or more and with
random access times of a few nanoseconds came
as standard equipment with current desktop computers
and were sufficient to store the entire census
data files for a medium size country of 100 million
people. Optical storage media with 5 to
18 GB capacities were readily available and could
be used for the long-term storage of census data.
Processing and storage/retrieval speed was no
longer a constraint when scheduling the data processing
operations. Rather, delays caused by slow
human interventions were very often responsible
for the overall processing elapsed time. |
| 24. |
Various versions of the Microsoft
Windows operating system were currently being
used on a large majority of all desktop computers.
General purpose and dedicated software were widely
available for the Windows platform, some obtainable
at low cost or no cost at all, and sufficed to
manage most data processing tasks at the statistical
office. |
| 25. |
While individual desktop computers
had already a substantial and often sufficient
processing power, using local area networks with
a dedicated file server enhanced further the efficiency
of the entire operation by pooling resources,
reducing or eliminating redundancies, and centrally
managing common tasks such as data back-up.
Where infrastructure permitted, wireless communications
were becoming an important tool for the interfacing
between various computer components. The
Internet with features such as e?mail and World
Wide Web had gained importance firstly for the
dissemination of information about the statistical
office and its products and secondly for collecting
data from respondents. |
| 26. |
Thus, virtually all phases of
the census process could benefit from the latest
technologies. Those would include project
planning software, geographic information systems,
paperless data capture methods, scanning with
mark, character and intelligent recognition techniques,
automatic or computer assisted coding and editing
methods, metadata systems, CD/DVD and Internet/World
Wide Web media, etc. |
|
|
| F.
Quality management |
|
|
| 27. |
The Workshop noted that quality
control during all census phases posed a major
challenge from data collection to data validation
and editing, tabulation and dissemination.
Process quality management (PQM) focused on careful
planning and efficient implementation of the census
process, including human resource management and
the management of production means. Statistical
quality management (SQM) related to the management
of the metadata database and the integrity of
the data during the entire process of transformation
from raw data to publishable micro databases and
statistical tables. A better quality of
the end product would assure greater user satisfaction. |
| 28. |
The Workshop noted further that
quality management issues were often underestimated.
The introduction of new technologies could provide
an opportunity to give special consideration to
the application of quality management principles
for the entire census operations. Census
managers were urged to assess each new application
in respect of its potential capability to control
process as well as statistical qualities.
They also needed to assess the impact of the new
technology to noncomputerized statistical, management
and administrative processes and organization
structures. However, as each application
could interfere with others, special attention
to interoperability needed to be paid. |
| 29. |
The Workshop considered that
many new technologies might be presented during
the course of the Workshop that would be of interest
to IT management involved in the planning and
processing of the forthcoming census. This
wealth of new information posed another considerable
challenge to IT management who would be required
to select a combination of IT solutions that fits
the existing infrastructure. In that selection
process, IT management should not overlook the
effect those new technology solutions would have
on the ability to maintain or improve both process
and statistical quality management. |
|
|
| G.
Expectations for the Workshop |
|
|
| 30. |
The participants were invited,
based on the agenda and without having yet heard
the presentations, to rate their interest in the
various Workshop topics. Six work groups
were created to deliberate on the question.
The findings for each group were presented to
the other participants. It appeared that
Module 2, paper based data collection and capture,
received the highest interest from participants,
probably due to the proximity for many countries
of the next census date prior to which solutions
needed to be found soon. The respondents
also expressed high interest in the topics of
dissemination and geographic information systems.
However non-paper based data capture methods and
data warehousing received lower advance interest,
probably because those technologies required sufficiently
developed infrastructure and general technological
advancement which only the most advanced countries
had. |
| 31. |
The Workshop agreed that one
of the important expectations for the 2000 rounds
of censuses was to significantly reduce the time
needed for the entire census process, from planning
to final reporting, by employing some of these
new technologies in the various stages of census
data processing. Also, the final quality
of processed data could be improved by better
quality control throughout the process.
Furthermore, a wider and more targeted audience
could be reached by employing better dissemination
methods utilizing effective application of IT.
Significant quality and timeliness gains could
be achieved by improving data collection and capture
methods and much effort could be spared when preparing
census maps by using Geographic Information Systems.
Finally, where possible, increased use of the
Internet, including the World Wide Web, showed
great promise for more efficient information exchange. |
| 32. |
However, the Workshop emphasized
that individual countries would have to consider
the level of local infrastructure and resource
availability when deciding on the use of any of
the available technologies. The availability
of technical support and maintenance were of crucial
importance to the successful utilization of new
technologies. |
|
|
| III.
PAPER BASED DATA COLLECTION AND CAPTURE |
|
|
| 33. |
The Workshop was presented with
an overview of paper based data collection and
capture technologies. It was noted that
traditional key-to-disk methods were time consuming,
demanded a large quantity of equipment and personnel
and were, due to the human factor, not always
fully reliable. Employing technology-assisted
solutions would improve efficiency, economy and
reliability in the data capture process.
Optical mark and character recognition systems
were well tested, had become increasingly versatile
and reliable, and could therefore significantly
reduce the time needed for data capture and make
subsequent processing more flexible. Particularly
the imaging technology promised improved efficiency
by largely eliminating the need to return at later
processing stages to paper based documents that
were always cumbersome to handle. Experience
showed that keying from image could be more efficient
than keying from paper, which could particularly
benefit the coding and editing tasks. |
|
|
| A.
Optical Mark Recognition (OMR) |
|
|
| 34. |
Based on the example of Japan,
the Workshop had a detailed exposure about the
optical mark reader (OMR) technology. The
various hardware components of an OMR system comprised
a feeding unit, a photoelectric conversion unit,
and a recognition control unit. The feeding
unit consisted of a hopper for documents to be
read and several stackers for accepted and rejected
documents. The photoelectric conversion
unit used sensors to convert marks on the document
to electric signals and forwarded the signals
to the image memory. Finally, the recognition
control unit read those images and stored recognized
marks onto a magnetic medium. Marks could
be recognized in "alternative mode", i.e. only
one mark was expected for one question and the
darkest mark was selected if by chance there were
several marks found, and in "bit mode", i.e.,
plural marks were expected for one question and
all recognized marks were stored in file. |
| 35. |
The Workshop noted the high
quality requirements for OMR forms, which needed
to be carefully designed in order to improve processing
and recognition reliability. Paper and printing
quality had to be high, dropout colours had to
be used for lead text and mark boxes, the shape
and size of the mark boxes had to be carefully
designed and sufficient distance had to be maintained
between the mark boxes. The OMR form needed
also to include timing marks along the aligning
edge in the direction of reading. Finally,
it was important that the mark boxes were completely
filled with a soft black pencil and that wrong
marks should be erased completely. Since
OMR forms were designed to be readable by the
equipment, staff designated to handle the forms
needed special training to fully understand the
content. |
| 36. |
The Workshop noted that OMR
equipment had to be tested for reliability and
recognition stability at least three times daily,
namely, before, during and after the operation.
Failing those tests, the equipment needed to be
cleaned, adjusted or repaired, as the case might
be. In addition, the equipment needed to
be cleaned daily by removing paper powder from
the mark and image heads, feeding unit and other
susceptible parts. Normally, a monthly maintenance
service was to be scheduled by the vendor. |
| 37. |
The Workshop agreed that OMR
technology was a reliable and economical choice
for censuses and surveys if the responses could
be pre-coded. However, it acknowledged that
the particular requirements for questionnaire
design and paper and printing quality were the
main drawbacks of the technology. For instance,
enumerators, respondents and editors could have
difficulties in using the questionnaires due to
their highly machine-oriented layout. Therefore
it was necessary to allocate sufficient time and
funds for training the enumerators and the OMR
operating personnel. The Workshop noted
that leasing was one way to reduce cost. |
|
|
| B.
Demonstration of Optical Mark Reader (OMR) |
|
|
| 38. |
Data & Research Services
(DRS) plc, a British company manufacturing OMR
equipment and operating a data capture service
bureau, provided the Workshop with an overview
of OMR products and services and highlighted some
of OMR's advantages and disadvantages compared
with key-to-disk data capture. The Workshop
was informed that OMR was capable of capturing
7,000 forms per hour, a huge improvement over
manual key entry. Optical reading also improved
data quality. It was pointed out that as
data volumes increased the use of OMR became more
economical than key-to-disk data capture, particularly
where predominantly pre-coded tick-box responses
could be used. Some disadvantages of OMR
were mentioned, including the need for specially
designed and accurately printed, and therefore
more costly, questionnaires and the difficulty
of capturing subjective data, i.e. textual responses.
The Workshop heard that OMR would be more efficient
and cheaper than optical character recognition
systems (OCR) as long as the majority of responses
could be pre-coded. |
| 39. |
Recognizing that a census questionnaire
often had to include some textual responses, DRS
had developed a new generation of OMRs that added
an image recognition unit. The captured
images would be stored in a file and could be
viewed by coding and editing operators who would
key-in information from the image, possibly assisted
by a computerized table-lookup system. But,
the bulk of the information would still be captured
using the significantly more efficient mark reading
technology. |
| 40. |
A demonstration of a small-capacity
desktop OMR reading actual Greek census forms
concluded the presentation by DRS, which the Workshop
found most useful. |
|
|
| C.
Optical Character Recognition (OCR) |
|
|
| 41. |
The Workshop noted that in some
contexts the recognition of handwritten numerals
and alphabets was referred to as Intelligent Character
Recognition (ICR) to distinguish that technology
from the recognition of printed text and numbers.
This report, however, is using the term OCR to
cover all character recognition. |
| 42. |
Kodak (United States) had been
invited to introduce to the Workshop optical character
recognition (OCR) technology as used in the 1990
United States census. The Workshop was informed
that to obtain maximum reliability in the scanning
process, special care had to be taken when designing
and printing the questionnaires. The measures
included the use of non-carbon based ink and dropout
colours. Like the OMR forms, the OCR forms
design had to be a compromise between maximizing
the ease of use by the enumerators, coders and
editors on the one hand and optimizing the efficiency
of the recognition software on the other.
Experience showed that the best recognition rates
for hand written responses were achieved at a
scanning resolution of 200 dots per inch (dpi)
or lower; higher resolutions generally worsened
the recognition rates. |
| 43. |
It was explained that the confidence
level of character recognition was user definable
and was dependent on the overall document quality,
i.e. questionnaire design and clarity of hand
written responses. However, setting
the confidence level too high, e.g. above 90 per
cent, could result in excessive numbers of rejects,
while setting the level much lower could jeopardize
the quality of the output data. The Workshop
noted that one of the major problems in character
recognition was the acceptance of positively but
wrongly identified characters. In consequence,
reduction of the number of "false positives" would
have the most benefit for the overall quality
of the captured data. |
| 44. |
On a unit cost basis, the economics
of keying-from-paper (KFP) and keying-from-image
(KFI) were compared. With the selected labour
cost the calculations suggested that the break-even
point was at about 400,000 census forms, i.e.,
beyond those numbers KFI would become more economical.
It was pointed out that KFI might be feasible
even with a lesser number of forms, if improved
data quality at the data capture stage, reduced
costs for the additional processing steps and
increased capture speed resulting in earlier completion
of the entire census process were taken into account. |
|
|
| OCR
technology for the Indonesian Census 2000 |
|
|
| 45. |
The Workshop was informed about
the background and rationale based on which Indonesia
selected OCR as the data capture method for the
year 2000 census. Major considerations had
been (a) the very large number of forms to be
processed for a population of more than 200 million;
(b) the need to produce small area statistics
based on the many island areas; and (c) the possibility
of publishing basic results within 3 to 6 months.
Helpful in the decision had also been the availability
of external assistance in the form of equipment,
software and expertise. |
| 46. |
The OCR system and the questionnaire
design had been assessed and tuned in several
pilot tests. The changes in the questionnaire
design had improved the recognition results significantly.
Further improvements had been achieved by replacing
the built-in western character set in the recognition
engine with a localized version of the character
map. The local version had been developed
from writing samples submitted by 5,000 different
persons. However, it was eventually decided
that it was better to omit the recognition of
alpha characters and to concentrate on maximizing
the performance of numeric recognition and mark
reading. |
| 47. |
The Workshop was given an overview
of the processing flow of an OCR based system
in Indonesia. The OCR system consisted of
three steps, namely scanning, recognition and
verification. The scanning of questionnaires
produced an image file in TIF format. That
was compared to a template file containing information
about the relative locations of input in the questionnaire.
The resulting digital output file was then submitted
to the verification process in order to produce
a clean data file. |
| 48. |
The Workshop learned about the
issues and principles involved in the OCR questionnaire
design in Indonesia. It was noted that OCR equipment
required less stringent paper quality and printing
accuracy than did OMR. Instead, four rectangular
registration markers were placed near the corners
of the questionnaire page to define the location
of individual fields relative to these registration
markers, thus providing greater tolerance for
misaligned forms being fed through the scanner.
Data fields were placed on the page as boxes of
sufficient size to allow clear handwriting, with
appropriate distance between them to minimize
the risk for misinterpretation. Depending
on the use, field types could be defined as containing
marks or textual information. For textual
boxes the use of two vertical dots within each
character box was recommended that would guide
the respondent or enumerator and thus improve
the quality of handwriting. Standard form-processing
tools could normally be used for developing the
questionnaire. Once the design was complete,
the questionnaire was scanned to produce an image
file that was input to the NCS Nestor Reader editing
function in order to create the above mentioned
master questionnaire in ZDF format. The
questionnaires used for data collection were printed
with dropout colours. |
| 49. |
The Workshop was given a hands-on
demonstration of developing an OCR questionnaire
using the Visio Technical software. The
form design included text, recognition mark and
check boxes. It was thus shown that the
questionnaire design could be developed by the
user without assistance from the software company.
In contrast, the validation and editing rules
were programmed in Visual Basic and were linked
to the Nestor Reader software, a more difficult
task that perhaps needed assistance from the vendor. |
| 50. |
The Workshop also observed a
practical demonstration of a less powerful but
similar system to the one that Indonesia was planning
to use, showing the scanning and recognition of
characters and marks, and the output of questionnaire
data to a digital file. |
| 51. |
The Workshop heard that Indonesia
was planning to deploy for its 2000 census some
80 OCR systems, consisting of Fujitsu Scanners
M3099GX, NCS Nestor Reader 5.0, Visio Technical
scanning software Scan All, and Fujitsu PCs.
The systems would be distributed across the country,
allocated to provinces according to their population
size. After the census, those systems would
be allocated for long-term use at smaller regional
offices. The Workshop heard that greater
emphasis would be placed on enumerator training,
particularly on the writing of numbers.
Statistics Indonesia had chosen to use cardboard
boxes for storing and transporting the questionnaires
instead of plastic satchels. The boxes were
designed to serve the dual purpose of better protecting
the forms in the humid climate and providing writing
support for form filling to be done by the enumerator. |
| 52. |
For the Indonesian census, coding
would be done in the office before the forms were
scanned. The Workshop discussed the feasibility
of reversing the sequence, i.e. of subjecting
the forms first to scanning and then only to computer
assisted coding from the scanned images.
It was concluded that the feasibility depended
on the availability of suitably trained staff. |
|
|
| Demonstration
of OCR cluster |
|
|
| 53. |
The Workshop observed a practical
demonstration by Top Image System (TIS) of the
TIS AFPS Pro recognition cluster that used a Kodak
scanner with a controlled station linked to six
Pentium PC stations in the following functions:
(1) processing; (2) tile; (3) completion; (4)
exception handling; (5) archive and export; and
(6) controlling. It noted the flexibility
to inspect recognition results by character (tile
mode) and appreciated the system's simplicity
and efficiency in facilitating the recognition
of visibly wrongly interpreted characters. |
| 54. |
Depending on the overall workload,
the number of computers for each processing step
could be increased or decreased and depending
on current workflow conditions, i.e., bottlenecks,
the usage of any computer could be temporarily
or permanently reassigned to another function
in order to keep the overall system performance
well balanced. |
| 55. |
To highlight the efficiency
of the modular approach, the example of the 1997
Turkish Census was cited. In that census,
questionnaires for 62 million people were scanned
and recognized in 30 days, albeit only for a subset
of variables. The Workshop noted that the
processing time was an inverse function of available
scanning and recognition clusters. It was
informed that TIS had achieved alpha recognition
rates as high as 94 per cent (Brazil) and 98 per
cent (in Germany), although the latter case involved
less elaborate forms than census questionnaires. |
| 56. |
Improvements in recognition
rates achieved by the TIS software were attributed
to several advanced techniques, including (a)
image enhancement; (b) form identification and
removal (lift-off); (c) use of several recognition
engines with voting algorithms; (d) trainable
recognition algorithms, including local writing
styles; (e) validation function and rules; (f)
automatic coding; and (g) visual inspection in
tile mode. |
| 57. |
The Workshop heard that the
form identification and removal feature eliminated
the need for dropout colours and would significantly
reduce the required storage space. The voting
algorithms would evaluate the results of several
recognition engines and select the best answer
according to pre-defined rules. The tile
mode would show for each character from 0 to 9
and A to Z, one at the time, a table containing
all images as they were interpreted to represent
that character. That feature provided an
efficient means of visually inspecting all images
at a glance and easily identifying those images
that did not correspond to the character under
review. |
|
|
| New
Zealand experience in 1996 |
|
|
| 58. |
The Workshop learned that for
the 1996 New Zealand Population Census imaging
and character recognition were used to capture
the data. Benefits compared with the 1991
census included: results released 5 months earlier;
cost savings for data capture estimated at 9 per
cent, noticeable reduction in paper handling and
storage (particularly after the capture); and
easier access to forms during coding and editing.
In addition, better quality control was gained,
fewer staff needed to be recruited and trained,
and for comparison with the post-enumeration survey
access to census data was easier. |
| 59. |
The following lessons were learned
from the 1996 New Zealand Population Census use
of imaging and character recognition: (a) systematic
recognition errors for certain characters rendered
biased results; (b) the use of images for coding
and editing was a distinct advantage; (c) more
data validation during data capture would improve
overall data quality; and (d) high-priority variables
could easily be processed first. The Workshop
was informed that further contracting out the
data capture process might give significant economic
long-term benefits, and, last but not least, imaging
should not be used just as a replacement of traditional
data capture methods but the entire census process
could beneficially be re-thought at this occasion. |
|
|
| Observations
and recommendations on OCR |
|
|
| 60. |
The Workshop noted that recognition
engines could be expensive and therefore the use
of multiple engines had to be carefully evaluated.
However, it was also recognized that no single
recognition engine would give 100 per cent results
in all circumstances and that different engines
had different strengths and weaknesses.
Thus, using several recognition engines with a
voting mechanism could significantly improve the
overall recognition rate. |
| 61. |
The Workshop recommended that
users should demand that competing vendors of
census data capture systems demonstrate that the
promised capabilities of their system would work
under local circumstances, i.e. in the physical
and infrastructure environment of the user as
well as with the specific forms as developed by
the user. |
| 62. |
The Workshop noted that using
technologically advanced solutions should not
be self-serving but consideration should be given
to local circumstances, e.g., to the constraints
based on limitations of financial, technical and
personnel resources. |
| 63. |
The Workshop also noted that
paper based methods continued to be used for data
collection, particularly when the general public
was filling in the questionnaires. It was
noted that non-response remained one of the main
problems in census taking. |
| 64. |
The Workshop discussed the benefits
and drawbacks of paper based data collection and
capture methods. Considerable interest was
shown in the topic and the following were the
observations by the Workshop: |
|
- improved technology had
helped the census process in many developing
countries;
|