| Workshop on Population
Data Analysis, Storage and Dissemination Technologies
|
| Bangkok, 27-30 March
2001 |
STAT/WDT/Philippines
20 March 2001
ENGLISH ONLY
ECONOMIC AND SOCIAL COMMISSION FOR ASIA AND THE
PACIFIC
Workshop on Population Data Analysis, Storage
and Dissemination Technologies
27-30 March 2001
Bangkok |
| Latest Innovations in
Methods and Tools for Census Data: Technological
Lessons from the 2000 Round of the Philippine
Census1/
|
| By: Carmelita N. Ericta2/
and Elpidio Nogales3/
|
| Contents |
- Introduction
- Census
Processing Using ICR Technology
B.1
Preparation of ICR-Friendly Census Forms
B.2
Stages of ICR-Based Census Processing
B.3
Resources Used for ICR-based Processing
- Experiences
in the Use of ICR Technology for Census Processing
C.1
Data Capture Center Strategy
C.2
Experiences on Preparation of ICR-Friendly
Forms
C.3
Experiences on the Use of ICR Software
C.4
Experiences on Image Capture
- Future
Directions on ICR-based Processing at NSO
- Recommendations
E.1
On Forms Preparation
E.2
On ICR Software
|
1/
This paper has been reproduced as submitted.
It has been issued without formal editing.
2/
Deputy Administrator, National Statistics Office,
Philippines. 3/
Project Leader, Data Capture Center, National
Statistics Office, Philippines.
|
| A.
INTRODUCTION |
| 1. Census processing has always
been a daunting task for most statistical offices.
The huge volume of questionnaires to be handled
poses a great challenge to the data processing
capabilities of statistical agencies. Thus the
National Statistics Office (NSO) is always seeking
ways to improve its census processing. |
| 2. In the Philippines' 1990
Census of Population and Housing, the NSO made
a bold decision to use stand alone PC XTs instead
of the old reliable mainframes to process more
than 70 million records. Many doubted the wisdom
of using these 40 MB-capacity computers but NSO
was able to show that it was a wise decision.
Again in 1995 NSO made another innovation in census
processing by using networked computers (PC 486s)
to process the 1995 Population Census. In both
censuses, NSO used traditional data entry to convert
data from census forms into digital format. |
| 3. Last year, the NSO conducted
the Philippines' 2000 Census of Population and
Housing (Census 2000). This time NSO sought to
depart from the traditional method of capturing
census data by employing the Intelligent Character
Recognition (ICR) Technology. With this technology,
NSO aims to facilitate processing of Census 2000
by capturing the images of census forms and let
the ICR interpret engines convert the precious
data into digital format. The knowledge and experience
gained in using ICR technology will be used to
revolutionize census/survey processing at the
NSO. |
|
| B.
CENSUS PROCESSING USING ICR TECHNOLOGY |
| 4. In a nutshell, ICR-based
census processing consists of document scanning,
interpretation/recognition, data verification,
and data transfer/generation stages. But prior
to actual processing, it is noteworthy to consider
the preparations that ought to be done on the
census forms, the overall processing strategy,
and the resources needed for an effective ICR-based
census processing. |
| B.1
Preparation of ICR-Friendly Census Forms |
| 5. One important aspect of ICR-based
census processing is the preparation of ICR-friendly
census forms. This matter should not be taken
for granted for this may eventually determine
the success or failure of census processing. Census
forms should be very carefully designed so as
to minimize processing errors. |
| 6. The statisticians and data
processing personnel should discuss and decide
together whether to include more fill-ins (or
handwritten) or more mark fields (check boxes),
or a balance of both in the census forms. Mark
fields are very accurate while handwritten entries
have lesser recognition rates. But preparation
of mark fields entails a lot of research as to
what choices should be included in the answers
as check boxes. Also, handwritten answers
can accommodate whatever any kind of responses
while mark fields restricts the answer to the
pre-printed choices. |
| 7. The designer of census forms
may also use dropout colors to increase accuracy.
Dropout colors are not captured by the scanner
and are not stored in document images. The use
of this scanning feature may help in designing
field boxes especially those requiring handwritten
entries. The only drawback in using dropout color
is when the prescribed shade or tint of the dropout
color is not accurately followed during the printing
of census forms. This may result to the insertion
of unwanted characters during the interpretation
stage. |
| B.2
Stages of ICR-based Census Processing |
| 8. The ICR technology is a system
which is acknowledged to be fast, reliable, and
efficient at extracting information from documents
of all kinds, such as forms, questionnaires, faxes,
and the like. The stages of ICR-based processing
are as follows: |
- Image Capture Stage;
- Interpretation/Recognition
Stage;
- Data Verification
Stage; and
- Data Transfer/Data
Generation Stage;
|
| 9. The ICR-based census processing
should run on a local area network consisting
of large-capacity image/database server with fast
workstations, scanners, and backup devices. Figure
1 is an example of network configuration for ICR-based
census processing. |
| B.2.1 Image Capture Stage |
| 10. Census forms are captured
as computer images using fast mid- or high-volume
document scanners. These document images are sent
to the network's image server for further processing.
To minimize form errors, some document scanning
software are equipped with deskew capability to
correct alignment of form images. |
| B.2.2 Interpretation/Recognition
Stage |
| 11. This is the heart of ICR-based
processing. In this stage, images of census questionnaires
undergo interpretation where forms are checked/identified
by ICR software's recognition engine. Identification
of forms is done by the use of adjustment fields
that are pre-printed in census forms. If the recognition
software cannot locate these adjustment fields
in an image, the said image is considered UNIDENTIFIED.
These unidentified images are set apart by the
recognition software and do not undergo further
processing. |
 |
| Figure1. DCC Manila
ICR-based Network Configuration |
| 12. If the recognition software
is able to locate sufficient number of adjustment
fields in an image, the said image is tagged as
IDENTIFIED. Field entries within the identified
images undergo further interpretation where corresponding
computer codes are created for these entries in
the ICR database. The identified forms with "erroneous"
(unrecognizable, inconsistent or invalid) field
entries are then automatically assigned by the
ICR software to the first available verify workstation. |
| B.2.3 Data Verification
Stage |
| 13. Only forms with at least
one invalid erroneous field entry go through data
verification. Forms with no erroneous field entries
(or clean forms) do not undergo data verification.
The interpreted data for these clean forms are
stored in ICR software's database. |
| 14. During this stage, all pages
of a questionnaire are displayed on the verifier's
workstation. The ICR software highlights the fields
to be verified. The verifier then enters the correct
entries for the highlighted fields. All corrections
are also stored in the ICR software's database.
During verification, all the details of a questionnaire
are showed onscreen so there is no need to retrieve
the actual paper questionnaires. |
| 15. The assignment of forms
during verification is completely automated and
the ICR software balances the volume of forms
per verifier. This is part of the software's workflow
capability. |
| B.2.4 Data Transfer/Data
Generation Stage |
| 16. Once forms have been verified,
the interpreted and verified data in the ICR software's
database is ready for conversion into text (ASCII)
format. This data conversion is done using the
Data Transfer module of the ICR software. What
is being converted is proprietary structure of
the ICR software's database, which is not readily
accessible to other popular software. The structure
of the output ASCII file is based on the instructions
defined by the data processing personnel who customized
the ICR software. |
| B.3
Resources for the ICR-Based Processing |
| B.3.1 Peopleware for the
ICR-Based Processing |
| 17. The first thing that comes
to mind when adopting new technologies is whether
there is enough skilled personnel who can learn
and use these technologies efficiently. For the
ICR-based processing, the following personnel
are needed for smooth operation: |
- Scan Operators;
- Verifier Operators;
- Data Processing Supervisor;
and
- A Team of IT Personnel
who will customize the ICR software.
|
| 18. The scan operators operate
the document scanners to make computer images
of census questionnaires. The minimum requirement
for such position is basic knowledge of personal
computers. The verifier operators (or verifiers)
are the ones who re-enter or correct erroneous
field entries. Again, only basic knowledge of
personal computers is needed. An added training
or briefing on basic concepts of the census may
be conducted to help them resolve invalid or inconsistent
field entries. The data processing supervisor
is the key person during processing. Basic knowledge
of computer operation and networking is necessary
for the supervisor. He should also be knowledgeable
in the various concepts used in the census. A
team of IT personnel is needed to customize the
ICR software. The team members should be knowledgeable
in computer programming and networking. They work
hand in hand with census statisticians in order
to faithfully implement the editing and validation
specifications for all the fields covered by the
census. They should be trained in the various
aspects of customizing the ICR software. |
| B.3.2 Hardware Requirements
for ICR-Based Processing |
| 19. The hardware requirements
for ICR-based processing are basically the same
as that of the traditional census processing except
the introduction of the document scanners. As
in traditional processing, a local area network
(LAN) is needed in ICR-based processing. However,
the server for ICR technology should have a large
disk and high memory capacity due to the nature
of image processing. The server in the ICR-based
processing is also used as image server and as
database server, aside from the normal networking
operations requiring large memory for processing.
The LAN for ICR-based processing should include
verifier, scanning, manager, and backup workstations.
It is preferable that these workstations will
be in the class of Pentium computers to satisfy
the necessary speed and memory size for processing.
Document scanners with sheet feeders are attached
to scan workstations. The manager workstation
is used to monitor the status of processing. This
workstation is also used to run data transfer
and interpret modules of the ICR software. A backup
workstation is used to create backup copies of
document images into CD-ROMs. Newer storage technologies
like magneto optical (MO) disks and drives are
also available and may be used as reliable backup
devices. |
| B.3.3 Software Requirements
for ICR-Based Processing |
| 20. The selection of an appropriate
ICR software package is crucial to the success
of an ICR-based census processing. The software
should have at least the following characteristics: |
- Ability to identify fields
in the forms.
- Ability to adjust the
character recognition accuracy levels or probability
of recognition for each character in the form.
- Ability to allow data
entry from unidentifiable forms by manual
keying of characters from images.
- Compatibility with standard
scanning devices.
- Ability to do automatic
assignment of forms for verification and workload
balancing among verify operators.
|
| 21. Usually, a scanning module
is included in the ICR software. It is best to
use this built-in module because it is designed
for processing of forms. It is also possible to
use a document scanning software external to the
ICR software but it may not include some features
that are useful in forms processing. For the Philippines'
Census 2000, the NSO used Eyes and Hands for Forms
as the ICR software and KODAK MVCS as external
document scanning software. |
| 22. Aside from the ICR software,
there is also a need to acquire or develop a monitoring
system in order to determine the status of processing
at any time. The software should be able to generate
processing statistics on productivity of various
personnel involved in the census processing. The
NSO Philippines used a tailor-made Census Progress
Monitoring System (CPMS) for this purpose. |
|
| C.
EXPERIENCES IN THE USE OF ICR TECHNOLOGY FOR CENSUS
PROCESSING |
| 23. NSO's experiences on the
use of ICR technology for its Census 2000 may
be of help to those who are planning to use this
new technology. There are four (4) areas of processing
where NSO learned the most and these are: |
- Data Capture Center
Strategy
- Forms Preparation
- ICR Software
- Image Capture/Scanning
|
| C.1
Data Capture Center Strategy |
| 24. The NSO Philippines created
data capture centers (DCC) to handle the data
capture and forms processing for the 2000 Census
of Population and Housing. Processing at each
DCC is LAN-based, equipped with at least five
(5) mid-volume scanners, fifteen (15) Pentium
III workstations, three (3) magneto optical disk
drives, three (3) CD writers, one (1) network
printer and one (1) 500 MHz Pentium III server
with a capacity of 90 gigabytes. In order to process
at least 15 million questionnaires, four (4) DCCs
were strategically setup in the country. Each
DCC is assigned to process about 4 million census
forms from a group of regions. The limitation
on the number of DCCs is due to the cost of equipment
for an ICR-based processing. |
| 25. The DCCs operate at least
18 hours a day, six days a week, with two-shift
work schedule. A total of 146 personnel were hired
for the DCCs. The personnel were trained for a
week on the basic concepts of the census and machine
processing at the DCC before the actual operation.
Most of the personnel have little computer knowledge
but were able to learn and became proficient in
their respective roles in the DCC within the one
week training. |
| 26. Aside from the scan and
verifier operators, NSO Philippines also hired
four (4) data controllers per shift per DCC. The
data controllers prepare and check the batches
of forms for validity of geographic codes, clarity
of entries, and paper orientation. They make sure
that the scan operators do not run out of forms
to scan. NSO Philippines also hired one (1) file
preparation and transfer operator (FPTO) per shift.
The FPTO runs the interpretation and transfer
modules of the ICR software. He is also tasked
to create backup copies of images in magneto optical
(MO) cartridges and CD-ROMs. He also creates back-up
copies of system log files and the databases of
the CPMS and ICR software. |
| 27. Each DCC sends its weekly
report and output data files to the Central Office
as e-mail attachments. These files are automatically
consolidated once the e-mail messages are received.
The consolidated report is then sent to the Administrator
and other top-level officials. |
| C.2
Experiences on Preparation of ICR-Friendly Forms |
| 28. NSO Philippines spent considerable
time designing ICR-friendly forms. A mixture of
mark and handwritten fields were used in the census
questionnaire. Mark fields were used for items
with few possible answers like sex, marital status,
housing questions, etc. Handwritten fields
were used for items with numerous possible entries
such as age, year of birth, occupation codes,
etc. These handwritten fields require only numeric
entries. The use of alphabetic entries was avoided
in order to achieve high recognition rates. |
| 29. One of the biggest problems
encountered during processing is due to the print
quality of the census forms. It was observed that
some forms do not conform to the print specifications
that caused the ICR software to tag them as unidentifiable.
The following are some of the printing problems
encountered during processing: |
|
Form Problems |
Effect on Processing |
| a. Wider/narrower
margins |
a. Unidentified
form |
| b. Some pages
are blank |
b. Unidentified
form |
| c. Some pages
are printed upside down |
c. Unidentified
form |
| d. Too dark
dropout color |
d. Field
boxes appear on images causing interpretation
error |
| e. Other
print errors |
e. Either
unidentified form or interpretation error |
|
| 30. Another common problem during
census processing is illegible handwritten entries.
The enumerators were sufficiently informed during
the pre-enumeration training to use the prescribed
handwriting strokes. But due to the volume of
forms to be accomplished, the handwriting stroke
changes as more forms are filled up. Illegible
handwritten entries cause either interpretation
error or misinterpretation of field values. |
| 31. The enumerators were instructed
to use pencils in accomplishing the census forms.
This would not have posed problems for ICR-based
processing had the forms been scanned right after
filling up the questionnaires. But in actual processing,
some questionnaires were scanned two or three
months after these were filled up. Hence, by the
time these forms were scanned, the pencil entries
were already very faint. Faint field entries would
be uninterpretable due to very low image density.
To resolve this, the DCC staff had to enhance
some entries using ordinary pens before scanning
the documents. This added some mandays to the
processing. |
| 32. To facilitate processing,
NSO Philippines instructed the enumerators to
bundle questionnaires by enumeration area. An
enumeration area (EA), which consists of about
300 households, is a pre-determined region within
a "barangay". Each barangay consists of
one or more EAs while a group of barangays forms
a municipality. Several municipalities form one
province. There are about 42,000 barangays in
the Philippines; thus, the total bundles will
be twice or even three times the number of barangays.
Questionnaires belonging to a bundle have common
geographic ID (Province, Municipality, Barangay,
and EA). This is also the bundle name. NSO Philippines
customized the ICR software to use the bundle
name as the correct GEO ID of the questionnaires.
It means that the GEO codes in the questionnaires
need not undergo interpretation thereby speeding
up interpretation and increasing accuracy. |
| 33. The use of ICR will not
eliminate manual processing. There was still some
minimal manual processing for the census 2000.
This is because some questionnaires have incomplete
or inconsistent entries. This affected the output
data file wherein some critical fields were left
blank by enumerators. |
| C.3
Experiences on the Use of ICR Software |
| 34. It has been mentioned that
the interpretation/recognition engine is the heart
on ICR technology. It was observed that during
the processing of Census 2000, the recognition
rates for mark fields is almost perfect while
those of handwritten fields are much lower. Overall,
the rate of interpretation/recognition ranges
from 90% to 95%. The speed of interpretation is
between 3,400 and 3,900 forms per hour. |
| 35. Although recognition rates
are high, most questionnaires still undergo verification.
The rule to undergo verification is for a questionnaire
to have at least one erroneous field. Verification
provides an opportunity to correct the erroneous
field. The rate of verification ranges from 270
to 320 forms per hour per verifier. Each DCC only
has four (4) verify licenses. Eight to ten verify
licenses per DCC would have been ideal for NSO
Philippines' census processing. |
| C.4
Experiences on Image Capture |
| 36. The quality of form images
is crucial to ICR-based processing. More than
15 million forms were scanned using 22 Kodak mid-volume
3590 scanners. Due to the voluminous forms scanned,
the quality of images is affected when dirt accumulate
and embed lines in the image forms. Daily cleaning
and monthly preventive maintenance ensured good
quality of images and minimal downtime of document
scanners. |
|
| D.
FUTURE DIRCTIONS ON ICR-BASED PROCESSING AT NSO
|
| 37. This pioneering experience
on ICR-based processing of the 2000 census returns
convinced NSO Philippines that this technology
is beneficial to the office in the long term.
The ICR equipment and software are reusable for
surveys and other censuses. NSO is planning to
use this technology to process the next Census
on Agriculture and Fisheries. The questionnaires
for this census are being designed to be ICR-friendly
taking into account the lessons learned from the
2000 Census of Population and Housing. ICR technology
is also being considered in the processing of
Foreign Trade (Imports/Exports) documents in the
very near future. |
|
| E.
RECOMMENDATIONS |
| E.1
On Forms Preparation |
| 38. ICR-based processing requires
careful attention to the design of questionnaires.
Mark fields should be used as much as possible
for critical items (or if possible for most items)
to minimize interpretation errors. Field boxes
should be large enough for handwritten entries.
This will allow enumerators to enter ICR-recognized
characters. |
| 39. The print quality of forms
should be monitored to avoid unidentified forms
during recognition stage. It is suggested that
a random sample of questionnaires should be tested
before accepting any delivery of printed forms.
An ICR application could be developed to test
the acceptability of any blank questionnaire. |
| E.2
On ICR Software |
| 40. Select an ICR software that
is able to accurately interpret handprint characters
that uses one or more recognition engines. But
make sure that the interpretation software does
not "crawl" or process slowly as a result of using
multiple recognition engines. If the rate of interpretation
is not fast enough, this stage could be a potential
bottleneck in processing. Make sure also that
there is more than enough number of verifier licenses
to give flexibility during processing. When there
are more questionnaires for verification in the
queue, it will be easy to shift some of the scan
operators to do verification if there are extra
verifier licenses. |
| 41. The ICR software should
be easy to customize or program. It should have
user- friendly (GUI) interfaces, easy to use screens,
and features that allow minimization of reading
errors. It must be compatible with all standard
scanner interfaces to enable usage with any existing
scan machines. |
| 42. Technical support should
also be a consideration when choosing ICR softwares.
The software supplier should provide the necessary
support throughout the duration of the processing. |
|