UN Web Site | UN Web Site Locator
Home Site map Contact 
ESCAP Statistics Division
ESCAP Statistics Division
 
Workshop 2001    
Workshop on Population Data Analysis, Storage and Dissemination Technologies
Bangkok, 27-30 March 2001

STAT/WDT/Philippines
20 March 2001
ENGLISH ONLY

ECONOMIC AND SOCIAL COMMISSION FOR ASIA AND THE PACIFIC

Workshop on Population Data Analysis, Storage and Dissemination Technologies
27-30 March 2001
Bangkok
Latest Innovations in Methods and Tools for Census Data: Technological Lessons from the 2000 Round of the Philippine Census1/
By: Carmelita N. Ericta2/ and Elpidio Nogales3/
Contents
  1. Introduction
  2. Census Processing Using ICR Technology
    B.1 Preparation of ICR-Friendly Census Forms
    B.2 Stages of ICR-Based Census Processing
    B.3 Resources Used for ICR-based Processing
  3. Experiences in the Use of ICR Technology for Census Processing
    C.1 Data Capture Center Strategy
    C.2 Experiences on Preparation of ICR-Friendly Forms
    C.3 Experiences on the Use of ICR Software
    C.4 Experiences on Image Capture
  4. Future Directions on ICR-based Processing at NSO
  5. Recommendations
    E.1 On Forms Preparation
    E.2 On ICR Software

1/  This paper has been reproduced as submitted.  It has been issued without formal editing. 
2/   Deputy Administrator, National Statistics Office, Philippines.
3/  Project Leader, Data Capture Center, National Statistics Office, Philippines.
A. INTRODUCTION
1. Census processing has always been a daunting task for most statistical offices. The huge volume of questionnaires to be handled poses a great challenge to the data processing capabilities of statistical agencies. Thus the National Statistics Office (NSO) is always seeking ways to improve its census processing.
2. In the Philippines' 1990 Census of Population and Housing, the NSO made a bold decision to use stand alone PC XTs instead of the old reliable mainframes to process more than 70 million records. Many doubted the wisdom of using these 40 MB-capacity computers but NSO was able to show that it was a wise decision. Again in 1995 NSO made another innovation in census processing by using networked computers (PC 486s) to process the 1995 Population Census. In both censuses, NSO used traditional data entry to convert data from census forms into digital format.
3. Last year, the NSO conducted the Philippines' 2000 Census of Population and Housing (Census 2000). This time NSO sought to depart from the traditional method of capturing census data by employing the Intelligent Character Recognition (ICR) Technology. With this technology, NSO aims to facilitate processing of Census 2000 by capturing the images of census forms and let the ICR interpret engines convert the precious data into digital format. The knowledge and experience gained in using ICR technology will be used to revolutionize census/survey processing at the NSO.
B. CENSUS PROCESSING USING ICR TECHNOLOGY
4. In a nutshell, ICR-based census processing consists of document scanning, interpretation/recognition, data verification, and data transfer/generation stages. But prior to actual processing, it is noteworthy to consider the preparations that ought to be done on the census forms, the overall processing strategy, and the resources needed for an effective ICR-based census processing.
B.1 Preparation of ICR-Friendly Census Forms
5. One important aspect of ICR-based census processing is the preparation of ICR-friendly census forms. This matter should not be taken for granted for this may eventually determine the success or failure of census processing. Census forms should be very carefully designed so as to minimize processing errors.
6. The statisticians and data processing personnel should discuss and decide together whether to include more fill-ins (or handwritten) or more mark fields (check boxes), or a balance of both in the census forms. Mark fields are very accurate while handwritten entries have lesser recognition rates.  But preparation of mark fields entails a lot of research as to what choices should be included in the answers as check boxes.  Also, handwritten answers can accommodate whatever any kind of responses while mark fields restricts the answer to the pre-printed choices.
7. The designer of census forms may also use dropout colors to increase accuracy. Dropout colors are not captured by the scanner and are not stored in document images. The use of this scanning feature may help in designing field boxes especially those requiring handwritten entries. The only drawback in using dropout color is when the prescribed shade or tint of the dropout color is not accurately followed during the printing of census forms. This may result to the insertion of unwanted characters during the interpretation stage.
B.2 Stages of ICR-based Census Processing
8. The ICR technology is a system which is acknowledged to be fast, reliable, and efficient at extracting information from documents of all kinds, such as forms, questionnaires, faxes, and the like. The stages of ICR-based processing are as follows:
  1. Image Capture Stage;
  2. Interpretation/Recognition Stage;
  3. Data Verification Stage; and
  4. Data Transfer/Data Generation Stage;
9. The ICR-based census processing should run on a local area network consisting of large-capacity image/database server with fast workstations, scanners, and backup devices. Figure 1 is an example of network configuration for ICR-based census processing.
B.2.1 Image Capture Stage
10. Census forms are captured as computer images using fast mid- or high-volume document scanners. These document images are sent to the network's image server for further processing. To minimize form errors, some document scanning software are equipped with deskew capability to correct alignment of form images.
B.2.2 Interpretation/Recognition Stage
11. This is the heart of ICR-based processing. In this stage, images of census questionnaires undergo interpretation where forms are checked/identified by ICR software's recognition engine. Identification of forms is done by the use of adjustment fields that are pre-printed in census forms. If the recognition software cannot locate these adjustment fields in an image, the said image is considered UNIDENTIFIED. These unidentified images are set apart by the recognition software and do not undergo further processing.
Figure1. DCC Manila ICR-based Network Configuration
Figure1. DCC Manila ICR-based Network Configuration
12. If the recognition software is able to locate sufficient number of adjustment fields in an image, the said image is tagged as IDENTIFIED. Field entries within the identified images undergo further interpretation where corresponding computer codes are created for these entries in the ICR database. The identified forms with "erroneous" (unrecognizable, inconsistent or invalid) field entries are then automatically assigned by the ICR software to the first available verify workstation.
B.2.3 Data Verification Stage
13. Only forms with at least one invalid erroneous field entry go through data verification. Forms with no erroneous field entries (or clean forms) do not undergo data verification. The interpreted data for these clean forms are stored in ICR software's database.
14. During this stage, all pages of a questionnaire are displayed on the verifier's workstation. The ICR software highlights the fields to be verified. The verifier then enters the correct entries for the highlighted fields. All corrections are also stored in the ICR software's database. During verification, all the details of a questionnaire are showed onscreen so there is no need to retrieve the actual paper questionnaires.
15. The assignment of forms during verification is completely automated and the ICR software balances the volume of forms per verifier. This is part of the software's workflow capability.
B.2.4 Data Transfer/Data Generation Stage
16. Once forms have been verified, the interpreted and verified data in the ICR software's database is ready for conversion into text (ASCII) format. This data conversion is done using the Data Transfer module of the ICR software. What is being converted is proprietary structure of the ICR software's database, which is not readily accessible to other popular software. The structure of the output ASCII file is based on the instructions defined by the data processing personnel who customized the ICR software.
B.3 Resources for the ICR-Based Processing
B.3.1 Peopleware for the ICR-Based Processing
17. The first thing that comes to mind when adopting new technologies is whether there is enough skilled personnel who can learn and use these technologies efficiently. For the ICR-based processing, the following personnel are needed for smooth operation:
  1. Scan Operators;
  2. Verifier Operators;
  3. Data Processing Supervisor; and
  4. A Team of IT Personnel who will customize the ICR software.
18. The scan operators operate the document scanners to make computer images of census questionnaires. The minimum requirement for such position is basic knowledge of personal computers. The verifier operators (or verifiers) are the ones who re-enter or correct erroneous field entries. Again, only basic knowledge of personal computers is needed. An added training or briefing on basic concepts of the census may be conducted to help them resolve invalid or inconsistent field entries. The data processing supervisor is the key person during processing. Basic knowledge of computer operation and networking is necessary for the supervisor. He should also be knowledgeable in the various concepts used in the census. A team of IT personnel is needed to customize the ICR software. The team members should be knowledgeable in computer programming and networking. They work hand in hand with census statisticians in order to faithfully implement the editing and validation specifications for all the fields covered by the census. They should be trained in the various aspects of customizing the ICR software.
B.3.2 Hardware Requirements for ICR-Based Processing
19. The hardware requirements for ICR-based processing are basically the same as that of the traditional census processing except the introduction of the document scanners. As in traditional processing, a local area network (LAN) is needed in ICR-based processing. However, the server for ICR technology should have a large disk and high memory capacity due to the nature of image processing. The server in the ICR-based processing is also used as image server and as database server, aside from the normal networking operations requiring large memory for processing. The LAN for ICR-based processing should include verifier, scanning, manager, and backup workstations. It is preferable that these workstations will be in the class of Pentium computers to satisfy the necessary speed and memory size for processing. Document scanners with sheet feeders are attached to scan workstations. The manager workstation is used to monitor the status of processing. This workstation is also used to run data transfer and interpret modules of the ICR software. A backup workstation is used to create backup copies of document images into CD-ROMs. Newer storage technologies like magneto optical (MO) disks and drives are also available and may be used as reliable backup devices.
B.3.3 Software Requirements for ICR-Based Processing
20. The selection of an appropriate ICR software package is crucial to the success of an ICR-based census processing. The software should have at least the following characteristics:
  • Ability to identify fields in the forms.
  • Ability to adjust the character recognition accuracy levels or probability of recognition for each character in the form.
  • Ability to allow data entry from unidentifiable forms by manual keying of characters from images.
  • Compatibility with standard scanning devices.
  • Ability to do automatic assignment of forms for verification and workload balancing among verify operators.
21. Usually, a scanning module is included in the ICR software. It is best to use this built-in module because it is designed for processing of forms. It is also possible to use a document scanning software external to the ICR software but it may not include some features that are useful in forms processing. For the Philippines' Census 2000, the NSO used Eyes and Hands for Forms as the ICR software and KODAK MVCS as external document scanning software.
22. Aside from the ICR software, there is also a need to acquire or develop a monitoring system in order to determine the status of processing at any time. The software should be able to generate processing statistics on productivity of various personnel involved in the census processing. The NSO Philippines used a tailor-made Census Progress Monitoring System (CPMS) for this purpose.
C. EXPERIENCES IN THE USE OF ICR TECHNOLOGY FOR CENSUS PROCESSING
23. NSO's experiences on the use of ICR technology for its Census 2000 may be of help to those who are planning to use this new technology. There are four (4) areas of processing where NSO learned the most and these are:
  1. Data Capture Center Strategy
  2. Forms Preparation
  3. ICR Software
  4. Image Capture/Scanning
C.1 Data Capture Center Strategy
24. The NSO Philippines created data capture centers (DCC) to handle the data capture and forms processing for the 2000 Census of Population and Housing. Processing at each DCC is LAN-based, equipped with at least five (5) mid-volume scanners, fifteen (15) Pentium III workstations, three (3) magneto optical disk drives, three (3) CD writers, one (1) network printer and one (1) 500 MHz Pentium III server with a capacity of 90 gigabytes. In order to process at least 15 million questionnaires, four (4) DCCs were strategically setup in the country. Each DCC is assigned to process about 4 million census forms from a group of regions. The limitation on the number of DCCs is due to the cost of equipment for an ICR-based processing.
25. The DCCs operate at least 18 hours a day, six days a week, with two-shift work schedule. A total of 146 personnel were hired for the DCCs. The personnel were trained for a week on the basic concepts of the census and machine processing at the DCC before the actual operation. Most of the personnel have little computer knowledge but were able to learn and became proficient in their respective roles in the DCC within the one week training.
26. Aside from the scan and verifier operators, NSO Philippines also hired four (4) data controllers per shift per DCC. The data controllers prepare and check the batches of forms for validity of geographic codes, clarity of entries, and paper orientation. They make sure that the scan operators do not run out of forms to scan. NSO Philippines also hired one (1) file preparation and transfer operator (FPTO) per shift. The FPTO runs the interpretation and transfer modules of the ICR software. He is also tasked to create backup copies of images in magneto optical (MO) cartridges and CD-ROMs. He also creates back-up copies of system log files and the databases of the CPMS and ICR software.
27. Each DCC sends its weekly report and output data files to the Central Office as e-mail attachments. These files are automatically consolidated once the e-mail messages are received. The consolidated report is then sent to the Administrator and other top-level officials.
C.2 Experiences on Preparation of ICR-Friendly Forms
28. NSO Philippines spent considerable time designing ICR-friendly forms. A mixture of mark and handwritten fields were used in the census questionnaire. Mark fields were used for items with few possible answers like sex, marital status, housing questions, etc.  Handwritten fields were used for items with numerous possible entries such as age, year of birth, occupation codes, etc. These handwritten fields require only numeric entries. The use of alphabetic entries was avoided in order to achieve high recognition rates.
29. One of the biggest problems encountered during processing is due to the print quality of the census forms. It was observed that some forms do not conform to the print specifications that caused the ICR software to tag them as unidentifiable. The following are some of the printing problems encountered during processing:
Form Problems
Effect on Processing
a. Wider/narrower margins a. Unidentified form
b. Some pages are blank b. Unidentified form
c. Some pages are printed upside down c. Unidentified form
d. Too dark dropout color d. Field boxes appear on images causing interpretation error
e. Other print errors e. Either unidentified form or interpretation error
30. Another common problem during census processing is illegible handwritten entries. The enumerators were sufficiently informed during the pre-enumeration training to use the prescribed handwriting strokes. But due to the volume of forms to be accomplished, the handwriting stroke changes as more forms are filled up. Illegible handwritten entries cause either interpretation error or misinterpretation of field values.
31. The enumerators were instructed to use pencils in accomplishing the census forms. This would not have posed problems for ICR-based processing had the forms been scanned right after filling up the questionnaires. But in actual processing, some questionnaires were scanned two or three months after these were filled up. Hence, by the time these forms were scanned, the pencil entries were already very faint. Faint field entries would be uninterpretable due to very low image density. To resolve this, the DCC staff had to enhance some entries using ordinary pens before scanning the documents. This added some mandays to the processing.
32. To facilitate processing, NSO Philippines instructed the enumerators to bundle questionnaires by enumeration area. An enumeration area (EA), which consists of about 300 households, is a pre-determined region within a "barangay".  Each barangay consists of one or more EAs while a group of barangays forms a municipality. Several municipalities form one province. There are about 42,000 barangays in the Philippines; thus, the total bundles will be twice or even three times the number of barangays. Questionnaires belonging to a bundle have common geographic ID (Province, Municipality, Barangay, and EA). This is also the bundle name. NSO Philippines customized the ICR software to use the bundle name as the correct GEO ID of the questionnaires. It means that the GEO codes in the questionnaires need not undergo interpretation thereby speeding up interpretation and increasing accuracy.
33. The use of ICR will not eliminate manual processing. There was still some minimal manual processing for the census 2000. This is because some questionnaires have incomplete or inconsistent entries. This affected the output data file wherein some critical fields were left blank by enumerators.
C.3 Experiences on the Use of ICR Software
34. It has been mentioned that the interpretation/recognition engine is the heart on ICR technology. It was observed that during the processing of Census 2000, the recognition rates for mark fields is almost perfect while those of handwritten fields are much lower. Overall, the rate of interpretation/recognition ranges from 90% to 95%. The speed of interpretation is between 3,400 and 3,900 forms per hour.
35. Although recognition rates are high, most questionnaires still undergo verification. The rule to undergo verification is for a questionnaire to have at least one erroneous field.  Verification provides an opportunity to correct the erroneous field. The rate of verification ranges from 270 to 320 forms per hour per verifier. Each DCC only has four (4) verify licenses. Eight to ten verify licenses per DCC would have been ideal for NSO Philippines' census processing.
C.4 Experiences on Image Capture
36. The quality of form images is crucial to ICR-based processing. More than 15 million forms were scanned using 22 Kodak mid-volume 3590 scanners. Due to the voluminous forms scanned, the quality of images is affected when dirt accumulate and embed lines in the image forms. Daily cleaning and monthly preventive maintenance ensured good quality of images and minimal downtime of document scanners.
D. FUTURE DIRCTIONS ON ICR-BASED PROCESSING AT NSO
37. This pioneering experience on ICR-based processing of the 2000 census returns convinced NSO Philippines that this technology is beneficial to the office in the long term. The ICR equipment and software are reusable for surveys and other censuses. NSO is planning to use this technology to process the next Census on Agriculture and Fisheries. The questionnaires for this census are being designed to be ICR-friendly taking into account the lessons learned from the 2000 Census of Population and Housing. ICR technology is also being considered in the processing of Foreign Trade (Imports/Exports) documents in the very near future.
E. RECOMMENDATIONS
E.1 On Forms Preparation
38. ICR-based processing requires careful attention to the design of questionnaires. Mark fields should be used as much as possible for critical items (or if possible for most items) to minimize interpretation errors. Field boxes should be large enough for handwritten entries. This will allow enumerators to enter ICR-recognized characters.
39. The print quality of forms should be monitored to avoid unidentified forms during recognition stage. It is suggested that a random sample of questionnaires should be tested before accepting any delivery of printed forms. An ICR application could be developed to test the acceptability of any blank questionnaire.
E.2 On ICR Software
40. Select an ICR software that is able to accurately interpret handprint characters that uses one or more recognition engines. But make sure that the interpretation software does not "crawl" or process slowly as a result of using multiple recognition engines. If the rate of interpretation is not fast enough, this stage could be a potential bottleneck in processing. Make sure also that there is more than enough number of verifier licenses to give flexibility during processing. When there are more questionnaires for verification in the queue, it will be easy to shift some of the scan operators to do verification if there are extra verifier licenses.
41. The ICR software should be easy to customize or program. It should have user- friendly (GUI) interfaces, easy to use screens, and features that allow minimization of reading errors. It must be compatible with all standard scanner interfaces to enable usage with any existing scan machines.
42. Technical support should also be a consideration when choosing ICR softwares. The software supplier should provide the necessary support throughout the duration of the processing.

 
Pop-IT project (1997-2001)
Project Objectives
Working Party Members
Working Party Meetings
First meeting, Bangkok, 24-26 September 1997
Second meeting, Singapore, 1-3 April 1998
Third meeting, Bali, 7-9 January 1999
Fourth meeting, Manila, 6-9 July 1999
Ffth meeting, Bangkok, 21 October 1999
Sixth meeting, Bangkok, 26 March 2001
Workshops
Application of New Information Technology to Population data, Bangkok, 12-20 October 1999
Population Data Analysis, Storage and Dissemination Technologies, Bangkok, 27-30 March 2001
Guidelines
Population data collection and capture (BBS - Statistics Indonesia)
GPS in modern mapping and GIS technologies to population data (Bangladesh Bureau of Statistics)
Population data dissemination (Statistics New Zealand)
Project Newsletter
Contact us
   
Copyright (c) 2013 ESCAP  |  Legal Notice