UN Web Site | UN Web Site Locator
Home Site map Contact 
ESCAP Statistics Division
ESCAP Statistics Division
 
Workshop 2001    
Workshop on Population Data Analysis, Storage and Dissemination Technologies
Bangkok, 27-30 March 2001

STAT/WDT/4
22 March 2001
ENGLISH ONLY

ECONOMIC AND SOCIAL COMMISSION FOR ASIA AND THE PACIFIC

Workshop on Population Data Analysis, Storage and Dissemination Technologies
27-30 March 2001
Bangkok 
The Use of OCR Technology in Population Census of Macao, China
(Items 3 of the provisional agenda)1/
Contents
  1. Background
  2. Features of the OCR System
  3. The Use of OCR in the Data Capturing Workflow
  4. Benefit of the OCR System
  5. Drawback
  6. Conclusion

1/ This paper, prepared by Mr Ieong Meng Chao, Statistics and Census Service, Macao, China, has been reproduced as submitted.  It has been issued without formal editing.
1. Background
Population census is the major source for obtaining up-to-date information on the demographic and socioeconomic characteristics of the population. Following the international practice, population censuses have been conducted in Macao every ten years. The last population census was conducted in 1991, and in this year, 2001, the 14th Population Census of Macao will be conducted in August.
Quarters are the basic enumeration units in Macao population census. There are about 200,000 quarters in the 23.8km2 area. All quarters will be visited by the enumerators within the 10-day census period. Twenty percents of the quarters will be requested to answer a long form questionnaire, and the rest eighty percents will only need to answer a short form questionnaire.
Traditionally the collected questionnaires are keyed in manually after the census period. The time required to complete the data entry of the questionnaires mainly depends on the available number of data entry operators. Take population census 1991 as an example, 20 operators took three months to complete the data entry of all the 110,000 questionnaires.
As the pressure of timely statistics is ever increasing, a faster data capturing method is required. After a extensive evaluation between OMR and OCR technologies, the later was chosen as the data capturing method for the 2001 population census.
2. Features of the OCR System
2.1 Dual recognizers approach
The OCR system used in the census project utilizes two recognizers to achieve a more acceptable accuracy rate. One of the recognizers uses a commercially available API and the other one is a recognizer developed using neural networks.
Using the API, the system is fed with an image of a written digit, and after some calculation, the API returns the recognized digit, together with how confident the system is about its recognition. The API only provides with a binary confidence, i.e. either it is confident with the result or not.
Regarding the recognizer based on the neural network paradigm, a back-propagation network is used, with an input layer, a hidden layer and an output layer. Before the recognizer can be used in the fieldwork, it must be first trained with a set of examples. These are sample images of scanned written digits, together with the data about which digit it represents. Fed with this information, the back-propagation network learns these patterns. When trained with a sufficient number of examples, the recognizer is able to recognize images of digits which were unseen during the training phase.
In order to achieve a better recognition accuracy, the examples used to train the neural network must be carefully chosen, to avoid noisy data being fed into the training process. During the training process, special care should be taken to avoid overfitting - the trained network could be so much strictly dependent on the training examples that it could not be general enough to recognize the unseen digits during training.
After the network is trained, the recognition consists simply of a set of calculations through the network, and the output with the highest value is regarded as the digit recognized. The digit recognized and a confidence level is output from the recognizer. This confidence level is a real value from 0 to 1.
2.2 Prioritized voting algorithm
The two recognizers are used together to form a combined recognizer, where the two recognizers are given the same digit image for classification. Using the recognition results from both recognizers, the combined recognizer gives the final output. When both recognizers give the same recognized digit, then the combined recognizer will also consider that as the recognized digit, and the confidence level will be the sum of the confidence levels given by each recognizer. When the recognized digits are different, the final output will favor the API recognizer, as follows:
Confidence level from API recognizer (CA)
Confidence level from neural recognizer(CN)
Recognizer chosen for the final recognized digit
Final confidence level
Confirmation Required
   1.0 £ 0.5
API recognizer
? 0.5 (CA-CN)
No
   1.0 > 0.5
API recognizer
< 0.5 (CA-CN)
Yes
£ 0.5 £ 0.5
API recognizer
< 0.5 (CA-CN)
Yes
£ 0.5 > 0.5
Neural recognizer
0.5
Yes
2.3 Batch mode processing
After questionnaires scanning, the recognition process is running in batch mode and the recognition results are stored in database for subsequent processing. This allows the OCR system running 24-hour in unattended mode.
When the operator login to the system, the confirmation process will be started and the pending rejected characters will be prompted automatically to the operators for manual confirmation.
2.4 Character mode recognition
The system uses a character mode recognition in which the system recognizes the questionnaire character by character in the predefined coordinates.
2.5 Automatic shift of image window
The system has the function to slightly shift the image window in order to scope the whole single character. The reason of adding this function is because that the scanned questionnaire images are usually distorted in certain degree. The distortion of the images may due to the scanning angle, the minor malfunction of the scanner, the environment factors which changed the physical size of the forms, or the dirty spots on the forms.
When the images are distorted, the OCR system may have problem in locating the corrected positions of the characters accordingly to the predefined coordinates. In the situation, the window shifting function will move the window around the original positions in certain margin, trying to scope the whole character without any overlapping on the character. This function has proved very useful in improving recognition accuracy.
2.6 Automatic spots removal
Dirty spots on the scanned image confuse the recognizers. In our OCR system we employ a simple function to remove the dirty spots. It will clean the black pixel on the target image if all of its eight surrounding pixels are white. Though this algorithm is simple, it is quite helpful to the recognition process.
2.7 Recognition result
The following testing results are obtained from a sample of 150,000 image of digits.
Digit (%) 0 1 2 3 4 5 6 7 8 9 All
Recognition rate 94.83 96.83 94.92 91.11 96.00 94.95 97.29 97.72 90.43 81.74 95.64
Reject rate 5.17 3.17 5.08 8.89 4.00 5.05 2.71 2.28 9.57 18.26 4.36
Accuracy rate 99.38 99.89 99.78 99.73 99.89 99.41 99.79 99.59 99.12 100.00 99.72
Error rate 0.62 0.11 0.28 0.27 0.11 0.59 0.21 0.41 0.88 0.00 0.28
In the above table:
  • Recognition rate is the percentage of recognized digits with confidence level of 0.5 or above. However, the recognized digits may not be correct;
  • Reject rate is the percentage of recognized digits with confidence level less than 0.5. The system might have successfully recognized the digit, but due to low confidence level, the results need to be confirmed manually;
  • Accuracy rate is the percentage of the correctness among the recognized digits with confidence level of 0.5 or above.
  • Error rate is the percentage of the incorrectness among the recognized digits with confidence level of 0.5 or above.
The overall recognition rate in the OCR system is 95.64% and the accuracy rate is 99.72%. Since the error rate is calculated based on the recognized data, the actual error rate on all the data will be 0.26% (0.28% x 95.64%). Furthermore, all the data will go through a strict validation process, so the final error rate should be even lower.
3. The Use of OCR in the Data Capturing Workflow
With the goal of improving the overall quality of census data, a revised data capturing workflow is designed for the 2001 Population Census of Macao. Its design is based on the flow of batches, each batch will contains about 40 forms. After the batches are scanned, the system will keep track on the status of the batches and processes them accordingly.
Data Capturing Workflow
3.1 Scanning
Completed questionnaires are sent from census branch offices as batches to the head office. Each batch contains 40-50 questionnaires. A unique batch ID will be assigned to the batch by sticking a preprinted label with barcode.
When the batches arrive in the head office, the scanning operator will first identify the batch by reading the barcode, then the forms will be fed into the document scanners. After the scanning, the total no. of forms scanned will be prompted. If it match with the manual counted total written on the label, the operator accept the batch, otherwise, the batch has to be rescanned or the total has to be altered.
When the scanning is completed, the system will mark the batch with a scanning completed status.
3.2 Recognition
The recognition program will automatically find all the batches with a scanning completed status and start the recognition process. Since several document formats exist in the system, the recognition program will use different form definitions according to a predefined batch numbering scheme.
The recognition result and the confidence level for all the OCR and OMR like (mark-sense) answer boxes of the batch will be stored in a Microsoft Access mdb file.
3.3 Confirmation
When rejected characters exists in the batch, the confirmation process will automatically open the batch and jump to the rejected pages and highlight the rejected characters. Operator can examine the image of the original document by zoom in-and-out to determine the correct answer. If the recognized answer is correct, the operator can accept the answer by pressing the <enter> key, otherwise, he/she key in the correct answer.
As the documents are recognized character-by-character, the confirmation process is also on character base. For example, if the last digit of the field year of birth is rejected, the operator will only need to confirm the last digit instead of the whole field.
3.4 Transformation
After confirmation, the batches will be transformed as formal questionnaire data and stored in the back-end database. The transformed data will be in the unit of questionnaire with the hierarchy of households and individuals.
3.5 Validation
Upon the completion of transformation, the batches will enter the status of validation. The validation program is designed to examine the data of the questionnaire. It performs range check on the data item, logical validation of related questions for a single person, and the cross-person validation.
The found errors will be written to the database with the identity of questionnaire ID, household number, questions number, member number, and the error code. These error records will be used to generate correction form for the subsequent error correction process.
3.6 Correction
The correction process involves three steps. The first step is the generation of correction form (appendix 1). Correction form is the print out of the validation error(s) for a particular questionnaire. It shows the details of the errors along with the original questionnaire image.
By reading the error description and the condensed image printed on the same form, the enumerators should be able to identity the causes of the errors and, if he/she know, fill in the correct answers. This manual correction is the second step of the correction process.
The final step is the processing of the correction forms. In fact, when the correction forms are filled with corrected answers, they are putted into batches similar to normal questionnaires. Special batch numbers are reserved for batches of correction forms. Subsequent processes like scanning, recognition, confirmation are conducted in a similar way. Finally, the correction data captured from the correction forms will be used to update the questionnaires master data.
When the questionnaires has been updated, the system will automatically perform validation on them again. If the questionnaire still has errors, the errors will be used to generate an new correction form, and repeat correction circle until all errors are cleaned.
4. Benefit of the OCR System
4.1 Improve the quality of census data
From the experience of the last census conducted in 1991, one of the major problems was the correction of errors. There was no imaging technology employed in the 1991 census, data correction were done manually in the following steps:
  1. read the errors from a correction list which stated what errors were found on which questionnaires;
  2. retrieved the original questionnaires from the document archive and identify the causes of the errors;
  3. prepare a plain correction form by filling in correction key and the new values;
  4. correction forms were then keyed in by data entry operator;
  5. re-run the validation program.
The data correction procedure mentioned above was hard and inefficient. The users spent a lot of time on retrieving the desired questionnaires from hundreds of thousand of questionnaires in the document archive and filling the correction form. And in most cases, they had no idea about the correct answers and had difficulty in clarifying the answers with the households, because even the households didn't remember what he/she answered the questions some months ago.
The new data capturing workflow try to solve this problem with the help of OCR technology. In the new scenario, correction forms are sent back to the original enumerator of the questionnaire within two days. The enumerators should be able to correct the answers. If they need to clarify the problem, they call make telephone calls to the household or, in the worst case, revisit the household.
The correction forms are printed with all the necessary information for correction, no questionnaire retrieval is required. Users only need to fill in the correct value in the correction boxes, thus save a lot of time in filling them.
4.2 Fast data capture
With the help of OCR technology, it is expected that the raw data from all the collected questionnaires could be captured within one week after the census period, and the data validation and correction could be completed within one month. Comparing to estimated six months manual data entry, five months could be saved.
4.3 Cost saving
Comparing to the cost manual data entry, it is estimated that a fifty percent cost saving could be gained when employing the OCR system in the 2001 population census. Also, the investment in hardware/software can be used for future projects.
5. Drawback
5.1 Not environment friendly
According to experience of census pilot project, fifteen percent of the questionnaires could not passed validation, thus a huge volume of correction form will be produced.
To tackle this problem, the validation rules should be carefully examined in order to flag major errors. Minor errors should be corrected by automatic correction of default values.
5.2 Difficult to manage
The use of correction form adds additional workload on the fieldwork and the head office. Therefore, sufficient manpower has to be reserved to handle the workload.
In order to lower the impact of the data correction, enumerators need additional  training on the use of the correction forms. They should know well how to use the correction form.
5.3 Problem of confidentiality
Because the correction forms are printed with the questionnaire image, the distribution of correction forms must strictly controlled. In our system, a simple tracking procedure on the return of the correction forms is employed. It will guarantee that each distributed correction form must be returned, if there are any outstanding correction form, it will generate a reminder messages on the enumerators' daily work report.
6. Conclusion
This is the first time that the OCR technology is used in data capturing process for population census in Macao. Though some technical problems encountered, the result from the census pilot project has shown that the use of OCR technology in population census data capturing is very promising, and we believe that its use can be expanded to the other future projects as well.

 
Pop-IT project (1997-2001)
Project Objectives
Working Party Members
Working Party Meetings
First meeting, Bangkok, 24-26 September 1997
Second meeting, Singapore, 1-3 April 1998
Third meeting, Bali, 7-9 January 1999
Fourth meeting, Manila, 6-9 July 1999
Ffth meeting, Bangkok, 21 October 1999
Sixth meeting, Bangkok, 26 March 2001
Workshops
Application of New Information Technology to Population data, Bangkok, 12-20 October 1999
Population Data Analysis, Storage and Dissemination Technologies, Bangkok, 27-30 March 2001
Guidelines
Population data collection and capture (BBS - Statistics Indonesia)
GPS in modern mapping and GIS technologies to population data (Bangladesh Bureau of Statistics)
Population data dissemination (Statistics New Zealand)
Project Newsletter
Contact us
   
Copyright (c) 2013 ESCAP  |  Legal Notice