| Workshop on Population
Data Analysis, Storage and Dissemination Technologies
|
| Bangkok, 27-30 March
2001 |
STAT/WDT/4
22 March 2001
ENGLISH ONLY
ECONOMIC AND SOCIAL COMMISSION FOR ASIA AND THE
PACIFIC
Workshop on Population Data Analysis, Storage
and Dissemination Technologies
27-30 March 2001
Bangkok |
| The Use of OCR Technology
in Population Census of Macao, China |
| (Items 3 of the provisional
agenda)1/
|
| Contents |
- Background
- Features
of the OCR System
- The
Use of OCR in the Data Capturing Workflow
- Benefit
of the OCR System
- Drawback
- Conclusion
|
1/
This paper, prepared by Mr Ieong Meng Chao, Statistics
and Census Service, Macao, China, has been reproduced
as submitted. It has been issued without
formal editing.
|
| 1.
Background |
| Population census is the major
source for obtaining up-to-date information on
the demographic and socioeconomic characteristics
of the population. Following the international
practice, population censuses have been conducted
in Macao every ten years. The last population
census was conducted in 1991, and in this year,
2001, the 14th Population Census of
Macao will be conducted in August. |
| Quarters are the basic enumeration
units in Macao population census. There are about
200,000 quarters in the 23.8km2 area. All quarters
will be visited by the enumerators within the
10-day census period. Twenty percents of the quarters
will be requested to answer a long form questionnaire,
and the rest eighty percents will only need to
answer a short form questionnaire. |
| Traditionally the collected
questionnaires are keyed in manually after the
census period. The time required to complete the
data entry of the questionnaires mainly depends
on the available number of data entry operators.
Take population census 1991 as an example, 20
operators took three months to complete the data
entry of all the 110,000 questionnaires. |
| As the pressure of timely statistics
is ever increasing, a faster data capturing method
is required. After a extensive evaluation between
OMR and OCR technologies, the later was chosen
as the data capturing method for the 2001 population
census. |
|
| 2.
Features of the OCR System |
| 2.1 Dual recognizers approach |
| The OCR system used in the census
project utilizes two recognizers to achieve a
more acceptable accuracy rate. One of the recognizers
uses a commercially available API and the other
one is a recognizer developed using neural networks. |
| Using the API, the system is
fed with an image of a written digit, and after
some calculation, the API returns the recognized
digit, together with how confident the system
is about its recognition. The API only provides
with a binary confidence, i.e. either it is confident
with the result or not. |
| Regarding the recognizer based
on the neural network paradigm, a back-propagation
network is used, with an input layer, a hidden
layer and an output layer. Before the recognizer
can be used in the fieldwork, it must be first
trained with a set of examples. These are sample
images of scanned written digits, together with
the data about which digit it represents. Fed
with this information, the back-propagation network
learns these patterns. When trained with a sufficient
number of examples, the recognizer is able to
recognize images of digits which were unseen during
the training phase. |
| In order to achieve a better
recognition accuracy, the examples used to train
the neural network must be carefully chosen, to
avoid noisy data being fed into the training process.
During the training process, special care should
be taken to avoid overfitting - the trained network
could be so much strictly dependent on the training
examples that it could not be general enough to
recognize the unseen digits during training. |
| After the network is trained,
the recognition consists simply of a set of calculations
through the network, and the output with the highest
value is regarded as the digit recognized. The
digit recognized and a confidence level is output
from the recognizer. This confidence level is
a real value from 0 to 1. |
| 2.2 Prioritized voting
algorithm |
| The two recognizers are used
together to form a combined recognizer, where
the two recognizers are given the same digit image
for classification. Using the recognition results
from both recognizers, the combined recognizer
gives the final output. When both recognizers
give the same recognized digit, then the combined
recognizer will also consider that as the recognized
digit, and the confidence level will be the sum
of the confidence levels given by each recognizer.
When the recognized digits are different, the
final output will favor the API recognizer, as
follows: |
| Confidence
level from API recognizer (CA) |
Confidence level from neural recognizer(CN) |
Recognizer chosen for the final recognized
digit |
Final confidence level |
Confirmation Required |
|
1.0 |
£ 0.5 |
API recognizer |
? 0.5 (CA-CN) |
No |
|
1.0 |
> 0.5 |
API recognizer |
< 0.5 (CA-CN) |
Yes |
| £ 0.5 |
£ 0.5 |
API recognizer |
< 0.5 (CA-CN) |
Yes |
| £ 0.5 |
> 0.5 |
Neural recognizer |
0.5 |
Yes |
|
|
| 2.3 Batch mode processing |
| After questionnaires scanning,
the recognition process is running in batch mode
and the recognition results are stored in database
for subsequent processing. This allows the OCR
system running 24-hour in unattended mode. |
| When the operator login to the
system, the confirmation process will be started
and the pending rejected characters will be prompted
automatically to the operators for manual confirmation. |
| 2.4 Character mode recognition |
| The system uses a character
mode recognition in which the system recognizes
the questionnaire character by character in the
predefined coordinates. |
| 2.5 Automatic shift of
image window |
| The system has the function
to slightly shift the image window in order to
scope the whole single character. The reason of
adding this function is because that the scanned
questionnaire images are usually distorted in
certain degree. The distortion of the images may
due to the scanning angle, the minor malfunction
of the scanner, the environment factors which
changed the physical size of the forms, or the
dirty spots on the forms. |
| When the images are distorted,
the OCR system may have problem in locating the
corrected positions of the characters accordingly
to the predefined coordinates. In the situation,
the window shifting function will move the window
around the original positions in certain margin,
trying to scope the whole character without any
overlapping on the character. This function has
proved very useful in improving recognition accuracy. |
| 2.6 Automatic spots removal |
| Dirty spots on the scanned image
confuse the recognizers. In our OCR system we
employ a simple function to remove the dirty spots.
It will clean the black pixel on the target image
if all of its eight surrounding pixels are white.
Though this algorithm is simple, it is quite helpful
to the recognition process. |
| 2.7 Recognition result |
| The following testing results
are obtained from a sample of 150,000 image of
digits. |
| Digit (%) |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
All |
| Recognition rate |
94.83 |
96.83 |
94.92 |
91.11 |
96.00 |
94.95 |
97.29 |
97.72 |
90.43 |
81.74 |
95.64 |
| Reject rate |
5.17 |
3.17 |
5.08 |
8.89 |
4.00 |
5.05 |
2.71 |
2.28 |
9.57 |
18.26 |
4.36 |
| Accuracy rate |
99.38 |
99.89 |
99.78 |
99.73 |
99.89 |
99.41 |
99.79 |
99.59 |
99.12 |
100.00 |
99.72 |
| Error rate |
0.62 |
0.11 |
0.28 |
0.27 |
0.11 |
0.59 |
0.21 |
0.41 |
0.88 |
0.00 |
0.28 |
|
| In the above table: |
- Recognition rate is the
percentage of recognized digits with confidence
level of 0.5 or above. However, the recognized
digits may not be correct;
- Reject rate is the percentage
of recognized digits with confidence level
less than 0.5. The system might have successfully
recognized the digit, but due to low confidence
level, the results need to be confirmed manually;
- Accuracy rate is the
percentage of the correctness among
the recognized digits with confidence level
of 0.5 or above.
- Error rate is the percentage
of the incorrectness among the recognized
digits with confidence level of 0.5 or above.
|
| The overall recognition rate
in the OCR system is 95.64% and the accuracy rate
is 99.72%. Since the error rate is calculated
based on the recognized data, the actual error
rate on all the data will be 0.26% (0.28% x 95.64%).
Furthermore, all the data will go through a strict
validation process, so the final error rate should
be even lower. |
|
| 3.
The Use of OCR in the Data Capturing Workflow |
| With the goal of improving the
overall quality of census data, a revised data
capturing workflow is designed for the 2001 Population
Census of Macao. Its design is based on the flow
of batches, each batch will contains about
40 forms. After the batches are scanned, the system
will keep track on the status of the batches and
processes them accordingly. |
|
| 3.1 Scanning |
| Completed questionnaires are
sent from census branch offices as batches to
the head office. Each batch contains 40-50 questionnaires.
A unique batch ID will be assigned to the batch
by sticking a preprinted label with barcode. |
| When the batches arrive in the
head office, the scanning operator will first
identify the batch by reading the barcode, then
the forms will be fed into the document scanners.
After the scanning, the total no. of forms scanned
will be prompted. If it match with the manual
counted total written on the label, the operator
accept the batch, otherwise, the batch has to
be rescanned or the total has to be altered. |
| When the scanning is completed,
the system will mark the batch with a scanning
completed status. |
| 3.2 Recognition |
| The recognition program will
automatically find all the batches with a scanning
completed status and start the recognition process.
Since several document formats exist in the system,
the recognition program will use different form
definitions according to a predefined batch numbering
scheme. |
| The recognition result and the
confidence level for all the OCR and OMR like
(mark-sense) answer boxes of the batch will be
stored in a Microsoft Access mdb file. |
| 3.3 Confirmation |
| When rejected characters exists
in the batch, the confirmation process will automatically
open the batch and jump to the rejected pages
and highlight the rejected characters. Operator
can examine the image of the original document
by zoom in-and-out to determine the correct answer.
If the recognized answer is correct, the operator
can accept the answer by pressing the <enter>
key, otherwise, he/she key in the correct answer. |
| As the documents are recognized
character-by-character, the confirmation process
is also on character base. For example, if the
last digit of the field year of birth is
rejected, the operator will only need to confirm
the last digit instead of the whole field. |
| 3.4 Transformation |
| After confirmation, the batches
will be transformed as formal questionnaire data
and stored in the back-end database. The transformed
data will be in the unit of questionnaire with
the hierarchy of households and individuals. |
| 3.5 Validation |
| Upon the completion of transformation,
the batches will enter the status of validation.
The validation program is designed to examine
the data of the questionnaire. It performs range
check on the data item, logical validation of
related questions for a single person, and the
cross-person validation. |
| The found errors will be written
to the database with the identity of questionnaire
ID, household number, questions number, member
number, and the error code. These error records
will be used to generate correction form for the
subsequent error correction process. |
| 3.6 Correction |
| The correction process involves
three steps. The first step is the generation
of correction form (appendix 1). Correction form
is the print out of the validation error(s) for
a particular questionnaire. It shows the details
of the errors along with the original questionnaire
image. |
| By reading the error description
and the condensed image printed on the same form,
the enumerators should be able to identity the
causes of the errors and, if he/she know, fill
in the correct answers. This manual correction
is the second step of the correction process. |
| The final step is the processing
of the correction forms. In fact, when the correction
forms are filled with corrected answers, they
are putted into batches similar to normal questionnaires.
Special batch numbers are reserved for batches
of correction forms. Subsequent processes like
scanning, recognition, confirmation are conducted
in a similar way. Finally, the correction data
captured from the correction forms will be used
to update the questionnaires master data. |
| When the questionnaires has
been updated, the system will automatically perform
validation on them again. If the questionnaire
still has errors, the errors will be used to generate
an new correction form, and repeat correction
circle until all errors are cleaned. |
|
| 4.
Benefit of the OCR System |
| 4.1 Improve the quality
of census data |
| From the experience of the last
census conducted in 1991, one of the major problems
was the correction of errors. There was no imaging
technology employed in the 1991 census, data correction
were done manually in the following steps: |
- read the errors from
a correction list which stated what errors
were found on which questionnaires;
- retrieved the original
questionnaires from the document archive and
identify the causes of the errors;
- prepare a plain correction
form by filling in correction key and the
new values;
- correction forms were
then keyed in by data entry operator;
- re-run the validation
program.
|
| The data correction procedure
mentioned above was hard and inefficient. The
users spent a lot of time on retrieving the desired
questionnaires from hundreds of thousand of questionnaires
in the document archive and filling the correction
form. And in most cases, they had no idea about
the correct answers and had difficulty in clarifying
the answers with the households, because even
the households didn't remember what he/she answered
the questions some months ago. |
| The new data capturing workflow
try to solve this problem with the help of OCR
technology. In the new scenario, correction forms
are sent back to the original enumerator of the
questionnaire within two days. The enumerators
should be able to correct the answers. If they
need to clarify the problem, they call make telephone
calls to the household or, in the worst case,
revisit the household. |
| The correction forms are printed
with all the necessary information for correction,
no questionnaire retrieval is required. Users
only need to fill in the correct value in the
correction boxes, thus save a lot of time in filling
them. |
| 4.2 Fast data capture |
| With the help of OCR technology,
it is expected that the raw data from all the
collected questionnaires could be captured within
one week after the census period, and the data
validation and correction could be completed within
one month. Comparing to estimated six months manual
data entry, five months could be saved. |
| 4.3 Cost saving |
| Comparing to the cost manual
data entry, it is estimated that a fifty percent
cost saving could be gained when employing the
OCR system in the 2001 population census. Also,
the investment in hardware/software can be used
for future projects. |
|
| 5.
Drawback |
| 5.1 Not environment friendly |
| According to experience of census
pilot project, fifteen percent of the questionnaires
could not passed validation, thus a huge volume
of correction form will be produced. |
| To tackle this problem, the
validation rules should be carefully examined
in order to flag major errors. Minor errors should
be corrected by automatic correction of default
values. |
| 5.2 Difficult to manage |
| The use of correction form adds
additional workload on the fieldwork and the head
office. Therefore, sufficient manpower has to
be reserved to handle the workload. |
| In order to lower the impact
of the data correction, enumerators need additional
training on the use of the correction forms. They
should know well how to use the correction form. |
| 5.3 Problem of confidentiality |
| Because the correction forms
are printed with the questionnaire image, the
distribution of correction forms must strictly
controlled. In our system, a simple tracking procedure
on the return of the correction forms is employed.
It will guarantee that each distributed correction
form must be returned, if there are any outstanding
correction form, it will generate a reminder messages
on the enumerators' daily work report. |
|
| 6.
Conclusion |
| This is the first time that
the OCR technology is used in data capturing process
for population census in Macao. Though some technical
problems encountered, the result from the census
pilot project has shown that the use of OCR technology
in population census data capturing is very promising,
and we believe that its use can be expanded to
the other future projects as well. |
|