1. A distributed data capture system is an
approach a census planner may want to choose
when a centralized data capture approach can
not be used due to the lack of work space for
equipment or of storage space for questionnaires
or due to other administrative or logistic reasons.
2. The proposed prototype system for a distributed
data-capture system will be developed around
an OCR/OMR software engine and a medium-speed
scanner. The system will process a single-sheet,
two-sided form, capable of processing 8,000
to 12,000 questionnaire forms per site per day.
3. There are integrated form processing software
packages, such as NCS ACCRA, MS&I's MATRA,
and TiS's AFPS. These integrated packages, while
providing comprehensive functionality and flexibility,
are expensive and overly complex for many users
or are often difficult to optimize for a particular
form or application. To customize these products,
even if possible, requires significant application
development effort to be ready for processing
the census.
2. Requirements
4. The basic requirements for the prototype
system are given as follows:
Able to process two-sided
legal-size questionnaire sheets which may
or may not have identical sides. One side
of the questionnaire sheet will have information
on four individual persons.
Usable by relatively
inexperienced personnel on a day-to-day basis.
Throughput for a single
installation in the range of 8,000 to 12,000
sheets per day.
All components of the
system localized for the Indonesian language.
Data will consists of
both handprint numerals and mark-sense.
User-friendly verification
module.
A system monitor will
keep track of the progress of each work (image
(Tiff), document, etc.) in the system, collect
and record statistics.
3. Work
sequence (see Figure 1.)
5. The prototype system will be designed
to handle a simple work sequence described below:
(1) Operator starts the system.
(2) Operator loads questionnaire into the
scanner.
(3) Scanner creates an image in the Recognition-ready
Folder.
(4) Image is submitted to the Recognition
program."
(5) Recognition results are sent to the Recognition
Results Analyzer.
Figure 1. The data flow for the prototype system
(6) The Recognition Results Analyzer determines
if human intervention is required, based on
confidence level of the recognition and on
other factors.
(6.1) Human intervention is required:
The image and the recognition
results are loaded for verification.
Upon completion of
verification, the Verify program sends the
results to the Verified Folder.
(6.2) Human intervention is not required:
The Verify program sends
the results to the Verified Folder.
(6.3) The results are written in the specified
format to the file system.
The Output program sends
the verified data to the Output Folder
6. In the above sequence, after starting the
system, the only required human intervention
is at the Scanner and at the Verify program.
All other processing are executed automatically.
4. Design
7. The prototype system will consist of a
single integrated GUI that handles scanning,
recognition, and final output. This GUI will
have a separate thread running for each function
(Scanning, recognition, and output).
8. Scanning controls will be as simple as possible,
possibly limited to a "Start/Stop" button on
the main GUI panel. The progress of work through
the system will be indicated by simple numeric
displays. Workflow will be handled internally
between scanning and recognition. File-based
workflow will be used between recognition, the
Recognition Results Analyzer, the Verifier,
and the output function.
a. Scanner
9. A TWAIN-compliant SCSI scanner (e.g. Fujitsu
M3099GX) will be used for the prototype. This
would avoid being tied to proprietary driver
systems.
b. Recognition Module
10. For the prototype system, an OCR engine
NCS NestorReaderÒ will be used for recognition
of the images taken from the scanner.
c. Recognition Results
Analyzer
11. The Recognition Results Analyzer, a program
in Visual Basic®, will read the recognition
results file, which is in a text format, and
count recognition errors. Depending on a user-defined
acceptable error-level, it will determine whether
or not the recognition results be subjected
to the verification or not. The error statistics
for each Enumeration District will be displayed
and printed.
d. Verifier program
12. The Verifier program is an application
program in Visual Basic® language, which
allows the user to view the original image and
the recognition results simultaneously and to
modify the results to agree with the image if
an error was made in recognition. The Verifier
may either operate on a single recognition results
file, on a list of result files, or on a directory
where recognition results files are being added
to a Recognition Results Folder.
13. The Verifier contains a main-window and
one or more different sub-windows, each displaying
a different aspect of the image, and the recognition
results for the current zone.
14. Typically, the sub-windows consist of a
portion of the image that includes a view of
the zone being verified. Next to the image of
the zone being verified there is a text field
where any corrections will be entered. In the
display of the image, zones that need to be
verified are highlighted in color.
e. Output
15. Output of the prototype will be ASCII
data in the form of a Comma Separated Value
(CSV) file or a text file, one record per person.
5.
Conclusion
16. The prototype system will be a simple
OMR/OCR and data entry solution that includes
customized functionality required for supporting
a population census application. Every process
in the system will be optimized for the census
application in the design stage, including:
Scanner interface
Network image workflow
Key verification
Output of the resolved,
edited and formatted ASCII data