|The Third Meeting of the
Working Party on the Application of New Technology
to Population Data
|Bali, 7-9 January 1999
ECONOMIC AND SOCIAL COMMISSION FOR ASIA AND THE
7 January 1999
Working Party on the Application of New
Technology to Population Data
7-9 January 1999
|The opportunities for
improving work flows by using imaging technology
in survey processing
|David Archer and Sarah
Statistics New Zealand
In 1996 Statistics New Zealand
used imaging technology to capture, code edit
data from the Census of Population & Dwelling
questionnaires and to capture data from the
field books used by census enumerators. Approximately
1.3 million dwelling and 3.6 million personal
double sided A3 questionnaires were scanned
and processed together with 5,500 census enumerator
field books. Intelligent character recognition
was used to interpret ticks and hand written
numeric answers. The recognition of alphabetic
hand written answers was not used because it
was felt that the reliability was not high enough.
Alphabetic recognition is being investigated
for the 2001 Census.
Once the questionnaires were scanned and recognised,
the images and interpreted data were used for
coding and editing. Instead of referring to
responses from the questionnaires, images of
the responses were displayed on screen. From
this the coder key entered any unrecognised
responses, and completed all coding and editing.
This eliminated the need to pass paper questionnaires
around the processing office and store and retrieve
them for long periods during the quality assurance
The Census was held on 5 March 1996. Processing
started on 25 March and the scanning phase to
capture the data was completed on 21 June. The
completion of all unit record processing occurred
on 30 November 1996, with final data released
by 28 February 1997. In comparison, the 1991
Census was held on 5 March and provisional data
was released at the end of April 1991. Final
data was released in February 1992. Although
the final data release dates for the two censuses
are the same it should be noted that the number
of questions asked in each Census was significantly
different. In 1991 a total of 45 questions were
asked. In 1996 a total of 75 questions were
The term imaging can be used in several contexts.
These include scanning, retrieval, recognition,
repair and making the resulting images available
for further processing. The bulk of imaging
applications comprise only scanning and retrieval
functions, while more recent applications have
added recognition and repair. These are the
traditional ways of handling electronic data
but work on the assumption that once the repair
of images has taken place the imaging process
is complete. The New Zealand population census
extended this further to use the images for
code and edit work.
- Scanning involves
taking an electronic image of the
document, that is, converting the
dark and light areas of a document
into a map of 0's and 1's. Recognition
takes pre-defined areas of this map
and converts them into meaningful
data, for example, into ASCII format.
- Optical Character
Recognition (OCR) of data carefully
transcribed onto special questionnaires
has been in use in statistical offices
since the 1970's. Optical Mark Recognition
(OMR) which involves recognition of
small marked areas is still one of
the more common questionnaires of
data capture. More recently, the term
Intelligent Character Recognition
(ICR) has been used to describe the
process of interpreting image data,
in particular, alpha numeric text.
- Repair is used
to describe both subsequent key entry
of unrecognised characters, and sometimes
the editing of interpreted responses.
The process of scanning,
recognition and repair for the census was handled
as one integrated function. Source documents
did not leave the work station until the repair
process has taken place. Standard imaging technology
identifies many of the incorrectly scanned images,
which means the paper documents can readily
be rescanned. Once this process was completed
the paper questionnaires were sent to storage
and eventual destruction. From this stage on
we relied entirely on images of the questionnaires
to carry out the coding of responses from alphabetic
answers and the more complex statistical editing
commonly used in statistical surveys.
This paper concentrates on the processes after
the initial recognition and repair process which
enabled us to rely on the image rather than
the paper documents.
Once the initial recognition and repair took
place the image of each questionnaire was divided
into blocks for each of the questions asked.
This meant that effectively we were not managing
individual questionnaires but answers to questions
which were linked to the parent questionnaire.
So while we had 3.6 million personal questionnaires
each with 75 questions we were not managing
3.6 million images but 270 million images. While
this sounds daunting the links to the parent
questionnaire were integrated but with enough
flexibility to enable the considerable benefits
of such an approach to be delivered.
impact of imaging on the management of questionnaires.
The use of imaging means
that resources do not need to be put into manually
sorting and matching individual questionnaires,
household questionnaires and field books (these
contain information on the type of dwelling
and on the number of questionnaires delivered
to and collected from dwellings) as they are
received into the office. Processing does not
need to take place in any particular geographic
order. In addition, there is no need to develop
systems for the storing and retrieval of the
Imaging requires prior clerical checking or
preparation phase. Staff unfold questionnaires
and check their suitability for scanning, for
example, no staples or paper clips. Unsuitable
questionnaires are transcribed onto new questionnaires.
Four types of batches were prepared for scanning.
These are field books, questionnaires sorted
by area, households not sorted by area, individual
questionnaires not sorted by area. After scanning
questionnaires were bundled together with their
batch header and sent to storage.
Bilingual questionnaires which had been completed
in Maori were translated onto English questionnaires
prior to scanning and recognition. Statistics
New Zealand received about 20,000 of these questionnaires,
less than expected. The manual transcription
of questionnaires has avoided the need to design
more than one imaging system and was considered
to be a more efficient process.
The pages of the field books were removed from
their bindings and the binding edge cut smoothly.
A field book batch header was prepared for each
sub district (Enumerator workload) and preceded
each batch of field book pages during scanning.
Batch headers contained information on the type
of batch that is being processed and also gave
enough information to locate the paper questionnaire
if it is needed later on.
Most questionnaires came into the office batched
by sub-district (Enumerator area). However,
there are a number of questionnaires which are
sent directly back to the office in sealed envelopes
rather than being collected by enumerators because
individuals requested privacy for their questionnaires.
These are managed through the use of special
batches and attached to incomplete households
during the processing.
impact of imaging on the processing of questionnaires.
Following scanning all scanned
batches were processed through the image recognition
software. All dwelling questionnaires had a
bar code on the front page that indicated it
is a dwelling questionnaire. This allowed the
system to record each dwelling questionnaire,
and the associated individual questionnaires
that follow it during the scanning process as
a batch. At this stage of processing numeric
repair was only carried out on the field books
and the identification numbers on the questionnaires.
A number of checks are required to ensure that
any discrepancies were corrected. The household
questionnaire and field book information were
then matched with automated checks being made
to ensure that the number of questionnaires
received and scanned matches those in the field
books. As households were balanced, the images
of the matched questionnaires/field book entry
were then written to CD for later retrieval
for the Code and Edit phase.
A sample of balanced households are selected
for quality assurance checks. These were checks
on the quality of the recognition process to
ensure that overall recognition rates are being
maintained. The sample size could be changed
regularly as can the selection of questions
to be checked - tick boxes and/or numeric responses.
Quality Assurance operators key entered the
responses they saw on the image and their entry
is compared with the recognised entry. If different
the operators confirm the correct entry and
The images are used for subsequent coding and
editing. Instead of referring to responses from
the paper questionnaires, images of the responses
are displayed on screen. From this the coder
completed any key entry of unrecognised numeric
responses, keys in the alpha responses and completed
any normal coding and editing. This eliminated
the need to pass questionnaires around the processing
office, and reduced the amount of time and cost
in handling, storing and managing questionnaires.
For alpha responses, the operator key entered
on an input screen the response from the displayed
image. The coding library appeared on the operators
screen with matched descriptions for the operator
to either select the correct one or enter a
different description, and the associated output
codes are assigned. For difficult to match entries
the operator had the opportunity to forward
the household batch electronically to a specialist
team for action and subsequent return to the
To enable more efficient processing a specialist
section was established to process complicated
households for family coding (about 5% of households).
A quality Control group was also set up to monitor
the quality of the operators' coding decisions.
|Processing of the Post
Statistics New Zealand ran
its first Post Enumeration Survey (PES) after
the 1996 Census. There were 12,000 households
and 35,000 individuals in the survey. Processing
consisted of matching the PES dwelling address
with an address in the Census and matching each
person in the PES household to a person in the
Census. This included searching for the Census
questionnaires of those people who were away
from their usual address either when included
in the PES or when they completed their Census
questionnaires, and searching for multiple Census
Census images were used to match against the
paper PES questionnaire. The PES questionnaire
itself was not designed for scanning.
Using the addresses given on the PES questionnaires
and the images of the Census field books, the
PES processing team created a file which listed
the Census questionnaires of those dwellings
and people that they need to view for matching
and searching. The images of these questionnaires
were downloaded from on-line storage and CD,
both during and after the Census capture phase.
Once the images were downloaded, the PES team
identified those PES dwellings and respondents
that were counted in the Census, including those
that were counted more than once.
of using images rather than paper questionnaires
- maging provided
a significant step towards a paper-less office:
no carrying of questionnaires to and from
the workstation; clear desks; quicker processing;
no storage of questionnaires near operators
- This is one of the biggest benefits of using
imaging. There were significant savings in
costs and efficiencies by not having the paper
- The scanning and recognition
of questionnaires allowed us to efficiently
manage and plan the rest of the processing
workload. Once the questionnaires are recognised
we knew how much work (repair work, edit failures)
we have to do.
- Reduced long term storage
requirements because questionnaires could
be destroyed after the initial scanning, recognition
- Questionnaires were available
on-line and could be displayed within seconds
as opposed to searching for documents on shelves.
- We used macro-editing,
which required some outliers to be reprocessed.
The electronic questionnaires could be found
very quickly and sent back for reprocessing.
This would be very time consuming if the questionnaires
had to be located from a physical storage
area and they are randomly scattered throughout.
The more often the paper questionnaires are
moved from the storage area the more likely
they will not be put back in the correct place.
- Processing can be in
geographic order or in random order. That
is, we could scan questionnaires as soon as
they arrive and then hold them to be processed
regionally or allow them to be processed in
some other order.
- Electronic questionnaires
can be quickly sent to specialist operators
then back to the original operator if necessary.
- With electronic questionnaires
the same questionnaire can be worked on simultaneously
by two or more persons.
- Electronic questionnaires
are readily available for post census analysis
(easier access to questionnaires)
- Parts of various questionnaires
on screen at once for inter record editing.
- Can view the relevant
field book entry on screen in conjunction
with questionnaires. Helpful for coding and
- The problem area
needing operator attention can be highlighted
to make it easier for the operator to identify.
- Only the questions relevant
to the coding or editing problem were shown
on screen although all other questions and
questionnaires for that dwelling are available
to the operator. This is particularly useful
when editing between questionnaires.
- Can use images of questions
that will not be captured (scanned but not
recognised) to help the coding process. For
example, task and duties with coding occupation.
- With the enhanced image
facility the electronic questionnaires are
much easier to read than the physical questionnaire
that has been completed in for example, light
pencil. In addition images can be magnified
by the operator so that characters not discernible
to the naked eye can be read. This leads to
better data quality.
- Estimated that
imaging saved up to 2% of the total cost of
the census compared to using paper documents
- Staffing levels were
reduced. In 1991 there were 70 data entry
operators employed to do capture and 60 coding
operators working one shift per day for 5
days per week. Some overtime was needed near
the end of 1991. Processing was from April
1991 until March 1992. In 1996 there were
65 operators working 2 shifts for 6 days.
Some overtime was required. Processing was
from April 1996 until November 1996.
|Proposed Improvements for
the 2001 Census
- Increased security.
In 1996 there were only 1700 CD's to keep
secure rather than over 4 million paper questionnaires.
In 2001 it is estimated that due to new CD
technology this could be reduced to around
- Cheaper technology in
2001. This will mean SNZ can buy faster servers
with larger storage capacity for the same
- Less electronic storage
space is needed if only the responses are
kept and the questionnaire itself is dropped
out. It is estimated that only 10% of the
space required for 1996 will be needed in
2001. Only one server will be required to
hold all images.
- Processing could concentrate
on high priority variables. Responses to other
questions will be coded and edited in a second
wave of editing. Decisions on this have yet
to be made and will depend on user requirements.
|Disadvantages of using
images rather than paper questionnaires
- Need bigger and
more expensive computer screens - 21" is ideal
- Larger servers required
but technology costs are continually decreasing
- OOS increased in 1996
because operators were sitting at the computer
for longer. It is essential to give operators
- More computer memory
is required to speed up movement of images
(again technology is becoming better and cheaper
- As only responses will
be kept in 2001 anything written outside the
imaged areas will be lost when put back into
the template. This would be expected to be
- Can only view five or
six images on the screen at one time. Although
there is the option to display the entire
front or back pages of the questionnaires,
this is not always the answer when several
questions from several questionnaires within
a household need to be viewed.
|Contacts for more information
Manager Business Development
Survey Design Statistician
|Prepared for the ESCAP working
group on new technologies for population surveys.