1/
This paper, prepared by Sihar Lumbantobing, BPS
Statistics Indonesia, has been reproduced as submitted.
It has been issued without formal editing.
1.
Introduction
After several pilots using OCR
technology, BPS Statistics Indonesia (BPS) decided
to process 2000 Population Census using OCR technology.
A careful decision should be made since the experience
using mark reader technology in processing 1971
Population Census did not give a very good result.
This experience enforces BPS to change data capturing
method to PC-based data entry in 1980 and 1990
Population Census. The other reason is because
the introduction of PC technology allows data
entry is more efficient comparing to mark reader
technology.
In the current days, the advancement
of OCR technology provides an optimistic option
for census data capturing. BPS considers
this option again, because the new OCR technology
allows less quality questionnaire then required
by mark reader technology in 1971. In 1971,
mark reader technology required BPS to print questionnaires
abroad. Now, OCR technology allows the documents
to be printed in the country, with lower-quality
printing. But a standard level, of course,
should be achieved.
Enumeration of the every household
member has been carried out in June 2000 using
OCR-based documents called SP2000-L2. The
processing of the documents is being done at this
moment, and the final result should be announced
before June 2001. Dealing with such new technology
in all stages of the census including planning,
data enumeration, data capturing, data editing,
and tabulation is of course providing new experience
to BPS. Hopefully, this is useful to other
NSOs which will like or intend to utilize such
technology.
2.
The Nature of the Indonesian 2000 Population Census
Data enumeration of the Indonesian
2000 Population census is carried out in June
2000, and July 1 is defined as the census date.
It is intended that SP 2000 population census
could provide population statistics up to smallest
administrative area. National administrative
coverage itself is divided into Provinces, which
then divided into Districts or Municipalities.
A District or Municipality is then divided into
Sub Districts, and subsequently a Sub District
is divided into villages. Village is the
smallest administrative area, and from the result
of 2000 Population Census, statistics of villages
could be produced. For BPS operational purposes,
a village is divided into census blocks.
Before discussing about the
OCR technology, it is worthwhile to talk about
the documents utilized in the Population Census.
To collect various data relating to population
and housing, BPS utilizes SP2000-L1, SP2000-RBL1,
and SP2000-L3, SP2000-L2, SP2000-KBL2. For
simplicity, the forms will be named as L1, RBL1,
L3, L2, and KBL2.
L1 form is for collecting information
about household and housing, and RBL1 form is
to collect total number of households and population
in every census block in a Sub District.
And L3 form is to record information of the homeless
persons. Data of the three forms are captured
using PC-based data entry method, in every BPS
district or municipality office.
In collecting detail data of
household members (individual persons), L2 form
is used. For controlling the process of
L2 forms of a census block, SP2000-KBL2 form is
used. An example of the L2 is attached on
Attachement-3.
As done with other surveys or
censuses, BPS conducted intensive training to
all intended field data collectors and supervisors.
Training for data enumerators of 2000 Population
Census is very important because they will deal
with a new technology, that is, OCR technology
application. The other reason is that because
data enumerators are mostly outsourcing, where
conducting data enumeration is new for them.
Problem related with training
occurs when because budget constraints, even though
enumeration is done in June 2000, but the training
should be taken long before, i.e., in February
or March 2000 (three months earlier). As
a result, many people trained in the training
period could not participate in enumeration time
because of several reasons such as: moved, change
jobs, died, etc. To do the work, then new
other persons are hired, but unfortunately they
are not trained. Several district or municipality
offices perform a very quick training for those
replacement persons, but is not sufficient.
It is admitted, that this situation affects quality
of collected data.
For processing of the census,
scanners are distributed in the central office
and in almost all province offices. The
type of scanner is KODAK DS Scanner 3500.
processing is done in every BPS Province offices,
accept in Java, there are 14 BPS District/ municipality
offices assigned the tasks. List of estimated
documents to be processed and number of scanners
distributed is shown in Attachment
1.
Three printing companies print
L2 and KBL2 forms in the country. Therefore,
for security reason, documents are put with a
tiny code so we can identify later on which company
produces a paper if a problem occurs. Actually,
in the whole period of printing, intensive quality
checking is made in the printing company by the
printing company staff. Checking is performed
the second time n BPS head office by sample.
When sample shows low quality printing, then all
documents in the same batch should be destroyed
entirely. This is should be done, because
high requirement of OCR technology in term of
drop color requirement.
3.
Preliminary result of the 2000 Population Census
On January 3, 2001, General
Director of BPS has announced preliminary result
of Indonesian 2000 Population Census. It
is announced that the number of Indonesian population
is 203 456 005 persons (See
Attachment 2). It means that in the
period of the year 1990 to 2000, Indonesian population
growth rate is 1.35%, and comparing with the population
growth rate in the period of 1980 to 1990, it
is less. However, the mentioned number
is only a preliminary number, and processed using
RBL1 and L3 forms. The final number will
be announced after processing L2 forms is completed.
There is a note about the figure.
First, there are several areas in Indonesia that
still in conflicted, so that enumeration in the
areas could not be done. Therefore, the
announced preliminary number includes estimated
numbers of the areas.
4.
SP2000-L2 Processing System
Up until now, L2 form processing
is still on the progress. Processing status
of several centers is shown in Attachement-1.
From the table, we can see that a processing,
i.e., Jakarta province, has already concluded
the processing.
As mentioned before, L2 form
is for collecting data of individual household
member. One L2 form could hold up to 8 household
members in both sides of the form. Each
side of the form contains data of four individuals
(household members). A household containing
more than 8 members, say 10 members, then the
household requires two L2 forms. The first
form for 8 members, and the second form for other
two members
For control purposes, A KBL2
form accompanies all L2 forms in the same census
block. In average, one census block contains
about a hundred of L2 forms. For processing L2
and KBL2 forms, there are three stages that should
be made sequentially, i.e., scanning, recognition,
verification, and validation stages (See
Figure 1). Every stage dealing with
different file types. And in order to keep
the files are controlled smoothly, then a good
management system should be made. This management
system includes how files are named, and which
files are maintained, and which files are deleted.
In the Figure 2, how files
in each stage is shown.
In the scanning stage, scanners
capture the content of the questionnaire and produce
an image file. Every L2 form and KBL2 form
produces two image files with the extension of
TIF (*.TIF), because each side of the document
produce one TIF file. One file TIF is about
24 KB. It means, for example, a census block
containing 100 of L2 forms and one KBL2 form will
produce about 202 TIF files, and require about
4.8 MB disk space.
Figure
1.
Figure
2. File naming systems
Notes: pp denotes Province
code
kk denotes District/ Municipality code
cc denotes Sub District code
Figure 3. Scanner Components
In the recognition stage, the
engine recognizes the images produced in the previous
stage, and produces interpreted characters.
The accuracy of interpretation depends on the
capability of a character memory. A better
character memory will produce better-interpreted
characters. That is why for processing population
census data, BPS is granted with a special Indonesian
memory, based on a number of Indonesian handwriting
samples. Interpreted produces interpreted
files with an extension ZRF (*.ZRF). A file
TIF file requires about 73 KB disk space.
It means that 202 TIF files requires about 14.7
MB of disk space.
The third stage is verification
of interpreted results with original image characters.
When the computer confidence very high (say 100%),
then there is no need to verify the interpreted
images from the original characters. But,
when the degree of confidence is small (say 60%),
then the interpreted numbers and original character
will be shown in order to give the operator to
determine whether the interpreted is correct or
not. If not, then the operator has the capability
to define the correct one. The output of
verification stage is a file with an extension
TXT (*.TXT).
In Indonesian Population Census
processing, verification is done for the whole
households in a census block. Therefore
it produces three kinds of files: census block
files, households files, and household members
files. In average, the three files require
about 67 KB disk space.
Computer editing (validation)
stage is a process of checking household members'
data, whether they follow editing rules.
For example, when a household member said that
she has given birth to 2 sons, then her sex must
be a female (2). If the answer is male (1),
then data is incorrect and needs to be corrected.
Output file names of this stage is a file with
an extension TPS (*.TPS). In average, the
output files of computer editing require about
67 KB.
Looking to the fact that number
of files deal with are very high, then there is
a need to handle them carefully. For easiness
of handling files, file names are defined using
the identification code of respondent. The
identification code defines Province code, District/
municipality code, Sub district code, village
code, census block code, and code number of the
respondent in a census block.
Because there are a number of
files and requires a lot of disk spaces then disk
space should be handled carefully. OCR system
handles them by deleting unnecessary files.
When a census block data has been defined "cleaned",
then all files produced from previous stages,
i.e., scanning, recognition, and verification
are erased. Household member files that
are not accompanied by household files should
be erased. There is a particular program
with a function to clean up the unaccompanied
files. Before erasing unnecessary files,
all files output of scanning and verification
stage should be stored into CD-ROM, for further
processing in the central office.
5.
SP2000-L2 Form Processing Obstacles
Even though processing a population
census is not new for Indonesia because has already
conducted the census in 1960, 1971, 1980, and
1990, BPS realizes that the processing should
be done carefully. The reason is that detail
information is collected directly from all population.
In the previous censuses, detail information is
collected by sample. The other reason is
because processing using OCR technology is relatively
different from then usual way system, i.e., PC
data entry system. In fact in the last six
months processing, BPS faces a number of obstacles.
The obstacle is related to documents, handwriting
quality, scanners, and data.
The way to handle OCR documents
is very different from handle data-entry documents.
This already stressed to interviewers. However,
when documents come to the office from data enumerators,
some of them can not be processed, because several
reason, such as: folded, uncleanness, stapled,
etc.
The problem with document condition
is not only because of the enumerators, but also
because of the printing defect. Even though
the printing and BPS have already performed quality
checking, there are still documents that are not
meet the standard. This problem deals with
drop color, incorrect position of the contents,
etc. As already mentioned earlier, even though
tight quality control procedure has been made
in the printing company and in BPS headquarter,
in fact, not all documents printed free of errors.
Problems especially related to drop color quality,
that make characters written in drop color is
not omitted by recognize procedure as is supposed
to be. Because of drop-color error, contents
of the questionnaire will of course to be erroneous.
To overcome the problem, the operator in the processing
center should type the data.
The second problem is dealing
in the contents of the documents. We mean
the way enumerators write down a questionnaire.
As OCR technology is new to enumerators, they
some time forget that the way to write questionnaire
is different from writing a questionnaire intended
data entry. For data entry, a questionnaire
could be written almost without restriction, as
long human eyes can read them. But in the
OCR technology, everything has to be according
to rules, because not eye that read the writing,
but scanner does.
In reality, answers with "mark"
have shown a very small error. Problems
are mostly related with "numbers". For processing
2000 Population Census, the system has revived
with a special character memory, based on a number
of Indonesian handwritings. But even that
fact, errors still occurs because of this handwriting.
Problems can be in term of the recognition engine
can not identify the number, or it interprets
a number with a wrong number. For example,
number "1"is interpreted as "7". This incorrect
interpretation can not be known, unless there
is a range check made to the field. For
example, when the value of "sex" field is equal
to 7, it must be an error, because only two allowed
value to the field, i.e., 1 for male, and 2 for
female.
The problem occurs because the
writing is not as required by the character
memory. The way to write numbers has been
done in the training period, according to prepared
example numbers. In many cases human editing
to improve the quality of numbers are useless.
Because, even though editing has been made, in
many times, error still occurred. The editing
does not improve the result significantly.
Record shows that BPS Java Province, which has
not performed editing, in fact has a high percentage
result (see Attachment
1).
Uncertain interpretation is
shown in the verification process. In this
condition, verify engine will show whether the
presentation is correct by showing interpreted
number and original character on the screen.
Operator then determines the correct one.
Problem dealing with writing
is also related to the quality of pencils.
It is defined that the accepted pencil is 2B type,
and BPS is already provided that kind of pencils
to the interviewers. However, many interviewers
still use unqualified pencils. The other problem
is when the interviewer does not write with sharp
pencils.
Related to the quality of the
content of the questionnaire is human editing.
In the human editing, editor is expected to improve
quality of the content, before documents are scanned.
Editor rewrite unclear numbers, blacken unclear
mark, etc. However, when rewrite a number,
in many cases, do not give much improvement result
to the scanner output. On the other hand,
rewrite a number means using an eraser.
But, in many cases, eraser left dirt on paper,
and affects rubber of drive rollers or feeder
(see Figure 3).
The other obstacle deals with
scanner. As scanner is very important part
in an OCR technology, then good condition of scanner
should be kept all the time. On other hand, because
when lot of dirt left on reading compartment,
cleaning process should be done frequently.
And the result is the productive time of scanner
reduced significantly. That is not only
that, it also reduces the lifetime of the machine.
There are several points needed to prolong scanner's
lifetime, such as to perform maintenance for:
Feed module, Paper path censor, drive rollers,
separator roller, imaging guide, front & rear
lamp.
The other obstacle faced is
about the "clean data" itself, whether the data
speak "normally". Because even the data
has been clean up by computer editing (validation),
there is a possibility that the data looks unusual.
For example, when in a place, most people have
the same age, then it must be something wrong
with the data, even though this is not wrong according
to editing rules.
The analysis of this kind has
been done to a limited number of data. Tables
have shown several unusual data, such as: Numbers
of people age more than 50 and still go to school
too big. Investigation has been made, to
see the original data. We found out that
one reason is because OCR system misinterprets
the numbers. An example is where age 23
is interpreted as 53. There are other causes
why this problem. One is the requirement
to write 0 (zero) in the first digit of age field,
when the age less then 10. However, careless
writing of number, 0 could be interpreted as 6,
8, or 3.
The study also is to see whether
the age report in 2000 Population Census also
tends to be number 0 and 5, or lasting with number
0 and 5. For example, 5, 25, 40, 55, 80,
etc. Study has been done to Jakarta Province
data. We found out using Joint Score Index/
UN Index that the report is categorized as inaccurate,
because Joint sore Index score is equal to 32.90.
Investigation to the data shows that one cause
of inaccurate age report is because inaccurate
recognition, an example: age 43 is interpreted
as 45.
6.
Conclusion
Realizing that OCR technology
is new to it, BPS has made an intensive preparation.
Enumeration of the 2000 Population Census has
been conducted using OCR form as well as non-OCR
form. Preliminary result based non-OCR form
has been announced, and final result will be announced
after finalizing the processing of the OCR forms.
The processing is being on progress. Even the
processing is not final yet, but it is near to
the end. Hopefully, the Indonesian experience
may a lesson to other NSO that is going or planning
to utilize OCR technology for population censuses,
as well to other censuses or surveys.
Attachment
1
Number of SP2000-L2 forms, SP2000-KBL2
forms, and Scanners For Each Processing
Center