UN Web Site | UN Web Site Locator
Home Site map Contact 
ESCAP Statistics Division
ESCAP Statistics Division
 
Workshop 2001    
Workshop on Population Data Analysis, Storage and Dissemination Technologies
Bangkok, 27-30 March 2001

STAT/WDT/5
26 March 2001
ENGLISH ONLY

ECONOMIC AND SOCIAL COMMISSION FOR ASIA AND THE PACIFIC

Workshop on Population Data Analysis, Storage and Dissemination Technologies
27-30 March 2001
Bangkok
The use of OCR technology in 2000 Population Census: Indonesian Experience1/
(Items 3 of t e provisional agenda)
Contents
  1. Introduction
  2. The Nature of the Indonesian 2000 Population Census
  3. Preliminary result of the 2000 Population Census
  4. SP2000-L2 Processing System
  5. SP2000-L2 Form Processing Obstacles
  6. Conclusion

1/  This paper, prepared by Sihar Lumbantobing, BPS Statistics Indonesia, has been reproduced as submitted.  It has been issued without formal editing.
1. Introduction
After several pilots using OCR technology, BPS Statistics Indonesia (BPS) decided to process 2000 Population Census using OCR technology.  A careful decision should be made since the experience using mark reader technology in processing 1971 Population Census did not give a very good result.  This experience enforces BPS to change data capturing method to PC-based data entry in 1980 and 1990 Population Census.  The other reason is because the introduction of PC technology allows data entry is more efficient comparing to mark reader technology.
In the current days, the advancement of OCR technology provides an optimistic option for census data capturing.  BPS considers this option again, because the new OCR technology allows less quality questionnaire then required by mark reader technology in 1971.  In 1971, mark reader technology required BPS to print questionnaires abroad.  Now, OCR technology allows the documents to be printed in the country, with lower-quality printing.  But a standard level, of course, should be achieved. 
Enumeration of the every household member has been carried out in June 2000 using OCR-based documents called SP2000-L2.  The processing of the documents is being done at this moment, and the final result should be announced before June 2001. Dealing with such new technology in all stages of the census including planning, data enumeration, data capturing, data editing, and tabulation is of course providing new experience to BPS.  Hopefully, this is useful to other NSOs which will like or intend to utilize such technology. 
2. The Nature of the Indonesian 2000 Population Census
Data enumeration of the Indonesian 2000 Population census is carried out in June 2000, and July 1 is defined as the census date.  It is intended that SP 2000 population census could provide population statistics up to smallest administrative area.  National administrative coverage itself is divided into Provinces, which then divided into Districts or Municipalities. A District or Municipality is then divided into Sub Districts, and subsequently a Sub District is divided into villages.  Village is the smallest administrative area, and from the result of 2000 Population Census, statistics of villages could be produced.  For BPS operational purposes, a village is divided into census blocks.
Before discussing about the OCR technology, it is worthwhile to talk about the documents utilized in the Population Census.  To collect various data relating to population and housing, BPS utilizes SP2000-L1, SP2000-RBL1, and SP2000-L3, SP2000-L2, SP2000-KBL2.  For simplicity, the forms will be named as L1, RBL1, L3, L2, and KBL2.
L1 form is for collecting information about household and housing, and RBL1 form is to collect total number of households and population in every census block in a Sub District.  And L3 form is to record information of the homeless persons.  Data of the three forms are captured using PC-based data entry method, in every BPS district or municipality office.
In collecting detail data of household members (individual persons), L2 form is used.  For controlling the process of L2 forms of a census block, SP2000-KBL2 form is used.  An example of the L2 is attached on Attachement-3.
As done with other surveys or censuses, BPS conducted intensive training to all intended field data collectors and supervisors.  Training for data enumerators of 2000 Population Census is very important because they will deal with a new technology, that is, OCR technology application.  The other reason is that because data enumerators are mostly outsourcing, where conducting data enumeration is new for them. 
Problem related with training occurs when because budget constraints, even though enumeration is done in June 2000, but the training should be taken long before, i.e., in February or March 2000 (three months earlier).  As a result, many people trained in the training period could not participate in enumeration time because of several reasons such as: moved, change jobs, died, etc.  To do the work, then new other persons are hired, but unfortunately they are not trained.  Several district or municipality offices perform a very quick training for those replacement persons, but is not sufficient.  It is admitted, that this situation affects quality of collected data.
For processing of the census, scanners are distributed in the central office and in almost all province offices.  The type of scanner is KODAK DS Scanner 3500.   processing is done in every BPS Province offices, accept in Java, there are 14 BPS District/ municipality offices assigned the tasks.  List of estimated documents to be processed and number of scanners distributed is shown in  Attachment 1.
Three printing companies print L2 and KBL2 forms in the country.  Therefore, for security reason, documents are put with a tiny code so we can identify later on which company produces a paper if a problem occurs.  Actually, in the whole period of printing, intensive quality checking is made in the printing company by the printing company staff.  Checking is performed the second time n BPS head office by sample.  When sample shows low quality printing, then all documents in the same batch should be destroyed entirely.  This is should be done, because high requirement of OCR technology in term of drop color requirement. 
3. Preliminary result of the 2000 Population Census
On January 3, 2001, General Director of BPS has announced preliminary result of Indonesian 2000 Population Census.  It is announced that the number of Indonesian population is 203 456 005 persons (See Attachment 2).  It means that in the period of the year 1990 to 2000, Indonesian population growth rate is 1.35%, and comparing with the population growth rate in the period of 1980 to 1990, it is less.   However, the mentioned number is only a preliminary number, and processed using RBL1 and L3 forms.  The final number will be announced after processing L2 forms is completed.
There is a note about the figure.  First, there are several areas in Indonesia that still in conflicted, so that enumeration in the areas could not be done.  Therefore, the announced preliminary number includes estimated numbers of the areas.
4. SP2000-L2 Processing System
Up until now, L2 form processing is still on the progress.  Processing status of several centers is shown in Attachement-1.  From the table, we can see that a processing, i.e., Jakarta province, has already concluded the processing.
As mentioned before, L2 form is for collecting data of individual household member.  One L2 form could hold up to 8 household members in both sides of the form.  Each side of the form contains data of four individuals (household members).  A household containing more than 8 members, say 10 members, then the household requires two L2 forms.  The first form for 8 members, and the second form for other two members
For control purposes, A KBL2 form accompanies all L2 forms in the same census block.  In average, one census block contains about a hundred of L2 forms. For processing L2 and KBL2 forms, there are three stages that should be made sequentially, i.e., scanning, recognition, verification, and validation stages (See Figure 1).  Every stage dealing with different file types.  And in order to keep the files are controlled smoothly, then a good management system should be made.  This management system includes how files are named, and which files are maintained, and which files are deleted.  In the Figure 2, how files in each stage is shown. 
In the scanning stage, scanners capture the content of the questionnaire and produce an image file.  Every L2 form and KBL2 form produces two image files with the extension of TIF (*.TIF), because each side of the document produce one TIF file.  One file TIF is about 24 KB.  It means, for example, a census block containing 100 of L2 forms and one KBL2 form will produce about 202 TIF files, and require about 4.8 MB disk space.
Figure 1.
Figure 1.
Figure 2. File naming systems
Figure 2. File naming systems
Notes:
pp denotes Province code
kk denotes District/ Municipality code
cc denotes Sub District code
Figure 3. Scanner Components
In the recognition stage, the engine recognizes the images produced in the previous stage, and produces interpreted characters.  The accuracy of interpretation depends on the capability of a character memory.  A better character memory will produce better-interpreted characters.  That is why for processing population census data, BPS is granted with a special Indonesian memory, based on a number of Indonesian handwriting samples.  Interpreted produces interpreted files with an extension ZRF (*.ZRF).  A file TIF file requires about 73 KB disk space.  It means that 202 TIF files requires about 14.7 MB of disk space.
The third stage is verification of interpreted results with original image characters.  When the computer confidence very high (say 100%), then there is no need to verify the interpreted images from the original characters.  But, when the degree of confidence is small (say 60%), then the interpreted numbers and original character will be shown in order to give the operator to determine whether the interpreted is correct or not.  If not, then the operator has the capability to define the correct one.  The output of verification stage is a file with an extension TXT (*.TXT).
In Indonesian Population Census processing, verification is done for the whole households in a census block.  Therefore it produces three kinds of files: census block files, households files, and household members files.  In average, the three files require about 67 KB disk space.
Computer editing (validation) stage is a process of checking household members' data, whether they follow editing rules.  For example, when a household member said that she has given birth to 2 sons, then her sex must be a female (2).  If the answer is male (1), then data is incorrect and needs to be corrected. Output file names of this stage is a file with an extension TPS (*.TPS).  In average, the output files of computer editing require about 67 KB.
Looking to the fact that number of files deal with are very high, then there is a need to handle them carefully.  For easiness of handling files, file names are defined using the identification code of respondent.  The identification code defines Province code, District/ municipality code, Sub district code, village code, census block code, and code number of the respondent in a census block.
Because there are a number of files and requires a lot of disk spaces then disk space should be handled carefully.  OCR system handles them by deleting unnecessary files.  When a census block data has been defined "cleaned", then all files produced from previous stages, i.e., scanning, recognition, and verification are erased.  Household member files that are not accompanied by household files should be erased.  There is a particular program with a function to clean up the unaccompanied files.  Before erasing unnecessary files, all files output of scanning and verification stage should be stored into CD-ROM, for further processing in the central office.
5. SP2000-L2 Form Processing Obstacles
Even though processing a population census is not new for Indonesia because has already conducted the census in 1960, 1971, 1980, and 1990, BPS realizes that the processing should be done carefully.  The reason is that detail information is collected directly from all population. In the previous censuses, detail information is collected by sample.  The other reason is because processing using OCR technology is relatively different from then usual way system, i.e., PC data entry system.  In fact in the last six months processing, BPS faces a number of obstacles.  The obstacle is related to documents, handwriting quality, scanners, and data.
The way to handle OCR documents is very different from handle data-entry documents.  This already stressed to interviewers.  However, when documents come to the office from data enumerators, some of them can not be processed, because several reason, such as: folded, uncleanness, stapled, etc.
The problem with document condition is not only because of the enumerators, but also because of the printing defect.  Even though the printing and BPS have already performed quality checking, there are still documents that are not meet the standard.  This problem deals with drop color, incorrect position of the contents, etc. As already mentioned earlier, even though tight quality control procedure has been made in the printing company and in BPS headquarter, in fact, not all documents printed free of errors.  Problems especially related to drop color quality, that make characters written in drop color is not omitted by recognize procedure as is supposed to be.  Because of drop-color error, contents of the questionnaire will of course to be erroneous.  To overcome the problem, the operator in the processing center should type the data.
The second problem is dealing in the contents of the documents.  We mean the way enumerators write down a questionnaire.  As OCR technology is new to enumerators, they some time forget that the way to write questionnaire is different from writing a questionnaire intended data entry.  For data entry, a questionnaire could be written almost without restriction, as long human eyes can read them.  But in the OCR technology, everything has to be according to rules, because not eye that read the writing, but scanner does.
In reality, answers with "mark" have shown a very small error.  Problems are mostly related with "numbers".  For processing 2000 Population Census, the system has revived with a special character memory, based on a number of Indonesian handwritings.  But even that fact, errors still occurs because of this handwriting.  Problems can be in term of the recognition engine can not identify the number, or it interprets a number with a wrong number.  For example, number "1"is interpreted as "7".  This incorrect interpretation can not be known, unless there is a range check made to the field.  For example, when the value of "sex" field is equal to 7, it must be an error, because only two allowed value to the field, i.e., 1 for male, and 2 for female.
The problem occurs because the writing is not as  required by the character memory.  The way to write numbers has been done in the training period, according to prepared example numbers.  In many cases human editing to improve the quality of numbers are useless.  Because, even though editing has been made, in many times, error still occurred.  The editing does not improve the result significantly.  Record shows that BPS Java Province, which has not performed editing, in fact has a high percentage result (see Attachment 1).
Uncertain interpretation is shown in the verification process.  In this condition, verify engine will show whether the presentation is correct by showing interpreted number and original character on the screen.  Operator then determines the correct one.
Problem dealing with writing is also related to the quality of pencils.  It is defined that the accepted pencil is 2B type, and BPS is already provided that kind of pencils to the interviewers.  However, many interviewers still use unqualified pencils. The other problem is when the interviewer does not write with sharp pencils.
Related to the quality of the content of the questionnaire is human editing.  In the human editing, editor is expected to improve quality of the content, before documents are scanned. Editor rewrite unclear numbers, blacken unclear mark, etc.  However, when rewrite a number, in many cases, do not give much improvement result to the scanner output.  On the other hand, rewrite a number means using an eraser.  But, in many cases, eraser left dirt on paper, and affects rubber of drive rollers or feeder (see Figure 3).
The other obstacle deals with scanner.  As scanner is very important part in an OCR technology, then good condition of scanner should be kept all the time. On other hand, because when lot of dirt left on reading compartment, cleaning process should be done frequently.  And the result is the productive time of scanner reduced significantly.  That is not only that, it also reduces the lifetime of the machine.  There are several points needed to prolong scanner's lifetime, such as to perform maintenance for: Feed module, Paper path censor, drive rollers, separator roller, imaging guide, front & rear lamp.
The other obstacle faced is about the "clean data" itself, whether the data speak "normally".  Because even the data has been clean up by computer editing (validation), there is a possibility that the data looks unusual.  For example, when in a place, most people have the same age, then it must be something wrong with the data, even though this is not wrong according to editing rules. 
The analysis of this kind has been done to a limited number of data.  Tables have shown several unusual data, such as: Numbers of people age more than 50 and still go to school too big.  Investigation has been made, to see the original data.  We found out that one reason is because OCR system misinterprets the numbers.  An example is where age 23 is interpreted as 53.  There are other causes why this problem.  One is the requirement to write 0 (zero) in the first digit of age field, when the age less then 10.  However, careless writing of number, 0 could be interpreted as 6, 8, or 3. 
The study also is to see whether the age report in 2000 Population Census also tends to be number 0 and 5, or lasting with number 0 and 5.  For example, 5, 25, 40, 55, 80, etc.  Study has been done to Jakarta Province data.  We found out using Joint Score Index/ UN Index that the report is categorized as inaccurate, because Joint sore Index score is equal to 32.90. Investigation to the data shows that one cause of inaccurate age report is because inaccurate recognition, an example: age 43 is interpreted as 45.
6. Conclusion
Realizing that OCR technology is new to it, BPS has made an intensive preparation. Enumeration of the 2000 Population Census has been conducted using OCR form as well as non-OCR form.  Preliminary result based non-OCR form has been announced, and final result will be announced after finalizing the processing of the OCR forms. The processing is being on progress. Even the processing is not final yet, but it is near to the end.  Hopefully, the Indonesian experience may a lesson to other NSO that is going or planning to utilize OCR technology for population censuses, as well to other censuses or surveys.
Attachment 1
Number of SP2000-L2 forms, SP2000-KBL2 forms, and Scanners For Each Processing Center
No.
Processing center
Estimated L2 and KBL2 forms collected
Estimated L2 and KBL2 forms processed
Number of scanners
Processing status
1
BPS Propinsi Dista Aceh
993,327
877,408
1
2
BPS Propinsi Sumatra Utara
2,977,534
2,977,534
4
3
BPS Propinsi Sumatra Barat
1,220,055
1,220,055
2
74.5 %
4
BPS Propinsi Riau
1,123,268
1,123,268
2
5
BPS Propinsi Jambi
682,432
682,432
1
6
BPS Propinsi Sumatra Selatan
1,997,438
1,997,438
3
7
BPS Propinsi Bengkulu
411,112
411,112
1
8
BPS Propinsi Lampung
1,857,024
1,857,024
3
9
BPS Propinsi DKI Jakarta
2,468,774
2,468,774
4
100 %
10
BPS Propinsi Jawa Barat
5,699,927
5,699,927
7
11
BPS Kabupaten Serang
911,659
820,493
1
12
BPS Kabupaten Bogor
1,162,691
872,018
1
13
BPS Kabupaten Bandung
1,931,210
869,045
1
14
BPS Cirebon
807,410
807,410
1
15
BPS Tasikmalaya
1,042,387
833,910
1
16
BPS Propinsi Jawa Tengah
3,851,488
3,851,488
5
61.1 %
17
BPS Kabupaten Banyumas
782,538
782,538
1
18
BPS Kabupaten Kebumen
684,347
684,347
1
19
BPS Kabupaten Wonosobo
873,296
785,966
1
20
BPS Kabupaten Pemalang
958,416
814,654
1
21
BPS Propinsi Klaten
886,343
797,709
1
22
BPS Propinsi DI Yogyakarta
878,550
878,550
1
23
BPS Propinsi Jawa Timur
3,393,246
2,714,597
4
24
BPS Kabupaten Kediri
1,957,388
1,663,780
2
25
BPS Kabupaten Jember
1,803,901
1,533,316
2
26
BPS Kabupaten Tuban
1,820,938
1,456,750
2
27
BPS Kabupaten Madiun
1,269,989
1,269,989
2
28
BPS Propinsi Bali
833,061
833,061
1
29
BPS Propinsi Nusa Tenggara Barat
1,047,207
1,047,207
2
30
BPS Propinsi Nusa Tenggara Timur
880,173
704,138
1
31
BPS Propinsi Kalimantan Barat
994,746
782,049
1
32
BPS Propinsi Kalimantan Tengah
490,450
490,450
1
33
BPS Propinsi Kalimantan Selatan
862,879
690,303
1
34
BPS Propinsi Kalimantan Timur
659,292
659,292
1
35
BPS Propinsi Sulawesi Utara
756,000
759,000
1
36
BPS Propinsi Sulawesi Tengah
573,005
573,005
1
75.36 %
37
BPS Propinsi Sulawesi Selatan
1,964,013
1,964,013
3
38
BPS Propinsi Sulawesi Tenggara
423,592
423,592
1
39
BPS Propinsi Maluku
558,266
558,266
1
40
BPS Propinsi Irian jaya
577,398
577,398
14
41
BPS Headquarter  
4,256,464
79
 
Jumlah
55,066,770
55,066,770
 
72 %
Attachment 2
Preliminary result of 2000 Population Census
Province
Number of population
Population Growth (%)
1. Dista Aceh
4,010,865
1.67
2. Sumatra Utara
11,476,272
1.17
3. Sumatra Barat
4,228,103
0.57
4. Riau
4,733,948
3.79
5. Jambi
2,400,940
1.80
6. Sumatra Selatan
7,756,506
2.15
7. Bengkulu
1,405,060
1.83
8. Lampung
6,654,354
1.05
9. DKI Jakarta
8,384,853
0.16
10. Jawa Barat
43,552,923
2.17
11. Jawa Tengah
30,856,825
0.82
12. Yoyakarta
3,109,142
0.68
13. Jawa Timur
34,525,588
0.63
14. Bali
3,124,674
1.22
15. Nusa Tenggara Barat
3,821,794
1.31
16. Nusa Tenggara Timur
3,929,039
1.92
17. Kalimantan Barat
3,740,017
1.53
18. Kalimantan Tengah
1,801,504
2.67
19. Kalimantan Selatan
2,970,244
1.40
20. Kalimantan Timur
2,436,545
2.74
21. Sulawesi Utara
2,820,839
1.35
22. Sulawesi Tengah
2,066,394
1.97
23. Sulawesi Selatan
7,787,299
1.14
24. Sulawesi Tenggara
1,771,951
2.86
25. Maluku
1,977,570
0.65
26. Irian Jaya
2,112,756
2.60
Jumlah
203,456,005
1.35
 
Pop-IT project (1997-2001)
Project Objectives
Working Party Members
Working Party Meetings
First meeting, Bangkok, 24-26 September 1997
Second meeting, Singapore, 1-3 April 1998
Third meeting, Bali, 7-9 January 1999
Fourth meeting, Manila, 6-9 July 1999
Ffth meeting, Bangkok, 21 October 1999
Sixth meeting, Bangkok, 26 March 2001
Workshops
Application of New Information Technology to Population data, Bangkok, 12-20 October 1999
Population Data Analysis, Storage and Dissemination Technologies, Bangkok, 27-30 March 2001
Guidelines
Population data collection and capture (BBS - Statistics Indonesia)
GPS in modern mapping and GIS technologies to population data (Bangladesh Bureau of Statistics)
Population data dissemination (Statistics New Zealand)
Project Newsletter
Contact us
   
Copyright (c) 2013 ESCAP  |  Legal Notice