ESCAP logo
Home Site Map Index Contact
 
About US Media Centre Members Programmes Documents Publications Jobs
Search:
More Options | Search Tips
Bangkok, Thailand  
  Home > Statistics Division > Workshop 1999

Statistics Division, UNESCAP
About us
Statistics Development
 
Bullet Statistics for monitoring MDGs
Bullet Statistics on disability
Bullet Statistics on informal sector and informal employment
Bullet Microdata management
Data Centre
Statistical Publications
Statistical Newsletter
Committee on Statistics
Meetings
Contact Us
Related Links
Calendar of statistical meetings in Asia and the Pacific
National Statistical Offices in Asia and the Pacific
Statistical Institute for Asia and the Pacific
United Nations Statistics Division
UNdata
Millennium Development Goals Asia Pacific
 
Workshop on Application of New Information Technology to Population Data
Bangkok, 12-20 October 1999

STAT/WNIT/Rep
16 June 2000
ENGLISH ONLY

ECONOMIC AND SOCIAL COMMISSION FOR ASIA AND THE PACIFIC

Workshop on Application of New Information Technology to Population Data
12-20 October 1999
Bangkok

Report on the Workshop on Application of New Information Technology to Population Data

The designations employed and the presentation of the material in this report do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or area, or of its authorities, or concerning the delimitation of its frontiers or boundaries.  Mention of any firm, licensed process or product does not imply endorsement by the United Nations.  This report has been issued without formal editing.
Contents
Abbreviations and Descriptions
  1. ORGANIZATION OF THE WORKSHOP
    1. Attendance
    2. Opening of the Workshop
    3. Workshop arrangements
    4. Documentation
  2. INTRODUCTION TO INFORMATION TECHNOLOGY IN CENSUS OPERATIONS
    1. Project RAS/96/P12
    2. Opening of the Workshop
    3. Census processes
    4. Technology applied in recent censuses and surveys
    5. IT trends
    6. Quality management
    7. Expectations for the Workshop
  3. PAPER BASED DATA COLLECTION AND CAPTURE
    1. Optical Mark Recognition (OMR)
    2. Demonstration of Optical Mark Reader (OMR)
    3. Optical Character Recognition (OCR)
      1. OCR technology for the Indonesian Census 2000
      2. Demonstration of OCR cluster
      3. New Zealand experience in 1996
      4. Observations and recommendations on OCR
    4. Archiving of census forms
  4. NON-PAPER BASED DATA COLLECTION AND CAPTURE
    1. Computer Assisted Telephone Interviewing
      1. Internet and CATI in Singapore Census 2000
    2. Computer Assisted Personal Interviewing
  5. IMPLICATIONS FOR THE GUIDELINES ON THE APPLICATION OF NEW TECHNOLOGY TO POPULATION DATA COLLECTION AND CAPTURE
  6. ADDING VALUE TO CENSUS DATA THROUGH DATA WAREHOUSING AND DATA MINING
  7. DATA DISSEMINATION
    1. Implications for the guidelines on the Application of New Information Technology to Population Data Dissemination
  8. GEOGRAPHIC INFORMATION SYSTEMS
    1. Implications for the guidelines on the Application of Geo-Positioning Systems and Geographic Information Systems for Digital Mapping and Statistical Management
  9. RECOMMENDATIONS OF THE WORKSHOP
    1. General, IT management
    2. Data collection and capture
    3. Guidelines
    4. Data warehousing, databases, data archiving
    5. Data Dissemination
    6. Mapping and GIS
    7. Follow up
Annex I: List of Participants
Annex II: Tentative Time Schedule
Annex III: List of Documents
ABBREVIATIONS AND DESCRIPTIONS
AFPS Pro  Comprehensive application for high-volume forms processing based on advanced imaging technologies.
ArcInfo Comprehensive GIS software for a variety of computing environments.
ArcView Desktop mapping and GIS software.
Blaise A survey data collection and processing system.
CAPI Computer Assisted Personal Interviewing.
CARS Classifications and Related Systems.
CATI Computer Assisted Telephone Interviewing.
CGI  Common Gateway Interface facilitating dynamic content provision from web servers to client computers.
CSV format "Comma Separated Value" format.  An ASCII file that is commonly used as an intermediate format when transferring files between databases and spreadsheets of different makes.  Values are enclosed in quotation marks and separated by commas
dpi dots per inch.
EA Enumeration Area.
FLY (fly) C program that creates GIF image files on the fly from CGI and other programs.
GIS Geographic Information System.
GPS Global Positioning System.
HTML HyperText Markup Language.
ICR Intelligent Character Recognition.
IMPS Integrated Microcomputer Processing System.
IT Information technology.
KFI  Keying-from-image.
KFP Keying-from-paper.
LAN Local Area Network.
MapInfo Software product for mapping, data visualization and GIS.
NCS Nestor Reader Development tool for building forms processing or automatic data capture/entry applications.
NSO(s) National Statistical Office(s).
OCR Optical Character Recognition.
OLAP Online Analysis Processing.
OMR Optical Mark Recognition/Reader.
PC Personal computer.
PDF Portable Document Format.
PopMap Integrated geographical software providing maps and a graphics database.
PQM Process Quality Management.
SAS Statistical Analysis Software.
SIAP Statistical Institute for Asia and the Pacific.
SPSS  Statistical Package for Social Sciences.
SQL Structured Query Language.
SQM  Statistical Quality Management.
SuperCROSS   Fast cross-tabulation software.
SuperMAP  Mapping software.
TCDC Technical Cooperation among Developing Countries.
TIFF format Tag Image File Format.
TREND Time Series Retrieval and Dissemination Database.
UNFPA United Nations Population Fund.
UNFPA/CST United Nations Population Fund /Country Support Team.
I. ORGANIZATION OF THE WORKSHOP
A. Attendance
1. The Workshop on Application of New Information Technology to Population Data, funded by the United Nations Population Fund (UNFPA) under the project RAS/96/P12, was held in Bangkok from 12 to 20 October 1999.  It was organized by the secretariat of the Economic and Social Commission for Asia and the Pacific of the United Nations (ESCAP) with active support of the Working Party on the Application of New Technology to Population Data.
2. The Workshop was attended by thirty-one participants from nineteen selected countries/areas in the Asian and Pacific region: Bangladesh; Fiji; Hong Kong, China; India; Indonesia; Islamic Republic of Iran; Kazakhstan; Malaysia; Maldives; Mongolia; Myanmar; Nepal; Pakistan; Philippines; Republic of Korea; Samoa; Sri Lanka; Thailand and Viet Nam.
3 The members of the Working Party, consisting of nine experts from Australia; Bangladesh; Indonesia; Japan; Macao, China; New Zealand; Philippines; Singapore and Thailand; and representatives of the Statistical Institute for Asia and the Pacific (SIAP), and UNFPA Country Support Teams for East Asia, and Central and South Asia participated as resource persons.  Invited private sector companies also participated as observers and made presentations.
4. The list of participants is attached as Annex I.
B. Opening of the Workshop
5. The Workshop was inaugurated by Ms Kayoko Mizuta, the Deputy Executive Secretary of ESCAP.  In her opening statement, Ms Mizuta welcomed the participants and thanked the donor agency and resource persons for the role and the commitment they played in the organization and funding of the Workshop.  She appreciated the cooperation extended by private sector organizations to the Workshop.  She noted that the Workshop was one of the outputs of the ESCAP project RAS/96/P12 and that it was organized under the guidance of the Working Party on the Application of New Technology to Population Data.  Apart from the Workshop, other major outputs of the Working Party included three guidelines on (a) population data collection and capture; (b) modern mapping and GIS; and (c) population data dissemination.
6. In noting the benefits of new technology to statistical services in the region, Ms Mizuta emphasized the role information technology (IT) played in reducing costs of census and survey operations.  While it was not possible to present the full spectrum of technological innovations in just one Workshop, she hoped that, by sharing information and experiences in significant areas of IT, participants would enrich and further improve their understanding of new technologies relevant for their operations. Ms Mizuta closed her opening statement by highlighting that the Workshop materials would be made available through the project web site and by wishing the Workshop success.
C. Workshop arrangements
7. The Workshop noted that the time schedule (see Annex II) prepared by the secretariat was based on the tentative agenda, and agreed to proceed accordingly in six modules as follows:
Module Organizer
1. Introduction to IT in census operations ESCAP secretariat
2. Paper based data collection and capture Indonesia and Japan
3. Non-paper based data collection and capture  Singapore and Australia 
4. Adding value to census data through data warehousing and data mining ESCAP secretariat
5. Data dissemination New Zealand
6. Geographic information systems  Bangladesh
8. The Workshop acknowledged with thanks the following presentations and support by private sector companies:
Topic  Presenter
2.3 Is OMR technology still feasible?  DRS Data and Research Services plc United Kingdom
2.4 Census Success Story: US Census Kodak (United States)
2.6 Imaging for Census Data Capture  Kodak Philippines Ltd.
2.8  Demonstration of pilot application in Statistics Indonesia (hardware support) Fujitsu, Thailand
2.9 Integrated demonstration on forms Co-ordinated by Scientific Digital Business, Thailand
-  Forms capture  Kodak
-  Forms recognition Top Image Systems.
4.1 Data werehouse implementation approach and methodology Unisys Thailand Ltd.
4.2 SAS approach and fitness to data warehouse processes  SAS Institute Pte Ltd, Bangkok, Thailand
4.3 SAS demonstration SAS Institute Pte Ltd, Bangkok, Thailand
6.2 Production of quality  maps for censuses Kevron Pty. Ltd, Australia
D. Documentation
9. The documents presented at the Workshop are listed in Annex III to the report.
II. INTRODUCTION TO INFORMATION TECHNOLOGY IN CENSUS OPERATIONS
A. Project RAS/96/P12
10. The Workshop noted the extensive activities and outputs of the UNFPA-funded project RAS/96/P12, entitled the Application of New Technology in Population Data Collection, Processing, Dissemination and Presentation, and its Working Party on Application of New Technology to Population Data.  The project had been initiated in April 1997 with the objective of improving the capabilities of member and associate member countries/areas of ESCAP in the application of modern information technology (IT) in population statistics production and dissemination. 
11. The Workshop reiterated the importance of providing valid, reliable and timely data for developing population policies and programmes.  The application of modern IT would be more important than ever in achieving that goal.
12. It was noted that the ability to exploit modern IT varied greatly in the region, but that diversity also offered an opportunity for intra-regional cooperation.  Thus, the basic thrust of the project was to share the experiences of NSOs that had made significant progress in exploiting new technology.  At the beginning of project implementation, a Working Party was established with experts from nine countries to identify priorities, to provide guidance in the systematic application of IT, to consolidate the experience of the countries and to share those experiences within the region.
13. Since 1997, the Working Party had met four times to identify and discuss the topics of principal interest to the project.  Each meeting had focused on one of the technology areas for which members had contributed a large number of technical papers.  Other project outputs included self-contained guidelines on the application of new technology to three important aspects of census processing, namely (a) population data collection and capture; (b) mapping and geographic information systems; and (c) population data dissemination.  The Working Party also guided the implementation of three pilot projects under RAS/96/P12, one each by the NSOs of Bangladesh, Indonesia and Philippines, to test such new technologies.  Each project would produce a report at the Workshop describing the technologies piloted and experiences gained.
14. The Workshop noted that further outputs of the project included five newsletters, a web site containing documents of the Working Party meetings, an awareness package to promote effective and efficient utilization of IT in population census and survey processing, and a survey on the application of IT within the region.
B. Objective of the Workshop
15. The participants noted that the overall objective of the Workshop was to sensitize participants to the opportunities that modern information technology provided in population data operations.  Immediate objectives of the Workshop were (a) to provide information that would improve the basic understanding of new technologies relevant to population censuses and surveys; (b) to discuss advantages and constraints of important new information technologies; (c) to consider strategic implications that information technology would have on the planning, conduct and processing of population censuses and surveys; and (d) to facilitate the understanding of the overall role of new technology in conducting censuses and surveys.
C. Census processes
16. The Workshop reviewed major processes and activities associated with the conduct of censuses or large-scale population surveys.  Three distinct phases were identified.  The pre-enumeration stage included census planning, census organization, questionnaire design, forms and manuals drafting, cartography, publicity, data processing system design and development, and the conduct of the pilot census.  The census planning entailed obtaining legal and financial support from the Government, estimating resource requirements, preparing budgets and scheduling the event.  The census organization established central and field offices, created national and regional committees and co-ordinated with other Government offices.  The questionnaire design required dialogue with potential users and was a precursor to developing the tabulation plan.  The questionnaire, forms, manuals and the data processing system were tested during the pilot census.  The enumeration stage included the recruitment and training of field workers, the establishing of house listings, the actual enumeration and the post-enumeration survey.  The post-enumeration stage included the data processing from data capture to final tabulations, the analysis of results, the evaluation of the census process, and the dissemination of reports.
17. The Workshop noted that, during the previous round of censuses, countries of the region had needed from 3 to 7 years in order to complete a census programme from the initial planning stage until the basic results were disseminated.
D. Technology applied in recent censuses and surveys
18. The Workshop reviewed the results of the ESCAP Survey on Application of New Technology in Population Data Collection, Processing and Dissemination, conducted in April 1998.  The questionnaire had been sent to 56 national statistical offices in the Region and 29 responses were returned.  The report was published as document STAT/WNIT/1 and was made available to the participants of the Workshop.
19. The survey had revealed a broad infrastructure gap among the countries of the region.  Technologically advanced offices provided network-connected PCs for every staff member, including individual e?mail addresses and instant Internet connections.  Offices with the weakest IT infrastructure had practically no internal or global network connectivity available for general use and as many as 15 persons had to share a PC.
20. According to the Survey, on average it took 17 months from the beginning of data collection to the tabulation and analysis of results.  In some cases, up to four years were needed.
21. The Workshop noted that technologically advanced NSOs developed applications in-house and used IT across all operations.  Such custom-made applications were typically developed in areas of data scrutiny, data editing, data estimation and tabulation, whereas data analysis was usually conducted with commercially available statistical software packages.  Overall, a significant use was indicated of off-the-shelf software packages, but there was no significant difference in the prevalence of brand names between developed and developing countries.
E. IT trends
22. The Workshop reviewed recent trends in information technology and noted that hardware and software developments produced data processing systems with ever increasing power, capacity and complexity which at the same time had become easier to use and cheaper to acquire.
23. Chip processing speeds commonly available were 400 MHz or better, while RAM sizes mostly exceeded 32 MB.  Together with graphics accelerators and other technical features, that configuration translated into substantial processing power which in turn was a basis for the development of increasingly capable software systems.  Disk storage systems of 6 GB or more and with random access times of a few nanoseconds came as standard equipment with current desktop computers and were sufficient to store the entire census data files for a medium size country of 100 million people.  Optical storage media with 5 to 18 GB capacities were readily available and could be used for the long-term storage of census data.  Processing and storage/retrieval speed was no longer a constraint when scheduling the data processing operations.  Rather, delays caused by slow human interventions were very often responsible for the overall processing elapsed time.
24. Various versions of the Microsoft Windows operating system were currently being used on a large majority of all desktop computers.  General purpose and dedicated software were widely available for the Windows platform, some obtainable at low cost or no cost at all, and sufficed to manage most data processing tasks at the statistical office.
25. While individual desktop computers had already a substantial and often sufficient processing power, using local area networks with a dedicated file server enhanced further the efficiency of the entire operation by pooling resources, reducing or eliminating redundancies, and centrally managing common tasks such as data back-up.  Where infrastructure permitted, wireless communications were becoming an important tool for the interfacing between various computer components.  The Internet with features such as e?mail and World Wide Web had gained importance firstly for the dissemination of information about the statistical office and its products and secondly for collecting data from respondents.
26. Thus, virtually all phases of the census process could benefit from the latest technologies.  Those would include project planning software, geographic information systems, paperless data capture methods, scanning with mark, character and intelligent recognition techniques, automatic or computer assisted coding and editing methods, metadata systems, CD/DVD and Internet/World Wide Web media, etc.
F. Quality management
27. The Workshop noted that quality control during all census phases posed a major challenge from data collection to data validation and editing, tabulation and dissemination.  Process quality management (PQM) focused on careful planning and efficient implementation of the census process, including human resource management and the management of production means.  Statistical quality management (SQM) related to the management of the metadata database and the integrity of the data during the entire process of transformation from raw data to publishable micro databases and statistical tables.  A better quality of the end product would assure greater user satisfaction.
28. The Workshop noted further that quality management issues were often underestimated.  The introduction of new technologies could provide an opportunity to give special consideration to the application of quality management principles for the entire census operations.  Census managers were urged to assess each new application in respect of its potential capability to control process as well as statistical qualities.  They also needed to assess the impact of the new technology to noncomputerized statistical, management and administrative processes and organization structures.  However, as each application could interfere with others, special attention to interoperability needed to be paid.
29. The Workshop considered that many new technologies might be presented during the course of the Workshop that would be of interest to IT management involved in the planning and processing of the forthcoming census.  This wealth of new information posed another considerable challenge to IT management who would be required to select a combination of IT solutions that fits the existing infrastructure.  In that selection process, IT management should not overlook the effect those new technology solutions would have on the ability to maintain or improve both process and statistical quality management.
G. Expectations for the Workshop
30. The participants were invited, based on the agenda and without having yet heard the presentations, to rate their interest in the various Workshop topics.  Six work groups were created to deliberate on the question.  The findings for each group were presented to the other participants.  It appeared that Module 2, paper based data collection and capture, received the highest interest from participants, probably due to the proximity for many countries of the next census date prior to which solutions needed to be found soon.  The respondents also expressed high interest in the topics of dissemination and geographic information systems. However non-paper based data capture methods and data warehousing received lower advance interest, probably because those technologies required sufficiently developed infrastructure and general technological advancement which only the most advanced countries had.
31. The Workshop agreed that one of the important expectations for the 2000 rounds of censuses was to significantly reduce the time needed for the entire census process, from planning to final reporting, by employing some of these new technologies in the various stages of census data processing.  Also, the final quality of processed data could be improved by better quality control throughout the process.  Furthermore, a wider and more targeted audience could be reached by employing better dissemination methods utilizing effective application of IT.  Significant quality and timeliness gains could be achieved by improving data collection and capture methods and much effort could be spared when preparing census maps by using Geographic Information Systems.  Finally, where possible, increased use of the Internet, including the World Wide Web, showed great promise for more efficient information exchange.
32. However, the Workshop emphasized that individual countries would have to consider the level of local infrastructure and resource availability when deciding on the use of any of the available technologies.  The availability of technical support and maintenance were of crucial importance to the successful utilization of new technologies.
III. PAPER BASED DATA COLLECTION AND CAPTURE
33. The Workshop was presented with an overview of paper based data collection and capture technologies.  It was noted that traditional key-to-disk methods were time consuming, demanded a large quantity of equipment and personnel and were, due to the human factor, not always fully reliable.  Employing technology-assisted solutions would improve efficiency, economy and reliability in the data capture process.  Optical mark and character recognition systems were well tested, had become increasingly versatile and reliable, and could therefore significantly reduce the time needed for data capture and make subsequent processing more flexible.  Particularly the imaging technology promised improved efficiency by largely eliminating the need to return at later processing stages to paper based documents that were always cumbersome to handle.  Experience showed that keying from image could be more efficient than keying from paper, which could particularly benefit the coding and editing tasks.
A. Optical Mark Recognition (OMR)
34. Based on the example of Japan, the Workshop had a detailed exposure about the optical mark reader (OMR) technology.  The various hardware components of an OMR system comprised a feeding unit, a photoelectric conversion unit, and a recognition control unit.  The feeding unit consisted of a hopper for documents to be read and several stackers for accepted and rejected documents.  The photoelectric conversion unit used sensors to convert marks on the document to electric signals and forwarded the signals to the image memory.  Finally, the recognition control unit read those images and stored recognized marks onto a magnetic medium.  Marks could be recognized in "alternative mode", i.e. only one mark was expected for one question and the darkest mark was selected if by chance there were several marks found, and in "bit mode", i.e., plural marks were expected for one question and all recognized marks were stored in file. 
35. The Workshop noted the high quality requirements for OMR forms, which needed to be carefully designed in order to improve processing and recognition reliability.  Paper and printing quality had to be high, dropout colours had to be used for lead text and mark boxes, the shape and size of the mark boxes had to be carefully designed and sufficient distance had to be maintained between the mark boxes.  The OMR form needed also to include timing marks along the aligning edge in the direction of reading.  Finally, it was important that the mark boxes were completely filled with a soft black pencil and that wrong marks should be erased completely.  Since OMR forms were designed to be readable by the equipment, staff designated to handle the forms needed special training to fully understand the content.
36. The Workshop noted that OMR equipment had to be tested for reliability and recognition stability at least three times daily, namely, before, during and after the operation.  Failing those tests, the equipment needed to be cleaned, adjusted or repaired, as the case might be.  In addition, the equipment needed to be cleaned daily by removing paper powder from the mark and image heads, feeding unit and other susceptible parts.  Normally, a monthly maintenance service was to be scheduled by the vendor.
37. The Workshop agreed that OMR technology was a reliable and economical choice for censuses and surveys if the responses could be pre-coded.  However, it acknowledged that the particular requirements for questionnaire design and paper and printing quality were the main drawbacks of the technology.  For instance, enumerators, respondents and editors could have difficulties in using the questionnaires due to their highly machine-oriented layout.  Therefore it was necessary to allocate sufficient time and funds for training the enumerators and the OMR operating personnel.  The Workshop noted that leasing was one way to reduce cost.
B. Demonstration of Optical Mark Reader (OMR)
38. Data & Research Services (DRS) plc, a British company manufacturing OMR equipment and operating a data capture service bureau, provided the Workshop with an overview of OMR products and services and highlighted some of OMR's advantages and disadvantages compared with key-to-disk data capture.  The Workshop was informed that OMR was capable of capturing 7,000 forms per hour, a huge improvement over manual key entry.  Optical reading also improved data quality.  It was pointed out that as data volumes increased the use of OMR became more economical than key-to-disk data capture, particularly where predominantly pre-coded tick-box responses could be used.  Some disadvantages of OMR were mentioned, including the need for specially designed and accurately printed, and therefore more costly, questionnaires and the difficulty of capturing subjective data, i.e. textual responses.  The Workshop heard that OMR would be more efficient and cheaper than optical character recognition systems (OCR) as long as the majority of responses could be pre-coded.
39. Recognizing that a census questionnaire often had to include some textual responses, DRS had developed a new generation of OMRs that added an image recognition unit.  The captured images would be stored in a file and could be viewed by coding and editing operators who would key-in information from the image, possibly assisted by a computerized table-lookup system.  But, the bulk of the information would still be captured using the significantly more efficient mark reading technology.
40. A demonstration of a small-capacity desktop OMR reading actual Greek census forms concluded the presentation by DRS, which the Workshop found most useful.
C. Optical Character Recognition (OCR)
41. The Workshop noted that in some contexts the recognition of handwritten numerals and alphabets was referred to as Intelligent Character Recognition (ICR) to distinguish that technology from the recognition of printed text and numbers.  This report, however, is using the term OCR to cover all character recognition.
42. Kodak (United States) had been invited to introduce to the Workshop optical character recognition (OCR) technology as used in the 1990 United States census. The Workshop was informed that to obtain maximum reliability in the scanning process, special care had to be taken when designing and printing the questionnaires.  The measures included the use of non-carbon based ink and dropout colours.  Like the OMR forms, the OCR forms design had to be a compromise between maximizing the ease of use by the enumerators, coders and editors on the one hand and optimizing the efficiency of the recognition software on the other.  Experience showed that the best recognition rates for hand written responses were achieved at a scanning resolution of 200 dots per inch (dpi) or lower; higher resolutions generally worsened the recognition rates. 
43. It was explained that the confidence level of character recognition was user definable and was dependent on the overall document quality, i.e. questionnaire design and clarity of hand written responses.   However, setting the confidence level too high, e.g. above 90 per cent, could result in excessive numbers of rejects, while setting the level much lower could jeopardize the quality of the output data.  The Workshop noted that one of the major problems in character recognition was the acceptance of positively but wrongly identified characters.  In consequence, reduction of the number of "false positives" would have the most benefit for the overall quality of the captured data.
44. On a unit cost basis, the economics of keying-from-paper (KFP) and keying-from-image (KFI) were compared.  With the selected labour cost the calculations suggested that the break-even point was at about 400,000 census forms, i.e., beyond those numbers KFI would become more economical.  It was pointed out that KFI might be feasible even with a lesser number of forms, if improved data quality at the data capture stage, reduced costs for the additional processing steps and increased capture speed resulting in earlier completion of the entire census process were taken into account.
OCR technology for the Indonesian Census 2000
45. The Workshop was informed about the background and rationale based on which Indonesia selected OCR as the data capture method for the year 2000 census.  Major considerations had been (a) the very large number of forms to be processed for a population of more than 200 million; (b) the need to produce small area statistics based on the many island areas; and (c) the possibility of publishing basic results within 3 to 6 months.  Helpful in the decision had also been the availability of external assistance in the form of equipment, software and expertise.
46. The OCR system and the questionnaire design had been assessed and tuned in several pilot tests.  The changes in the questionnaire design had improved the recognition results significantly.  Further improvements had been achieved by replacing the built-in western character set in the recognition engine with a localized version of the character map.  The local version had been developed from writing samples submitted by 5,000 different persons.  However, it was eventually decided that it was better to omit the recognition of alpha characters and to concentrate on maximizing the performance of numeric recognition and mark reading.
47. The Workshop was given an overview of the processing flow of an OCR based system in Indonesia.  The OCR system consisted of three steps, namely scanning, recognition and verification.  The scanning of questionnaires produced an image file in TIF format.  That was compared to a template file containing information about the relative locations of input in the questionnaire.  The resulting digital output file was then submitted to the verification process in order to produce a clean data file. 
48. The Workshop learned about the issues and principles involved in the OCR questionnaire design in Indonesia. It was noted that OCR equipment required less stringent paper quality and printing accuracy than did OMR.  Instead, four rectangular registration markers were placed near the corners of the questionnaire page to define the location of individual fields relative to these registration markers, thus providing greater tolerance for misaligned forms being fed through the scanner.  Data fields were placed on the page as boxes of sufficient size to allow clear handwriting, with appropriate distance between them to minimize the risk for misinterpretation.  Depending on the use, field types could be defined as containing marks or textual information.  For textual boxes the use of two vertical dots within each character box was recommended that would guide the respondent or enumerator and thus improve the quality of handwriting.  Standard form-processing tools could normally be used for developing the questionnaire.  Once the design was complete, the questionnaire was scanned to produce an image file that was input to the NCS Nestor Reader editing function in order to create the above mentioned master questionnaire in ZDF format.  The questionnaires used for data collection were printed with dropout colours.
49. The Workshop was given a hands-on demonstration of developing an OCR questionnaire using the Visio Technical software.  The form design included text, recognition mark and check boxes.  It was thus shown that the questionnaire design could be developed by the user without assistance from the software company.  In contrast, the validation and editing rules were programmed in Visual Basic and were linked to the Nestor Reader software, a more difficult task that perhaps needed assistance from the vendor.
50. The Workshop also observed a practical demonstration of a less powerful but similar system to the one that Indonesia was planning to use, showing the scanning and recognition of characters and marks, and the output of questionnaire data to a digital file.
51. The Workshop heard that Indonesia was planning to deploy for its 2000 census some 80 OCR systems, consisting of Fujitsu Scanners M3099GX, NCS Nestor Reader 5.0, Visio Technical scanning software Scan All, and Fujitsu PCs.  The systems would be distributed across the country, allocated to provinces according to their population size.  After the census, those systems would be allocated for long-term use at smaller regional offices.  The Workshop heard that greater emphasis would be placed on enumerator training, particularly on the writing of numbers.  Statistics Indonesia had chosen to use cardboard boxes for storing and transporting the questionnaires instead of plastic satchels.  The boxes were designed to serve the dual purpose of better protecting the forms in the humid climate and providing writing support for form filling to be done by the enumerator.
52. For the Indonesian census, coding would be done in the office before the forms were scanned.  The Workshop discussed the feasibility of reversing the sequence, i.e. of subjecting the forms first to scanning and then only to computer assisted coding from the scanned images.  It was concluded that the feasibility depended on the availability of suitably trained staff.
Demonstration of OCR cluster
53. The Workshop observed a practical demonstration by Top Image System (TIS) of the TIS AFPS Pro recognition cluster that used a Kodak scanner with a controlled station linked to six Pentium PC stations in the following functions: (1) processing; (2) tile; (3) completion; (4) exception handling; (5) archive and export; and (6) controlling.  It noted the flexibility to inspect recognition results by character (tile mode) and appreciated the system's simplicity and efficiency in facilitating the recognition of visibly wrongly interpreted characters.
54. Depending on the overall workload, the number of computers for each processing step could be increased or decreased and depending on current workflow conditions, i.e., bottlenecks, the usage of any computer could be temporarily or permanently reassigned to another function in order to keep the overall system performance well balanced.
55. To highlight the efficiency of the modular approach, the example of the 1997 Turkish Census was cited.  In that census, questionnaires for 62 million people were scanned and recognized in 30 days, albeit only for a subset of variables.  The Workshop noted that the processing time was an inverse function of available scanning and recognition clusters.  It was informed that TIS had achieved alpha recognition rates as high as 94 per cent (Brazil) and 98 per cent (in Germany), although the latter case involved less elaborate forms than census questionnaires.
56. Improvements in recognition rates achieved by the TIS software were attributed to several advanced techniques, including (a) image enhancement; (b) form identification and removal (lift-off); (c) use of several recognition engines with voting algorithms; (d) trainable recognition algorithms, including local writing styles; (e) validation function and rules; (f) automatic coding; and (g) visual inspection in tile mode.
57. The Workshop heard that the form identification and removal feature eliminated the need for dropout colours and would significantly reduce the required storage space.  The voting algorithms would evaluate the results of several recognition engines and select the best answer according to pre-defined rules.  The tile mode would show for each character from 0 to 9 and A to Z, one at the time, a table containing all images as they were interpreted to represent that character.  That feature provided an efficient means of visually inspecting all images at a glance and easily identifying those images that did not correspond to the character under review.
New Zealand experience in 1996
58. The Workshop learned that for the 1996 New Zealand Population Census imaging and character recognition were used to capture the data.  Benefits compared with the 1991 census included: results released 5 months earlier; cost savings for data capture estimated at 9 per cent, noticeable reduction in paper handling and storage (particularly after the capture); and easier access to forms during coding and editing.  In addition, better quality control was gained, fewer staff needed to be recruited and trained, and for comparison with the post-enumeration survey access to census data was easier.
59. The following lessons were learned from the 1996 New Zealand Population Census use of imaging and character recognition: (a) systematic recognition errors for certain characters rendered biased results; (b) the use of images for coding and editing was a distinct advantage; (c) more data validation during data capture would improve overall data quality; and (d) high-priority variables could easily be processed first.  The Workshop was informed that further contracting out the data capture process might give significant economic long-term benefits, and, last but not least, imaging should not be used just as a replacement of traditional data capture methods but the entire census process could beneficially be re-thought at this occasion.
Observations and recommendations on OCR
60. The Workshop noted that recognition engines could be expensive and therefore the use of multiple engines had to be carefully evaluated.  However, it was also recognized that no single recognition engine would give 100 per cent results in all circumstances and that different engines had different strengths and weaknesses.  Thus, using several recognition engines with a voting mechanism could significantly improve the overall recognition rate.
61. The Workshop recommended that users should demand that competing vendors of census data capture systems demonstrate that the promised capabilities of their system would work under local circumstances, i.e. in the physical and infrastructure environment of the user as well as with the specific forms as developed by the user.
62. The Workshop noted that using technologically advanced solutions should not be self-serving but consideration should be given to local circumstances, e.g., to the constraints based on limitations of financial, technical and personnel resources.
63. The Workshop also noted that paper based methods continued to be used for data collection, particularly when the general public was filling in the questionnaires.  It was noted that non-response remained one of the main problems in census taking.
64. The Workshop discussed the benefits and drawbacks of paper based data collection and capture methods.  Considerable interest was shown in the topic and the following were the observations by the Workshop:
  • improved technology had helped the census process in many developing countries;