UN Web Site | UN Web Site Locator
Home Site map Contact 
ESCAP Statistics Division
ESCAP Statistics Division
 
Workshop 1999    
Workshop on Application of New Information Technology to Population Data
Bangkok, 12-20 October 1999

STAT/WNIT/Rep
16 June 2000
ENGLISH ONLY

ECONOMIC AND SOCIAL COMMISSION FOR ASIA AND THE PACIFIC

Workshop on Application of New Information Technology to Population Data
12-20 October 1999
Bangkok

Report on the Workshop on Application of New Information Technology to Population Data

The designations employed and the presentation of the material in this report do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or area, or of its authorities, or concerning the delimitation of its frontiers or boundaries.  Mention of any firm, licensed process or product does not imply endorsement by the United Nations.  This report has been issued without formal editing.
Contents
Abbreviations and Descriptions
  1. ORGANIZATION OF THE WORKSHOP
    1. Attendance
    2. Opening of the Workshop
    3. Workshop arrangements
    4. Documentation
  2. INTRODUCTION TO INFORMATION TECHNOLOGY IN CENSUS OPERATIONS
    1. Project RAS/96/P12
    2. Opening of the Workshop
    3. Census processes
    4. Technology applied in recent censuses and surveys
    5. IT trends
    6. Quality management
    7. Expectations for the Workshop
  3. PAPER BASED DATA COLLECTION AND CAPTURE
    1. Optical Mark Recognition (OMR)
    2. Demonstration of Optical Mark Reader (OMR)
    3. Optical Character Recognition (OCR)
      1. OCR technology for the Indonesian Census 2000
      2. Demonstration of OCR cluster
      3. New Zealand experience in 1996
      4. Observations and recommendations on OCR
    4. Archiving of census forms
  4. NON-PAPER BASED DATA COLLECTION AND CAPTURE
    1. Computer Assisted Telephone Interviewing
      1. Internet and CATI in Singapore Census 2000
    2. Computer Assisted Personal Interviewing
  5. IMPLICATIONS FOR THE GUIDELINES ON THE APPLICATION OF NEW TECHNOLOGY TO POPULATION DATA COLLECTION AND CAPTURE
  6. ADDING VALUE TO CENSUS DATA THROUGH DATA WAREHOUSING AND DATA MINING
  7. DATA DISSEMINATION
    1. Implications for the guidelines on the Application of New Information Technology to Population Data Dissemination
  8. GEOGRAPHIC INFORMATION SYSTEMS
    1. Implications for the guidelines on the Application of Geo-Positioning Systems and Geographic Information Systems for Digital Mapping and Statistical Management
  9. RECOMMENDATIONS OF THE WORKSHOP
    1. General, IT management
    2. Data collection and capture
    3. Guidelines
    4. Data warehousing, databases, data archiving
    5. Data Dissemination
    6. Mapping and GIS
    7. Follow up
Annex I: List of Participants
Annex II: Tentative Time Schedule
Annex III: List of Documents
ABBREVIATIONS AND DESCRIPTIONS
AFPS Pro  Comprehensive application for high-volume forms processing based on advanced imaging technologies.
ArcInfo Comprehensive GIS software for a variety of computing environments.
ArcView Desktop mapping and GIS software.
Blaise A survey data collection and processing system.
CAPI Computer Assisted Personal Interviewing.
CARS Classifications and Related Systems.
CATI Computer Assisted Telephone Interviewing.
CGI  Common Gateway Interface facilitating dynamic content provision from web servers to client computers.
CSV format "Comma Separated Value" format.  An ASCII file that is commonly used as an intermediate format when transferring files between databases and spreadsheets of different makes.  Values are enclosed in quotation marks and separated by commas
dpi dots per inch.
EA Enumeration Area.
FLY (fly) C program that creates GIF image files on the fly from CGI and other programs.
GIS Geographic Information System.
GPS Global Positioning System.
HTML HyperText Markup Language.
ICR Intelligent Character Recognition.
IMPS Integrated Microcomputer Processing System.
IT Information technology.
KFI  Keying-from-image.
KFP Keying-from-paper.
LAN Local Area Network.
MapInfo Software product for mapping, data visualization and GIS.
NCS Nestor Reader Development tool for building forms processing or automatic data capture/entry applications.
NSO(s) National Statistical Office(s).
OCR Optical Character Recognition.
OLAP Online Analysis Processing.
OMR Optical Mark Recognition/Reader.
PC Personal computer.
PDF Portable Document Format.
PopMap Integrated geographical software providing maps and a graphics database.
PQM Process Quality Management.
SAS Statistical Analysis Software.
SIAP Statistical Institute for Asia and the Pacific.
SPSS  Statistical Package for Social Sciences.
SQL Structured Query Language.
SQM  Statistical Quality Management.
SuperCROSS   Fast cross-tabulation software.
SuperMAP  Mapping software.
TCDC Technical Cooperation among Developing Countries.
TIFF format Tag Image File Format.
TREND Time Series Retrieval and Dissemination Database.
UNFPA United Nations Population Fund.
UNFPA/CST United Nations Population Fund /Country Support Team.
I. ORGANIZATION OF THE WORKSHOP
A. Attendance
1. The Workshop on Application of New Information Technology to Population Data, funded by the United Nations Population Fund (UNFPA) under the project RAS/96/P12, was held in Bangkok from 12 to 20 October 1999.  It was organized by the secretariat of the Economic and Social Commission for Asia and the Pacific of the United Nations (ESCAP) with active support of the Working Party on the Application of New Technology to Population Data.
2. The Workshop was attended by thirty-one participants from nineteen selected countries/areas in the Asian and Pacific region: Bangladesh; Fiji; Hong Kong, China; India; Indonesia; Islamic Republic of Iran; Kazakhstan; Malaysia; Maldives; Mongolia; Myanmar; Nepal; Pakistan; Philippines; Republic of Korea; Samoa; Sri Lanka; Thailand and Viet Nam.
3 The members of the Working Party, consisting of nine experts from Australia; Bangladesh; Indonesia; Japan; Macao, China; New Zealand; Philippines; Singapore and Thailand; and representatives of the Statistical Institute for Asia and the Pacific (SIAP), and UNFPA Country Support Teams for East Asia, and Central and South Asia participated as resource persons.  Invited private sector companies also participated as observers and made presentations.
4. The list of participants is attached as Annex I.
B. Opening of the Workshop
5. The Workshop was inaugurated by Ms Kayoko Mizuta, the Deputy Executive Secretary of ESCAP.  In her opening statement, Ms Mizuta welcomed the participants and thanked the donor agency and resource persons for the role and the commitment they played in the organization and funding of the Workshop.  She appreciated the cooperation extended by private sector organizations to the Workshop.  She noted that the Workshop was one of the outputs of the ESCAP project RAS/96/P12 and that it was organized under the guidance of the Working Party on the Application of New Technology to Population Data.  Apart from the Workshop, other major outputs of the Working Party included three guidelines on (a) population data collection and capture; (b) modern mapping and GIS; and (c) population data dissemination.
6. In noting the benefits of new technology to statistical services in the region, Ms Mizuta emphasized the role information technology (IT) played in reducing costs of census and survey operations.  While it was not possible to present the full spectrum of technological innovations in just one Workshop, she hoped that, by sharing information and experiences in significant areas of IT, participants would enrich and further improve their understanding of new technologies relevant for their operations. Ms Mizuta closed her opening statement by highlighting that the Workshop materials would be made available through the project web site and by wishing the Workshop success.
C. Workshop arrangements
7. The Workshop noted that the time schedule (see Annex II) prepared by the secretariat was based on the tentative agenda, and agreed to proceed accordingly in six modules as follows:
Module Organizer
1. Introduction to IT in census operations ESCAP secretariat
2. Paper based data collection and capture Indonesia and Japan
3. Non-paper based data collection and capture  Singapore and Australia 
4. Adding value to census data through data warehousing and data mining ESCAP secretariat
5. Data dissemination New Zealand
6. Geographic information systems  Bangladesh
8. The Workshop acknowledged with thanks the following presentations and support by private sector companies:
Topic  Presenter
2.3 Is OMR technology still feasible?  DRS Data and Research Services plc United Kingdom
2.4 Census Success Story: US Census Kodak (United States)
2.6 Imaging for Census Data Capture  Kodak Philippines Ltd.
2.8  Demonstration of pilot application in Statistics Indonesia (hardware support) Fujitsu, Thailand
2.9 Integrated demonstration on forms Co-ordinated by Scientific Digital Business, Thailand
-  Forms capture  Kodak
-  Forms recognition Top Image Systems.
4.1 Data werehouse implementation approach and methodology Unisys Thailand Ltd.
4.2 SAS approach and fitness to data warehouse processes  SAS Institute Pte Ltd, Bangkok, Thailand
4.3 SAS demonstration SAS Institute Pte Ltd, Bangkok, Thailand
6.2 Production of quality  maps for censuses Kevron Pty. Ltd, Australia
D. Documentation
9. The documents presented at the Workshop are listed in Annex III to the report.
II. INTRODUCTION TO INFORMATION TECHNOLOGY IN CENSUS OPERATIONS
A. Project RAS/96/P12
10. The Workshop noted the extensive activities and outputs of the UNFPA-funded project RAS/96/P12, entitled the Application of New Technology in Population Data Collection, Processing, Dissemination and Presentation, and its Working Party on Application of New Technology to Population Data.  The project had been initiated in April 1997 with the objective of improving the capabilities of member and associate member countries/areas of ESCAP in the application of modern information technology (IT) in population statistics production and dissemination. 
11. The Workshop reiterated the importance of providing valid, reliable and timely data for developing population policies and programmes.  The application of modern IT would be more important than ever in achieving that goal.
12. It was noted that the ability to exploit modern IT varied greatly in the region, but that diversity also offered an opportunity for intra-regional cooperation.  Thus, the basic thrust of the project was to share the experiences of NSOs that had made significant progress in exploiting new technology.  At the beginning of project implementation, a Working Party was established with experts from nine countries to identify priorities, to provide guidance in the systematic application of IT, to consolidate the experience of the countries and to share those experiences within the region.
13. Since 1997, the Working Party had met four times to identify and discuss the topics of principal interest to the project.  Each meeting had focused on one of the technology areas for which members had contributed a large number of technical papers.  Other project outputs included self-contained guidelines on the application of new technology to three important aspects of census processing, namely (a) population data collection and capture; (b) mapping and geographic information systems; and (c) population data dissemination.  The Working Party also guided the implementation of three pilot projects under RAS/96/P12, one each by the NSOs of Bangladesh, Indonesia and Philippines, to test such new technologies.  Each project would produce a report at the Workshop describing the technologies piloted and experiences gained.
14. The Workshop noted that further outputs of the project included five newsletters, a web site containing documents of the Working Party meetings, an awareness package to promote effective and efficient utilization of IT in population census and survey processing, and a survey on the application of IT within the region.
B. Objective of the Workshop
15. The participants noted that the overall objective of the Workshop was to sensitize participants to the opportunities that modern information technology provided in population data operations.  Immediate objectives of the Workshop were (a) to provide information that would improve the basic understanding of new technologies relevant to population censuses and surveys; (b) to discuss advantages and constraints of important new information technologies; (c) to consider strategic implications that information technology would have on the planning, conduct and processing of population censuses and surveys; and (d) to facilitate the understanding of the overall role of new technology in conducting censuses and surveys.
C. Census processes
16. The Workshop reviewed major processes and activities associated with the conduct of censuses or large-scale population surveys.  Three distinct phases were identified.  The pre-enumeration stage included census planning, census organization, questionnaire design, forms and manuals drafting, cartography, publicity, data processing system design and development, and the conduct of the pilot census.  The census planning entailed obtaining legal and financial support from the Government, estimating resource requirements, preparing budgets and scheduling the event.  The census organization established central and field offices, created national and regional committees and co-ordinated with other Government offices.  The questionnaire design required dialogue with potential users and was a precursor to developing the tabulation plan.  The questionnaire, forms, manuals and the data processing system were tested during the pilot census.  The enumeration stage included the recruitment and training of field workers, the establishing of house listings, the actual enumeration and the post-enumeration survey.  The post-enumeration stage included the data processing from data capture to final tabulations, the analysis of results, the evaluation of the census process, and the dissemination of reports.
17. The Workshop noted that, during the previous round of censuses, countries of the region had needed from 3 to 7 years in order to complete a census programme from the initial planning stage until the basic results were disseminated.
D. Technology applied in recent censuses and surveys
18. The Workshop reviewed the results of the ESCAP Survey on Application of New Technology in Population Data Collection, Processing and Dissemination, conducted in April 1998.  The questionnaire had been sent to 56 national statistical offices in the Region and 29 responses were returned.  The report was published as document STAT/WNIT/1 and was made available to the participants of the Workshop.
19. The survey had revealed a broad infrastructure gap among the countries of the region.  Technologically advanced offices provided network-connected PCs for every staff member, including individual e?mail addresses and instant Internet connections.  Offices with the weakest IT infrastructure had practically no internal or global network connectivity available for general use and as many as 15 persons had to share a PC.
20. According to the Survey, on average it took 17 months from the beginning of data collection to the tabulation and analysis of results.  In some cases, up to four years were needed.
21. The Workshop noted that technologically advanced NSOs developed applications in-house and used IT across all operations.  Such custom-made applications were typically developed in areas of data scrutiny, data editing, data estimation and tabulation, whereas data analysis was usually conducted with commercially available statistical software packages.  Overall, a significant use was indicated of off-the-shelf software packages, but there was no significant difference in the prevalence of brand names between developed and developing countries.
E. IT trends
22. The Workshop reviewed recent trends in information technology and noted that hardware and software developments produced data processing systems with ever increasing power, capacity and complexity which at the same time had become easier to use and cheaper to acquire.
23. Chip processing speeds commonly available were 400 MHz or better, while RAM sizes mostly exceeded 32 MB.  Together with graphics accelerators and other technical features, that configuration translated into substantial processing power which in turn was a basis for the development of increasingly capable software systems.  Disk storage systems of 6 GB or more and with random access times of a few nanoseconds came as standard equipment with current desktop computers and were sufficient to store the entire census data files for a medium size country of 100 million people.  Optical storage media with 5 to 18 GB capacities were readily available and could be used for the long-term storage of census data.  Processing and storage/retrieval speed was no longer a constraint when scheduling the data processing operations.  Rather, delays caused by slow human interventions were very often responsible for the overall processing elapsed time.
24. Various versions of the Microsoft Windows operating system were currently being used on a large majority of all desktop computers.  General purpose and dedicated software were widely available for the Windows platform, some obtainable at low cost or no cost at all, and sufficed to manage most data processing tasks at the statistical office.
25. While individual desktop computers had already a substantial and often sufficient processing power, using local area networks with a dedicated file server enhanced further the efficiency of the entire operation by pooling resources, reducing or eliminating redundancies, and centrally managing common tasks such as data back-up.  Where infrastructure permitted, wireless communications were becoming an important tool for the interfacing between various computer components.  The Internet with features such as e?mail and World Wide Web had gained importance firstly for the dissemination of information about the statistical office and its products and secondly for collecting data from respondents.
26. Thus, virtually all phases of the census process could benefit from the latest technologies.  Those would include project planning software, geographic information systems, paperless data capture methods, scanning with mark, character and intelligent recognition techniques, automatic or computer assisted coding and editing methods, metadata systems, CD/DVD and Internet/World Wide Web media, etc.
F. Quality management
27. The Workshop noted that quality control during all census phases posed a major challenge from data collection to data validation and editing, tabulation and dissemination.  Process quality management (PQM) focused on careful planning and efficient implementation of the census process, including human resource management and the management of production means.  Statistical quality management (SQM) related to the management of the metadata database and the integrity of the data during the entire process of transformation from raw data to publishable micro databases and statistical tables.  A better quality of the end product would assure greater user satisfaction.
28. The Workshop noted further that quality management issues were often underestimated.  The introduction of new technologies could provide an opportunity to give special consideration to the application of quality management principles for the entire census operations.  Census managers were urged to assess each new application in respect of its potential capability to control process as well as statistical qualities.  They also needed to assess the impact of the new technology to noncomputerized statistical, management and administrative processes and organization structures.  However, as each application could interfere with others, special attention to interoperability needed to be paid.
29. The Workshop considered that many new technologies might be presented during the course of the Workshop that would be of interest to IT management involved in the planning and processing of the forthcoming census.  This wealth of new information posed another considerable challenge to IT management who would be required to select a combination of IT solutions that fits the existing infrastructure.  In that selection process, IT management should not overlook the effect those new technology solutions would have on the ability to maintain or improve both process and statistical quality management.
G. Expectations for the Workshop
30. The participants were invited, based on the agenda and without having yet heard the presentations, to rate their interest in the various Workshop topics.  Six work groups were created to deliberate on the question.  The findings for each group were presented to the other participants.  It appeared that Module 2, paper based data collection and capture, received the highest interest from participants, probably due to the proximity for many countries of the next census date prior to which solutions needed to be found soon.  The respondents also expressed high interest in the topics of dissemination and geographic information systems. However non-paper based data capture methods and data warehousing received lower advance interest, probably because those technologies required sufficiently developed infrastructure and general technological advancement which only the most advanced countries had.
31. The Workshop agreed that one of the important expectations for the 2000 rounds of censuses was to significantly reduce the time needed for the entire census process, from planning to final reporting, by employing some of these new technologies in the various stages of census data processing.  Also, the final quality of processed data could be improved by better quality control throughout the process.  Furthermore, a wider and more targeted audience could be reached by employing better dissemination methods utilizing effective application of IT.  Significant quality and timeliness gains could be achieved by improving data collection and capture methods and much effort could be spared when preparing census maps by using Geographic Information Systems.  Finally, where possible, increased use of the Internet, including the World Wide Web, showed great promise for more efficient information exchange.
32. However, the Workshop emphasized that individual countries would have to consider the level of local infrastructure and resource availability when deciding on the use of any of the available technologies.  The availability of technical support and maintenance were of crucial importance to the successful utilization of new technologies.
III. PAPER BASED DATA COLLECTION AND CAPTURE
33. The Workshop was presented with an overview of paper based data collection and capture technologies.  It was noted that traditional key-to-disk methods were time consuming, demanded a large quantity of equipment and personnel and were, due to the human factor, not always fully reliable.  Employing technology-assisted solutions would improve efficiency, economy and reliability in the data capture process.  Optical mark and character recognition systems were well tested, had become increasingly versatile and reliable, and could therefore significantly reduce the time needed for data capture and make subsequent processing more flexible.  Particularly the imaging technology promised improved efficiency by largely eliminating the need to return at later processing stages to paper based documents that were always cumbersome to handle.  Experience showed that keying from image could be more efficient than keying from paper, which could particularly benefit the coding and editing tasks.
A. Optical Mark Recognition (OMR)
34. Based on the example of Japan, the Workshop had a detailed exposure about the optical mark reader (OMR) technology.  The various hardware components of an OMR system comprised a feeding unit, a photoelectric conversion unit, and a recognition control unit.  The feeding unit consisted of a hopper for documents to be read and several stackers for accepted and rejected documents.  The photoelectric conversion unit used sensors to convert marks on the document to electric signals and forwarded the signals to the image memory.  Finally, the recognition control unit read those images and stored recognized marks onto a magnetic medium.  Marks could be recognized in "alternative mode", i.e. only one mark was expected for one question and the darkest mark was selected if by chance there were several marks found, and in "bit mode", i.e., plural marks were expected for one question and all recognized marks were stored in file. 
35. The Workshop noted the high quality requirements for OMR forms, which needed to be carefully designed in order to improve processing and recognition reliability.  Paper and printing quality had to be high, dropout colours had to be used for lead text and mark boxes, the shape and size of the mark boxes had to be carefully designed and sufficient distance had to be maintained between the mark boxes.  The OMR form needed also to include timing marks along the aligning edge in the direction of reading.  Finally, it was important that the mark boxes were completely filled with a soft black pencil and that wrong marks should be erased completely.  Since OMR forms were designed to be readable by the equipment, staff designated to handle the forms needed special training to fully understand the content.
36. The Workshop noted that OMR equipment had to be tested for reliability and recognition stability at least three times daily, namely, before, during and after the operation.  Failing those tests, the equipment needed to be cleaned, adjusted or repaired, as the case might be.  In addition, the equipment needed to be cleaned daily by removing paper powder from the mark and image heads, feeding unit and other susceptible parts.  Normally, a monthly maintenance service was to be scheduled by the vendor.
37. The Workshop agreed that OMR technology was a reliable and economical choice for censuses and surveys if the responses could be pre-coded.  However, it acknowledged that the particular requirements for questionnaire design and paper and printing quality were the main drawbacks of the technology.  For instance, enumerators, respondents and editors could have difficulties in using the questionnaires due to their highly machine-oriented layout.  Therefore it was necessary to allocate sufficient time and funds for training the enumerators and the OMR operating personnel.  The Workshop noted that leasing was one way to reduce cost.
B. Demonstration of Optical Mark Reader (OMR)
38. Data & Research Services (DRS) plc, a British company manufacturing OMR equipment and operating a data capture service bureau, provided the Workshop with an overview of OMR products and services and highlighted some of OMR's advantages and disadvantages compared with key-to-disk data capture.  The Workshop was informed that OMR was capable of capturing 7,000 forms per hour, a huge improvement over manual key entry.  Optical reading also improved data quality.  It was pointed out that as data volumes increased the use of OMR became more economical than key-to-disk data capture, particularly where predominantly pre-coded tick-box responses could be used.  Some disadvantages of OMR were mentioned, including the need for specially designed and accurately printed, and therefore more costly, questionnaires and the difficulty of capturing subjective data, i.e. textual responses.  The Workshop heard that OMR would be more efficient and cheaper than optical character recognition systems (OCR) as long as the majority of responses could be pre-coded.
39. Recognizing that a census questionnaire often had to include some textual responses, DRS had developed a new generation of OMRs that added an image recognition unit.  The captured images would be stored in a file and could be viewed by coding and editing operators who would key-in information from the image, possibly assisted by a computerized table-lookup system.  But, the bulk of the information would still be captured using the significantly more efficient mark reading technology.
40. A demonstration of a small-capacity desktop OMR reading actual Greek census forms concluded the presentation by DRS, which the Workshop found most useful.
C. Optical Character Recognition (OCR)
41. The Workshop noted that in some contexts the recognition of handwritten numerals and alphabets was referred to as Intelligent Character Recognition (ICR) to distinguish that technology from the recognition of printed text and numbers.  This report, however, is using the term OCR to cover all character recognition.
42. Kodak (United States) had been invited to introduce to the Workshop optical character recognition (OCR) technology as used in the 1990 United States census. The Workshop was informed that to obtain maximum reliability in the scanning process, special care had to be taken when designing and printing the questionnaires.  The measures included the use of non-carbon based ink and dropout colours.  Like the OMR forms, the OCR forms design had to be a compromise between maximizing the ease of use by the enumerators, coders and editors on the one hand and optimizing the efficiency of the recognition software on the other.  Experience showed that the best recognition rates for hand written responses were achieved at a scanning resolution of 200 dots per inch (dpi) or lower; higher resolutions generally worsened the recognition rates. 
43. It was explained that the confidence level of character recognition was user definable and was dependent on the overall document quality, i.e. questionnaire design and clarity of hand written responses.   However, setting the confidence level too high, e.g. above 90 per cent, could result in excessive numbers of rejects, while setting the level much lower could jeopardize the quality of the output data.  The Workshop noted that one of the major problems in character recognition was the acceptance of positively but wrongly identified characters.  In consequence, reduction of the number of "false positives" would have the most benefit for the overall quality of the captured data.
44. On a unit cost basis, the economics of keying-from-paper (KFP) and keying-from-image (KFI) were compared.  With the selected labour cost the calculations suggested that the break-even point was at about 400,000 census forms, i.e., beyond those numbers KFI would become more economical.  It was pointed out that KFI might be feasible even with a lesser number of forms, if improved data quality at the data capture stage, reduced costs for the additional processing steps and increased capture speed resulting in earlier completion of the entire census process were taken into account.
OCR technology for the Indonesian Census 2000
45. The Workshop was informed about the background and rationale based on which Indonesia selected OCR as the data capture method for the year 2000 census.  Major considerations had been (a) the very large number of forms to be processed for a population of more than 200 million; (b) the need to produce small area statistics based on the many island areas; and (c) the possibility of publishing basic results within 3 to 6 months.  Helpful in the decision had also been the availability of external assistance in the form of equipment, software and expertise.
46. The OCR system and the questionnaire design had been assessed and tuned in several pilot tests.  The changes in the questionnaire design had improved the recognition results significantly.  Further improvements had been achieved by replacing the built-in western character set in the recognition engine with a localized version of the character map.  The local version had been developed from writing samples submitted by 5,000 different persons.  However, it was eventually decided that it was better to omit the recognition of alpha characters and to concentrate on maximizing the performance of numeric recognition and mark reading.
47. The Workshop was given an overview of the processing flow of an OCR based system in Indonesia.  The OCR system consisted of three steps, namely scanning, recognition and verification.  The scanning of questionnaires produced an image file in TIF format.  That was compared to a template file containing information about the relative locations of input in the questionnaire.  The resulting digital output file was then submitted to the verification process in order to produce a clean data file. 
48. The Workshop learned about the issues and principles involved in the OCR questionnaire design in Indonesia. It was noted that OCR equipment required less stringent paper quality and printing accuracy than did OMR.  Instead, four rectangular registration markers were placed near the corners of the questionnaire page to define the location of individual fields relative to these registration markers, thus providing greater tolerance for misaligned forms being fed through the scanner.  Data fields were placed on the page as boxes of sufficient size to allow clear handwriting, with appropriate distance between them to minimize the risk for misinterpretation.  Depending on the use, field types could be defined as containing marks or textual information.  For textual boxes the use of two vertical dots within each character box was recommended that would guide the respondent or enumerator and thus improve the quality of handwriting.  Standard form-processing tools could normally be used for developing the questionnaire.  Once the design was complete, the questionnaire was scanned to produce an image file that was input to the NCS Nestor Reader editing function in order to create the above mentioned master questionnaire in ZDF format.  The questionnaires used for data collection were printed with dropout colours.
49. The Workshop was given a hands-on demonstration of developing an OCR questionnaire using the Visio Technical software.  The form design included text, recognition mark and check boxes.  It was thus shown that the questionnaire design could be developed by the user without assistance from the software company.  In contrast, the validation and editing rules were programmed in Visual Basic and were linked to the Nestor Reader software, a more difficult task that perhaps needed assistance from the vendor.
50. The Workshop also observed a practical demonstration of a less powerful but similar system to the one that Indonesia was planning to use, showing the scanning and recognition of characters and marks, and the output of questionnaire data to a digital file.
51. The Workshop heard that Indonesia was planning to deploy for its 2000 census some 80 OCR systems, consisting of Fujitsu Scanners M3099GX, NCS Nestor Reader 5.0, Visio Technical scanning software Scan All, and Fujitsu PCs.  The systems would be distributed across the country, allocated to provinces according to their population size.  After the census, those systems would be allocated for long-term use at smaller regional offices.  The Workshop heard that greater emphasis would be placed on enumerator training, particularly on the writing of numbers.  Statistics Indonesia had chosen to use cardboard boxes for storing and transporting the questionnaires instead of plastic satchels.  The boxes were designed to serve the dual purpose of better protecting the forms in the humid climate and providing writing support for form filling to be done by the enumerator.
52. For the Indonesian census, coding would be done in the office before the forms were scanned.  The Workshop discussed the feasibility of reversing the sequence, i.e. of subjecting the forms first to scanning and then only to computer assisted coding from the scanned images.  It was concluded that the feasibility depended on the availability of suitably trained staff.
Demonstration of OCR cluster
53. The Workshop observed a practical demonstration by Top Image System (TIS) of the TIS AFPS Pro recognition cluster that used a Kodak scanner with a controlled station linked to six Pentium PC stations in the following functions: (1) processing; (2) tile; (3) completion; (4) exception handling; (5) archive and export; and (6) controlling.  It noted the flexibility to inspect recognition results by character (tile mode) and appreciated the system's simplicity and efficiency in facilitating the recognition of visibly wrongly interpreted characters.
54. Depending on the overall workload, the number of computers for each processing step could be increased or decreased and depending on current workflow conditions, i.e., bottlenecks, the usage of any computer could be temporarily or permanently reassigned to another function in order to keep the overall system performance well balanced.
55. To highlight the efficiency of the modular approach, the example of the 1997 Turkish Census was cited.  In that census, questionnaires for 62 million people were scanned and recognized in 30 days, albeit only for a subset of variables.  The Workshop noted that the processing time was an inverse function of available scanning and recognition clusters.  It was informed that TIS had achieved alpha recognition rates as high as 94 per cent (Brazil) and 98 per cent (in Germany), although the latter case involved less elaborate forms than census questionnaires.
56. Improvements in recognition rates achieved by the TIS software were attributed to several advanced techniques, including (a) image enhancement; (b) form identification and removal (lift-off); (c) use of several recognition engines with voting algorithms; (d) trainable recognition algorithms, including local writing styles; (e) validation function and rules; (f) automatic coding; and (g) visual inspection in tile mode.
57. The Workshop heard that the form identification and removal feature eliminated the need for dropout colours and would significantly reduce the required storage space.  The voting algorithms would evaluate the results of several recognition engines and select the best answer according to pre-defined rules.  The tile mode would show for each character from 0 to 9 and A to Z, one at the time, a table containing all images as they were interpreted to represent that character.  That feature provided an efficient means of visually inspecting all images at a glance and easily identifying those images that did not correspond to the character under review.
New Zealand experience in 1996
58. The Workshop learned that for the 1996 New Zealand Population Census imaging and character recognition were used to capture the data.  Benefits compared with the 1991 census included: results released 5 months earlier; cost savings for data capture estimated at 9 per cent, noticeable reduction in paper handling and storage (particularly after the capture); and easier access to forms during coding and editing.  In addition, better quality control was gained, fewer staff needed to be recruited and trained, and for comparison with the post-enumeration survey access to census data was easier.
59. The following lessons were learned from the 1996 New Zealand Population Census use of imaging and character recognition: (a) systematic recognition errors for certain characters rendered biased results; (b) the use of images for coding and editing was a distinct advantage; (c) more data validation during data capture would improve overall data quality; and (d) high-priority variables could easily be processed first.  The Workshop was informed that further contracting out the data capture process might give significant economic long-term benefits, and, last but not least, imaging should not be used just as a replacement of traditional data capture methods but the entire census process could beneficially be re-thought at this occasion.
Observations and recommendations on OCR
60. The Workshop noted that recognition engines could be expensive and therefore the use of multiple engines had to be carefully evaluated.  However, it was also recognized that no single recognition engine would give 100 per cent results in all circumstances and that different engines had different strengths and weaknesses.  Thus, using several recognition engines with a voting mechanism could significantly improve the overall recognition rate.
61. The Workshop recommended that users should demand that competing vendors of census data capture systems demonstrate that the promised capabilities of their system would work under local circumstances, i.e. in the physical and infrastructure environment of the user as well as with the specific forms as developed by the user.
62. The Workshop noted that using technologically advanced solutions should not be self-serving but consideration should be given to local circumstances, e.g., to the constraints based on limitations of financial, technical and personnel resources.
63. The Workshop also noted that paper based methods continued to be used for data collection, particularly when the general public was filling in the questionnaires.  It was noted that non-response remained one of the main problems in census taking.
64. The Workshop discussed the benefits and drawbacks of paper based data collection and capture methods.  Considerable interest was shown in the topic and the following were the observations by the Workshop:
  • improved technology had helped the census process in many developing countries;
  • operational issues for data capture had to be considered in conjunction with the entire survey process;
  • the choice between OMR and OCR/ICR needed careful consideration.  Questions arising were whether alpha recognition was already well enough proven and whether scanning of occupation and industry would be viable;
  • further, was the imaging-type data capture really viable for all countries, especially the smaller developing countries in the Pacific with correspondingly small budgets;
  • in the context of censuses, the simpler OMR technology with maximum utilization of pre-coded variables could possibly be the most efficient option;
  • the low literacy level in some countries might prove a problem when using questionnaires for image scanning;
  • the number of different languages or dialects might prove a potential problem with image recognition systems;
  • the locally available expertise in handling forms, in interviewing and in computer literacy were issues to be considered;
  • the statistically less developed countries could learn from the experience of more developed countries which already had successfully used sophisticated data capture technologies.
D. Archiving of census forms
65. The Workshop was informed about an often-overlooked aspect of census data processing, namely, the long-term archiving of census forms.  It was noted that some countries required census documents to be discarded immediately while, in contrast, others had legal stipulations demanding the retention of original documents for decades or centuries.  The simplest archiving method would be to store the original questionnaires.  But transfer of the images to a more efficient storage medium could be considered because of the significant space and environmental requirements for paper documents.  Obviously, when scanning was part of the data capture system, the scanned images could conveniently be stored on electronic media (tapes, disks, CD-ROMs). The Workshop noted, however, that the rapid evolution of storage formats and hardware could make those types of digitized information inaccessible over a long period of time.  Therefore, it recommended giving due consideration to simple, stable and space efficient microfilm technology as a long-term storage solution for images.
IV. NON-PAPER BASED DATA COLLECTION AND CAPTURE
66. The technologies for direct electronic data capture were becoming an alternative or at least a complement to the use of paper forms.  The most common non-paper based data capture methods were computer assisted personal interviewing (CAPI), computer assisted telephone interviewing (CATI), and submission of questionnaires through the Internet.
A. Computer Assisted Telephone Interviewing
Internet and CATI in Singapore Census 2000
67. The Workshop heard that the year 2000 Census in Singapore would mark a significant step towards a paperless census.  The main technology blocks that the Department of Statistics was building on were the utilization of available administrative records (for pre-filled personal and household information), CATI and Internet form submission.  CATI was expected to be the main mode of data collection, to be used for 60 to 80 per cent of the households.  There was no precedent for a large scale Internet submission and therefore it was difficult to estimate its popular acceptance beforehand.  Personal interviewers would be sent to households that could not be reached by phone or that did not submit their response through the Internet.  Their forms would be scanned and OCR/ICR would be used to capture the results.
68. Apart from an advanced technology solution, the Singapore census was unique in the sense that most of the data collection would be through outsourcing and that multiple vendors would be involved.  The Singapore experience showed that measures were required to prevent conflicts between different vendors involved in the census project.  The measures included procedures for keeping all parties informed about decisions made and progress achieved, establishment of conflict resolution procedures, and the use by each vendor of their own servers and their own licenses for their software.  The Workshop noted that end-users of complex applications were not in a position to identify the causes of system problems; for that purpose a separate help desk was needed.
69. The Department of Statistics of Singapore had previous experience in using the Internet for data collection, but that was restricted to the transmission of survey information from about 1,000 large companies.  The year 2000 census would be an exercise of a completely different scale, and therefore the challenges were unprecedented.  Although the technology in the submitter's environment was beyond the data collector's control, the standardization of browsers and the availability of the Java language and a secure data transfer protocol made it possible to use the Internet for large scale data collection.  Post census surveys would be used to verify the results and possible biases in the various modes of data capture.
70. The Workshop agreed that data protection was perceived as a major consideration in the Internet census submission and that major publicity campaigns were needed to promote that mode of submission.  The Workshop was informed that although it was always possible that data might get into the wrong hands during the Internet submission, that risk was actually rather small.  In fact, it was much easier to eavesdrop the CATI interviews than to intercept and decrypt secure data transfers over the Internet.  However, attacks of hackers on Internet servers were indeed a major security consideration.  The server side design should include industry standard firewalls to allow only authorized traffic; it was also important to implement immediately any security related patches that were frequently announced by the suppliers of operating and database management systems.  A key precaution was to minimize the data holdings on any server that was connected to the Internet, i.e., to frequently move the data to a non-connected system.  In addition, rapid response teams should be on stand-by to identify and tackle any intrusion as soon as it occurred.
71. The management and integration of the diverse data capture systems required a well-designed centralized tracking system.  In order to minimize duplicate responses for the same household, such as a daughter submitting an Internet response and a father being simultaneously interviewed by a CATI operator, the progress of returns by each capture mode needed to be updated and checked frequently.  It was also important to design the overall system such that a failure in one capture mode did not bring down the rest of the operation.  A centralized backup system that allowed a complete rollover to any point of time during the previous few days was even more important than in a conventional database system.  In any case, based on available back-up information, including voice recordings of telephone interviews, return to the respondent for the purpose of repeating the interview should be avoided at all cost.
72. The development of a multiple-technology and multiple-vendor system required excellent coordination between all partners.  The user acceptance testing had to be rigorous, first for each system component and then for the whole system in integrated and simultaneous use in order to discover design flaws and bugs that required rectification.
B. Computer Assisted Personal Interviewing
73. The Workshop was also given an overview of the CAPI system as used by the Australian Bureau of Statistics.  It was stated that CAPI had the potential for significantly improving the quality of data and timeliness of processing.  It would also help to achieve cost effectiveness, particularly if the required equipment could be utilized for other applications after the first data collection.
74. The improved quality was achieved through computer-assisted filling of the questionnaire, thereby avoiding omissions and/or superfluous responses, while on-line editing would reduce the number of erroneous responses and permit more detailed probing through the questionnaire.  Improvements in the timing of data release were possible due to elimination of a separate data capture phase (key-to-disk or OMR/OCR scanning), implementation of a field coding system, use of on-line derivation of output variables and electronic data transfer from the enumerator's computer to a central facility.  Cost effectiveness might be judged by less tangible results such as improved coding effectiveness, reduced interview time, streamlined processing, reduced reliance on clerical procedures and printed material, etc.
75. The Workshop noted, however, that CAPI involved considerable set-up cost for hardware and application development.  The availability of communications infrastructure in the field was an important factor in reducing data transfer times and making the most efficient use of the expensive equipment.
76. The Workshop was given a presentation of the survey processing system Blaise developed by Statistics Netherlands.  That software was specifically designed in support of computer-assisted data capture, i.e. to be used by field enumerators with a laptop computer or from the office when interviewing by telephone, but could equally well be used for key-from-paper data entry operations.  Form-based data entry, complex routing and checking, interactive coding and data editing, strong data manipulation and tabulation capabilities, as well as survey management and export to other statistical and database formats were features that made the Blaise software a very useful tool for statistical offices.  The Workshop noted that Blaise was commercial software but hoped that statistical offices in developing countries could obtain it at a lower cost if not free of charge.
77. The Workshop drew the conclusion that CAPI was a very useful technology but would be less feasible for full scale census operations until such time that the necessary equipment had become significantly cheaper, smaller, easily portable, more robust, and powered with long-lasting batteries.  The Workshop noted further that non-paper data capture methods such as CATI and electronic form submission would not be feasible in many countries due to the insufficiently developed communications infrastructure. 
V. IMPLICATIONS FOR THE GUIDELINES ON THE APPLICATION OF NEW TECHNOLOGY TO POPULATION DATA COLLECTION AND CAPTURE
78. At the completion of modules 2 and 3, the Workshop reviewed the draft guidelines on the Application of New Technology to Population Data Collection and Capture in the light of the Workshop proceedings.  It was emphasized that the guidelines were based on voluntary contributions from the Working Party members and that given the urgency to publish the project outputs, it was not feasible to perfect the guidelines with all possible aspects related to the application of IT. 
79. The coordinator of the guidelines noted that certain concepts and terminology needed updating.  They included, among others, the latest in character recognition innovations (multiple-engine recognition and voting system) and some Internet data collection and security issues.  He agreed that it would be useful to add information lessons learned from some of the unsuccessful high-tech solutions, and invited contributions from all participating statistical and census offices on such experiences.  Additional information on public domain software, and examples of census and survey forms used in connection with the latest data capture technologies would further enhance the guidelines.
80. The Workshop identified several areas where the guidelines could be improved and requested the Working Party to implement the changes where possible.  Those included the technological implications arising from the high confidentiality requirements for census and survey data, quantification of savings that had been obtained through the application of new technology, and special training requirements for each featured technology in the guidelines.  A technology update was required on the recognition technology involving the combination of OMR and OCR/ICR technologies. 
81. The Workshop noted that the guidelines were yet to have an introductory section that explained their purpose and coverage and, as important, what they did not include.  And finally, the Workshop agreed that the guidelines would be easier to read if the various sections were structured in a similar fashion.
VI. ADDING VALUE TO CENSUS DATA THROUGH DATA WAREHOUSING AND DATA MINING
82. Presentations were made by the representatives of two local vendors (Unisys and SAS Institute) that provided data warehousing, online analytical processing and data mining solutions for various businesses, including statistical offices.  Although some of the most advanced statistical offices had been experimenting with those technologies (e.g. common data dissemination platform in the Australian Bureau of Statistics), they were relatively unknown to most Workshop participants.  Therefore, the presentations and the consequent discussion centred on the key concepts and terminology, and their differences from traditional relational databases and analytical tools.
83. Data warehousing technologies typically involved several separate databases in various platforms from which data were extracted and cleansed to a normalized data warehouse.  The Workshop was informed of the analogy between the evolution of database technology and data warehousing technology. Relational database modelling and SQL had changed little since the 1970s.  However, huge improvements to the hardware had allowed the development of user friendly design tools for databases to the extent that knowledge of the SQL was no longer needed in order to develop and run simple database systems.  It was pointed out that data modelling for data warehouses was still very challenging and laborious and that design tools needed considerable improvement.  Also, the query times and other performance factors were not always satisfactory.  Nevertheless, it was expected that data warehousing technology would go though a similar evolution as database systems, and would eventually become much easier to implement.
84. The Workshop agreed that data warehousing and related downstream technologies offered a great potential for integrating data derived from administrative records, various censuses and surveys, and for different points of time.  It noted that setting up a full-blown data warehouse system was not easy and required significant resources for standardization of concepts and metadata, for data modelling, for data cleansing and for the rest of the implementation.  Therefore it was important that the organization was clear about the business objectives that data warehousing would help to achieve.  Data warehouses were typically built with a long-term goal in mind and with scope for future growth.  Noting that the specification and use of a correct data model was the single most crucial success factor in the implementation of a data warehouse, the Workshop strongly recommended the sharing of data models among statistical offices, rather than "reinventing the wheel" alone.
85. The Workshop noted that data mining was often related to data warehousing.  It could be implemented within or above the data warehouse, but also outside and independent of it.  Data mining tools were used without defining any test hypothesis in advance.  They involved mathematical algorithms that could reveal hidden interdependencies in the data, thus producing unexpected results and insights.  Online analytical processing (OLAP) was based on a more traditional analytical approach with an advance hypothesis setting.  OLAP could be used in a data warehouse environment.  The Workshop cautioned that an elaborate and nice-looking interface of an OLAP or data mining tool did not guarantee that the related data warehouse would necessarily be implemented properly.  In fact, full-fledged data warehousing systems (top-down developed systems) were so large and involved so many different types of tools that there was no single company offering all required products.  However, there were providers for smaller systems, data marts, which were designed and developed from the bottom up.
86. The Workshop recommended that data warehouses be developed in a modular fashion, keeping the long-term needs in mind: "Start small, think big".  It noted that the Internet was increasingly used for data transfers in data warehousing solutions.
87. At the end of the module, the SAS Institute demonstrated an OLAP interface using a Web browser and Java applets to create user-end (thin clients) graphics.
VII. DATA DISSEMINATION
88. The Workshop noted that the traditional way to disseminate census results was in the form of tabulations, i.e. a listing of the number of occurrences for individual or grouped values of one or more variables.  Census publications comprised usually a set of core tables presented in hierarchical, geographic breakdowns and aggregations.  Additionally, they were increasingly complemented by custom designed tables based on client specifications.
89. In the past, when a client approached the statistical office to obtain specific information, the request was handed to the programming department where a tailor-made query was developed and run and the results were verified by a statistician for correctness and consistency.  That was often a lengthy and costly procedure and was prone to mistakes when involving both programming and statistical staff.  Today's technology allowed statisticians to take on the entire task of designing and producing the output without involving the programming department, thus reducing significantly the response time to a client's request.  Also, the risk of misinterpreting the client's data request was reduced, which otherwise often resulted in rerunning the job, thus wasting valuable resources as well as delaying the delivery of the data to the client.
90. Electronic output from a data extraction phase was best delivered in standard file formats such as spreadsheet files, comma-separated-value files (CSV format, usable by database and statistical standard software) or tab-delimited files (TXT format, usable by text processing software), so that they could be used for further processing.
91. For distribution of electronic output, several media types were available, each one having advantages and disadvantages.  Commonly available diskettes were easy and safe to use and were ideal for storing small files due to their portability and reliability.  However, large files had to be compressed or split into smaller sections.  It was noted that, apart from their small capacity, a major disadvantage of using diskettes was that once infected, they were prone to distributing boot-section viruses.
92. The Workshop noted that the cost of producing individual copies of compact disks (CD) had not reduced drastically.  Nevertheless, that medium was suitable for storing or disseminating both large and small amounts of data and information.
93. The Workshop also noted the benefits of data dissemination by electronic mail; it was very fast and an ideal medium for transferring small data sets to users and efficient when the same material were disseminated simultaneously to several recipients.  However, it was not suitable for very large files.  Another drawback, not different from ordinary mail, was that senders would not be certain that users did indeed receive the files.   Also, ordinary file attachments carried a threat of virus infection which mail gateways and virus protection software could not always detect.  The Workshop heard that at Statistics New Zealand the use of e-mail for dissemination purposes had increased from 25 per cent in 1998 to 60 per cent in 1999, with diskette delivery dropping considerably.
94. The Workshop noted that the World Wide Web was increasingly used by statistical offices to make information available globally about the office and its activities as well as about statistical information of major interest to the general public.  It agreed that the web offered a great tool and an opportunity for improving customer relations and public perception about statistical offices.  During the Workshop the participants had the opportunity to visit the web sites of various statistical offices and review the key aspects to be considered when developing a web site for an NSO.  The Workshop agreed that clarity and ease of use were important design objectives for a web site.  Special consideration should be given to the fact that many visitors to the web site had slow Internet connections, which put restrictions on the use of large files and graphics intensive designs.
95. The Workshop noted that nowadays many national statistical offices provided the possibility of retrieving data dynamically through the Internet.  Using standard web browsers to formulate queries, users were able to obtain data corresponding to their individual needs.  The implementation of that kind of service required a relatively high degree of technological know-how and the implementation of industry standard security mechanisms to prevent intrusion and to safeguard the confidentiality of data.  The users needed to have a relatively high degree of familiarity with the data to ensure they were extracting the correct variables to satisfy their request.
96. The Workshop noted that magnetic tapes had lost much of their attractiveness as a storage and dissemination medium.  While they could store substantial amounts of data, very few of the current microcomputer base systems had a tape drive attached.  Also, the access to data on tapes was cumbersome and time consuming.
97. The Workshop acknowledged that hardcopy output still had advantages.  It did not require any technology at all to be used and therefore could be read anywhere.  Full portability was also achieved if users had the opportunity to print electronically disseminated data.  The Workshop noted the advantages of the Portable Document Format (PDF) in that regard.  Disadvantages of hardcopy output included the facts that the data could not easily be manipulated or presented in a different form and that the storage of bulky publications could cause a problem.  The Workshop also noted that fax machines provided a feasible alternative for sending small amounts of data to a limited number of customers.
98. The Workshop agreed that users needed to have access to the information about data collection methods, sources, definitions, and terminology used.  Additional statements could be included on the quality of data as well as sample error tables.  Advice on the use of data with low value cells, subject to sample errors, might also be given.  Disseminated products should include the terms and conditions of data supply, explaining to the users how they could use the data and the rules that governed the transmission of data to third parties.  The terms and conditions were required to protect the statistical office from liability, should users make wrongful use of the information or if perchance the data included erroneous information.  With all distributions of information a statement should be included specifying the confidentiality provisions contained in the data.  Finally, the supplied data should always be accompanied by details on whom to contact for queries.
99. While tabulations were the most condensed format for presenting statistical output, the Workshop encouraged the use of graphics in order to make information easier to understand.  Particularly, graphs could quickly inform about trends or relationships by visually portraying the underlying data content.  Graphs were used to support written commentary on statistical results and were ideal for press and media releases.
100. However, graphs should be designed carefully so as not to defeat the principal reason for their use, namely, clarity of presentation.  Graphs should not be overloaded with information, should always clearly identify their purpose and the origin of data, and should identify the variables included.  The Workshop agreed that a key to good graphical presentation was the selection of the correct form of graph (single, multiple, vertical and horizontal bar graphs, line and pie graphs, two and three dimensional graphs, etc.).  It was noted that graphs could be produced by commonly available spreadsheet programs (Excel, Lotus) as well as by general-purpose statistical software packages (SAS, SPSS) and by some specially developed data extraction software used at statistical offices (IMPS, PopGraph, SuperCROSS).
101. Another method of portraying statistical information was thematic mapping.  With the availability of specially developed mapping software or an industry-strength GIS, statistical data could be linked to geographic areas and displayed with great efficiency and clarity.  Particularly, thematic maps could show at a glance regional differences or similarities of different indicators such as population densities, fertility rates, health service coverage, etc.  As in any data release the confidentiality of data had to be maintained, especially in maps covering small areas.
102. The Workshop saw demonstrations of several software products for mapping and tabulation, namely PopMap, IMPS, SuperCROSS, Superstar and SuperMap.  In addition, a small group of Workshop participants used SuperCROSS for constructing simple cross tabulations from a synthetic database derived from perturbed data originating from the New Zealand Population Census.  Based on the demonstrations, the Workshop discussed the criteria to be considered when selecting tabulation and mapping software and agreed that important aspects were: (a) the capability to handle large data sets; (b) the availability of statistical calculation functions; (c) the possibility to compile camera-ready tabulations; (d) the suitability for dissemination use with newer media such as CD-ROM or the Internet; and most importantly, (e) the user friendliness of the software and (f) the cost.  In addition, for many countries the ability to handle non-Latin character sets was important.
103. The Workshop agreed that when evaluating data extraction software packages the NSOs needed to pay particular attention to the ability of suppliers to demonstrate that they (a) could support the software; (b) could provide training, supply manuals and on-line help files; and (c) were prepared to let the software be tested thoroughly within the working environment where it would be operating.
104. In summary, the Workshop emphasized that the modernization of dissemination methods and the creation of products for new media were essential in order to reach a wider audience.  At the same time, new technology allowed the production and dissemination of information of special interest, customized for narrow target groups.  The Workshop recognized that ultimately it was the users of statistics who would determine how the data were to be presented and the manner in which they were to be delivered.
105. At the end of the module, the Workshop reviewed an application that was equally useful in the production and dissemination of statistics.  It learned from Statistics New Zealand that a Classification and Related Standards System (CARS) had been implemented with the aim of providing a centralized storage, maintenance and access facility for all classification data used in the input and output systems of the organization.  CARS contained historical classifications, code files and concordances and information relating to them; all economic, social and geographic standard classifications; survey specific classifications; and all classification categories used for coding survey data at the input stage and their descriptions and labels used in the presentation of output data.
106. The implementation of CARS in Statistics New Zealand had reduced the time and resources needed in developing new surveys; the quality of surveys had also improved.  In addition, comparison and analysis of data was facilitated by retaining concordances.  Classification information stored in CARS was accessible by a large number of staff in their day-to-day work.  A more limited but better qualified number of staff had access to the system for the maintenance of the information.  CARS was particularly useful in a statistical agency as it standardized all code files and descriptions used within all surveys or censuses conducted.  The major advantage was the ability to compare the use of variables between data sets.  For example, occupations within a labour force survey could be compared with occupations collected from the population census.  According to the information available, this was the first agency-wide implementation of such a system.
Implications for the guidelines on the Application of New Information Technology to Population Data Dissemination
107. At the completion of module 5, the Workshop reviewed the draft guidelines on the Application of New Information Technology to Population Data Dissemination in the light of the proceedings, which were accessible through the Internet at 
http://www.unescap.org/stat/pop-it/pop-it5/meet_5.asp or 
http://www.unescap.org/stat/pop-it/pop-wit/pop-wit.asp
108. The Workshop identified several areas where the guidelines could be improved and requested the Working Party to implement the changes where possible.  The Workshop agreed that the guidelines would be easier to read if the various sections were consistently structured, and noted also that these guidelines required an introductory section and disclaimers regarding the intended coverage, as described in paragraph 81 of this report.
VIII. GEOGRAPHIC INFORMATION SYSTEMS
109. The Workshop noted that a Geographic Information System (GIS) was a computerized database system for storing, manipulating, retrieving, displaying and printing spatial and non-spatial geographic data and their attributes.  GIS was especially useful for statistical offices in the preparation of enumeration area maps and in illustrating census and survey results through thematic maps.  Several comprehensive GIS software products such as MapInfo and ArcInfo were available, with functionality much exceeding the immediate needs of statistical offices in developing countries.  However, low- or no-cost software solutions for mapping were also available, such as PopMap developed and distributed by the United Nations.
110. The Workshop was informed that to create a GIS database, paper-based maps needed to be digitized and geo-coded, either manually or by scanning.  The map information could also be imported from existing map data files.  If maps did not exist, methods such as aerial photography or remote sensing could be used to create them.  However, often those options were costly and beyond the means of the statistical office to implement entirely from its own resources.  The Workshop heard that Geo-Positioning Systems (GPS), that had recently become popular and affordable for navigational purposes, could be beneficially used to create detailed enumeration area maps.
111. The Workshop was informed about the components, features and limitations of GPS.  The GPS was based on 24 operational satellites, which were orbiting at 20,200 km above ground and were controlled, monitored and synchronized from five ground stations.  With the help of a cheap, handheld mobile GPS unit, longitude, latitude and altitude co-ordinates could be calculated by receiving signals from at least three satellites.  The system provided an inherent accuracy of 5 metres or better.  However, the Workshop was informed that the launcher of the satellites, the United States Department of Defence, intentionally manipulated1/ the data sent by satellites so that the actual accuracy in civilian use was no better than 100 metres.  The Workshop heard that to overcome that limitation the industry had developed a so called Differential GPS, that relied on nearby fixed ground units with which the mobile unit communicated in order to receive updated corrections to measurements calculated from the satellite information.  With that correction an accuracy of better than 2 metres could be achieved, which should be good enough for any application the statistical office might have.
112. The Workshop was given an outdoor demonstration of a handheld GPS (manufactured by Magellan).  Coordinates were continuously recorded while Workshop participants walked around the block.  Back in the meeting room, the list of coordinates were transferred from the GPS unit to a computer and the MSTAR software by Magellan was used to convert the coordinates into plots and graphic images.
113. The Workshop was also given a demonstration of the ArcView software, a module of the ArcInfo GIS. Based on a Bangladesh Ward map showing several enumeration areas, various features were displayed such as map viewer, table displayer, layout map composer, table charter and script text editor.
114. The Workshop heard presentations of the two pilot projects that were implemented by Bangladesh and the Philippines as components of the UNFPA funded project RAS/96/P12.  The Bangladesh pilot project concentrated on the use of GPS for the creation and updating of enumeration area maps.  The Philippines pilot project developed a census operations management system, called Quick Count, to be used in the 2000 population and housing census.  The application was based on the use of GIS and the World Wide Web.  The intention was to provide managers on all levels with up-to-date information throughout the 30-day enumeration period that would be available through the Internet to anyone who was pre-authorized to access it.  Above a certain level of access authority managers would be allowed to update the information.  It was expected that the Quick Count system would report preliminary census results very soon after enumeration was completed.  Due to budgetary constraints, the National Statistics Office of the Philippines elected to develop its own GIS solution based on the FLY shareware obtained via the Internet.  It was expected that the Quick Count system would be tested in connection with the forthcoming enumeration for the pilot census.
115. The Workshop concluded that GIS as well as GPS were valuable tools for the statistical office to better cope with the cumbersome mapping task.
Implications for the guidelines on the Application of Geo-Positioning Systems and Geographic Information Systems for Digital Mapping and Statistical Management
116. The Workshop noted that the lessons learned from the two pilot projects would be reflected in the guidelines and hoped that a complete draft version of the guidelines would be swiftly made available on the Internet.
IX. RECOMMENDATIONS OF THE WORKSHOP
General, IT management
117. The Workshop agreed that the conduct of censuses and surveys was necessarily becoming increasingly technology intensive.  It recommended that national statistical offices keep abreast of the latest information technology by continuously monitoring technology evolution and by upgrading production and office systems periodically.
118. Appreciating the excellent cooperation and contribution received during the project, the Workshop recommended that technologically advanced offices continue to share with others their experiences in adopting new information technologies.
119. Noting that modern data capture technologies (OMR, OCR/ICR, CAPI, CATI, Internet data collection) had uses in many sectors, the Workshop recommended that in order to keep IT applications cost-effective, census and survey organizations should collaborate, among themselves and with other agencies, in the procurement and post?census use of the equipment and software.
120. The Workshop noted that for many countries budgetary constraints hampered the effective application of new technology.  It requested the bilateral and multilateral donor agencies to increase their assistance to developing countries for IT applications, and recommended that the technical cooperation among developing countries (TCDC) modality be promoted for an enhanced sharing of IT experience and skills through expert visits and study tours.
121. The Workshop recommended that statistical offices upgrade their organizational IT knowledge and create a modern IT culture, and develop prudent procurement methods to match the skilful and articulate marketing techniques of private sector vendors.
122. The Workshop recommended that Governments should take into account in their procurement rules the overall costs and benefits that each technology alternative offered in the long term, and not take decisions solely on the bid price for a particular application.
123. The Workshop emphasized that it was crucial for senior management in the national statistical offices to increase its awareness of trends in information technology and the associated costs and benefits,  and to improve related management skills.
124. The Workshop recommended that national statistical offices should ensure that any vendor being considered for the supply of new technology systems was able to substantiate its claims.  Statistical offices should have a benchmark drawn up addressing their requirements, before the commencement of the evaluation process.  They should also ensure that staff evaluating potential systems and vendors have a good knowledge of the requirements and of the technology being evaluated.
125. The Workshop recommended that NSOs and census organizations make full use of the opportunities that new information technology offered in the conduct of censuses.   They should bear in mind that no stage of a census could now be planned and executed without taking technology into account and that new technology had merged certain stages in the census operation.  The Workshop recommended that census organizations make corresponding changes in their organizational and management structures, and adjust the resources available for IT procurement, recruitment of skilled staff, and training of existing staff.
126. Given that statistical offices had to take the whole range of census operations into consideration while assessing the implementation of new technology applications, the Workshop recommended the application of quality management strategies as a useful method for control of the whole process.  Further, the interoperability of the various components to be chosen required special attention, not only with regard to the operational aspects but also in terms of the integrity of the huge masses of data to be processed.
127. Recognizing that many developing countries were using public domain software packages, the Workshop recommended that ESCAP should promote sharing of experiences on the use of such packages with a view to maximizing the benefits of those applications.
128. The Workshop recommended that statistical offices should avoid procuring hardware and software that did not run under common operating systems, that did not provide integration with other systems, that was not easily extendable, that had no indication of long-term support and that was likely to lead to dependency on one vendor.
129. Noting that electronic format had many advantages over hard copy format, the Workshop recommended that statistical offices should aim at digitizing census and survey information as early as possible.  That would involve greater utilization of existing electronic records (administrative records), adoption of computer-aided interview technologies, and scanning of census forms immediately after enumeration.  Electronic format minimized manual handling of forms and allowed maximum flexibility in data verification and editing.
130. The Workshop noted that it was essential for statistical offices to ensure, as part of the evaluation process, that selected vendors had the commitment and capacity to train the statistical office staff in the hardware or software, and to provide continuing service and support.
131. Considering scarce resources, especially in the small developing countries and areas, the Workshop recommended that on a subregional level governments should find ways to cooperate in the purchase and utilization of expensive current technology, e.g., by sharing the cost of acquisition and responsibility for operation and maintenance of such equipment.  Further, the Workshop recommended that governments of developed countries and areas, which operate such advanced?technology systems, should make their use available to developing countries in the region, preferably at nominal or no cost.
132. The Workshop noted that language capabilities of data capture and dissemination software were important in many countries.  In the area of data capture, the OCR/ICR engines achieved very high recognition rates for hand?written characters in a limited number of languages; that efficiency was not matched for numerous other languages in Asia.  Similarly, many NSOs required bi? or multilingual capabilities for data tabulation and dissemination software.  The Workshop recommended that the NSOs express language capability as one of the prerequisites for software acquisition, and recommended that the software developers expend efforts in incorporating local language and multilingual capabilities in their products.  In that regard, it was noted that the Workshop had provided an excellent opportunity for the vendors to better understand the needs of the NSOs and also explain some of the features of their products which were of interest to the NSOs.
133. The Workshop recommended that further technical meetings be held after the 2000 and 2001 censuses to share information on technology lessons learned, and to promote effective data utilization and dissemination.
134. To facilitate exchange of experiences, ideas and information on resourcing and other topics, it was recommended that an e-mail based discussion group be established.
Data collection and capture
135. As the current data capture technology provided increasingly powerful means of handling data on numerous topics for large collections, the pressure for expanding the scope of the census was mounting. The Workshop cautioned that in considering those demands, census statisticians must not ignore the operational aspects of actual data collection in the field, the skill levels required for data collection and handling, and the technical requirements.
136. The application of IT would also assist countries in improving the management of errors and coding of captured information from censuses and surveys. The Workshop recommended that greater sharing of information should be promoted in those areas, including computer?assisted coding.
137. The Workshop recognized that selection of data capture technology was a crucial success factor in census taking.  It advised census organizations to assess carefully all costs, including the implications for various census operations, involved in the selection, procurement, operation, maintenance and management of capture technology.
138. The Workshop recommended the conduct of at least one and preferably two major tests using real forms, real enumerators and real respondents to test systems.  Testing was needed for:
  • the selection of the preferred technology
  • refinement and improvements in the technology
  • development of procedures and arrangements related to the implementation of the technology 
  • the building of awareness within management about how the new technology should be handled
  • calculation of the resources needed for the main event
  • preparation of the content, schedule, and methodology of training to be carried out.
139. The Workshop recommended that census organizations make full use of the flexibility that was offered by new imaging and recognition technologies, for instance by planning for an early release of results for the most important topics.
140. The Workshop recommended that census organizations evaluate data capture solutions carefully taking into account country circumstances.  Evaluation results obtained elsewhere were not necessarily directly applicable, due to differences in handwriting patterns, questionnaire design, and availability of quality paper, ink and printing facilities.  It noted that competitive benchmark testing had become a standard evaluation method in large census organizations all over the world.
141. Noting that the available character recognition software was developed for universal use and that the turn-key OCR/ICR solutions were restricted to data capture (and did not cover the whole census operation), the Workshop recommended that software developers incorporate in character recognition applications statistical features, such as classifications that assisted in data coding.
142. The Workshop recommended that statistical organizations planning to use OCR/ICR should develop procedures to control the quality of recognition.   It was particularly important to search and check for non?random bias caused by systematic recognition errors.
143. The Workshop agreed that imaging should not be used simply as a data capture replacement technology, and recommended that statistical organizations identify which other census processes were affected and determine how they could be made more efficient and cost-effective.
144. The Workshop recommended that census and survey offices should consider outsourcing as an option for implementing elements of censuses and surveys.   It noted that the feasibility of outsourcing depended on national circumstances, the organization's own resources and skills, and the availability of external partners.  It heard of the experiences of the Singapore Department of Statistics in developing an innovative multi?modal data capture system for the year 2000 census by using several external developers.   The Workshop noted that the multi?vendor approach required clear delineation of responsibilities for system development and support, which could be conveniently achieved by using a prime contractor approach.
Guidelines
145. The Workshop identified several areas where the guidelines on data collection and capture could be improved and requested the Working Party to implement the changes where possible. 
Data warehousing, databases, data archiving
146. The Workshop noted that data warehousing was a new technology with high potential for increasing the value of census and survey data by linking them to other data holdings.  Data warehouses provided access to a variety of different databases and created the possibility of combining statistical data from various statistical surveys.  The Workshop, however, recommended that NSOs develop these data warehouses in a modular fashion and keep long-term needs prominently in mind: "Start small, but think big".
147. Noting that getting the data models correct was probably the most important success factor in the implementation of a data warehouse, the Workshop strongly recommended the sharing of data models amongst the statistical offices.
Data Dissemination
148. The Workshop recognized that the evolution of information technology was not only continuously offering opportunities for increasing operational efficiency, but was also affecting the requirements of data users. The Workshop recommended that statistical offices should periodically assess the needs and perceptions of the users in order to be able to deliver census and survey results through channels and formats that customers expected.
149. Noting that the Internet was a cost-effective dissemination mode both for data providers and users, the Workshop recommended that NSOs should establish and develop web sites as a major data and information dissemination channel.
150. The Workshop noted the variety of web sites available from statistical organizations and recommended that offices investigating the option of setting up a site of their own should evaluate the features of other sites.
151. The Workshop recommended that NSOs start the development of web sites from simple structures and designs that allowed expansion of the site in a modular fashion, and provided accessibility to users with narrow bandwidth.
152. The Workshop noted that small island countries did not have the skills as yet to develop their own web sites; the cost was also a major factor.  The Workshop recommended that countries which did not yet have their own web site should look at the feasibility of acquiring space on another organization's server, or on a server in another country.  The Workshop recommended that this information be included in the draft guidelines.
Mapping and GIS
153. Noting that maps were the best way to illustrate spatial features of population, the Workshop recommended that statistical offices create new products that utilize digitized maps.   It also noted that maps were essential in census planning, field work and operations monitoring, and that in the long run, geographic information systems were a feasible option for creating accurate multi-purpose maps.
154. Noting that GPS (Global Positioning System) offered a cost-effective option for determining spatial coordinates, the Workshop recommended that NSOs should consider this technology option for improving the accuracy of area maps required in census and survey field work.
155. The Workshop emphasized the need for promoting training on special topics related to the application of IT to census and survey operations.
Follow up
156. At the end of the Workshop, the participants proposed that a follow-up workshop should be organized during the second half of 2000 to exchange information about the technological successes and failures in the data capture and data processing of the year 2000 round of censuses in the region.  That workshop could also cover issues of data dissemination and data use.
 
Annex I
LIST OF PARTICIPANTS
 
Annex II
TENTATIVE TIME SCHEDULE
 
Annex III
LIST OF DOCUMENTS
Symbol   Title

STAT/WNIT/L.1 Provisional agenda
Module 1:  Introduction to IT in census operations
Module 1.1 Objectives of the Workshop*
Module 1.3
  • An overview of the project RAS/96/P12
  • Project RAS/96/P12*
Module 1.4 Introduction to Census Operations*
Module 1.5
  • Result of the ESCAP Survey on Applications of  Information Technology to Population Data
  • Presentation paper*
Module 1.6 Information Technology Trends and their impact on Census Data Processing*
Module 1.7 IT Management Challenges*
Module 1.9 Expectations for the year 2000 rounds of censuses*
Module 2: Paper based data collection and capture
Module 2.1 An Overview of Paper Based Data Collection and Capture Technologies*
Module 2.2 An Overview of the OMR Technology (Based on the experiences in Japan)*
Module 2.3 The Use of Optical Mark Reading (OMR) for Census Data Collection**
Module 2.5 OCR Questionnaire*
Module 2.7 OCR Technology Selection for 2000 Population Census in Indonesia*
Module 2.8 Application of Imaging Technology for Capturing Population Census Data
Module 2.9
  • Recent Experience in Using New Technologies for Census**
  • AFPSPRO - modules description**
  • Configuration for UN's Demo (Census)**
Module 2.10 Improving Work flows by using Imaging for the New Zealand Population Census
Module 3: Non-paper based data collection and capture
Module 3.1 Introduction to non-paper based data collection and capture technologies - CAPI*
Module 3.2 Efficient Computer Aided Telephone Interview (CATI)*
Module 3.3
  • Computer Assisted Personal Interviewing Solutions in Australia
  • Attachment 1: CAI Manual Outline
  • Attachment 2: Diary and Office Processing: Integrating Blaise with Other Facilities
  • Attachment 3: Sample Business Case for the use of Computer Assisted Interviewing in Household Surveys
Module 3.4 Data Collection Through the Internet - IT Design & Security Issues*
Module 3.5 Blaise: A survey processing system*
Module 3.6 Integration of Different Modes of Data Capture*
Module 2&3 
  • Guidelines on the Application of New Technology to Population Data Collection and Capture
Module 3.8
  • Guidelines on the Application of New Technology to Population Data Collection and Capture (Presentation paper)*
Module 4: Adding value to census data through data warehousing and data mining
Module 4.1
  • Adding Value to Census Data through Data Warehousing**
  • Stranded on Islands of Data**
Module 4.2 Data Warehousing**
Module 4.3 SAS demonstration**
Module 5: Data dissemination
Module 5 Guidelines on the Application of New Information Technology to Population Data Dissemination
Module 5.1-5.5 Data Dissemination*
Module 5.6
  • PopMap*
  • Use of IMPS for Census and Survey Data Dissemination in the Philippines*
Module 5.10 Graphs*
Module 5.11 Maps*
Module 5.14 Interesting features of Statistical Office web sites in the ESCAP Region
Module 5.16
  • Statistics New Zealand's Classifications and Related Standards (CARS) System
  • Classification and related Standards system (CARS)*
Module 6:  Geographic information systems
Module 6 Guidelines on the Application of GPS and GIS Technologies for Digital Mapping and Statistical Management
Module 6.3
  • Demonstration of GPS and its Applications in Digital mapping
  • Theory of DGPS
  • DGPS Survey Manual
  • Demonstration on Application of Arc/Info, Arcview and ERDAS Imagine Softwares in Digital Mapping and GIS
Module 6.4
  • Application of GPS for Digital Mapping and GIS
  • Application of Modern Mapping and GIS Technology to Census
  • Use of GPS for Preparation of Census Enumeration Area Maps and Mauza Database
Module 6.5
  • Pilot Application of GIS to the Philippines Census 2000 Operations
  • Presentation paper*
Background papers
  • Data Processing for Demographic Censuses and Surveys
  • Report of the Workshop on Computer-Assisted Coding New Zealand, 17-21 April 1989
  • Kazakhstan, 1999 Census
  • Statistics New Zealand
  • Link to other government statistical offices
  • Distribution of Household, L2+KBL2 Form

* PowerPoint or other computer-based presentation.
** Vendor material.
 
Pop-IT project (1997-2001)
Project Objectives
Working Party Members
Working Party Meetings
First meeting, Bangkok, 24-26 September 1997
Second meeting, Singapore, 1-3 April 1998
Third meeting, Bali, 7-9 January 1999
Fourth meeting, Manila, 6-9 July 1999
Ffth meeting, Bangkok, 21 October 1999
Sixth meeting, Bangkok, 26 March 2001
Workshops
Application of New Information Technology to Population data, Bangkok, 12-20 October 1999
Population Data Analysis, Storage and Dissemination Technologies, Bangkok, 27-30 March 2001
Guidelines
Population data collection and capture (BBS - Statistics Indonesia)
GPS in modern mapping and GIS technologies to population data (Bangladesh Bureau of Statistics)
Population data dissemination (Statistics New Zealand)
Project Newsletter
Contact us
   
Copyright (c) 2013 ESCAP  |  Legal Notice