Workshop on Application
of New Information Technology to Population Data
Bangkok, 12-20 October
1999
Information Technology
Trends and their impact on Census Data Processing
(Presentation paper)
Curiosity is probably a very
early developed human trail. The wish to walk
faster and to travel further, to till more land
and to lift more weight, resulted in recent exponential
development in most technical areas
General
IT trends
Faster
Smaller
Cheaper
Handier
Better
Like the general human progress,
also . are charecteriszed by . solutions
1890 first electro-mechanical
tabulator, Hollerith
1940's first electronic
computer, vacuum tubes, UNIVAC, punch tapes
and cards, machine code programming, occupied
a huge hall
1950/60's transistors,
mini computers, magnetic tapes, line printers,
higher level programming languages (Cobol,
Fortran)
1970's solid state
integrated circuits, LSI, VLSI, micro chips,
hard disks, diskettes, disk operating systems
DOS, transaction based online systems, object
oriented programming
1990's color VDUs and
printers, DVD, PDAs, OCR/ICR, VoiceR, enterprise
computing, intranet, internet, warehousing,
expert systems.
Principal
trend setters
Statistical organization
Private sector,
government sector, universities, home
entertainment
Manufacturers
Interesting to note the shifting
driving forces behind the technological developments
Often, first computing
resource in country at statistical organization.
Resource was used by others (finance, administration)
With improved technical
infrastructure, other organizations became
avant-garde users and statistical offices
are adapting the evolving technology to their
needs
Lately, it is the computer
manufacturing industry which drives technological
development and pushes it onto the user community
Relevant
hardware technologies
High-performance
and high-capacity stationary and mobile
micro computers
High-capacity
fixed and exchangeable hard disk storage
devices
Color VDUs and
printers
Optical scanning
devices
Writeable CDs,
Digital Video Disks
Local and wide-area
networks
Remote sensing,
Geo-positioning System (GPS)
Computers: 400-600 MHz,
64-256 MB RAM
Disk storage: 6GB and
up, Zip drive 200 MB
Printers: 5-8 ppm for
personal, combi-features for home office use
Scanners: very cheap
home use, high-capacity industrial use
CD 600 MB, DVD 5-18
GB, impressive retrieval speed good for video
replay
Networks: provides
for work groups, organization-wide data sharing
Remote sensing: for
cartography, accuracy can be better than 1m
GPS for mapping of enumeration areas
Relevant
software technologies
Graphic interface
operating systems
Hierarchical
and relational database systems
Metadata systems
Statistical analysis
tools
Optical and intelligent
character recognition
Geographic information
systems
Graphic interface:
first developed by Xerox, adapted by Apple,
appropriated by Windows
Databases: for analytical
processing: square files, transposed files
(Redatam) for transaction processing: relational
databases (Access, dBase, Oracle)
Metadata: data about
meaning, content, organization and purpose
of data
Statistical analysis
s/w: SPSS, SAS, special demographic, s/w
OCR/ICR: improved processing
power gives better results: Uruguay 1996:
preprinted numeric 99.98%, marks 99.7%, handwritten
numeric 98.9%, handwritten alpha 97.4% (but
about 15% of forms had to be manually improved
before submitting to the scanner)
GIS: for catography
ArcInfo, MapInfo (commercial), for thematic
mapping PopMap (free), Supermap (commercial)
Relevant
software technologies (continued)
Integrated office
management
Project planing
and management tools
Typesetting
On-line services,
bulletin boards
E-mail, internet,
world wide web
Integrated office management:
inter-office access to common information,
document sharing
Project planning: MS
Project, Timeline, Primavera, critical path
resource planning
Typesetting: transfer
of printed output in digital form to printing
house
On-line service: external
end-user access to basic information, BBS:
internal access to instructions, documents
E-mail and internet:
efficient correspondence (with audit trail),
dissemination of reports.
Anticipated
future Trends
Improved hardware
price/performance ratio
Continued miniaturization
Mobile computing,
incl. wireless communication
Expanded world
wide web, E-commerce
Improved expert
systems (ICR, voice recognition)
Warehousing,
data mining
Multimedia
Improvements of current technologies
will have noticeable effects on the efficiency,
timeliness, quality and visibility of census processing.
Concerning completely new technologies, I cannot
see any on the horizon apart from robotics or
a fully developed Orson Wells environment where
every citizen is watched and controlled at all
times. But if we get that far, then we don't need
any population census anymore
Improved hardware:
(a) Faster and cheaper equipment, more affordable,
better performing, better quality, more throughput;
(b) Increasingly powerful software, greater
sophistication and complexity of problem solving,
more timely, more relevant and more useful
results; (c) Improved user-friendliness; (d)
Better targeting of result
Continued miniaturization:
smaller and sturdier equipment, ever increasing
storage capacity
Better mobility: hand-held
PDAs with WIN-CE and wireless transmission
for intelligent data collection (CAPI). Mobile
phone for voice transmissions from remote
areas
Expanded WWW: improved
dissemination efficiency, dynamic data retrieval,
income generating
Expert systems: advanced
knowledge based software solutions for: ICR
at image processing or directly at point of
data collection (write pad), voice recognition,
data mining, (far in future: enumeration by
robots?)
Warehousing: currently
mainly for commercial use to identify consumer
preferences, trends and unexpected relationships
contained in large and varied data sets
Multimedia: driven
by home entertainment, dissemination of dry
statistics can perhaps be made more intriguing
for the end-user
IT supported
elements of census processing
Planning
and Management
Mapping
Forms
and manuals
Data
collection
Data
capture
Coding
Error
checking
Editing
Output
Dissemination
Analysis
These are the various steps
in the census process which can be supported by
IT. Three areas will be covered in detail during
the workshop, namely data capture, dissemination
and mapping.
Planning
and Management
Process
and resources
critical
path
budget
Work
flow
questionnaires
data
files
data
back-up
MS
Project
Timeline
Quicken
Spreadsheet
IMPS/Centrack
Critical path: important
to plan from the start, use available means
to define activities and resource requirements
and obtain critical path, manage the implementation
of the plan, assure feed back from line offices
to keep progress up-to-date
Budget control: important
to control the budget at project level, even
of perhaps the Ministry of Finance is responsible
for the official records
Workflow control: needs
clear advance definition (EA list) with count,
processing plan, tracking of EA folders and
data files through various stages of processing
Back-up system: must
be effective and assure safety of data
Virus protection
Challenge is to keep
implementation plan updated and document and
data flow under control
Mapping
Geographic Information
Systems
vector
raster
ArcInfo
MapInfo
PopMap
GIS: Vector-efficient,
space saving, elegant scale change. Raster
- sufficient for EA maps but space consuming
storage of graphic image
Commercial systems:
ArcInfo and MapInfo are industrial strength
products, perhaps unnecessarily powerful for
census mapping needs. SDBQ, SuperMap, Redatam
are specially developed for statistical use
Free software: UN/Vietnam
developed PopMap, IMPS, MapView
Forms and
manuals
Questionnaires
ontrol forms
Manuals
Tabulation plan
Census
design system
Word
processor
Form
maker
Spreadsheet
Numerous documents
to be prepared:
questionnaires,
control forms,
preliminary manual
count forms,
batch transfer forms,
manuals for enumerator
and supervisor,
editing and coding
rules and instructions,
tabulation plan and
table definitions,
analytical and administrative
reports,
regular office communications.
Census Design System
by US Bureau of the Censuses. First mentioned
in 1996, but development seems delayed (funding
problems?)
Data collection
Paper
questionnaires
door-to-door
enumeration
mail-in
CAPI
fixed
collection points
door-to-door
PDAs
CATI
E-form
This is an are where improvement
would have the most benefit to reliability and
timeliness of the further processing of census
data and to the overall quality of results.
Paper questionnaire:
unreliable, individual interpretation by enumerator
or respondent
CAPI, CATI and E-form
should minimize such variations due to computer
validation. Door-to-door used for surveys
Fixed points such as
customs, magistrate
PDA already successfully
used as enterprise platform, great hope for
future, as these would bring significant improvements
to reliability, quality and timeliness of
census processing. Problem might be typing,
but write pad capability of PDAs will improve.
Slow-down due to error checking compensated
by reply sensitive guidance through questionnaire
Voice recognition still
far off, but could start playing a roll in
a few years
E-form could be efficient
but too few respondents with access to internet,
even in highly developed countries
Data capture
Key-to-disk data
entry
OMR
Image scanning
with OCR/ICR
IMPS/Centry
Of course, if we have to have
paper based data collection, then improvement
of data capture will have significant benefits
in time and accuracy
Key-to-disk is for many
developing countries still the preferred mode,
add'l advantage is equipment influx, DP training
OMR was successfully
used already in the 80s (Caribbeans, Bangladesh)
but has stringent paper quality and environmental
demands
OCR can also read marks,
ICR interpretation of handwritten characters
(numbers better than alpha) storage of images
can be interpreted during coding and editing
without further reference to the questionnaires
Coding
Manual (before
data capture)
Computer assisted
(after data capture)
Automatic
Manual coding very slow,
cumbersome and rather unreliable
Computer assisted coding
gives significant gains in consistency due
to look-up tables
Automatic coding after
OCR/ICR, only for certain variables feasible
such as gender, age, but less so for occupation
and industry
Perhaps a combination of computer
assisted and automatic coding is most feasible
Error checking
Manually
Automatic, with
error listings
Automatic, including
imputation
pre-determined
hot-deck
undetermined
IMPS/Concor
Even with much improved or almost
perfect data collection and capture techniques,
this processing step will always have to be performed
Manual checking. Some
basic checks always required before data capture,
such as geographic code and presence of essential
fields
Trend is toward imputation:
pre-defined, a fix
value depending on some indicators within
record,
hot-deck, copy value
from another record with similar characteristics
undetermined, category
for clearly out-of-range and inconsistent
values
Editing
Manual
Computer assisted
Like all manual process,
unreliable and time consuming, error prone
and inconsistent
Computer assisted editing
results in improved speed and consistency,
accuracy, can be done automatically in connection
with validation
Output
Database
Tabulation
Thematic maps,
graphs, census atlas
Administrative
reports
Analytical reports
IMPS/Cents
Redatam
Plus
PC-Axis
PopMap
SDBQ,
SuperMap
A variety of output possibilities,
some are essential, i.e. database, tabulation
Database are: microdata
stored as square hierarchical file, transposed
indexed file, macrodata in table format (printout
copy or aggregate data), integrated metadata
systems in preparation for warehousing
Tabulations are primarily
on paper, lowest unit: village level
Maps and graphs: help
better visualizing results
Administrative reports:
provide full documentation of the entire census
undertaking, including lessons learned
Analytical reports:
extensive analysis usually by outside organizations
after census project is completed
Dissemination
Printed
reports
Microfilm
Disk media
On-line
(BBS)
world wide
web
pre-defined
(push)
dynamic
(pull)
This is an interesting area,
because by selecting proper dissemination methods
the user base can be dramatically enlarged
Printed reports: traditional
printing facilities, directly from hard copy
printout, or, better, from tabulation data
file
Microfilm: requires
special equipment (inexpensive) but somewhat
uncomfortable to operate and read, has waned
in popularity due to available electronic
means
For all digitally distributed
information: confidentiality is an important
issue. Diskettes for subset of data, CD can
be used with dynamic retrieval software (SDBQ)
when entire census macro data are stored.
Cheap desk top systems exist for CD-ROM recording
On-line: Phone/modem
access required to obtain pre-defined tables
or dynamically generated output from microdata,
used for domestic consumption. Diminishing
importance for bulletin boards
WWW similar but with
larger global audience, may include remote
tabulation requests, delivery possibly against
payment like E-commerce
Analysis
Demographic
analysis software
General purpose
analysis software
PAS
MortPak,
Qfive
Fertility
estimates
PANDEM
DemProj
People
and Workers
FIVFIV
LIPRO
Here come the magicians who
can manipulate the diligently collected, processed
and presented census data. However, analysis is
usually an activity beyond the actual census operation.
Relevant analysis s/w
has been available, with or without cost,
for long time, some being adapted to more
recent computing environment, others remain
DOS based.
Vary powerful commercial
analysis s/w such as SAS and SPSS has been
developed for mainframe computers but have
been adapted successfully for the micro computer
environment
Conclusion
Continued accelerated
technological development
Improved reliability,
quality and timeliness
Depending on local
infrastructure:
Improvement of
traditional methods
Increased use
of OCR/ICR, CAPI, E-form
Increased use
of GIS and thematic mapping
Increased use
of CD/DVD, (BBS) and WWW
Accelerated technological
development: smaller, faster, cheaper, easier
to use hardware and software
Improved reliability
quality and timeliness: might be achieved
with improvements in the area of data collection
Some of the recent
or forthcoming technologies may have limited
use for countries without appropriate infrastructure.
Implementation of proven technology should
therefore be carefully considered
Improvement of traditional
methods: paper based data collection but with
better designed forms, better methods for
planning and control, better mapping, etc.
Increased use of OCR/ICR
for data capture. CAPI (also with PDAs), CATI,
E-form for data collection, resulting in better
quality of data and more timely reports
GIS: increased affordability,
cooperation with other Gov offices (Cadaster),
easy digitizing, powerful presentation s/w
(PopMap, Redatam, SQBQ)
CD/DVD: cheap CD cutting,
efficient storage and dynamic tabulation s/w
(SQBQ), Improved communications infrastructure
(BBS), globalisation with WWW