The First Meeting of the
Working Party on the Application of New Technology
to Population Data
Bangkok, 24-26 September
1997
STAT/WPA.1/3.1
24 September 1997
ENGLISH ONLY
ECONOMIC AND SOCIAL COMMISSION FOR ASIA AND THE
PACIFIC
Working Party on Application of New Technology
to Population Data
First Meeting
24-26 September 1997
Bangkok
Recent developments
in the application of information technology to
population data collection, processing and dissemination
at the Australian Bureau of Statistics
Dr Rob Edmondson
Director, Technology Application, Population Statistics
Group
Australian Bureau of Statistics rob.edmondson@abs.gov.au
September 1997
The application of information technology
to population data collection, processing and
dissemination at the Australian Bureau of Statistics
(ABS) can be the conveniently divided into three
roughly equal parts. The Population Census (the
Census) with its size and public profile, the
household survey program with many intertwining
strands of work, and 'the rest' - such as some
labour force collections based on employer respondents,
and some demography and crime collections based
on administrative by-product data.
Population
Census
There were a number of significant IT developments
in the last Census. Personal Computers
(PCs) were provided to field managers, Geographic
Information Systems (GIS) were used for Collection
District design, coding was performed using
a PC based system, and tabulation and dissemination
were largely performed using PC platforms. In
the next Census, in addition to further developing
these systems, the use of Imaging and Optical
Character Recognition (OCR) is under active
consideration. Each of these is briefly discussed
below.
For the first time, PCs equipped with modems
were provided to 145 field managers working
from their homes around the country. Secure
communications facilities were provided for
frequent data exchange. This environment was
used to make more information available to key
personnel, and to provide it in both a timely
and relevant manner. Information could flow
between the field managers and more questions
could be resolved locally. Overall, a successful
use of new technology that will probably be
refined in the next Census.
Another first in the Census was the move to
GIS technology for Collection District design.
The most significant obstacle to adopting this
approach was the availability of suitable digitised
geographic data, and a contract for the provision
of this data was signed some years ago. Once
available, the digitised information is a valuable
resource for a range of purposes, both within
the statistical agency and outside. Within the
ABS, the digitised geographic information is
being used in the household surveys program,
and it forms the basis for GIS based dissemination
products. The basic concept of using GIS technology
for CD design worked well, and is likely to
be retained and enhanced for the next Census.
Maintenance of the information in the intercensal
period is an issue under consideration.
The third major change in the last Census was
the move from Mainframe based coding to PC based
coding for those fields not captured by Optical
Mark Recognition (OMR) processing. Together
with a very effective system to manage paper
flows, this proved very worth while. The PC
based coding facility used a general-use coding
engine that accommodates a wide range of coding
indexes and a few styles of coding. To further
reduce costs and improve timeliness, imaging
and OCR are being considered for the next Census.
It is hoped that OCR will permit the automatic
coding of a significant fraction of responses,
and that imaging will significantly reduce the
need for paper handling. The concept is that
when automatic coding of an OCRed field fails,
the image will be presented for further processing
rather than the paper. At this stage we are
trialing OCR and automatic coding systems to
test viability and refine estimates.
The last area worth commenting on is the use
of PC based tabulation and dissemination facilities.
These facilities are essentially improved version
of the products used in the preceding Census.
All tabulations were done using Supercross,
and including the provision of data to the information
warehouse. A number of CDROM based products
have been released with software that provides
simple 'browse, manipulate, export' functionality.
A cut down version of Supercross is being considered
as an alternative to the current product. The
successful CDATA91 GIS product was improved
for the '96 release, and there are some pilot
projects using the Internet for dissemination
of some Census data, including online (or offline
from CDROM) area selection from maps. This is
consistent with a general move to using internet
interface and software technology for dissemination.
Though based on internet software, the products
are often packaged on CDROM and can be used
offline, or they can be easily incorporated
into an internal "intranet". Moves in this direction
are greatly assisted by the incorporation of
internet format output options in many of the
packages used for dissemination in the Bureau.
Household
Surveys
Household Surveys are predominantly interviewer
administered, either face to face or (increasingly)
by telephone, typically with data entered on
an OMR form that is mailed back for scanning
and processing. The main processing system is
currently mainframe based and is reaching the
end of its effective life. It is likely to be
progressively replaced with new components that
operate on the PC or in client-server mode.
The processing systems are largely being replaced
by SAS (available on mainframe, Unix, and PC)
and the tabulation systems are to be largely
replaced by Supercross (available on NT servers
and workstations). Dissemination is predominantly
by paper publication, by tailored data services
(paper or floppy disk), and by confidentialised
unit record data. There is increasing use of
the Information Warehouse to hold data and metadata
in a form fit for dissemination, and there are
some successful pilot projects that move parts
of the survey design stage activities, dispatch
and collection control facilities, and management
information systems into Lotus Notes (Notes).
Some surveys have used other data capture techniques:
computer assisted personal interviewing (CAPI),
telephone interviewing, and OCR have been or
will be used in various surveys and are discussed
below.
CAPI has been successfully used in a number
of surveys, though the cost of in-field notebooks
has made the cost/effectiveness of this approach
questionable. The use of Blaise software (from
Statistics Netherlands) proved very effective
for population surveys, and the current DOS
based Blaise software will be available in a
Windows based edition before long. Blaise has
enabled the fielding of more complex instruments,
with in-field editing to improve data quality,
and early transmission of relatively clean data.
Experience to date indicates that processing
time to clean unit record data has been reduced
while ensuring the consistent application of
edits to the data. A number of surveys have
been and will be conducted using the existing
stock of notebooks. It is not clear what use
will be made of CAPI after this period though
smaller handheld in-field devices hold some
promise.
Telephone interviewing has been successfully
used in the main Labour force collection to
conduct the second and subsequent interviews.
For the second and subsequent interviews, the
interviewer enters the response onto an OMR
form, and the forms are processed in the usual
way. As the telephone interviewers work from
home, the IT assistance has been limited, though
centralised and computer assisted telephone
interviewing is used in a number of economic
collections.
Optical Character Recognition is also starting
to be used in some surveys. Some economic collections
have been using OCR for a few years, but the
greater need for alphabetic rather than numeric
character recognition in population statistics
has slowed the adoption of OCR for collecting
population statistics. The institutional mailback
component of the Survey of Disability and Ageing
will trial OCR, and the ongoing Survey of Income
and Housing Costs will probably convert to OCR.
If successful, these pilot projects, together
with complementary work in Census, may see wide
spread adoption of OCR in the future, perhaps
displacing OMR as the normal data capture vehicle.
Indeed, OCR may be more suitable for interviewer
completed questionnaires than respondent completed
questionnaires as the interviewers may be able
to complete the forms with higher recognition
rates.
To complete the picture, some household collections
have used more traditional computer assisted
data entry (CADE) systems. CADE systems have
been based on a number of software products:
older style desktop database software; the internally
developed client-server Input Processing System
(IPS); the DOS based Blaise system; and more
recently, the Windows based Notes system. Notes
is usually associated with messaging and groupware
applications, but it now contains enough functionality
and programmability to make it a useful data
entry platform. Blaise proved particularly suitable
for some innovative collections, such as the
CADE system for capturing information from Time
Use Diaries. Other CADE systems are mostly well
established, but there are moves to take advantage
of the Internet and related electronic data
interchange initiatives to take advantage of
emerging opportunities in this area.
Processing of the captured data is still largely
done in the traditional manner using aging mainframe
processes. There have been various approaches
to rejuvenating these systems. In the main,
these have emphasised the use of portable SAS
for processing, the use of Supercross to replace
Table Production Language (TPL), the use of
input processing facilities packaged with data
capture systems, and the use of Notes and other
client server technologies. The ABS has contracted
with the supplier of Supercross to incorporate
various 'TPL' functionality into their product,
particularly for processing the kind of hierarchical
unit structures found in household surveys.
Once the functionality has been delivered, we
expect a substantial shift of processing from
the mainframe SAS/TPL/PL1 environment to a client
server SAS/Supercross/SQLWindows environment.
These systems are being integrated with downstream
dissemination initiatives, and upstream survey
development processes to provide the next generation
of household survey systems. Many of the systems
and products can be used with the 'server' on
the 'client' PC, though in the ABS we tend to
field the systems using shared access UNIX and
NT servers. A number of specialised statistical
processing sub-systems are being or will be
integrated into the new environment, including
seasonal analysis/trending facilities and some
general use statistical sampling and weighting
systems that are tailored to the family of statistical
methodologies used in the ABS by most (95%?)
household surveys.
Increasing volumes of data are being made available
for electronic dissemination. Some older dissemination
techniques are being phased out in favour of
newer technologies. Fiche is being displaced
by CDROM based facilities providing much improved
location, manipulation and display capabilities.
Dissemination using older electronic messaging
facilities is being replaced with Internet facilities
and even floppy disks are being displaced by
internet email. The internet presents many opportunities,
and the ABS is quite well placed to move in
this area with a large volume of data cleared
for electronic release, and the ability to associate
a steadily increasing collection of electronic
documentation with such data.
Other
Population Collections
As might be expected, the 'other' category
is difficult to generalise. In the main, employer
based surveys tend to adopt processing systems
resembling those used by economic surveys. Some
Labour Force employer surveys are starting to
use OCR, and have used administrative byproduct
data and electronic capture for some time. Crime
and Demography collections do not have employer
respondents, but most information is collected
by other government instrumentalities (often
State based). Administrative byproduct data
is typically collected from each supplier in
a different format and run through a specialised
processing system.
Impact
of new technologies on the operations of the national
statistical office, benefits drawn, and issues
encountered.
In the last few years the most significant
technology changes have been a move to roughly
one networked PC per employee, the installation
of a much improved wide area network and the
deployment of relational database technology
(Oracle/Unix) and groupware technology (Lotus
Notes). As a direct result, the dedicated mainframe
terminal network has been removed, the centralised
print service has shrunk to one printer part
time, and the support cost of the mainframe
has shrunk to very low levels.
Relational database systems have been deployed
primarily in economic collection areas, but
input processing and final dissemination (via
the information warehouse) have both been used
in population surveys, and the Census DPC had
a dedicated Unix server. They have also been
used for some administrative systems - particularly
financial and personnel system.
Lotus Notes has been used for electronic mail,
discussion databases, and the automation of
many administrative systems (leave, acquisition,
recruitment, staff movements, planning etc).
It has also been used increasingly for a range
of statistical processes, providing opportunities
for improvements from the earliest survey design
stages to final dissemination. The earliest
successful systems used the flexible document
structures to develop systems for query tracking
and resolution, structured survey documentation,
and management information systems. Administrative
processes have also been automated, with most
paper forms eliminated and electronic forms
routing and processing automated so that only
the originating and approving officers need
view the electronic form in almost all cases.
Not only have processing systems been modernised,
often with improved output and better timeliness,
but the quantify, quality, and accessibility
of documentation has improved. This has not
been without cost, and there has been a substantial
shift in resources from people to technology.
This was in part the result of a change to full
cost recovery of all IT operation, and giving
the users the freedom to move money between
various IT and non-IT expenditure items. To
date, the IT organisation has retained a monopoly
on the provision of IT Services (subject to
demonstrable unit-cost decreases year on year),
though it has increasingly used external service
providers to deliver the service.
The network has provided the environment to
host a range of servers and services available
from every machine on the network. Banyan/Vines
servers provide basic file, print, and communications
services, Notes servers provide a range of messaging
and document-database services. Unix servers
carry a significant proportion of the data entry,
analysis, and dissemination load. NT servers
are providing specialised application services
such as timeseries manipulation (FAME) and tabulation
(SUPERCROSS). Most servers have no dedicated
input/output devices, relying on general network
services instead. Much more mainframe output
is now sent to Banyan network printers (or to
Notes databases) than is printed on centralised
printers. As well as servers, there are a range
of general services available over the network
- OMR and OCR scanners feed data in, computer
fax gateways receive and send faxes without
paper being generated, internet gateways and
firewalls provided internet email and limited
internet browsing, file transfer machines provide
secure transmission facilities for PCs in the
field, and special devices such as CDROM cutters
and plotters are available to all (or just to
selected individuals).
Experiences
in the implementation of applications
Client
Server Technology
The large number of server types listed above
indicates that we have had a considerable success
with client server systems overall. However
there have been a number of difficult implementations
along the way, and some systems can only be
classed as marginally successful when compared
to the original expectations. As a general rule,
economic collection have moved more rapidly
to use client server platforms, but population
collections are starting to catch up.
A number of early systems used PC database
technology with multiuser database backends
residing on Banyan servers. These were successful,
but unpleasantly network intensive, and the
remainder are being eliminated. We rarely develop
server applications for these platforms now
except when the software being used requires
a shared file system. Blaise processes, some
SAS systems etc are run from Banyan servers.
Notes is based on a non-relational database
engine, and this can be used to develop applications
of various kinds. Starting with documentation
centred applications and moving into workflow,
task tracking, planning and management information
systems, these have been very successful. With
the increase in the programmability of Notes,
we are now developing much more traditional
database applications using Notes.
The earliest UNIX/RDBMS systems were a CATI
application and Financial and Personnel management.
These were successful, though the required CPU
power was underestimated at the start (a problem
we have had with most client server applications).
Financial and Personnel Management is still
successfully based on a UNIX server, but most
subsequent applications have been statistical
applications, including some general use environments
within which particular survey applications
are constructed. General use environments include
an Oracle based input processing system, an
Oracle/SAS based processing environment, and
an Oracle based information warehouse system.
There have also been a number of Unix hosted
but more specialised applications including
a GIS server, OCR engines, and a new Business
Register.
We have just started to deploy some NT servers
for general use (as host environments for Notes,
they have been used for some time). So far,
these have been used to provide particular third
party applications: FAME and Supercross. It
is not clear how widely we may end up using
NT servers, and whether they will supplement
or displace existing Unix servers.
The
Internet
ABS has had an Internet site for some time
providing "subscription" based access to a range
of statistics. More recently we have provided
a Web site for public good information. We currently
maintain several thousand pages of information
using Lotus Domino technology which essentially
allows us to put nominated Notes databases on
the Web. Maintenance of the information is straight
forward as it only uses the usual office documentation
tool - Notes. Some more specialised initiatives
are underway, including map-based drill down
area selection to public good Census information.
This will add a few thousand more pages to the
site.
The ABS will continue to enhance the content
of the site, and when appropriate third party
charging arrangements are available, we expect
to sell data over the Web on a self-serve basis.
Even when direct use of the Web is not expected,
for example when releasing CDROM material, we
are increasingly using software and interfaces
associated with the Web. There is also interest
in collecting data using the internet once suitable
security arrangements have been agreed. This
can range from data capture using electronic
forms on the Web to moving existing electronic
providers to a better communications medium.
Data
capture using OCR, OMR and other technology
OMR is well established as the main data
capture technology in Census and Household surveys.
More recently, the use of OCR has been growing,
particularly in business mail back surveys.
As outlined above, OCR is now being actively
investigated by Census and some other population
collections, and has a number of attractions
- printing requirements are not as stringent,
recognition rates for alphabetic characters
are improving, and implementation costs are
reducing.
Other data capture options also have their
attractions. CAPI enables more complex questionnaires
to be used and improves data quality and timeliness,
but hardware costs will limit its use. This
situation may change with hand held devices
becoming available at significantly lower prices.
CATI has also been used with significant benefits,
though in population statistics, telephone interviewing
has usually been used with paper OMR forms.
Administrative by-product capture is also used
by several collections, and opportunities in
this area are increasing.
The
use of GIS software packages for data collection
and analysis
In the ABS, GIS are mainly used for 'frame'
creation and selection, and for various publication
and dissemination initiatives. The Integrated
Regional Database (IRDB) and CDATA have been
successful GIS dissemination initiatives, and
the social atlases and other map products sell
well.
Strategies
adopted to address various issues and problems
Year
2000
Many population collections are cross sectional,
substantially modified each time they are run,
and are unlikely to have significant year 2000
problems. The next Census will be in 2001 and
will be extensively retested before production
use. Thus the scale of problem is manageable,
but there are still significant areas of risk.
The ABS has evaluated all its systems and identified
its more significant and risky applications,
and various external dependencies. External
dependencies include external organisations
providing electronic data, and third party hardware
and software. External hardware and software
providers have been approached about year 2000
compliance and plans. External data providers
are being identified and approached about file
format and data provision risks. The highest
priority application systems are being modified
and/or redeveloped as part of this year's work
program. This includes the significant sub annual
Labour Force and Demography collections, and
the household surveys processing environment.
Test environments in which dates can be set
forward are being progressively made available,
and will be used to verify that the remaining
systems do not have significant problems, and/or
to correct any minor problems discovered. Outstanding
and any more significant problems discovered
will be remedied as part of next years work
program.
Cost
Recovery
The ABS has fully cost recovered all internal
technology services, both applications and infrastructure
services, for some years. The first couple of
years were difficult, but the system is now
reasonably well understood and works reasonably
well. The main benefits have flowed from resource
shifts resulting from the provision of better
cost and consumption data, more direct pricing
signals, and more freedom for line areas to
capture benefits. The pricing structure is reasonably
detailed, and each item fully recovers the costs,
including all overheads, that are attributed
to it.
In the infrastructure services area, cost recovery
has resulted in demonstrably lower unit costs
each year, and in higher overall technology
expenditure as a result of a significant rise
in demand for (and supply of) infrastructure
services. The significant rise in the number
of PCs, and the consequential demand for network
and server capacity, was not a direct result
of executive decision making. It was largely
driven by individual managers shifting available
resources into these areas. It occurred during
a period when overall financial resources available
to the ABS were subject to an annual reduction.
The last mainframe, and all the servers have
been acquired using existing budget allocations
rather than requiring additional funds.
In the applications area, there has been a
fairly steady demand for services, but there
has been more flexibility in deployment, and
more attention paid to the cost effectiveness
of requested developments. The applications
dollar could be spent on PCs, on subject matter
people, on travel or other items.
Technology services have undergone a number
of external benchmarking studies and have, as
a rule, performed at the standard of world best
practice. This is probably due in part to the
cultural and financial effects of cost recovery.
Outsourcing
and Market Testing
Australian Government policy is followed,
and we have market tested and/or outsourced
in number of areas - help desk provision, provision
of GIS services, provision of support for financial
management software, provision of field support,
etc. The result of market testing has not always
been outsourcing, but outsourcing often occurs
when the required skills or expertise is peripheral
to mainstream statistical processing. The Census
and Statistics Act places limits on our ability
to outsource unit record processing facilities,
particularly for population data.
Areas
where it might be possible for the NSO to make
contributions to facilitating the transfer of
technology and/or exchange of information to developing
countries.
Some of the area that seem to offer some
possibilities for transfer of technology and/or
exchange of information of immediate relevant
to population statistics include the use of
Notes, the use of Blaise, computer assisted
and/or automatic coding, exploiting various
Internet technologies and opportunities, the
use of Supercross, OMR/OCR experiences and developments,
and exploiting GIS.