| Workshop on Population Data
Analysis, Storage and Dissemination Technologies |
| Bangkok, 27-30 March 2001 |
STAT/WDT/6
27 March 2001
ENGLISH ONLY
ECONOMIC AND SOCIAL COMMISSION FOR ASIA AND THE
PACIFIC
Workshop on Population Data Analysis, Storage and
Dissemination Technologies
27-30 March 2001
Bangkok |
| Going on the net with your
national statistics: What is there to consider?* |
| (Items 7 of the provisional
agenda) |
| By Sten Backlund, UNSD1 |
| Contents |
| |
*
This paper, prepared by Mr Sten Backlund, United
Nations Statistics Division (UNSD), has been reproduced
as submitted. It has been issued without formal
editing.
1
Former employee of Statistics Sweden and during
2000 in charge of coordinating Statistics Sweden's
Internet activities
|
| Abstract |
| In a society where instant access
to valuable information can make the difference
between success and failure you will have to rely
on your sources of data being available when you
need them. By the end of each year Statistics
Sweden releases tables containing preliminary
population figures as per 31 December down to
communal level. On 28 December 1999 the website
timed out and from no apparent reason. We will
use this as a starting point, search for answers
to why it happened and discuss how it could have
been avoided. In this context we will also elaborate
on how to address issues that are not readily
found in textbooks or manuals. |
|
| The
site |
| Introduction |
| In the quest for answers to the
collapse of the Statistics Sweden website there
are some pre-conditions to keep in mind. Statistics
Sweden launched its first data on the Internet
in the fall of 1995 and the site grew rapidly
in size. In the end of 1999 it was decided to
restructure the website and opt for a database
driven solution since the burden of administering
the file based system was threatening the integrity
of the site. For this purpose a time-consuming
bidding procedure for a database driven solution
started in early 2000, the result of which did
not live up to expectations. In the same year
it was decided that all official statistics, including
ad-hoc requests from the macro databases2,
should be made freely available on the Internet. |
| During these years the information
generally available on the web as well as the
numbers of users connected have multiplied in
a way that was never imagined. This naturally
affects any website that provides information
that may be critical to the end users. The sophistication
of systems has also deepened since the number
of different web services that can be offered
grows when methods and techniques evolve. |
| One of the problems fast expansion
inevitable seems to lead to is that the owner
of any given website will always find himself
racing against time to be able to catch up and
adapt to the ever increasing demand from the international
society. |
| Availability is one of the most
important measures of quality. In many cases you
will demand 24-hours service e.g. in banking,
booking, news distribution. In Sweden, as an example,
an information "marketplace" is well under way
where in a first step all providers of governmental
information will be connected for sharing information3.
Later on large businesses will be part of it and
in a longer perspective any private person, organization
or company will be able to access and share information
over the web. To make this happen a standardized
secure system for data interchange and authentication
of actors must be offered. Today you still might
end up with a number of different solutions that
depends on what has been chosen by the company
with which you want to communicate. (Typically
Swedish banks use different solutions for securing
transactions over the web instead of sharing the
same e.g. soft certificates or smart cards). |
| We can comment briefly on what
the factors are that have an impact on a website's
availability in a negative way. What comes first
in mind is hardware or systems failure, e.g. when
a web server stops operating or when the power
supply fails, that has nothing to do with the
web application itself. Second, the website may
time out because it is badlydesigned or contains
bugs. Third, the website can be spammed by junk
e-mail or come under attack from an intruder who
seemingly does not like you resulting in what
is called Denial-of-service. |
| (Using simple probability theory
it can easily be shown that if your single web
server has a probability of 0.99 to be up and
running it still means that it on average will
be out of order approximately 7.2 hours monthly!) |
| Going back to our case in study,
the Statistics Sweden website is today comparatively
large and complex, as are many other NSO websites.
The current number of HTML pages produced regularly
or ad-hoc is large. The underlying meta and macro
databases are voluminous since it is mandatory
that all official statistics produced should be
added or appended to the database tables. Most
information available is also translated into
English. The site records more than 150000 successful
hits daily. |
| Statistics Sweden still maintains
a conventional web solution based on file directories
for documents while data are stored in Sybase
databases. As an effect it has been decided that
updating site information resides with the policy
area programmes. There is no data warehousing
implemented as yet even if facilities e.g. coming
with the SAS software is under evaluation. The
agency relies heavily on PC-AXIS as a tool for
standardization and dissemination of data. |
| Virtual hosting is used for launching
web driven multi-layered applications where e.g.
electronic forms are used for collecting statistical
data from respondents like administrative units,
schools or businesses or where results/analyses
from surveys, which are not part of the official
statistics, are published for designated users.
More than 50 such web sites, of different sizes,
exist. |
| Securing the web site and the
internal NT LAN is considered of utmost importance.
Statistics Sweden has implemented proxy and firewall
techniques to prevent intruders and to filter
information that may be harmful to the system. |
2
SSD - Sveriges Statistiska Databaser
3SHS
- Spridnings-och HämtningsSystemet, jointly
developed by Statskontoret (The Swedish Agency
for Public Management), Riksskatteverket (The
National Tax Board) and Riksförsäkringsverket
(The National Social Insurance Board)
|
| The
hardware configuration |
| The configuration solution adopted
at that time built on mirrored web and database
servers for the public Internet. The "outer" web
server was located behind the firewall in the
so called Demilitarized Zone, which is a physical
network in itself only providing dedicated services
on predefined nodes to the users e.g. FTP, HTTP,
HTTPS, SMTP and for practical reasons a few others
e.g. DNS, Day-Time, NTP and Ping. |
| In order to speed up data transfer
a secure transport network was set up connecting
the two web servers and the two database servers.
From the safe SCB LAN the "inner" servers were
updated first and then the changes were propagated
onto the "outer" servers. Now if any of the "outer'
servers failed it could be temporarily swapped
for the corresponding "inner" server - though
only during office hours. |
| Another facility provided is Virtual
Private Network (using Point-to-Point Tunneling
Protocol), a method that enables Internet users
encrypted access to the safe network, e.g. usable
for staff working from home or for secure communications
between Statistics Sweden and trusted partners.
(VPN will be discussed more in the end of the
paper). |
| The firewall is implemented through
dedicated hardware (Watchguard Firebox II + which
supports 5000 concurrent threads). A strict scheme
is implemented describing what kind of incoming
and outgoing traffic is permitted. |
|
| Methods
and schemes for updating and maintaining the web
site |
| Statistics Sweden has been using
mainstream technology for its web development
and still does. At that time it meant the Windows
NT 4.0 product family (Internet Information Server,
Index Server, Transaction Server, Proxy Server)
and the Office 95/98 suite of software. The website
has been implemented mainly using ASP technique
to retrieve pages and data (although a minor deviation
is made since the Java based Silverstream product
is used for a few electronic data collection applications).
FrontPage is the number one tool for the final
touch of page design but there are alternatives
used e.g. Dreamweaver or Adobe PageMill. |
| But still most common is generating
files from MS Word and MS Excel. Updating statistical
information is delegated to staff in the line
departments. There are more than 200 persons who
have selected rights to update file and database
servers. Since all official data from recurrent
and intermittent activities are appended to the
statistical databases in each policy area this
means that in all departments there are persons
who are entrusted with updating and maintaining
database tables. This is often done in an automated
fashion even if there is hands-on needed to initiate
the process. The same goes for creating or updating
informative pages e.g. in press releases or information
on specific topics e.g. the living standard survey
and others. A comprehensive set of instructions
has been developed covering most of what is needed
to know from layout to permissions and updating. |
| There are central staff needed
for coordinating activities, informing on schemes
and procedures, arranging internal or external
training etc. This resides with the two webmasters
who have a steering group with subject matter
departments represented at their disposal. |
| All taken together in order to
run the website smoothly everyone must know what
to do and how to do it and when. Scripts badly
designed or too much reliance on standard output
from popular software may occasionally decrease
overall site performance. |
|
| The
search for answers |
| It was obvious that the answer
to the problem of December 28 to a great extent
could be related to the 512 Kbit capacity of the
connection point to the ISP. Another matter that
early came to attention was the way the published
tables were produced. The most common way was
to use ASP technique to create requested tables
but what did that really mean? |
| Previously no direct control over
how pages were created was implemented except
that they should conform to the layout standards
that were decided and that the language was correct.
Of course recommendations and rules were established
but there was simply not enough time to maintain
a regular checking of page sizes or if links were
valid and not broken. |
| ASP-generated HTML code contrary
to static HTML does not cache. This means that
whenever a client returns to a previous ASP page
it will be re-created instead of returned from
the client's cache or the proxy. With NT4 there
is another problem. Standard installation allows
30 parallel ASP-scripts to run (Microsoft Script
Engine default) before queuing starts while studies
undertaken at the IT unit showed that when more
than 500 ASP-scripts were executing or in line
the server tended to return the error code 500
i.e. "Server too busy". |
| Then there was the problem of
only a single server performing all duties. What
would have happened if there had been another
server instantly available? This question later
led to a rearrangement of the web site. |
| Another thing that relates to
availability is that there should always be staff
at hand when extraordinary events are expected.
This was the case even if December 28 is one of
the days when people are on Christmas holiday.
During the day the server was restarted in order
to get rid of processes blocking it from deliver
the requested services. |
| But what was soon found as the
real "sinner" was the population table most in
demand, the size of which was 600 K! When the
HTML-code was examined and unnecessary formatting
(table cell level) was removed it was down to
64K. On 28/12 the number of recorded visits to
this page was a moderate 3000. Another simple
calculation shows that this yields 1.8 GB of outgoing
data (3000*600) that on the current 2 Mbit connection
would occupy the single server on a 100% basis
during 10 hours, while the reduction leads to
not more than 17 minutes. |
|
| Solving
the problem? |
| Obviously the first thing that
had to be done to avoid the same thing to happen
again was to change the manner in which pages
frequently visited were created. Most pages containing
statistical information were the result of efforts
made by staff with the subject matter departments
using Word or Excel, in some cases FrontPage or
Dreamweaver. The resulting HTML code was then
often over-dimensioned. Seen in a micro-perspective
this is nothing to fuzz about. Most statistics
that are disseminated have a moderate number of
readers and on an ad-hoc basis. But for pages
in high demand the situation is different. A macro
tool was developed stripping the code of unnecessary
formatting. ASP was to be avoided where static
HTML was a sufficient alternative, e.g. for the
most wanted population tables in this case. In
this case you will also admit caching. In this
way the main table was reduced by a factor 10. |
| The second thing to consider was
how to optimize the bandwidth usage at the point
of connection and/or to increase the capacity.
While 512 Kbit is not much nowadays it was still
regarded quite sufficient for the Internet traffic
in 1999 and before. It was decided to extend the
connection to 2 Mbit but at the same time to evaluate
software for "traffic control"(meaning that traffic
through your pipe is differed as to its priority)
and load-balancing on the market. |
| Since one of the underlying factors
was the single web server available at the connection
point (ISP Point-of-Presence) it was necessary
to address this situation properly. A different
solution was proposed and taken which meant that
both web servers were now located in the DMZ.
It was also decided to upgrade the OS from NT4
to Windows 2000 AS SP1. This would mean that traffic
from the inside (SCB LAN) must pass through the
firewall but was considered a minor obstacle since
the capacity of the firewall had previously been
increased (not needed for the automatic updating
of the servers). At the same time Round Robin
DNS was implemented which meant that incoming
traffic was shared between the servers on equal
basis. |
| One of the problems was that service
was not available around the clock so one had
to make sure that there would be someone working
on that day to supervise. It was decided to take
down the servers the night before the 28th
to clear all pending processes. |
| For extra measure the incoming
mail was rerouted to the other connection point
in Örebro (some 180 km west of Stockholm),
staff were encouraged not to send any large e-mail
consignments on this day and an additional link
on the welcome page was inserted pointing directly
to the population tables. This link was also provided
to the major Swedish newspapers. |
|
| Lessons
learnt |
| The release of the population
figures went without problem. The actions taken
were proven sufficient. Most probable only changing
the main table from ASP for HTML and stripping
the code of unnecessary formatting would have
been adequate. |
| Upgrading from NT4 to Windows
2000 was performed without trouble. (The problem
of ASP queuing is also more or less solved in
this environment). W2000 also seems to be more
stable since system failures have diminished. |
| In retrospect it may look somewhat
like overkill when all these steps were taken
to prevent the site from timing out. But all the
same it was considered a severe loss in confidence
when such a thing happens when you disseminate
statistics of major interest to the media and
the common man. It should be remembered that the
exercise also brought things to the surface that
led to a deeper understanding of what factors
effects a website and how to deal with them in
better ways. In the end the overall outcome was
positive. |
| One of the main findings is that
you should never neglect the need of a staff development
program on how to implement and use mainstream
software for creating web pages or multi-layered
web applications. Even if your staff are competent,
as in this case, you cannot lay the responsibility
on them to learn things on their own. Commonly
those involved are totally occupied with day-to-day
work and you need to provide the time needed to
attend courses, workshops or seminars to raise
their competence. |
| If there are many players engaged
in the development, updating and maintenance of
your web site you will need a coordinating body.
In our case the web development and updating reside
with one department (Information and Publishing),
hardware and methods for creating web driven applications
for disseminating macro data was the responsibility
of the IT unit and the maintenance finally was
upheld on a contract with IBM Global Services.
There were some communication problems between
the bodies and thus a consensus was missing which
to some extent contributed to the unwanted situation. |
| The importance of instant redundancy!
The solution at that time was mirrored web and
database servers, located in the DMZ and on the
transport (secure) network. In the updating process
the inner servers were first addressed and then
the updated information was propagated onto the
outer servers in an automatic process. Still this
meant that the site was configured with only a
single web server that provided Internet services
to the public. Even if you could switch to the
mirrored server on the transport (secure) network
in the case of failure you still needed someone
to be there to supervise activities for bringing
the system back to normal. |
| Put your site under stress! Establish
a schedule for testing your site and deal with
latent risks before unwanted problems arise. Monitor
activities and obtain data on hits from your ISP
and from find out if this is really what customers
want or if it is sufficient with e.g. automatic
e-mail or other solutions. There are also third-party
providers who can be contracted if you want to
and can afford outsourcing these services. However,
with the advent of new technology for wireless
within that then can be used for comprehensive
statistics and analysis. |
|
| Additional
remarks |
| When you engage in developing
your first website the first thing you normally
will have to do is to identify your potential
users and what information they would like you
to offer. You will also have to decide on to what
degree the information will be made available
to the users. You will make risk analysis in order
to identify possible threats to your web site.
By doing all this you are under way and will hopefully
end up in formulating an Internet Policy for your
agency. Common parts of the policy deal with e.g.
organization, infrastructure, security, maintenance
and competence development. |
| A website will at times seem to
have its own life and while you try to "contain"
it by imposing new methods or enhancing the technique
you will always gain from careful planning, from
laying a solid ground for your intended Internet
activities by employing mainstream technology
and standards in an early stage and to keep your
staff skilled and motivated through training schemes
and promotions. |
| In the beginning of 2001 Statistics
Sweden made additional changes in its web configuration.
Round Robin DNS was deserted in favor of Windows
Network Load Balancing, which is a facility coming
with Windows 2000 AS. A current internal assessment
now indicates 14 as a minimum number of servers
needed if no changes in provided web services
are implemented. |
| When discussing line capacity
one should always remember that many of the users
on the Internet still are stuck with dial-up modems
28.8 or 56.6. They are not interested in whether
or not you have 2, 50 or 100 Mbit channels for
the in- and outgoing traffic. An ASP-page generating
350K will still be painful to download. Therefore
you should always try to minimize the generated
HTML-code or at least keep an eye on the most
frequent pages regularly. |
| There are on-going discussions
within the organization whether or not to provide
WAP (or SMS services). Other statistical agencies
are using this techniques e.g. for dissemination
price statistics. Still WAP has its flaws and
is comparatively expensive for end users. Before
starting to develop WAP applications you should
therefore at least do some kind of a market research
to advanced and high-speed connection (General
Packets Radio Services GPRS, High Speed Circuit
SwitchedDataHSCSD, Bluetooth) things may change
rapidly. |
| It is harder than could be imagined
to establish an affordable database driven solution
for the documents on your website. Too many of
the vendors available are expensive. Still when
you reach a certain level of complexity with your
web site(-s) you must reconsider if it isn't worthwhile
anyway. The Statistics Sweden public web site
comprised in the end of 2000 more than 12000 files
distributed into 500 directories. Then should
be added the virtual web sites and the intranet.
A good thing (except from benefits from the basic
exchange for coordinated standardized techniques
for all your web sites and lowering costs for
maintenance) with the modern solutions is that
it enables the client to design his own "playground"
containing the links that he frequently uses and
excluding the rest. This is often referred to
as an enterprise portal. |
| XML or eXtensible Markup Language
is another notion that is here to stay. It promises
a standardized way for storing and delivering
highly structured information on the Web. XML's
structured syntax lets you describe virtually
any type of information-from a simple recipe to
a complex business database-and sort, filter,
find, and manipulate that information in flexible
ways. It separates data and metadata, facilitates
interchange of information and is excellent for
archiving information during long periods of time
without loosing the possibility to recover the
data at any specific point. The latest versions
of databases include XML as an option, statistical
software giants like SAS and SPSS supports it
and the last generation of browsers provide XML
parsing. XML should therefore be part of any statistical
organizations method development programme. |
| Log the events occurring on your
web site! The access log file can be very useful
to you when identifying your clients and which
pages they visit. Especially subject matter departments
show a great interest in how their statistics
are used, not only in the number of successful
hits but also by whom and how often he/she returns.
You don't need any fancy programs since log data
comes as plain text and you can use your favorite
statistical software to process it. But remember
that log files occupy disk space. At Statistics
Sweden this means 55 MB daily! If you want to
archive log data you should of course first compress
it using Winzip or any other convenient software. |
| Clean up the servers on a regular
basis. Broken links are no fun and a lot of garbage,
pages never referred to or outdated, tend to be
left in the directories. |
| Try to establish separate environments
for development, test and production. Too often
you tend to use e.g. one server both for development
and testing with numerous system downs as result.
Another thing to highlight is that you should
provide proper tools for your web developers and
not stay too long with "outdated" software on
the clients when you have migrated your production
system to newer versions. |
| Finally, Virtual Private Networking
technology allows an agency to connect to branch
offices or to other agencies or organizations
over a public IP network like the Internet, while
maintaining secure communications. To the user
VPN is a "point-to-point" connection and how it
works behind the scene is irrelevant. The main
advantage is that you only have to connect to
local ISPs to establish VPN thus reducing the
need of e.g. modem pools and remote dialing. VPN
should therefore be considered as the major alternative
for data exchange over long distance. (Most common
is the use of Point-to-Point Tunneling Protocol,
PPTP, which allows IP, IPX or NetBEUI traffic
to be encrypted end sent across any IP network
while the Layer 2 Tunneling Protocol, L2TP, allows
traffic over any medium that supports datagram
delivery thus also including e.g. X.25 or ATM
networks). |
|
| ISP collected data on the
connection point 28/12 and 31/124 |
| Bits per second external
traffic (measured on intervals of 5 minutes) |
|
|
| IBM collected data on the
two web servers 28/12 (Performance Monitor)4 |
| Number of parallel sessions
running. Intervals of 15 seconds. |
|
4
Axel Skough. Rapport från Befolkningspubliceringen
2000. Internal paper.
|