UN Web Site | UN Web Site Locator
Home Site map Contact 
ESCAP Statistics Division
ESCAP Statistics Division
 
Workshop 2001    
Workshop on Population Data Analysis, Storage and Dissemination Technologies
Bangkok, 27-30 March 2001

STAT/WDT/6
27 March 2001
ENGLISH ONLY

ECONOMIC AND SOCIAL COMMISSION FOR ASIA AND THE PACIFIC

Workshop on Population Data Analysis, Storage and Dissemination Technologies
27-30 March 2001
Bangkok
Going on the net with your national statistics: What is there to consider?*
(Items 7 of the provisional agenda)
By Sten Backlund, UNSD1
Contents

* This paper, prepared by Mr Sten Backlund, United Nations Statistics Division (UNSD), has been reproduced as submitted.  It has been issued without formal editing.
1 Former employee of Statistics Sweden and during 2000 in charge of coordinating Statistics Sweden's Internet activities
Abstract
In a society where instant access to valuable information can make the difference between success and failure you will have to rely on your sources of data being available when you need them. By the end of each year Statistics Sweden releases tables containing preliminary population figures as per 31 December down to communal level. On 28 December 1999 the website timed out and from no apparent reason. We will use this as a starting point, search for answers to why it happened and discuss how it could have been avoided. In this context we will also elaborate on how to address issues that are not readily found in textbooks or manuals.
The site
Introduction
In the quest for answers to the collapse of the Statistics Sweden website there are some pre-conditions to keep in mind. Statistics Sweden launched its first data on the Internet in the fall of 1995 and the site grew rapidly in size. In the end of 1999 it was decided to restructure the website and opt for a database driven solution since the burden of administering the file based system was threatening the integrity of the site. For this purpose a time-consuming bidding procedure for a database driven solution started in early 2000, the result of which did not live up to expectations. In the same year it was decided that all official statistics, including ad-hoc requests from the macro databases2, should be made freely available on the Internet.
During these years the information generally available on the web as well as the numbers of users connected have multiplied in a way that was never imagined. This naturally affects any website that provides information that may be critical to the end users. The sophistication of systems has also deepened since the number of different web services that can be offered grows when methods and techniques evolve.
One of the problems fast expansion inevitable seems to lead to is that the owner of any given website will always find himself racing against time to be able to catch up and adapt to the ever increasing demand from the international society.
Availability is one of the most important measures of quality. In many cases you will demand 24-hours service e.g. in banking, booking, news distribution. In Sweden, as an example, an information "marketplace" is well under way where in a first step all providers of governmental information will be connected for sharing information3. Later on large businesses will be part of it and in a longer perspective any private person, organization or company will be able to access and share information over the web. To make this happen a standardized secure system for data interchange and authentication of actors must be offered. Today you still might end up with a number of different solutions that depends on what has been chosen by the company with which you want to communicate. (Typically Swedish banks use different solutions for securing transactions over the web instead of sharing the same e.g. soft certificates or smart cards).
We can comment briefly on what the factors are that have an impact on a website's availability in a negative way. What comes first in mind is hardware or systems failure, e.g. when a web server stops operating or when the power supply fails, that has nothing to do with the web application itself. Second, the website may time out because it is badlydesigned or contains bugs. Third, the website can be spammed by junk e-mail or come under attack from an intruder who seemingly does not like you resulting in what is called Denial-of-service.
(Using simple probability theory it can easily be shown that if your single web server has a probability of 0.99 to be up and running it still means that it on average will be out of order approximately 7.2 hours monthly!)
Going back to our case in study, the Statistics Sweden website is today comparatively large and complex, as are many other NSO websites. The current number of HTML pages produced regularly or ad-hoc is large. The underlying meta and macro databases are voluminous since it is mandatory that all official statistics produced should be added or appended to the database tables. Most information available is also translated into English. The site records more than 150000 successful hits daily.
Statistics Sweden still maintains a conventional web solution based on file directories for documents while data are stored in Sybase databases. As an effect it has been decided that updating site information resides with the policy area programmes. There is no data warehousing implemented as yet even if facilities e.g. coming with the SAS software is under evaluation. The agency relies heavily on PC-AXIS as a tool for standardization and dissemination of data.
Virtual hosting is used for launching web driven multi-layered applications where e.g. electronic forms are used for collecting statistical data from respondents like administrative units, schools or businesses or where results/analyses from surveys, which are not part of the official statistics, are published for designated users. More than 50 such web sites, of different sizes, exist.
Securing the web site and the internal NT LAN is considered of utmost importance. Statistics Sweden has implemented proxy and firewall techniques to prevent intruders and to filter information that may be harmful to the system. 

2 SSD - Sveriges Statistiska Databaser
3SHS - Spridnings-och HämtningsSystemet, jointly developed by Statskontoret (The Swedish Agency for Public Management), Riksskatteverket (The National Tax Board) and Riksförsäkringsverket (The National Social Insurance Board)
The hardware configuration
The configuration solution adopted at that time built on mirrored web and database servers for the public Internet. The "outer" web server was located behind the firewall in the so called Demilitarized Zone, which is a physical network in itself only providing dedicated services on predefined nodes to the users e.g. FTP, HTTP, HTTPS, SMTP and for practical reasons a few others e.g. DNS, Day-Time, NTP and Ping.
In order to speed up data transfer a secure transport network was set up connecting the two web servers and the two database servers. From the safe SCB LAN the "inner" servers were updated first and then the changes were propagated onto the "outer" servers. Now if any of the "outer' servers failed it could be temporarily swapped for the corresponding "inner" server - though only during office hours.
Another facility provided is Virtual Private Network (using Point-to-Point Tunneling Protocol), a method that enables Internet users encrypted access to the safe network, e.g. usable for staff working from home or for secure communications between Statistics Sweden and trusted partners. (VPN will be discussed more in the end of the paper).
The firewall is implemented through dedicated hardware (Watchguard Firebox II + which supports 5000 concurrent threads). A strict scheme is implemented describing what kind of incoming and outgoing traffic is permitted.
Methods and schemes for updating and maintaining the web site
Statistics Sweden has been using mainstream technology for its web development and still does. At that time it meant the Windows NT 4.0 product family (Internet Information Server, Index Server, Transaction Server, Proxy Server) and the Office 95/98 suite of software. The website has been implemented mainly using ASP technique to retrieve pages and data (although a minor deviation is made since the Java based Silverstream product is used for a few electronic data collection applications). FrontPage is the number one tool for the final touch of page design but there are alternatives used e.g. Dreamweaver or Adobe PageMill.
But still most common is generating files from MS Word and MS Excel. Updating statistical information is delegated to staff in the line departments. There are more than 200 persons who have selected rights to update file and database servers. Since all official data from recurrent and intermittent activities are appended to the statistical databases in each policy area this means that in all departments there are persons who are entrusted with updating and maintaining database tables. This is often done in an automated fashion even if there is hands-on needed to initiate the process. The same goes for creating or updating informative pages e.g. in press releases or information on specific topics e.g. the living standard survey and others. A comprehensive set of instructions has been developed covering most of what is needed to know from layout to permissions and updating.
There are central staff needed for coordinating activities, informing on schemes and procedures, arranging internal or external training etc. This resides with the two webmasters who have a steering group with subject matter departments represented at their disposal.
All taken together in order to run the website smoothly everyone must know what to do and how to do it and when. Scripts badly designed or too much reliance on standard output from popular software may occasionally decrease overall site performance.
The search for answers
It was obvious that the answer to the problem of December 28 to a great extent could be related to the 512 Kbit capacity of the connection point to the ISP. Another matter that early came to attention was the way the published tables were produced. The most common way was to use ASP technique to create requested tables but what did that really mean?
Previously no direct control over how pages were created was implemented except that they should conform to the layout standards that were decided and that the language was correct. Of course recommendations and rules were established but there was simply not enough time to maintain a regular checking of page sizes or if links were valid and not broken.
ASP-generated HTML code contrary to static HTML does not cache. This means that whenever a client returns to a previous ASP page it will be re-created instead of returned from the client's cache or the proxy. With NT4 there is another problem. Standard installation allows 30 parallel ASP-scripts to run (Microsoft Script Engine default) before queuing starts while studies undertaken at the IT unit showed that when more than 500 ASP-scripts were executing or in line the server tended to return the error code 500 i.e. "Server too busy".
Then there was the problem of only a single server performing all duties. What would have happened if there had been another server instantly available? This question later led to a rearrangement of the web site.
Another thing that relates to availability is that there should always be staff at hand when extraordinary events are expected. This was the case even if December 28 is one of the days when people are on Christmas holiday. During the day the server was restarted in order to get rid of processes blocking it from deliver the requested services. 
But what was soon found as the real "sinner" was the population table most in demand, the size of which was 600 K! When the HTML-code was examined and unnecessary formatting (table cell level) was removed it was down to 64K. On 28/12 the number of recorded visits to this page was a moderate 3000. Another simple calculation shows that this yields 1.8 GB of outgoing data (3000*600) that on the current 2 Mbit connection would occupy the single server on a 100% basis during 10 hours, while the reduction leads to not more than 17 minutes.
Solving the problem?
Obviously the first thing that had to be done to avoid the same thing to happen again was to change the manner in which pages frequently visited were created. Most pages containing statistical information were the result of efforts made by staff with the subject matter departments using Word or Excel, in some cases FrontPage or Dreamweaver. The resulting HTML code was then often over-dimensioned. Seen in a micro-perspective this is nothing to fuzz about. Most statistics that are disseminated have a moderate number of readers and on an ad-hoc basis. But for pages in high demand the situation is different. A macro tool was developed stripping the code of unnecessary formatting. ASP was to be avoided where static HTML was a sufficient alternative, e.g. for the most wanted population tables in this case. In this case you will also admit caching. In this way the main table was reduced by a factor 10.
The second thing to consider was how to optimize the bandwidth usage at the point of connection and/or to increase the capacity. While 512 Kbit is not much nowadays it was still regarded quite sufficient for the Internet traffic in 1999 and before. It was decided to extend the connection to 2 Mbit but at the same time to evaluate software for "traffic control"(meaning that traffic through your pipe is differed as to its priority) and load-balancing on the market.
Since one of the underlying factors was the single web server available at the connection point (ISP Point-of-Presence) it was necessary to address this situation properly. A different solution was proposed and taken which meant that both web servers were now located in the DMZ. It was also decided to upgrade the OS from NT4 to Windows 2000 AS SP1. This would mean that traffic from the inside (SCB LAN) must pass through the firewall but was considered a minor obstacle since the capacity of the firewall had previously been increased (not needed for the automatic updating of the servers). At the same time Round Robin DNS was implemented which meant that incoming traffic was shared between the servers on equal basis.
One of the problems was that service was not available around the clock so one had to make sure that there would be someone working on that day to supervise. It was decided to take down the servers the night before the 28th to clear all pending processes.
For extra measure the incoming mail was rerouted to the other connection point in Örebro (some 180 km west of Stockholm), staff were encouraged not to send any large e-mail consignments on this day and an additional link on the welcome page was inserted pointing directly to the population tables. This link was also provided to the major Swedish newspapers.
Lessons learnt
The release of the population figures went without problem. The actions taken were proven sufficient. Most probable only changing the main table from ASP for HTML and stripping the code of unnecessary formatting would have been adequate.
Upgrading from NT4 to Windows 2000 was performed without trouble. (The problem of ASP queuing is also more or less solved in this environment). W2000 also seems to be more stable since system failures have diminished.
In retrospect it may look somewhat like overkill when all these steps were taken to prevent the site from timing out. But all the same it was considered a severe loss in confidence when such a thing happens when you disseminate statistics of major interest to the media and the common man. It should be remembered that the exercise also brought things to the surface that led to a deeper understanding of what factors effects a website and how to deal with them in better ways. In the end the overall outcome was positive.
One of the main findings is that you should never neglect the need of a staff development program on how to implement and use mainstream software for creating web pages or multi-layered web applications. Even if your staff are competent, as in this case, you cannot lay the responsibility on them to learn things on their own. Commonly those involved are totally occupied with day-to-day work and you need to provide the time needed to attend courses, workshops or seminars to raise their competence.
If there are many players engaged in the development, updating and maintenance of your web site you will need a coordinating body. In our case the web development and updating reside with one department (Information and Publishing), hardware and methods for creating web driven applications for disseminating macro data was the responsibility of the IT unit and the maintenance finally was upheld on a contract with IBM Global Services. There were some communication problems between the bodies and thus a consensus was missing which to some extent contributed to the unwanted situation.
The importance of instant redundancy! The solution at that time was mirrored web and database servers, located in the DMZ and on the transport (secure) network. In the updating process the inner servers were first addressed and then the updated information was propagated onto the outer servers in an automatic process. Still this meant that the site was configured with only a single web server that provided Internet services to the public. Even if you could switch to the mirrored server on the transport (secure) network in the case of failure you still needed someone to be there to supervise activities for bringing the system back to normal.
Put your site under stress! Establish a schedule for testing your site and deal with latent risks before unwanted problems arise. Monitor activities and obtain data on hits from your ISP and from find out if this is really what customers want or if it is sufficient with e.g. automatic e-mail or other solutions. There are also third-party providers who can be contracted if you want to and can afford outsourcing these services. However, with the advent of new technology for wireless within that then can be used for comprehensive statistics and analysis.
Additional remarks
When you engage in developing your first website the first thing you normally will have to do is to identify your potential users and what information they would like you to offer. You will also have to decide on to what degree the information will be made available to the users. You will make risk analysis in order to identify possible threats to your web site. By doing all this you are under way and will hopefully end up in formulating an Internet Policy for your agency. Common parts of the policy deal with e.g. organization, infrastructure, security, maintenance and competence development.
A website will at times seem to have its own life and while you try to "contain" it by imposing new methods or enhancing the technique you will always gain from careful planning, from laying a solid ground for your intended Internet activities by employing mainstream technology and standards in an early stage and to keep your staff skilled and motivated through training schemes and promotions.
In the beginning of 2001 Statistics Sweden made additional changes in its web configuration. Round Robin DNS was deserted in favor of Windows Network Load Balancing, which is a facility coming with Windows 2000 AS. A current internal assessment now indicates 14 as a minimum number of servers needed if no changes in provided web services are implemented.
When discussing line capacity one should always remember that many of the users on the Internet still are stuck with dial-up modems 28.8 or 56.6. They are not interested in whether or not you have 2, 50 or 100 Mbit channels for the in- and outgoing traffic. An ASP-page generating 350K will still be painful to download. Therefore you should always try to minimize the generated HTML-code or at least keep an eye on the most frequent pages regularly.
There are on-going discussions within the organization whether or not to provide WAP (or SMS services). Other statistical agencies are using this techniques e.g. for dissemination price statistics. Still WAP has its flaws and is comparatively expensive for end users. Before starting to develop WAP applications you should therefore at least do some kind of a market research to advanced and high-speed connection (General Packets Radio Services GPRS, High Speed Circuit SwitchedDataHSCSD, Bluetooth) things may change rapidly.
It is harder than could be imagined to establish an affordable database driven solution for the documents on your website. Too many of the vendors available are expensive. Still when you reach a certain level of complexity with your web site(-s) you must reconsider if it isn't worthwhile anyway. The Statistics Sweden public web site comprised in the end of 2000 more than 12000 files distributed into 500 directories. Then should be added the virtual web sites and the intranet. A good thing (except from benefits from the basic exchange for coordinated standardized techniques for all your web sites and lowering costs for maintenance) with the modern solutions is that it enables the client to design his own "playground" containing the links that he frequently uses and excluding the rest. This is often referred to as an enterprise portal.
XML or eXtensible Markup Language is another notion that is here to stay. It promises a standardized way for storing and delivering highly structured information on the Web. XML's structured syntax lets you describe virtually any type of information-from a simple recipe to a complex business database-and sort, filter, find, and manipulate that information in flexible ways. It separates data and metadata, facilitates interchange of information and is excellent for archiving information during long periods of time without loosing the possibility to recover the data at any specific point. The latest versions of databases include XML as an option, statistical software giants like SAS and SPSS supports it and the last generation of browsers provide XML parsing. XML should therefore be part of any statistical organizations method development programme.
Log the events occurring on your web site! The access log file can be very useful to you when identifying your clients and which pages they visit. Especially subject matter departments show a great interest in how their statistics are used, not only in the number of successful hits but also by whom and how often he/she returns. You don't need any fancy programs since log data comes as plain text and you can use your favorite statistical software to process it. But remember that log files occupy disk space. At Statistics Sweden this means 55 MB daily! If you want to archive log data you should of course first compress it using Winzip or any other convenient software.
Clean up the servers on a regular basis. Broken links are no fun and a lot of garbage, pages never referred to or outdated, tend to be left in the directories. 
Try to establish separate environments for development, test and production. Too often you tend to use e.g. one server both for development and testing with numerous system downs as result. Another thing to highlight is that you should provide proper tools for your web developers and not stay too long with "outdated" software on the clients when you have migrated your production system to newer versions.
Finally, Virtual Private Networking technology allows an agency to connect to branch offices or to other agencies or organizations over a public IP network like the Internet, while maintaining secure communications. To the user VPN is a "point-to-point" connection and how it works behind the scene is irrelevant. The main advantage is that you only have to connect to local ISPs to establish VPN thus reducing the need of e.g. modem pools and remote dialing. VPN should therefore be considered as the major alternative for data exchange over long distance. (Most common is the use of Point-to-Point Tunneling Protocol, PPTP, which allows IP, IPX or NetBEUI traffic to be encrypted end sent across any IP network while the Layer 2 Tunneling Protocol, L2TP, allows traffic over any medium that supports datagram delivery thus also including e.g. X.25 or ATM networks).
ISP collected data on the connection point 28/12 and 31/124
Bits per second external traffic (measured on intervals of 5 minutes)
IBM collected data on the two web servers 28/12 (Performance Monitor)4
Number of parallel sessions running. Intervals of 15 seconds.

4 Axel Skough. Rapport från Befolkningspubliceringen 2000. Internal paper.
 
Pop-IT project (1997-2001)
Project Objectives
Working Party Members
Working Party Meetings
First meeting, Bangkok, 24-26 September 1997
Second meeting, Singapore, 1-3 April 1998
Third meeting, Bali, 7-9 January 1999
Fourth meeting, Manila, 6-9 July 1999
Ffth meeting, Bangkok, 21 October 1999
Sixth meeting, Bangkok, 26 March 2001
Workshops
Application of New Information Technology to Population data, Bangkok, 12-20 October 1999
Population Data Analysis, Storage and Dissemination Technologies, Bangkok, 27-30 March 2001
Guidelines
Population data collection and capture (BBS - Statistics Indonesia)
GPS in modern mapping and GIS technologies to population data (Bangladesh Bureau of Statistics)
Population data dissemination (Statistics New Zealand)
Project Newsletter
Contact us
   
Copyright (c) 2013 ESCAP  |  Legal Notice