Journal reference: Computer Networks and ISDN Systems,
Volume 28, issues 711, p. 963.
An Investigation of Documents from the World Wide Web
Paul M. Aoki
Lawrence A. Rowe
Computer Science Division
University of California at Berkeley
WWW pages: Woodruff,
We report on our examination of pages from the World Wide Web. We
have analyzed data collected by the Inktomi Web crawler (this data
currently comprises over 2.6 million HTML documents). We have
examined many characteristics of these documents, including: document
size; number and types of tags, attributes, file extensions,
protocols, and ports; the number of in-links; and the ratio of
document size to the number of tags and attributes. For a more
limited set of documents, we have examined the following: the number
and types of syntax errors and readability scores. These data have
been aggregated to create a number of ranked lists, e.g., the ten
most-used tags, the ten most common HTML errors.
HTML, statistics, tools, World Wide Web.
We report the results of an extensive analysis of HTML documents from
the World Wide Web. Our data set, collected by the Inktomi Web crawler, currently
comprises over 2.6
million HTML documents. We present a broad range of statistics
pertaining to these pages.
Such an analysis of the content of HTML documents is of interest for
Despite these motivations, however, previous studies relating
to the Web have either focused on other topics or have been
limited in scope. The most closely related work includes:
- Evolution of HTML. Unused features and extensions that do
not achieve a reasonable level of acceptance should be deprecated and,
eventually, eliminated. This prevents the accretion of useless
- Improving Web content. Widespread awareness of poor
natural and markup language usage will promote the spread of helpful
tools and practices.
- Control of HTML. The marketplace perceives the relative
ability of vendors to force acceptance of new, non-standard language
extensions as market ``strength.'' Understanding the true acceptance
level of such extensions can help fight vendor disinformation.
- Sociological insights. Many interesting sociological
observations may be derived from the content of Web pages.
To complement the above work,
we have conducted a large-scale investigation of the
content of HTML documents from the
Web. The remainder of this paper is structured as follows.
First, we describe the tools we used to perform our study.
We next discuss the scope of our study and our results.
Finally, we present some lessons learned and
possible future directions.
- User studies.
and browser usage studies
have become very common. Such studies
focus on high-level user issues (e.g., choice of software, available
connectivity) and low-level user-browser interaction (e.g., use of the
back button). The information extracted, though valuable, is
- Content analyses of small data sets.
There have been some attempts to perform simple analyses of the
content of the Web. For example, the original Lycos
project at Carnegie Mellon University's Center for Machine Translation
tracked a number of interesting statistics while
their data set was relatively small. These included:
- content of title and headings
- 100 top keywords and first 20 lines
- word frequency count
- file size (bytes, words)
- URL types
- most-linked-to URLs
- Structural analysis.
The CMU Lycos project generated at least one complete graph
of their data set. The project's commercial successor, Lycos, Inc.,
now tracks the 250 most-linked-to sites as a side-effect of their
Other projects have focused on (graph-oriented)
structural analysis as well.
These include several Web visualization systems (e.g., Webspace
and the Navigational View Builder
For the most part, such visualization
has been very small-scale and limited in scope.
More sophisticated analyses are possible, combining both structural
analysis and semantic modelling. A project at Xerox PARC
conducting such analyses over small data sets.
The tools used to perform the data collection and data analysis for
this study represent the integration of software from a variety of
sources. Specifically, we have developed or adapted software
to perform the following tasks:
We discuss each set of tools in turn.
Web Data Collection
The Inktomi research
project at Berkeley, consisting of Prof. Eric Brewer and graduate
student Paul Gauthier, conducts research in the construction of scalable Web
servers using parallel processing technology. To date, the
project has produced two major software components: a parallel Web
crawler and a parallel Web index search engine. In this paper,
where we mention Inktomi, it may be assumed that we refer to the crawler.
The data presented in this study comes entirely from Inktomi. The
high speed of the crawler enables us, for the first time, to consider
taking ``snapshots'' of the Web and analyzing them. As of this
writing, the Inktomi team has crawled twice. The first set of runs,
from July to October 1995, collected 1.3 million unique HTML documents.
The second set of runs, in November 1995, collected 2.6 million
unique HTML documents.
HTML Data Extraction and Manipulation: libink
Although toolkits such as the W3C Reference Library [FRYS94] already exist for manipulating HTML and
HTTP objects, we have developed our own special-purpose library,
libink. This was necessitated by the fact that our
performance and functionality needs were very different from those of
the other toolkit developers.
libink consists of four major subcomponents:
- HTML parser. libink contains a simple
flex-based HTML scanner. We found existing parsers too slow
(especially true in the case of parsers written in scripting
languages) or difficult to modify. The libink scanner is
small, enabling us to make it both fast and relatively robust, as well
as highly configurable. Like the W3C SGML/HTML lexical analyzer [CONN95], our scanner uses a callback interface to
handle various events (e.g., recognition of a tag and its attributes).
The W3C lexical analyzer, however, is not configurable.
- URL parser. The URL parser, unlike many freely-available
implementations, conforms to RFC 1808 [FIEL95].
- Domain name service (DNS) translation and caching. We use
Internet addresses to reduce hostname aliasing in our data. To speed
up the lookup process, we provide a wrapper around the standard name
service routines that caches all URL hostnames.
- General hash table services. The various lookup tables on
which libink relies sometimes exceed the capacity of a single
machine's physical memory. Therefore, in addition to in-memory hash
tables, libink provides interfaces to striped on-disk hash
tables (using GNU DBM) as well as hash-partitioned distributed hash
tables (using ONC RPC). The distributed hash tables support 1ms
turnaround on hash table lookups, which is far better than the 20-30ms
required to fetch a hash table page from secondary storage.
Natural Language Analysis: style
We scored English language documents using the standard UNIX
style program [CHER81]. style
reports a variety of statistical properties of each document, such as
the average sentence length and the number of complex sentences. It
also scores the document using four readability metrics. These
metrics indicate the nominal educational (grade) level a reader would
need to understand the document.
Since most HTML documents do not conform to an internationalization
standard, we applied heuristics to screen out non-English documents.
We filtered out documents that contained any character with
the high bit set (indicating a non-ASCII character set) or containing
character sequences indicating known encodings (such as the Shift-JIS
encoding of the Japanese character set).
Markup Language Analysis: weblint
We scored documents using weblint [BOWE96], an analogue to the standard UNIX
lint utility, written in Perl. We modified weblint
to report the classes of errors in a document rather than a
We examined over 2.6 million HTML documents collected by the Inktomi
crawler in November of 1995. Although Inktomi occasionally downloads
non-HTML documents, the results presented reflect only HTML documents.
(For example, we filtered out all binary files, such as images.)
Furthermore, because Inktomi implements the Robot
Exclusion Standard, the contents of automated databases
which follow the standard
(e.g., genome data sets) have also been excluded.
The distribution of the documents in the data set by domain
appears in Table 1.
Table 1: Documents Studied by Domain
|Domain||# of HTML Documents
||% of Total
Here, ``other'' includes all domains other than the given top-level
domains. For example, ``other'' contains all non-US top-level domains
(such as Germany's .de).
We analyzed a variety of properties of these documents.
In this paper, we present results on the following:
After all markup had been extracted, the size of each HTML document
was measured. For the entire data set, the mean size was 4.4KB, the
median size was 2.0KB, and the maximum size was 1.6MB.
Figure 1 presents different views of the size distribution. On
first inspection, this distribution appears to be exponential (the
magenta line represents the location of the mean). However, further
zooming indicates a curve before the distribution begins to taper off.
The final graph in Figure 1
contains a semilog plot of the same data (in which the
sizes are plotted logarithmically and the number of documents is
Figure 1: Size Distribution
These simple size distribution plots proved to be very useful in
detecting several problems with the data set. Many of the outliers
were caused by one of two major classes of errors:
- Problematic URLs: when faced with incorrect URLs that
contain valid prefixes, some HTTP servers return the file matching the
valid prefix. For example, the data set contains hundreds of
documents with URLs of the form
http://bazaar.com/underground2.html/..., all of which are
identical to http://bazaar.com/underground2.html. There
does not appear to be a general way for a client program (such as a
crawler) to differentiate this situation from a site containing a
large number of identical files.
- CGI Error Responses: some of the most popular CGI programs,
such as NCSA imagemap and CERN HTImage, report
errors with messages containing HTTP status ``200'' (success).
Because the image map programs all happen to return fixed error
messages, we were able to detect and eliminate those particular
messages, but there (again) does not appear to be any general way for
a client to distinguish ``200'' error messages from valid documents.
For each document
we examined the ratio of the total number of tags
to its size.
Figure 2 contains the results.
An interesting pattern emerges - rays radiating out from the
origin, indicating a number of documents with constant tag/size
ratios. One such ray is indicated by the green ellipse.
We examined a number of these rays and determined that
they represented different versions of the same document
(occurring in archives or mirrored sites). This suggests that
the tag/size ratio might be used as a component of a signature
for an HTML document, e.g., for purposes of copy detection.
Figure 2: Tag/Size Ratio
We examined the
distribution of tags.
We obtained a list
of valid tags from the Sandia HTML Reference Manual
The average number of total tags per document was 71.
The average number of unique tags per document was 11.
We examined the most popular tags. The top graph of Figure 3
shows the top
ten tags (ranked according to the number of documents in which the
tag appeared at least once). The bottom graph indicates the average
number of occurrences of the tag per document.
Figure 3: Ten Most-Used Tags
We also examined the least popular tags. Several tags,
were used zero times
in our data set of over 2.6 million HTML documents.
A number of other tags appeared a very limited number of times.
We examined the
distribution of attributes.
The average number
of total attributes per document was 29. The average number of unique
attributes per document was 4.
We examined the most popular attributes. Figure 4 shows the top
ten attributes (ranked according to the number of documents
in which the attribute appeared at least once). HREF appeared
an average of 14 times per document.
Figure 4: Ten Most-Used Attributes
We also examined the least popular attributes. Several attributes,
were used zero times
in our data set of 2.6 million HTML documents.
A number of other attributes appeared a very limited number of times.
Browser-specific Extension Usage
We also studied the use of browser-specific extensions. These consist
of HTML features (i.e., tags or attributes) added by vendors rather
than by the standards process. Here, we contrast the use of such
extensions in the first Inktomi data set (1.3 million documents,
collected in mid-1995) and the second Inktomi data set (2.6 million
documents, collected in November 1995).
Figure 5 shows the percentage of documents in which the four
most popular extensions are used. The usage of most of these features
has risen dramatically, indicating wide user acceptance. Other
features, such as BLINK,
have not experienced such growth.
Figure 5: Browser-Specific Extensions Usage
Figure 6 indicates the popularity of various proposals for
dynamic addition of functionality to browsers. APP
support SunSoft's Java ``applet'' language, DYNSRC
supports VRML markup, and EMBED
supports Netscape's third-party ``plug-in'' modules. All have enjoyed
significant growth, though the oldest and most popular method (Java,
first released in May 1995 [KARP95]) still has
very low usage.
Figure 6: Browser-Specific Extensions Usage
For each of the HTML documents in our data set, we extracted
the port number used to access the document. We analyzed the
distribution of port numbers. While 418 unique ports
were observed, six ports accounted for over 98% of the
documents. Table 2 presents the most popular ports.
Table 2: Port Usage
||% of Docs
Port 80, the standard HTTP port,
was used for approximately 94% of the documents. Port 70
(the standard Gopher port) was used for approximately
0.3% of the documents (this number is slightly lower than
the 1% usage of port 70 observed in our earlier data set).
We checked many of the documents being
served from port 70; all the ones we examined were in fact
HTML documents. Ports 8000, 8001, and 8080, and 8888
accounted for the majority of the remaining documents.
The strong preference for ``8'' and ``80'' in the non-standard
ports is presumably related to the standard port number ``80''
Protocols Used in Child URLs
As discussed above, we extracted child URLs from all HTML documents
in our data set.
Figure 7 presents the distribution of protocols in
this set of child URLs.
By far, the most dominant protocol observed was HTTP
(there were an average of 17 HTTP URLs per document).
Figure 7: Protocol Usage
File Types Used in Child URLs
We also studied the distribution of file types described in the set of
extracted child URLs. We inferred the file type from the file name
extensions (e.g., ``.gif'') found in the URL path. In Table 3,
the ``% of Docs'' column indicates the percentage of
documents which contained a file of a given type.
The ``# of
Occurrences'' column shows the total number of extensions of the given
file type that were observed.
The ``# of
Docs'' column indicates the number of documents which contained
one or more extensions of the indicated type.
Note that files can be counted multiple
times, e.g., file.ps.Z would be counted as a file having both
``.ps'' and ``.Z'' extensions.
Table 3: File Type and File Name Extensions
||% of Docs
||# of Occurrences
||# of Docs
||GNU zip (gz/gzip/taz/tgz)
|ARC archive (arc)
|MS Word (doc)
|Adobe Acrobat (pdf)
|TeX DVI (dvi)
|Rich Text (rtf)
|Maker Interchange (mif)
||Sun audio (au)
|MS WAVE (wav)
|Audio IFF (aif/aifc/aiff)
|MIME audio (snd)
|Amiga MOD (mod/nst)
|X bitmap (xbm)
|X pixmap (xpm)
|portable pixmap (ppm)
|portable graymap (pgm)
|portable bitmap (pbm)
|X window dump (xwd)
|portable anymap (pnm)
|MS video (avi)
Number of In-links
We sorted the child URLs which we extracted according to the number of
times they occurred in our data set. This showed us the most
``popular'' sites, as measured by the number of in-links observed.
These appear in Table 4.
The in-link entries marked with (*) indicate sites that are
highly self-referential. That is, these sites (by inspection) appear
to contain a great number of links to their own top-level pages. It
would probably be instructive to count only links from outside a
The UNIX utility style was used to assess the readability
level of a subset of the HTML documents in our data set (approximately
We remove HTML markup before invoking style on each document.
We do this for two reasons. First, style does not understand
HTML, so the extra punctuation would confuse its analyzer. Second,
breaking English text into sentences and sentence fragments can be
tricky and we need to provide the style analyzer with some
assistance. For example, it is not always clear when a bulleted list
should be ignored, treated as a single long sentence, or treated as a
list of individual sentences. When invoked on troff
documents, style uses a set of heuristics to insert
punctuation into text, using the markup to assist it [CHER81]. This
information is then used by later passes of the analyzer to determine
sentence and sentence fragment breaks. We use a similar set of
heuristics to insert periods and commas into HTML documents as we
strip out markup.
The numbers presented in Table 5 represent the scores of the
different domains on the Kincaid readability test.
Higher numbers represent more grammatical and lexical complexity.
Lower numbers represent more simple structure and word choice.
Documents with lower numbers are considered to be more
The ``other'' domain is excluded because it represents
extraordinarily diverse sources.
Table 5: Average Readability broken down by Domain
weblint was used to assess the syntactic correctness of
a subset of the HTML documents in our data set
Figure 6 presents the top ten syntax errors ranked according
to the percentage of documents in which they appear.
(Note that ``netscape-attribute'' is not necessarily an
error, but rather indicates the percentage of documents
using Netscape-specific extensions.)
Observe that over 40% of the documents
in our study contain at least one error.
Descriptions of the errors appear in Table 6.
Figure 6: Ten Most Common Syntax Errors
Table 6: List of weblint Errors
||outer tags should be <HTML> .. </HTML>
||heading-only tag (TITLE, NEXTID, LINK, BASE, META) found outside
||required tag does not immediately follow another
||unclosed elements (e.g., <H1> ... )
||empty container element
||mis-matched tag (e.g., <H1> ... </H2>)
||order of headings (e.g., <H3> following <H1>)
We have reported the results of our examination of pages from the
World Wide Web. Additional data not presented in the hardcopy
version of this paper may be found at
There are two maxims which are particularly apropos of our
experience. First, dealing with large data sets is difficult
and time-consuming. None of the existing tools which
we used scaled adequately to dealing with a data set on the
order of millions of documents.
Second, we observed empirically that the Web changes
Many properties of the documents in our first data set have altered
in the months since the data was collected.
The largest document in our data set was 1.6Mbytes; we checked the
current size of that same document. It has grown to 9Mbytes.
As another example, many of the most popular URLs in the
first data set no longer exist.
A longitudinal study examining trends would be extremely interesting.
Our limited observation reveals that while certain charactertistics
change fairly quickly (e.g., new features are introduced) others
appear to change more slowly (e.g., average document size and reading level
did not appear to change between the time periods we observed).
One could also consider how the introduction of new tools impact
For example, as authoring tools become more common, one could study
their impact on the number and type of syntax errors.
Structural graph analysis has many applications in this area.
In particular, analysis
of the kind practiced by sociologists in structural network
analysis [WASS94] promises insight.
However, existing social network algorithms are several orders
of magnitude more complex than is viable for a data set of this
size. Significant work would have to be done to make such
It would also be interesting to allow user-defined queries against
the data set. The simplest functionality would be to allow a
user to ascertain how a form-specified URL compared with the data
set. A more interesting and complex interface would allow the user
to define arbitrary queries on the data set.
- N. Bowers, ``Weblint Home Page (version 1.013),'' Khoral Research,
Inc., Albuquerque, NM, Jan. 1996. Available as
- L. D. Catledge and J. E. Pitkow, ``Characterizing Browsing
Strategies in the World-Wide Web,'' Proc. 3rd Int. World Wide Web
Conf., Darmstadt, Germany, Apr. 1995. Available as
- L. L. Cherry, ``Writing Tools - The STYLE and DICTION Programs,''
Computer Science Technical Report No. 91 (TM 79-1271-13), Bell
Laboratories, Murray Hill, NJ, Feb. 1981. Revised version reprinted
as L. L. Cherry and W. Vesterman, ``Writing Tools - The STYLE and
DICTION Programs,'' 4.4 BSD User's Supplementary Documents, Computer
Science Research Group, Berkeley, CA, 1994.
- E. H. Chi, ``Webspace Visualization,'' The Geometry Center,
Univ. of Minnesota, Minneapolis, MN. Available as
- CommerceNet Consortium, ``The CommerceNet/Nielsen Internet
Demographics Survey,'' Menlo Park, CA, 1995. Available as
- D. Connolly, ``A Lexical Analyzer for HTML and Basic SGML,'' W3C
Working Draft, World Wide Web Consortium, Cambridge, MA, Dec. 1995.
- R. Fielding,
``Relative Uniform Resource Locators,''
RFC 1808, June 1995. Available as
- H. Frystyk and H. W. Lie, ``Towards a Uniform Library of Common
Code: A Presentation of the World Wide Web Library,'' Proc. 2nd
Int. World Wide Web Conference, Chicago, IL, Oct. 1994. Available
- M. J. Hannah, ``HTML Reference Manual,'' Sandia National
Laboratories, Albuquerque, NM, Dec. 1995. Available as
- R. Karpinski, ``Hot Java Arrives: Sun Aims to Revolutionize the
Web,'' InteractiveAge, May 22, 1995. Available as
- Lycos, Inc., ``The Lycos 250 and Hot Lists,'' Pittsburgh, PA,
Sep. 1995. Available as
- M. L. Mauldin and J. R. R. Leavitt,
``Web Agent Related Research at the Center for Machine
Translation,'' 1994 Meeting of the ACM Special Interest Group on
Networked Information Discovery and Retrieval, McLean, VA, Aug. 1994.
Available as http://fuzine.mt.cs.cmu.edu/mlm/signidr94.html,
Carnegie Mellon Univ., Jul. 1994.
- S. Mukherjea and J. D. Foley, ``Visualizing the World-Wide Web
with the Navigational View Builder,'' Proc. 3rd Int. World Wide Web
Conf., Darmstadt, Germany, Apr. 1995. Available as
- P. Pirolli, J. Pitkow and R. Rao, ``Silk from a Sow's Ear:
Extracting Usable Structures from the Web,'' Xerox PARC, Palo Alto,
CA, Nov. 1995. Submitted for publication.
- J. E. Pitkow and K. Bharat, ``WEBVIZ: A Tool for World Wide Web
Access Log Visualization,'' Proc. 1st Int. World Wide Web
Conf., Geneva, Switzerland, May 1994. Available as
- J. E. Pitkow and M. M. Recker, ``Results From The First World-Wide
Web User Survey'', Georgia Institute of Technology, Atlanta, GA,
Jan. 1994. Available as
- J. E. Pitkow and M. M. Recker, ``Using the Web as a Survey Tool:
Results from the Second WWW User Survey,'' Proc. 3rd Int. World
Wide Web Conf., Darmstadt, Germany, Apr. 1995. Available as
- J. E. Pitkow and C. Kehoe, ``The GVU Center's 3rd WWW User
Survey,'' Georgia Institute of Technology, Atlanta, GA, Apr. 1995.
- M. Rissa and C. Oy, ``WWW User Survey Results,'' Helsinki,
Finland, Feb. 1995. Available as
- S. Wasserman and K. Faust, ``Social Network Analysis: Methods and
Applications,'' Cambridge University Press, Cambridge, UK, 1994.
- Yahoo, Inc., ``Survey Says...'' Mountain View, CA, Aug. 1995.
About the authors
Allison Woodruff is a PhD student in the Electrical Engineering
and Computer Science Department at the University of California,
Berkeley. Her research interests include spatial information
systems, multimedia databases, visual programming languages, and
user interfaces. She has worked as a geographic information
systems specialist for the California Department of Water
Woodruff holds a BA in English from California State University,
Chico and an MA in Linguistics and an MS in Computer Science from
the University of California, Davis.
Paul M. Aoki is a PhD student in the Department of Electrical
Engineering and Computer Sciences at the University of Califormia
at Berkeley. He holds a B.S. in Electrical Engineering and a M.S.
in Computer Science from the University of California at
Berkeley. His research interests include query optimization for
parallel and distributed databases and index support for non-
traditional data types.
Eric Brewer is an Assistant Professor of Computer Science at the
University of California at Berkeley, and received his PhD in CS
from MIT in 1994. Interests include mobile and wireless computing
(the InfoPad and Daedalus projects); scalable servers (the NOW
and Inktomi projects); and application- and system-level security
(The ISAAC project and Netscape security holes). Previous work
includes multiprocessor-network software and topologies (Strata,
metabutterflies), high performance multiprocessor simulation
Paul Gauthier has served as Director and Vice President of
Research and Development of Inktomi Corporation since February
1996. Mr. Gauthier is also in the doctorate program in the
Department of Electrical Engineering and Computer Sciences at the
University of California at Berkeley, where he is working towards
a doctoral degree in computer science. Mr. Gauthier holds a
Bachelor of Science degree, with honors, in Computer Science from
Dalhousie University (located in Nova Scotia, Canada).
Professor Rowe received a BA in mathematics and a PhD in
information and computer science from the University of
California at Irvine in 1970 and 1976, respectively. Since 1976
he has been on the faculty at the University of California at
Berkeley where he is now a Professor of Electrical Engineering
and Computer Science and the founding director of the Berkeley
Multimedia Research Center.
His current research interests are multimedia applications,
systems, and databases on which he has published over fifty
papers. He is an editorial board member for the ACM Multimedia
Systems Journal. Professor Rowe heads the research group that
developed the Berkeley Distributed Video-on-Demand System,
algorithms to compute special effects on compressed images, the
Berkeley Continuous Media Toolkit, and the Berkeley MPEG1 video