How the Web Works

Version 2/20/06
Support Material: Hackers, Hits and Chats
Keyterms: bandwidth, barriers to entry, blogs, broadband, business model, caching, client, client-side scripting, cookie file, css, digital divide, domain name service, firewall, first-mover advantage, google, header, html, internet back-bone, IP Address, ISP, keywords, meta tags, news reader, personalization, privacy, rss, search engine, search engine optimization, server, server-side scripting, standards, target audience, the last mile problem, w3c.org, web analytics, xml

How the Web Works: Technology
To have a better understanding of the internet and its structure you can review:

An Atlas of Cyberspaces
How web-servers and the internet work.
Whose Internet Is It Anyway, Wired, April 1998

Contents

Oversight
Structural Issues
Domain Name System
IP Addresses
Cookie File
CSS
XML
Server-side scripting
Client-side scripting
Caching
Firewalls
Mediated Environment
Cost of Publishing
Author Control
Growth

Oversight
The web is overseen by the w3c.org. w3c is a standards body whose role is to ensure compatibility and agreement among industry players as standards evolve, and innovation occurs. This includes not only the web we know and are familiar with, but also the web as it is pushed out through mobile devices.

Structural Issues
WWW can be thought of as comprising four main components within an particular organizational structure:

Organization
Content
Browsers
Servers
Access
Organization
Organization may not be the appropriate term for describing the overall structure of the internet, given its seemingly chaotic nature, but it is important to understand how the internet, and information flowing through the internet, is organized.
The internet essentially comprises a backbone, a fiber optic network maintained by large ISPs (including: MCI, UUnet and British Telecom), smaller ISPs that connect to the larger ISPs via T1 connections and similar, and company LANs that connect directly to ISPs. Users essentially connect to their ISPs via their particular internet connections (the last mile) or via a T1 connection over a company LAN. Content which reside on web servers, are hosted by the ISPs. This structure is illustrated and explained in howstuffworks.com.
Content
HyperText Markup Language is the principal industry standard language that is used to develop WWW documents. This language was developed in 1989 by Tim Berners-Lee at the European Laboratory for Particle Physics (CERN) in Switzerland. HTML is developed and approved by the W3C. It is available free to use. Click on the view option of your Internet Explorer / Firefox browser and scroll down to the view source option to view the HTML code that was used to create this document. This ability to view the source document makes learning the HTML language relatively simple. The need to learn the language is diminished by the use of WYSIWYG HTML text programs. I prefer you to learn the language by following these HTML Tips. Due to the need to create dynamic content, interactive content, and styling issues, other languages are now being used in conjunction with HTML and XML. These are noted later in the section. It is the task of W3C to make sure all these technologies remain interoperable with HTML, web browsers and other related technologies.

Browsers
"Browsers" are the client-side software programs that are used to view WWW documents. Microsoft's Internet Explorer is currently the dominant browser. Netscape was the pioneer (lost first mover advantage, but still retains some marketshare, specifically via its Firefox browser). The reason both work with most web design is due to the fact that HTML and associated languages are open standards. To put the content and browser in simple context you can use the metaphor of the television. You have programs (developed content) and TV sets (browsers).
Servers
Servers host the content that is developed for WWW. Thus with the TV metaphor, servers resemble TV stations that own and store the content to be broadcast.
Access
The fourth component is the access provider. To continue our TV metaphor, the access provider is similar to your local cable company (assuming that you need a cable company to provide you access to the network and cable TV). An Internet user generally has access from two distinct sources:

Work/School, using a corporate or school account
Home, using an internet service provider (ISP)
The marketplace for commercial internet service providers is very competitive. The types of companies involved can be categorized as follows:

Major service providers such as:

AOL
Prodigy
Compuserve (now part of AOL)
MSN (bundled with Windows)

Telephone companies and cable companies such as Comcast
Local Internet Service Providers such as Erols, Panix, and Earthlink

Domain Name System
The domain name system is essential to the functioning of the web. It is currently managed by ICANN, but this is open to debate as many would rather this important role no longer reside entirely within the control of the US. Prior to ICANN's role, this was actually controlled by the US government. The domain name system applies names to the IP address system, which allows web browsers to access sites by using a name, rather than the site's IP address, which is more difficult to remember.

IP Addresses
The IP address is the numeric address of the machine accessing the web. This can be the host of the content (web-site) and the client (web browser). The host IP addresses are mapped to domain names for easy reference.
Hosts are able to offer their site content based on the IP address of the client. This is useful on the following two examples:
a. The site wants to offer content that is country-specific:
If a client is accessing amazon.com, and is accessing from an IP address that is from the UK, the client will be redirected to amazon.co.uk automatically. This ensures the client receives the correct experience for someone purchasing a product from the UK, from Amazon (prices, currency, shipping etc)
b. The site needs to exclude content based on the country of origin of the IP address. Sites that operate in China are likely to have to modify their content in order to not have their site banned in China. China is able to do this by operating a firewall that is selective about which IP addresses to include. Thus if an IP address signals the client is based in China, the site defaults to its .cn domain. This is how google and other search engines are able to operate in China. Chinese residents only have access to google.cn, yahoo.cn etc.
The IP address is also used, in combination with the cookie file, and data on the server side, for additional authentication. Since the IP address is unique to the means of the client accessing the web (laptop and access) if the site detects a different IP address is being used for access (while viewing the cookie file), the site may request the user go through additional steps for verification, this is common in online banking where security is critical.
The IP address is also tracked in the web analytics of the site, so a site owner can see which IP addresses accessed the site when, and which pages. This is how law enforcement is able to identify which computers have done nasty things (accessed child porn sites for example). This is also how wikipedia was able to identify the source of edits for the political entries such as Joseph Biden.
Google's data includes search requests and IP addresses of the searchers. This is data the federal government is very interested in its search for illegal activities.

Cookie File
The cookie file, resident on the client machine, is used to help identify a browser when accessing a site. This enables personalization of content. It is also used during transactions that require multiple page interactions with a site. Without the cookie file tieing the interactions with the site together, these types of transactions cannot occur. This covers basically all e-commerce transactions.

Additional non-HTML languages
While the principle language of the web is HTML, it is also important to become familiar with Cascading Style Sheets (CSS), used by designers to control the presentation of content; XML, used to extend HTML tags and provide a means to export content; server-side scripting languages, used to present dynamic content more appropriate for personalization of site content to the user; and client-side scripting languages used to present content on the client-side that browsers can manipulate.

CSS and its relationship to HTML
While HTML creates the structure of the content of a web page, and includes standardized presentation (example, <b> </b> is for bolding a block of text); CSS allows a web designer to customize the presentation, and control that customization across multiple pages.
CSS can change the presentation of each of the standard web tags, CSS also allows for the creation of additional tags in order to control presentation. CSS content can be included in a separate file, to which appropriate web pages point from their header section; can be included in the header section; or directly in the body text. In the case of this page, the CSS content is included in a separate file (http://www.udel.edu/alex/css/css.css); if you view source this page, you will see the code that points to this file, in the header. Examples of the outcome of pointing to this file include changing the colours of the hyperlinks from the html standard blue and purple, redefining the h3 tag (brown colour) and changing the colours of the bold tag to brown.
An example of CSS content appearing in the header can been seen in blogs on the blogspot platform. Because it is important for each blog author to have independent control over his / her CSS content it is important to include that directly in the header of each 'template'.

XML's role
xml allows programmers to extend html and create their own variable names / tags. This is useful in order to have more flexibility in terms of describing data (we can use the example of a parts list for a large company) and standardizing data across multiple sources, that is used to export to other sources (RSS). The following are a couple of examples of using XML:
A manufacturing company may want to make a customized list to make it easier for them to put up a parts list or to make it easier for customers to view a parts list.
iTunes, Apple's music software, allows XML tags to be added to a file so that iTunes' users can see special content - a picture or descriptive text that users of another music program would not see.
The following illustrates how the process works with RSS and blogging: A blog creates a feed. The feed includes certain attributes / variables: (author, date of publication, title and extract). When a blog author publishes a new entry, not only is that entry published on the blog, but the appropriate elements (author, date, title and extract) are added to the rss file. News readers (bloglines for example) will crawl the rss file to see if there is new content. The news reader will understand the content described in the appropriate variables which are standardized across all blogs which export a feed. New content is made available to those that subscribe to that particular blog, in the format prescribed by the news reader.

Server-side scripting languages
Server-side scripting languages (javascript, .asp and .php) are used to create dynamic content. This is based on where the client is from (cookie file content and IP address) and the history that client has with the web page / company (content stored on the server in databases). Javascript gets called within html and xml pages, but .asp and .php files are recognized by browsers. Dynamic content is used in order to personalize presentation for example.
The following is a typical process that creates a dynamically produced web page. The client machine calls the page (browser requests a web page). The server accesses the cookie file resident on the client machine, this identifies the client, and calls data from the server based on the information the cookie provides, and information resident in the clients' content residing in the server-side databases.

Client-side scripting
While server side scripting is popular for presenting content that is personalized to the user, based on content in the database of the server, client-side scripting can be used to allow the client to 'interact' with the presentation of the content that is displayed by the server to the client. Javascript is a popular client-side scripting language. This is used in a number of examples, including calculators a user can use to input some numbers and get a mortgage rate quote.

Thus browsers need to be compatible not only with current html standards, but .asp and .php standards etc. This is part of the role of W3C.org to maintain compatibility across browsers and content.

Caching
In order to make the web experience more efficient caching of web content occurs, both on the client and server-side. ISPs use proxy caches to save bandwidth on frequently-accessed web pages from their clients. Google caches copies of web pages, these are used for indexing. You can access the most recently cached copy from the search index.
Since content is stored, via cache, on client machines and servers throughout the internet this presents a problem for web authors who want to eliminate web content from the internet. Eliminating it from the server where the content resides does not necessarily eliminate all copies of the content. Google will likely have an old copy (via cache) that is searchable.

Firewalls
Firewalls are critical to ensure privacy to those pretected by the firewall, and for governments who use a firewall to block content to its citizens and companies who do the same thing for their employees. Since many browsers are browsing the internet from behind a firewall, this can impact the internet experience.

Top-Down vs. Mediated Environment
The barriers to entry for WWW as a communications medium are significantly less than more traditional marketing media (TV, Radio, Newspapers etc.) For a small investment a marketer can establish an effective web-site as the focal point of its entire marketing effort. This is enabling (small) companies, that find other media cost prohibitive, to compete.
WWW is also freeing up the publishing market which has traditionally been the territory of the very few (those that can afford it). Now small publishing companies are evolving and individuals have the ability to use WWW to "publish" information (blogging and RSS for example).

Distributed Medium
WWW is a distributed medium. Information sources are worldwide and are hosted on any of the millions of WWW servers. Imagine trying to decide which TV station to watch if your TV allowed you to surf 20 million channels! Because the information is distributed in such a fashion, it becomes a real problem for those that try to "catalog" WWW in order to provide efficient indexing and information retrieval for browsers. And once a "catalog" has been developed, then updating it becomes a real headache, not only would you have to be aware of new information from existing servers, but you would also need to develop a mechanism for accounting for new servers. Servers are becoming very simple to install. This compounds the problem further.

Cost of Publishing
Because there are significant barriers to entry in traditional media markets this limits the number of "publishers" to a very few. Those that can afford the capital outlay and on going expense to compete. Because this is a significant investment, the value of the material that is published must also be considered significant, at least to a particular target audience which finances the endevour (business model: subscription, pay per view or third party advertising). This is a good "checks and balances" mechanism to make sure that, in general, what is published off-line does carry some value, to somebody.
While the low cost of entering WWW offers the real benefit of opening up the WWW market to small businesses, publishers and individuals, this also presents a real dilemma. Many WWW publishers can create a lot of information, and since the cost to publish is very low, the return on investment needed is also very low, deeming a lot of the information on WWW only relevant to the very few (or only the author). Thus much WWW information is of no value to the WWW audience, but is still a viable publishing proposition from an economic (utility) standpoint.

Author Indexing Control
Another issue that complicates the quality of information available to browse is that the author of the web material, to some extent, can control the "indexing" process of a site. By using relevant keywords, and META tags that can hide irrelevant (but very popular) keywords, the author can try to manipulate when the site appears to a browser, searching for information. You can see how to do this HTML Tips: Part 4. This will be discussed in greater detail when we focus on Search Engine Optimization.

Issues Regarding the Growth Potential / Future
Access to WWW is still limited. Limited access translates into limited marketing opportunities for companies who want to use WWW as part of their marketing program. In order to understand the real value of WWW, we should spend some time considering the issues that are limiting WWW access and the prospects for growth in the future.

WWW is an eletist medium, due to cost and complexity: digital divide
Cost of access limits the size of WWW audience. One has to buy a computer and access in order to use WWW. Those that do not own a computer have to incur a significant cost to access WWW. Computer hardware (and software) is starting to become much less expensive over the last couple of years, reducing this barrier somewhat.
Complexity is another issue. Technophobia plagues all those who are not comfortable in front of a computer. The perception that WWW is complex to those that don't access WWW hinders the growth of WWW. TV viewing is much less complex than WWW browsing, thus WWW access to the mass population will lag TV access significantly. The design of computers further compounds this issue, as the competition in this industry is focused on feature development, rather than on fundamental design usability. Once computers adopt a human-centric design paradigm, mass market acceptance will accelerate.

Speed of information: bandwidth
The infrastructure that supports the Internet and WWW is finding it difficult to deal with the current volume of traffic. As the size of WWW grows (the number of webservers and information on webservers) as well as the size of the audience of WWW grows, the system becomes over burdened and in some cases slows to a grinding holt. The instant gratification that browsers require makes it necessary to develop an infrastructure that supports the size of the system at speed. As can be expected in situations of high growth, demand for the system is overwhelming the development of the infrastructure behind the system.
The "last-mile" connection issue also presents a bottleneck to the internet. This refers to the users' connection to the ISP, which has not been a competitive marketplace, thus the advances in technology have been stifled. While many consumers still rely on a 56k modem, more and more are getting access to a cable modem or DSL access, both types of broadband access help reduce this issue.
With the potential unleashed with wireless access WWW will become ubiquitous.

Security and Privacy.
Security issues have also slowed the adoption of WWW as a marketing medium. Although Security and Privacy may have little impact on the number of users who access WWW, it certainly effects what someone wants to do once they are on WWW. Electronic commerce, while evolving, is likely to become a significant form of retailing in the future. This will only happen when security issues are resolved and the perception (from a consumer's stand point) is that WWW is secure. Similarly, as concerns about privacy are resolved, web customer adoption will increase.