| Trace: Extension Web Statistics |
Extension Web Statistics
From AgriLife Wiki
Contents |
Accounting for Web Delivered Education
Background
It is well understood that the use of the World Wide Web for the delivery of education has continued to escalate over the past 10 years. There now exists a whole generation of people for which the Web is almost the only way they know to find information about things that affect their daily lives. From shopping, to governmental and tax information, to full college courses, the way many people learn is through their computer terminals. Cooperative Extension, as many other educational institutions, have recognized this phenomenon from the time the World Wide Web became known and became a stable platform through which to deliver information. It has purposefully used this technology, designing and delivering education through this medium. Yet, when it comes to documenting Extension's performance, and using the Web as a means to account for its worth to various stakeholders, this medium has remained largely untapped.
There are a number of reasons why the Web has not been used for Extension's accountability, but mostly they center around the question of how to document and measure various aspects of educational delivery. Extension has long focused on measuring the number of people it reaches with educational programs, the satisfaction that those people express in the quality of these programs, and the eventual outcome the programs have on peoples’ lives. In all of these cases, Extension knows the faces or names of the people it reaches. County Extension agents and specialists see and talk with these people every day.
The World Wide Web offers no similar method of identifying Extension’s audience. Peoples’ names are replaced with Internet Protocol (IP) numbers. People directing their web browsers to Extension information on the Web may stay for a minute or they may stay for an hour. A person visiting one of our web sites one day under one pseudonym (IP number), may again visit the next day under a different pseudonym. A person visiting the website one minute may hand off that pseudonym to another person the next minute. The website visitor may reside anywhere in the world, or even in space. And, there are intermediaries or brokers on the internet that serve out our information from their own servers, saving people the trouble of ever having to directly access our websites.
Another issue further distorting these numbers are search engines and automated scripts or bots that spider the site downloading every page. Some are looking to index the page for a search engine, others are searching for email addresses to add to spam lists. These requests are not distinguishable from requests from real people.
In other words, accounting for the delivery of education on the Web requires a different paradigm--a whole new way to think about what delivering education through this medium means, and how it can get measured. We cannot account for Web delivered education using a paradigm, or even using an analogy, based in seeing people face-to-face, knowing their names, and being able to distinguish one person from another. The infrastructure and technology we know as The Internet simply does not give us that information.
Possible Measures
The question, then, becomes what can you measure that provides some indication of the access being made of Extension’s Web delivered education? And, in thinking about these measures, which can be consistently applied, are not subject to inflationary methods used by the content authors, are not affected by subjective assumptions imposed by the web analyst, and are fairly universally defined? To get some handle on these questions, it is helpful to understand what is “knowable” about web statistics and the servers that collect them.
First, there are a number of Web serving software packages that a webmaster may use. The most widely used of these are Apache from the Apache Software Foundation, and Internet Information Server from Microsoft. All of these web server packages are capable of creating entries in a log file every time someone requests some information from that server. The industry standard for entries in this log file include the IP number of the computer making the request of the web server, the date and time the request is made, the specific file on the server being requested, a code indicating the success of the request, and the number of bytes transferred. In some instances, the logs may also include the website from which the user launched the request, and the type/version of web browser being used by the visitor.
Then, there are other applications which are able to analyze these log files. Packages such as Analog, AWStats, and WebTrends appear to be most frequently used by Extension. These packages evaluate the log files under different assumptions and metric definitions. However, there are some basic metrics for which they are in general agreement. Nevertheless, the information they have to work with are controlled by the web server logging capability. All statistics they create depend on assumptions made about these data. Following are descriptions of how some of these data are used.
- Visitors (Unique Hosts):
- The measure of the number of visitors to a website is typically made by evaluating the number of different IP numbers which show up in the log files for a given period of time. Quite often, we look at that period of time as one month. So, if one specific IP number shows up in the log files a thousand times during a month, or just a single time, it counts as a single visitor. Unfortunately, this statistic cannot be equated in any way to number of people visiting the website.
- People who use internet service providers may be assigned a different IP number every time they turn on their computers. This is because they often have more subscribers to their services than they have IP numbers available, so they depend on people not being connected 24 X 7, allowing them to reassign IP numbers to other computers when people turn off their own computers. A more extreme case of IP number reassignment, though, occurs in the case of AOL subscribers, where the IP number assigned to a single computer may change during the session and appear to be different computer to the web server each time a request is made.
- In some cases, attempts are made to interpret recurring appearances of a single IP number in the log files as “user sessions.” The argument is when a single IP number shows up repeatedly, with less than a pre-defined time interval between occurrences, this represents a single user accessing multiple files on the website. The interpretation, then, is that this is analogous to a person walking into an office, or calling on the telephone, and hence, represents a single person. While there is a certain comfortable feel to this idea, the assumptions behind it can lead to inconsistent, and possibly indefensible metrics. What defines what length this pre-defined time interval should be? Is there any research that validates this time interval? What happens to the metric when a single user is assigned different IP numbers in mid-session as with the AOL situation described above. What if one of the documents that is retrieved takes so long for the visitor to read that the time interval is reached, and that visitor’s next access kicks off another user session. In other words, this interpretation is trying to make something of the log data that is not really captured in the data. While this certainly might be interesting and useful information for a webmaster, it cannot be defended in an accountability model. As a result, most web analysis software packages do not even attempt to make this interpretation.
- Pages:
- When a request is made of a web server, the specific file on the server is documented in the log file. Part of that file name is its file extension which the web browser uses in order to take certain actions on it. For example, any *.htm or *.html files are known as text files that contain certain markup coding in them which dictates the way the browser will display the file. Files that are *.pdf types are used to launch the user's Acrobat browser plugin so as to display that Adobe Acrobat file. Other files such as *.txt, *.doc, *.wpd, *.exe, *.ppt, *.avi, *.wmv, and *.swf are generally known to be standalone files that are inclusive of all content that is to be presented. The web log analysis packages normally deem these types of files as “pages”. Graphic files such as *.jpg, *.gif, and *.png, however, are somewhat problematic, as they could possibly be as much a standalone file as an html file is. But more often than not, they are used within an html file. Therefore, for the most part, log analysis packages do not include these kinds of files in the page counts. Nevertheless, pages provide a good indicator of the extent of access being made by visitors to the website.
- Hits:
- A early measure of the success of a website is that of “hits” that are being imposed on a web server. In its simplest form, the number of hits on a website are the number of entries made to the log file. Every single file access from the server, regardless of the nature of the file, counts as a hit. While this can certainly provide some measure of how busy the web server is, it is very subject to author design inflation. The content author could create an html page with no graphics, and accessing that page would create one “hit.” If the author subsequently embeds into that html file the anchor tags for four graphic images to be displayed, that same access would create five “hits.” This is the simplest metric for a website to determine, but is least meaningful.
- Bandwidth (Volume):
- This is another simple measure of the work a web server must do. As stated above, one of the items recorded in the log file is the number of bytes transferred. Therefore, this metric is simply the sum of all the number of bytes transferred for the given period of time. This metric, as well, is highly subject to how the author designs the content. Content placed in a Powerpoint file is much larger than that same content in a text file. Therefore, this metric is highly impacted not by user access, but by design of the author. Hence, it is not likely a good measure for agency accountability.
- Location:
- Some log analysis packages attempt to identify the geographic location of the visitors to the website. They use the IP numbers, and some externally available information about where in the world these IP numbers are assigned, to estimate the amount of access being made from each geographic area. Or, they might attempt to identify the number of users originating in various Internet domains (education, government, commercial, networks, etc.). For purposes of determining the originating internet domains, or for purposes of determining different countries visitors are coming from, this measure works fairly well. However, making interpretations about which states within the United States they reside is highly suspect. Large commercial internet service providers like AOL, Verizon, Cox Internet, Sprint, AT&T, SBC, etc. make no geographic distinction in the IP numbers they assign to their users. All AOL users, for example, appear to be originating from Virginia. Therefore, it is virtually impossible to discern visitors from a specific state or geographic region from all other U.S. visitors to any of Extension’s websites based solely on analysis of the log files.
Conclusion
Given the nature of web statistics, Extension cannot, and should not, be trying to equate anything about web delivery to its old paradigm of seeing or communicating with people in person. Web statistics can only reflect the information that is kept in the log files of the web server. And, when dealing with various accountability agencies and auditors, we find they are typically interested in "just the facts". They want you reporting the facts in a consistent manner. And, they want those facts to be conservative, defensible, auditable, and replicable. The facts are that the log files tell you how many files the server delivered, which files it delivered, how many kilobytes were in those files, what IP numbers the server delivered the files to, and when it delivered the files. The logs files contain no information about where the users of these IP numbers are geographically located. IP numbers cannot be equated to people. Multiple people can be assigned the same IP number by an internet service provider or an IP number may represent a search engine or other automated service. The same person may come to the website from multiple IP numbers. Internet service providers that provide caching services (e.g., AOL) prevent a lot of people from ever touching our servers in the first place. Web logs are stateless, telling you only what happened in each transaction. They cannot tell you how long a "person" stayed connected to a specific website. They do not track login and logout times. And, trying to interpret logs in a way to define proxies for login and logout times is highly subjective and not defensible from an audit standpoint. Usually, the numbers are much larger than the number of eyeballs looking at a page. In other words, Extension should simply allow the web logs to speak for themselves.
If Extension then casts aside trying to equate web statistics with anything it has traditionally done (i.e., use a totally new paradigm), and focuses just on the web statistics themselves, then it *can* use and defend measures like number of transactions (hits), number of unique IP numbers coming to the site, number of different file types that were delivered by the server, and volume (total kilobytes or megabytes) delivered. What is likely more important in the end is not the absolute value of these measures, but what is happening to them over time. So, if Extension organizations agencies are experiencing a decline in traditional accountability measures, but an increase in web statistics measures, that can help explain to its stakeholders what is happening. It cannot, though, suggest that a 10,000 reduction in face to face contacts is explained by a 10,000 increase in some web statistics measure; those are apples and oranges.
