Methodology - Introduction Edit
In order to ensure this project collects relevant and useful data, it is necessary to ensure the methodology of gathering the data isn't susceptible to accusations of being skewed.
One way of doing this is to show exactly what the sources of the data we are analyzing are, and where that data originates.
To that purpose, I believe the process initiated below, where participating sites are posting their names and a link to their data is a good one.
If it is impossible for participating sites to post links to their statistics, then sites which are using alternative methods of data delivery should still post a link to their site.
In addition, if it is possible to loosely categorize the participating website into categories (these are suggestions) such as Commercial, Non-commercial, Blog, Web Forum and so on, it will ensure the visible integrity of the data. If this is done, then examiners of the final results of the data will be able to see that the data has been gathered from across a variety of sites across the internet. This is more important than you might think. If there is a chance this data could be accused of being skewed, it will invalidate the purpose of the research.
Proposed Methodology Edit
This project aims to collect a very simple statistic which is based on a very simple question.
What browsers are being used to access sites on the internet?
To ensure unbiased data, the proposal is to ask as many webmasters and administrators to contribute data relating to USER_AGENT visits to their sites as possible.
A script has been submitted which will collect the user_agent data of each browser to access a page in which the script has been integrated. This script will then enter the user_agent and referring site into a database for analysis.
This data will then be analyzed and presented in a simple percentage format which will be available under the GNU General Public Licence (if appropriate - please comment), with the added proviso that the organization which facilitated the data collection and analysis be named as the source of the end data.
The sources of raw data will be credited with their contribution in an appropriate place within the project.
How to help Edit
Simple! Please add a stats url below with the site name in a format like so:
* [http://www.site.com/stats/browsers Site Name]
We'd prefer if you give us a direct link to the page that explains browser statistics for the site, but if this is impossible we can take a normal stats page. If the site doesn't have browser stats but does have others, perhaps you could write to the owner of the site and ask kindly if this information could be made available?
Alternative how to help Edit
This method will involve placing some php scripting into your pages or templates. The scripting will be invisible, probably just an include which will then report data back to the collecting database. Currently we don't have a database set for this purpose, however when this happens we will be activating this method of data collection.
Privacy Concerns Edit
From discussion around and about, I think the consensus is that we don't need to record IP information. Removing this will lessen the load on the database each time it takes a hit and will (I hope) reduce the bandwidth we use. Let me know if that's not in fact the case though.
In addition, we probably won't be needing OS data, although that would be interesting and a valuable contribution as well - the unix/linux guys would love us for that. That's worth thinking about. The statistics would be just as valid and the information could be used in all sorts of ways.
Other discussion which has come up has expressed concerns about privacy, both for internet users and for websites.
I personally think we can get around this by collecting data and implementing a policy of confidentiality, which would mean we do not publish any site's contributions unless that site's statistics are already in the public domain and they have expressly given their consent for raw data to be displayed. I would assume the best way for such consent to be given is for the webmaster of the sites concerned to add a link to the page on this wiki themselves.
Please let me know your thoughts on this.
List of sites Edit
Stat Link [Site name] (Avg. Monthly Hit) Descrip. - sanitized stat wiki page
- 7is7.com (400,000) - Monthly stats.
- Market Share - Monthly.
- BoingBoing.net - Stats/BoingBoing.net
- Blogshares.com (352,000) - Stats/Blogshares.com
- HTMLfixIT.com - Stats/HTMLfixIT
- iNeedHosting.net - Stats/iNeedHosting.net
- OneStat.com - Stats/OneStat.com
- W3Schools - Stats/W3Schools.com
- Spy Counter - Full stats. Updated daily.
- The Webcomic List - Stats/TheWebcomicList.com
- nofrag (55,000,000) graphic site - Stats/nofrag.com
- Audi Passion (20,000,000) automobile - Stats/audipassion.com
- OzForces (25,000,000) Australia compu. game service - Stats/ozforces.com
- WebHits.de (230,000,000) Overall stats from sites using WebHits counter (predominantly European sites). Updated daily. - Stats/WebHits.de
- Wikipedia (en) (100,000,000) - Stats/en.wikipedia.org
Hard to get at statistics Edit
Something which has come up while trying to get this project started is that there are a lot of people who are administering their websites off hosts who provide statistics only as part of a cPanel installation, or similar.
We are working on an alternative method of obtaining these statistics, and will post the various methods we find as we find them.
If you have any ideas, please post them. Your input is welcomed.
Alternative Method #1 Edit
If you are using rented serverspace from a hosting company and do not have direct access to the httpd.conf, but you can view your statistics through a control panel which requires a login, I have found a way to have the data dropped straight into a directory on your webroot.
It may be necessary for you to change the viewing options in your ftp client in order to see the relevant directories. I had to change mine to "Show hidden files" to accomplish this.
In your ftp space, locate the directory named "awstats". Mine is located for example at /tmp/awstats/.
Open the awstats folder and you should see a file called awstats.yoursite.domain.conf
If you open this file for editing (it is a simple text file) and search for " DirData ", you will see the directory name where data from each awstats run is kept at the moment. If you want to, leave this directory alone and type directly underneath it, DirData="your/path/to/your/new/stats/directory/"
The next time awstats runs, it will place the data into a text file within the folder you specified on your root directory.
All that remains after that is to "wash" the data of any information you do not want to send to this project. I am trying to find a way of doing that now. If anyone has any ideas, please post them.
Alternative Method #2 Edit
Rather than gather statistics as the visitors browse, an alternative is to make use of the excellent Apache web logs (assuming your webserver is Apache). A script could run daily, weekly or whatever and extract the latest data from the apache log. By default, most apache installations record user-agent. For an example script, already released under the GPL (but written some time ago, so it doesn't split Mozilla from Firefox), see http://community.asiaosc.org/~iwsmith/quickscripts/apache_log_scan.sh
Tracking history for sites that don't Edit
Some sites don't keep browser stats every month. One example of this is BoingBoing.net, which has a stats counter that restarts from scratch the first minute of the month. So how do we keep track?
We create a subpage and link to it! Traditionally we'd just add the name of the site, let's take our example and add the wikilink to a subpage of Stats to it:
* [http://www.site.com/stats/browsers Site Name] - [[Stats/Site Name]]
In that page, just put the current percentage of the top few browsers (Firefox, IE, whatever) and sign it with four tildes (~~~~) to timestamp it. That's all. This way, once the month is over, we still have our statistics.
There are several pages on this wiki devoted to this project. They are as follows: