Vitavonni

Thu, 19 Jul 2007

Fun with spiders

(no, not the animals, but web crawlers!)

For a pet project of mine, I've recently been spidering the web a bit myself. So far, I've processed over 100.000 websites. The machine doing the spidering is an old K6-450, so it's not particularly fast...

My spider is downloading the web pages HTML, and eventually some framesets (but at most 1 level deep). It's using text contents, image 'alt' attributes, title and some meta tags. The text contents are tokenized and stemmed.

This results in some fun numbers:

  • The average web page uses about 194 different words.
  • The average token (after stemming!) is 6.8 characters long

Each of the web pages I'm spidering has about 6.4 categories assigned to it. I'll be using this training set to train an AI to classify web sites.

(I've also started a web page for the project, but it's still pretty much empty so far, not worth looking.)

[category: /en/xml | Permalink]
Menu
[planet.debian]
[planet.xmlhack]
[planet SELinux]
[munichblogs]
[email]
[RSS 2 feed]
[English RSS 2]
Categories
< July 2007 >
SuMoTuWeThFrSa
1 2 3 4 5 6 7
8 91011121314
15161718192021
22232425262728
293031    
Archives
2010-Jul
2010-Jun
2010-May
2010-Apr
2010-Mar
2010-Feb
2010-Jan
2009-Dec
2009-Nov
2009-Oct
2009-Sep
2009-Aug
2009-Jul
2009-Jun
2009-May
2009-Apr
2009-Mar
2009-Feb
2009-Jan
2008-Dec
2008-Nov
2008-Oct
2008-Sep
2008-Aug
2008-Jul
2008-May
2008-Apr
2008-Mar
2008-Feb
2008-Jan
2007-Dec
2007-Nov
2007-Oct
2007-Sep
2007-Aug
2007-Jul
2007-Jun
2007-May
2007-Apr
2007-Mar
2007-Feb
2007-Jan
2006-Dec
2006-Nov
2006-Oct
2006-Sep
2006-Aug
2006-Jul
2006-Jun
2006-May
2006-Apr
2006-Mar
2006-Feb
2006-Jan
2005-Dec
2005-Nov
2005-Oct
2005-Sep
2005-Aug
2005-Jul
2005-Jun
2005-May
2005-Apr
2005-Mar
2005-Feb
2005-Jan
2004-Dec
2004-Nov
2004-Oct
2004-Sep
2004-Aug
2004-Jul
Other links:
Swing and the City - Lindy Hop in Munich