Vitavonni

Thu, 19 Jul 2007

Fun with spiders

(no, not the animals, but web crawlers!)

For a pet project of mine, I've recently been spidering the web a bit myself. So far, I've processed over 100.000 websites. The machine doing the spidering is an old K6-450, so it's not particularly fast...

My spider is downloading the web pages HTML, and eventually some framesets (but at most 1 level deep). It's using text contents, image 'alt' attributes, title and some meta tags. The text contents are tokenized and stemmed.

This results in some fun numbers:

  • The average web page uses about 194 different words.
  • The average token (after stemming!) is 6.8 characters long

Each of the web pages I'm spidering has about 6.4 categories assigned to it. I'll be using this training set to train an AI to classify web sites.

(I've also started a web page for the project, but it's still pretty much empty so far, not worth looking.)

[category: /en/xml | Permalink]

Thu, 12 Jul 2007

Shaping your traffic for improved performance

Sometimes, putting up an arbitrary limit for your network connection can actually improve the performance.

A typical setup will look like this: your computer is connected via 54-100 mBit to a router, which has a DSL line of let's say 2 mBit downstream and 192 kBit upstream (and then tons of other links).

When you overload one of these connections, performance can go down a lot.

The simples case how this can happen is when you upload data. 192 kBit are the weakest link here, and rather easy to fill up.

What happens then is that your computer will start sending out data at his full link speed, until the buffers at the router are full and it starts dropping packets. TCP/IP will then adopt and - lacking 'ACK' packets - slow down sending data. It will likely still keep the send-buffer of the router full.

Now if the router isn't very smart, it just keeps a first-in-first-out buffer. As long as you have just this single upload, you don't need to care. But if you have other connections, like a download, you do care.

For your download to run at full speed, the sender needs to get those ACK packets from you. They're basically a "got your data, send more" kind of message. Really tiny packets, but essential for signaling.

A simple trick you can use now is to limit your bandwith at the sending computer already. Since you're going to limit it just to the next link anyway it doesn't mean it will actually be getting much slower. But your network stack is smarter than the router's (probably). It can send out these ACK packets at a higher priority, so you end up with a better download-and-upload performance.

If you know your network connection speed, just use it. Otherwise you can still measure it or guess it. I'm always going for 95% of the speed I see. You might want to limit it a bit more if you are sharing the connection with others.

Now enough theory, here's what I use:

IFACE=wifi0
# setup ingress filtering
/sbin/tc qdisc add dev $IFACE handle ffff: ingress
# local network is unrestricted
/sbin/tc filter add dev $IFACE parent ffff: protocol ip \
  prio 10 u32 match ip src 192.168.2.0/24 \
  police conform-exceed ok/ok flowid :1
# incoming internet traffic is limited.
/sbin/tc filter add dev $IFACE parent ffff: protocol ip \
  prio 50 u32 match ip src 0.0.0.0/0 police \
  rate 1500kbit  burst 10k drop flowid :2
# outgoing traffic
/sbin/tc qdisc add dev $IFACE root handle 1: htb default 30
# local network is unrestricted
/sbin/tc class add dev $IFACE parent 1: classid 1:10 htb \
  rate 54mbit burst 15k
/sbin/tc qdisc add dev $IFACE parent 1:10 handle 10: sfq perturb 10
/sbin/tc filter add dev $IFACE parent 1:0 protocol ip \
  prio 10 u32 match ip dst 192.168.2.0/24 flowid 1:10
# internet traffic is limited.
/sbin/tc class add dev $IFACE parent 1: classid 1:20 htb \
  rate 160kbit ceil 196kbit burst 15k
/sbin/tc qdisc add dev $IFACE parent 1:20 handle 20: sfq perturb 10
/sbin/tc filter add dev $IFACE parent 1:0 protocol ip \
  prio 50 u32 match ip src 0.0.0.0/0 flowid 1:20

Long example, I know. The main reason is that I'm not limiting traffic to local computers. Furthermore, I'm also limiting my incoming traffic to use up at most 1.5 mbit of the 2 mbit connection, so my housemates will be able to use the internet as well (Debian mirrors send data fast - fast enough to kick them out of any online game they might be playing...).

There is one important detail hidden in there: sfq. This stands for "stochastic fair queueing". It tells the network stack to basically send "one packet for each connection in turn". This way, the small packets needed for download or interactive SSH sessions will get out quickly even when doing a larger upload.

It works great for me - I'm having a good download rate, using the 160 kBit upload limit completely (I don't want to completely fill up the 192k upload link either) and I'm actually writing this blog entry remotely via SSH. The lag when writing is okay, probably around 300ms. Without shaping, this wouldn't work this well.

(And no: just using 'sfq' for your outgoing traffic is not enough, since the key scheduling is happening at the weakest link. So you have to limit your traffic shaping setup to be just below the actual weakest link.)

[category: /en/linux | Permalink]

Thu, 05 Jul 2007

Back from Amsterdam Lindy Exchange

Last weekend, I was at the Amsterdam Lindy Exchange. Both my first trip to Amsterdam (A crazy place, I think you can get high just from walking the streets because of the smoke clouds coming out of the Coffee Shops... And I really wouldn't have thought the prostitutes are even in their windows on Sunday and Monday mornings ...) and my first Lindy Exchange [wikipedia].

A lindy exchange is

a gathering of lindy hop dancers in one city for several days to experience the dance venues and styles of that local community, and to dance with visitors and locals alike

And it was fun. I didn't get a glimpse of 'typical' Amsterdam night life, because the evenings were already filled with dance events. We did a boat trip on sunday however, so I did do something typical for Amsterdam, too.

Going to a different city to meet other dancers is a great thing to do. If you're a dancer, you should definitely do that. If you're not a dancer, you should become one. :-)

My next lindy exchange will be in September: the Munich Lindy-Exchange, where I actually ended up in the organization crew. We're really trying hard to make that a great event. We'll be having some great parties. And we'll be going to Oktoberfest (aka "Munich beer festival") with you. And yes, we'll probably end up dancing there, too.

[category: /en | Permalink]
Menu
[planet.debian]
[planet.xmlhack]
[planet SELinux]
[munichblogs]
[email]
[RSS 2 feed]
[English RSS 2]
Categories
< July 2007 >
SuMoTuWeThFrSa
1 2 3 4 5 6 7
8 91011121314
15161718192021
22232425262728
293031    
Archives
2010-Mar
2010-Feb
2010-Jan
2009-Dec
2009-Nov
2009-Oct
2009-Sep
2009-Aug
2009-Jul
2009-Jun
2009-May
2009-Apr
2009-Mar
2009-Feb
2009-Jan
2008-Dec
2008-Nov
2008-Oct
2008-Sep
2008-Aug
2008-Jul
2008-May
2008-Apr
2008-Mar
2008-Feb
2008-Jan
2007-Dec
2007-Nov
2007-Oct
2007-Sep
2007-Aug
2007-Jul
2007-Jun
2007-May
2007-Apr
2007-Mar
2007-Feb
2007-Jan
2006-Dec
2006-Nov
2006-Oct
2006-Sep
2006-Aug
2006-Jul
2006-Jun
2006-May
2006-Apr
2006-Mar
2006-Feb
2006-Jan
2005-Dec
2005-Nov
2005-Oct
2005-Sep
2005-Aug
2005-Jul
2005-Jun
2005-May
2005-Apr
2005-Mar
2005-Feb
2005-Jan
2004-Dec
2004-Nov
2004-Oct
2004-Sep
2004-Aug
2004-Jul
Other links:
Swing and the City - Lindy Hop in Munich