
It seems like Google broke something in their search. For example, searching for "iTunes-library-xml Python" (I've just written a parser which will turn Apple PropertyList Pseudo-XML into a useable Python object consisting of hashes, arrays, integers etc. btw.) actually gives me results (at least the one around #4) that don't contain "iTunes" (and I'm pretty sure also never contained).
OUCH:
Looking at the cached version of the result when searching for "iTunes library xml python" contains the notice "The word iTunes is only found on pages linking to this page". Google knows two pages linking there, neither contained the word iTunes.
This is not what I understand as an "exact phrase match".
Some people will have come across this: They've started some long-running process, e.g. some computation for their thesis, and want to be notified when it's done. Depending on the setup, they can't just background it and run wait in the shell.
Or you might want to run some expensive process somewhere, but there is some larger thing going on right now, so you want to wait for that to finish (you know, it's often better when people don't fight for memory and put all the load on the swap drive...).
Or you need to monitor some process that might crash, and want to schedule a notification or restart.
Here's how I'm doing that now:
busywaitpid command-of-process && notify-send "Computation is finished!"Which will singnal a popup bubble when the process is finished.
Here's the script:
#!/bin/sh mypid=$$ pid=`pgrep -of $@ | grep -vE "^$mypid$"` if [ -z "$pid" ]; then echo "No pid found." >&2 exit 2 fi echo "Waiting for pid $pid." while test -d "/proc/$pid"; do sleep .1 done # Make sure it's gone... test -d "/proc/$pid" || exit 1 exit 1
Note that it will wait for the oldest (-o) process where the full command line (-f) matches the given parameter.
Bugs: it doesn't handle when there are multiple processes matching the query - it will use the oldest (as given my pgrep).
[Update: Specto, aptitude install specto, is a GNOME GUI application that you should be able to use for this purpose (amongst others such as website change monitoring)]
I've switched back to the Debian stock kernel 2.6.23 from a self-compiled 2.6.23.9. The reason: I wanted to give the iwlwifi driver for my Intel 3945 wireless a try.
I think the 'iwlwifi' driver will be included upstream in 2.6.24, I guess the Debian people have added it to the 2.6.23 package themselves. At least I didn't come across it when configuring my 2.6.23.9, and I've seen it in the 2.6.24rc changelogs.
So far, the iwlwifi driver has been clearly superios, except it still lacks support for the LED. There seems to be a patch around for that though. So maybe my wireless is still disconnecting from time to time and I'm just not noticing it because I don't see the LED flashing. :-)
A few days ago, I asked about how to properly embed spamtraps in web pages.
Well, noone could tell me if using display: none is appropriate. I actually do not want Google to index the contents in that div. So as long as they don't punish me for using display: none at all, it's okay. And the page I placed the spamtrap on is a doorway-like page for others anyway; it's not part of an important site.
It took the first spammer around 54 hours to send the first spam. Or try to send: all 10 retries with different zombies were rejected by my spam filter. Since then, I've been receiving another round of deliver attempts - around 5-15 per spamtrap address - almost every hour.
Of the > 500 spam delivery attempts I've seen since, none made it through my initial spam filters (not to speak of the content filter behind that), but they were rejected at the SMTP level, even before the mail content was sent.
I've now disabled some of my spam filters to allow the trap adresses to actually receive mail. After all, I want to use them to train my filter. :-)
My thesis is about data mining, clustering of correlated data in high dimensional vector spaces, to be a bit more precise.
In detail, I'm working on methods to improve upon existing clustering algorithms such as 4C (Computing Clusters of Correlation Connected Objects) and ERiC (On Exploring Complex Relationships of Correlation Clusters), where you need to pick some parameters (e.g. k for a k nearest neighbour based approach) appropriately.
My approach is twofold. On one hand, I'm improving upon the traditional covariance based correlation (which is quite sensitive to noise), so the parameters become easier to pick, on the other hand I'm working on an approach to automatically fine-tune the parameters to further improve stability.
For testing my computations I needed a visualization of this data. I was considering using gnuplot (and in fact I'm using gnuplot a lot), but for some situation I needed animation capabilities, and thats where gnuplot becomes really messy.
So I decided to dive into SVG and Javascript. Here's my first SVG project:
Visualizing kNN correlation in SVG with Javascript
(Internet Exploder is not supported. I don't have Windows, and for all I know it doesn't really support SVG. Use a Gecko-based browser such as Firefox, Opera and Safari (at least on Windows) also seem to work. I didn't get it to work on kHTML/Konqueror/Webkit. I'm just doing this for myself, so I have no need to support other browsers.)
It's a 3D dataset, consisting of 300 points. 100 points are noise, 100 points are in a 2D cluster (green) and 100 points are on a 1D cluster embedded into this plane (I'm working on algorithms that support hierarchical clusters, so I needed a dataset with this property!).
There are two buttons in the UI, one toggles rotation, the other one toggles the playback of "k". It will cycle k through a range of about 3-200. When offset hits 20 (so k would be 22 or 23), the main correlation vectors - the big blue lines - already point along the 1D cluster. At an offset of around 80 they have already diverged quite a bit from the 1D cluster - at this point, the correlation is seeing the 2D plane quite well already.
I could also show you the behaviour for points in the 2D plane (but outside of the 1D cluster) and noise points.
We're preparing a paper for SSDBM 2008.
[Update: Safari works at least on Windows]
Spamassassin supposedly can precompile rules to allow faster operation. It compiles an optimized matching automaton that will process all the regular expressions in parallel.
Anyway, in order to use it, you need to manuall install some dependencies. They probably can't be introduced as real dependencies to not enforce them upon people who don't want to use this feature. So far, I've identified
aptitude install re2c gcc libc6-dev make
I'm aware that many people will have the libc6-dev and gcc stuff already installed, and the re2c dependency is well-documented. But I actually had removed the C compiler from my mail server.
Heute (24.12. Poststempel) ist die Deadline, um sich der Verfassungsbeschwerde gegen die Vorratsdatenspeicherung anzuschließen.
Die Vorratsdatenspeicherung wird uns als "Maßnahme gegen den Terrorismus" verkauft, kann aber von Kriminellen und Terroristen einfach umgangen werden. Sie ist daher wirkungslos für den "geplanten" Verwendungszweck. Allerdings bietet sie enorme Gefahren für unsere Freiheit, beginnend mit dem Verlust des Schutzes für Medien, Seelsorger, Anwälte, Journalisten, Geistlichen und anderen Vertrauensträgern, die für eine freiheitliche Gesellschaft unerlässlich sind. Die Mißbrauchsmöglichkeiten der gespeicherten Daten sind enorm, und lassen die Vergleiche zur StaSi der DDR geradezu schmeichelhaft erscheinen. Gerade mal die geplante biometrische Mega-Datenbank des FBI erreicht vergleichbare Maßstäbe.
Mehr Infos und das Vollmachtsformular gibt es auf der Seite Voratsdatenspeicherung.de
Heute noch abschicken!
Some time ago I've been working on an AI to classify web pages. I've spidered (and classified!) about half a million web pages I'm using for training.
This totals to about 40.000 categories and 100.000 useful tokens. I need 8 bytes of storage - two float values - for each combination, which is a matrix of about 27 Gigabytes. I'm using mmap to map my data files into memory for convenient access. On a 64 bit system, I can keep them mmapped all the time.
This results in the following funny top output:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12443 erich 26 10 27.7g 309m 21m R 8.5 32.8 86:01.98 trainSo according to top, the process is using 27 GB of virtual memory. That is the data files, actually. The 300 MB resident memory is the training data it's processing. At this point, all my data has been converted to 32-bit integers, so this is already condensed down a bit.
It's also obvious that ext3 supports sparse files - running "du" on my database directory currently gives 1.5 GB (it just has started filling the matrix), adding the "apparent size" switch gives 29G.
I hope that when the training run has finally completed (ETA is a couple of days), the AI will actually be able to classify webpages somewhat reliably. :-) You can think of it like a huge bayesian spam filter. Except that I've added some boosting that should improve results, some computational improvements for speed and scaled it up to 40.000 categories instead of just "spam".
Other uses I'm planning for the AI is automatic email filtering into multiple folders (best to be employed on Google GMail, where an email can have multiple labels, not just one folder), automatic document classification for document and knowledge management systems, etc. pp.
[Update: I havn't yet decided if or when I'll release the code as GPL. Right now, the code could require some cleanups. And I'm considering to start a business around this datamining and classification stuff, so I might want to keep it closed for some time at least. The business will most likely be around integrating the code though, so a public release might well be possible.]
I'm considering to embed some spamtraps (i.e. email adresses that will feed all their incoming email to the spam filter) into some web pages.
However, I want to prevent people from accidentially using these links or even just seeing them. So using "display: none" seems appropriate. But Google is known to punish websites 'hiding' content from users but not from robots.
Some sites say, Google will just ignore the parts that are within "display: hidden", others say it will punish the site altogether.
Maybe the adsense control comments will help
<!-- google_ad_section_start(weight=ignore) -->but it wouldn't really make sense. It's meant for adsense only.
Or the page could become a bit more hackish and use javascript to kill the unwanted content. Any experiences with the proper way of hiding spamtrap email links like this:
<a href="mailto:aaaaaaa-never-email-this-address@domain.tld" >Unwanted Emails only<:/a>
I've recently considered using a Google calendar for a project, and tried to embed it in a web site.
However, there are a few issues I'm having with it:
Also the "multiple calendars" feature is a bit hackish. I'd like to be able to differentiate events by flags such as "city", "outskirts", "training", "dance event", "music event". Obviously, entries might have more than one, so I'd need 6 calendars for this already. Usually, you can combine five...
Guess I'd need to do this all in Ajax by myself. It would be cool if Google Calendar hat an API for embedding like it has for maps. The Calendar API I've seen so far is basically polling the raw data via JSON or XML. Which is already great, but I do like some of the calendar layouting the do, and I'd like to avoid having to replicate that myself.
Linux has already been the main OS here at the computer sciences for some time. The computer pool was just switched to Ubuntu, after having been on SuSE and OpenSuSE for the last 6 years or more. With the latest hardware upgrades, our computer lab is nicer than ever: dual-core AMD64 systems with 2.6 GHz, a couple of them with two LCD screens connected (which is great for developing, having the editor on one screen and documentation on the other). I wonder if the machines spend more than 1% of their time at full CPU speed - I havn't seen any running at more than the minimum 1 GHz.
But one thing I've noticed, which has changed: even those people I can see in the lab having their own laptop with them have Linux running.
A few years ago, most people bringing their laptops were doing so to have Windows around. Now with the nice dual-screen setups here, bringing along a laptop to have Linux is less attractive than ever. Still more people bringing their laptops run Linux anyway.
I figure this means that they've noticed that the DEs a Linux system offers is very useful (granted, and more to what they have at the university when they don't have their laptop around. However the main part - Java and Eclipse - actually wouldn't differ). They still dual-boot (I overheard something about 'on Windows, this and that and foo) - and even accept the effort of dual-booting to get the Linux benefits.
In a few years, will there still be Windows developers you can hire, when everybody studying computer science is being taught Linux? (Granted, not all universities have made the full switch to Linux for their computer sciences)
P.S. I'm not involved with the computer admin group; for detailed inquiries you should talk to them directly. Especially for e.g. software distribution questions and such. Oh, and I heard that the servers were running Debian for some time already, while the clients were OpenSuSE. So maybe they still are Debian.
My personal web site - which actually is not particularly interesting, IMHO - used to get in a lot of traffic via Google. It was highly ranked on several quite interesting search terms. For example, it's still number 1 on the search term "Halbleiterlaser" (semiconductor laser), which is the topic of my secondary school thesis. So I still win against Wikipedia on this one. For "flowers" I sometimes and up at #1 in Google image search (causing lots of people to hot-link certain images from my site... might be with some geo targeting though?) But I've lost other battles to Wikipedia by now, such as the search term "Liebeskummer" ("lovesickness"; I have some old old poetry online that people dig a lot; this still makes up the biggest share of my visitors)
My rankings have dropped quite a lot the last year, down about 50% within a year. I don't know if this is because I'm blogging a lot less, because I'm not updating it often enough or because Google doesn't like the redesign as much as it did like the old one. (The redesign was 2 years ago, followed by a drop in visitor numbers but a recovery to like 70% within a month)
Algorithmic changes of Googles ranking are just another possible reason; I always felt as if my placement was highly overrated anyway.
I added at a Google sitemap some months ago and other things that people expect to help with their site being well included, but it didn't stop the numbers from going further down. For obvious reasons I didn't buy any links or anything; I'm not making my living with that page or anything, it's just a pet project thingy. (So obviously, don't make me any SEO offers for my page!)
Still it's a pity: I used to have a couple of really major placements on Google - so much that I've had some inquires if I'd like to sell my personal homepage (nah, I don't want to have to get a new email address!), offers by reputable and huge dating sites to put up ads on my page and surprisingly high offers for including my web page in link farms - but I'd have issues with that (I'd like to keep control of what my page contains; including content that might get Google to ban my page is scary), and it would have required me to run my page on PHP (ugh).
I wish there was an easy way to just keep the rankings I mysteriously had acquired the last years. :-) For some time it looked like I had some magic SEO capabilities, and friends suggested I should consider actually going into the SEO business. My page is still overrated by Google, but it seems like it has lost much of it's magic.
P.S. A fun story followup: I've just noticed that someone in a french cooking forum hotlinks to a photo of mine, claiming that he had cooked the food I did. Isn't that hilarious? Should I substitute that image with a g*atsee?
... was a lot easier than expected. Just not very well documented.
First of all, you need the appropriate utilities. Debian users can aptitude install libsmbios-bin
Next identify your system. It will look something like this
$ sudo modprobe dcdbas $ sudo getSystemId Libsmbios: 0.13.10 System ID: 0x01D8 Service Tag: ...REMOVED... Express Service Code: ...although my warrany is over... Product Name: MXC061 BIOS Version: A10 Vendor: Dell Inc. Is Dell: 1
The information you need is the "System ID".
Now you need to get the so-called HDR file for your bios. This can either be extracted from their EXE file using wine (with -dump-hdr or so), or you can find it on the linux.dell.com server. This page contains a huge list, and there are tons of dirs like system_bios_ven_0x1028_dev_0x01d8_version_a10. 0x1028 apparently is "Dell". The second hex number is your System ID. The last number (A10 here) is the BIOS revision. Pick the appropriate directory. There should be a bios.hdr file in there.
You can verify if the file is appropriate for your system:
$ sudo dellBiosUpdate -f bios.hdr -tAnd do the update by calling
$ sudo modprobe dell_rbu $ sudo dellBiosUpdate -f bios.hdr -u
When rebooting the next time, your screen might be garbled for a few seconds. At least it was for me. I was scared I might have trashed my system, but then it rebooted and had the new BIOS. So just give it some time (Fortunately I've done enough BIOS updates to know to just wait. I've even done a 'blind' video BIOS update on a Nvidia TNT. The first update had trashed the card, but I was able to redo the flash process without seeing anything on the screen, and guess what, the card worked again!)
In case you're wondering how this works: as I understand it, the dell_rbu driver will reserve memory for the BIOS update. Being a kernel module, it can just lock the memory in place until the next reboot. It will store that address in CMOS for the Bios and set the update flag. On reboot, the current Bios will check if that the stored image is still intact (I bet they do some checksumming here!) and then load that into the BIOS flash. That way, you don't need to boot into a low-level system such as Dos or Dos-Mode anymore to do an update.
My domains DNS is still hosted with the company I registered it at. I'm planning to move it to a different company early next year. So when a friend asked me for secondary nameserver exchange, I already set up the new DNS.
So my current setup is like this:
Primary and secondary nameserver are at my old provider, serving their copy of the DNS zone (which obviously lists their nameservers as NS)
The A and MX records point to my server, which will sometime also be the new master NS. This host has an own copy of the zone file, which agrees on the 'regular' entries, but lists my server as well as the friends two servers as NS. This is the only link from the domain to the friends servers. Let me emphasize this: neither my server nor my friends servers are currently listed in the .tld database. Their NS entries still point to the old providers server. I'm planning to change that in January.
Now my friend told me, that he had about 10 email delivery attempts to my domain in his logs, obviously coming from some spammers.
WTF? In order to link my domain to his server, you'd need to
What is the reason to jump through all those hoops? Do many admins configure a secondary NS to be an unlisted, unprotected relay for incoming email?
Is it common for secondary NS to receive random emails from spammers?
It's now far over a year that the ipw3945 driver doesn't work reliably for me on stock Debian kernels. From other blogs I figured it only occurs in SMP situation, and even then not on all systems.
On my laptop, I can reproduce it rather reliably: connect to the WPA2 encrypted network at my parents and transfer a file to one of the other computers on the network. An IO rate of a couple hundred kb/s will make the wireless card disconnect frequently. Often I end up having to flip the kill switch twice to get it working again.
I have however found a workaround: use a kernel with preemption enabled.
So it seems that there is something in that driver which will in certain situations trigger a big lock. Actually you can even hear that - sound will also be interrupted shortly when the wireless dies. Without preemption enabled, this will make the wireless card run into a timeout and reset itself.
I don't know yet if this completely fixes it. But other comments in the bug report at bughost suggest that it's at least much more stable. bug 1085 seems another instance of this bug and is open since june 2006. As good as their OSS record is, it seems that Intel has given up on fixing this bug.
There are other drivers causing similar problems. For example the dcdbas driver for the Dell Bios used to check for the wireless kill switch, display brightness and such functionality. When I load that driver, hal or NetworkManager will interrupt my sound every few seconds when polling the kill switch. Maybe they would just need to get hold of a system affected so they can diagnose it properly themselves. There have been at least 20 people reporting this bug by now, most of them on Dell systems.
The german government (that is the CDU CSU and SPD partys) passed a law that would require communication providers to keep connection data for half a year. This data retention includes phone calls, emails and internet connection data such as IP addresses.
This law was passed as a "preventive" step against terrorism, mostly. Which is such an hilariously bad argumentation, actually.
This is a serious step towards total surveillance, and is considered to be unconstitutional by many. Amongst others, it seriously undermines Confidentiality protection for e.g. priests, lawyers, doctors and journalists. Apart from opening the doors for total surveillance, of course and not having any desireable effect when it comes to preventing crimes, especially terrorism. Over 20.000 people already have already signed a complaint of unconstitutionality to be handed in at the Federal Constitutional Court in germany. This would then be the biggest complaint of unconstitutionality in the (short) history of the Federal republic of Germany. (1300 people complained 1983 against the census, which led to privacy protection becoming considered a key role of the german constitution)
Right now, the law is waiting to be signed by the German state president (and we hope that he might refuse to sign it). Once it has take effect, the complaint can be handed in.
Web page with further information.
If you are a German resident, please join the 20.000+ others who have already signed a mandate for the complaint of unconstitutionality. Deadline is December 24th.
It's our freedom we're fighting for!
Die Bundesregierung hat ein Gesetz verabschiedet, dass Vorschreibt, unsere Kommunikationsdaten massiv zu speichern. In dem Video unten wird das schön anschaulich erklärt wird: Im Endeffekt verlangt der Staat, dass protokolliert wird, wer wann an wen einen Brief schreibt, in welchen Briefkasten dieser eingeworfen wird und so weiter. Also wer an Tante Erna eine Postkarte zu Weihnachten geschrieben hat.
Hörenswertes Interview bei TV Berlin
Man beachte, dass dieses Video bereits ein Jahr alt ist. Jetzt ist es aber so weit, das Gesetz wurde verabscheidet - und es wird jetzt eine Verfassungsbeschwerde gegen dieses Gesetz eingereicht.
Bis zum 24.12. können sie noch eine Vollmacht einreichen, um diese Verfassungsbeschwerde für unsere Freiheit zu unterstützen.
Es ist zwar jetzt für das Bundesverfassungsgericht irrelevant, wie viele Menschen diese Klage unterstützen - aber die politische Bedeutung einer Breiten unterstützung ist nicht zu unterschätzen. Bisher gibt es offenbar über 20.000 Vollmachten hierzu, womit dies bereits jetzt die "größte" Verfassungsklage in der Geschichte der Bundesrepublik wäre.
Es gibt momentan noch die Hoffnung, dass Bundespräsident Köhler das Gesetz für Verfassungswidrig hält, und seine Unterschrift verweigert. Dies wäre sehr wünschenswert.
Nach Einschätzung der Bürgerrechtler "höhlt eine Vorratsdatenspeicherung Anwalts-, Arzt-, Seelsorge-, Beratungs- und andere Berufsgeheimnisse aus und begünstigt Wirtschaftsspionage. Sie untergräbt den Schutz journalistischer Quellen und beschädigt damit die Pressefreiheit im Kern". Mit Inkrafttreten des Gesetzes drohe Journalisten der Abbruch von Informantenkontakten, Beratungsangeboten wie der Telefonseelsorge die Abnahme von Anrufen und E-Mails, Strafverfolgern der Wegfall anonymer Anzeigen, Regierungskritikern das Ende unkomplizierter Kommunikation und Internetsurfern Ermittlungen wegen des Besuchs vermeintlich verdächtiger Internetseiten. Insgesamt würden sensible Kontakte und Kommunikationen entweder erschwert werden oder insgesamt enden. Damit würde "die freie Kommunikation in Deutschland gravierend beeinträchtigt, was unserer freiheitlichen Gesellschaft insgesamt erheblichen Schaden zufügt".
Es gibt aber selbst in der CSU ein paar Personen, die ein besseres Verständnis für das Problem zeigen als blinde Fraktionstreue, z.B. Dr. Peter Gauweiler:
Die "Vorratsdatenspeicherung" in der geplanten Form halte ich - auch aus meiner beruflichen Erfahrung als Strafverteidiger - mit Blick auf den erhofften Aufklärungerfolg für ungeeignet.Der Grundrechtseingriff steht in keinem Verhältnis zum Nutzen der Maßnahme. Ich werde einer Umsetzung der derzeitigen Pläne bei der Bundestagsdebatte im Herbst nicht zustimmen.
Also jetzt aktiv werden und dieses Überwachungsgesetz stoppen!