
I've become a Facebook user. Caused by many of my dancing friends and lots of collegues from university being there: Facebook is a convenient and easy way to keep some contact with all those people that aren't like your core friends, but still friends, and that are all around the world.
I've also toyed around with the Facebook API, and actually written two small applications for it. From a technical point of view, I'm actually quite impressed by Facebook.
Yes, Facebook seems to have quite some load trouble these days. Sometimes it's very slow. Sometimes it just malfunctions (with being kicked out and having to re-login being just the least). These days, messages were distorted every now and then, some app messages were always missing the last few words.
What seriously does impress me is the architecture they use with third party applications. It's designed around cachability (so profile pages aren't slowed down by slow or broken third party applications) and they do jump some hoops to allow application writers do a lot of things while preventing them to disturb other applications or the core functionality.
For all I know, Facebook is the first major thing to do CSS and JavaScript rewriting. Data produced by third party applications is fed through a very smart parser and rewriter that allows an impressively large subset of CSS and JavaScript to be used without the developers having to pay attention to not producing conflicts. In CSS, rules are prefixed with a selector to restrict them to their applications scope. In JavaScript, object references are uniquified, and the convenience functions you have for interacting and accessing nodes (including functions to do common things such as modify CSS class assignments) take care of all that. Access to the raw JavaScript methods is filtered, so you can't e.g. use parentNode to get access to objects outside of your scope. At least in theory.
Much of this is for the benefit of users: applications are not allowed to do annoying animations unless the user has just interacted with them; apps also can't modify or disturb others, or read data from other applications via DOM.
Well, of course there might be one or another security issue still there; some of these things might also be related to the performance issues of Facebook recently. And of course there are bugs. Lots. A couple of things still need to be thought through properly (e.g. aggregation of feed messages with multiple targets, localization functionality for applications, finer grained control of data access). But their CSS and JavaScript rewriting is really cool.
It seems like Google broke something in their search. For example, searching for "iTunes-library-xml Python" (I've just written a parser which will turn Apple PropertyList Pseudo-XML into a useable Python object consisting of hashes, arrays, integers etc. btw.) actually gives me results (at least the one around #4) that don't contain "iTunes" (and I'm pretty sure also never contained).
OUCH:
Looking at the cached version of the result when searching for "iTunes library xml python" contains the notice "The word iTunes is only found on pages linking to this page". Google knows two pages linking there, neither contained the word iTunes.
This is not what I understand as an "exact phrase match".
A few days ago, I asked about how to properly embed spamtraps in web pages.
Well, noone could tell me if using display: none is appropriate. I actually do not want Google to index the contents in that div. So as long as they don't punish me for using display: none at all, it's okay. And the page I placed the spamtrap on is a doorway-like page for others anyway; it's not part of an important site.
It took the first spammer around 54 hours to send the first spam. Or try to send: all 10 retries with different zombies were rejected by my spam filter. Since then, I've been receiving another round of deliver attempts - around 5-15 per spamtrap address - almost every hour.
Of the > 500 spam delivery attempts I've seen since, none made it through my initial spam filters (not to speak of the content filter behind that), but they were rejected at the SMTP level, even before the mail content was sent.
I've now disabled some of my spam filters to allow the trap adresses to actually receive mail. After all, I want to use them to train my filter. :-)
My thesis is about data mining, clustering of correlated data in high dimensional vector spaces, to be a bit more precise.
In detail, I'm working on methods to improve upon existing clustering algorithms such as 4C (Computing Clusters of Correlation Connected Objects) and ERiC (On Exploring Complex Relationships of Correlation Clusters), where you need to pick some parameters (e.g. k for a k nearest neighbour based approach) appropriately.
My approach is twofold. On one hand, I'm improving upon the traditional covariance based correlation (which is quite sensitive to noise), so the parameters become easier to pick, on the other hand I'm working on an approach to automatically fine-tune the parameters to further improve stability.
For testing my computations I needed a visualization of this data. I was considering using gnuplot (and in fact I'm using gnuplot a lot), but for some situation I needed animation capabilities, and thats where gnuplot becomes really messy.
So I decided to dive into SVG and Javascript. Here's my first SVG project:
Visualizing kNN correlation in SVG with Javascript
(Internet Exploder is not supported. I don't have Windows, and for all I know it doesn't really support SVG. Use a Gecko-based browser such as Firefox, Opera and Safari (at least on Windows) also seem to work. I didn't get it to work on kHTML/Konqueror/Webkit. I'm just doing this for myself, so I have no need to support other browsers.)
It's a 3D dataset, consisting of 300 points. 100 points are noise, 100 points are in a 2D cluster (green) and 100 points are on a 1D cluster embedded into this plane (I'm working on algorithms that support hierarchical clusters, so I needed a dataset with this property!).
There are two buttons in the UI, one toggles rotation, the other one toggles the playback of "k". It will cycle k through a range of about 3-200. When offset hits 20 (so k would be 22 or 23), the main correlation vectors - the big blue lines - already point along the 1D cluster. At an offset of around 80 they have already diverged quite a bit from the 1D cluster - at this point, the correlation is seeing the 2D plane quite well already.
I could also show you the behaviour for points in the 2D plane (but outside of the 1D cluster) and noise points.
We're preparing a paper for SSDBM 2008.
[Update: Safari works at least on Windows]
I'm considering to embed some spamtraps (i.e. email adresses that will feed all their incoming email to the spam filter) into some web pages.
However, I want to prevent people from accidentially using these links or even just seeing them. So using "display: none" seems appropriate. But Google is known to punish websites 'hiding' content from users but not from robots.
Some sites say, Google will just ignore the parts that are within "display: hidden", others say it will punish the site altogether.
Maybe the adsense control comments will help
<!-- google_ad_section_start(weight=ignore) -->but it wouldn't really make sense. It's meant for adsense only.
Or the page could become a bit more hackish and use javascript to kill the unwanted content. Any experiences with the proper way of hiding spamtrap email links like this:
<a href="mailto:aaaaaaa-never-email-this-address@domain.tld" >Unwanted Emails only<:/a>
I've recently considered using a Google calendar for a project, and tried to embed it in a web site.
However, there are a few issues I'm having with it:
Also the "multiple calendars" feature is a bit hackish. I'd like to be able to differentiate events by flags such as "city", "outskirts", "training", "dance event", "music event". Obviously, entries might have more than one, so I'd need 6 calendars for this already. Usually, you can combine five...
Guess I'd need to do this all in Ajax by myself. It would be cool if Google Calendar hat an API for embedding like it has for maps. The Calendar API I've seen so far is basically polling the raw data via JSON or XML. Which is already great, but I do like some of the calendar layouting the do, and I'd like to avoid having to replicate that myself.
(no, not the animals, but web crawlers!)
For a pet project of mine, I've recently been spidering the web a bit myself. So far, I've processed over 100.000 websites. The machine doing the spidering is an old K6-450, so it's not particularly fast...
My spider is downloading the web pages HTML, and eventually some framesets (but at most 1 level deep). It's using text contents, image 'alt' attributes, title and some meta tags. The text contents are tokenized and stemmed.
This results in some fun numbers:
Each of the web pages I'm spidering has about 6.4 categories assigned to it. I'll be using this training set to train an AI to classify web sites.
(I've also started a web page for the project, but it's still pretty much empty so far, not worth looking.)
An often quoted feature of services such as OpenBC/Xing (and most of such 'pure social networking' sites) is that they basically allow you to keep an address book without having the need to update it yourself.
Some people may even argue that this is the only real benefit these social networking sites do actually offer.
There are of course services dedicated to helping you keep your address book up to date. These often offer plugins for Thunderbird and Outlook, so you can actually use the address book directly. (e.g. Plaxo) Some email providers even have a function to send out "please update my address book entry on you" emails to your receipients (e.g. web.de), but most people find these quite annoying.
Now some people might argue that you could use the FOAF standard for this. But publishing your FOAF data on the web is a privacy problem. Most people won't be willing to publish much more than their email address there. Just like some people are not willing to entrust their information to services such as OpenBC.
Using e.g. HTTP authentification to restrict access to your FOAF data is also not working very well: you'd need some user management to be able to revoke access or change the access credentials if the passwords are leaked somehow.
OpenID would definitely be interesting, but how many of your friends have OpenID yet? And not everybody has access to deploy the server side needed for this.
The easiest to deploy approach would be to just use public key encryption. You could then upload an encrypted copy of your data for each 'friend' to any web site. You could also upload different data (including work contact information only, for example) for different recipients.
My idea is like this:
Big benefits of this approach:
Drawbacks:
"Beta" is short for "BETrAying". It means the web site isn't honest to you about the benefits for you, what they are doing with the data or what they are telling you.
Examples:
So spelled out it would be like:
Sorry, we're not telling you the truth about our service and the benefits, and that's why we're calling the service beta; once we've found out how to do it right and still earn some money with it, we'll remove the beta sign.
Don't use the browser name for capability detection.
I'd like to emphasize that. For example Google Groups won't let me upload an image to my profile, because I'm not running Firefox or Internet Exploder.
Well, I tried both Epiphany (which has a very clean and fast UI, unlike Firefox which is totally cluttered) and Iceweasel (which is Firefox, but with the trademark replaced). Both are using Gecko, Gecko/20070324 for Epiphany and Gecko/20070310 for Iceweasel. So I'm very sure they have the same capabilites as Firefox 2 when it comes to web sites.
If you want to test for Firefox's capabilities, use the Gecko version number.. Thank you.
[Update: it was suggested I point people to GeckoIsGecko.org, which has links on how to properly detect the Gecko engine, instead of relying on the browser name.]
[Update: Mike Hommey pointed out that the 'Gecko/date' string is mostly meaningless, and largely is the build date; it doesn't contain tree information. Instead you should be using the "rv: 1.8.0.11" part of the User-Agent. This is also what the getGeckoRv function on the howto linked from GeckoIsGecko.org does. Oh, and Firefox 2 is not gecko 2, but IIRC uses gecko 1.8.x, just like my Epiphany.]
(Title stolen from Holger Levsen)
Whenever I see this image [mozilla open Standards], I want to make a spoof of it titled:
Every time you make an Ajax app, god kills a firefox.
But I would certainly be violating Mozilla trademarks by doing so (their artwork, logos and trademarks such as "firefox" are not OpenSource).
Anyway: AJAX is a hackaround, in particular it is not an open standard. Please use it only where it's really needed. Granted, there is worse (e.g. Flash; prepare for incompability hell now that the first opensource plugin can playback youtube videos - as you might be aware, many Linux distributions can not ship Adobe Flash, so they'll likely start shipping this plugin as soon as it's somewhat working sufficiently; or ActiveX which only works with MSIE...), but that's not really a good excuse for this abuse of Javascript that is called AJAX.
On a side note, since I already mentioned flash - I can really recommend the Flashblock mozilla extension. A must have: you can view any flash if you need to by just clicking on it, but they won't be loaded automatically anymore. So you can easily access youtube (just one extra click!), but won't be bothered by flash ads and such stuff.
Oh, and Adobe. They're probably the biggest blocker for a widespread Linux adoption judging by this article [computerworld.com], which is already very positive on Linux ("Unlike many of the applications included on new Windows systems, these don't seem to come with annoying self-launching advertisements, such as the irony-challenged Trend Micro Anti-Spyware pop-up upgrade pleas that plagued my HP system at home."): maybe his biggest issue is that he couldn't just run his Adobe Photoshop Elements on Linux.
Of course there are application trying to offer the same functionality; starting with Gimp, digiKam and Krita (and I'm not sure he tried Krita and digiKam as well; they are probably more similar to Adobes product), but I can understand his wish to be able to continue using the same applications.
(My personal recommendation: start using Opensource applications on Windows, e.g. Firefox, Thunderbird, Inkscape (great vector graphics program!) (and it's using open standards: SVG), Gaim (multi-protocol instant messenger and tons of others. They're free, so even if you don't use them every day, you didn't waste money on them... and if you happen to like them: you can be sure that they'll be working the same if you do the switch to Linux at some point in the future. Be prepared for when Microsoft says you PC is too old.
Rumor: the B in "blog" actually means beta.
Original text of this post:
You know you've been exposed to too many Web 2.0 applications when you start to think the B in the blogger.com logo is for "beta". I was like WTF, are they making "beta" to be all of their logo now?
And yes, I know about it being a shortened version of "weblog", likely coming from the pun "we blog".
Brain Handles did a test to see if Google indexes Javascript-generated content (for simple scripts).
It doesn't. So if you are doing a heavily Ajax-based web page, you are (still) risking to prevent Google from indexing your contents.
(Note that for two prime Ajax examples this doesn't matter: GMail and Google Maps. One doesn't have any public content anyway, the other no text.)
Please use Ajax only sparingly. It's a dirty workaround for shortcomings in interaction capabilities of HTML and web browsers, not the ultimate solution to all our web problems.
Two consequences:
Romain Francoise mentions Yahoo Pipes.
Well, I played with Yahoo pipes like one or two weeks ago; and while I was impressed with their Visio-Like UI, I was lacking pretty much all functionality I wanted to try...
My goal was simple: run a query on Google Blog Search (which will have the result available in RSS), and then grab all URLs out of that stream.
But I didn't find any 'filter' in Yahoo Pipes which allowed me to extract the URLs (or any part of the text, actually) from the blog entries. I don't want to remove whole result entries, but I just want to extract certain text chunks from their body... (there might be multiple, so the regexp module isn't an option either).
I could do that with Python in a few lines, actually.
Fortunately, I rarely use my Google GMail account.
Because for about a month or so, I can't write replies in my preferred browser anymore. I can only guess it's due to some broken browser detection done by Google - my preferred browser is Epiphany, and uses XulRunner. So it's the same engine as my Firefox, and I can write replies with Firefox (but the Firefox UI is not as nice as Epiphanys, and it uses more memory).
(Well, almost. Epiphany is Gecko/20070209, whereas Firefox is Gecko/20070208. So either some change in this one day breaks GMail, or Google broke it themselves with some stupid browser detection; many people still think it's sufficient to check for 'Firefox' to detect all non-IE users. Please use the engine ID, that is Gecko.)
Anyway: if I'd be a heavy Google Mail user, that would be a desaster for me. Broken for a month now and counting!
Fortunately, I don't rely on that friggin' Ajax stuff; I can either use the standard HTML version of GMail or use Firefox. I'd just like to emphasize that Ajax apps break much easier, and your users might be unhappy about that. Ajax is far from perfect, but an ugly hack.
Don't overuse it.
[Update: I fixed my GMail issues by switching the language to German and back to English. Weird, huh?]
From Dojo Toolkit (a very powerful AJAX toolkit), but should probably go to The Daily WTF...
_getAdjustedDay: function(/*Date*/dateObj)
//summary: used to adjust date.getDay() values to the new values based on the current first day of the week value
var days = [0,1,2,3,4,5,6];
if(this.weekStartsOn>0){
for(var i=0;i<this.weekStartsOn;i++){
days.unshift(days.pop());
}
}
return days[dateObj.getDay()]; // Number: 0..6 where 0=Sunday
}
That code is inefficient and stupid on so many levels. For example the if
statement... you might be aware that 0 < 0 is false.Yep. I'd prefer something along the lines of
return (dateObj.getDay() - this.weekStartsOn) % 7;No arrays were abused during the making of this function.
I always thought PHP programmers were the worst, but apparently some JavaScript "coderz" are up to par.
... suck.
The XML Schema "datetime" format can't be handled by Java's SimpleFormat (ok, that probably is more Javas fault, of not being able to handle hours:minutes in the timezone specification, anyway, this is mildy annoying).
The XSLT2 parsers are very intolerant about the format of the time specification, too. They could have made more stuff optional such as the specification of seconds; an error here will make the whole XSLT fail with an exception. Some compact form of error handling would be nice... as would be a smart parser which can handle various formats.
The XML Schema "duration" is even worse. First of all, it was completely forgotton when doing XSLT 2; there are no functions to format or disassemble it (except by regular expressions, which could also use a zero-width lookahead).
Secondly, it's lacking common specifications such as "next week". While "next week" is computationally equivalent to "in 7 days", it can have different semantics in some contexts (especially when not being aligned):
If I'm looking for the week February 9th 2007 is in, the result is February 5th two February 11th. If I'm looking at a 7 day interval containing this day, there are infinite possibilities (aligned on milliseconds and below...). So it does make sense to make a difference between 7 days and a week.
... is the working title of my diploma thesis.
Tomorrow I'll hold the introductory presentation on my topic in the research seminar. This is to explain my project to other members of the working group, so they can give me feedback and suggestions.
What I'll be doing:
Where I am:
Difficulties:
Fortunately, we have a rather strong research group here in Munich. There are experts for query and transformation languages (e.g. Xcerpt, SPEX - Querying XML Streams), temporal calculations (Computational Treatment of Temporal Notions, CTTN) and time modeling Calendar and Temporal Type System CaTTS); and I'm also working closely together with the author of IkeWiki and Xcerpt. And of course they're all nice people to work with!
So google has finally changed their algorithm somehow to remove Googlebombs. I wonder which approach they chose. Maybe they require the search term to actually occur on the destination page?
Anyway, I wonder if we could now do the opposite - rank down pages by Googlebombing them. I wonder if we could e.g. all setup links to Windows (microsoft.com) and maybe even mention the word "Googlebomb", and Google will think we're trying to Googlebomb Microsoft this way? So at the end maybe the Wikipedia article becomes hit #1? That would be cool.
P.S. Ever noticed that Google blocks lynx and wget by their user-agent string? (well, I get error 400 with lynx, which might be different, but I can use the --user-agent option of wget to bypass their filter, so at least that filter exists and is kind of pointless...)
Gunnar doesn't like the idea of recommending XML for configuration files, why do you need to be able to edit a XML-file with a non-XML-aware editor if you don't like the raw syntax?
If you don't like the raw syntax, use an editor that gives you a different representation. Or use some transformation. Write a tool that converts YAML to XML and back, if you like YAML better. (Btw, this is another reason to use a common library for configuration file handling - let people choose their configuration file formats!)
Writing XML in the raw with a good schema-aware editor with syntax highlighting is actually quite nice. Have you ever edited an XML file with eclipse? You really should do that... I once opened my Openbox (a rather minimalistic window manager) configuration file in eclipse. Guess what, it was giving me useful syntax completion! It had loaded and used the referenced schema file.

It's not as if I think XML is the ultimate thing; (nor is Eclipse an editor I'd use for configuration files; startup takes years and it frequently crashes for me. vim also has some XML support...) IMHO there is a lot in XML that should be stripped (such as attributes); I like JSON syntax better, except it's in turn lacking essential information such as character encoding, namespaces and schema information. I also don't thin JSON allows comments. But when handling information from multiple sources (and multiple schemes), XML is really useful. It removes most of the quessing needed for handling other formats.
And that is what I'm precisely advocating: use standardized formats. Consider for example the apache configuration. Do you know of any tool that can parse the apache configuration files other than apache? Some parts look like SGML/XML, but they don't have much more in common than using < and >. When you are in need of automating something with apache, you'll be annoyed by this. If apache would be using something where you have a reliable parser ready for - that would be nice.
Have a look at the xchat.conf configuration file. It uses "key = value", but they have these extra spaces there and don't use quoting, this means the file can't be loaded by many parsers, e.g. bash. Now lets use at buttons.conf - compeltely different syntax, "KEY value" blocks, separated by empty lines...
Btw, note that configuration handling with XML to me means also keeping comments somehow... most applications will nuke any comments in their configuration files; which is funny since most configuration syntaxes do have a notion of comments, but did you ever come across an application using sh-style configuration (i.e. that you could source in bash/dash/zsh), that keeps comments?
P.S. The YAML homepage is not YAML. It's valid XHTML. Only if you strip out all the tags and attributes and use only the text content within the /html/body/pre tag, then you have something which probably is YAML. This compatibility with HTML is probably why XML was at all successful.
I wonder if we should do some RDF representation of packages - especially including dependency data.
That way we can maybe use some RDF reasoners to query our data, and maybe extract some interesting information. On the other hand, people with interest in RDF to use our real-world data for their experiments, and maybe we get something back from them.
There is a couple of package metadata we currently are not tracking inside the actual archive, but in different places. Including licensing information (debian/copyright), homepage location (on packages.qa.debian.org), download location (debian/watch) - it would be nice to aggregate these into some RDF store, and export them somehow.
For most of the package information (especially dependency information), we'll have to write our own ontology (I wonder if we can map version numbers to some standard rule language, or if applications will need an external reasoner to process them?); for some things we can reuse the FOAF (Friend-of-a-friend) or DOAP (Description of a project) ontologies. The first is rather common for describing people, people-people and people-thing relationships; the latter was designed for describing opensource projects (but won't be directly applicable to packages of a project).
I've blogged about my RDF export of Debtags data before; the canonical first step would actually have been to export the package data, and enrich it with the Debtags collected data...
Note that RDF is designed in a way that you can have one site provide metadata for another site. For example, the Debtags RDF export contains "category" information for Debian packages, but does not contain e.g. the description of the packages it talks about via an URI. So there is nothing wrong from a RDF point of view of keeping e.g. the licensing, watch or homepage data separate.
For the Google Summer of Code, there was a proposal including "collaborative repository of meta-informations about source packages (CRMI)"; but the first part of the proposal, the "distribution wide tracker tool (DWTT)" showed to be a bigger task than expected.
But maybe we'll still see CRMI at some point, and maybe we can have it provide an RDF export of data (using a semantic wiki might be a good starting point for CRMI maybe?).
[P.S. this blog posting maybe belongs more into the en/linux/debian category. But only the xml category is also syndicated on planet.XMLhack, and I want this post to go there to reach more RDF users. I really need to switch my blog to some software which supports tagging...]
I had trouble with tomcat. It started fine, but shutdown took a very long time (enough for eclipse to suggest killing it, for example).
Some people suggested I try the tarball from Apache, instead of using the Debian package, but that didn't help either.
By not running tomcat from eclipse and waiting a long time for the shutdown, I finally managed to get some actual error messages:
This was usually the last message I was seeing:
INFO: Pausing Coyote HTTP/1.1 on http-8080
After some timeout, I now got these messages:
Protocol handler pause failed java.net.NoRouteToHostException: No route to host
No route to localhost? WTF?
I resolved it by putting all hostnames it might try (localhost.localdomain, although I use that nowhere, as well as the FQDN I'm using) into my /etc/hosts and forcing them to point to 127.0.0.1 - and voila, tomcat actually shuts down instead of pausing and then failing to notice it has successfully paused...
I'm keeping my homepage in a SVN repository; I'm using the $Date$ variable to automatically track the last modification date (though it will also change on minor modifications).
For the HTML "Date" meta tag, the W3C recommends using ISO8601 date format. This is the (non-XSLT-2, so no regexp) hack I use for conversion:
<xsl:value-of select="concat(substring($string,8,4),'-',substring($string,13,2),'-',substring($string,16,2),'T',substring($string,19,8),substring($string,28,5))" />Did I mention I hate XSLT? It's lacking so many standard functions, like date-time processing, regular expressions, exceptions, ... - granted, a lot of stuff was added for XSLT2, but it still sucks badly. Especially the syntax.
Here's how to format the date according to RFC 2616, as used in the last-modified meta tag and HTTP/NNTP/SMTP headers:
<xsl:variable name="day" select="concat(substring($string,8,4),'-',substring($string,13,2),'-',substring($string,16,2))" /> <xsl:variable name="time" select="substring($string,19,8)" /> <xsl:variable name="timezone" select="substring($string,28,5)" /> <xsl:value-of select="date:day-abbreviation($day)" /> <xsl:text>, </xsl:text> <xsl:value-of select="date:day-in-month($day)" /> <xsl:text> </xsl:text> <xsl:value-of select="date:month-abbreviation($day)" /> <xsl:text> </xsl:text> <xsl:value-of select="date:year($day)" /> <xsl:text> </xsl:text> <xsl:value-of select="date:time($time)" /> <xsl:text> </xsl:text> <xsl:value-of select="$timezone" />Note that this does not include any error handling and is not very robust. You also need an XSLT processor with some of the http://exslt.org/dates-and-times extensions such as xsltproc. (Which unfortunately doesn't do XSLT2 yet and doesn't have a regexp extension).
[Update: Joel Wreed pointed me to his libxslt plugins, a regexp and a exsl.org/dates-and-times plugin. These would help a lot, though IIRC the date-parse exsl.org spec doesn't support the date format I'd need. (So I can't just say date-format(date-parse(...),...)). Also he said that they basically are unmaintained right now. It would be nice if they could be merged into libxslt, though...]
I've set up some simple RDF export (gzip only, please add caching yourself!) for the Debtags data.
The export is using the DOAP ("Description of a project") vocabulary, though it isn't optimal (we're not talking about separate projects, but multiple packages may belong to the same). The format of the RDF file may (and will!) still change, for example I'd like to have some explicit URIs for the packages instead of just storing it in the name tag. Suggestions for a matching vocabulary / serialization are welcome.
But if you want to play around with some real-life data, just grab a copy.
There is a couple of interesting things you can do with it, such as my folding tag cloud. So go ahead, and play around with it a bit.
[Update: I've changed the schema a bit, it should be now clearer which package is being described. I'm using a new namespace for the packages that is still undefined (that's why it's called "temp")]
Check out the first web 2.0 christmas card (and hopefully the last, ever!).
Sorry, I just could not resist doing that... mocking web2.0 on christmas.
Merry christmas to everbody!
Zack asked for website meta languages for redoing his homepage.
Well, I redid my homepage last february, using XML and XSLT. A monster XSLT stylesheet, because I wanted to keep my template outside of the stylesheet.
I can not recommend XSLT. Using it for templating is quite messy. XSLT is okay if you want to transform one representation of the data to another; it's not if you want to add a lot of surrounding markup and things like a sitemap and similar navigation tools. This is next to impossible in pure XSLT, it gets better once you have some extensions (dubbed EXSLT, and supported by pretty much any xslt processor) or maybe with XSLT2. String manipulation is also a pain (at least with XSLT1); I gave up on generating a nice "last modified" date from my subversion tag. Supposedly XSLT2 has some functions that could make this easier (i.e. for parsing and printing datetime information), but the common approach with XSLT is to only supply the required minimum of function you need, and I'm not aware of an easy way to add custom functions, not to mention any large standard library that can efficiently be used (which is the true strength of Java, Python and C#, that they bring a huge collection of pre-written ready-to-use code with them). Of course there are some efforts to write XSLT libraries (especially for XSLT2), and this aforementioned EXSLT is some kind of standard library that even might be efficiently implemented in some interpreters - you can't rely on it to be there and to just work. XSLT isn't useless, but when it comes to presenting data to humans or writing clean, compact code it's not satisfactory at all.
I'd give you my Makefile and XSLT file, if they weren't that messy... too many features; I'm generating two languages, a partially expanded navigation menu, etc. - my XSLT is 10k, my sitemap currently 4.5k, the template file is 5k.
Anyway. Half a year ago, my favourite templating language was KID templating. I used it for some small projects such as my DNSoupdate tool to edit DNS zones via the DNS protocol (requires a nameserver such as bind which has support for DNS update, uses encryption). It was a perfect match for such tiny pages, but I'm no longer that convinced I still like it for larger projects.
What's good about Kid: XML templating language, Python based, easy to use in whichever way you want by writing a few lines of python (i.e. easy to write a Makefile to generate a static version of you homepage)
What's not good about Kid: only works with a Python interpreter. I'd prefer to have a templating language that can be used from multiple languages. However the Kid syntax relies on Python, which is really bad. Also I'd prefer to have some component-render model for more complex websites. The current setup is IMHO okay if you have one default layout and one content layout; but if you have multiple components that could be combined differently on the pages, it gets too messy. With my turbogears experiments, Kid was also not very performant (but that might as well be Turbogears fault).
Right now, I'd probably still go with Kid. But I've been having a look at JSP 2.1, and JSF / Facelets in particular. There are some things about facelets that I really like (for example, that they use a proper XML syntax, instead of this bastardized almost-XML that JSP usually (ab-)uses). There is also some stuff that I don't like (e.g. the massive overengineering of everything surrounding it), and I have no idea how easy it will be to generate static pages in a scripted fashion, i.e. using it without a real webserver. Or I might just write it all in python, which is a nice language for manipulating XML, usually.
Please don't just send me an email with your favourite templating engine. Like zack, I'm only interested in XML-based templating engines, which does not apply to most templating solutions out there. Clearsilver for example is another bastardization of XML. I'm aware of TAL/METAL, and find them quite interesting, but they were also not having this kind of componentization that I'm usually thinking in.
[Update: some people have pointed me to Genshi, which is mostly Kid compatible. However, it still has mostly the same problems, e.g. the templates being not reusable in other languages than python, and that certain constructs are a pain to do (e.g. the page_specific_css recipe with more than one css file). Others have pointed me to smarty, but it's string-based and doesn't ensure valid XML output. (Which is very useful for e.g. generating atom or rss output) For example this is probably valid in smarty: {if 0}<b>{/if}</b> - allowing errors is bad. Oh, and smarty is PHP, which is broken by design, a no-go. The best match to my ideas so far is XML::Template (for Perl, Python, Ruby, PHP) which is pretty close to what I've been doing manually when not using a templating solution. I don't know yet how well it handles recursion - I need recursive templates for the navigation menu on my homepage.]
CSSzengarden has reached 200 CSS files. It's an impressive site full of CSS tricks to learn.
Go there, and just click through a few designs. There are many impressive designs there. And while most a very different in their visual experience, they all have the exact same HTML code.
So please avoid using HTML for layouting purposes. That's what CSS is for. CSS is much more powerful and does a better job for this, so use it!
Ajax, when used properly, can be a great user experience.
Badly written ajax however can be a pain. Often huge javascript libraries are loaded, it makes your browser and system slow and sometime you just end up staring at an spinning animated gif for "Loading ...".
Good Ajax makes the application snappy, responsive, fast, and avoids screen flicker. But with your traditional "get new HTML page" model, error handling is done by your browser. DNS issue? Your browser will say server not found. Connectivity issues? Browser will inform you of the timeout. Slow connection? our browsers throbber [wikipedia] gives you an indication something is happening.
With AJAX, it's up to the authors of the Ajax application to do proper error handling. And many AJAX application have serious issue here.
User proofing Ajax application [A list apart] is a good article on some basics on how to improve your Ajax applications.
Ajax is in the need for some software engineering for QA. Right now, it's so much low level hacking there, it makes you expect 90% of Ajax applications have serious usability and reliability issues.
Dear Lazyweb, What is the JSF 1.2 XML-Syntax equivalent for
<img src="<c:url value='/static/foo.png' />" />(this will make the URL relative to your application root, not to your web server; so if the app is installed in /foobar, the resulting URL will be /foobar/static/foo.png)
My preferred solution would have a useful src= value so an XHTML browser can still display the page. Same for CSS stylesheets, links and similar.
Please send me an email at erich@debian.org, I'll update the entry. (Comments are intentially not enabled.)
Thanks.
[Update: at least in apache myfaces, this should work (untested):
<c:url value="/static/foo.png" var="url" />
<img src="${url}" />
However, the template file won't render approximately in a regular browser.
For easy-to-edit templates it would be nice to have an actual value in the src
property. I have an idea how that could work with custom tags...]
While everybody is still crazy about AJAX - how will its future look like? Using it is currently a major PITA, and you'll most likely have the user download a 200k Javascript file just to make it useable for you as a programming language. JavaScript lacks so much that current programming languages offer out of the box.
This includes especially some comprehensive standard library (Java, C# and Python are all great here), a compact syntax for common data structures (e.g. set operations in Python or stream operations in C++ with <<) and of course: interfaces!
Security restrictions of the browsers - intentional security restrictions to avoid cross site scripting attacks - make interfacing between different javascript applications rather cumbersome, if at all possible. And the only way to have "private" functions in JavaScript is also more of a hack (abusing closures) than a native feature of the language.
What I'd like to see in the next generation of JavaScript - and browsers should start implementing that rather soon, so we'll be able to use it in some 5 years - are proper interfaces especially for cross-site applications, information hiding, an extensive standard libary, a short syntax for XML processing and common data structures and pretty much all that every javascript toolkit reimplements again and again. Oh, and the result shouldn't be Java yet, but still an embedded scripting language. ;-)
Ajax has shown how viable it is to run client-side computations, while just downloading the raw data from a server. But Ajax is not restricted to doing fancy user interfaces.
It should easily be possible to use an iframe based ad to use the CPU power of page visitors to do some large-scale computations.
Can you imagine how much processing power Google could churn up by having its GMail users do some distributed computing? Or on YouTube. While the user is watching the video, a javascript does some calculations in the background.
By keeping data in a cookie, your calculations might even be able to survive page reloads. And if you're running a large ad network such as Googles', you might even be able to dect user inactivity. Update a cookie whenever the user comes onto an adsense page; if he didn't go on such a page for 30 minutes assume the user is idling and start computation. If he leaves open his web browser over night you'll get a lot of CPU cycles.
(Yes, I know that Google is supposedly not in desparate need for free CPU cycles...)
I guess the term "ajax clouds" is now more appropriate.
Debtags clouds have evolved. They're no longer a static page with a single cloud that will forward you to the more complex browsing tool by Enrico, but now the cloud will adapt to your previous choices, and allow the selection of multiple tags.
The biggest issue probably is tag naming now (e.g. what is the difference between "role", "use" and "scope" (unfold them to get an idea) or between "interface" and "uitoolkit" (interface is mostly commandline vs. fullscreen vs. windowed vs. 3d; uitoolkit is gtk vs. qt vs. whatever) - unless you're familiar with these terms, you'll probably find it still hard to navigate the tag cloud.
Still I hope this inspires you to think of new UIs doable with tag information (which is a small step towards the semantic web; actually these facets here are quite similar to RDF triples...)
These is so much things I'd like to try out with this data...
If you have suggestions, please share them via email.
For those interested in the technical stuff: tag clouds are loaded via ajax, served from a database with ~120 MB of precalculated, precompressed json files. Precalculation is rather expensive; on my 4+ years old laptop it took about 105 minutes (76 minutes of CPU time). Storing them in the filesystem instead of a BerkeleyDB hashtable took more than 4 hours. The outmost (i.e. largest amount of data) set takes 1.1 seconds to compute; there are 344871 precomputed tag selections, so it precalculated 75 selections per second on average. Yes, complexity is not linear; benefits from caching large results are huge.
I'd really love to run a similar interface for e.g. last.fm, but I guess this would not work as well; their tags aren't grouped in facets. But I have some ideas to make up for that.
P.S. This is also my first real Ajax app (except for using json instead of XML). And I still hate Javascript.
[Update: I've worked around an issue with opera (which is stricter on javascript object syntax than mozilla). I havn't tried Internet Exploder yet. But this is a navigation experiment, not an application to be deployed...]
Tag clouds are usually done by scaling font sizes according to some weight.
Actually this is not very precise. For a representative representation (lol, I should get this domain name. representative-representation.com), the tag size - that is the surface area! - should be a representation of the tags weight.
The suface however doesn't directly depend on the font size, but is more like font size * length of word (length being appropriate for the font used).
So when displaying tags with very different font sizes, "egg" and "Technorati" shouldn't just be scaled by their weight, but also by their word length.
OTOH, few users will actually be able to "grasp" the actual difference in size. IMHO it's just about "popular" vs. "obscure" and about making the tool more intersting to use.
I've been working a little bit more on the folding tagcloud for Debian packages. I've added closing of folds, and the code for displaying selected tags as well as matching packages is in place, too (you'll need to use a different .json data to actually see results though).
To make it truly interactive, i.e. allow the selection of multiple tags until you get some results, I need to add more data files.
So I'll have to decide now if I'm going to use a CGI (the "traditional" method), which will likely need to have some caching, or if I'll just precalculate everything into static .json files. I could even store them as .gz on the webserver; any browser with ajax capabilities should be able to do gzip decompression on the fly. This would offer maximum performance and security, but it means I'll need more magic in the javascript (and Javascript is ugly). Or I'll do a combination of both, use a tiny CGI serving the precalculated data; the CGI could then easily be replaced with a dynamic-caching CGI later.
SVG rendering for the tag cloud would probably be also very cool. With some smart layouting algorithms, it could become much more cloud-like. And there could probably be a nice animation when "subclouds" are unfolded, pusing away the other folds. However, that would be much slower. Any animation means a slowdown, since it adds extra delays.
Folding tag clouds of Debian software packages.
I'm trying to make tag clounds workable with a large number of tags (unfolded tag cloud for comparison) by folding them into subtopics.
Yay, when we're done with tagging the Debian packages, we'll have a great new way of browsing available Linux software. Linux doesn't lack software anymore, it has so much software, you just don't find what you need.
The next version of this will probably allow you to select multiple tags, and update the tag cloud upon selection. So if you choose "GTK" ui toolkit, the "QT" toolkit tag will become quite small, etc.
Ajax isn't progress, it's actually a step back.
Most people have been happy to actually use some sane advanced language for doing web sites. There is Java, JSP, PHP, Ruby, Python, Perl, ...
And then there is Ajax. One of the main components is JavaScript. That language most of us would have loved to forget. Forever.
Ajax brings you back into the dark ages of internet development, with browser incompatibilities, it features memory leaks, and many Ajax apps (such as live.com) will run unbearably slow on older computers. Such as my 1.8 GHz P4M laptop. If they work at all - live.com just shows "Loading..." for me right now.
Have you ever looked at some of the actual Ajax code?
This isn't the future, this is the dark ages coming back!
[MJ Ray replies that I didn't mention accessibility issues with Ajax. That's true; I'm aware of them, but this post focuses on Ajax doing away with all the advances in programming languages and software engineering... The in-accessibility of Ajax stuff is worth an own blog post sometime]
Talking about logic, and where people fail at understanding it...
Fun with stupid Google queries - Is there any page about Google and not Google?
Wonder how big the Google index is? Google OR -Google.
(I'm aware that the count is imprecise, it's just funny that this query is actually processed.)
Now let's look at MSN.
google 64741016
-google 4669973284
okay. that should make "4734714300"...
google OR -google 5156927770 - oops, magic new hits.
-google OR google 68735504
So where are the 5 billion pages that have "google or -google" but not
"-google or google"?
And how about MSN OR -MSN - there can be only one.
askdjfhsakdf OR -askdjfhsakdf - top result for this garbage word: Google!
askdjfhsakdf OR -askdjfhsakdf Google results are consistent. I wonder what their sorting is in this case... random? hash function? age?
Remember Googlestossen? Like Googlewhack, but with scoring. There must be only one hit with both words; score = # of hits with word 1 * # of hits with word 2
A typical web page will consist of dozens of files - images, javascript, CSS.
Web browsers usually load 2 files in parallel (recommended by the HTTP/1.1 spec to use max 2 keep-alive connections). If you are including many javascript files in your <head /> element, these will probably be loaded first, the images second. This is the effect of images appearing "late" over slow connections.
However, if your page can be displayed without the javascript (which is very recommendable because of accessibility issues), you might want the browser to load the images first. If your page totally relies on JavaScript - bad luck for you.
In order to improve your load times, you can use some simple techniques. For example by putting images on a separate server (e.g. images-amazon.com, yahoos yimg.com, static.flickr.com, photos1.blogger.com).
[Update: I'm not suggesting you (ab-)use one of these sites for hosting your images, but these are examples of big services using this technique. Blogger, btw, has a referrer filter, so it won't work. And it breaks "planets", so I actually recommend you to use a different blogging service.]
This is a common practise for large sites, for several reasons:
The last point is the inspiration for this posting - while your web server is still busy building some dynamic web page or serving some Ajax requests, your image server could already be sending out the images.
This might give the user the impression that your web site is much faster. Most users are broadband - but they still have some latency. In fact, latency has increased for many users with broadband due to interleaving on DSL lines, for example; ping is higher with regular DSL lines in germany than it was with modems or ISDN.
Some relevant pages: HTTP/1.1 Pipelining [w3.org], Mozilla pipelining FAQ [mozilla.org]
Tag clouds [en.wikipedia.org] are a current must-have for web 2.0 applications.
Examples can be found for bookmarks, blog entries, photos, music or books.
Tag clouds are hip, because they're a dynamic feature and show the "Zeitgeist" [wikipedia]. They given an overview on the users ("California" in flickr) or on current hot topics ("Israel", "Lebanon" in technorati).
However, tag clouds also have severe limitations.
First of all, they're arbitrarily ordered. Usually alphabetic, so there is no content relationship among the entires.
Secondly, they only show an excerpt, since there are usually much more tags than fit on the screen.
Thirdly, they're atomic information, whereas relations as used e.g. in RDF [wikipedia] can convey much more complex information.
I'm trying to push tag clouds to a next level. They're a gimmick right now, but maybe we can make them to a powerful navigation tool?
Together with Enrico Zini I've just created my first tag cloud (I've skipped making a tag cloud for my blog...).
Well, it quickly evolved beyond a tag cloud. You could maybe call it a tag sky. Or tag forest.
I'm not using my blog or something like this for the tag cloud. That would be quite boring, I'm not doing real tagging on it. Instead I'm using software tags. The Debtags project, led by Enrico and I, has been working on software tags for some years now during our spare time. We have about 600 tags in a dozen of facets, and 15000 software packages (I don't have the number ready how many of that are somewhat tagged already). Well, the tagging efforts are still far from complete, thats why we're currently working on an AI to assist tagging efforts, too.
We generated two different renderings of the tag clouds for you: one separated cloud per facet, and all folded into one big cloud. Oh, and actually click on one of the tags, it will take you to a more complex tag-based navigation tool and a tagger.
So what makes these different from the usual tag clouds you see everywhere (apart from the sheer size, sorry about that. Maybe we'll add buttons next to hide/show tags with low occurrence numbers)?
Well, the tags next to each other aren't completely unrelated any more, since they are (in both renderings) grouped by their facet. This makes it easier to locate something - go through the red facets first, then look at the tags in the group.
I'm thinking about a second step, which would involve dynamic expanding details in the tag cloud, or hiding them, finally transforming the tag cloud into a true navigation utility beyond a "single click filter".
In my final diploma thesis, one of the topics to work on suggested by my professor is doing "tag clouds" (i.e. weighted lists) for relations. The prototype will likely be integrated with the IkeWiki semantic wiki. I don't have a clear vision of how the "relation cloud" will work or look like, but I havn't started with my tesis yet anyway. I currently imagine up to three clouds (corresponding to the empty places in the relation) that will dynamically adopt to the choices already made by the user. Some zooming will probably be needed, too.
Another use of tag clouds would be a visualization of the AI - the weights could be chosen by how sure the AI is about this tag; the cloud would then describe the AIs rating of a software package description.
If you have some ideas, good links, relevant papers or other feedback, just send me an email to erich@debian.org. Thank you.