Quantcast
Channel: Codewerks
Viewing all articles
Browse latest Browse all 10

My take on Microdata versus Microformats

$
0
0

To provide some background, I’m going to start with a little story of how I originally got involved with the microformats community. A couple of years ago I was working at thethingsiwant.com, I had implemented a little hack that whenever someone added an item from amazon I would analyze the URL, extract the ASIN, send a Web service request to their server and then get the price, image and referal link of the product. Users of the site really liked having an image for their items and they started asking to have this on other sites. Feeling encouraged I ran a query on the database, picked the 100 most popular websites and wrote an extendable crawler that picked basic product data from their pages (yes, I know, I’m insane) [*]. Anyhow, what happened next was that some of our users had e-commerce sites of their own and they started asking “Why doesn’t TTIWBot pick data from my site?”, so I would generally take their request and add a customization of the bot for their site. The thing is, this was very time consuming and also every time a website would change their layout I would have to change the customization of the bot, it quickly became clear that some sort of common format would have been better. Now, I could have went the easier route and asked them to implement an XML Web Service but some were really small sites and you could tell they were done on a budget and truth is it would have been cruel to have asked them to do something as complicated and as expensive as a Web Service was at the time. Enter Microformats.

I read somewhere (I don’t exactly remember where) that there was a new technology being developed called microformats, originally being backed by technorati. I looked around the documentation, understood the basic concepts and decided to contribute. To be honest, I was a little scared to be posting in their list in the first place [*] since most people there seemed to had either advanced degrees from prestigious universities, were working on blooming silicon valley startups or where people whose work I had read ( like Mark Pilgrim ). At the beginning, I didn’t felt that I was taken very seriously, and I felt it was reasonable since the only affiliations I had at the time was with TTIW which was a small, independent website with no backing whatsoever and whose markup wasn’t exactly perfect. So, to prove that I was serious I wrote a little web app that you can find here though due to the changes in PHP it currently doesn’t work anymore. To get an idea of how it used to work you can see the review by Michael Coté, here. I also contributed with the examples in the wild section of hListing. Of course, my enthusiasm took over and I wrote all kind of extra stuff that I just thought of, like what you can find here, but that’s because I liked the technology and I started thinking about uses for it.

Anyhow, my point here is that I had an actual need. As the discussions continued ( which I didn’t followed completely since I was really busy with other stuff ), it became clearer that there were some problems with the formats being developed and what I was actually needing. Specifically, I remember not being very happy with the format for price since not having a currency attached to a value makes the information worthless and TTIW had users from a variety of different countries. Well, I also thought that there were some design issues and as an end user I thought that having a validator would make life easier for everybody involved. I found a way to write one ( using the structure provided by tidy ) and tried contacting the people responsible of microformats to tell them about this and maybe warn them about some possible problems, but I got no reply. So I decided to go ahead and release it ( xmfp ), wrote a little discharge on some of the issues I thought the technology had and sort of withdrawed from the list.

A lot of things happened to me and the world after that and I sort of withdrawed from web development and disconnected for a while. And then the Microdata spec came out and since I like the technology I wrote a quick implementation which I’m still working on. So let’s move on to why I believe Microdata is a better spec.

A quick summary of why I believe Microdata to be a better format than Microformats

It doesn’t break RDF

It is entirely possible without much effort to encode rdf in microdata and an example is provided here. A lot of work has been done on RDF and on linked data technologies in general and there are plenty of things you can do with linked data (with or without RDF) that you cannot do with or that are outside the scope of microdata. And a good summary of this was provided by Kingsley Idehen in this comment on a post by Georgi kobilarov.

Clear syntax rules

The syntax is extremely simple and once you have access to a tree representation of the HTML document, like the one provided by Tidy or the Dom, it’s really simple to extract the data, and the WHATWG even provides an algorithm for doing so. Furthermore, if you are working in a scripting language, you could base your implementation on the one I did ( it’s just 5 very recursive functions ).

Another good thing is that even though it’s very simple, it is extremely powerful and you could encode a wide variety of complex data types with it.

It doesn’t require post processing of the picked up values. Which I believe was a terrible mistake in the design of Microformats. Why do I believe this? Because it makes a generalized extractor or parser impossible to build, since you would have to add by hand the post processing rules of any new format you may need to add to the extractor.

There are no vocabulary validation rules so far in the Microdata spec, so this is still open, but there were some vocabularies developed by the microformats community that had the problem of having some properties with different value types, for example: org in hcard might have been either a string "org":"example" or an structure with 2 different values "org":{"organization-name":"example-name", "organization-unit":"example-unit" } . This is unnecessarily complex to work with when working with the data in general applications.

Not limited to a closed set of vocabularies

The microdata spec does not force the use of any particular vocabulary. In fact, the choice of vocabulary is completely up to the implementor. This means that if a user has a particular need (like the one I had) and there is no vocabulary that fits that need, he can create his own.

It’s a W3C Spec

There is not much more to say about it ( and I mean this in a good way ). This also means that it will probably be implemented by most browsers.


(*) None of this is working anymore, since most websites have changed layouts and Amazon requires a timestamp and TTIW’s server is so old the clock keeps getting out of sync. I also did a lot more like this (like integrating the crawler with different referal services datafeeds, identifying a product across different websites by ISBN or UPC, etc.) but none of this is relevant to what this post is about and in the end I wrote another algorithm in javascript that picks the image based on the areas of HTML elements on the page and also some basic price picking on the text of the page, and that’s what’s currently live, but TTIW is not actively being maintained.

(*) I even made a couple of faux passes, like on a discussion of semantic URLs I was trying to find an example and went to TTIW’s Tags page, and clicked on the tag “gothic lolita” (which was a popular tag at the site since a lot of the users were teen girls, and this youth subculture was trendy at the time) then I realized that it didn’t look very serious and changed it to star wars, unfortunately I forgot to change the link. It’s just that I was petrified of posting at the list.


Viewing all articles
Browse latest Browse all 10

Trending Articles