Completely changed the way the string representing the HTML is preprocessed before being fed to tidy. I’ve just changed the function and the approach. The function is not really very elegant but it fixes a bunch of bugs. It’s mostly character iteration and lots and lots of flags (old school style!!). But it got me thinking after doing some quick browsing on the HTML parsing algorithm provided by the WHATWG if I shouldn’t just write my own (though it looks sort of hard and specially time consuming). I’ve also been looking at the source code of tidy and though it’s quite big the other option would be to try to contribute to it and help update it to HTML 5, but it would take some time for me to get to know the base code and the project seems to have been abandoned (and it might be quite big for just one person to work on). Anyhow, I’m not promising anything so far.
I do understand that the current approach that the library takes on this (preprocessing and then sending to tidy) is not the most efficient one. However there is another take on efficiency and that’s economic efficiency, and except for really heavy duty Microdata consuming the library does fulfill it’s purpose and the truth is Microdata is a new spec that still has to be widely adopted, so that’s not a real concern right now. So the question is whether if it makes sense to spend the next 3 months writing a parser from scratch, when the one I have does fit my needs (and probably those of 99.999% of PHP developers that may use the library). So far I don’t see the point. But then again my geeky side keeps bugging me to do it right.
Well, anyhow if you find any bugs (and I’m sure there might be many, simply because there are very few microdata examples and I might be missing strange markup some user might come up with ), please report them!!. Other than that I will write a post next on why I believe microdata to be better than microformats and I would also probably write a personal post that I’m sort of owing myself to write.