fixing web pages on the fly
January 14, 2003
I was thinking about the fact that browsers have to fix up a page internally in order to render it correctly if elements are incorrectly nested or are left unclosed. Since we know that Mozilla does a pretty good job of this, and its fix-up engine is open source, shouldn't it be possible to make a web proxy that feeds a tag soup source HTML document through the fixer-upper and outputs a valid XHTML document?
I'd love to see a web service that I could send a URL to that would pipe the page through Mozilla and then through HTML Tidy and (eventually, presuming this would be a slow process) spit out valid XHTML on the other side.
My secret ulterior motive for all of this is that I want to see people start to work on conferring benefits to sites that use proper XHTML, by offering richer indexing, search, transformation or presentation opportunities. And perhaps the best way to demonstrate these new opportunities on existing, invalid pages would be if we had a way to easily create machine-made valid versions. Granted, the transformations would be imperfect, but they might be close enough to show the potential applications.
8 Comments
Leave a comment
- Earlier: tivo is its own lobbyist
- Next: a little background on me and weblogs

How will this web service or proxy encourage the use of valid XHTML? If the rich indexers are piping everything through the service, then there's no downside for authoring tag soup. Tag soup authors will never use this service. If they cared about valid XHTML, they would author it in the first place.
More and more folks participating in the world wide web today do not have the knowledge to author valid xhtml, including myself, and time doesn't always allow us to learn about it either, but that doesn't mean we care any less about validating our pages. Therefore, I believe Anil's idea is wonderful and would be quite beneficial to the mass of others like myself.
OK, I see. It will be useful for authors to see what their pages look like as valid XHTML.
Of course, that assumes that valid XHTML has value. Good sources say that XHTML is just a crock.
Ping LazyWeb.
I mostly agree that anybody caring enough about this would probably be writing proper markup in the first place, or at least take the time to educate themselves after a few run-ins with a validator. But it could be a tool in the education itself to see auto-cleaned code, the same way people have used a visual HTML editor, and then looked at the results.
As much as I like the idea, I don't think it could be quite this simple. People using a service like this are going to expect that what comes out the other side works, disclaimer or not. Instead, they're more likely to get a near-solution that will only end up highlighting arcane things like the quirks/compliant modes in browsers, etc. That's going to be a big problem when you run into situations like code that previously used things like box-model trickery, which would validate in the first place, and so remain untouched by the transformation.
On the other hand, Bravada, there's nothing above about promoting usage of XHTML(initially, anyway). Seems to me the point is to make it possible to demonstrate what it could bring about, and also confer some of the benefits those people who have been using rich markup have been working for(this wouldn't have anything to do with [Dive into]Mark's little hissy yesterday, would it?). The experimentation would start on the extant sites, and the transformation engine would allow those people currently pumping out "soup" a quick and dirty way into seeing concrete, personal results(read: jump on the bandwagon), and continue studying.
No, diveintomark was saying that XHTML 2.0 is crock. Not 1.0.
And Mark talks a lot of mess, but you know that boy's gonna be churning out XHTML 2.0 as soon as he can find a valid reason to.
Bravada,
Mark probably isn't the best example, he's discussing how a *proposed* spec. will break his *current* valid markup. XHTML 2 is a long way off and since there isn't a browser in the world that supports it, the point is moot. It's none of my business, and I'm sure he has his reasons, but going back to HTML 4.01 strikes me as a rather extreme measure in reaction to an unfinished proposal years away.
What's lost in this is that DTD's exist for a reason and a page is just as valid even if written to an older standard. Anils idea could easily be extended to creating valid markup for a variety of DTD's.
This touches on Su's concern that just because something validates does not mean it's going to work in the real world. Extending this beyond XHTML and into HTML4.01 and CSS1 would be a step towards aiding the creation of compliant markup while leaving some extra room for proven solutions. Solutions that while not cutting edge may be perfectly sound.
I think it would only dissuade folks to write valid XHTML docs.