syndication formats?

November 22, 2002

Embed:

I have a radical proposal for a ubiqitous content syndication format, applicable for almost any purpose, but extremely well suited for weblogs. It's extremely simple to implement, either by software or by hand, works already in millions of clients that are very forgiving of misformed or omitted data, and is human readable both in source and output formats. Even better, it doesn't require any additional work to create the syndication format when creating your website.

My new syndication format is called XHTML. I propose that existing syndication and aggregation clients should be able to read an HTML file, detect if it has the appropriate XHTML doctype, and then render the contents of each XHTML node in the appropriate place in the client's display. All that would be needed is standardization of names and classes for page elements like DIVs and headers. A post/entry title would always be an H3, with a class set to "title", for example. Permanent links would always be P tags with their classes set to "permalink". Simple.

Content authors shouldn't have to make two versions of all their content just because people are lazy in the way they make their client software. Valid XHTML is a hierarchial outline of content, presented in a machine-parsable manner. Augment this XHTML with proper use of link tags for navigation, and the loss of the page cruft that surrounds the content on a typical HTML page wouldn't be missed at all. Even better, an XQuery-based search engine could give you a Google engine that returned relevant entries from a site, instead of an entire page, therefore rewarding people who go through the effort by making them participants in a new, better targeted web that's fully backwards compatible. Existing pages could probably be rewritten by proxies, if the authors are unable or unwilling to reflag their content. Simply iterate through the nodes in the body of the document, find the highest-level node that repeats and contains other content, and you've got the pattern that delimits individual entries. Or look for # named anchors that suggest that they're permalinks. Transform those through XSLT into elements with a predictable set of names.

So, the proposal? A documented standard set of XHTML element names targeted at standardizing class names for page elements, in order to allow HTML to serve as a syndication, aggregation, and distribution format, in addition to being a page rendering format. The side benefit would be that any tool that produced compliant code would probably also be able to share style sheets with other compliant tools, as page elements with the same name would inherit the appropriate styles. A lot easier than forcing tools to output multiple versions of content each time a page is changed. And adaptable to situations like a newspaper, so that articles using the naming convention could also appear in an aggregator.

Rich XML-based descriptions of content are great, and will always have their place. But for something as simple as syndication and distribution, HTML already has an overwhelming advantage over any nascent formats. Who wants to propose a set of basic tags?

10 TrackBacks

Making soup from FCD/the weblog on November 23, 2002 8:49 AM

I don't pretend to understand the various formats and protocols that make this, and other, sites work. I have a Read More

How much power should we allow the CLASS attribute to have in providing semantic structure? <h3 class="BlogPostTitle">First p0st!</h3> Consider the Read More

Here's a proposal on a standard XHTML format for blogs to do away with RSS. It makes sense. Why bother Read More

as days pass by from XHTML instead of RSS on November 24, 2002 5:33 PM

TITLE: as days pass by URL: http://www.kryogenix.org/days/000372.cas IP: 66.33.208.17 BLOG NAME: XHTML instead of RSS DATE: 11/24/2002 05:33:28 PM Read More

Anil Dash proposes to use XHTML itself for content syndication. He calls for a standardised way of posting entries to a weblog - "All that would be needed is standardization of names and classes for page elements like DIVs and headers. A post/entry tit... Read More

Interesting conversation happening at dashes.com re. Anil's suggestion that XHTML would make a perfect syndication format. There are posters there Read More

XHTML vs. the World, as seen by Tantek Celik. Base of the whole article is against RSS, and how using Read More

It looks like people are starting to wake up to the notion that XML is, well, extensible. You don't need separate syndication and archiving formats. You don't need separate syndication and display formats. The most extreme example I have seen to d Read More

I've been reading a lot lately about using XHTML instead of RSS for syndication of a site in a new Read More

I'm getting a little sick of RSS. Maybe I'm missing the point, or some piece of technical knowledge, but the Read More

45 Comments

I like the idea.

It would make subscribing to a site easier, since one wouldn't have to find the URL of the RSS feed. The URL of the weblog itself would be what you wanted. How nice.

I'm with you on this too. Pardon the self-link, but I've written a half-baked version of the idea that you are expressing much more clearly (and realistically) here.

Steven, I remember reading your BlogML idea when you posted it, but I think at the time it seemed unrealistically demanding to ask people to use a custom XML format to encode their blogs. Now I realize we've already got almost the right format in XHTML, if we can all just agree on a few standards. But you absolutely encapsulated the benefits of such a system. A few other people have mentioned similar ideas, too, I'm just hoping to present it in a way that makes sense to more people.

I like the idea, but at its simplest it seems to take the decision away from the publisher as to how much info to aggregate to syndication. A lot of authors are loath to implement RSS because they don't like the idea of other people aggregating the copy on their site wholesale and repurposing it, within another sire, within an RSS viewer, or wherever, but without taking along the original site's look and feel.

The other reason that RSS is a good idea is because it dramatically reduces server load, as only a skeleton XML doc needs to be requested and retrieved, instead of the heavy front pages so many of us employ. Having each of my readers download the front page 12 times a day in the spirit of syndication weighs heavy at the bandwidth pump...

I believe entirely that aggregation should not be dependent on syndication based in XML. Unforunately, I do not think that one set of XHTML tags can be drafted for use for specific functions in such aggregation. There will always exist a time when an H3 tag stands for something rather than a post title. Many of the tags within XHTML have structural meanings attached to them.

Having said that and acknowleding the benefits of many current syndication standards (RSS, etc.), I think a solution may be to implement a way for aggregators and browsers to understand the content on a page and be able to present them in a manner like you suggest in this post. Even this idea requires work on the part of the page's developer, but I think it could be easier than figuring out how to get data to fit in a syndication format (or which syndication format to even use). In the spirit of the Microcontent client you talk about and with efforts like the Semantic web, software should be able to understand (or at least infer) meaning from pages and act accordingly.

A similar application that I have been thinking alot about lately is for tools that expect to serve up content to other pages (possibly via RSS, but also via remotely linked Javascript).

One example that already supports a seperation of content from formating is Blogrolling.com (see http://www.blogrolling.com/css_documentation.phtml
)

There will always exist a time when an H3 tag stands for something rather than a post title.

I agree, but (as I probably didn't make clear enough) this idea would specify that only an H3 tag with a class set to "XXtitle" would be parsed as a syndication title.

the real problem with this suggestion is that it's not extendable. say i want to include information not in the standard set of class tags, hwo do i go about doing that? what if i want to include metadata that doesn't necessarily belong on the page? i can think of many more things that this solution isn't considering. on top of it all page bloat is an issue too.

I've been deeply engrossed in the subject for an article for O'Reilly that was just published and had to comment on this.

A set of tags already exists: they're called RSS. The point of XHTML (or one of them at least) is that it can leverage the full range of XML toolkits and specification. This includes XML Namespace that allows tags from other schemas to be included thereby extending the original from its designed purposes. Some have already been experimenting with combining RSS and XHTML tags into their pages here. (Do a view source to see what I mean.)

What Anil proposes (H3 with this class...) has been done before in the past before we had XML -- its called screen scraping -- albeit more refined screen scraping, but screen scraping nonetheless. It all seems rather retro to me.

Also, If I want to display entries with different styles as I do on my weblog, isn't defining one class limit me to one style? (I'm not the most up to date on CSS.)

I don't mean to sound harsh. I really don't. Its an interesting notion that has its merits. I can understand the argument that ideally content authors shouldn't have to make two versions. I think that in practice its limitations will outweigh its benefits.

As Kevin Fox notes, a separate syndication file is more bandwidth efficient especially with aggregators and the like banging away frequently on them. Aggregators have recently improved from their early days of brute force updates -- downloading a feed on some interval regardless of changes. RSS is more about data (that just so happens to be about content) where (XHTML) is more about display. Combining the two is fine, but inefficient in that information necessary for one task must be ignore when used for the other.

I have other questions and concerns about such a combined approach that could effect its effectiveness. Where would "hidden" information (I'll refrain from calling it meta data) helpful to syndication go? How would I find and retrieve information that exists across the site -- such as a feed of recent comments?

I complete object that its out of laziness as Anil says. If a content author is "too lazy" to generate two versions of their content I'd suggest that they author their content in RSS. You or I can convert RSS more efficiently and reliably into XHTML. RSS is for machine processing while XHTML is designed for display. In fact I could be really lazy as a content author and have multiple XHTML pages generated from one RSS file.

I think of RSS files as more of a Web service then a web page. That may help provide a different perspective.

No offense taken, I like the debate. Maybe it's a way of standardizing pages to make screen scraping easier and more consistent. If that's so, is that a bad thing?

And yeah, you could go into different namespaces, but you lose backwards compatibility and you lose the familiarity people have with existing HTML tags.

Also, CSS places no restrictions on having multiple classes on a single element. Some user agents will barf on it, but I'm too lazy to look up the specifics.

This sounds like the first implementation of semantic web concepts that could really make some ground. Why not propose a spec this weekend and leave it up for comment? If something got out there, I bet a good deal of people already converted to xhtml would implement it. It'd also give people an incentive to move forward with their code.

Personally, I'd prefer to do this over RSS. Even though I use a link element pointing to my feed and there's a huge orange button on my page, I got a handful of emails last week asking me if I had an RSS feed. I put my entire posts in the RSS file, and I had to code a page to generate it, and another to cache flat versions when updated. All in all, it was a pain in the ass, and simply renaming CSS classes that are already there would have been simpler.

RSS is for machine processing while XHTML is designed for display.

This point should not get lost.

Well I don't pretend to be deeply engrossed in this stuff specifically, I have been deeply engrossed in data vs. display in frameworks for a long time (including other areas of HTML and XML) and I can tell you from a lot of experience that losing the separatation between the two is a nuclear minefield waiting to happen.

You want data structured in a way that's efficient for storing, indexing and exchanging between stores. You want UI to modeled according to very speciic end-user tasks that can vary greatly.

Weblogs are a tiny, tiny island; maybe the center of your universe, but a tiny island nonetheless. Forcing the data into a single user model (in this case weblogs) is the job for authoring tools.

just my 2.

I'm with Matt -- this is something that you (we?) should spec out this weekend, and see if it makes any headway in the real world. I like the idea; I would love to see how it plays out when implemented, and the beautiful thing about Movable Type is that we can actually implement it in another index template without futzing with what's already there.

RSS is for machine processing while XHTML is designed for display.

I certainly hope you didn't lean on this point in your article, Timothy, because it's incorrect. Properly-structured XHTML is far more robust than RSS for providing syntactic structure for a Web document, and is just as machine readable. The fact that an <EM> is rendered as italicized text in a browser is completely incidental.

The more I learn about these issues, the more I become convinced that it's wrong to ask authors to jump through additional hoops to support formats for alternate endpoints like RSS newsreaders. At the end of the day, I'm paying the RSS tax through additional bandwidth and ensuring that what I put in my XHTML won't break my RSS (like matching character encodings and avoiding relative links).

If there was a standardized method of parsing XHTML for syndication, weblog-stylee, there wouldn't be anyone griping about RSS. No religious wars over format, and no problem for tools since anything that can parse XML should be able to parse XHTML.

That said, leaning on the CLASS attribute to provide such semantic meaning seems generally frowned upon, but I have yet to see it abused to an extent where people object. Maybe this will put the concept to the test.

Yeah, I'd like to see a proposed spec left up for comments.

Afterthought: although I'm behind the idea, I question the method of pressing specific elements into service. For example: above, Anil says post titles will be represented by an H3. This makes no sense in a document that does not also contain H1 and H2 elements. Likewise, using P (a block-level element) for a permalink is certainly incorrect (it should be the A element, no?) and doesn't help those who put their permalinks in-line with their posts, or as the post title, as I do on my site.

Perhaps instead of forcing a particular tag order which may or may not make sense in the context of the document, we should consider using the CLASS elements alone. In this case, *any* element can be a post title, post body, permalink, etc. provided it is of the appropriate class.

Again, it's a test of the "how much power do we give CLASS to override the meaningfulness of the element?" argument, which is really what I'm primarily interested in. ;)

Scott, I was talking over this idea with Jason Levine tonight, and he was suggesting the same idea, that classes be assigned more weight than the particular semantic element which is assigned to a piece of data. I'm not necessarily averse to that idea, as this is still a concept that's very much forming in my mind. (That's why there's no spec for comment yet, just a general idea.)

Once we get more of discussion on this, I'm sure we'll all build a good sample spec to solicit comments and feedback.

Keep in mind that if retrievals of RSS feeds are a bandwith problem for some now, automatic retrieving of the full XHTML page, many times including static, non blog content and stuff just to parse por the new content would be a bigger problem.

I keep hearing complaints about traffic for hits by aggregators, but I don't see how that's any different than any other rude spider that hits a page too often... don't the solutions that were suggested in the RSS realm work for HTML as well? What about caching, like we do with HTML now? Do we need a dynamically-generated MD5 meta tag in the head of the document? Or will a simple time stamp do?

I'm with Scott on relying on class indepented of tab. I'm not sure if this is the best place for it to happen (let me know if not), but I've made a list of the basic elements of a blog post that could be included. It's pretty obvious, but I'm sure I've missed something:


  • title

  • authorname

  • authorlink

  • timestamp

  • permalink

  • commentscount

  • commentslink

  • bodytext

Anil,

Aggregators in theory honor the Last-Modified header, but if your RSS is dynamically generated like mine, it won't work; the server considers every call to the script its last modification.

If you send a handmade Last-Modified header as well as a 304 Not Modified, all the popular aggregators seem to behave:

<?php

# format timestamp for Last-Modified header
$last = gmdate("D, d M Y H:i:s \G\M\T",$timestamp);

# send it
header("Last-Modified: $last");

# compare it to aggregator's If_Modified_Since
# if they match, send a 304 and die
if ($_SERVER[HTTP_IF_MODIFIED_SINCE] == $last){
header("HTTP/1.1 304 Not Modified");
exit;
}

# start xml here...

?>

See also : Karl Dubost - De XHTML � RSS

Comment cr�er son feed RSS � partir de son site en XHTML 1.0 ou XHTML 1.1 ? Compliqu� ? En bon normand, je vais r�pondre oui et non. Je vais tenter de l'expliquer en s'appuyant sur la technique utilis�e pour cette page.

Translation : Is converting XHTML to RSS hard? Yes and no. Here's how I do it for my site.

If there's a demand, I can add translating this set documents to English to my list of things to do. (If someone else wants to do it first, please don't let me stop you.)

As much as I think simply using XHTML as a syndication language is appealing, I tend to consider it also a bad idea. It smacks a bit too much of Grand Unifying Theory-itis for my taste.

After all, if we ever get to the point where it's all XML (hahaha), just transform it.

Excellent idea, Anil. I'd suggest that to avoid collision with classes that may already exist within a given site, that a string be prepended to the standardized class names � like wsf (weblog syndication format). Not sure if you can use an underscore in class names, so the final, standardized class name would look like wsfTitle or wsf_Title to indicate a blog entry's title.

I recall underscores are bad for CSS class names, and some quick googling turned this up (which seems woefully out of date, but I could swear I saw early mozilla betas (

Yeah, I had thought of the prefix, too. I had put "XXTitle" as a class name above, but I probably should have italicized the XX to indicate that it was some unknown prefix. I still own AllWrite.org if anybody wants to use AW as the prefix. Heh.

I'm not going to pretend to know everything about this subject, but I recently had a similar idea and googled around to see what else was out there about it and came up with this link: http://rss.benhammersley.com/archives/001076.html

and I thought it might bring up some interesting points.

What about the meta info about the blog - like its title, author's name, etc.? Stash it in standardized <meta> tags?

I tend to regard merging them into one form as a bad idea. The two different formats exist for different purposes.

As someone else has pointed out, this'd increase bandwidth usage because of all the other cruft that's included in XHTML page views, such as nav. Also, if I'm going to be viewing XHTML pages why wouldn't I just bookmark a set of tabs for all my sites and open that every day? I prefer the convenience of having all the aggregated content together on one page which is why I like it how it is, however if it's all going to be driven off the same page why even bother with syndication when browsers are perfectly capable of handling it?

Scott, have you looked at the xml base recommendation for alleviating your relative links problem? It's be interesting to see what, if any, client support it has.

Meta-info about a weblog isn't necessarily weblog-wide; quite a few have multiple authors, for example. There needs to be some way of meta-data existing (and being parsed) on a post-by-post basis for at least *some* fields, and if we can do it for some fields we should do it for all...

Someone has already started to translate the series Aaron.

I will put it online when it will be done. :)

Structure de ce site
Un document est un arbre.
Coupe et bouture.
Mots du Po�te

Can anyone explain to me why the W3C frowns on using class and id attribute values as semantic markers?

I drafted up a standard a while back to provide inline metadata using div and span tags.

Outline here:

http://groups.yahoo.com/group/syndication/message/2283

And a simple explanation of an earlier non standardised version here:

http://scriptingnews.userland.com/backIssues/2001/02/16

Basically having looked at this for Moreover, inline markup is very important since metadata and conetnt don't get out of sync. - Essentially there eventually becomes no difference between RSS and some defined use of XHTML - based upon standardised div and span elements, which is was proposing be called SWML - I would be very keen to work with some people on this.

I have one question which I hope someone can email me with an answer - writting a parser to extract metadata from properly labeled div and span elements is harder than for new tags, since you have to count nested tags:

e.g.

is harder than (and also causes problems if you extract portions of a document)


Is the following legal XHTML?

Sorry last post didn't show code - hopefully this will:

I have one question which I hope someone can email me with an answer - writting
a parser to extract metadata from properly labeled div and span elements is harder
than for new tags, since you have to count nested tags:

e.g. <span class=headline><span class=english></span></span>


is harder than (and also causes problems if you extract portions of a document)


<headline><english></english></headline>




Is the following legal XHTML?


<span class="headline"><span class="english">

</span class="english"></span class="headline">

With regard to Stuart's observation about post-specific metadata:

In the case where that information is already being displayed (for example, if I were to have a post's author as part of the blog format) then that information's containing tag would have its class set to the wsfEntryAuthor class, like <p class="wsfEntryAuthor">C'est Moi</p>


In the case where that information isn't included in the blog as part of the information that it displays, then a simple workaround is to provide that information (I'll use the example of an entries primary category) as:

<p class="wsfEntryCategory">Anil's WSF Proposal</p>

and in the stylesheet, set the wsfEntryCategory's display value to none. The information will still be present, so an aggregator will be able to find it, but browsers won't render it.

RSS is for machine processing while XHTML is designed for display.

I certainly hope you didn't lean on this point in your article, Timothy, because it's incorrect.

I did not. Thanks for reading it before commenting Scott. ;)

The point of that statement was that it is a matter of purpose. Syntactic structure of a Web document is not the intended purpose of RSS. RSS was designed to syndicate a collection of online resources called a channel. Admittedly the RSS format and its documentation have been a disaster and are lacking, so let me clarify: RSS files were never intended to contain or transport HTML. If you go back to the version 0.91 format documents, you will see that was to be "plain text" that was required/recommended to be 500 characters or less. The <description> element was to be an excerpt or brief abstract of the content on the other side of the <link>. What has become common practice is the work of a one-man design committee and his personal agenda. Let us not confuse intent with misuse.

The recommendations I present in the article I mentioned attempts to strike a balance between the original intent of the format and how it is being commonly used. On a personal level a more of a RSS-as-it-should-be hard-liner -- no markup in my feed. I do provide a full content feed because some people prefer it and to me its trivial to produce. I have been tempted to take it down on principle though.

The more I learn about these issues, the more I become convinced that it's wrong to ask authors to jump through additional hoops to support formats for alternate endpoints like RSS newsreaders. At the end of the day, I'm paying the RSS tax through additional bandwidth and ensuring that what I put in my XHTML won't break my RSS (like matching character encodings and avoiding relative links).

I can appreciate your frustration however the solution is simple: don't put XHTML in RSS. You and Anil are both MovableType users like myself. Add a strip_html="1" to your tags. Those who want to see the full stylized presentation of your content can use their browsers.

As MT users, I have also found your comments a bit curious because RSS feeds are generated automatically by that tool. When I began handling development of an RSS Feed plugin for MT, I was surprised at how many users didn't know what an RSS feed was or that it was being generated for them.

I still don't understand the argument about bandwidth.

I hope this clarifies matters a bit. I'm all for experiments such as these, though I'm personally skeptical that it will be more successful then (proper) RSS and XHTML working together.

There's really little I can say that Timothy hasn't said, except to agree with his comments. In particular, I have no interest in aggregators pulling my content out wholesale.

Before I removed content-encoded from my template (and my RSS file), News is Free grabbed one of my postings, which happened to have serveral photos in it. Any time anyone accessed the category (politics in the Middle East) my aggregated feed was displayed with everyone else's, but mine included the photos. Needless to say, my server was hammered for photo requests.

As for RSS religious wars, seems to me adding XHTML into the mix just adds another combatant. More Daypop joy juice.

Still, regardless of my personal view of RSS/XML vs. RSS/XHTML, this was an extremely useful comment thread and discussion, and I thank you for starting it.

I did not. Thanks for reading it before commenting Scott. ;)

My apologies, Tim. Anil's links are not underlined in my browser and I missed the link to your article. Mmrph. You'll have to forgive my reactionary bristling to statements like "XHTML is designed for display."

RSS is not going to be replaced by another so-called "standard" ... sorry, but there is already too much momentum behind RSS in the Blogging world. Most mere mortals don't give a crap if its RSS/XHTML or not.

Look at http://www.terra.es/personal4/alsanan/xml/index.xml . I did it some months ago. I wanted RSS to be the place where my stories get placed. But the same XML has a reference to a XSL filter that makes the browser turn the tags into a beautiful webpage.

My opinion is that RSS2.0 is a pragmatic response to the purpose of XHTML/CSS.

I see RSS+Aggregators as a good example of what the Semantic Web wants to achieve.

"The enemy of good is better".

What do you think?

Folks -

Interesting discussion - and very similar to a project that I'm currently working on with a number of other Blogosphere folks, the Weblog MetaData Initiative.

WMDI is an effort to define standard metadata tags for weblogs, just as Anil proposes. However, our approach will most likely be to support many different methods of encoding that metadata in a weblog --- possibile techniques include HTML META tags, XML/RDF embedded in templates, HTML comments, and yes, now that you mention it, a full XHTML approach.

Our reasoning is that restricting the encoding to a single implementation will make widespread adoption more difficult; particularly across different weblog tools which, necessarily, have diffent approaches to how they handle content.

While this obviously increases the complexity of any application to parse the data, our intent is to mitigate this by creating, as a part of our effort, an open-source parsing engine / reference application which is intended for use as a starting point for anyone implementing applications based on WMDI.

The element-focused approach is similar to that of the Dublin Core metadata initiative, which also focuses on conceptual elements without restricting the technical implementation. Dublin Core provides a very limited core set of metadata which is being leveraged by many specific projects in domains like weblogs; WMDI is one of them as we're using DC elements as a starting point.

We're actively working to finalize our specification; a preliminary version can be found here. And we've got a working, if very limited, demo application up and running here which pulls data from volunteers' weblogs who have marked up their blogs according to our preliminary spec; that page can be found here. In the next week, we'll likely be moving much of our activity to SourceForge as we get more serious about laying down a firm code base.

At any rate, we're an open effort, and would welcome participation from anyone, so feel free to drop by our homepage and chime in to the discussion forums there or, drop me an email.

-N.Z. Bear

the w3c has a server-side gizmo that converts xhmtl to rss, by defining exactly what you are explaining. link:
http://www.w3.org/2000/08/w3c-synd/
it isnt a reccomendation or anything, but anything coming from the w3c is standard enough for my tastes.

It's just plain stupid playing with classes and shit in "Non-modular X[HyperText]ML" to produce "R[Site][Summary]", and vice versa. RSS is a light-weight syndication language while XHTML is a rich and highly presentational language. Period.
If you want to add RSS to your XHTML or v.v., you're free to do so through the wonders of XML. For example, you can add only the 'Text Module' from XHTML 1.1, use a different namespace to avoid collision and modify the Document Type Definition (DTD).
Now then, if any XML-compliant User Agent doesn't support all your "new" fancy tags, it will just ignore them and render it's contents (like in HTML).
If you want to "fix" this, or if you think it's really tricky to write this jumbo mumbo language, you can use the wonders of XSL[Transformations] to transform it, and then feed the monster with something it'll swallow.

Leave a comment