RDF: not for the faint of heart
They say the RDF in RSS 1.0 will let people do cool things. They say the RDF in RSS 1.0 will allow for unexpected connections. Okay, so I figured out a cool thing to do with the RDF in RSS 1.0: convince everyone I can in my circle to provide an RSS 1.0 feed not only of their entries, but also of their comments, and ideally to use the same name in their blogs and in comments they leave, so that all the <dc:creator>s and <dc:contributor>s will have the same value. Then I grab all the feeds once an hour, parse the RDF, and throw it into an RDF database. To make the RDF useful (since otherwise I could just be parsing RSS 0.9x/2.0), I also grab the XHTML web page, and parse out the TrackBack RDF, to get the TrackBack URL, and use that to grab the TrackBack RSS feed for each post. Then when you want to know everything that Shelley wrote (within my sources, anyway), you just search for, um, whichever name she uses (@@@ grrr, how would this actually work? @@@) within a date range, to get a list of posts (in her weblog or any collaborative weblogs she might belong to) and comments. Want to see a comment in the context of the post and the other comments? Click a link, since the rdf:Seq is there to let an RDF parser keep track of the order of items. Want to see a post in the context of a TrackBack thread? No problem, I’ll know that too.
So, I Google for an rdf parser in PHP, download the PHP XML Classes that look like the best bet, and point rdfdump.php, which parses RDF and prints out the RDF statements it find, at my RSS 1.0 feed. And then reality sets in. After fixing a few of the more glaring errors in my feed, I can see some things that are usable: down among the <item>s, I can see that ordinal(0) triple("http://philringnalda.com/archives/002307.php", "http://purl.org/rss/1.0/title", literal("The clue-by-four hits"))
says “The title of entry 2307 is ‘The clue-by-four hits’.” But, um, what exactly is ordinal(9) triple(anonymous("http://www.philringnalda.com/index.rdf#genid1"), "http://www.w3.org/1999/02/22-rdf-syntax-ns#_9", "http://philringnalda.com/archives/002295.php")
trying to tell me? “The RDF_9 of this file’s #genid1 is entry 2295”? And once I add this entry, it should tell me that “The RDF_9 of this file’s #genid1 is entry 2296.” And you thought generating the rdf:Seq was confusing. Clearly this is going to take some more thought.
Then I pointed rdfdump at my XHMTL page, to see what the TrackBack RDF would look like. Uh oh, judging by the error it throws, the RDF parser is yet another thing that doesn’t approve of RDF embedded in XHTML. So unless I’m missing something, I’ll have to parse out the RDF, and then parse it (which I’d probably have to do anyway, what with people commenting it out rather than use cloaking to hide it from validators). Sigh. I can’t help thinking how easy it would be to add mod_trackback to RSS 0.94/2.0, and just parse the XML as XML.
The two things you say you’ve found confusing are actually the same thing – the rdf:Seq.
”#genid1” is what your RDF parser is using to refer to the Seq. Remember RDF parsers treat everything as a URI if they can’t, stuff like #genid1 (not the best way for a parser to do it, but not dreadful) is the RDF parser picking a name for something it doesn’t have a URI for.
Somewhere else in your triples you’ll find one with the channel as the subject, http://purl.org/rss/1.0/items as the predicate, and the mysterious #genidWhatever as the object – that mysterious resources is your collection of items.
Similarly <mysteriousResource> <http://www.w3.org/1999/02/22-rdf-syntax-ns#_9> <http://philringnalda.com/archives/002295.php> means that http://philringnalda.com/archives/002295.php is the nineth member of the items collection. <rdf:li> is just a simplification so you don’t have to keep track and code <rdf:_1/> <rdf:_2/> etc.
The mysterious resource is what links your channel to your items. It’s why us RDF lot want it to stay there.
BTW. I can stick HTML into this thing, have you checked this for safety against XSS attacks?
Okay, I think I’ve got it: items is a class, genid#1 is an instance of items, and the value of genid#1->9 is entry 2295. And since my parser doesn’t know from the last time it parsed the same feed, for each item I need to check my db to see if I’ve seen it before, and for items I haven’t seen assign my own sequence numbers based on their sequence in the RDF. Even I should be able to do that, after a little thrashing.
Still not sure what I’ll be able to point to and say ”see: that’s why I need RDF”, unless it will somehow let me figure out that since A uses email address B with homepage C and D uses email address B with homepage F, that A & D are probably the same person, with homepages at C and F.
XSS? I did do some work on tightening up the way it handles the email/homepage link to prevent XSS attacks for people who don’t allow HTML in their comment bodies, but my feeling is that since new comments are immediately emailed to me as plain text, and since I’m the only one with a cookie worth stealing, as long as I take a look at the HTML in new comments from people I don’t know, I’m fairly safe. Possibly moderately, but I think fairly.
Nobody’s going to understand this one…
…but every time I see one of the recent tech weblog posts about ”RDF in RSS” (which, to be honest, I barely understand myself), I keep thinking that RDF stands for Steve Jobs’ Reality Distortion Field:reality-distortion field n. An expression used to…