A little RDF help, please?
I’m having yet another problem with (anonymous|blank) nodes: I think I’ve pretty well got the concept down (either you’re talking about something intrinsic to the feed, like the accursed rdf:Seq, or you’re talking about something that you can’t give a resource for (what’s the URI that will return Mark Pilgrim himself?), or something that you can’t be bothered to name (it can be useful to say that some grain of sand weighs this much and has this hardness, but giving that grain of sand a name is a bit much)), but… how the hell do I manage and use #genidx in my app?
My toolkit generates genids like “http://philringnalda.com/index.rdf#genid1” when you are using the raw parser, but using the database class, it just generates “#genid1”, from what seems to be the assumption that you’ll parse a particular file once, and then that’s it.
Of course, parsing RSS isn’t going to work like that. I could just add each hour’s parsing as new documents with a new docKey from the URL + timestamp, but it seems silly to save all the static channel stuff over and over, 24 times a day, not to mention saving a single item dozens of times over until it finally leaves the feed. So, when I parse I only add new statements, and change existing ones as they change. But what am I supposed to do with blank nodes?
I can add a timestamp, so that I can keep #genid1T1032767767 separate from #genid1T1032766067, but then what am I going to do with them? Suppose I have six items from one docKey that I want to display in sequence order. I select everything from my docKey with a <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> of <http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq>, getting dozens or hundreds of #genidTns. Then I select everything with those #genids as the subject and a predicate of <http://www.w3.org/1999/02/22-rdf-syntax-ns#_n> and one of my items as the object. Then, what?, I go through each set of #genids, looking for a set that includes all of my items, and there’s no guarantee that I’ll find one, since someone could have posted three of my items one hour, and then another three plus twelve more the next hour, so that all six were never in a single file together. Bah. Unless someone has a workable idea, I think my next revision of the code will add a cleanup routine after the parser finishes, to get every #genid that’s a sequence, and delete everything that’s even vaguely related to it. I’m less of a fan of RDF sequences every time I meet up with them.
RDF is built on a basis of persistence, one reason that I was more interested in putting RSS data as separate RDF blocks into individual files. RSS is based on impermanence, with each item being pretty much of a throwaway after it’s consumed by the aggregator. I’m beginning to think myself that the two are not compatible, at least, not as we currently define and use RSS.
Unfortunately, your toolkit is out of date, but that’s understandable. The new spec (which is only a working draft, still) doesn’t allow generation of fake URIs when dealing with blank nodes now. The reason for this is just for this type of situation — to make it easier to find blank nodes and provide special handling (such as removal).
Phil, curious, are you pre-parsing the items into triples, first, and checking against database? Or are you using the class that automatically attempts to put the triples into the database, and then do cleanups? The reason I ask is that it sounds like you’re doing a lot of updates to the database.
If I understand Shelley’s answer correctly, I believe my parser (RDFLib for Python) is also out of date, in that it generates random-looking crap as the subject for a subjectless blank node.
Phil, I don’t know how to solve your problem, but I sympathize. And some day I do hope to become completely virtual, at which point I will of course be addressable by URI.
I’m parsing, then checking as it goes in. I just rewrote the function that the class calls to store a statement, to query for things that might be the same statement and then check whether it’s the same (s=s && p=p && o=o) or a change (s=s && p=p && o!=o). Same I forget about it, change I update, new I insert. (Calling
RDQL_db::store_rdf_document($url,$docKey);
to parse, but rewrotefunction _class_my_rdf_store_statement_handler
, for those following along at home.)I’d like to just ignore the items sequence right there, but I’m not sure I wouldn’t throw out the baby too:
#genidn #type #seq
<channel url> /items #genidn
#genidn #_n <item url>
(really should learn some proper syntax for abbreviations) is pretty obviously the accursed rdf:Seq when it’s all together like that, but in a function where I only have the current statement to look at, I’m a bit less sure it’s not some blank node from Ben doing some fancy FOAFness with the contributors in his comments. That’s why I’m thinking in terms of going back later: if I select all ?x <rdf:items> ?y (and if my #genids are actually unique), then I can get rid of item sequences without getting rid of things I need.
Or maybe I should just leave them in: I was a bit worried about scaling and the db size, what with running up to 3500 statements pretty quickly, but it looks like even though we say a lot in our RSS, we don’t say a whole lot that’s new: last night I had to flush the whole thing and start over after a bit of a bobble, and since then I’ve only added around a hundred statements. So most of the work is listening to the same things over and over and over again, waiting to hear something new. I’ve known people like that, too.
Ah, but if you are addressible by URI, then saying GET Mark Pilgrim can’t have any effect on the server, so you’d have to be creating duplicates, and I think Calvin’s example when he turned his Transmogrifier into a Duplicator pretty well showed that you don’t want to do that.
No Mark – random crap is what you should get. You shouldn’t get anything that remotely looks like a URI. You can use this as a guide, then, to know which nodes are blank nodes or not.
Phil, how do you filter the duplicates?
Just as a side note, the XML link buttons (for posts and comments) on the right hand column of the main page result in 404s. The RDF button works fine however.
Which would be rather funny, if I wasn’t trending in the other direction thanks to my frustration with the looseness of RDF. Thanks.
How do I filter dups? Stunningly poorly, at the moment. While typing I just realized that if you say that something has two properties (clouds are white, clouds are grey), then when I see you saying the second one, I assume you are correcting the first one, and I replace it. Stupid. And to avoid it, I have to either store everything you say every time as a new thing, 24 times a day storing the fact that the title of http://philringnalda.com is philringnalda.com, and then when I want to know the title of http://philringnalda.com deciding which of tens of thousands of titles is the one I want, or I have to give up on the idea of only storing new statements, and just delete everything you told me the last time each time I parse, or I have to special case everything I know about, saying ”okay, I’ve got a triple: is it a channel title? is it a channel description? is it a channel dc:creator? is it foaf data in the channel? is it in the channel because it refers to someone who is always associated with the channel, or just associated with the items currently in the channel?
You know what? It’s no wonder nobody has done anything to speak of that uses the RDF in RSS.
It depends on what makes sense for your application. RDF allows you to have as many values for a property as you want, and more, but there’s no rule that you have to pay attention to them all or even store them all.
I don’t know exactly what you’re doing, but here’s one way to deal with the blank nodes when you’re updating a channel.
In your database, you’ll have:
(channel) rss:items _:a .
_:a rdf:type rdf:Seq .
_:a rdf:_1 (item) .
and so forth.
Your newly parsed file will have these triples:
(channel) rss:items _:b .
_:b rdf:type rdf:Seq .
_:b rdf:_1 (item) .
Assuming you don’t care about the old channel, you can just delete _:a from your model, add in the statements about _:b, and change the value of rss:items for (channel) to _:b
(Whoops. Accidentally hit post before I was ready.)
Anyway, the syntax I’m using for RDF triples is called N3 and it’s designed for informal discussions like this. It represents blank nodes in the form ”_:string”, where each string corresponds to a different blank node. It’s pretty handy.
If I used better tools, would it be as simple as you make it sound? I was using the only thing that seems to exist for PHP (the only language I’m even half good at), a port of Repat. So I don’t have a model, I just have newly parsed triples flying by me headed for the database, and my only place to stop them is one at a time just before they are stored. I don’t have (channel) rss:items _:a . in the database and (channel) rss:items _:b . in my newly parsed file, I have (channel) rss:items _:a . in the database and (channel) rss:items _:a . in my function, headed for the database. So my first step has to be to rewrite the parser to keep track of bNode IDs for a given source from one parsing to the next. Then, I probably need to rewrite the way it does storage, parsing everything into memory and not storing it until I’ve had a chance to look it over and do some modification, because what I really have (keeping in mind that a typical weblog RSS channel is only likely to change one (or less) items from one read to the next) is:
Database:
(channel) rss:items _:a .
_:a rdf:type rdf:Seq .
_:a rdf:_1 (itemV) .
_:a rdf:_2 (itemW) .
_:a rdf:_3 (itemX) .
Parser:
(channel) rss:items _:b .
_:b rdf:type rdf:Seq .
_:b rdf:_1 (itemW) .
_:b rdf:_2 (itemX) .
_:b rdf:_3 (itemY) .
and I have to sync them up so that I end up with:
_:a rdf:_4 (itemY) .
But those weren’t the dups that actually threw me for my final loop, since the truth of the matter is I’ve never actually seen an RSS feed where the rdf:Seq offered interesting information, so I was pretty much ignoring it. I know the use case of search engine results, but I’ve never seen it and don’t plan to parse it.
What gave me troubles was, take something like Ben Hammersley’s feed, and ignore the fact that he’s making my life really complicated by trying to stick a foaf:Person inside a dc:creator, and just pretend that he was giving Morbus credit as coauthor:
(channel) dc:creator ”Ben Hammersley” .
(channel) dc:creator ”Morbus Iff” .
Only getting them one at a time, when I see Morbus I assume that the dc:creator has changed since the last time I parsed the feed, and I replace Ben. So I’m back to rewriting the parser, and since the thing that started me down the path of trying to parse RDF was reading so many people saying over and over and over again that ”the RDF in RSS is great because there are all these wonderful tools to parse it”, I’d have to say that if step one is to completely rewrite the parser then even if I was good enough to do it the experiment would still be a failure.
Your parser should support multiple values for the same property. TRAMP (built on rdflib) does.
Well, it does, but the problem is knowing which ones I want and don’t want. I don’t want to save the channel title eight thousand times a year.
Phil, AFAIK the RAP toolkit is much more up to date than the PHPXMLClasses one. Of course it’s in PHP. You can find it over here:
http://www.wiwiss.fu-berlin.de/suhl/bizer/rdfapi/index.html