Yay for liberal parsing
Don’t get me wrong, I think Brent has made the right decision to try to recover ill-formed Atom feeds, but…
Kellan was puzzled by a Inexplicable Magpie/Snoopy Problem, where the Snoopy PHP HTTP client that his Magpie RSS parser uses was returning all of this RSS feed in the headers, rather than the body of the request. For no particular reason beyond the way the line spacing ended up looking a bit off when I opened it in an editor that was trying to guess about Unix vs. Windows line endings, I suspected the truth: some of the headers ended with CRLF, and some just ended with LF, including the crucial “blank line” that separates the headers from the body of a response. There is absolutely no question in the HTTP standard about what should be there: the one and only thing that signals the end of the headers, and the start of the body, is a CRLF at the start of a line. However, liberal parsers apparently treat other things than just CRLF as a separator, and so when you look at it in a browser, it looks just fine, and aggregators that use a liberal HTTP library have no problem with it. Unless and until someone mentions it to the feed owner, he or she will have no idea that there’s something so wrong with his or her feed that it shouldn’t even make it past the web server.
I don’t know exactly what sort of invalid separators most HTTP libraries accept, and I’m neither smart enough, nor willing enough to test and find out, so I’ll just have to pick on Simon: his PHP HTTP Client Class, which I’ve happily used before, treats the first line where trim($line) == ” as the end of the headers. Long lines in headers may be continued on the next line, by starting the line with whitespace, tabs or spaces. A line of \t … \r\n is just the end of the line before, and thus \t\r\n is a valid, if silly, continuation of the previous header line. Are there servers that do that, intending it as a header? Dunno. Are there servers that do it, or \s\r\n, or \s\n, intending it as the end of the headers? Dunno. I know that the one and only thing that should be allowed to signal the end of the headers is \r\n with nothing coming before it on that line, because anything else has a different meaning, and I know that you can’t implement the spec as written, because nobody else does.
Sheesh. Sure didn’t take me long to mistake a train wreck for a lighthouse, did it? Welcome back, me.
Absolutely. Welcome back, Phil.
I had the same Snoopy problem with a BlogSnob bot. So I patched Snoopy to accept Lf as a valid end-of-header.
Assuming that this page will quickly be a top Google result for people looking for answers to this particular Snoopy problem, here’s the fix:
In the _httprequest function, change the line
if($currentHeader == ”rn”)
break;
to read…
if($currentHeader == ”rn” or $currentHeader == ”n”)
break;
Oh, I hope not. I did my best with the title ;)
I certainly won’t ever title a post anything like ”HTTP Error 500” ever again, anyway.
LOL.
Welcome back, Magpie.
But you will get people coming here who want to try and understand Liberals.
Myself, I don’t try and parse them. I just smile and nod while slowly backing away.
Adam do you have a sense of how many hosts you ran into which had this problem? I had never seen it before, but then I don’t do large crawls with Magpie. (some people do, just not me)
Thanks.
Hard to say since my reporting isn’t as good as it could be. I really would only know if someone had their account shut down by the bot and complained. I received one complaint that was attrbutable to this. They were running a Web server that I’d never heard of. And I’ve heard of a LOT (I used to teach a class in Web servers at the local university).
My guess is that most hosts are running more common Web servers that are actually written well. If I were the guy running that server, I’d switch quickly. If the server doesn’t handle that simple part of the HTTP spec correctly, what else is it doing wrong?
Odd. That host with the problem RSS is running IIS 5. I think IIS 5 handles headers correctly. My bet is that he’s generating the headers programatically instead of letting IIS create them.
I’d switch, too. ”Server:|s|Microsoft-IIS/5.0|r||n|”? Ouch.
But, since the HTTP version, Server, and Date headers have proper line endings, I’d guess that the server is only at fault in allowing some app that’s setting custom headers to do it wrong. Maybe.