Time for an untrusted content element?

In discussing an odd spam email to a W3C mailing list, Ian Hickson says:

I’m thinking that HTML should have an element that basically says “content within this section may contain links from external sources; just because they are here does not mean we are endorsing them” which Google could then use to block Google rank whoring. I know a bunch of people being affected by Web log spam would jump at that chance to use this element if it was put into a spec.

If anyone from Google thinks this is worth considering, let me know as soon as possible. WHAT WG is coming up with lots of HTML extensions, so now’s the time to discuss it.

That could very well be the most useful weblog entry I’ve ever seen, if anyone at Google’s actually interested. It could be problematic, letting pages appear to the average user to link to lots of other places while only linking to one other site in Googlebot’s eyes, but we already have that with annoying Javascripted links and tiresome redirectors.

29 Comments

Comment by Matt #
2004-08-24 23:58:40

That seems like a silly thing for an element, I could totally get behind VoteLinks though:

http://dev.technorati.com/wiki/VoteLinks

Comment by Mark #
2004-08-25 07:23:52

The problem with VoteLinks is that it allows you to vote against some other page, i.e. you can do harm to someone else by something you put on your own page. I’ve mentioned this problem to Kevin Marks and he seems unwilling to acknowledge it or fix it.

rel=”vote-for” –> increase PageRank (the default, this is what all links do now)
rel=”vote-abstain” –> ignore for PageRank (like a Javascript link that Google can’t follow, or a hypothetical link-level NOFOLLOW meta tag)
rel=”vote-against” –> decrease PageRank

It’s the last one that’s problematic. Currently there is no way to harm another page by something you put on your own page. The reason is obvious: it would be abused to bully competitors. You can help someone else (by linking to them), or you can ignore them (by not linking to them, or by linking with a Javascript pseudo-link, or by using a smart redirector like Blogger does), but you can’t harm anyone but yourself.

The only useful part of VoteLinks is vote-abstain, since vote-for is what Google does by default and vote-against is problematic. Given that, the analogy of ”voting” makes no sense, and it would be better just to have some sort of rel attribute (or other attribute) like NOFOLLOW. I don’t care if this is link-level, or an attribute that i could apply to other inline or block elements, or its own inline or block element, but I support the idea and would use it if I knew search engines would obey it.

However, it seems unlikely that Google would do this, since as far as they’re concerned they’ve already solved the problem. Just use Google’s redirector that doesn’t pass on PageRank, problem solved. I can’t imagine why they would want to promote a decentralized solution that took them out of the center of things. Since when does Google advocate million-dollar markup?

Comment by Jim Dabell #
2004-08-25 08:22:52

That’s a good point. The voting is bordering on a reputation system in the style of Down & Out in the Magic Kingdom. It’s an interesting idea with a hell of a lot of potential, but it needs to go through several iterations and people need to get a lot more experience in the problem space before something like Google depends on it.

I don’t care if this is link-level, or an attribute that i could apply to other inline or block elements, or its own inline or block element, but I support the idea and would use it if I knew search engines would obey it.

Perhaps it would be easier if people were already doing it? If a specification is written, and the various wiki and weblog engines have default templates that include this, it’ll be a simple yet effective improvement to ranking algorithms to take advantage of this preexisting information. Having a rel attribute/meta element is not harmful beyond the bandwidth consumed, so it’s fairly easy to make this change appear attractive to search engines.

Obviously, having a high-traffic site, you are concerned with bandwidth and would want to avoid any unnecessary markup, but would you include it if most wiki engines, weblog engines, etc included this information out of the box?

 
 
 
Comment by Jim Dabell #
2004-08-25 00:07:40

I suggested <a href=”http://www.example.com/” rel=”unendorsed”> and <meta name=”endorsement-default” content=”false”> to Ian Hickson, but he thought that it was impractical for both Google and authors. I don’t see any way of making it simpler than that though.

 
Comment by Phil Ringnalda #
2004-08-25 00:23:06

My wiki is 410 Gone because it shows referrers, and while I wasn’t watching it it turned into tens of thousands of links to porn. I thought about digging into its guts to disable the referrers (which would have been substantially easier than adding a rel to some links and not others), but I decided that the cost of providing content is simply too high right now. I would have been willing to dig deep enough to throw an element around that whole section, though.

Do you really think that a significant proportion of the people producing content are less lazy and more committed to keeping SERPs clean than me? Our house is on fire; selecting the perfect brand of mineral water in the most elegant bottle to put it out might not be an optimal way to proceed.

Comment by Jim Dabell #
2004-08-25 01:46:01

If you are going to be deciding whether or not something is ”endorsed” on a link-by-link basis, you really can’t get around the workload, and rel is as simple as it can realistically get.

If you want to disable the ”endorsements” on a wiki-wide basis, simply include the <meta> element in the header include instead. That’s a one-liner.

Okay, so it doesn’t solve the medium case where you want one section of a page ”endorsed” and one section ”unendorsed”, but I would expect that situations where the first two techniques aren’t feasible are quite rare and don’t warrant an extension to HTML on their own. The majority case, on the other hand, can be solved by working within the framework HTML 4.01 has already laid out. This type of thing is what rel and <meta> were designed for.

 
 
Comment by d.w. #
2004-08-25 04:41:11

Jim — forgive me if I’m being dense here, but what I would find most useful would be if ”unendorsed” (or whatever) was a block-level attribute I could apply to <DIV>’s or <SPAN>’s or whatever… I could see, on single entry archive pages for example, where I have comments enabled inline, wrapping the entire comments area in an ”unendorsed” <DIV> in my template, which I would think would be fairly low maintenance. Anything that required manually or programatically adding attributes to individual links sounds braindead to me.

Comment by Jim Dabell #
2004-08-25 05:41:05

In the case where you want to switch off endorsements for a wiki, according to my suggestion, you would load up the template or wherever you generate the <head> section, and put in:

<meta name=”endorsement-default” content=”false”>

Then hit save, upload, and the job is done.

If you had to use a new element type for this, you would need to wrap wherever you might generate unendorsed links, check that you haven’t broken any contextual selectors and rewrite CSS where necessary, and change your doctype to whatever new doctype includes this new element type (HTML 4.01 certainly doesn’t), hoping that the change in doctype doesn’t break anything (e.g. it might kick browsers into a different rendering mode), and wait for the W3C to sprinkle magic Recommendation dust on the new doctype, as RFC 2854 describes this as being the determining factor in whether something may be transmitted as text/html or not.

It’s also more work for people like Google, as now they have to keep track of where they are in the document tree where previously they wouldn’t have to.

The meta solution is a little heavy-handed in that it turns off the endorsement for all links on a page, which is why I also suggested that you could override the default for specific links with the rel attribute.

I also don’t see the big deal about adding rel attributes. If you are allowing people to put comments on, presumably you’re already messing with attributes to avoid people putting in things like javascript: URLs. Tacking on rel=”unendorsed” is dead simple in comparison to any other input validation.

Also… if <unendorsed> was a block-level element, how would you deal with a paragraph containing both endorsed and unendorsed links?

Adding a new element type is a big deal, both in terms of individual pages, and in terms of messing with an established specification. The alternative of a simple meta switch, or, for finer control, a rel attribute, seems far simpler and more robust than the new element type bodge.

Comment by d.w. #
2004-08-25 06:09:24

AFAIK, the WHAT-WG work is already going to result in a raft of new elements and attributes anyway, so the DOCTYPE issue is already something they’ll already have to deal with in future browser revs. That horse is already out of the barn. I haven’t been following their work so I don’t know how they’re going to handle this, but obviously, the elephant in the room is ”what happens in the eon between the time when the WG finishes its work and W3C yays-or-nays things?”

I don’t see ”unendorsed” as a block-level element, I see it as an attribute that can be applied to other block-level elements…

Comment by Jim Dabell #
2004-08-25 06:12:53

I don’t see ”unendorsed” as a block-level element, I see it as an attribute that can be applied to other block-level elements…

Ah right, that solves the contextual selector problem then (but still doesn’t solve the problem of mixed link types in a single block).

Comment by d.w. #
2004-08-25 06:24:47

<SPAN>?

Comment by Jim Dabell #
2004-08-25 06:29:27

Yep, applying the attribute to both block and inline elements would fix the mixed link type problem.

I still think a new element type/attribute is too complicated compared with my suggestion though (which is completely compatible with HTML 4.01).

 
 
 
 
Comment by Ian Hickson #
2004-08-26 12:55:35

What’s the use case for the inline version? All the use cases I can think of (mailing list archives, blog comment sections, wikis) are block-level sections. Almost like <blockquote>, but not actually quoting anything from outside the page, just not quoting something from the original page owner.

Comment by Jim Dabell #
2004-08-26 13:48:57

Here’s one use case off the top of my head:

A wiki that would prefer to endorse links, but doesn’t want spam. New URLs are flagged for moderation, and until approved, should appear in the page, but not be endorsed.

 
Comment by Jacques Distler #
2004-08-26 14:53:12

What’s the use case for the inline version?

Does Google’s spider actually construct a document-tree of the page in question? I doubt it.

When you put a <meta> element on a page, Google’s spider knows it applies to the whole page. When you put an attribute on an <a> element, Google’s spider knows it applies to that hyperlink (alone).

When you put in a block-level element, Google’s spider needs to parse the page and construct a document tree to know which hyperlinks to ignore.

Maybe I’m wrong, but I don’t see them implementing that.

 
 
 
 
Comment by Mike #
2004-08-28 05:17:32

As a web designer, I see adding a tag such as this as wasteful. This tag would have nothing to do with (X)HTML markup at all. It just allows you to filter content and block Google from giving a good PageRank to links in your comments.

A vast majority of the top rankings on Google to personal sites and weblogs come from other personal websites and weblogs. If all the major weblogging packages (Movable Type, WordPress, TextPattern, etc.) add that META tag to their header, you will see rankings for some of the most popular legitimate page rankings fall.

I understand this is a very serious problem for some people, but if anyone is so headstrong as to want to filter out such things, just filter out links from being left in comments at all.

As for the REL and META tag: Why have another META tag? Yes, I still use some META tags, but adding another to the already 120+ that exist just seems like overkill. Although, I fully endorse the REL attribute, Google will not pick up on it. All Google does is look for the links on the page and what that link is pointing to, it could care less about what other attributes, including the REL attribute, that link has.

I regularly police the comments left on my page, and I have implemented several measures to help cut down on this on my weblog. These measures are easily implementable as plugins for all the major weblogging packages.

Comment by Phil Ringnalda #
2004-08-28 07:26:45

Nothing to do with… – so, quotes and blockquotes marking up content from others that was selected by the author is semantic and perfect and joyous, but an element marking up an entire section as being unreviewed content from others is meaningless? Don’t let the fact that we want it to influence Google throw you: saying ”this content was neither written by, nor reviewed by, the original author” is a whole lot more semantic than saying that your site navigation is an unordered list.

 
Comment by Jim Dabell #
2004-08-28 08:48:56

If all the major weblogging packages (Movable Type, WordPress, TextPattern, etc.) add that META tag to their header, you will see rankings for some of the most popular legitimate page rankings fall.

They don’t have to use the <meta> element approach though. In their case, they can just stick the rel attribute on links provided by others – meaning links provided in the main body of each article that are supplied by the weblogger would still be counted towards pagerank. Using rel would mean that the pagerank gained from posting comments all over the place would disappear, it wouldn’t mean that pagerank gained by being legitimately linked to by others would disappear.

Why have another META tag?

Because a number of people have stated that having to add a rel attribute to each link is a showstopper for them.

There’s nothing inherently wrong with <meta> elements. Using one to say ”the links on this page are unendorsed” is perfectly reasonable.

Yes, I still use some META tags, but adding another to the already 120+ that exist just seems like overkill.

It’s not overkill – removing links altogether or stopping Google from following the links is overkill (and user-unfriendly).

I regularly police the comments left on my page, and I have implemented several measures to help cut down on this on my weblog. These measures are easily implementable as plugins for all the major weblogging packages.

Wouldn’t it be nicer if you didn’t have to police comments? That alone is a barrier to entry for people who would like to allow comments on their website but don’t want to waste time filtering crap.

Some of the popular countermeasures cause problems for legitimate users – e.g. IP banning affects people behind proxies. Wouldn’t it be better to implement a proper solution rather than a whole range of workarounds that don’t work quite right?

 
 
Comment by Michael Bernstein #
2004-08-28 12:22:08

Ah, yes. This idea (marking content as not to be used by google) pops up every so often, but since it depends on a Google implementation change, never seems to go anywhere.

Here was my original take on the subject, where I was primarily concerned with feedback loops that could result from displaying Google results on a content page:

http://www.michaelbernstein.com/weblog/archive/2002_04_12a/view

Comment by Phil Ringnalda #
2004-08-28 12:35:41

Ayuh, if it was just YAPFGTITIDWTI I wouldn’t have been too excited about it. Somehow, the idea of Hixie talking about it, in terms of WHAT-WG, where things that will actually be implemented by Opera, Mozilla, and Safari (at least) are being discussed, made it seem a little more real to me.

I do wonder if in any of the previous rounds there’s been a good discussion of the potential for abuse: Google doesn’t give a rat’s ass about the semantic elegance of a namespaced rel="foobar:untrusted" attribute versus whatever’s ugliest, maybe hot comments, but the minute you suggest that Googlebot should not see some of the things you show human visitors, their little spam sensors prick up and start spinning madly.

 
 
Comment by nick #
2004-08-30 11:24:38

I know this is sort of a pipe dream, but a simple XHTML element like <ext></ext> would be a godsend for denoting external content that you have no control over / do not wish to endorse. Just imagine wrapping those around your comment blocks; Google comes along for spidering and knows not to boost it’s PageRank… bye bye comment spam. Maybe in XHTML 2? :)

 
Comment by Lachlan Hunt #
2004-08-31 00:23:24

The problem with an element is that it would encourage too many authors to abuse the system. It would be far to easy to wrap the whole content of the page in a single element, and if too many people abused the system like that, then ranking by links would become essentially worthless.

The problem with ”vote links” that just have a vote-for and vote-against, or equivalent, is that they say absolutely nothing about why the resource is being voted against. Basically, it would be the equivalent to a font tag — red for vote-against, green for vote-for, but has absolutely no semantics. This reason also applies to the element.

For those reasons, I came up with my own proposal which focusses entirely on the content of the resource, and why it is being linked to. This includes: user feedback, quality, accuracy, accessibility, rating and endorsement relationships. That way, it becomes far more difficult to abuse, while also providing some benefit to regular user agents. For example, filtering and sorting links. I wrote the whole proposal in my blog.

Comment by Mark #
2004-08-31 10:51:29

That would be so much better in RDF.

 
Comment by Brian #
2004-11-22 19:43:36

then ranking by links would become essentially worthless.

Aren’t we almost to that point anyway? My google results are filled with results that are only there because they have put up hundreds of useless pages to link from.

And if we did have such an element, I really think less than 5% of the web publishing world would use it. I don’t think it would bring the link ranking system to a grinding halt.

 
 
Trackback by Petroglyphs #
2004-08-25 15:20:58

XMDP-style Robot Profile

I’m not convinced an HTML extension just to mark content as unendorsed is necessary or desirable. Instead, I think this might be better handled through a back door approach: approach: <a h…

 
Trackback by eclecticism #
2004-08-26 10:56:41

Untrusted content, nofollow, etc.

Phil Ringnalda pointed to an idea that Ian Hickson just tossed out while brainstorming ways to battle the ever-increasing issue of comment spam. Personally, I’d love to be able to wrap the comments section of my individual entry pages in something like…

 
Trackback by Curiosity is bliss #
2004-08-26 12:37:36

Google hinting

Phil Ringnalda and Ian Hickson are thinking about extending HTML with search engine hinting tags. I totally agree with this, as I mentioned before. The questions are whether the attribute should be on a block or on individual links, whether we should h…

 
Trackback by Burningbird #
2004-08-26 12:38:42

BLINK is Back!

Phil pointed to a weblog entry that mentions adding an HTML element just for marking untrusted content in a page. With this, Google would then know not to use any links within that section for page ranking.

The concept behind this new addition is t…

 
Trackback by Curiosity is bliss #
2005-01-17 08:45:32

Google hinting: rel=nofollow

A follow-up on Google hinting: some indications that Google will announce support for a rel=nofollow attribute. Links tagged with it would not count in the PageRank calculation. So, it could be used in wikis and blog comments, to discourage spammers. I…

 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <del datetime="" cite=""> <dd> <dl> <dt> <em> <i> <ins datetime="" cite=""> <kbd> <li> <ol> <p> <pre> <q cite=""> <samp> <strong> <sub> <sup> <ul> in your comment.