Skip to content

Mitigate URL shorteners #53

@chrisnewtn

Description

@chrisnewtn

So Twitter have started using their t.co url shortener on their profile links. At the moment Elsewhere is oblivious to this and any other shortener, it'll just treat the shortened url like it's the actual url. This is a problem.

What it basically means is that your website, instead of being example.com, is identified as t.co/24rkwdfj. Now when Elsewhere is validating links, it can't find any link to example.com, it can only see t.co/24rkwdfj and since nowhere else links to that, it won't treat it as being a valid resource.

In order to mitigate this we need to make Elsewhere aware of the fact that it's resolving redirects (I'm not exactly sure how they're handled at the moment).

The solution proposed is that we identify sites by their actual url i.e. the url that the shortener resolves to. We will still however keep track of the urls that are used as part of any redirects to the resolved url.

The end result, aside from the fixed urls, as a slight modification to each resource returned in the response.

{
  "results": [
    {
      "url": "http://chrisnewtn.com",
      "title": "Chris Newton",
      "favicon": "http://chrisnewtn.com/favicon.ico",
      "outboundLinks": {
          "verified": [ ... ],
          "unverified": [ ]
      },
      "inboundCount": {
        "verified": 4,
        "unverified": 0
      },
      "verified": true,
      // new bit
      "urlAliases": [
        "http://t.co/vV5BWNxil2"
      ],
    }
  ],
  "query": "http://chrisnewtn.com",
  "created": "2012-10-12T16:30:57.270Z",
  "crawled": 9,
  "verified": 9
}

This aliases property contains all the other urls used to identify the resource that Elsewhere has encountered, just in case it's useful.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions