Skip to content

Figure out how to automatically scrape #1

@domluna

Description

@domluna

The YorkU website is literally a clusterfuck for scraping, but it would be really awesome if we could automatically do it. I'm not even sure if this is completely possible due to the absurd html layout and the fact that the urls don't make any sense.

Accounting - https://w2prod.sis.yorku.ca/Apps/WebObjects/cdm.woa/20/wo/2Ut0tG0DUArPP653ACehWw/1.1.10.7
Biology -https://w2prod.sis.yorku.ca/Apps/WebObjects/cdm.woa/20/wo/2Ut0tG0DUArPP653ACehWw/1.1.10.7

Notice they're the same url! WTF!

Also I think it's putting cookies in the url because these urls will expire after a short while.

Anyway the html soup can be dealt with it's the url structure not making any sense that worries me. The structure we would want would be something like

https://www.yorku.ca/courses/2014-15/{Term}/{Subject}

but I guess that would make too much sense.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions