Parsing/scraping download links on pages and subpages

I don't have very much experience with code, only the basic ideas.

There is a website with a number of pages in an index. Each of these pages may have either a subdirectory, or it will be a page with various elements and any number of download links in the form of FILENAME.EXTENSION

I would like to write something to automate the process of parsing the pages and subpages, then scraping them for downloads. How can this be accomplished, with what tools and API?

Thanks.

wget -m -k -E http://www.gnu.org/

Should work, it downloads a site, converts the links to locally usable links and saves .html pages as actually .html

See here for more details http://www.gnu.org/software/wget/manual/wget.html

1 Like

Thanks, wget looks perfect for this.

If I remember right theres a windows version as well if your on that platform.

Yup, I found it. Initially I thought it was tar gz implying Linux, but "search twice post once" I found they have different versions with Windows binaries as well. Even if it were Linux only, I'd be interested in looking at the source code for their algorithm and trying to reproduce it on my own.

I remember using a gui once as well that was actually pretty decent on windows, but because theres no official build, windows versions are a bit scattered so i didnt find it. Its a really useful tool though on any system.

Yeah I can definitely see the benefits of that. I think I might work on this and a gui for it as a side project, and somehow generalize it to all websites.