wget tricks, download all files of type x from page or site
techAdmin
Status: Site Admin
Joined: 26 Sep 2003
Posts: 4127
Location: East Coast, West Coast? I know it's one of them.
Reply Quote
There's a nice generalized wget howto.

For our purposes, we won't need all this information, but I'm going to quote the main part because... well, because I'm tired of people taking their sites down and links dying:

:: Quote ::
wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off -i ~/mp3blogs.txt

And here's what this all means:

-r -H -l1 -np These options tell wget to download recursively. That means it goes to a URL, downloads the page there, then follows every link it finds. The -H tells the app to span domains, meaning it should follow links that point away from the blog. And the -l1 (a lowercase L with a numeral one) means to only go one level deep; that is, don't follow links on the linked site. In other words, these commands work together to ensure that you don't send wget off to download the entire Web -- or at least as much as will fit on your hard drive. Rather, it will take each link from your list of blogs, and download it. The -np switch stands for "no parent", which instructs wget to never follow a link up to a parent directory.

We don't, however, want all the links -- just those that point to audio files we haven't yet seen. Including -A.mp3 tells wget to only download files that end with the .mp3 extension. And -N turns on timestamping, which means wget won't download something with the same name unless it's newer.

To keep things clean, we'll add -nd, which makes the app save every thing it finds in one directory, rather than mirroring the directory structure of linked sites. And -erobots=off tells wget to ignore the standard robots.txt files. Normally, this would be a terrible idea, since we'd want to honor the wishes of the site owner. However, since we're only grabbing one file per site, we can safely skip these and keep our directory much cleaner. Also, along the lines of good net citizenship, we'll add the -w5 to wait 5 seconds between each request as to not pound the poor blogs.

Finally, -i ~/mp3blogs.txt is a little shortcut. Typically, I'd just add a URL to the command line with wget and start the downloading. But since I wanted to visit multiple mp3 blogs, I listed their addresses in a text file (one per line) and told wget to use that as the input.


Let's take a look at the core parts though:
wget -r -l1 -A.mp3 <url>

This will download from the given <url> all files of type .mp3 for one level in the site, down from the given url.

This can be a really handy device, also good for example for .htm or .html pages.

Here's a concrete example: say you want to download all files of type .mp3 going down two directory levels, but you do not want wget to recreate the directory structures, just get the files:

:: Code ::
wget -r -l2 -nd -Nc -A.mp3 <url>
# or if the site uses a lot of ? type gunk in the urls, and you only
# want the main ones, use this:
wget -N -r -l inf -p -np -k -A '.gif,.swf,.css,.html,.htm,.jpg,.jpeg'  <url>
# or if the site is dynamically generated, you don't want to use -N
# because the pages are always  new, rather this:
wget -nc -r -l inf -p -np -k -A '.gif,.swf,.css,.html,.htm,.jpg,.jpeg'  <url>

That will filter out all file endings like ?.... and just download the actual page parts, which is probably what you want.

-r makes it recursive
-l2 makes it 2 levels
-nd is no directories
-Nc only downloads files you have not already downloaded
-A.mp3 means all mp3 files on page

Other wget tricks
:: Code ::
wget -N -r -l inf -p -np -k <some url>

will download the entire website, allegedly, with images, and make the links relative, I think, though that might be wrong.

And of course, read the gnu documentation

:: Quote ::
�-k�
�--convert-links�
After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-html content, etc.

Each link will be changed in one of the two ways:

* The links to files that have been downloaded by Wget will be changed to refer to the file they point to as a relative link.

Example: if the downloaded file /foo/doc.html links to /bar/img.gif, also downloaded, then the link in doc.html will be modified to point to �../bar/img.gif�. This kind of transformation works reliably for arbitrary combinations of directories.
* The links to files that have not been downloaded by Wget will be changed to include host name and absolute path of the location they point to.

Example: if the downloaded file /foo/doc.html links to /bar/img.gif (or to ../bar/img.gif), then the link in doc.html will be modified to point to hostname/bar/img.gif.

Because of this, local browsing works reliably: if a linked file was downloaded, the link will refer to its local name; if it was not downloaded, the link will refer to its full Internet address rather than presenting a broken link. The fact that the former links are converted to relative links ensures that you can move the downloaded hierarchy to another directory.

Note that only at the end of the download can Wget know which links have been downloaded. Because of that, the work done by �-k� will be performed at the end of all the downloads.


Note that if you want to download only the material in one directory or url path and lower, use the -np option. Also, to avoid hyper fast requests, use the -w <time> option to slow down wget.

Further:
If you want to get all images (and other fancy stuff like scripts, css) from a website (even if the files referenced in the Html source are hosted on other domains), you can do it with the following wget command:
wget -E -H -k -K -p -U "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" the.domain/file.name

to get just say images, with no directory etc, using also stuff from above:

wget -nd -erobots=off -A .jpg,.jpeg -E -H -k -K -p -U "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" the.domain/file.name

note: if you don't specify the browser type, the server possibly will think of you as a Crawler, a search engine bot, and will only serve you the robots.txt file.
Back to top
Display posts from previous:   

All times are GMT - 8 Hours