[ How to extract a link from an html file using bash ]
I am trying to create a shell file that will grab all of the images from the toplist of
wallbase.cc. So far I have it so that it curls the html code (
using mac so no wget), and grabs all the links to the images. The only problem I am having is that when I grep for the links it returns
<a href=link> <target=blank>. What I am trying to do is extract the link so that I can curl it into a file. I thought about using an external Java or C file to extract the links but I figure there is a pure bash way to do it.
Any help would be great.
edit: my commands so far
<a href="http://wallbase.cc/wallpaper/' wallbase.source
This returns all of the links including the html code. I just need to pipe this with some command in order to strip the html and leave the links
You can do all of that with your native grep
This options may just be what you are looking for grep's man page:
-E, --extended-regexp Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
-o, --only-matching Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
curl <URL> | grep -o -E "href=[\"'](.*)[\"'] "
The regular expression is extremely generic but you may be able to refine it to your needs
You can do it with a single command:
mech-dump --links http://domain.tld/path
This command comes with perl module WWW::Mechanize