TAGS :Viewed: 7 - Published at: a few seconds ago

[ How to extract a link from an html file using bash ]

I am trying to create a shell file that will grab all of the images from the toplist of wallbase.cc. So far I have it so that it curls the html code (using mac so no wget), and grabs all the links to the images. The only problem I am having is that when I grep for the links it returns <a href=link> <target=blank>. What I am trying to do is extract the link so that I can curl it into a file. I thought about using an external Java or C file to extract the links but I figure there is a pure bash way to do it.

Any help would be great.

edit: my commands so far

grep <a href="http://wallbase.cc/wallpaper/' wallbase.source

This returns all of the links including the html code. I just need to pipe this with some command in order to strip the html and leave the links

Answer 1

You can do all of that with your native grep

This options may just be what you are looking for grep's man page:

-E, --extended-regexp Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)

-o, --only-matching Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

curl <URL> | grep -o -E "href=[\"'](.*)[\"'] "

The regular expression is extremely generic but you may be able to refine it to your needs

Answer 2

You can do it with a single command:

mech-dump --links http://domain.tld/path

This command comes with perl module WWW::Mechanize