How to easily extract URLs from a webpage

Duarte M
2 min readSep 1, 2018

--

I found it far too time-consuming to discover how to do this, so I have written this mini guide in the interest of simplicity and passing on knowledge. Hopefully the next person to do this won’t have to spend so much time searching the web.

A few use cases:

  • Scrape contacts to generate leads
  • Find price differences between restaurants
  • Find related links in a wikipedia page
  • Find related music artists

This is how to do it on macOS, but you can probably do the same thing using a similar method on Windows, Linux, etc.

  1. Download the webpage source as an html file.
  2. Find the class where your desired urls are. In your browser right click the type of link you are looking for and click “inspect element”. The class name will be where it says ‘class=”x”’.
  3. The html needs to be cleaned as it is probably a mess. Open the html file using a text editor like Sublime Text or Atom and use the following regex to find the classes you need: <div[^<>]*class=”my-class”[^<>]*>[\s\S]*?</div> where you replace the class name with the one you found in step 2. Copy and paste this into a new html file.
  4. Now to remove everything except the full URLS we can use a terminal program called grep on our new html file. Open up terminal and put this in:
    grep -Eoi ‘<a [^>]+>’ Yourfilepath.html |
    grep -Eo ‘href=”[^\”]+”’ |
    grep -Eo ‘(http|https)://[^”]+’
  5. If there are any duplicates you can use your text editor to remove them. On Sublime you would do the following:
    Edit -> Sort Lines (F9)
    Edit -> Permute Lines -> Unique

And that’s it. This is certainly a messy way to go about it but it’s quick and it works. If you can recommend a cleaner way, please do.

--

--