Writing a web scraper with python is easy, if you allow yourself a little help by two beautiful libraries
- BeautifulSoup
- Requests
import sys, requests from BeautifulSoup import BeautifulSoup # we want UTF8 reload(sys) sys.setdefaultencoding('utf8') #lets define the URLs we will use to scrape url = 'https://www.somepageortheother.nl/bladeren' # let's scrape the first URL and get all available letters of the alphabet print "obtaining a list of all pages to visit" response = requests.get(url) html = response.content soup = BeautifulSoup(html) alphabet = soup.find('ul', attrs={'class': 'letter-list'}) letters = [] for letter in alphabet.findAll('li'): letters.append(letter.text)If you look at it in reality, all the good stuff is happening from line 15 to 21 On line 15+16 we let Requests makes it easy to get the content of our request (in this case html) On line 17 we let BeauitfulSoup make things super easy. BeautifulSoup(html) transforms all that content into something that is easy to parse and the rest is just that, using find functions to find what we want and going through it. It truly couldn’t have been easier and you can find good documentation on BeautifulSoup here Please note that BeautifulSoup 4 has a lot going for it and you should investigate it, in lieu of BeautifulSoup 3…
python Webscraping