python Webscraping

Writing a web scraper with python is easy, if you allow yourself a little help by two beautiful libraries

BeautifulSoup
Requests

import sys, requests
from BeautifulSoup import BeautifulSoup

# we want UTF8
reload(sys)  
sys.setdefaultencoding('utf8')

#lets define the URLs we will use to scrape

url = 'https://www.somepageortheother.nl/bladeren'

# let's scrape the first URL and get all available letters of the alphabet 

print "obtaining a list of all pages to visit"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
alphabet = soup.find('ul', attrs={'class': 'letter-list'})
letters = []
for letter in alphabet.findAll('li'):
    letters.append(letter.text)

If you look at it in reality, all the good stuff is happening from line 15 to 21 On line 15+16 we let Requests makes it easy to get the content of our request (in this case html) On line 17 we let BeauitfulSoup make things super easy. BeautifulSoup(html) transforms all that content into something that is easy to parse and the rest is just that, using find functions to find what we want and going through it. It truly couldn’t have been easier and you can find good documentation on BeautifulSoup here Please note that BeautifulSoup 4 has a lot going for it and you should investigate it, in lieu of BeautifulSoup 3…

python Webscraping

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
cookielawinfo-checkbox-functional	1 year	The cookie is set by the GDPR Cookie Consent plugin to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Necessary" category .
cookielawinfo-checkbox-others	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to store the user consent for cookies in the category "Others".
cookielawinfo-checkbox-performance	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to store the user consent for cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.

Your Privacy