Python Crawler
A simple crawler written by python to extract all the article’s URL from my blog “https://robotwxy.github.io/“
Crawl the HTML
1 2 3 4 5
| import requests import re def crawl(url): response = requests.get(url).text return response
|
Parse and extract the “href” element
1 2 3 4 5
| from bs4 import BeautifulSoup def parse(html): soup = BeautifulSoup(html, 'lxml') hrefs = soup.find_all('a', {"href": re.compile('^/.+?/$')}) return hrefs
|
soup_example
hrefs_example
The whole programme
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| base_url = 'https://robotwxy.github.io/' my_urls = set() my_urls.add(base_url) seen = set() while my_urls.__len__() != 0: temp_url = my_urls.pop() seen.add(temp_url) html = crawl(temp_url) hrefs = parse(html) for href in hrefs: url_str = base_url + href.get('href') if url_str not in seen: my_urls.add(url_str) print(seen)
|
results