Contents
  1. 1. Python Crawler
    1. 1.1. Crawl the HTML
    2. 1.2. Parse and extract the “href” element
    3. 1.3. The whole programme

Python Crawler

A simple crawler written by python to extract all the article’s URL from my blog “https://robotwxy.github.io/“

Crawl the HTML

1
2
3
4
5
import requests
import re
def crawl(url):
response = requests.get(url).text
return response

Parse and extract the “href” element

1
2
3
4
5
from bs4 import BeautifulSoup
def parse(html):
soup = BeautifulSoup(html, 'lxml')
hrefs = soup.find_all('a', {"href": re.compile('^/.+?/$')})
return hrefs

soup_example
hrefs_example

The whole programme

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
base_url = 'https://robotwxy.github.io/'
my_urls = set()
my_urls.add(base_url)
seen = set()
while my_urls.__len__() != 0:
temp_url = my_urls.pop()
seen.add(temp_url)
html = crawl(temp_url)
hrefs = parse(html)
for href in hrefs:
url_str = base_url + href.get('href')
if url_str not in seen:
my_urls.add(url_str)
print(seen)

results

Contents
  1. 1. Python Crawler
    1. 1.1. Crawl the HTML
    2. 1.2. Parse and extract the “href” element
    3. 1.3. The whole programme