wxy's Blog

Python_Crawler

By Wxy

2019-02-25

Contents

1. Python Crawler

Python Crawler

A simple crawler written by python to extract all the article’s URL from my blog “https://robotwxy.github.io/“

Crawl the HTML

import requests
import re
def crawl(url):
    response = requests.get(url).text
    return response

Parse and extract the “href” element

from bs4 import BeautifulSoup
def parse(html):
    soup = BeautifulSoup(html, 'lxml')
    hrefs = soup.find_all('a', {"href": re.compile('^/.+?/$')})
    return hrefs

soup_example
hrefs_example

The whole programme

base_url = 'https://robotwxy.github.io/' 
my_urls = set() 
my_urls.add(base_url)
seen = set()
  
while my_urls.__len__() != 0: 
    temp_url = my_urls.pop() 
    seen.add(temp_url) 
    html = crawl(temp_url) 
    hrefs = parse(html) 
    for href in hrefs: 
        url_str = base_url + href.get('href')  
        if url_str not in seen:  
            my_urls.add(url_str)  
print(seen)

Contents

1. Python Crawler