On a recent project we had difficulties in scraping the summary paragraph from Wikipedia article pages and Beautiful Soup was suggested as a possible tool to help with this. The Beautiful Soup Python library has functions to iterate, search and update the elements in the parsed tree of a html (and xml) document.
So download and install the library do a quick test was to fetch the URL of the web page we’re interested using the ‘requests’ HTTP library to make things easy. The http document is then passed to create a ‘soup’ object,.
result = requests.get("https://en.wikipedia.org/wiki/HMS_Sheffield_(D80)")
src = result.content
soup = BeautifulSoup(src, 'lxml')
The prettify # makes the html more readable by indenting the parent and sibling structure
Searching for tag types (such as ‘a’ for anchor links) is simple using ‘find’ (first instance) or ‘find_all’, this shows all internal (Wikimedia links) and external links (“https://”)
Lets just get links that refer to “HMS …”
Now lets get the text paragraphs we’re interested in, this can be done using the ‘p’ tag
Dedicated Wikipedia Library
While Beautiful Soup is a good generic tool for parsing web pages, it turns out that for Wikipedia there are dedicated python utilities for dealing with the content such as the Wikipedia library (https://pypi.org/project/wikipedia/) which wraps the Wikimedia API simply
wp.search(“HMS Sheffield”) returns the Wikipedia pages for all incarnations of HMS Sheffield, and we can use wp.summary(“HMS Sheffield (D80)”) to give hte element from page we’re interested in.
The wp.page(“HMS Sheffield (D80)”) also gives the full text content in a readable form with headings.
Again we can select the first paragraph for the summary (exclude URL), and possible use other paragraphs using the headings as index/topic markers.
Smart Quotes! While trying this out I also found a useful function to get rid of those pesky Microsoft smart quotes causing trouble in RDF definitions on the same task. Unicode, Dammit converts Microsoft smart quotes to HTML or XML entities: