Beautiful Soup (HTML parser)

{{Short description|Python HTML/XML parser}}

{{Other uses|Beautiful Soup (disambiguation){{!}}Beautiful Soup}}

{{Infobox software

| name = Beautiful Soup

| title = Beautiful Soup

| logo =

| logo caption =

| screenshot =

| caption =

| collapsible =

| author = Leonard Richardson

| developer =

| released = {{Start date|2004}}

| discontinued =

| latest release version = {{wikidata|property|reference|edit|P348}}

| latest release date = {{Start date and age|{{wikidata|qualifier|P348|P577}}|df=no}}

| latest preview version =

| latest preview date =

| programming language = Python

| operating system =

| platform = Python

| size =

| language =

| genre = HTML parser library, Web scraping

| license = {{ubl | Python Software Foundation License (Beautiful Soup 3) | MIT License (versions 4 and up){{cite web|title=Beautiful Soup website|url=http://www.crummy.com/software/BeautifulSoup/#Download|accessdate=18 April 2012|quote=Beautiful Soup is licensed under the same terms as Python itself}} }}

| alexa =

| website = {{URL|https://www.crummy.com/software/BeautifulSoup/}}

}}

Beautiful Soup is a Python package for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML,{{Citation|last=Hajba|first=Gábor László|title=Using Beautiful Soup|date=2018|work=Website Scraping with Python: Using BeautifulSoup and Scrapy|pages=41–96|editor-last=Hajba|editor-first=Gábor László|publisher=Apress|language=en|doi=10.1007/978-1-4842-3925-4_3|isbn=978-1-4842-3925-4}} which is useful for web scraping.{{Cite web |last=Python |first=Real |title=Beautiful Soup: Build a Web Scraper With Python – Real Python |url=https://realpython.com/beautiful-soup-web-scraper-python/ |access-date=2023-06-01 |website=realpython.com |language=en}}

History

Beautiful Soup was started in 2004 by Leonard Richardson.{{cn|date=May 2024}} It takes its name from the poem Beautiful Soup from Alice's Adventures in Wonderland{{Cite web |last=makcorps |date=2022-12-13 |title=BeautifulSoup tutorial: Let's Scrape Web Pages with Python |url=https://www.scrapingdog.com/blog/beautifulsoup-tutorial-web-scraping-with-python/ |access-date=2024-01-24 |language=en-US}} and is a reference to the term "tag soup" meaning poorly-structured HTML code.{{Cite web |date=2021-02-11 |title=Python Web Scraping |url=https://www.udacity.com/blog/2021/02/python-web-scraping.html |access-date=2024-01-24 |website=Udacity |language=en-US}} Richardson continues to contribute to the project,{{Cite web |title=Code : Leonard Richardson |url=https://code.launchpad.net/%7Eleonardr/+branches |access-date=2020-09-19 |website=Launchpad |language=en-US}} which is additionally supported by paid open-source maintainers from the company Tidelift.{{Cite web|last=Tidelift|title=beautifulsoup4 {{!}} pypi via the Tidelift Subscription|url=https://tidelift.com/subscription/pkg/pypi-beautifulsoup4|access-date=2020-09-19|website=tidelift.com|language=en}}

=Versions=

Beautiful Soup 3 was the official release line of Beautiful Soup from May 2006 to March 2012. The current release is [https://www.crummy.com/software/BeautifulSoup/bs4/download/ Beautiful Soup 4.x].

In 2021, Python 2.7 support was retired and the release 4.9.3 was the last to support Python 2.7.{{cite web |last1=Richardson |first1=Leonard |date=7 Sep 2021 |title=Beautiful Soup 4.10.0 |url=https://groups.google.com/g/beautifulsoup/c/flWqqlrcJ9s |access-date=27 September 2022 |website=beautifulsoup |publisher=Google Groups |language=en-US}}

Usage

Beautiful Soup represents parsed data as a tree which can be searched and iterated over with ordinary Python loops.{{Cite web |title=How To Scrape Web Pages with Beautiful Soup and Python 3 {{!}} DigitalOcean |url=https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3 |access-date=2023-06-01 |website=www.digitalocean.com |language=en}}

= Code example =

The example below uses the Python standard library's urllib{{Cite web |last=Python |first=Real |title=Python's urllib.request for HTTP Requests – Real Python |url=https://realpython.com/urllib-request/ |access-date=2023-06-01 |website=realpython.com |language=en}} to load Wikipedia's main page, then uses Beautiful Soup to parse the document and search for all links within.

  1. !/usr/bin/env python3
  2. Anchor extraction from HTML document

from bs4 import BeautifulSoup

from urllib.request import urlopen

with urlopen("https://en.wikipedia.org/wiki/Main_Page") as response:

soup = BeautifulSoup(response, "html.parser")

for anchor in soup.find_all("a"):

print(anchor.get("href", "/"))

Another example is using the Python requests library{{Cite web |last=Blog |first=SerpApi |title=Beautiful Soup: Web Scraping with Python|url=https://serpapi.com/blog/beautiful-soup-build-a-web-scraper-with-python/ |access-date=2024-06-27 |website=serpapi.com |date=5 March 2024 |language=en}} to get divs on a URL.

import requests

from bs4 import BeautifulSoup

url = "https://wikipedia.com"

response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

headings = soup.find_all("div")

for heading in headings:

print(heading.text.strip())

See also

References