如何使用 Python 3 中的 Requests 和 Beautiful Soup 处理 Web 数据

2024-02-22 前端 0

简介

网络为我们提供了比我们能阅读和理解的更多数据，因此我们经常希望以编程方式处理这些信息，以便理解它。有时，网站创建者通过 .csv 或逗号分隔值文件或通过 API（应用程序编程接口）向我们提供这些数据。其他时候，我们需要自己从网络上收集文本。

本教程将介绍如何使用 Requests 和 Beautiful Soup Python 包来利用网页数据。Requests 模块允许您将 Python 程序与 Web 服务集成，而 Beautiful Soup 模块旨在快速完成屏幕抓取。使用 Python 交互式控制台和这两个库，我们将学习如何收集网页并处理其中可用的文本信息。

安装 Requests

让我们首先激活我们的 Python 3 编程环境。确保您位于环境所在的目录，并运行以下命令：

. my_env/bin/activate

为了处理网页，我们需要请求页面。Requests 库允许您以人类可读的方式在 Python 程序中使用 HTTP。

在激活我们的编程环境后，我们将使用 pip 安装 Requests：

pip install requests

安装 Requests 库时，您将收到以下输出：

Collecting requests  Downloading requests-2.26.0-py2.py3-none-any.whl (88kB)    100% |████████████████████████████████| 92kB 3.1MB/s ...Installing collected packages: chardet, urllib3, certifi, idna, requestsSuccessfully installed certifi-2017.4.17 chardet-3.0.4 idna-2.5 requests-2.26.0 urllib3-1.21.1

如果 Requests 已经安装，您将从终端窗口收到类似以下的反馈：

Requirement already satisfied...

安装 Requests 到我们的编程环境后，我们可以继续安装下一个模块。

安装 Beautiful Soup

与安装 Requests 一样，我们将使用 pip 安装 Beautiful Soup。当前版本的 Beautiful Soup 4 可以使用以下命令安装：

pip install beautifulsoup4

运行此命令后，您应该会看到类似以下的输出：

Collecting beautifulsoup4  Downloading beautifulsoup4-4.10.0-py3-none-any.whl (97 kB)     |████████████████████████████████| 97 kB 6.8 MB/sCollecting soupsieve>1.2  Downloading soupsieve-2.3.1-py3-none-any.whl (37 kB)Installing collected packages: soupsieve, beautifulsoup4Successfully installed beautifulsoup4-4.10.0 soupsieve-2.3.1

现在 Beautiful Soup 和 Requests 都已安装，我们可以继续了解如何使用这些库来抓取网站。

使用 Requests 收集网页

现在我们已经安装了要使用的两个 Python 库，我们可以熟悉一下如何浏览基本网页。

让我们首先进入 Python 交互式控制台：

python

从这里，我们将导入 Requests 模块，以便我们可以收集一个示例网页：

import requests

我们将把示例网页 mockturtle.html 的 URL（下面）赋给变量 url：

url = 'https://assets.digitalocean.com/articles/eng_python/beautiful-soup/mockturtle.html'

接下来，我们可以使用 request.get() 方法将该页面的请求结果赋给变量 page。我们将该方法传递给页面的 URL（分配给 url 变量）。

page = requests.get(url)

变量 page 被分配了一个响应对象：

>>> page<Response [200]>>>>

上面的响应对象告诉我们方括号中的 status_code 属性（在本例中为 200）。可以显式调用此属性：

>>> page.status_code200>>>

返回的 200 代码告诉我们页面下载成功。以数字 2 开头的代码通常表示成功，而以 4 或 5 开头的代码表示发生了错误。您可以从 W3C 的状态码定义中了解更多关于 HTTP 状态码的信息。

为了处理网络数据，我们将要访问网页文件的基于文本的内容。我们可以使用 page.text（或者如果我们想要以字节形式访问响应，则使用 page.content）来读取服务器响应的内容。

page.text

按下 ENTER 后，我们将收到以下输出：

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"/n    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">/n/n<html lang="en-US" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">/n<head>/n  <meta http-equiv="content-type" content="text/html; charset=us-ascii" />/n/n  <title>Turtle Soup</title>/n</head>/n/n<body>/n  <h1>Turtle Soup</h1>/n/n  <p class="verse" id="first">Beautiful Soup, so rich and green,<br />/n  Waiting in a hot tureen!<br />/n  Who for such dainties would not stoop?<br />/n  Soup of the evening, beautiful Soup!<br />/n  Soup of the evening, beautiful Soup!<br /></p>/n/n  <p class="chorus" id="second">Beau--ootiful Soo--oop!<br />/n  Beau--ootiful Soo--oop!<br />/n  Soo--oop of the e--e--evening,<br />/n  Beautiful, beautiful Soup!<br /></p>/n/n  <p class="verse" id="third">Beautiful Soup! Who cares for fish,<br />/n  Game or any other dish?<br />/n  Who would not give all else for two<br />/n  Pennyworth only of Beautiful Soup?<br />/n  Pennyworth only of beautiful Soup?<br /></p>/n/n  <p class="chorus" id="fourth">Beau--ootiful Soo--oop!<br />/n  Beau--ootiful Soo--oop!<br />/n  Soo--oop of the e--e--evening,<br />/n  Beautiful, beauti--FUL SOUP!<br /></p>/n</body>/n</html>/n'>>>

在这里，我们看到页面的完整文本被打印出来，包括所有的 HTML 标记。但是，由于没有太多的间距，很难阅读。

在下一节中，我们可以利用 Beautiful Soup 模块以更加人性化的方式处理这些文本数据。

使用 Beautiful Soup 逐步解析页面

Beautiful Soup 库可以从解析后的 HTML 和 XML 文档（包括具有非闭合标签或标记混乱等格式不正确的标记的文档）中创建解析树。这个功能将使网页文本比我们从 Requests 模块中看到的更易读。

首先，我们将在 Python 控制台中导入 Beautiful Soup：

from bs4 import BeautifulSoup

接下来，我们将运行 page.text 文档通过该模块，以便为我们提供一个 BeautifulSoup 对象 —— 也就是说，我们将通过运行 Python 内置的 html.parser 对 HTML 进行解析，从而得到从解析页面中得到的解析树。构造的对象将以嵌套数据结构的形式表示为 mockturtle.html 文档。这被赋值给变量 soup。

soup = BeautifulSoup(page.text, 'html.parser')

为了在终端上显示页面的内容，我们可以使用 prettify() 方法将 Beautiful Soup 解析树转换为格式良好的 Unicode 字符串。

print(soup.prettify())

这将使每个 HTML 标记都单独显示在自己的一行上：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html lang="en-US" xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml"> <head>  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>  <title>   Turtle Soup  </title> </head> <body>  <h1>   Turtle Soup  </h1>  <p class="verse" id="first">   Beautiful Soup, so rich and green,   <br/>   Waiting in a hot tureen!   <br/>   Who for such dainties would not stoop?   <br/>   Soup of the evening, beautiful Soup! ...</html>

在上面的输出中，我们可以看到每行只有一个标记，并且由于 Beautiful Soup 使用的树形模式，标记是嵌套的。

查找标记的实例

我们可以使用 Beautiful Soup 的 find_all 方法从页面中提取单个标记。这将返回文档中给定标记的所有实例。

soup.find_all('p')

在我们的对象上运行该方法将返回歌曲的完整文本以及相关的  标记和该请求标记中包含的任何标记，这里包括换行标记  ：

[<p class="verse" id="first">Beautiful Soup, so rich and green,<br/>  Waiting in a hot tureen!<br/>  Who for such dainties would not stoop?<br/>  Soup of the evening, beautiful Soup!<br/>  Soup of the evening, beautiful Soup!<br/></p>, <p class="chorus" id="second">Beau--ootiful Soo--oop!<br/>...  Beau--ootiful Soo--oop!<br/>  Soo--oop of the e--e--evening,<br/>  Beautiful, beauti--FUL SOUP!<br/></p>]

您会注意到上面的输出数据包含在方括号 [ ] 中。这意味着它是 Python 的列表数据类型。

因为它是一个列表，我们可以调用其中的特定项（例如，第三个  元素），并使用 get_text() 方法从该标记内提取所有文本：

soup.find_all('p')[2].get_text()

我们收到的输出将是这种情况下第三个  元素中的内容：

'Beautiful Soup! Who cares for fish,/n  Game or any other dish?/n  Who would not give all else for two/n  Pennyworth only of Beautiful Soup?/n  Pennyworth only of beautiful Soup?'

请注意，返回的字符串中也显示了 /n 换行符。

按类和 ID 查找标记

在使用 Beautiful Soup 处理网页数据时，与 CSS 选择器相关的 HTML 元素，如类和 ID，可能会有所帮助。我们可以通过使用 find_all() 方法并将类和 ID 字符串作为参数传递来针对特定的类和 ID。

首先，让我们找到所有 chorus 类的实例。在 Beautiful Soup 中，我们将类的字符串分配给关键字参数 class_：

soup.find_all(class_='chorus')

当我们运行上述行时，我们将收到以下列表作为输出：

[<p class="chorus" id="second">Beau--ootiful Soo--oop!<br/>  Beau--ootiful Soo--oop!<br/>  Soo--oop of the e--e--evening,<br/>  Beautiful, beautiful Soup!<br/></p>, <p class="chorus" id="fourth">Beau--ootiful Soo--oop!<br/>  Beau--ootiful Soo--oop!<br/>  Soo--oop of the e--e--evening,<br/>  Beautiful, beauti--FUL SOUP!<br/></p>]

具有 chorus 类的两个  标记部分被打印到终端上。

我们还可以指定我们只想在  标记内搜索 chorus 类，以防它用于多个标记：

soup.find_all('p', class_='chorus')

运行上面的行将产生与之前相同的输出。

我们还可以使用 Beautiful Soup 来定位与 HTML 标记相关联的 ID。在这种情况下，我们将字符串 'third' 分配给关键字参数 id：

soup.find_all(id='third')

一旦我们运行上面的行，我们将收到以下输出：

[<p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>  Game or any other dish?<br/>  Who would not give all else for two<br/>  Pennyworth only of Beautiful Soup?<br/>  Pennyworth only of beautiful Soup?<br/></p>]

与 ID 为 third 的  标记相关联的文本将与相关标记一起打印到终端上。