Python requests 库完全指南
本文档面向零基础新手,目标是让你真正理解:
- HTTP 协议基础:请求和响应是什么
requests库的安装与入门- GET / POST / PUT / DELETE 等请求方式
- 请求头(Headers)、查询参数、请求体的设置
- 响应对象的所有属性:状态码、文本、JSON、二进制
- 文件上传与下载
- Session 会话(保持登录状态)
- 超时、重试、代理的配置
- 身份认证(Basic Auth、Token、OAuth)
- 错误处理与异常
- 实战案例:天气查询、网页抓取、API 调用
配有大量可运行示例,全部从最基础讲起。
第一部分:HTTP 基础知识
1.1 什么是 HTTP?
HTTP(超文本传输协议)是浏览器与服务器之间"对话"的规则。
你每天浏览网页时发生的事情:
你的浏览器 服务器
────────── ──────
"我要看 baidu.com 首页" ──────► "好的,给你HTML代码"
HTTP 请求(Request) HTTP 响应(Response)
用 Python 写代码也可以做同样的事:
requests.get('https://baidu.com') ──► 返回 HTML 内容
1.2 HTTP 请求的组成
一个 HTTP 请求包含:
┌─────────────────────────────────────────────────────┐
│ 请求行: GET /search?q=python HTTP/1.1 │
│ ↑ ↑ ↑ │
│ 方法 路径 协议版本 │
├─────────────────────────────────────────────────────┤
│ 请求头(Headers): │
│ Host: www.example.com │
│ User-Agent: Mozilla/5.0 ... │
│ Content-Type: application/json │
│ Authorization: Bearer token123 │
├─────────────────────────────────────────────────────┤
│ 请求体(Body): │
│ {"username": "admin", "password": "123456"} │
│ (GET 请求通常没有请求体) │
└─────────────────────────────────────────────────────┘
1.3 HTTP 方法(动词)
GET ──► 获取资源(查询) 如:搜索商品
POST ──► 提交数据(新建) 如:用户注册、提交表单
PUT ──► 替换资源(全量更新) 如:修改用户全部信息
PATCH ──► 修改资源(局部更新) 如:只修改用户头像
DELETE ──► 删除资源 如:删除一篇文章
HEAD ──► 只获取响应头(不要正文) 如:检查文件是否存在
OPTIONS ──► 查询服务器支持哪些方法
1.4 HTTP 状态码
1xx 信息性响应
2xx 成功
200 OK ──► 请求成功
201 Created ──► 创建成功(POST 后常见)
204 No Content ──► 成功但无返回内容(DELETE 后常见)
3xx 重定向
301 Moved Permanently ──► 永久跳转
302 Found ──► 临时跳转
4xx 客户端错误(你的问题)
400 Bad Request ──► 请求格式错误
401 Unauthorized ──► 未认证(需要登录)
403 Forbidden ──► 无权限(已登录但没权限)
404 Not Found ──► 找不到资源
429 Too Many Requests ──► 请求太频繁
5xx 服务器错误(对方的问题)
500 Internal Server Error ──► 服务器内部错误
502 Bad Gateway ──► 网关错误
503 Service Unavailable ──► 服务暂时不可用
第二部分:安装与入门
2.1 安装 requests
pip install requests
验证安装:
import requests
print(requests.__version__) # 如:2.31.0
2.2 第一个请求
import requests
# 向一个公开的测试 API 发送 GET 请求
response = requests.get('https://httpbin.org/get')
# 查看响应
print(f"状态码:{response.status_code}") # 200
print(f"内容类型:{response.headers['Content-Type']}")
print(f"响应内容(前200字符):{response.text[:200]}")
解释:
requests.get(url) → 发送 GET 请求,返回 Response 对象
response.status_code → HTTP 状态码(200=成功)
response.text → 响应内容(字符串)
response.headers → 响应头(字典)
2.3 使用公开测试接口练习
本章大量示例使用 https://httpbin.org——这是专门用来测试 HTTP 请求的公开网站:
https://httpbin.org/get ──► 返回你的 GET 请求信息
https://httpbin.org/post ──► 返回你的 POST 请求信息
https://httpbin.org/put ──► 返回你的 PUT 请求信息
https://httpbin.org/delete ──► 返回你的 DELETE 请求信息
https://httpbin.org/status/404 ──► 返回指定状态码
https://httpbin.org/delay/3 ──► 延迟3秒后返回(测试超时)
https://httpbin.org/headers ──► 返回你发送的请求头
https://httpbin.org/ip ──► 返回你的 IP 地址
https://httpbin.org/json ──► 返回一段 JSON 数据
第三部分:GET 请求
3.1 基本 GET 请求
import requests
# 最简单的 GET 请求
response = requests.get('https://httpbin.org/get')
print(f"状态码:{response.status_code}") # 200
print(f"是否成功:{response.ok}") # True(状态码 < 400 时为 True)
print(f"编码:{response.encoding}") # utf-8
print(f"响应时间:{response.elapsed}") # 如:0:00:00.234567
print(f"最终URL:{response.url}") # 经过重定向后的最终 URL
3.2 带查询参数的 GET 请求
查询参数(Query Parameters)是 URL 中 ? 后面的部分,如:
https://api.example.com/search?q=python&page=1&limit=10
import requests
# ===== 方式1:直接在 URL 里写 =====
url = 'https://httpbin.org/get?name=张三&age=25&city=北京'
response = requests.get(url)
print(response.json()['args'])
# {'age': '25', 'city': '北京', 'name': '张三'}
# ===== 方式2:用 params 参数(推荐!自动 URL 编码)=====
params = {
'name': '张三',
'age': 25,
'city': '北京'
}
response = requests.get('https://httpbin.org/get', params=params)
# 查看实际请求的 URL
print(f"实际URL:{response.url}")
# https://httpbin.org/get?name=%E5%BC%A0%E4%B8%89&age=25&city=%E5%8C%97%E4%BA%AC
# (中文被自动编码了!这就是推荐 params 方式的原因)
print(response.json()['args'])
# {'age': '25', 'city': '北京', 'name': '张三'}
# ===== 传递列表参数(多值)=====
params_multi = {
'ids': [1, 2, 3], # 传递多个 id
'tag': ['python', 'web']
}
response = requests.get('https://httpbin.org/get', params=params_multi)
print(response.url)
# ?ids=1&ids=2&ids=3&tag=python&tag=web
# ===== 实际应用:调用搜索 API =====
def search_github_repos(keyword, language='python', per_page=5):
"""搜索 GitHub 仓库"""
url = 'https://api.github.com/search/repositories'
params = {
'q': f'{keyword} language:{language}',
'sort': 'stars',
'order': 'desc',
'per_page': per_page
}
response = requests.get(url, params=params)
if response.status_code == 200:
data = response.json()
repos = data['items']
print(f"找到 {data['total_count']} 个仓库,显示前{per_page}个:n")
for repo in repos:
print(f" ⭐ {repo['stargazers_count']:>8,} {repo['full_name']}")
print(f" {repo['description']}n")
else:
print(f"请求失败:{response.status_code}")
search_github_repos('web scraping')
3.3 Response 对象详解
import requests
response = requests.get('https://httpbin.org/json')
# ===== 基本信息 =====
print(f"状态码: {response.status_code}") # 200
print(f"是否成功: {response.ok}") # True
print(f"原因短语: {response.reason}") # 'OK'
print(f"最终 URL: {response.url}")
print(f"响应耗时: {response.elapsed.total_seconds():.3f}秒")
# ===== 响应头 =====
print(f"n响应头:")
for key, value in response.headers.items():
print(f" {key}: {value}")
print(f"nContent-Type:{response.headers.get('Content-Type')}")
print(f"Content-Length:{response.headers.get('Content-Length', '未知')}")
# ===== 响应内容(三种格式)=====
# 1. 文本格式(自动解码)
print(f"n文本内容(前100字符):{response.text[:100]}")
# 2. JSON 格式(直接解析为 Python 字典/列表)
data = response.json()
print(f"nJSON 数据:{data}")
# 3. 二进制格式(图片/文件等)
raw_bytes = response.content
print(f"n二进制数据长度:{len(raw_bytes)} 字节")
# ===== 编码处理 =====
# requests 会自动检测编码,但有时需要手动指定
response2 = requests.get('https://www.baidu.com')
response2.encoding = 'utf-8' # 手动指定编码
print(response2.text[:200])
# ===== 请求历史(重定向链)=====
response3 = requests.get('http://github.com') # http 会跳转到 https
print(f"n重定向链:")
for r in response3.history:
print(f" {r.status_code} → {r.url}")
print(f"最终:{response3.status_code} {response3.url}")
第四部分:POST 请求
4.1 发送表单数据(application/x-www-form-urlencoded)
import requests
# 模拟提交 HTML 表单
form_data = {
'username': 'zhangsan',
'password': '123456',
'remember': 'true'
}
response = requests.post('https://httpbin.org/post', data=form_data)
result = response.json()
print(f"状态码:{response.status_code}")
print(f"发送的表单数据:{result['form']}")
# {'password': '123456', 'remember': 'true', 'username': 'zhangsan'}
print(f"Content-Type:{result['headers']['Content-Type']}")
# application/x-www-form-urlencoded
4.2 发送 JSON 数据(application/json)
import requests
# 现代 REST API 通常使用 JSON 格式
json_data = {
'title': '学习 Python requests',
'content': '今天学习了 requests 库的基本用法',
'tags': ['python', 'http', '学习'],
'is_public': True,
'views': 0
}
response = requests.post(
'https://httpbin.org/post',
json=json_data # 用 json= 参数,自动设置 Content-Type: application/json
)
result = response.json()
print(f"发送的 JSON:{result['json']}")
print(f"Content-Type:{result['headers']['Content-Type']}")
# application/json
# ===== 对比:data= vs json= =====
# data=:手动把字典转JSON字符串,需要手动设置 Content-Type
import json
headers = {'Content-Type': 'application/json'}
response_manual = requests.post(
'https://httpbin.org/post',
data=json.dumps(json_data),
headers=headers
)
# json=:自动序列化 + 自动设置 Content-Type(推荐!)
response_auto = requests.post(
'https://httpbin.org/post',
json=json_data
)
# 两者效果完全相同,推荐用 json= 参数
4.3 实战:调用 RESTful API
import requests
BASE_URL = 'https://jsonplaceholder.typicode.com'
# ===== GET:获取数据 =====
def get_post(post_id):
response = requests.get(f'{BASE_URL}/posts/{post_id}')
return response.json()
# ===== POST:创建数据 =====
def create_post(title, body, user_id=1):
response = requests.post(
f'{BASE_URL}/posts',
json={'title': title, 'body': body, 'userId': user_id}
)
return response.json()
# ===== PUT:全量更新 =====
def update_post(post_id, title, body):
response = requests.put(
f'{BASE_URL}/posts/{post_id}',
json={'id': post_id, 'title': title, 'body': body, 'userId': 1}
)
return response.json()
# ===== PATCH:局部更新 =====
def patch_post(post_id, **fields):
response = requests.patch(
f'{BASE_URL}/posts/{post_id}',
json=fields # 只发送需要更新的字段
)
return response.json()
# ===== DELETE:删除数据 =====
def delete_post(post_id):
response = requests.delete(f'{BASE_URL}/posts/{post_id}')
return response.status_code # 成功返回 200
# 测试所有操作
post = get_post(1)
print(f"获取文章:{post['title']}")
new_post = create_post('测试标题', '这是内容')
print(f"创建文章,ID:{new_post['id']}")
updated = update_post(1, '新标题', '新内容')
print(f"更新文章:{updated['title']}")
patched = patch_post(1, title='只改标题')
print(f"局部更新:{patched['title']}")
code = delete_post(1)
print(f"删除文章,状态码:{code}") # 200
第五部分:请求头(Headers)
5.1 为什么要设置请求头?
常见场景:
① 模拟浏览器(User-Agent):某些网站拒绝非浏览器请求
② 身份认证(Authorization):告诉服务器你是谁
③ 指定数据格式(Content-Type / Accept)
④ 防爬保护绕过(Referer)
⑤ 缓存控制(Cache-Control)
import requests
# ===== 设置自定义请求头 =====
headers = {
# 模拟 Chrome 浏览器(最常用!防止被识别为爬虫)
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/120.0.0.0 Safari/537.36',
# 告诉服务器我能接受 JSON 格式
'Accept': 'application/json',
# 告诉服务器我发送的是 JSON
'Content-Type': 'application/json',
# 防盗链(告诉服务器是从哪个页面过来的)
'Referer': 'https://www.example.com',
# 接受压缩数据(加快传输速度)
'Accept-Encoding': 'gzip, deflate, br',
# 接受的语言
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
}
response = requests.get('https://httpbin.org/headers', headers=headers)
print(response.json()['headers'])
# ===== 查看默认的请求头 =====
response2 = requests.get('https://httpbin.org/headers')
print(f"n默认 User-Agent:{response2.json()['headers']['User-Agent']}")
# python-requests/2.31.0(requests 默认 UA)
# ===== 实用函数:创建常用请求头 =====
def get_browser_headers(referer=None):
"""返回模拟浏览器的标准请求头"""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
}
if referer:
headers['Referer'] = referer
return headers
response3 = requests.get('https://www.example.com', headers=get_browser_headers())
print(f"n状态码:{response3.status_code}")
5.2 Authorization 请求头(身份认证)
import requests
# ===== Bearer Token 认证(最常见于现代 API)=====
token = 'your_api_token_here'
headers = {
'Authorization': f'Bearer {token}'
}
response = requests.get('https://httpbin.org/bearer', headers=headers)
# 或者:
response = requests.get(
'https://httpbin.org/bearer',
headers={'Authorization': f'Bearer {token}'}
)
# ===== API Key 认证(不同 API 传递方式不同)=====
# 方式1:放在请求头
api_key_headers = {'X-API-Key': 'my_api_key_123'}
requests.get('https://api.example.com/data', headers=api_key_headers)
# 方式2:放在查询参数
requests.get('https://api.example.com/data', params={'api_key': 'my_api_key_123'})
# 方式3:放在请求体(POST 时)
requests.post('https://api.example.com/data',
json={'api_key': 'my_api_key_123', 'data': '...'})
第六部分:身份认证
6.1 Basic Auth(基本认证)
import requests
from requests.auth import HTTPBasicAuth
# 方式1:直接传元组(最简洁)
response = requests.get(
'https://httpbin.org/basic-auth/user/passwd',
auth=('user', 'passwd')
)
print(f"Basic Auth:{response.status_code} {response.json()}")
# 方式2:使用 HTTPBasicAuth 对象(显式更清晰)
response2 = requests.get(
'https://httpbin.org/basic-auth/user/passwd',
auth=HTTPBasicAuth('user', 'passwd')
)
print(f"HTTPBasicAuth:{response2.status_code}")
# 认证失败的情况(用错误的密码)
response3 = requests.get(
'https://httpbin.org/basic-auth/user/passwd',
auth=('user', 'wrong_password')
)
print(f"错误密码:{response3.status_code}") # 401
6.2 Digest Auth 和 Token Auth
import requests
from requests.auth import HTTPDigestAuth
# Digest 认证(比 Basic 更安全)
response = requests.get(
'https://httpbin.org/digest-auth/auth/user/passwd',
auth=HTTPDigestAuth('user', 'passwd')
)
print(f"Digest Auth:{response.status_code}")
# 自定义 Token 认证(最常见于现代 API)
class TokenAuth(requests.auth.AuthBase):
"""自定义 Token 认证类"""
def __init__(self, token):
self.token = token
def __call__(self, r):
# 在每个请求上自动添加 Authorization 头
r.headers['Authorization'] = f'Bearer {self.token}'
return r
# 使用自定义认证
token_auth = TokenAuth('my_access_token_xyz')
response = requests.get('https://httpbin.org/get', auth=token_auth)
print(response.json()['headers'].get('Authorization'))
# Bearer my_access_token_xyz
第七部分:文件上传与下载
7.1 上传文件
import requests
# ===== 方式1:上传单个文件 =====
with open('/path/to/image.jpg', 'rb') as f:
response = requests.post(
'https://httpbin.org/post',
files={'file': f}
)
print(response.json()['files'])
# ===== 方式2:指定文件名和 Content-Type =====
with open('/path/to/data.csv', 'rb') as f:
response = requests.post(
'https://httpbin.org/post',
files={
'file': ('custom_name.csv', f, 'text/csv')
# ↑文件名 ↑内容 ↑内容类型
}
)
# ===== 方式3:上传多个文件 =====
files = [
('images', ('photo1.jpg', open('photo1.jpg', 'rb'), 'image/jpeg')),
('images', ('photo2.jpg', open('photo2.jpg', 'rb'), 'image/jpeg')),
]
response = requests.post('https://httpbin.org/post', files=files)
# ===== 方式4:文件 + 表单数据同时上传 =====
with open('avatar.png', 'rb') as f:
response = requests.post(
'https://httpbin.org/post',
files={'avatar': f},
data={'username': '张三', 'bio': '这是简介'} # 同时传表单数据
)
print(response.json()['files']) # 文件
print(response.json()['form']) # 表单数据
# ===== 方式5:从内存上传(不需要本地文件)=====
import io
content = b'name,agenxe5xbcxa0xe4xb8x89,25' # CSV 内容
response = requests.post(
'https://httpbin.org/post',
files={'data': ('report.csv', io.BytesIO(content), 'text/csv')}
)
7.2 下载文件
import requests
import os
def download_file(url, save_path, chunk_size=8192):
"""
下载文件到本地,支持大文件(流式下载)
参数:
url - 下载链接
save_path - 本地保存路径
chunk_size - 每次读取的块大小(字节)
"""
response = requests.get(url, stream=True) # stream=True:不立即下载全部内容
response.raise_for_status() # 状态码不是 2xx 时抛出异常
# 获取文件总大小(不是所有服务器都提供)
total_size = int(response.headers.get('Content-Length', 0))
os.makedirs(os.path.dirname(save_path) or '.', exist_ok=True)
downloaded = 0
with open(save_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=chunk_size):
if chunk: # 过滤掉保持连接的空块
f.write(chunk)
downloaded += len(chunk)
# 显示进度
if total_size:
pct = downloaded / total_size * 100
bar = '█' * int(pct / 2)
print(f'r [{bar:<50}] {pct:.1f}% '
f'{downloaded//1024}KB/{total_size//1024}KB',
end='', flush=True)
print(f'n✅ 下载完成:{save_path}')
return save_path
# 下载一张图片
download_file(
'https://httpbin.org/image/png',
'downloaded_image.png'
)
# 下载一个文本文件
download_file(
'https://raw.githubusercontent.com/psf/requests/main/README.md',
'requests_readme.md'
)
7.3 下载图片(直接到内存)
import requests
from PIL import Image # pip install Pillow
import io
def download_image(url):
"""下载图片,返回 PIL Image 对象(不保存到磁盘)"""
response = requests.get(url)
response.raise_for_status()
image = Image.open(io.BytesIO(response.content))
return image
# 使用
img = download_image('https://httpbin.org/image/jpeg')
print(f"图片尺寸:{img.size}")
print(f"图片格式:{img.format}")
img.save('downloaded.jpg')
第八部分:Session 会话
8.1 为什么要用 Session?
问题:HTTP 是无状态协议,每次请求都是独立的。
登录后,下一次请求服务器不知道你已经登录!
解决:Session(会话)自动保持 Cookie,
让每次请求都带着"已登录"的信息。
另一个好处:同一个 Session 复用 TCP 连接,减少开销,速度更快。
import requests
# ===== 不用 Session:每次请求独立 =====
# 登录后获取 Cookie,下次请求需要手动带上
r1 = requests.get('https://httpbin.org/cookies/set/user/zhangsan')
print(f"r1 cookies:{r1.cookies.get('user')}") # zhangsan
r2 = requests.get('https://httpbin.org/cookies')
print(f"r2 cookies:{r2.json()}") # {} ← 第二次请求没有 Cookie!
# ===== 用 Session:自动保持 Cookie =====
session = requests.Session()
r1 = session.get('https://httpbin.org/cookies/set/user/zhangsan')
print(f"r1 cookies:{session.cookies.get('user')}") # zhangsan
r2 = session.get('https://httpbin.org/cookies')
print(f"r2 cookies:{r2.json()}")
# {'cookies': {'user': 'zhangsan'}} ← Cookie 自动保持!
session.close() # 用完记得关闭
8.2 Session 的完整用法
import requests
# ===== 方式1:手动管理(记得 close)=====
session = requests.Session()
# 设置全局请求头(每次请求都带上)
session.headers.update({
'User-Agent': 'MyApp/1.0',
'Accept': 'application/json',
})
# 设置全局认证(每次请求都带上)
session.auth = ('username', 'password')
# 设置全局参数
session.params = {'api_version': 'v2'}
# 用 Session 发送请求(和普通请求一样的方法)
r = session.get('https://httpbin.org/get')
print(r.json())
session.close()
# ===== 方式2:with 语句(推荐!自动关闭)=====
with requests.Session() as session:
session.headers.update({'User-Agent': 'MyApp/1.0'})
r1 = session.get('https://httpbin.org/get')
r2 = session.post('https://httpbin.org/post', json={'data': 'test'})
print(f"r1 状态:{r1.status_code}")
print(f"r2 状态:{r2.status_code}")
# 退出 with 块后自动关闭
8.3 模拟登录全过程
import requests
def simulate_login():
"""
模拟完整的网站登录流程:
1. 获取登录页面(可能含 CSRF token)
2. 提交登录表单
3. 带着 Cookie 访问需要登录的页面
"""
with requests.Session() as session:
session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36'
})
# 第1步:GET 登录页面(Session 自动保存服务器设置的 Cookie)
print("1. 获取登录页面...")
login_page = session.get('https://httpbin.org/cookies/set/csrf_token/abc123')
print(f" CSRF Token:{session.cookies.get('csrf_token')}")
# 第2步:POST 登录请求(携带 CSRF token 和账号密码)
print("2. 提交登录...")
login_data = {
'username': 'testuser',
'password': 'testpass',
'csrf_token': session.cookies.get('csrf_token', ''),
}
response = session.post(
'https://httpbin.org/post',
data=login_data
)
print(f" 登录状态:{response.status_code}")
# 第3步:访问需要登录的页面
print("3. 访问受保护页面...")
protected = session.get('https://httpbin.org/cookies')
print(f" 携带的 Cookie:{protected.json()['cookies']}")
return session.cookies
simulate_login()
第九部分:超时、重试、代理
9.1 超时设置
import requests
from requests.exceptions import Timeout, ConnectionError
# ===== 设置超时(强烈建议!否则可能永远等下去)=====
# 连接超时 + 读取超时(元组形式)
# connect_timeout:等待服务器响应的最长时间
# read_timeout:读取响应数据的最长时间
try:
response = requests.get(
'https://httpbin.org/delay/5', # 模拟5秒延迟
timeout=(3, 5) # 连接超时3秒,读取超时5秒
)
except Timeout:
print("❌ 请求超时!")
# 统一设置(连接和读取用同一个超时值)
try:
response = requests.get(
'https://httpbin.org/delay/2',
timeout=1 # 总共只等1秒
)
except Timeout:
print("❌ 1秒内没有响应")
# 不设超时(危险!程序可能卡死)
# response = requests.get('https://example.com') # ❌ 没有 timeout
# ===== 建议的超时值 =====
TIMEOUT_FAST = (3, 10) # 快速接口:连接3秒,读取10秒
TIMEOUT_SLOW = (5, 60) # 慢接口(如文件上传)
TIMEOUT_STRICT = 5 # 严格限制:总共不超过5秒
9.2 自动重试(urllib3)
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def create_session_with_retry(
retries=3,
backoff_factor=0.5,
status_forcelist=(500, 502, 503, 504)
):
"""
创建带自动重试的 Session
参数:
retries - 最大重试次数
backoff_factor - 退避因子(重试间隔 = backoff_factor * 2^(retry次数-1))
0.5 → 第1次0.5s,第2次1s,第3次2s
status_forcelist - 遇到哪些状态码时触发重试
"""
session = requests.Session()
retry_strategy = Retry(
total=retries,
backoff_factor=backoff_factor,
status_forcelist=status_forcelist,
allowed_methods=["GET", "POST"], # 哪些方法允许重试
raise_on_status=False
)
adapter = HTTPAdapter(max_retries=retry_strategy)
# 对 http 和 https 都启用重试
session.mount('https://', adapter)
session.mount('http://', adapter)
return session
# 使用
session = create_session_with_retry(retries=3)
try:
response = session.get('https://httpbin.org/status/503', timeout=5)
print(f"状态码:{response.status_code}")
except Exception as e:
print(f"多次重试后失败:{e}")
session.close()
9.3 手动重试(更灵活)
import requests
import time
from typing import Optional
def request_with_retry(
method: str,
url: str,
max_retries: int = 3,
retry_delay: float = 1.0,
timeout: float = 10.0,
**kwargs
) -> Optional[requests.Response]:
"""
带手动重试逻辑的请求函数
会在以下情况重试:
- 网络连接错误
- 超时
- 5xx 服务器错误
"""
last_error = None
for attempt in range(1, max_retries + 1):
try:
response = requests.request(method, url, timeout=timeout, **kwargs)
if response.status_code < 500:
return response
# 5xx 错误,等待后重试
print(f" [第{attempt}次] 服务器错误 {response.status_code},{retry_delay}s 后重试...")
except requests.exceptions.ConnectionError as e:
print(f" [第{attempt}次] 连接错误:{e},{retry_delay}s 后重试...")
last_error = e
except requests.exceptions.Timeout as e:
print(f" [第{attempt}次] 超时,{retry_delay}s 后重试...")
last_error = e
if attempt < max_retries:
time.sleep(retry_delay * (2 ** (attempt - 1))) # 指数退避
raise RuntimeError(f"请求失败,已重试{max_retries}次。最后错误:{last_error}")
# 测试
try:
response = request_with_retry('GET', 'https://httpbin.org/get')
print(f"成功:{response.status_code}")
except RuntimeError as e:
print(f"最终失败:{e}")
9.4 代理设置
import requests
# ===== 设置 HTTP/HTTPS 代理 =====
proxies = {
'http': 'http://proxy_host:8080',
'https': 'http://proxy_host:8080',
}
response = requests.get('https://httpbin.org/ip', proxies=proxies)
print(f"通过代理后的 IP:{response.json()['origin']}")
# ===== 带认证的代理 =====
proxies_auth = {
'http': 'http://username:password@proxy_host:8080',
'https': 'http://username:password@proxy_host:8080',
}
# ===== SOCKS 代理(需要安装 requests[socks])=====
# pip install requests[socks]
proxies_socks = {
'http': 'socks5://127.0.0.1:1080',
'https': 'socks5://127.0.0.1:1080',
}
# ===== 在 Session 中设置全局代理 =====
with requests.Session() as session:
session.proxies.update(proxies)
r = session.get('https://httpbin.org/ip')
print(r.json())
9.5 SSL 证书
import requests
# 默认:验证 SSL 证书(安全)
response = requests.get('https://www.baidu.com') # 正常
# 跳过 SSL 验证(开发/测试时用,生产不推荐)
response = requests.get('https://self-signed.badssl.com/', verify=False)
# 会出现 InsecureRequestWarning 警告
# 静默忽略警告(不推荐在生产使用)
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
response = requests.get('https://self-signed.badssl.com/', verify=False)
# 指定自定义 CA 证书文件
response = requests.get('https://internal-server.com',
verify='/path/to/ca-bundle.crt')
# 客户端证书(双向 TLS)
response = requests.get('https://server.com',
cert=('/path/to/client.crt', '/path/to/client.key'))
第十部分:错误处理
10.1 异常体系
requests 的异常继承关系:
IOError
└── requests.exceptions.RequestException ← 所有 requests 异常的基类
├── ConnectionError ← 网络连接问题
│ ├── ProxyError ← 代理错误
│ └── SSLError ← SSL 证书错误
├── Timeout ← 超时
│ ├── ConnectTimeout ← 连接超时
│ └── ReadTimeout ← 读取超时
├── URLRequired ← URL 无效
├── TooManyRedirects ← 重定向太多
├── MissingSchema ← URL 缺少协议头
├── InvalidSchema ← 不支持的协议
└── HTTPError ← HTTP 错误(由 raise_for_status 抛出)
10.2 完整的错误处理
import requests
from requests.exceptions import (
RequestException,
ConnectionError,
Timeout,
HTTPError,
TooManyRedirects,
MissingSchema,
)
def safe_get(url, **kwargs):
"""
带完整错误处理的 GET 请求
返回:Response 对象,失败时返回 None
"""
try:
response = requests.get(url, timeout=(5, 30), **kwargs)
# 检查 HTTP 错误(4xx, 5xx)
# 不报错时什么都不做,有错误时抛出 HTTPError
response.raise_for_status()
return response
except MissingSchema:
print(f"❌ URL 格式错误(缺少 http:// 或 https://):{url}")
except ConnectionError:
print(f"❌ 无法连接到服务器:{url}")
print(" 可能原因:网络断开、DNS 解析失败、服务器宕机")
except Timeout:
print(f"❌ 请求超时:{url}")
except TooManyRedirects:
print(f"❌ 重定向次数过多:{url}")
except HTTPError as e:
print(f"❌ HTTP 错误:{e.response.status_code}")
if e.response.status_code == 401:
print(" 原因:未认证,需要登录或提供 API Key")
elif e.response.status_code == 403:
print(" 原因:无权限访问该资源")
elif e.response.status_code == 404:
print(" 原因:资源不存在")
elif e.response.status_code == 429:
print(" 原因:请求太频繁,触发了限速")
elif e.response.status_code >= 500:
print(" 原因:服务器内部错误")
except RequestException as e:
print(f"❌ 请求异常:{e}")
return None
# 测试各种错误
test_urls = [
'https://httpbin.org/get', # 正常
'not_a_url', # URL 格式错误
'https://httpbin.org/status/404', # 404
'https://httpbin.org/status/500', # 500
'https://this-domain-does-not-exist-xyz.com', # 连接失败
]
for url in test_urls:
print(f"n测试:{url}")
result = safe_get(url)
if result:
print(f"✅ 成功,状态码:{result.status_code}")
10.3 raise_for_status() 的用法
import requests
# raise_for_status():遇到 4xx/5xx 时抛出 HTTPError,否则什么都不做
response = requests.get('https://httpbin.org/status/404')
print(f"状态码:{response.status_code}") # 404
try:
response.raise_for_status()
print("请求成功") # 不会到这里
except requests.exceptions.HTTPError as e:
print(f"HTTP错误:{e}")
# HTTPError: 404 Client Error: NOT FOUND for url: ...
# 链式写法(常见模式)
try:
data = requests.get('https://httpbin.org/json').raise_for_status()
# ⚠️ 注意:raise_for_status() 返回 None,不能这样链式获取 json!
except:
pass
# 正确的链式写法
response = requests.get('https://httpbin.org/json')
response.raise_for_status()
data = response.json()
print(data)
第十一部分:综合实战案例
11.1 案例一:天气查询(OpenWeatherMap API)
import requests
from datetime import datetime
class WeatherClient:
"""
OpenWeatherMap API 客户端
注册地址:https://openweathermap.org/api
免费账号可以调用 Current Weather API
"""
BASE_URL = 'https://api.openweathermap.org/data/2.5'
def __init__(self, api_key: str):
self.api_key = api_key
self.session = requests.Session()
self.session.params = {'appid': api_key, 'units': 'metric', 'lang': 'zh_cn'}
def get_current(self, city: str) -> dict:
"""获取当前天气"""
response = self.session.get(
f'{self.BASE_URL}/weather',
params={'q': city}
)
response.raise_for_status()
return response.json()
def get_forecast(self, city: str, days: int = 5) -> list:
"""获取未来N天天气预报"""
response = self.session.get(
f'{self.BASE_URL}/forecast',
params={'q': city, 'cnt': days * 8} # 每3小时一条,8条=1天
)
response.raise_for_status()
return response.json()['list']
def format_weather(self, data: dict) -> str:
"""格式化天气数据为可读文本"""
city = data['name']
country = data['sys']['country']
temp = data['main']['temp']
feels_like = data['main']['feels_like']
humidity = data['main']['humidity']
desc = data['weather'][0]['description']
wind_speed = data['wind']['speed']
return (f"📍 {city}, {country}n"
f"🌡️ 温度:{temp:.1f}°C(体感 {feels_like:.1f}°C)n"
f"🌤️ 天气:{desc}n"
f"💧 湿度:{humidity}%n"
f"💨 风速:{wind_speed} m/s")
def close(self):
self.session.close()
# 使用示例
# client = WeatherClient('your_api_key_here')
# try:
# weather = client.get_current('Beijing')
# print(client.format_weather(weather))
# finally:
# client.close()
11.2 案例二:通用 API 客户端基类
import requests
import logging
import time
from typing import Any, Dict, Optional
logger = logging.getLogger(__name__)
class BaseAPIClient:
"""
通用 REST API 客户端基类
封装了常见的功能:认证、重试、错误处理、日志
"""
def __init__(
self,
base_url: str,
api_key: Optional[str] = None,
timeout: tuple = (5, 30),
max_retries: int = 3
):
self.base_url = base_url.rstrip('/')
self.timeout = timeout
self.max_retries = max_retries
self._session = requests.Session()
self._session.headers.update({
'Accept': 'application/json',
'Content-Type': 'application/json',
'User-Agent': f'PythonAPIClient/1.0',
})
if api_key:
self._session.headers['Authorization'] = f'Bearer {api_key}'
def _request(
self,
method: str,
endpoint: str,
**kwargs
) -> requests.Response:
"""底层请求方法,含重试逻辑"""
url = f"{self.base_url}/{endpoint.lstrip('/')}"
for attempt in range(1, self.max_retries + 1):
try:
start = time.perf_counter()
response = self._session.request(
method, url, timeout=self.timeout, **kwargs
)
elapsed = time.perf_counter() - start
logger.debug(
f"{method.upper()} {url} → "
f"{response.status_code} ({elapsed:.3f}s)"
)
response.raise_for_status()
return response
except requests.exceptions.HTTPError as e:
if e.response.status_code < 500:
raise # 4xx 客户端错误,不重试
if attempt == self.max_retries:
raise
wait = 2 ** (attempt - 1)
logger.warning(f"服务器错误,{wait}s 后重试(第{attempt}次)...")
time.sleep(wait)
except (requests.exceptions.ConnectionError,
requests.exceptions.Timeout) as e:
if attempt == self.max_retries:
raise
wait = 2 ** (attempt - 1)
logger.warning(f"网络错误({e}),{wait}s 后重试...")
time.sleep(wait)
def get(self, endpoint: str, params: Dict = None, **kwargs) -> Any:
"""GET 请求,自动解析 JSON"""
response = self._request('GET', endpoint, params=params, **kwargs)
return response.json() if response.content else None
def post(self, endpoint: str, data: Any = None, **kwargs) -> Any:
"""POST 请求"""
response = self._request('POST', endpoint, json=data, **kwargs)
return response.json() if response.content else None
def put(self, endpoint: str, data: Any = None, **kwargs) -> Any:
"""PUT 请求"""
response = self._request('PUT', endpoint, json=data, **kwargs)
return response.json() if response.content else None
def patch(self, endpoint: str, data: Any = None, **kwargs) -> Any:
"""PATCH 请求"""
response = self._request('PATCH', endpoint, json=data, **kwargs)
return response.json() if response.content else None
def delete(self, endpoint: str, **kwargs) -> bool:
"""DELETE 请求,返回是否成功"""
response = self._request('DELETE', endpoint, **kwargs)
return response.status_code in (200, 204)
def close(self):
self._session.close()
def __enter__(self):
return self
def __exit__(self, *args):
self.close()
# 基于基类实现具体客户端
class TodoAPIClient(BaseAPIClient):
"""JSONPlaceholder Todo API 客户端"""
def __init__(self):
super().__init__('https://jsonplaceholder.typicode.com')
def list_todos(self, user_id: int = None, completed: bool = None):
params = {}
if user_id is not None: params['userId'] = user_id
if completed is not None: params['completed'] = completed
return self.get('/todos', params=params)
def get_todo(self, todo_id: int):
return self.get(f'/todos/{todo_id}')
def create_todo(self, title: str, user_id: int = 1):
return self.post('/todos', {'title': title, 'completed': False, 'userId': user_id})
def complete_todo(self, todo_id: int):
return self.patch(f'/todos/{todo_id}', {'completed': True})
def delete_todo(self, todo_id: int):
return self.delete(f'/todos/{todo_id}')
# 使用
with TodoAPIClient() as client:
todos = client.list_todos(user_id=1, completed=False)
print(f"未完成的 Todo:{len(todos)} 条")
todo = client.get_todo(1)
print(f"第1条:{todo['title']}")
new = client.create_todo('学习 requests 库')
print(f"新建 Todo ID:{new['id']}")
ok = client.complete_todo(1)
print(f"标记完成:{ok}")
11.3 案例三:网页内容抓取
import requests
from html.parser import HTMLParser
import re
class LinkParser(HTMLParser):
"""简单的 HTML 链接解析器"""
def __init__(self):
super().__init__()
self.links = []
self.title = ''
self._in_title = False
def handle_starttag(self, tag, attrs):
if tag == 'a':
for name, value in attrs:
if name == 'href' and value:
self.links.append(value)
if tag == 'title':
self._in_title = True
def handle_data(self, data):
if self._in_title:
self.title += data
def handle_endtag(self, tag):
if tag == 'title':
self._in_title = False
def scrape_page(url: str) -> dict:
"""
抓取网页基本信息
返回:{title, links, word_count, status_code}
"""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36',
'Accept-Language': 'zh-CN,zh;q=0.9',
}
try:
response = requests.get(url, headers=headers, timeout=(5, 15))
response.raise_for_status()
response.encoding = response.apparent_encoding # 自动检测编码
html = response.text
# 解析 HTML
parser = LinkParser()
parser.feed(html)
# 提取纯文本(去掉 HTML 标签)
text = re.sub(r'<[^>]+>', '', html)
text = re.sub(r's+', ' ', text).strip()
word_count = len(text)
return {
'url': url,
'status': response.status_code,
'title': parser.title.strip(),
'links': [l for l in parser.links if l.startswith('http')],
'word_count': word_count,
'size': len(response.content),
}
except requests.exceptions.RequestException as e:
return {'url': url, 'error': str(e)}
# 测试
result = scrape_page('https://www.python.org')
print(f"URL: {result['url']}")
print(f"标题: {result.get('title', 'N/A')}")
print(f"状态: {result.get('status', 'N/A')}")
print(f"大小: {result.get('size', 0) / 1024:.1f} KB")
print(f"字符数:{result.get('word_count', 0):,}")
print(f"外链数:{len(result.get('links', []))}")
if result.get('links'):
print(f"前5个链接:")
for link in result['links'][:5]:
print(f" {link}")
第十二部分:常见陷阱与最佳实践
12.1 陷阱1:忘记设置超时
import requests
# ❌ 没有 timeout:如果服务器不响应,程序会永远等待!
response = requests.get('https://httpbin.org/delay/99')
# ✅ 正确:永远要设置 timeout
response = requests.get('https://httpbin.org/get', timeout=(3, 10))
12.2 陷阱2:每次请求都创建新 Session(低效)
import requests
# ❌ 每次都新建连接,效率低
for url in urls:
r = requests.get(url) # 每次都是新的 TCP 连接
# ✅ 用 Session 复用连接(减少握手开销,速度快3-5倍)
with requests.Session() as session:
for url in urls:
r = session.get(url) # 复用 TCP 连接
12.3 陷阱3:直接访问 response.json() 而不检查状态码
import requests
# ❌ 如果请求失败(如返回 404 HTML),json() 会报错
response = requests.get('https://httpbin.org/status/404')
data = response.json() # JSONDecodeError!404 返回的是 HTML,不是 JSON
# ✅ 先检查状态码
response = requests.get('https://httpbin.org/status/404')
if response.status_code == 200:
data = response.json()
else:
print(f"请求失败:{response.status_code}")
# ✅ 或者用 raise_for_status
try:
response = requests.get('https://httpbin.org/json')
response.raise_for_status()
data = response.json()
except requests.HTTPError:
data = None
12.4 陷阱4:中文 URL 没有编码
import requests
from urllib.parse import quote
# ❌ 中文 URL 可能导致编码问题
url = 'https://www.example.com/搜索?q=Python教程'
# ✅ 用 params 参数(requests 自动编码)
response = requests.get(
'https://www.example.com/搜索',
params={'q': 'Python教程'}
)
# ✅ 或者手动编码
url = 'https://www.example.com/' + quote('搜索') + '?q=' + quote('Python教程')
12.5 陷阱5:Stream 下载时忘记迭代
import requests
# ❌ stream=True 但没有迭代,内容不会自动下载
response = requests.get('https://httpbin.org/image/png', stream=True)
# 此时只下载了响应头,没有下载正文!
content = response.content # 这里才会触发下载,但 stream=True 的优势消失了
# ✅ 正确:stream=True 配合 iter_content
response = requests.get('https://httpbin.org/image/png', stream=True)
with open('image.png', 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
第十三部分:完整速查表
📌 发送请求
requests.get(url, params=, headers=, timeout=, auth=)
requests.post(url, data=, json=, files=, headers=, timeout=)
requests.put(url, json=, headers=, timeout=)
requests.patch(url, json=, headers=, timeout=)
requests.delete(url, headers=, timeout=)
requests.head(url, headers=, timeout=)
requests.request(method, url, ...) ← 通用方法
📌 Response 对象
.status_code → 状态码(200, 404...)
.ok → 状态码 < 400 时为 True
.reason → 状态码描述('OK', 'Not Found')
.text → 响应文本(自动解码)
.content → 响应二进制内容
.json() → 解析 JSON,返回 dict/list
.headers → 响应头字典
.cookies → Cookie 字典
.url → 最终 URL(重定向后)
.encoding → 编码(可手动设置)
.apparent_encoding → 自动检测的编码
.elapsed → 响应耗时(timedelta)
.history → 重定向历史
.raise_for_status()→ 4xx/5xx 时抛出 HTTPError
📌 请求参数
params= → URL 查询参数(dict)
data= → 表单数据(dict)或字节
json= → JSON 数据(dict/list,自动序列化)
files= → 上传文件(dict)
headers= → 请求头(dict)
cookies= → Cookie(dict)
auth= → 认证(元组 或 Auth 对象)
timeout= → 超时(秒数 或 (connect, read) 元组)
proxies= → 代理(dict)
verify= → SSL 验证(True/False/证书路径)
stream= → 是否流式下载
allow_redirects= → 是否允许重定向
📌 Session
session = requests.Session()
session.headers.update({}) → 设置全局请求头
session.auth = (user, pass) → 设置全局认证
session.cookies → Cookie Jar
session.params → 全局查询参数
session.proxies → 全局代理
session.close() → 关闭
📌 异常
RequestException → 所有异常的基类
ConnectionError → 网络连接错误
Timeout / ConnectTimeout / ReadTimeout
HTTPError → raise_for_status() 触发
TooManyRedirects → 重定向超限
MissingSchema / InvalidSchema → URL 格式问题
总结
学完本章,你应该掌握:
- HTTP 基础:请求/响应结构、HTTP 方法、状态码的含义
- GET 请求:
params参数自动编码、Response 对象的所有属性 - POST 请求:
data=(表单)vsjson=(JSON API)的区别 - 请求头:
User-Agent模拟浏览器、Authorization认证 - 文件操作:
files=上传、stream=True流式下载大文件 - Session:保持 Cookie、复用连接提升性能
- 超时与重试:
timeout=(5,30)防止卡死、自动重试应对网络抖动 - 错误处理:
raise_for_status()+ 捕获各类异常 - 身份认证:Basic Auth、Token Auth 的使用方式
最常用的模式:
import requests
# 简单请求
response = requests.get('https://api.example.com/data',
params={'key': 'value'},
headers={'Authorization': 'Bearer token'},
timeout=(5, 30))
response.raise_for_status()
data = response.json()
# 发送 JSON
response = requests.post('https://api.example.com/create',
json={'name': '张三', 'age': 25},
headers={'Authorization': 'Bearer token'},
timeout=(5, 30))
response.raise_for_status()
result = response.json()
# Session(多次请求)
with requests.Session() as session:
session.headers['Authorization'] = 'Bearer token'
data1 = session.get('/endpoint1', timeout=10).json()
data2 = session.get('/endpoint2', timeout=10).json()