Crawl4AI 浏览器配置指南
Crawl4AI 浏览器配置指南
Crawl4AI 支持多种浏览器引擎,并为浏览器行为提供了广泛的配置选项。
浏览器类型
从三种浏览器引擎中选择:
Chromium(默认)
python
async with AsyncWebCrawler(browser_type="chromium") as crawler:
result = await crawler.arun(url="https://example.com")
# Firefox
python
async with AsyncWebCrawler(browser_type="firefox") as crawler:
result = await crawler.arun(url="https://example.com")
# WebKit
python
async with AsyncWebCrawler(browser_type="webkit") as crawler:
result = await crawler.arun(url="https://example.com")
基本配置
常见的浏览器设置:
python
async with AsyncWebCrawler(
headless=True, # 无头模式运行(无 GUI)
verbose=True, # 启用详细日志
sleep_on_close=False # 关闭浏览器时不延迟
) as crawler:
result = await crawler.arun(url="https://example.com")
身份管理
控制你的爬虫如何出现在网站上:
# 自定义用户代理
python
async with AsyncWebCrawler(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
) as crawler:
result = await crawler.arun(url="https://example.com")
# 自定义头
python
headers = {
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "no-cache"
}
async with AsyncWebCrawler(headers=headers) as crawler:
result = await crawler.arun(url="https://example.com")
截图功能
使用增强的错误处理捕获页面截图:
python
result = await crawler.arun(
url="https://example.com",
screenshot=True, # 启用截图
screenshot_wait_for=2.0 # 捕获前等待 2 秒
)
if result.screenshot: # Base64 编码的图片
import base64
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
超时和等待
控制页面加载行为:
python
result = await crawler.arun(
url="https://example.com",
page_timeout=60000, # 页面加载超时(毫秒)
delay_before_return_html=2.0, # 内容捕获前等待
wait_for="css:.dynamic-content" # 等待特定元素
)
JavaScript 执行
在爬取前执行自定义 JavaScript:
# 单个 JavaScript 命令
python
result = await crawler.arun(
url="https://example.com",
js_code="window.scrollTo(0, document.body.scrollHeight);"
)
# 多个命令
python
js_commands = [
"window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more').click();"
]
result = await crawler.arun(
url="https://example.com",
js_code=js_commands
)
代理配置
使用代理以增强访问:
# 简单代理
python
async with AsyncWebCrawler(
proxy="http://proxy.example.com:8080"
) as crawler:
result = await crawler.arun(url="https://example.com")
# 带认证的代理
python
proxy_config = {
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass"
}
async with AsyncWebCrawler(proxy_config=proxy_config) as crawler:
result = await crawler.arun(url="https://example.com")
反检测功能
启用隐身功能以避免机器人检测:
python
result = await crawler.arun(
url="https://example.com",
simulate_user=True, # 模拟人类行为
override_navigator=True, # 掩盖自动化信号
magic=True # 启用所有反检测功能
)
处理动态内容
配置浏览器以处理动态内容:
# 等待动态内容
python
result = await crawler.arun(
url="https://example.com",
wait_for="js:() => document.querySelector('.content').children.length > 10",
process_iframes=True # 处理 iframe 内容
)
# 处理懒加载图片
python
result = await crawler.arun(
url="https://example.com",
js_code="window.scrollTo(0, document.body.scrollHeight);",
delay_before_return_html=2.0 # 等待图片加载
)
综合示例
以下是如何结合各种浏览器配置:
python复制
async def crawl_with_advanced_config(url: str):
async with AsyncWebCrawler(
# 浏览器设置
browser_type="chromium",
headless=True,
verbose=True,
# 身份
user_agent="Custom User Agent",
headers={"Accept-Language": "en-US"},
# 代理设置
proxy="http://proxy.example.com:8080"
) as crawler:
result = await crawler.arun(
url=url,
# 内容处理
process_iframes=True,
screenshot=True,
# 时间设置
page_timeout=60000,
delay_before_return_html=2.0,
# 反检测
magic=True,
simulate_user=True,
# 动态内容
js_code=[
"window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more')?.click();"
],
wait_for="css:.dynamic-content"
)
return {
"content": result.markdown,
"screenshot": result.screenshot,
"success": result.success
}