Crawl4AI 浏览器配置指南

121
0
0
2024-11-11

Crawl4AI 浏览器配置指南

Crawl4AI 支持多种浏览器引擎,并为浏览器行为提供了广泛的配置选项。

浏览器类型

从三种浏览器引擎中选择:

Chromium(默认)

python
async with AsyncWebCrawler(browser_type="chromium") as crawler:
    result = await crawler.arun(url="https://example.com") 

# Firefox

python
async with AsyncWebCrawler(browser_type="firefox") as crawler:
    result = await crawler.arun(url="https://example.com") 

# WebKit

python
async with AsyncWebCrawler(browser_type="webkit") as crawler:
    result = await crawler.arun(url="https://example.com") 

基本配置

常见的浏览器设置:

python
async with AsyncWebCrawler(
    headless=True,           # 无头模式运行(无 GUI)
    verbose=True,            # 启用详细日志
    sleep_on_close=False     # 关闭浏览器时不延迟
) as crawler:
    result = await crawler.arun(url="https://example.com") 

身份管理

控制你的爬虫如何出现在网站上:

# 自定义用户代理

python
async with AsyncWebCrawler(
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
) as crawler:
    result = await crawler.arun(url="https://example.com") 

# 自定义头

python
headers = {
    "Accept-Language": "en-US,en;q=0.9",
    "Cache-Control": "no-cache"
}
async with AsyncWebCrawler(headers=headers) as crawler:
    result = await crawler.arun(url="https://example.com") 

截图功能

使用增强的错误处理捕获页面截图:

python
result = await crawler.arun(
    url="https://example.com", 
    screenshot=True,                # 启用截图
    screenshot_wait_for=2.0        # 捕获前等待 2 秒
)

if result.screenshot:  # Base64 编码的图片
    import base64
    with open("screenshot.png", "wb") as f:
        f.write(base64.b64decode(result.screenshot))

超时和等待

控制页面加载行为:

python
result = await crawler.arun(
    url="https://example.com", 
    page_timeout=60000,              # 页面加载超时(毫秒)
    delay_before_return_html=2.0,    # 内容捕获前等待
    wait_for="css:.dynamic-content"  # 等待特定元素
)

JavaScript 执行

在爬取前执行自定义 JavaScript:

# 单个 JavaScript 命令

python
result = await crawler.arun(
    url="https://example.com", 
    js_code="window.scrollTo(0, document.body.scrollHeight);"
)

# 多个命令

python
js_commands = [
    "window.scrollTo(0, document.body.scrollHeight);",
    "document.querySelector('.load-more').click();"
]
result = await crawler.arun(
    url="https://example.com", 
    js_code=js_commands
)

代理配置

使用代理以增强访问:

# 简单代理

python
async with AsyncWebCrawler(
    proxy="http://proxy.example.com:8080" 
) as crawler:
    result = await crawler.arun(url="https://example.com") 

# 带认证的代理

python
proxy_config = {
    "server": "http://proxy.example.com:8080", 
    "username": "user",
    "password": "pass"
}
async with AsyncWebCrawler(proxy_config=proxy_config) as crawler:
    result = await crawler.arun(url="https://example.com") 

反检测功能

启用隐身功能以避免机器人检测:

python
result = await crawler.arun(
    url="https://example.com", 
    simulate_user=True,        # 模拟人类行为
    override_navigator=True,   # 掩盖自动化信号
    magic=True                 # 启用所有反检测功能
)

处理动态内容

配置浏览器以处理动态内容:

# 等待动态内容

python
result = await crawler.arun(
    url="https://example.com", 
    wait_for="js:() => document.querySelector('.content').children.length > 10",
    process_iframes=True     # 处理 iframe 内容
)

# 处理懒加载图片

python
result = await crawler.arun(
    url="https://example.com", 
    js_code="window.scrollTo(0, document.body.scrollHeight);",
    delay_before_return_html=2.0  # 等待图片加载
)

综合示例

以下是如何结合各种浏览器配置:

python复制
async def crawl_with_advanced_config(url: str):
    async with AsyncWebCrawler(
        # 浏览器设置
        browser_type="chromium",
        headless=True,
        verbose=True,

        # 身份
        user_agent="Custom User Agent",
        headers={"Accept-Language": "en-US"},

        # 代理设置
        proxy="http://proxy.example.com:8080" 
    ) as crawler:
        result = await crawler.arun(
            url=url,
            # 内容处理
            process_iframes=True,
            screenshot=True,

            # 时间设置
            page_timeout=60000,
            delay_before_return_html=2.0,

            # 反检测
            magic=True,
            simulate_user=True,

            # 动态内容
            js_code=[
                "window.scrollTo(0, document.body.scrollHeight);",
                "document.querySelector('.load-more')?.click();"
            ],
            wait_for="css:.dynamic-content"
        )

        return {
            "content": result.markdown,
            "screenshot": result.screenshot,
            "success": result.success
        }