AsyncWebCrawler 类介绍

AsyncWebCrawler 类是进行网络爬取操作的主要接口。它提供了异步网络爬取功能以及广泛的配置选项。

构造函数参数

浏览器设置

browser_type (str, 可选): 默认值为 "chromium"。可选值包括 "chromium"、"firefox"、"webkit"。控制使用的浏览器引擎。
```
python
```
```
# 使用 Firefox
crawler = AsyncWebCrawler(browser_type="firefox")
```
headless (bool, 可选): 默认值为 True。当为 True 时，浏览器在无界面模式下运行。设置为 False 用于调试。
```
python
```
```
# 可见浏览器用于调试
crawler = AsyncWebCrawler(headless=False)
```

verbose (bool, 可选): 默认值为 False。启用详细日志记录。

python

# 启用详细日志记录
crawler = AsyncWebCrawler(verbose=True)

缓存设置

always_by_pass_cache (bool, 可选): 默认值为 False。当为 True 时，总是获取新鲜内容。
```
python
```
```
# 总是获取新鲜内容
crawler = AsyncWebCrawler(always_by_pass_cache=True)
```
base_directory (str, 可选): 默认值为用户的 home 目录。缓存存储的基础路径。
```
python
```
```
# 自定义缓存目录
crawler = AsyncWebCrawler(base_directory="/path/to/cache")
```

网络设置

proxy (str, 可选): 简单代理 URL。

python

# 使用简单代理
crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080")

proxy_config (Dict, 可选): 带有认证的代理高级配置。

python

# 高级代理配置
crawler = AsyncWebCrawler(proxy_config={
    "server": "http://proxy.example.com:8080", 
    "username": "user",
    "password": "pass"
})

浏览器行为

sleep_on_close (bool, 可选): 默认值为 False。在关闭浏览器前添加延迟。
```
python
```
```
# 关闭前等待
crawler = AsyncWebCrawler(sleep_on_close=True)
```

自定义设置

user_agent (str, 可选): 自定义用户代理字符串。

python

# 自定义用户代理
crawler = AsyncWebCrawler(
    user_agent="Mozilla/5.0 (Custom Agent) Chrome/90.0"
)

headers (Dict[str, str], 可选): 自定义 HTTP 头部。

python

# 自定义头部
crawler = AsyncWebCrawler(
    headers={
        "Accept-Language": "en-US",
        "Custom-Header": "Value"
    }
)

js_code (Union[str, List[str]], 可选): 在每个页面执行的默认 JavaScript。

python

# 默认 JavaScript
crawler = AsyncWebCrawler(
    js_code=[
        "window.scrollTo(0, document.body.scrollHeight);",
        "document.querySelector('.load-more').click();"
    ]
)

方法

arun()

爬取网页的主要方法。

python

async def arun(
    # 必填
    url: str,                              # 爬取的 URL

    # 内容选择
    css_selector: str = None,              # 内容的 CSS 选择器
    word_count_threshold: int = 10,        # 每个块的最小字数

    # 缓存控制
    bypass_cache: bool = False,            # 绕过此请求的缓存

    # 会话管理
    session_id: str = None,                # 会话标识符

    # 截图选项
    screenshot: bool = False,              # 截图
    screenshot_wait_for: float = None,     # 截图前等待

    # 内容处理
    process_iframes: bool = False,         # 处理 iframe 内容
    remove_overlay_elements: bool = False, # 移除弹出/模态窗口

    # 反反爬虫设置
    simulate_user: bool = False,           # 模拟人类行为
    override_navigator: bool = False,      # 覆盖导航器属性
    magic: bool = False,                   # 启用所有反检测功能

    # 内容过滤
    excluded_tags: List[str] = None,       # 要排除的 HTML 标签
    exclude_external_links: bool = False,  # 移除外部链接
    exclude_social_media_links: bool = False, # 移除社交媒体链接

    # JavaScript 处理
    js_code: Union[str, List[str]] = None, # 执行的 JavaScript
    wait_for: str = None,                  # 等待条件

    # 页面加载
    page_timeout: int = 60000,            # 页面加载超时（毫秒）
    delay_before_return_html: float = None, # 返回前等待

    # 提取
    extraction_strategy: ExtractionStrategy = None  # 提取策略
) -> CrawlResult:

使用示例

基本爬取

python

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com")

高级爬取

python

async with AsyncWebCrawler(
    browser_type="firefox",
    verbose=True,
    headers={"Custom-Header": "Value"}
) as crawler:
    result = await crawler.arun(
        url="https://example.com", 
        css_selector=".main-content",
        word_count_threshold=20,
        process_iframes=True,
        magic=True,
        wait_for="css:.dynamic-content",
        screenshot=True
    )

会话管理

python

async with AsyncWebCrawler() as crawler:
    # 第一个请求
    result1 = await crawler.arun(
        url="https://example.com/login", 
        session_id="my_session"
    )

    # 使用同一会话的后续请求
    result2 = await crawler.arun(
        url="https://example.com/protected", 
        session_id="my_session"
    )

上下文管理器

AsyncWebCrawler 实现了异步上下文管理器协议：

python

async def __aenter__(self) -> 'AsyncWebCrawler':
    # 初始化浏览器和资源
    return self

async def __aexit__(self, *args):
    # 清理资源
    pass

始终使用异步上下文管理器来使用 AsyncWebCrawler：

python

async with AsyncWebCrawler() as crawler:
    # 在这里编写你的爬取代码
    pass

最佳实践

资源管理

python

# 始终使用上下文管理器
async with AsyncWebCrawler() as crawler:
    # 爬虫将被正确清理
    pass

错误处理

python

try:
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com")
        if not result.success:
            print(f"爬取失败: {result.error_message}")
except Exception as e:
    print(f"错误: {str(e)}")

性能优化

python

# 启用缓存以获得更好的性能
crawler = AsyncWebCrawler(
    always_by_pass_cache=False,
    verbose=True
)

反检测

python

# 最大隐身模式
crawler = AsyncWebCrawler(
    headless=True,
    user_agent="Mozilla/5.0...",
    headers={"Accept-Language": "en-US"}
)
result = await crawler.arun(
    url="https://example.com", 
    magic=True,
    simulate_user=True
)

关于浏览器类型的注意事项

每种浏览器类型都有其特点：

chromium：最佳的整体兼容性
firefox：适用于特定用例
webkit：更轻量级，适用于基本爬取根据您的具体需求选择：

python

# 高兼容性
crawler = AsyncWebCrawler(browser_type="chromium")

# 内存高效
crawler = AsyncWebCrawler(browser_type="webkit")

MaXiaoTiao

AsyncWebCrawler 类介绍

AsyncWebCrawler 类介绍

构造函数参数

浏览器设置

缓存设置

网络设置

浏览器行为

自定义设置

方法

arun()

使用示例

基本爬取

高级爬取

会话管理

上下文管理器

最佳实践

资源管理

错误处理

性能优化

反检测

关于浏览器类型的注意事项