arun() 方法的完整参数指南

以下参数可以传递给 arun() 方法。它们根据主要使用上下文和功能组织。

核心参数

python

await crawler.arun(
    url="https://example.com",    # 必填：要爬取的 URL
    verbose=True,               # 启用详细日志记录
    bypass_cache=False,         # 跳过此请求的缓存
    warmup=True                 # 是否运行预热检查
)

文本处理参数

python

await crawler.arun(
    word_count_threshold=10,                # 每个内容块的最小字数
    image_description_min_word_threshold=5,  # 图片描述的最小字数
    only_text=False,                        # 仅提取文本内容
    excluded_tags=['form', 'nav'],          # 要排除的 HTML 标签
    keep_data_attributes=False,             # 保留 data-* 属性
)

内容选择参数

python

await crawler.arun(
    css_selector=".main-content",  # 内容提取的 CSS 选择器
    remove_forms=True,             # 移除所有表单元素
    remove_overlay_elements=True,  # 移除弹出/模态窗口/覆盖层
)

链接处理参数

python

await crawler.arun(
    exclude_external_links=True,          # 移除外部链接
    exclude_social_media_links=True,      # 移除社交媒体链接
    exclude_external_images=True,         # 移除外部图片
    exclude_domains=["ads.example.com"],  # 要排除的特定域名
    social_media_domains=[               # 额外的社交媒体域名
        "facebook.com",
        "twitter.com",
        "instagram.com"
    ]
)

浏览器控制参数

基本浏览器设置

python

await crawler.arun(
    headless=True,                # 在无头模式下运行浏览器
    browser_type="chromium",      # 浏览器引擎："chromium"、"firefox"、"webkit"
    page_timeout=60000,          # 页面加载超时（毫秒）
    user_agent="custom-agent",    # 自定义用户代理
)

导航和等待

python

await crawler.arun(
    wait_for="css:.dynamic-content",  # 等待元素/条件
    delay_before_return_html=2.0,     # 返回 HTML 前等待（秒）
)

JavaScript 执行

python

await crawler.arun(
    js_code=[                     # 要执行的 JavaScript（字符串或列表）
        "window.scrollTo(0, document.body.scrollHeight);",
        "document.querySelector('.load-more').click();"
    ],
    js_only=False,               # 仅执行 JavaScript 而不重新加载页面
)

反反爬虫功能

python

await crawler.arun(
    magic=True,              # 启用所有反检测功能
    simulate_user=True,      # 模拟人类行为
    override_navigator=True  # 覆盖导航器属性
)

会话管理

python

await crawler.arun(
    session_id="my_session",  # 持久浏览的会话标识符
)

截图选项

python

await crawler.arun(
    screenshot=True,              # 截图页面
    screenshot_wait_for=2.0,      # 截图前等待（秒）
)

代理配置

python

await crawler.arun(
    proxy="http://proxy.example.com:8080",      # 简单代理 URL
    proxy_config={                             # 高级代理设置
        "server": "http://proxy.example.com:8080", 
        "username": "user",
        "password": "pass"
    }
)

内容提取参数

提取策略

python

await crawler.arun(
    extraction_strategy=LLMExtractionStrategy(
        provider="ollama/llama2",
        schema=MySchema.schema(),
        instruction="提取特定数据"
    )
)

分块策略

python

await crawler.arun(
    chunking_strategy=RegexChunking(
        patterns=[r'\n\n', r'\.\s+']
    )
)

HTML 转文本选项

python

await crawler.arun(
    html2text={
        "ignore_links": False,
        "ignore_images": False,
        "escape_dot": False,
        "body_width": 0,
        "protect_links": True,
        "unicode_snob": True
    }
)

调试选项

python

await crawler.arun(
    log_console=True,   # 记录浏览器控制台消息
)

参数交互和注意事项

魔法模式组合

python

# 完整的反检测设置
await crawler.arun(
    magic=True,
    headless=False,
    simulate_user=True,
    override_navigator=True
)

动态内容处理

python

# 处理懒加载内容
await crawler.arun(
    js_code="window.scrollTo(0, document.body.scrollHeight);",
    wait_for="css:.lazy-content",
    delay_before_return_html=2.0
)

内容提取管道

python

# 完整的提取设置
await crawler.arun(
    css_selector=".main-content",
    word_count_threshold=20,
    extraction_strategy=my_strategy,
    chunking_strategy=my_chunking,
    process_iframes=True,
    remove_overlay_elements=True
)

最佳实践

性能优化

python

await crawler.arun(
    bypass_cache=False,           # 尽可能使用缓存
    word_count_threshold=10,      # 过滤掉噪音
    process_iframes=False         # 如果不需要，跳过 iframes
)

可靠爬取

python

await crawler.arun(
    magic=True,                   # 启用反检测
    delay_before_return_html=1.0, # 等待动态内容
    page_timeout=60000           # 对慢页面使用更长的超时时间
)

清洁内容

python

await crawler.arun(
    remove_overlay_elements=True,  # 移除弹出窗口
    excluded_tags=['nav', 'aside'],# 移除不必要的元素
    keep_data_attributes=False     # 移除 data 属性
)

MaXiaoTiao

arun() 方法的完整参数指南

arun() 方法的完整参数指南

核心参数

文本处理参数

内容选择参数

链接处理参数

浏览器控制参数

基本浏览器设置

导航和等待

JavaScript 执行

反反爬虫功能

会话管理

截图选项

代理配置

内容提取参数

提取策略

分块策略

HTML 转文本选项

调试选项

参数交互和注意事项

魔法模式组合

动态内容处理

内容提取管道

最佳实践

性能优化

可靠爬取

清洁内容