Crawl4AI 输出格式指南

Crawl4AI 提供多种输出格式以满足不同需求，从原始 HTML 到使用 LLM 或基于模式的提取结构化数据。

基本格式

python

result = await crawler.arun(url="https://example.com")

# 访问不同格式
raw_html = result.html           # 原始 HTML
clean_html = result.cleaned_html # 清理后的 HTML
markdown = result.markdown       # 标准 Markdown
fit_md = result.fit_markdown     # 最相关内容在 Markdown 中

原始 HTML

原始未修改的网页 HTML。当你需要以下情况时很有用：

保留确切的页面结构
使用自己的工具处理 HTML
调试页面问题

python

result = await crawler.arun(url="https://example.com")
print(result.html)  # 包括头部、脚本等的完整 HTML

清理后的 HTML

移除不必要元素的清理 HTML。自动执行：

移除脚本和样式
清理格式
保留语义结构

python

result = await crawler.arun(
    url="https://example.com", 
    excluded_tags=['form', 'header', 'footer'],  # 额外要移除的标签
    keep_data_attributes=False  # 移除 data-* 属性
)
print(result.cleaned_html)

标准 Markdown

将 HTML 转换为干净的 Markdown 格式。非常适合：

内容分析
文档化
可读性

python

result = await crawler.arun(
    url="https://example.com", 
    include_links_on_markdown=True  # 在 Markdown 中包含链接
)
print(result.markdown)

最相关内容的 Markdown

提取最相关内容并将其转换为 Markdown。理想用于：

文章提取
主要内容的聚焦
移除模板文字

python

result = await crawler.arun(url="https://example.com")
print(result.fit_markdown)  # 仅主要内容

结构化数据提取

Crawl4AI 提供了两种强大的结构化数据提取方法：

1. 基于 LLM 的提取

使用任何 LLM（OpenAI、HuggingFace、Ollama 等）以高准确度提取结构化数据：

python

from pydantic import BaseModel
from crawl4ai.extraction_strategy import LLMExtractionStrategy

class KnowledgeGraph(BaseModel):
    entities: List[dict]
    relationships: List[dict]

strategy = LLMExtractionStrategy(
    provider="ollama/nemotron",  # 或 "huggingface/...", "ollama/..."
    api_token="your-token",   # Ollama 不需要
    schema=KnowledgeGraph.schema(),
    instruction="从内容中提取实体和关系"
)

result = await crawler.arun(
    url="https://example.com", 
    extraction_strategy=strategy
)
knowledge_graph = json.loads(result.extracted_content)

2. 基于模式的提取

对于具有重复模式（例如，产品列表、文章提要）的页面，使用 JsonCssExtractionStrategy：

python

from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
    "name": "Product Listing",
    "baseSelector": ".product-card",  # 重复的元素
    "fields": [
        {"name": "title", "selector": "h2", "type": "text"},
        {"name": "price", "selector": ".price", "type": "text"},
        {"name": "description", "selector": ".desc", "type": "text"}
    ]
}

strategy = JsonCssExtractionStrategy(schema)
result = await crawler.arun(
    url="https://example.com", 
    extraction_strategy=strategy
)
products = json.loads(result.extracted_content)

内容自定义

HTML 转文本选项

配置 Markdown 转换：

python

result = await crawler.arun(
    url="https://example.com", 
    html2text={
        "escape_dot": False,
        "body_width": 0,
        "protect_links": True,
        "unicode_snob": True
    }
)

内容过滤器

控制包含哪些内容：

python

result = await crawler.arun(
    url="https://example.com", 
    word_count_threshold=10,        # 每个块的最小字数
    exclude_external_links=True,    # 移除外部链接
    exclude_external_images=True,   # 移除外部图片
    excluded_tags=['form', 'nav']   # 移除特定的 HTML 标签
)

综合示例

以下是如何一起使用多种输出格式：

python复制

async def crawl_content(url: str):
    async with AsyncWebCrawler() as crawler:
        # 使用 fit markdown 提取主要内容
        result = await crawler.arun(
            url=url,
            word_count_threshold=10,
            exclude_external_links=True
        )

        # 使用 LLM 获取结构化数据
        llm_result = await crawler.arun(
            url=url,
            extraction_strategy=LLMExtractionStrategy(
                provider="ollama/nemotron",
                schema=YourSchema.schema(),
                instruction="提取关键信息"
            )
        )

        # 获取重复模式（如果有）
        pattern_result = await crawler.arun(
            url=url,
            extraction_strategy=JsonCssExtractionStrategy(your_schema)
        )

        return {
            "main_content": result.fit_markdown,
            "structured_data": json.loads(llm_result.extracted_content),
            "pattern_data": json.loads(pattern_result.extracted_content),
            "media": result.media
        }

MaXiaoTiao

Crawl4AI 输出格式指南

Crawl4AI 输出格式指南

基本格式

原始 HTML

清理后的 HTML

标准 Markdown

最相关内容的 Markdown

结构化数据提取

1. 基于 LLM 的提取

2. 基于模式的提取

内容自定义

HTML 转文本选项

内容过滤器

综合示例