Crawl4AI 内容选择指南

120
0
0
2024-11-11

Crawl4AI 内容选择指南

Crawl4AI 提供了多种方式来选择和过滤网页上的特定内容。学习如何精确地定位您需要的内容。

CSS 选择器

提取特定内容的最简单方式:

# 使用 CSS 选择器提取特定内容

python
result = await crawler.arun(
    url="https://example.com", 
    css_selector=".main-article"  # 定位主文章内容
)

### # 多个选择器
result = await crawler.arun(
    url="https://example.com", 
    css_selector="article h1, article .content"  # 定位标题和内容
)

内容过滤

控制包含或排除哪些内容:

python
result = await crawler.arun(
    url="https://example.com", 
    # 内容阈值
    word_count_threshold=10,        # 每个块的最小字数

    # 标签排除
    excluded_tags=['form', 'header', 'footer', 'nav'],

    # 链接过滤
    exclude_external_links=True,    # 移除外部链接
    exclude_social_media_links=True,  # 移除社交媒体链接

    # 媒体过滤
    exclude_external_images=True   # 移除外部图片
)

Iframe 内容

处理 iframe 内的内容:

python
result = await crawler.arun(
    url="https://example.com", 
    process_iframes=True,  # 提取 iframe 内容
    remove_overlay_elements=True  # 移除可能阻挡 iframe 的弹出/模态窗口
)

结构化内容选择

使用 LLM 进行智能选择 使用 LLM 智能提取特定类型的内容:

python
from pydantic import BaseModel
from crawl4ai.extraction_strategy import LLMExtractionStrategy

class ArticleContent(BaseModel):
    title: str
    main_points: List[str]
    conclusion: str

strategy = LLMExtractionStrategy(
    provider="ollama/nemotron",  # 适用于任何支持的 LLM
    schema=ArticleContent.schema(),
    instruction="提取文章标题、要点和结论"
)

result = await crawler.arun(
    url="https://example.com", 
    extraction_strategy=strategy
)
article = json.loads(result.extracted_content)

基于模式的选取

对于重复的内容模式(如产品列表、新闻提要):

python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
    "name": "News Articles",
    "baseSelector": "article.news-item",  # 重复的元素
    "fields": [
        {"name": "headline", "selector": "h2", "type": "text"},
        {"name": "summary", "selector": ".summary", "type": "text"},
        {"name": "category", "selector": ".category", "type": "text"},
        {
            "name": "metadata",
            "type": "nested",
            "fields": [
                {"name": "author", "selector": ".author", "type": "text"},
                {"name": "date", "selector": ".date", "type": "text"}
            ]
        }
    ]
}

strategy = JsonCssExtractionStrategy(schema)
result = await crawler.arun(
    url="https://example.com", 
    extraction_strategy=strategy
)
articles = json.loads(result.extracted_content)

基于域的过滤

根据域控制内容:

python
result = await crawler.arun(
    url="https://example.com", 
    exclude_domains=["ads.com", "tracker.com"],
    exclude_social_media_domains=["facebook.com", "twitter.com"],  # 自定义要排除的社交媒体域
    exclude_social_media_links=True
)

媒体选择

选择特定类型的媒体:

python
result = await crawler.arun(url="https://example.com")

# 访问不同的媒体类型
images = result.media["images"]  # 图片详细信息列表
videos = result.media["videos"]  # 视频详细信息列表
audios = result.media["audios"]  # 音频详细信息列表

# 带元数据的图片
for image in images:
    print(f"URL: {image['src']}")
    print(f"Alt 文本: {image['alt']}")
    print(f"描述: {image['desc']}")
    print(f"相关性得分: {image['score']}")

综合示例

以下是如何结合不同的选择方法:

python
async def extract_article_content(url: str):
    # 定义结构化提取
    article_schema = {
        "name": "Article",
        "baseSelector": "article.main",
        "fields": [
            {"name": "title", "selector": "h1", "type": "text"},
            {"name": "content", "selector": ".content", "type": "text"}
        ]
    }

    # 定义 LLM 提取
    class ArticleAnalysis(BaseModel):
        key_points: List[str]
        sentiment: str
        category: str

    async with AsyncWebCrawler() as crawler:
        # 获取结构化内容
        pattern_result = await crawler.arun(
            url=url,
            extraction_strategy=JsonCssExtractionStrategy(article_schema),
            word_count_threshold=10,
            excluded_tags=['nav', 'footer'],
            exclude_external_links=True
        )

        # 获取语义分析
        analysis_result = await crawler.arun(
            url=url,
            extraction_strategy=LLMExtractionStrategy(
                provider="ollama/nemotron",
                schema=ArticleAnalysis.schema(),
                instruction="分析文章内容"
            )
        )

        # 组合结果
        return {
            "article": json.loads(pattern_result.extracted_content),
            "analysis": json.loads(analysis_result.extracted_content),
            "media": pattern_result.media
        }