Crawl4AI 内容选择指南
Crawl4AI 内容选择指南
Crawl4AI 提供了多种方式来选择和过滤网页上的特定内容。学习如何精确地定位您需要的内容。
CSS 选择器
提取特定内容的最简单方式:
# 使用 CSS 选择器提取特定内容
python
result = await crawler.arun(
url="https://example.com",
css_selector=".main-article" # 定位主文章内容
)
### # 多个选择器
result = await crawler.arun(
url="https://example.com",
css_selector="article h1, article .content" # 定位标题和内容
)
内容过滤
控制包含或排除哪些内容:
python
result = await crawler.arun(
url="https://example.com",
# 内容阈值
word_count_threshold=10, # 每个块的最小字数
# 标签排除
excluded_tags=['form', 'header', 'footer', 'nav'],
# 链接过滤
exclude_external_links=True, # 移除外部链接
exclude_social_media_links=True, # 移除社交媒体链接
# 媒体过滤
exclude_external_images=True # 移除外部图片
)
Iframe 内容
处理 iframe 内的内容:
python
result = await crawler.arun(
url="https://example.com",
process_iframes=True, # 提取 iframe 内容
remove_overlay_elements=True # 移除可能阻挡 iframe 的弹出/模态窗口
)
结构化内容选择
使用 LLM 进行智能选择 使用 LLM 智能提取特定类型的内容:
python
from pydantic import BaseModel
from crawl4ai.extraction_strategy import LLMExtractionStrategy
class ArticleContent(BaseModel):
title: str
main_points: List[str]
conclusion: str
strategy = LLMExtractionStrategy(
provider="ollama/nemotron", # 适用于任何支持的 LLM
schema=ArticleContent.schema(),
instruction="提取文章标题、要点和结论"
)
result = await crawler.arun(
url="https://example.com",
extraction_strategy=strategy
)
article = json.loads(result.extracted_content)
基于模式的选取
对于重复的内容模式(如产品列表、新闻提要):
python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"name": "News Articles",
"baseSelector": "article.news-item", # 重复的元素
"fields": [
{"name": "headline", "selector": "h2", "type": "text"},
{"name": "summary", "selector": ".summary", "type": "text"},
{"name": "category", "selector": ".category", "type": "text"},
{
"name": "metadata",
"type": "nested",
"fields": [
{"name": "author", "selector": ".author", "type": "text"},
{"name": "date", "selector": ".date", "type": "text"}
]
}
]
}
strategy = JsonCssExtractionStrategy(schema)
result = await crawler.arun(
url="https://example.com",
extraction_strategy=strategy
)
articles = json.loads(result.extracted_content)
基于域的过滤
根据域控制内容:
python
result = await crawler.arun(
url="https://example.com",
exclude_domains=["ads.com", "tracker.com"],
exclude_social_media_domains=["facebook.com", "twitter.com"], # 自定义要排除的社交媒体域
exclude_social_media_links=True
)
媒体选择
选择特定类型的媒体:
python
result = await crawler.arun(url="https://example.com")
# 访问不同的媒体类型
images = result.media["images"] # 图片详细信息列表
videos = result.media["videos"] # 视频详细信息列表
audios = result.media["audios"] # 音频详细信息列表
# 带元数据的图片
for image in images:
print(f"URL: {image['src']}")
print(f"Alt 文本: {image['alt']}")
print(f"描述: {image['desc']}")
print(f"相关性得分: {image['score']}")
综合示例
以下是如何结合不同的选择方法:
python
async def extract_article_content(url: str):
# 定义结构化提取
article_schema = {
"name": "Article",
"baseSelector": "article.main",
"fields": [
{"name": "title", "selector": "h1", "type": "text"},
{"name": "content", "selector": ".content", "type": "text"}
]
}
# 定义 LLM 提取
class ArticleAnalysis(BaseModel):
key_points: List[str]
sentiment: str
category: str
async with AsyncWebCrawler() as crawler:
# 获取结构化内容
pattern_result = await crawler.arun(
url=url,
extraction_strategy=JsonCssExtractionStrategy(article_schema),
word_count_threshold=10,
excluded_tags=['nav', 'footer'],
exclude_external_links=True
)
# 获取语义分析
analysis_result = await crawler.arun(
url=url,
extraction_strategy=LLMExtractionStrategy(
provider="ollama/nemotron",
schema=ArticleAnalysis.schema(),
instruction="分析文章内容"
)
)
# 组合结果
return {
"article": json.loads(pattern_result.extracted_content),
"analysis": json.loads(analysis_result.extracted_content),
"media": pattern_result.media
}