搭建并使用向量数据库

一、前序配置

本节重点为搭建并使用向量数据库，因此读取数据后我们省去数据处理的环节直入主题，数据清洗等步骤可以参考第三节

In [1]:

import os
from dotenv import load_dotenv, find_dotenv

# 读取本地/项目的环境变量。
# find_dotenv()寻找并定位.env文件的路径
# load_dotenv()读取该.env文件，并将其中的环境变量加载到当前的运行环境中  
# 如果你设置的是全局的环境变量，这行代码则没有任何作用。
_ = load_dotenv(find_dotenv())

# 如果你需要通过代理端口访问，你需要如下配置
# os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'
# os.environ["HTTP_PROXY"] = 'http://127.0.0.1:7890'

# 获取folder_path下所有文件路径，储存在file_paths里
file_paths = []
folder_path = '../../data_base/knowledge_db'
for root, dirs, files in os.walk(folder_path):
    for file in files:
        file_path = os.path.join(root, file)
        file_paths.append(file_path)
print(file_paths[:3])

['../../data_base/knowledge_db/.DS_Store', '../../data_base/knowledge_db/prompt_engineering/6. 文本转换 Transforming.md', '../../data_base/knowledge_db/prompt_engineering/4. 文本概括 Summarizing.md']

In [2]:

from langchain.document_loaders.pdf import PyMuPDFLoader
from langchain.document_loaders.markdown import UnstructuredMarkdownLoader

# 遍历文件路径并把实例化的loader存放在loaders里
loaders = []

for file_path in file_paths:

    file_type = file_path.split('.')[-1]
    if file_type == 'pdf':
        loaders.append(PyMuPDFLoader(file_path))
    elif file_type == 'md':
        loaders.append(UnstructuredMarkdownLoader(file_path))

In [3]:

# 下载文件并存储到text
texts = []

for loader in loaders: texts.extend(loader.load())

载入后的变量类型为langchain_core.documents.base.Document, 文档变量类型同样包含两个属性

page_content 包含该文档的内容。
meta_data 为文档相关的描述性数据。

In [4]:

text = texts[1]
print(f"每一个元素的类型：{type(text)}.", 
    f"该文档的描述性数据：{text.metadata}", 
    f"查看该文档的内容:\n{text.page_content[0:]}", 
    sep="\n------\n")

每一个元素的类型：<class 'langchain_core.documents.base.Document'>.
------
该文档的描述性数据：{'source': '../../data_base/knowledge_db/prompt_engineering/4. 文本概括 Summarizing.md'}
------
查看该文档的内容:
第四章 文本概括

在繁忙的信息时代，小明是一名热心的开发者，面临着海量的文本信息处理的挑战。他需要通过研究无数的文献资料来为他的项目找到关键的信息，但是时间却远远不够。在他焦头烂额之际，他发现了大型语言模型（LLM）的文本摘要功能。

这个功能对小明来说如同灯塔一样，照亮了他处理信息海洋的道路。LLM 的强大能力在于它可以将复杂的文本信息简化，提炼出关键的观点，这对于他来说无疑是巨大的帮助。他不再需要花费大量的时间去阅读所有的文档，只需要用 LLM 将它们概括，就可以快速获取到他所需要的信息。

通过编程调用 AP I接口，小明成功实现了这个文本摘要的功能。他感叹道：“这简直就像一道魔法，将无尽的信息海洋变成了清晰的信息源泉。”小明的经历，展现了LLM文本摘要功能的巨大优势：节省时间，提高效率，以及精准获取信息。这就是我们本章要介绍的内容，让我们一起来探索如何利用编程和调用API接口，掌握这个强大的工具。

一、单一文本概括

以商品评论的总结任务为例：对于电商平台来说，网站上往往存在着海量的商品评论，这些评论反映了所有客户的想法。如果我们拥有一个工具去概括这些海量、冗长的评论，便能够快速地浏览更多评论，洞悉客户的偏好，从而指导平台与商家提供更优质的服务。

接下来我们提供一段在线商品评价作为示例，可能来自于一个在线购物平台，例如亚马逊、淘宝、京东等。评价者为一款熊猫公仔进行了点评，评价内容包括商品的质量、大小、价格和物流速度等因素，以及他的女儿对该商品的喜爱程度。

python
prod_review = """
这个熊猫公仔是我给女儿的生日礼物，她很喜欢，去哪都带着。
公仔很软，超级可爱，面部表情也很和善。但是相比于价钱来说，
它有点小，我感觉在别的地方用同样的价钱能买到更大的。
快递比预期提前了一天到货，所以在送给女儿之前，我自己玩了会。
"""

1.1 限制输出文本长度

我们首先尝试将文本的长度限制在30个字以内。

```python
from tool import get_completion

prompt = f"""
您的任务是从电子商务网站上生成一个产品评论的简短摘要。

请对三个反引号之间的评论文本进行概括，最多30个字。

评论: {prod_review}
"""

response = get_completion(prompt)
print(response)
```

我们可以看到语言模型给了我们一个符合要求的结果。

注意：在上一节中我们提到了语言模型在计算和判断文本长度时依赖于分词器，而分词器在字符统计方面不具备完美精度。

1.2 设置关键角度侧重

在某些情况下，我们会针对不同的业务场景对文本的侧重会有所不同。例如，在商品评论文本中，物流部门可能更专注于运输的时效性，商家则更关注价格和商品质量，而平台则更看重整体的用户体验。

我们可以通过增强输入提示（Prompt），来强调我们对某一特定视角的重视。

1.2.1 侧重于快递服务

```python
prompt = f"""
您的任务是从电子商务网站上生成一个产品评论的简短摘要。

请对三个反引号之间的评论文本进行概括，最多30个字，并且侧重在快递服务上。

评论: {prod_review}
"""

response = get_completion(prompt)
print(response)
```

通过输出结果，我们可以看到，文本以“快递提前到货”开头，体现了对于快递效率的侧重。

1.2.2 侧重于价格与质量

```python
prompt = f"""
您的任务是从电子商务网站上生成一个产品评论的简短摘要。

请对三个反引号之间的评论文本进行概括，最多30个词汇，并且侧重在产品价格和质量上。

评论: {prod_review}
"""

response = get_completion(prompt)
print(response)
```

通过输出的结果，我们可以看到，文本以“可爱的熊猫公仔，质量好但有点小，价格稍高”开头，体现了对于产品价格与质量的侧重。

1.3 关键信息提取

在1.2节中，虽然我们通过添加关键角度侧重的 Prompt ，确实让文本摘要更侧重于某一特定方面，然而，我们可以发现，在结果中也会保留一些其他信息，比如偏重价格与质量角度的概括中仍保留了“快递提前到货”的信息。如果我们只想要提取某一角度的信息，并过滤掉其他所有信息，则可以要求 LLM 进行 文本提取（Extract） 而非概括( Summarize )。

下面让我们来一起来对文本进行提取信息吧！

```python
prompt = f"""
您的任务是从电子商务网站上的产品评论中提取相关信息。

请从以下三个反引号之间的评论文本中提取产品运输相关的信息，最多30个词汇。

评论: {prod_review}
"""

response = get_completion(prompt)
print(response)
```

二、同时概括多条文本

在实际的工作流中，我们往往要处理大量的评论文本，下面的示例将多条用户评价集合在一个列表中，并利用 for 循环和文本概括（Summarize）提示词，将评价概括至小于 20 个词以下，并按顺序打印。当然，在实际生产中，对于不同规模的评论文本，除了使用 for 循环以外，还可能需要考虑整合评论、分布式等方法提升运算效率。您可以搭建主控面板，来总结大量用户评论，以及方便您或他人快速浏览，还可以点击查看原评论。这样，您就能高效掌握顾客的所有想法。

```python
review_1 = prod_review

一盏落地灯的评论

review_2 = """
我需要一盏漂亮的卧室灯，这款灯不仅具备额外的储物功能，价格也并不算太高。
收货速度非常快，仅用了两天的时间就送到了。
不过，在运输过程中，灯的拉线出了问题，幸好，公司很乐意寄送了一根全新的灯线。
新的灯线也很快就送到手了，只用了几天的时间。
装配非常容易。然而，之后我发现有一个零件丢失了，于是我联系了客服，他们迅速地给我寄来了缺失的零件！
对我来说，这是一家非常关心客户和产品的优秀公司。
"""

一把电动牙刷的评论

review_3 = """
我的牙科卫生员推荐了电动牙刷，所以我就买了这款。
到目前为止，电池续航表现相当不错。
初次充电后，我在第一周一直将充电器插着，为的是对电池进行条件养护。
过去的3周里，我每天早晚都使用它刷牙，但电池依然维持着原来的充电状态。
不过，牙刷头太小了。我见过比这个牙刷头还大的婴儿牙刷。
我希望牙刷头更大一些，带有不同长度的刷毛，
这样可以更好地清洁牙齿间的空隙，但这款牙刷做不到。
总的来说，如果你能以50美元左右的价格购买到这款牙刷，那是一个不错的交易。
制造商的替换刷头相当昂贵，但你可以购买价格更为合理的通用刷头。
这款牙刷让我感觉就像每天都去了一次牙医，我的牙齿感觉非常干净！
"""

一台搅拌机的评论

review_4 = """
在11月份期间，这个17件套装还在季节性促销中，售价约为49美元，打了五折左右。
可是由于某种原因（我们可以称之为价格上涨），到了12月的第二周，所有的价格都上涨了，
同样的套装价格涨到了70-89美元不等。而11件套装的价格也从之前的29美元上涨了约10美元。
看起来还算不错，但是如果你仔细看底座，刀片锁定的部分看起来没有前几年版本的那么漂亮。
然而，我打算非常小心地使用它
（例如，我会先在搅拌机中研磨豆类、冰块、大米等坚硬的食物，然后再将它们研磨成所需的粒度，
接着切换到打蛋器刀片以获得更细的面粉，如果我需要制作更细腻/少果肉的食物）。
在制作冰沙时，我会将要使用的水果和蔬菜切成细小块并冷冻
（如果使用菠菜，我会先轻微煮熟菠菜，然后冷冻，直到使用时准备食用。
如果要制作冰糕，我会使用一个小到中号的食物加工器），这样你就可以避免添加过多的冰块。
大约一年后，电机开始发出奇怪的声音。我打电话给客户服务，但保修期已经过期了，
所以我只好购买了另一台。值得注意的是，这类产品的整体质量在过去几年里有所下降
，所以他们在一定程度上依靠品牌认知和消费者忠诚来维持销售。在大约两天内，我收到了新的搅拌机。
"""

reviews = [review_1, review_2, review_3, review_4]

```

```python
for i in range(len(reviews)):
    prompt = f"""
    你的任务是从电子商务网站上的产品评论中提取相关信息。

```

三、英文版

1.1 单一文本概括

python
prod_review = """
Got this panda plush toy for my daughter's birthday, \
who loves it and takes it everywhere. It's soft and \ 
super cute, and its face has a friendly look. It's \ 
a bit small for what I paid though. I think there \ 
might be other options that are bigger for the \ 
same price. It arrived a day earlier than expected, \ 
so I got to play with it myself before I gave it \ 
to her.
"""

```python
prompt = f"""
Your task is to generate a short summary of a product \
review from an ecommerce site.

Summarize the review below, delimited by triple 
backticks, in at most 30 words.

Review: {prod_review}
"""

response = get_completion(prompt)
print(response)
```

1.2 设置关键角度侧重

1.2.1 侧重于快递服务

```python
prompt = f"""
Your task is to generate a short summary of a product \
review from an ecommerce site to give feedback to the \
Shipping deparmtment.

Summarize the review below, delimited by triple 
backticks, in at most 30 words, and focusing on any aspects \
that mention shipping and delivery of the product.

Review: {prod_review}
"""

response = get_completion(prompt)
print(response)
```

1.2.2 侧重于价格和质量

```python
prompt = f"""
Your task is to generate a short summary of a product \
review from an ecommerce site to give feedback to the \
pricing deparmtment, responsible for determining the \
price of the product.

Summarize the review below, delimited by triple 
backticks, in at most 30 words, and focusing on any aspects \
that are relevant to the price and perceived value.

Review: {prod_review}
"""

response = get_completion(prompt)
print(response)
```

1.3 关键信息提取

```python
prompt = f"""
Your task is to extract relevant information from \ 
a product review from an ecommerce site to give \
feedback to the Shipping department.

From the review below, delimited by triple quotes \
extract the information relevant to shipping and \ 
delivery. Limit to 30 words.

Review: {prod_review}
"""

response = get_completion(prompt)
print(response)
```

2.1 同时概括多条文本

```python
review_1 = prod_review

review for a standing lamp

review_2 = """
Needed a nice lamp for my bedroom, and this one \
had additional storage and not too high of a price \
point. Got it fast - arrived in 2 days. The string \
to the lamp broke during the transit and the company \
happily sent over a new one. Came within a few days \
as well. It was easy to put together. Then I had a \
missing part, so I contacted their support and they \
very quickly got me the missing piece! Seems to me \
to be a great company that cares about their customers \
and products. 
"""

review for an electric toothbrush

review_3 = """
My dental hygienist recommended an electric toothbrush, \
which is why I got this. The battery life seems to be \
pretty impressive so far. After initial charging and \
leaving the charger plugged in for the first week to \
condition the battery, I've unplugged the charger and \
been using it for twice daily brushing for the last \
3 weeks all on the same charge. But the toothbrush head \
is too small. I’ve seen baby toothbrushes bigger than \
this one. I wish the head was bigger with different \
length bristles to get between teeth better because \
this one doesn’t.  Overall if you can get this one \
around the $50 mark, it's a good deal. The manufactuer's \
replacements heads are pretty expensive, but you can \
get generic ones that're more reasonably priced. This \
toothbrush makes me feel like I've been to the dentist \
every day. My teeth feel sparkly clean! 
"""

review for a blender

review_4 = """
So, they still had the 17 piece system on seasonal \
sale for around $49 in the month of November, about \
half off, but for some reason (call it price gouging) \
around the second week of December the prices all went \
up to about anywhere from between $70-$89 for the same \
system. And the 11 piece system went up around $10 or \
so in price also from the earlier sale price of $29. \
So it looks okay, but if you look at the base, the part \
where the blade locks into place doesn’t look as good \
as in previous editions from a few years ago, but I \
plan to be very gentle with it (example, I crush \
very hard items like beans, ice, rice, etc. in the \
blender first then pulverize them in the serving size \
I want in the blender then switch to the whipping \
blade for a finer flour, and use the cross cutting blade \
first when making smoothies, then use the flat blade \
if I need them finer/less pulpy). Special tip when making \
smoothies, finely cut and freeze the fruits and \
vegetables (if using spinach-lightly stew soften the \
spinach then freeze until ready for use-and if making \
sorbet, use a small to medium sized food processor) \
that you plan to use that way you can avoid adding so \
much ice if at all-when making your smoothie. \
After about a year, the motor was making a funny noise. \
I called customer service but the warranty expired \
already, so I had to buy another one. FYI: The overall \
quality has gone done in these types of products, so \
they are kind of counting on brand recognition and \
consumer loyalty to maintain sales. Got it in about \
two days.
"""

reviews = [review_1, review_2, review_3, review_4]
```

```python
for i in range(len(reviews)):
    prompt = f"""
    Your task is to generate a short summary of a product \
    review from an ecommerce site.

```

In [5]:

from langchain.text_splitter import RecursiveCharacterTextSplitter

# 切分文档
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=50)

split_docs = text_splitter.split_documents(texts)

二、构建Chroma向量库

Langchain 集成了超过 30 个不同的向量存储库。我们选择 Chroma 是因为它轻量级且数据存储在内存中，这使得它非常容易启动和开始使用。

LangChain 可以直接使用 OpenAI 和百度千帆的 Embedding，同时，我们也可以针对其不支持的 Embedding API 进行自定义，例如，我们可以基于 LangChain 提供的接口，封装一个 zhupuai_embedding，来将智谱的 Embedding API 接入到 LangChain 中。在本章的附LangChain自定义Embedding封装讲解中，我们以智谱 Embedding API 为例，介绍了如何将其他 Embedding API 封装到 LangChain 中，欢迎感兴趣的读者阅读。

注：如果你使用智谱 API，你可以参考讲解内容实现封装代码，也可以直接使用我们已经封装好的代码zhipuai_embedding.py，将该代码同样下载到本 Notebook 的同级目录，就可以直接导入我们封装的函数。在下面的代码 Cell 中，我们默认使用了智谱的 Embedding，将其他两种 Embedding 使用代码以注释的方法呈现，如果你使用的是百度 API 或者 OpenAI API，可以根据情况来使用下方 Cell 中的代码。

In [6]:

# 使用 OpenAI Embedding
# from langchain.embeddings.openai import OpenAIEmbeddings
# 使用百度千帆 Embedding
# from langchain.embeddings.baidu_qianfan_endpoint import QianfanEmbeddingsEndpoint
# 使用我们自己封装的智谱 Embedding，需要将封装代码下载到本地使用
from zhipuai_embedding import ZhipuAIEmbeddings

# 定义 Embeddings
# embedding = OpenAIEmbeddings() 
embedding = ZhipuAIEmbeddings()
# embedding = QianfanEmbeddingsEndpoint()

# 定义持久化路径
persist_directory = '../../data_base/vector_db/chroma'

In [7]:

!rm -rf '../../data_base/vector_db/chroma'  # 删除旧的数据库文件（如果文件夹中有文件的话），windows电脑请手动删除

In [8]:

from langchain.vectorstores.chroma import Chroma

vectordb = Chroma.from_documents(
    documents=split_docs,
    embedding=embedding,
    persist_directory=persist_directory  # 允许我们将persist_directory目录保存到磁盘上
)

在此之后，我们要确保通过运行 vectordb.persist 来持久化向量数据库，以便我们在未来的课程中使用。

让我们保存它，以便以后使用！

In [9]:

vectordb.persist()

/Users/lta/anaconda3/envs/llm_universe_2.x/lib/python3.10/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: Since Chroma 0.4.x the manual persistence method is no longer supported as docs are automatically persisted.
  warn_deprecated(

In [10]:

print(f"向量库中存储的数量：{vectordb._collection.count()}")

向量库中存储的数量：925

三、向量检索

3.1 相似度检索

Chroma的相似度搜索使用的是余弦距离，即：similarity=cos(A,B)=A⋅B∥A∥∥B∥=∑1naibi∑1nai2∑1nbi2其中ai、bi分别是向量A、B的分量。

当你需要数据库返回严谨的按余弦相似度排序的结果时可以使用similarity_search函数。

In [11]:

question="什么是大语言模型"

In [12]:

sim_docs = vectordb.similarity_search(question,k=3)
print(f"检索到的内容数：{len(sim_docs)}")

检索到的内容数：3

In [13]:

for i, sim_doc in enumerate(sim_docs):
    print(f"检索到的第{i}个内容: \n{sim_doc.page_content[:200]}", end="\n--------------\n")

检索到的第0个内容: 
第六章 文本转换

大语言模型具有强大的文本转换能力，可以实现多语言翻译、拼写纠正、语法调整、格式转换等不同类型的文本转换任务。利用语言模型进行各类转换是它的典型应用之一。

在本章中,我们将介绍如何通过编程调用API接口，使用语言模型实现文本转换功能。通过代码示例，读者可以学习将输入文本转换成所需输出格式的具体方法。

掌握调用大语言模型接口进行文本转换的技能，是开发各种语言类应用的重要一步。文
--------------
检索到的第1个内容: 
以英译汉为例，传统统计机器翻译多倾向直接替换英文词汇，语序保持英语结构，容易出现中文词汇使用不地道、语序不顺畅的现象。而大语言模型可以学习英汉两种语言的语法区别，进行动态的结构转换。同时，它还可以通过上下文理解原句意图，选择合适的中文词汇进行转换，而非生硬的字面翻译。

大语言模型翻译的这些优势使其生成的中文文本更加地道、流畅，兼具准确的意义表达。利用大语言模型翻译，我们能够打通多语言之间的壁垒，
--------------
检索到的第2个内容: 
通过这个例子，我们可以看到大语言模型可以流畅地处理多个转换要求，实现中文翻译、拼写纠正、语气升级和格式转换等功能。

利用大语言模型强大的组合转换能力，我们可以避免多次调用模型来进行不同转换，极大地简化了工作流程。这种一次性实现多种转换的方法，可以广泛应用于文本处理与转换的场景中。

六、英文版

1.1 翻译为西班牙语

python
prompt = f"""
Translate the fo
--------------

3.2 MMR检索

如果只考虑检索出内容的相关性会导致内容过于单一，可能丢失重要信息。

最大边际相关性 (MMR, Maximum marginal relevance) 可以帮助我们在保持相关性的同时，增加内容的丰富度。

核心思想是在已经选择了一个相关性高的文档之后，再选择一个与已选文档相关性较低但是信息丰富的文档。这样可以在保持相关性的同时，增加内容的多样性，避免过于单一的结果。

In [14]:

mmr_docs = vectordb.max_marginal_relevance_search(question,k=3)

In [15]:

for i, sim_doc in enumerate(mmr_docs):
    print(f"MMR 检索到的第{i}个内容: \n{sim_doc.page_content[:200]}", end="\n--------------\n")

MMR 检索到的第0个内容: 
第六章 文本转换

大语言模型具有强大的文本转换能力，可以实现多语言翻译、拼写纠正、语法调整、格式转换等不同类型的文本转换任务。利用语言模型进行各类转换是它的典型应用之一。

在本章中,我们将介绍如何通过编程调用API接口，使用语言模型实现文本转换功能。通过代码示例，读者可以学习将输入文本转换成所需输出格式的具体方法。

掌握调用大语言模型接口进行文本转换的技能，是开发各种语言类应用的重要一步。文
--------------
MMR 检索到的第1个内容: 
与基础语言模型不同，指令微调 LLM 通过专门的训练，可以更好地理解并遵循指令。举个例子，当询问“法国的首都是什么？”时，这类模型很可能直接回答“法国的首都是巴黎”。指令微调 LLM 的训练通常基于预训练语言模型，先在大规模文本数据上进行预训练，掌握语言的基本规律。在此基础上进行进一步的训练与微调（finetune），输入是指令，输出是对这些指令的正确回复。有时还会采用RLHF（reinforce
--------------
MMR 检索到的第2个内容: 
同一份数据集中的每个样本都含有相同个数的特征，假设此数据集中的每个样本都含有d 个特征，则第i
个样本的数学表示为d 维向量：xi = (xi1; xi2; ...; xid)，其中xij 表示样本xi 在第j 个属性上的取值。
模型：机器学习的一般流程如下：首先收集若干样本（假设此时有100 个）
，然后将其分为训练样本
（80 个）和测试样本（20 个）
，其中80 个训练样本构成的集合称为“
--------------

MaXiaoTiao

搭建并使用向量数据库

搭建并使用向量数据库

一、前序配置

二、构建Chroma向量库

三、向量检索

3.1 相似度检索

3.2 MMR检索