Text Embedding
文本嵌入技术综合教程¶
本教程涵盖了各种文本嵌入模型的原理、比较和实际应用,包括 OpenAI、HuggingFace、LLaMA 等主流模型。
import os
from EdgeGPT import Chatbot as Bing, ConversationStyle
bing = Bing(cookiePath = os.path.expanduser('~/.config/EdgeGPT/cookies.json'))
async def ask(prompt):
res = (await bing.ask(
prompt = prompt,
conversation_style = ConversationStyle.balanced,
))['item']['messages'][1]
print(res['text'])
print('\n---\n')
print(res['adaptiveCards'][0]['body'][0]['text'])
await ask('''
text-embedding-ada-002 是什么?
''')
await ask('''
比较 GPT-3 嵌入、GloVe 嵌入、Word2vec 嵌入、MPNet 嵌入
''')
await ask('''
还有哪些文本嵌入模型?
''')
await ask('''
文本嵌入模型的英文?
''')
await ask('''
InstructorEmbedding 是什么?
''')
await ask('''
比较 HuggingFaceInstructEmbeddings 和 HuggingFaceHubEmbeddings
''')
await ask('''
sentence_transformers 的 GitHub 链接?
''')
await ask('''
InstructorEmbedding 的 GitHub 链接?
''')
await ask('''
比较 HuggingFaceInstructEmbeddings 和 LlamaCppEmbeddings
''')
await ask('''
执行下面这段代码之后,HuggingFaceEmbeddings 的模型下载到哪里了?
```
from langchain.embeddings import HuggingFaceEmbeddings
# 准备文本
text = '这是一个测试文档。'
# 使用 HuggingFace 生成文本嵌入
embeddings = HuggingFaceEmbeddings()
query_result = embeddings.embed_query(text)
doc_result = embeddings.embed_documents([text])
```
''')
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
dir(tokenizer)
tokenizer.name_or_path
%%bash
ls -lah ~/.cache/huggingface/hub/models--bert-base-uncased
%%bash
du -sh ~/.cache/huggingface/hub/models--bert-base-uncased
推荐工具和资源¶
向量数据库¶
如果你在寻找向量数据库嵌入解决方案,推荐查看:
- Chroma - AI原生的开源嵌入数据库
- 官网:https://www.trychroma.com/
- 文档:https://docs.trychroma.com/embeddings
模型微调¶
如果你需要进行模型微调,推荐:
推荐的基础模型¶
- gpt4xalpaca - 推荐作为基础模型,审查程度较低
- GPT4 x Alpaca - 如果仅用于道德任务,vicuna 可能更好
await ask('''
Chroma Embedding Database 是什么
''')
await ask('''
ggml 是什么
''')
await ask('''
HUGGINGFACEHUB_API_TOKEN 有什么用?
''')
await ask('''
Hugging Face Hub API 有什么用?
''')
await ask('''
没有 HUGGINGFACEHUB_API_TOKEN 就不能使用 Hugging Face Hub API 吗?
''')
await ask('''
没有 HUGGINGFACEHUB_API_TOKEN 就不能在 HuggingFace 下载模型吗?
''')
await ask('''
在模型上运行推理是什么意思?训练模型是什么意思?
''')
await ask('''
没有 HUGGINGFACEHUB_API_TOKEN 可以在 HuggingFace 下载他人公开的模型吗?
''')
await ask('''
怎么用 Python 在 HuggingFace 下载模型?
''')
await ask('''
在 HuggingFace 下载模型,如何断点续传?
''')
await ask('''
PyTorch 模型可以分成好几个 bin 文件吗?
''')
import os
import pickle
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.embeddings import LlamaCppEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.faiss import FAISS
def get_docs(dir_name):
# (1) Import a series of documents.
loader = DirectoryLoader(dir_name, loader_cls=TextLoader, silent_errors=True)
raw_documents = loader.load()
# (2) Split them into small chunks.
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=128,
)
return text_splitter.split_documents(raw_documents)
def ingest_docs(dir_name):
documents = get_docs(dir_name)
# (3) Create embeddings for each document (using text-embedding-ada-002).
embeddings = LlamaCppEmbeddings(model_path=os.path.expanduser('~/ggml-model-q4_1.bin'), n_ctx=1024)
return FAISS.from_documents(documents, embeddings)
vectorstore = ingest_docs('_posts/ultimate-facts')
import pickle
# Save vectorstore
with open('vectorstore_13B_1024.pkl', 'wb') as f:
pickle.dump(vectorstore, f)
import pickle
# Load vectorstore
with open('vectorstore_13B_1024.pkl', 'rb') as f:
vectorstore = pickle.load(f)
question = '你知道什么?'
# Get context related to the question from the embedding model
for context in vectorstore.similarity_search(question):
print(f'{context}\n')
from langchain.chains.llm import LLMChain
from langchain.callbacks.base import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains.chat_vector_db.prompts import CONDENSE_QUESTION_PROMPT, QA_PROMPT
from langchain.chains.question_answering import load_qa_chain
from langchain.vectorstores.base import VectorStore
from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI
# Callback function to stream answers to stdout.
manager = CallbackManager([StreamingStdOutCallbackHandler()])
streaming_llm = ChatOpenAI(streaming=True, callback_manager=manager, verbose=True, temperature=0)
question_gen_llm = ChatOpenAI(temperature=0, verbose=True, callback_manager=manager)
# Prompt to generate independent questions by incorporating chat history and a new question.
question_generator = LLMChain(llm=question_gen_llm, prompt=CONDENSE_QUESTION_PROMPT)
# Pass in documents and a standalone prompt to answer questions.
doc_chain = load_qa_chain(streaming_llm, chain_type='stuff', prompt=QA_PROMPT)
# Generate prompts from embedding model.
qa = ConversationalRetrievalChain(retriever=vectorstore.as_retriever(), combine_docs_chain=doc_chain, question_generator=question_generator)
QA_PROMPT
在 PyTest 中,可以使用 monkeypatch
来替换导入的库,从而进行测试。 monkeypatch
是一个 pytest 内置的 fixture,用于替换运行时的变量和对象,以便在测试期间使用自定义值。
对于要替换的库,可以使用 pytest_mock
fixture 来自动将其传递给 monkeypatch
。如果没有使用 pytest_mock
,则需要手动使用 monkeypatch.setattr()
方法来替换导入的库。
以下是一个示例,假设我们有一个名为 example.py
的模块,它导入了 requests
库,并使用了该库的 get()
方法发送网络请求:
import requests
def get_example_data():
response = requests.get('https://example.com')
return response.content
要测试这个函数,我们需要使用 Dummy 对象替换 requests 库,以便我们可以模拟网络请求的响应。可以使用以下代码进行测试:
import pytest
import example
class DummyResponse:
def __init__(self, content):
self.content = content
@pytest.fixture
def mock_requests(monkeypatch):
def mock_get(*args, **kwargs):
return DummyResponse(b'Test data')
monkeypatch.setattr(requests, 'get', mock_get)
def test_get_example_data(mock_requests):
data = example.get_example_data()
assert data == b'Test data'
在这个示例中,我们首先定义了一个名为 DummyResponse
的类,它代表了一个虚拟的 requests 库的响应。然后我们定义了一个名为 mock_requests
的 fixture,它使用 monkeypatch.setattr()
方法替换了 requests 库的 get()
方法,使其返回我们定义的 DummyResponse 对象。
最后,我们定义了一个测试函数 test_get_example_data
,并将 mock_requests
fixture 作为参数传递给它。在测试函数内部,我们调用 example.get_example_data()
,这将调用我们已经用 Dummy 对象替换的 requests 库的 get()
方法,最终返回我们定义的虚拟响应。
这样我们就可以在测试中使用 Dummy 对象替换任何库,以便更好地控制测试环境。
import openai
help(openai.ChatCompletion.create)
import json, os
from revChatGPT.V1 import Chatbot, configure
# Open the JSON file and read the conversation_id
with open(os.path.expanduser('~/.config/revChatGPT/config.json'), 'r') as f:
conversation_id = json.load(f).get('conversation_id', None)
bot = Chatbot(
config = configure(),
conversation_id = conversation_id,
lazy_loading = True
)
%%bash
pip install --upgrade git+https://github.com/seii-saintway/ipymock
import pytest
import markdown
import IPython
def delta(prompt):
res = ''
for response in bot.ask(prompt):
# IPython.display.display(IPython.core.display.Markdown(response['message']))
# IPython.display.clear_output(wait=True)
yield {
'choices': [
{
'index': 0,
'delta': {
'content': response['message'][len(res):],
}
}
],
}
res = response['message']
def mock_create(*args, **kwargs):
for message in kwargs['messages']:
if message['role'] == 'user':
break
else:
return {
'choices': [{}],
}
if kwargs.get('stream', False):
return delta(message['content'])
for response in bot.ask(message['content']):
# IPython.display.display(IPython.core.display.Markdown(response['message']))
# IPython.display.clear_output(wait=True)
pass
return {
'choices': [
{
'finish_reason': 'stop',
'index': 0,
'message': {
'content': response['message'],
'role': 'assistant',
}
}
],
}
@pytest.fixture
def mock_openai(monkeypatch):
monkeypatch.setattr(openai.ChatCompletion, 'create', mock_create)
question = '终极真实是什么?'
answer = {}
def test_qa(mock_openai):
global answer
answer = qa({'question': question, 'chat_history': []})
print('\n')
assert isinstance(answer, dict)
from ipymock import do
do(
mock_openai=mock_openai,
test_qa=test_qa,
)
answer
- Chat with Open Large Language Models
- Vicuna: a chat assistant fine-tuned from LLaMA on user-shared conversations. This one is expected to perform best according to our evaluation.
- Koala: a chatbot fine-tuned from LLaMA on user-shared conversations and open-source datasets. This one performs similarly to Vicuna.
- ChatGLM: an open bilingual dialogue language model | 开源双语对话语言模型
- Alpaca: a model fine-tuned from LLaMA on 52K instruction-following demonstrations.
- LLaMA: open and efficient foundation language models
I recommend using gpt4 x alpaca ggml as a base model as it doesn't have the same level of censorship as vicuna. However, if you're using it purely for ethical tasks, vicuna is definitely better.
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id='Pi3141/alpaca-lora-7B-ggml', filename='ggml-model-q4_0.bin', resume_download=True)
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id='Pi3141/alpaca-lora-7B-ggml', filename='ggml-model-q4_1.bin', resume_download=True)
%%bash
pip install llama-cpp-python[server]
%%bash
export MODEL=~/ggml-model-q4_1.bin
python3 -m llama_cpp.server
import os
from llama_cpp import Llama
llm = Llama(model_path=os.path.expanduser('~/ggml-model-q4_1.bin'))
llm.tokenize('''> 我们刚刚知道自然科学借以掌握质的方法––形成量的概念的方法。我们必须提出的问题是,这种方法是不是也能够适用于主观的意识的质。按照我们前面所说,为了使这种方法能够加以运用,必须有与这些质充分确定地、唯一地联系着的空间变化。如果情况真的如此,那么这个问题就可以通过空间–时间的重合方法来解决,因而**测量**便是可能的。但是,这种重合的方法本质上就是进行物理的观察,而就内省法来说,却不存在物理的观察这种事情。由此立刻就可以得出结论:心理学沿着内省的途径决不可能达到知识的理想。因此,它必须尽量使用物理的观察方法来达到它的目的。但这是不是可能的呢?是不是有依存于意识的质的空间变化,就像例如在光学中干涉带的宽度依存于颜色,在电学中磁铁的偏转度依存于磁场的强度那样呢?'''.encode('utf8'))
llm.tokenize('''> 现在我们知道,事实上应当承认在主观的质和推断出来的客观世界之间有一种确切规定的、一义的配列关系。大量的经验材料告诉我们,我们可以发现,至少必须假设与所有经验唯一地联系着的“物理的”过程的存在。没有什么意识的质不可能受到作用于身体的力的影响。的确,我们甚至能够用一种简单的物理方法,例如吸进一种气体,就把意识全部消除掉。我们的行动与我们的意志经验相联系,幻觉与身体的疲惫相联系,抑郁症的发作与消化的紊乱相联系。为了研究这类相互联系,心的理论必须抛弃纯粹内省的方法而成为**生理的**心理学。只有这个学科才能在理论上达到对心理的东西的完全的知识。借助于这样一种心理学,我们就可以用概念和所与的主观的质相配列,正如我们能够用概念与推论出来的客观的质相配列一样。这样,主观的质就像客观的质一样成为可知的了。'''.encode('utf8'))
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
def get_docs(dir_name):
# (1) Import a series of documents.
loader = DirectoryLoader(dir_name, loader_cls=TextLoader, silent_errors=True)
raw_documents = loader.load()
# (2) Split them into small chunks.
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1024,
chunk_overlap=64,
)
return text_splitter.split_documents(raw_documents)
import os
from langchain.embeddings import LlamaCppEmbeddings
from langchain.vectorstores.faiss import FAISS
def ingest_docs(dir_name):
documents = get_docs(dir_name)
# (3) Create embeddings for each document (using text-embedding-ada-002).
embeddings = LlamaCppEmbeddings(model_path=os.path.expanduser(
'~/.cache/huggingface/hub/models--Pi3141--alpaca-lora-7B-ggml/snapshots/fec53813efae6495f9b1f14aa4dedffc07bbf2e0/ggml-model-q4_1.bin'
), n_ctx=2048)
return FAISS.from_documents(documents, embeddings)
vectorstore = ingest_docs('_posts/ultimate-facts')
I need a big memory to accelerate LLM inference.
import pickle
# Save vectorstore
with open('vectorstore_7B_2048.pkl', 'wb') as f:
pickle.dump(vectorstore, f)
# Load vectorstore
with open('vectorstore_7B_2048.pkl', 'rb') as f:
vectorstore = pickle.load(f)
question = '你知道什么?'
# Get context related to the question from the embedding model
for context in vectorstore.similarity_search(question):
print(f'{context}\n')
# from langchain.chains.chat_vector_db.prompts import CONDENSE_QUESTION_PROMPT, QA_PROMPT
from langchain.prompts import PromptTemplate
CONDENSE_QUESTION_PROMPT = PromptTemplate(
input_variables=['chat_history', 'question'],
output_parser=None, partial_variables={},
template='给定以下对话和后续问题,请重新表述后续问题以成为一个独立问题。\n\n聊天记录:\n{chat_history}\n后续问题:{question}\n独立问题:',
template_format='f-string',
validate_template=True
)
QA_PROMPT = PromptTemplate(
input_variables=['context', 'question'],
output_parser=None, partial_variables={},
template='使用下面的背景信息回答最后的问题。如果您不知道答案,请直接说您不知道,不要试图编造一个答案。\n\n背景信息:\n{context}\n\n问题:{question}\n有用的答案:',
template_format='f-string',
validate_template=True
)
from langchain.chains.llm import LLMChain
from langchain.callbacks.base import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains.question_answering import load_qa_chain
from langchain.vectorstores.base import VectorStore
from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI
# Callback function to stream answers to stdout.
manager = CallbackManager([StreamingStdOutCallbackHandler()])
streaming_llm = ChatOpenAI(streaming=True, callback_manager=manager, verbose=True, temperature=0)
question_gen_llm = ChatOpenAI(temperature=0, verbose=True, callback_manager=manager)
# Prompt to generate independent questions by incorporating chat history and a new question.
question_generator = LLMChain(llm=question_gen_llm, prompt=CONDENSE_QUESTION_PROMPT)
# Pass in documents and a standalone prompt to answer questions.
doc_chain = load_qa_chain(streaming_llm, chain_type='stuff', prompt=QA_PROMPT)
# Generate prompts from embedding model.
qa = ConversationalRetrievalChain(retriever=vectorstore.as_retriever(), combine_docs_chain=doc_chain, question_generator=question_generator)
import openai
import json, os
from revChatGPT.V1 import Chatbot, configure
# Open the JSON file and read the conversation_id
with open(os.path.expanduser('~/.config/revChatGPT/config.json'), 'r') as f:
conversation_id = json.load(f).get('conversation_id', None)
bot = Chatbot(
config = configure(),
conversation_id = conversation_id,
lazy_loading = True
)
import pytest
def delta(prompt):
res = ''
for response in bot.ask(prompt):
yield {
'choices': [
{
'index': 0,
'delta': {
'content': response['message'][len(res):],
}
}
],
}
res = response['message']
def mock_create(*args, **kwargs):
for message in kwargs['messages']:
if message['role'] == 'user':
break
else:
return {
'choices': [{}],
}
if kwargs.get('stream', False):
return delta(message['content'])
for response in bot.ask(message['content']):
pass
return {
'choices': [
{
'finish_reason': 'stop',
'index': 0,
'message': {
'content': response['message'],
'role': 'assistant',
}
}
],
}
@pytest.fixture
def mock_openai(monkeypatch):
monkeypatch.setattr(openai.ChatCompletion, 'create', mock_create)
question = '终极真实是什么?'
answer = {}
def test_qa(mock_openai):
global answer
answer = qa({'question': question, 'chat_history': []})
print('\n')
assert isinstance(answer, dict)
from ipymock import do
do(
mock_openai=mock_openai,
test_qa=test_qa,
)
answer