Use 4bit GPTQ models whith langchain.
It is recommended to use conda to create a virtual environment for Python3.9. Then set up the environment according to the following steps:
Step 1:
Use git to download this project
git clone https://github.com/PengZiqiao/gptq4llama_langchain.git
Step 2:
Install dependencies using pip
pip install -r requirements.txt
Step 3:
The project depends on GTPQ-for-LLaMa. You need to copy it or softlink it into repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
pip install -r GPTQ-for-LLaMa/requirements.txt
- Prepare the model files, you can use (but not limited to) the following models:
- vicuna-13b-GPTQ-4bit-128g
- wizardLM-7B-GPTQ
- stable-vicuna-13B-GPTQ
- koala-13B-GPTQ-4bit-128g
- BELLE-7B-gptq
- import the
GPTQ
class and create an instance.
from model import GPTQ
model_dir = 'YOUR_MODEL_DIR'
checkpoint = 'YOUR_MODEL_DIR/checkpoint_file'
GPTQ(model_dir, checkpoint, wbits=4, groupsize=128)
- We define a method called
generate()
. This method has two parameters:prompt
andstreaming
, which indicate the input prompt and whether streaming generation is enabled.
content = input()
prompt = f"""A chat between a user and an assistant.
USER: {content}
ASSISTANT: """
print(gptq.generate(prompt))
- Using streaming output(learn a lot from text-generation-webui), you get a generator. The full content is output every time. If you only want to keep the newly generated content, you need to manually remove the last output.
last_output = ''
for output in gptq.generate(prompt, streaming=True):
print(output.replace(last_output, ''), end='')
last_output = output
- We also define a method called
embed()
for representing the input text conversion vector. Used for similar search, classification, clustering, and other operations.
embeddings = gptq.embed(prompt)
- If you are using this project on the server, we also use fastapi to enclose the above methods into APIs. Start the service using the following command:
python run_server.py
- You can change the host, port in
config.py
LLM_HOST = "0.0.0.0"
LLM_PORT = "8080"
- Other config should by modified
AUTO_TYPE = False # True: use AutoConfig, AutoModelForCausalLM instead of LlamaConfig, LlamaForCausalLM to support more models; False: just import load_quant from GPTQ-for-LLaMa/llama_inference.py
MODEL_PARAMS = dict(
model = "YOUR_MODEL_DIR",
checkpoint = "YOUR_MODEL_DIR/checkpoint_file",
wbits=4, groupsize=128, fused_mlp=False, warmup_autotune=False
)
# modify this according to your model's best prompts format
HUMAN_PREFIX = 'USER'
AI_PREFIX = 'ASSISTANT'
/generate/
to get reply text.
Method | POST |
Requst body | {"prompt": "string", |
use requests.post to call the api
import requests
import json
from config import LLM_HOST, LLM_PORT
GENERATE_PARAMS = dict(
min_length=0, max_length=4096, temperature=0.1, top_p=0.75, top_k=40
)
def generate(prompt):
url = f"http://{LLM_HOST}:{LLM_PORT}/generate/"
headers = {"Content-Type": "application/json"}
data = json.dumps(dict(prompt=prompt, params=GENERATE_PARAMS))
res = requests.post(url, headers=headers, data=data)
return res.text
print(generate('Hello, bot!'))
/streaming_generate/
to get the streaming reply.
Method | POST |
Requst body | {"prompt": "string", |
use requests.post to call the api
from sseclient import SSEClient
import requests
import json
from config import LLM_HOST, LLM_PORT
def streaming_generate(prompt):
url = f"http://{LLM_HOST}:{LLM_PORT}/streaming_generate/"
headers = {"Content-Type": "application/json"}
data = json.dumps(dict(prompt=prompt, params=GENERATE_PARAMS))
res = requests.post(url, headers=headers, data=data, stream=True)
client = SSEClient(res).events()
return client
for each in streaming_generate('Hello, bot!'):
print(each.data)
/embed/
to get the embeddings
Method | POST |
Requst body | {"prompt": "string"} |
use requests.post to call the api
import requests
import json
from config import LLM_HOST, LLM_PORT
def embed(prompt):
url = f"http://{LLM_HOST}:{LLM_PORT}/embed/"
headers = {"Content-Type": "application/json"}
data = json.dumps(dict(prompt=prompt))
res = requests.post(url, headers=headers, data=data)
return json.loads(res.text)
print(embed('Hello, bot!'))
/chat/
to make conversations
Method | POST |
Requst body | [['USER MESSAGE 1', 'ASSISTANT MESSAGE 1'], |
use requests.post to call the api
from sseclient import SSEClient
import requests
import json
from config import LLM_HOST, LLM_PORT
def chat(history):
url = f"http://{LLM_HOST}:{LLM_PORT}/chat/"
headers = {"Content-Type": "application/json"}
data = json.dumps(history)
res = requests.post(url, headers=headers, data=data, stream=True)
client = SSEClient(res).events()
return client
for each in chat([['Hello, bot!', '']]):
print(each.data)
- We provide a custom langchain LLM wrapper name GPTQLLM.
from model import GPTQLLM
from langchain import PromptTemplate
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
# Do not set streaming=True if num_beam>1
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = GPTQLLM(streaming=True, callback_manager=callback_manager)
llm("Hello, bot!")
For more details on how to use LLMs within LangChain, see the LLM getting started guide.
- We also provie a custom Embeddings Model
from model import GPTQEmbeddings
document = "This is a content of the document"
query = "What is the contnt of the document?"
embeddings = GPTQEmbeddings()
doc_result = embeddings.embed_documents([document])
query_result = embeddings.embed_query(query)
- https://github.com/hwchase17/langchain
- https://github.com/qwopqwop200/GPTQ-for-LLaMa
- https://github.com/oobabooga/text-generation-webui
- loras using support
- custom some langchain chains
- custom some langchain agents