GPTQ4LLaMa-langchain

Use 4bit GPTQ models whith langchain.

Getting Start

It is recommended to use conda to create a virtual environment for Python3.9. Then set up the environment according to the following steps：

Step 1:
Use git to download this project

git clone https://github.com/PengZiqiao/gptq4llama_langchain.git

Step 2:
Install dependencies using pip

pip install -r requirements.txt

Step 3:
The project depends on GTPQ-for-LLaMa. You need to copy it or softlink it into repositories

cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
pip install -r GPTQ-for-LLaMa/requirements.txt

Prepare the model files, you can use (but not limited to) the following models:

Using GPTQ model class

import the GPTQ class and create an instance.

from model import GPTQ

model_dir = 'YOUR_MODEL_DIR'
checkpoint = 'YOUR_MODEL_DIR/checkpoint_file'

GPTQ(model_dir, checkpoint, wbits=4, groupsize=128)

We define a method called generate(). This method has two parameters: prompt and streaming, which indicate the input prompt and whether streaming generation is enabled.

content = input()

prompt = f"""A chat between a user and an assistant.
USER: {content}
ASSISTANT: """

print(gptq.generate(prompt))

Using streaming output(learn a lot from text-generation-webui), you get a generator. The full content is output every time. If you only want to keep the newly generated content, you need to manually remove the last output.

last_output = ''
for output in gptq.generate(prompt, streaming=True)：
    print(output.replace(last_output, ''), end='') 
    last_output = output

We also define a method called embed() for representing the input text conversion vector. Used for similar search, classification, clustering, and other operations.

embeddings = gptq.embed(prompt)

Using API

If you are using this project on the server, we also use fastapi to enclose the above methods into APIs. Start the service using the following command:

python run_server.py

You can change the host, port in config.py

LLM_HOST = "0.0.0.0"
LLM_PORT = "8080"

Other config should by modified

AUTO_TYPE = False # True: use AutoConfig, AutoModelForCausalLM instead of LlamaConfig, LlamaForCausalLM to support more models; False: just import load_quant from GPTQ-for-LLaMa/llama_inference.py

MODEL_PARAMS = dict(
    model = "YOUR_MODEL_DIR",
    checkpoint = "YOUR_MODEL_DIR/checkpoint_file",
    wbits=4, groupsize=128, fused_mlp=False, warmup_autotune=False
)

# modify this according to your model's best prompts format
HUMAN_PREFIX = 'USER'
AI_PREFIX = 'ASSISTANT'

/generate/ to get reply text.


Method	`POST`
Requst body	`{"prompt": "string", "params": {}}`

use requests.post to call the api

import requests
import json
from config import LLM_HOST, LLM_PORT

GENERATE_PARAMS = dict(
    min_length=0, max_length=4096, temperature=0.1, top_p=0.75, top_k=40
)

def generate(prompt):
    url = f"http://{LLM_HOST}:{LLM_PORT}/generate/"
    headers = {"Content-Type": "application/json"}
    data = json.dumps(dict(prompt=prompt, params=GENERATE_PARAMS))

    res = requests.post(url, headers=headers, data=data)
    return res.text

print(generate('Hello, bot!'))

/streaming_generate/ to get the streaming reply.


Method	`POST`
Requst body	`{"prompt": "string", "params": {}}`

use requests.post to call the api

from sseclient import SSEClient
import requests
import json
from config import LLM_HOST, LLM_PORT

def streaming_generate(prompt):
    url = f"http://{LLM_HOST}:{LLM_PORT}/streaming_generate/"
    headers = {"Content-Type": "application/json"}
    data = json.dumps(dict(prompt=prompt, params=GENERATE_PARAMS))

    res = requests.post(url, headers=headers, data=data, stream=True)
    client = SSEClient(res).events()
    return client

for each in streaming_generate('Hello, bot!'):
    print(each.data)

/embed/ to get the embeddings


Method	`POST`
Requst body	`{"prompt": "string"}`

use requests.post to call the api

import requests
import json
from config import LLM_HOST, LLM_PORT
def embed(prompt):
    url = f"http://{LLM_HOST}:{LLM_PORT}/embed/"
    headers = {"Content-Type": "application/json"}
    data = json.dumps(dict(prompt=prompt))

    res = requests.post(url, headers=headers, data=data)
    return json.loads(res.text)

print(embed('Hello, bot!'))

/chat/ to make conversations


Method	`POST`
Requst body	`[['USER MESSAGE 1', 'ASSISTANT MESSAGE 1'], ['USER MESSAGE 2', 'ASSISTANT MESSAGE 2'], ... ['USER MESSAGE n', '']]`

use requests.post to call the api

from sseclient import SSEClient
import requests
import json
from config import LLM_HOST, LLM_PORT

def chat(history):
    url = f"http://{LLM_HOST}:{LLM_PORT}/chat/"
    headers = {"Content-Type": "application/json"}
    data = json.dumps(history)

    res = requests.post(url, headers=headers, data=data, stream=True)
    client = SSEClient(res).events()
    return client

for each in chat([['Hello, bot!', '']]):
    print(each.data)

Using with langchain

We provide a custom langchain LLM wrapper name GPTQLLM.

from model import GPTQLLM
from langchain import PromptTemplate
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# Do not set streaming=True if num_beam>1
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = GPTQLLM(streaming=True, callback_manager=callback_manager) 
llm("Hello, bot!")

For more details on how to use LLMs within LangChain, see the LLM getting started guide.

We also provie a custom Embeddings Model

from model import GPTQEmbeddings

document = "This is a content of the document"
query = "What is the contnt of the document?"

embeddings = GPTQEmbeddings()

doc_result = embeddings.embed_documents([document])
query_result = embeddings.embed_query(query)

Acknowledgements

TODO

loras using support
custom some langchain chains
custom some langchain agents

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
prompts		prompts
repositories		repositories
streamlit_app		streamlit_app
README.md		README.md
api.py		api.py
config.py		config.py
model.py		model.py
requirements.txt		requirements.txt
run_server.py		run_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPTQ4LLaMa-langchain

Getting Start

Using GPTQ model class

Using API

Using with langchain

Acknowledgements

TODO

About

Releases

Packages

Languages

PengZiqiao/gptq4llama_langchain

Folders and files

Latest commit

History

Repository files navigation

GPTQ4LLaMa-langchain

Getting Start

Using GPTQ model class

Using API

Using with langchain

Acknowledgements

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages