Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing OpenAI GPT-4 with Ollama as LLM-as-a-Judge and API Calls with Local LLMs in Giskard (RAGET Toolkit) AND Replacing API Calls with Local LLMs in Giskard Using Ollama #2096

Open
1 task done
pds13193 opened this issue Jan 10, 2025 · 10 comments
Assignees
Labels
question Further information is requested

Comments

@pds13193
Copy link

pds13193 commented Jan 10, 2025

Checklist

  • I've searched the project's issues.

❓ Question

I am currently using Giskard, specifically the RAGET toolkit, for evaluating our chatbot. By default, Giskard uses GPT-4 from OpenAI to evaluate the output of our model. However, I would like to replace GPT-4 with an open-source LLM-as-a-judge, specifically Ollama. I have already set up the Ollama client using below code (The one mentioned in the Giskard document).

import giskard
api_base = "http://localhost:11434" # Default api_base for local Ollama
giskard.llm.set_llm_model("ollama/llama3.1", disable_structured_output=True, api_base=api_base)
giskard.llm.set_embedding_model("ollama/nomic-embed-text", api_base=api_base)

Additionally, for confidentiality reasons, I want to replace the default LLM API calls (which use remote LLMs) with local LLMs (with Ollama call). I have set up the Ollama client locally (as shown above) and would like to know if this setup will replace all external LLM API calls with local LLMs, wherever Giskard relies on an external LLM.

Below are my questions:

  1. Once the Ollama client is set up, does it automatically replace OpenAI GPT-4 as the LLM-as-a-judge, or is there additional configuration required?
  2. Will the Ollama client setup replace all external API calls and use the local LLM instead? If not, are there additional configurations needed to ensure only local LLMs are used for all relevant tasks?

I know the answer to the second question will also address the first one, but I would still like to ask the first one specifically 😄

@pds13193 pds13193 added the question Further information is requested label Jan 10, 2025
@pds13193
Copy link
Author

Hi Team,

It’s been 5 days since I raised the issue regarding, and I wanted to follow up to kindly ask if there are any updates on this matter.

This issue is quite important for us as we are looking to implement this code in our production environment. Any guidance or update would be greatly appreciated.

@henchaves
Copy link
Member

henchaves commented Jan 15, 2025

Hey @pds13193,

Yes, once you set giskard.llm.set_llm_model and giskard.llm.set_embedding_model to use Ollama, it will be used within all giskard methods that need to call LLM and embedding models. So, for your both questions, the answer is yes!

From the code you shared, there is no additional configuration required. Maybe if you use Jupyter notebook you will also have to run:

import nest_asyncio
nest_asyncio.apply()

@omarelgaml
Copy link

Hey @pds13193 @henchaves

I have been following exactly the same approach that @pds13193 is following and I keep getting this error

2025-01-31 21:51:12,249 pid:85464 MainThread giskard.rag INFO Finding topics in the knowledge base. 2025-01-31 21:52:16,689 pid:85464 MainThread giskard.rag INFO Found 3 topics in the knowledge base. Generating questions: 0%| | 0/5 [00:00<?, ?it/s] 2025-01-31 21:54:37,782 pid:85464 MainThread giskard.rag ERROR Encountered error in question generation: 'question'. Skipping. 2025-01-31 21:54:37,847 pid:85464 MainThread giskard.rag ERROR 'question' Traceback (most recent call last): File "/Users/omar/myenv/lib/python3.10/site-packages/giskard/rag/question_generators/base.py", line 59, in generate_questions yield self.generate_single_question(knowledge_base, *args, **kwargs, seed_document=doc) File "/Users/omar/myenv/lib/python3.10/site-packages/giskard/rag/question_generators/simple_questions.py", line 108, in generate_single_question question=generated_qa["question"], KeyError: 'question'

@pds13193 did you face this problem?

@pds13193
Copy link
Author

pds13193 commented Jan 31, 2025 via email

@omarelgaml
Copy link

Hey @henchaves,

do you have any idea why is that happening?

@henchaves
Copy link
Member

Hi @omarelgaml,
Let me investigate this issue you are facing. Could you provide the full code you used that has generated this error, hiding personal & sensitive information, please?

@henchaves henchaves self-assigned this Feb 5, 2025
@omarelgaml
Copy link

omarelgaml commented Feb 12, 2025

Hi @henchaves,

Thank you for getting back to me. I did exactly like the documentation.

Here is my code:

api_base = "http://localhost:11434" # default api_base for local Ollama
giskard.llm.set_llm_model("ollama/llama3.1", disable_structured_output=True, api_base=api_base)
giskard.llm.set_embedding_model("ollama/nomic-embed-text", api_base=api_base)

csv_files = [".."]

all_documents = []
for file_name in os.listdir('./data'):
    if file_name.endswith('.csv'):
        file_path = os.path.join('./data', file_name)

        loader = CSVLoader(file_path)
        documents = loader.load()
        all_documents.extend(documents)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024, chunk_overlap=20, add_start_index=True
)
all_splits = text_splitter.split_documents(all_documents)
df = pd.DataFrame([d.page_content for d in all_splits], columns=["text"])
knowledge_base = KnowledgeBase(df[0:20])

testset = generate_testset(
    knowledge_base,
    num_questions=10,
    language="de",
    agent_description="make testset relevant to the provided data."
)

Issues:

  1. Language Detection: The documentation states that the language will be detected automatically, but it did not. I had to specify it manually.

  2. Incomplete Question Generation: The function does not generate the required number of questions, as many fail with this error:

Generating questions:  50%|█████     | 5/10 [05:17<06:41, 80.27s/it]
2025-02-03 12:12:37,220 pid:3939 MainThread giskard.rag  ERROR    Encountered error in question generation: 0. Skipping.
2025-02-03 12:12:37,229 pid:3939 MainThread giskard.rag  ERROR    0
Traceback (most recent call last):
  File "/Users/../lib/python3.10/site-packages/giskard/rag/question_generators/base.py", line 59, in generate_questions
    yield self.generate_single_question(knowledge_base, *args, **kwargs, seed_document=doc)
  File "/Users/../lib/python3.10/site-packages/giskard/rag/question_generators/double_questions.py", line 125, in generate_single_question
    "question_1": linked_questions[0]["question"],
KeyError: 0
Generating questions:  60%|██████    | 6/10 [06:28<04:19, 64.83s/it]

It only generated 60% of the questions. If I request 50 questions, it generates fewer than 50% of them.

  1. Evaluation Report Error: Even when proceeding with these questions and running the evaluation, I get an error when attempting to display or save the report:
from giskard.rag import evaluate
from rag_chain import langchain_create_retrieval_chain_init

chain = langchain_create_retrieval_chain_init()

def get_answer_fn(question: str, history=None) -> str:
    """A function representing your RAG agent."""
    answer = chain.invoke(question)  # could be langchain, llama_index, etc.
    return answer

report = evaluate(get_answer_fn, testset=testset, knowledge_base=knowledge_base)
report.to_html("rag_eval_report.html")

Error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[21], line 1
----> 1 report.to_html("rag_eval_report.html")

File ~/../myenv/lib/python3.10/site-packages/giskard/rag/report.py:101, in RAGReport.to_html(self, filename, embed)
     89 """Renders the evaluation report as HTML.
     90
     91 Saves or returns the HTML representation of the scan report.
   (...)
     96     If provided, the HTML will be written to the file.
     97 """
     98 tpl = get_template("rag_report/rag_report.html")
    100 kb_script, kb_div = (
--> 101     components(self._apply_theme(self._get_knowledge_plot())) if self._knowledge_base else (None, None)
    102 )
    103 q_type_script, q_type_div = components(
    104     self._apply_theme(self.plot_correctness_by_metadata("question_type")), theme="dark_minimal"
    105 )
    106 topic_script, topic_div = components(
    107     self._apply_theme(self.plot_correctness_by_metadata("topic")), theme="dark_minimal"
    108 )

File ~/../myenv/lib/python3.10/site-packages/giskard/rag/report.py:334, in RAGReport._get_knowledge_plot(self)
    330 def _get_knowledge_plot(self):
    331     tabs = [
    332         TabPanel(child=self._knowledge_base.get_knowledge_plot(), title="Topic exploration"),
    333         TabPanel(
--> 334             child=self._knowledge_base.get_failure_plot(
    335                 [
    336                     QuestionSample(**question, id="", reference_context="", conversation_history=[])
    337                     for question in self._dataframe[
    338                         ["question", "reference_answer", "agent_answer", "correctness", "metadata"]
    339                     ].to_dict(orient="records")
    340                 ]
    341             ),
    342             title="Failures",
    343         ),
    344     ]
    346     tabs = Tabs(tabs=tabs, sizing_mode="stretch_width", tabs_location="below")
    347     return tabs

File ~/../myenv/lib/python3.10/site-packages/giskard/rag/knowledge_base.py:298, in KnowledgeBase.get_failure_plot(self, question_evaluation)
    297 def get_failure_plot(self, question_evaluation: Sequence[dict] = None):
--> 298     return get_failure_plot(self, question_evaluation)

File ~/../myenv/lib/python3.10/site-packages/giskard/rag/knowledge_base_plots.py:30, in get_failure_plot(knowledge_base, question_evaluation)
     28 reference_answer = [question.reference_answer for question in question_evaluation]
     29 correctness = [question.correctness for question in question_evaluation]
---> 30 colors = [failure_palette[question.correctness] for question in question_evaluation]
     32 x_min = knowledge_base._reduced_embeddings[:, 0].min()
     33 x_max = knowledge_base._reduced_embeddings[:, 0].max()

TypeError: list indices must be integers or slices, not float

Could you please check these issues? Thank you!

@henchaves
Copy link
Member

Hey @omarelgaml,

I see that you are using llama3.1 model as your LLM client. I also tested with this model and got the same error. I also tested with llama3 and llama3.2 and still got the error. Basically, the models from llama family seems to not be reliable when generating a well formatted JSON, it's always returning something that makes the parser to break. If it's not a problem for you, I recommend you to use another model such as qwen2.5, which I tested and seems to be working fine. I will also update Giskard docs to replace llama with qwen.

@omarelgaml
Copy link

Hey @henchaves ,

Yes the qwen2.5 solved the problem thanks!

But I still get the same problem when I try to print the report, and when I use .to_pandas() I only fund the correctness, no score is given for the retriever

@henchaves
Copy link
Member

Hello @omarelgaml.

I'm really sorry for the late response. Let me try to help you with that.

First thing that may be compromising the final report object, is the response format from your call to chain.invoke. The get_answer_fn should only return a str or giskard.rag.AgentAnswer. If it's returning a purelangchain_core.messages.ai.AIMessage without any preprocessing, it will not work well.

Secondly, regarding the retriever score, if you want to have the scores for each RAG component, it's more appropriated to call report.component_scores() instead of report.to_pandas(), which gives details about the evaluation in each conversation of your testset and not the global metrics.

Image.

On my side, I've tried to reproduce the same steps as you did, but I successfully managed to call report.to_html() method, without any error. If you still face these issues, could you provide more details about your environment like OS, Python version and pip list? Also, some users have identified some incompatibility with specific version from third-party libs. I recommend you to use this version of giskard while it's not released yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

3 participants