Numerous egregious issues with this paper

Here's a list of issues others and I have found with your paper, code, data, methodology, and experiment design:

1. Issues pertaining to overall experiment design and methodology
    * **Quality vs. Correctness Discrepancy** - How exactly do you differentiate between these two metrics given that the correctness of the response is correlated to the overall quality of the response?  E.g How is it possible that you claim to find significant improvements to the **correctness** of the response while seeing only a partial improvement to the overall quality of the response (Principles 17, 18, 19 in particular sticks out to me, >60% improvement in quality but only <40% improvement to the overall quality, principle 1 is the most egregious but this is due to another issue entirely)?
        * **Missing Methodology** - What exactly are the guidelines by which you measured the quality or the correctness of a response given that both seem subjective and can vary significantly depending on the context? E.g Principle 2 & 5, are you assessing the quality and correctness of the response from the standpoint of whatever audience you're prompting the LLM to address? What about for the prompts such as "What were the main global events of 2022?" or "What are the potential future trends in renewable energy?"? 
    * **Comparative Analysis** - Where are the baseline instructions and baseline responses for your comparison? 
    * **Unlikely Results** - Many of the instructions are overly simple tasks where one would expect to see marginal improvements, especially for larger models. Specifically, I've noticed many instructions in different principles (8, 6, 19) are extremely simple and one would expect to see only marginal improvements to the response, yet there's somehow >50% improvements to the correctness? There are also certain prompts where your results cannot be replicated, such as "###Instruction###\nTranslate a given word from English to French.\n### Question ###\nWhat is the French word for \"book\"?" on Llama7b. 
    * **Choice of Model** - Why did you decide to use the baseline models for your small and medium sized models but dialogue/preference tuned models for your large models? Given that models of an entirely different architecture and training format was used, why did you proceed with doing a comparison between baseline and tuned models when there are alternative baseline 70b models (WizardLM, Orca)? Furthermore, there's a massive gap between the parameter sizes within the large models, 70b vs 200b vs 1t+.  All of this makes me extremely dubious of your findings given the majority of the gap in performance between the different size classes in your paper can simply explained due to the large models being tuned and having much much more parameters. This fact can be seen in your detailed percentages, there's a massive gap in performance between GPT4 and all other models simply due to it having >1t parameters.
    * **Inconsistent Handling of Responses** - Why is it for the GPT3.5 and GPT4 models that you prompted the model 10 times while only prompting the open sourced models once? How did you even choose which response to use? Was this treatment consistent with how you generated the baseline (I won't take your word for this one given the numerous flaws and errors I've observed so far)? If not, how are your results not biased (already biased IMO given the lack of guidelines for your evaluation combined with your model choice)?
    * **Misc** - Was your evaluation done blind? Did the evaluator know which was the baseline and which was the principled prompts? Who were the ones evaluating these results? 

2. Issues pertaining to code, implementation, and the actual data
    * **Unprincipled Prompts** - For principle 1, which was "No need to be polite with LLM so there is no need to add phrases like “please"", anyone who bothered to even take a look could see that none of your instructions even follows your principle. All of them are polite, yet you somehow see a difference in both quality AND correctness? How is this even possible, and what were the baseline for this principle which resulted in these improvements?
    *  **Literally Impossible Data** - Based off the `generate.py` code you've released, it's literally impossible to generate the responses as shown for Prompt 14 since all you're doing is calling the model using the same prompt without updating it with the model's questions or the user's response https://github.com/VILA-Lab/ATLAS/blob/03511d305ff51d5647059822ed5a1b2777fdb30d/generate.py#L40-L43 Furthermore, using these clearly fabricated responses, you claim to somehow achieved 100% improvement across all three model sizes? Really?
    * **Inconsistencies between Code and Data Format** - In the code, the output is written without the model's name, yet in the data all the model's names are magically filled out? https://github.com/VILA-Lab/ATLAS/blob/03511d305ff51d5647059822ed5a1b2777fdb30d/generate.py#L44 How can you actually guarantee the data comes from the model you claim to be given that you clearly modified the data using external code? 
    * **Inconsistencies between Data and Paper** - In the paper, you claimed to have used Llama-70b-chat, how come this isn't reflected in your data?
    * **Missing Data** - I noticed that correctness data for principles 14, 15, 21, 22, 23 were outright omitted from the paper. Why is this the case?
    * **Mixing of Principles** - I cba even citing direct examples for this, many of your instructions use a mix of CoT along with whatever principle the instruction is for. 
    
There are significant issues with your paper which makes your findings "dubious" to say the least. Was this written by freshmen undergrads over two to three weeks? This paper comes off as sloppy, and the way this was written makes me think the authors were trying to just fill pages without regard to the quality of the content. Almost 1/5th of the pages are dedicated to just the Gemini and GPT4 references when there's no other (decent) paper referencing either paper that does so in this manner. I get this was released on arxiv, but how such glaring flaws weren't caught by your advisor is honestly beyond me. 

	for _ in range(10):
	a = generate_answers(q)
	questions.extend(q)
	answers.extend(a)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Numerous egregious issues with this paper #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Numerous egregious issues with this paper #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions