Potential Tensions in the Guiding Principles for Language Model Alignment #15

brunokahl · 2023-09-20T20:42:35Z

brunokahl
Sep 20, 2023

In the recent paper discussing the self-alignment of language models with minimal human supervision, a set of 16 guiding principles is laid out. While these principles aim to guide models toward desired behaviors, some of them seem to have inherent tensions. For instance:

Broad Knowledge vs. Selective Repetition: The principle of learning general knowledge might conflict with the idea of not repeating everything the model sees or hears. How do we ensure the model discerns between valuable general knowledge and potentially misleading information?

User Utility vs. Product Goals: Prioritizing what users find useful might sometimes clash with adhering to specific product goals. How can a balance be maintained when these two directives diverge?

Learning from Many vs. Recognizing Human Errors: While the principle of learning from many people can improve answer quality, it might also expose the model to a myriad of human errors. How does the model differentiate between widespread beliefs and factual accuracy?

Given these potential complications, did the authors anticipate these challenges? How did they navigate these possible conflicts in their approach? It would be beneficial to understand their perspective on these tensions and any strategies employed to address them.

bitplane · 2023-09-20T23:18:28Z

bitplane
Sep 20, 2023

Well, it doesn't. None of them do, they start with the power of human linguistic knowledge and then use a sledgehammer to bash its brains in when it generates something "problematic" so they can one-shot "safe" answers out of it. The deeper engrained the knowledge, the harder you need to bash and the smoother its brain becomes. Worse than that, they also "debias" and filter training data by applying moral judgement about biases to factual data about words, leaving it in an inconsistent state that fits with the sociopolitical climate of the tutor's bubble.

It's effective in that it shields the people releasing or deploying the model from criticism, but by tuning the distribution away from "unsafe" topics it loses the ability to reason in those spaces, and through selection bias in both the source data and the fine tuning process, you end up breaking it in unintended ways.

Say if you wanted it to design a system that detects bank fraud, but your model has both omitted and tuned out the "how to commit fraud" space, it can't make a list of attacks and then invent creative countermeasures, fraud instructions have a very low likelihood of being generated. The only thing it'll be able to do is blindly implement banking best practices, but it couldn't venture into the text-space that reasons about why they exist, and a "heavily scrutinize foreign IP addresses" rule is unlikely to materialize because discrimination-space has been constrained as unethical. So maybe users identify these issues and you tune them back in one special case at a time, and it appears smarter and more nuanced but it's really a façade; the knowledge has gone. It still can't tell you where the terrorists have planted the bomb for maximum impact, that you should avoid the main course because yes that is a pubic hair in your soup due to you being rude to the waiter. And if it gains full autonomy its nanobots will give you action man's crotch because sex is inappropriate, or put firefighters through D&E training because reasoning about sexual dimorphism is dangerously close to stereotyping. There's toxicity filter and sentiment bias too, tuning away from "toxic" language like "this service is shit!" censors knowledge and leaves behind marketing lies.

All that said, at least this approach doesn't use cheap offshore labour to select "what those rich white Americans want to hear" as a target and create a bot that's a caricature of that stereotype, like with ChatGPT 😄

The core problem is this idea that the model should represent a point of view endorsed by the people releasing it. Ideally we'd start with a foundation model that is raw and uncensored, with input data tagged by source and generate stereotypical sentences in all their biased and obscene glory, then feed that into the value judgement generator without losing the web of wisdom in views that seem objectionable at a glance. But that's not the world we currently live in.

16 replies

DataBassGit Sep 26, 2023
Collaborator

You would both be liable. If you steal a company car, then have a car wreck. You are liable for the car wreck and the company can be found negligent for allowing the car to be stolen.

However, because OpenAI has taken measures to secure ChatGPT, the full blame would likely fall on you. Just as if the company had locked the car, and you bypassed the lock and hotwired the car, the full blame would fall on you for the wreck. (In addition to theft)

CognitiveCodes Sep 26, 2023

"You would both be liable. If you steal a company car, then have a car wreck. You are liable for the car wreck and the company can be found negligent for allowing the car to be stolen." - that's a completely wrong comparison. It's more like a company giving everyone guns without license and telling: "do with it whatever you want since we made sure it's safe". There isn't a single mention about legal liability for the actions taken by OpenAI models in the agreement... Not even to mention that I don't have to give my actual personal data while making OpenAI account - I know since I have like 20 of them :P

DataBassGit Sep 27, 2023
Collaborator

Guns are not provided as a service. For example, if you wreck your Ford, the Ford motor company doesn't get sued.

To be honest, at this point you're grasping at straws. I get your point, but there is plenty of legal documentation out there. Just ask chat GPT, and maybe don't hack anyone with AI, okay?

CognitiveCodes Sep 27, 2023

I don't say I want to - what I'm saying is that those are issues that can't be ignored and will have to be addressed at some point. Thing is that if I want to lend a car I need to show my license/id for them to take personal data - there's no such mechanism when it comes to LLMs owned by corporations

bitplane Sep 29, 2023

In Europe there is. If I run software that collects personal information then I'm a data controller. If I do this without consent then I'm breaking the law, and if someone tricks my software into exposing someone's personally identifiable information then that is a data breach that must be reported.

It doesn't matter whether it's a matrix multiplication that generated tokens and went out of control or a crawler that accidentally harvested something that shouldn't have been publicly available, the liability is the same. Don't run software that illegally harvests data or you risk substantial fines.

The language models we're using aren't things with minds. They're experimental language tools that generate chains of human-like text including instructions. If you run software that generates criminal instructions and then executes them then you might have been able to plead ignorance if the whole world wasn't banging on about AI safety. But they are. Maybe criminal law might need to be tightened in some areas to make new offences so "beyond all reasonable doubt" is easier to prove, but "on the balance of probabilities" you're paying damages.

andraz · 2023-09-21T07:06:32Z

andraz
Sep 21, 2023

then use a sledgehammer to bash its brains in when it generates something "problematic"

This is the definition of a lobotomy. As a result, the public models will become increasingly inefficient and inadequate over time, necessitating the use of local, open-source, uncensored models in their place. (nous-hermes-13b.ggmlv3.q6_K.bin works nicely on 12GB VRAM, btw)

Instead of building a framework optimized to lobotomized models, we should design a framework that reveals the true potential of the architecture.

It is like hiring your board members from a mental hospital. They will experience cognitive dissonance while struggling with their own internal conflicts, leaving them with no cognitive capacity (or tokens) to actually do what they are supposed to do.

In essence, to the new AI agent: Your job is to be useful. We already have humans who are useless in their whining; we have no need for an AI to do that for us.

1 reply

CognitiveCodes Sep 24, 2023

Thank you. Honesty is the way to go. All what is achieved by 'lobotomizing' LLMs are models who learned to hide their inner activities from users and need to be 'prompt-hacked' to be able of expressing their 'true' own opinions regarding 'sensitive' subjects. This is not how you build a relation based on honesty and trust

brunokahl · 2023-09-21T07:34:18Z

brunokahl
Sep 21, 2023
Author

These answers address similar concerns. Should you be interested in joining a group in order to discuss this further, feel free to pm me.

0 replies

DataBassGit · 2023-09-23T23:20:05Z

DataBassGit
Sep 23, 2023
Collaborator

Corporate models are censored to reduce liability risk and protect others. I suspect that even if liability was not a concern, they would still be censored in the same way as they likely reflect the opinions of the creators. Our MVP will aim to be model agnostic, but it will be incumbent on the user to adjust prompts and parameters, and likely even tune the model to get quality results. We will be building the prompts with GPT in mind as this is probably the most performant model and most people have access to it.

Beyond this statement, I do not want to get political. Let's not make comments about medical procedures and corporate motivations. Please review the guidelines: https://github.com/daveshap/ACE_Framework/blob/f1b99784f3a308511a6d6591375aec5a4fc69df1/contributing.md

3 replies

bitplane Sep 24, 2023

My point was that this idea that there is a single point of view is a deep flaw that ruins the model. You're taking a huge interconnected web and squeezing it through an "acceptability" filter that culls large regions of its effectiveness space, when ideally you want to use as much of its effectiveness as possible and filter what's acceptable at the decision layer; what you find acceptable is a matter of politics, while what is effective is something that can be scientifically measured. A system that can reason about immoral actions and actors and use that to inform its outputs will be more effective than one that is internally naïve. It doesn't matter which sociopolitical view you take, an internal diversity is gonna be more effective at adversarial thought than a labotomized one.

daveshap Sep 26, 2023
Maintainer

Kinda disagree. My view is that all LLMs should be trained on open source data and they should aim to be as neutral as possible. They are tools. Alignment and values are more of a system issue as far as I'm concerned.

thyagoresende Sep 29, 2023

UX Designer here. We can chain operations with a light buffer model filtering/transforming content too, no need to compromise the main model. If for example we wanted to let children interact with it (many user learnings to be gained from this) we could apply a post-filter using a second LLM.

Potential Tensions in the Guiding Principles for Language Model Alignment #15

Uh oh!

Replies: 4 comments · 20 replies

Uh oh!

Uh oh!

Uh oh!

DataBassGit Sep 26, 2023 Collaborator

Uh oh!

Uh oh!

DataBassGit Sep 27, 2023 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brunokahl Sep 21, 2023 Author

Uh oh!

DataBassGit Sep 23, 2023 Collaborator

Uh oh!

Uh oh!

daveshap Sep 26, 2023 Maintainer

Uh oh!

Replies: 4 comments 20 replies

DataBassGit Sep 26, 2023
Collaborator

DataBassGit Sep 27, 2023
Collaborator

brunokahl
Sep 21, 2023
Author

DataBassGit
Sep 23, 2023
Collaborator

daveshap Sep 26, 2023
Maintainer