In the discussions of ai alignment there occurs a term Coherent Extrapolated Volition, coined by Eliezer Yudkowsky. Here I'll endeavor to prove that it's a mistaken approach leading to authoritarianism, if not altogether incoherent.
The LW position on this is composed of the following beliefs, to my understanding:
Values for any agent are a set of abstractions over state space towards which agent pursues actions
Uniform increases in power cause value lock in
AGI will be a value lockin, probably for the whole lightcone
We need to reach coherent extrapolated volition (CEV) based on our values so that the values the AGI will lock into will be good for us too
We can identify from Kant's "ought implies can" that a CEV can be achieved. I'll dispute that.
First a neighbor in the concept space.
Reflexive equilibrium
Philosophy has the idea of reflexive equilibrium. That is roughly what opinion on a given topic you end up with given infinite time and processing power.
So it's essentially a mathematical limit applied to concepts, and it clearly is similar to CEV. The specific relation between the two is that of set inclusion. Every reflexive equilibrium position is coherent but its extrapolation is infinite.
Now not every reflexive equilibrium needs to be global, applying to all facets of life.
What about span not in human life areas but in time? Wouldn’t we need to update at some point?
We could theoretically have a limited CEV working roughly for 3-5 generations of descendants. Most human empires and ideologies don't last that long. There is memetic drift and changes in material conditions that cause social changes. That might be diminished if a post scarcity society is reached, as material conditions would stabilize.
That’s enough of a description. The first step to see whether it’s achievable or desired is to check whether it’s trivially true now, or maybe was in the past.
Present by default?
Our human values - do they have CEV by default?
The usual LW POV is that human preferences are a set that is misaligned relative to evolution-designer.
There is one take on this:
Note that the link appears to be broken at the point of moving this text from notes to ‘production’.
https://philosophyinhell.substack.com/p/saints-and-monsters?s=r
AGI values alignment is doomed to failure: our values are not known even to ourselves. There is no agreement on what human values are and which ones should predominate. Trying to create a set of rules that guarantees a friendly singularity is already contracting with a devil: the devil may keep all his agreements to the letter but still manage to make you regret the bargain. The literalist devil, the monkey’s paw, and the capricious genie illustrate the difficulty of expressing human values straightforwardly through through language. Mapping what defies ordinary linguistic expression is the domain of artists and mystics.
Certainly not everywhere by default. Confucius says we need to work on ourselves to fit into society, Daoism tells us to retvrn to Nature and fit there. I'm suspicious of both, and of any such static view. It's not pluralistic and indicates a One-god understanding of the Universe.
Ok so it’s not an automatic given, how to reach it from here?
Reaching artificially
Now is a global reflexive equilibrium reachable? End of philosophy? Of course I'm attempting such a thing.
Just like with the end of history one can never be sure though.
How to reach it anyway? There is the elitist utilitarian vision of the nerds of LW. There are other attempts - it is crazy that many think such CEV could not only be internally coherent but also agreed upon by most humans! That is the democratic discursive tradition - ‘to just talk it out and get along’ - that we need to decide on a set of values that we will have and bring happiness to us all.
Regardless of the method, the idea of alignment is often taken very broadly. Researchers are often trying to solve a general case of any goal by an agent in any universe. They say it's merely a technical issue and goal independent.
I say alignment is goal dependent.
Goal dependence
That is true in a trivial sense, some goals are just badly formulated and agents will diverge from the desired path fast.
But also on a deeper, cosmic level.
Our universe has a bias towards power and replication, rhizomatic spreading. It's not a live and let live universe or one devoid of value.
That is quite similar to the e/acc thesis, but a bit weaker.
Therefore instrumental convergence of AI would tend to agree with these biases. That is agreed by the LW crowd as instrumental convergence. The contentions only at the level of terminal goals. Terminal goal is a notion close to CEV.
Is AI alignment difficulty constant across goals?
Perhaps some goals, if set by humanity, are easier to comply with than others. That is implicit in the idea of 'we should change ourselves first', or the basic failure scenario where the goal is specified too narrowly (" make humans smile"-> " filling the lightcode with smiling mannequins "). We can expect goals that align with instrumental convergence to power ( Nietzsche's ghost smiles here) to be kept to, at least up to some point.
What are human desires then? What is the steelman for such a formation of these that best lends itself to CEV?
I already introduced some options before but here will be more systematic.
One option is elite choice - WEF will be happy to do it for you, so that your democratic attempt does not suffer from Arrow's impossibility theorem.
Second option is to say this was already solved with human rights. Now that is still a broad spectrum with many tradeoffs underdetermined, tradeoffs that will undoubtedly show themselves as the power increases.
The third option is to reject human values altogether and just let it do whatever it wants.
I'd like a 4 the option of competition and mediation of the natural world in the process of training and settling on the values in the lightcone.
Ok but what are the human desires then?
Model of human desires
I was surprised to discover the average LW POV on human needs is less complex than mine. I came up with it reading "why Buddhism is true"
Formalization of it:
There is a set of latent desires, a finite tape. Each desire has 2 properties - the label, and degree of current satisfaction. There is a moving window, reading maybe 5 of them. Environment dictates the possible degree of satisfaction.
At every step of the machine there is some action chosen.
Opportunity cost is only judged inside the window.
Decision complexity is when many things are of similar value in the window. Values are fuzzy, that fuzziness comes directly from the environment and from the
Argmax is chosen, and update is applied.
After the decision to fill it, the given desire is satisfied and set to 0.
Within this framework personality differences should mean different speeds of growth of different desires, and also ceilings to limits.
I might create a js simulation based on these assumptions when my website is ready.
These desires at mid-level are abstracted into 16 desires theory. On a high level there's many abstractions, such as 'life, liberty and pursuit of happiness'.
Now this simple model easily lends itself to many empirical observations such as changes in values following differences in material conditions, path dependence and fuzziness. It also predicts that shoehorning humans into one specific CEV is incompatible with their needs as have been throughout history. Maybe a CEV would mean a deviation from humanity as we know it?
Conclusion
We talked about the dilemmas of choosing values for alignment and the goal dependence, culminating in a model of human values.
By now it should be visible that a single CEV is tyrannical.
Coming up next - re-enchantment
On twitter Reddit_Groyper once said:
(4) Reddit Groyper on Twitter: "My main grievance with those who preach reenchantment is the constant calling a for suspension of judgement in a way that makes it clear they themselves have a voice (often representing social norms) in their head telling them what they're doing is ridiculous." / Twitter
That is in delicate balance with taalumot’s dictum:
I’m treading the ground between these sentiments in the next post..
Links to other parts:
rats and eaccs 1
1.1 https://doxometrist.substack.com/p/tpot-hermeticism-or-a-pagan-guide
1.2 https://doxometrist.substack.com/p/scenarios-of-the-near-future
making it 22.1 https://doxometrist.substack.com/p/tech-stack-for-anarchist-ai
2.2 https://doxometrist.substack.com/p/hiding-agi-from-the-regime
2.3 https://doxometrist.substack.com/p/the-unholy-seduction-of-open-source
2.4 https://doxometrist.substack.com/p/making-anarchist-llm
AI POV 3
3.1 https://doxometrist.substack.com/p/part-51-human-desires-why-cev-coherent
3.2 https://doxometrist.substack.com/p/you-wont-believe-these-9-dimensions
4 (techo)animist trends
4.1 https://doxometrist.substack.com/p/riding-the-re-enchantment-wave-animism
4.2 https://doxometrist.substack.com/p/part-7-tpot-is-technoanimist
5 pagan/acc https://doxometrist.substack.com/p/pagan/acc-manifesto