long thing about multi-agent self and utilitarianism

I’ve found a new distraction! And it’s something cool.

There’s a few of us on the internet reading a math paper from MIRI, and we’re trying to understand all the concepts as precisely as we each feel that we need to, and writing up notes on all of that in a big big google doc that’ll help anyone else who wants to follow.

It’s great fun. And I’ve said before (I think?) that getting involved in friendly AI research, or whatever they’re calling it these days, is one of those so-unattainable-I’m-embarrassed-to-even-admit-it’s-a-goal kind of goals, and this particular kind of project was something I was explicitly mulling as a step in that general direction. I was fortunate this opportunity came up at the right time for me.

(The so-unattainable-I’m-embarrassed status remains. I’m imagining something will happen like it gets reduced to merely implausible, which might even make it easier to update on and help nudge me in the direction of whatever I should actually be doing with my life, but that might not be optimal and might not be what happens. I’ll probably need to think about this more in another post).

So yeah that’s been taking up time and energy. Writing to people who need writing to should be a higher priority, but I’m talking about causality here not justification and I haven’t yet grown up enough such that those to things naturally align.

I’ve been reading Sinceriously, the blog I mentioned in passing as part of my discussion with the Beeminder guy. There’s not enough material there yet for me to grok the thesis as a whole – I could write things commenting on each individual point and might even do so if TheSinceriousOne really would like me to, but right now it’s not priority number 1.

Two interesting things did come up though so I’ll mention them here.

From Judgement Extrapolations:

Here is a way to fuck up the extrapolation process: Take a particular extrapolation procedure as your true values and be all, “I will willpower myself to want to act like the conclusions from this are my values.”

Don’t fucking do it.

The blog is largely written like this, not as a commentary on human behaviour in the abstract but as a personal challenge to the reader. I’m taking that aspect at face value.

Applied to me then, this would be talking about utilitarianism.

Given infinite computing power and somehow bypassing a lot of inconvenient caveats, utilitarianism tells you what to do in any circumstance. It’s whatever will end up maximizing everybody’s wellbeing (whatever that means exactly – one of the important caveats is that we really have no idea, but we have plenty of heuristics that are often enough to keep us going in practical examples). This includes your own wellbeing, but since there are so many other beings, unless you’re very poorly placed to help them most of the utilitarian value will come from helping others.

“Pure” utilitarianism in this sense leaves no room for going out and enjoying yourself for the sake of it. It also leaves no room for “moral” goodies such as acting in accordance with rules or acting virtuously, or prioritizing people who you have some kind of connection to.

It doesn’t mean those things won’t happen. If you treat yourself too shittily then you just aren’t effective at accomplishing utilitarian goals. Broadly speaking, the things that make us happy are:

  • Those related to reproduction
  • Superstimuli
  • Things that keep us going and help us do other stuff

A pure utilitarian has no use for the first and the second, but will eagerly consume a lot of the third for purely “instrumental” reasons. My current best guess is that for optimal utilitarian performance your life wouldn’t actually be too bad. Weird, not at all socially normal, but not exceptionally unpleasant.

(So far this hasn’t been at all relevant to what the Sincerious One was talking about, and has just been my general take on utilitarianism. I’ll get to that soon). Except that obviously it’s the sort of thing by extrapolating from some subset of your values, and plausibly one of the examples Sincerious has in mind.

Am I actually a pure utilitarian? Let’s just say no.

I said that acting purely in whichever way maximized everybody’s happiness probably wouldn’t make you too unhappy. But what if it did?

I was averse to utilitarianism because I would think through some thought experiments in which I had to suffer a lot in order to help out a whole bunch of other people, and I was pretty sure I wouldn’t like that or even that I would commit to an ethical system in which it might be what I had to do.

So I weakened it to merely consequentialism, where you’re just trying to maximize some utility function that doesn’t have to equal everybody’s happiness added together. In particular it leaves some room for keeping yourself above a baseline of acceptable happiness, caring about others more once you’re safely above it. It somewhat blurs the boundary between an ethical theory and a theory of preference, but they’re both normative theories saying what to do in any particular circumstance so I don’t feel a really clear need to partition things up like that. It’s more like, some of the lumps and bumps on the utility function are coloured “moral” and some are coloured “what I want”, but it’s still just one function in the end. Or something like that.

Anyway, that’s still extrapolating from a bunch of values it’s just that in this case I needed to do some bargaining between two subagents – one which didn’t want to sit in the pit of flames for too long however helpful that would be to everybody else, and another which was looking at myself from the outside and other people who might be suffering from the inside, and would criticize me if I was too much of a dick. Something like what I described seemed a reasonable compromise from the point of view of both subagents. Even if it wasn’t what either of them wanted exactly, they both had bargaining power and so failing to settle on a good compromise like this would be worse for them both.

And yet, this is all sounding too easy.

The main problem with consequentialism in the way that I described is that I don’t actually do it. We’re not talking about getting some delicate tradeoff between “what I want” and “what I think is right” wrong. We’re not talking about some bounded computation thing where I’m failing to update on some evidence or missing out on a better strategy because it never occurred to me. We’re talking about getting basic stuff horribly wrong.

  • Eating fast food so much
  • Not exercising
  • Feeling horrible whenever I donate to charity and hence not doing it
  • Putting off anything that involves filling out forms pretty much indefinitely
  • Keeping my immediate environment clean, tidy and organized

Those are the top ones I can think of. “Eating meat at all” is sort of one, but that feels more in the delicate tradeoff category. It’s not like it’s totally obvious that EAs should all be vegetarian, and that gives me an exit route. I’m including it because it feels like I’m doing it for the wrong reason – if someone came up with a totally clear and valid explanation of why it was consequentially a disaster to eat meat yourself even when you had other high-leverage ways of having an impact like donating to effective charities, then I would probably still carry on eating meat. (Again, this is a prediction of my own behaviour and is not normative. I’m literally imagining assume eating meat is very definitely very wrong).

But the others? They need an explanation.

Some hypotheses:

  • I’m just generally a disaster, and whether or not I should feel intrinsically bad about being such a disaster, fixing it would be some terribly painful struggle that I am too weak and cowardly to endure. I’m not sure this is exactly a “hypothesis” but part of me seems to believe that. Maybe I should investigate this a little before listing the other hypotheses.

Me: Hello disaster believing agent, how are you today?

DBA: I’m sorry.

Me: What do you mean?

DBA: I’m not even sure that I exist now. I feel like you’re just going to write me as a big pile of self-negativity that doesn’t form a coherent sub-agent in any way. Can you ask me some more specific things about the problem space that will help motivate the feeling that you’re after here?

Me: ok…

Feeling of confusion: what you noticed here wasn’t the disaster-believing part of yourself. You already knew about that. What you just noticed was feeling confused that you seemed to believe this thing that wasn’t even a coherent narrative. Can we explore that first?

Me: well, ok. I guess it’s my role here to try and steelman the disaster-believing argument? So you as the feeling of confusion can feel flaws in it?

FOC: yeah maybe although I don’t want to prejudice whatever was going on with you and the DBA over there. I dunno though, this one seems like it might be more productive, like you seem to have a pretty good idea about what believing you’re a disaster means yourself.

Me: So there’s two hypotheses here: “I am a disaster” and “fixing that would be too hard”. Are you confused about them both?

FOC: yeah.

Me: So “I am a disaster” means… take everything that’s going well in my life. I’m wealthy and stably employed. I live in a pretty reasonable country and have all the conveniences of being in a big city. I have a few friends who seem to genuinely like me. No particular obvious catastrophe is looming right now. I know about EA, which still feels like an exceptional piece of luck. Additional things which I just take for granted.

Let’s say that I don’t fully appreciate those – it’s very hard for me to feel gratitude correctly, I know what it’s supposed to feel like because I’ve felt it when I’m out on one of my walks, and I enter a scene that suddenly seems beautiful and I’m the only one there to appreciate it. That feeling of privilege – that everything is set up just right for me and I acknowledge it and feel grateful without knotting myself into guilt about it – I know what it feels like from that but it’s not something I feel about other aspects of my everyday life.

So the good stuff sort of gets taken for granted and what’s left are the shitty bits. Comparing myself to other people – and this feeling of being a disaster is definitely a social emotion that has to do with my relative ranking against others – is done purely on those areas of my life which I actually notice, which are all the bad ones. Unsurprisingly due to that cherry-picking I’m going to end up feeling like I did pretty badly.

So this is the system-2 saying this. The system-1 just sees the disaster.

FOC: that last sentence didn’t seem quite right…?

Like did you really know all of that before you typed it just now? Are you sure there isn’t some value in meditating on it a little and trying to get some of it to sink in?

Me: I guess…?

FOC: also that didn’t sound like a steelmanning, but never mind. Let’s quickly look at the “it would be too hard to fix” part to make sure we have the main areas covered.

Me: I guess I’m averse to imagining that I can optimize myself beyond a certain point. Previous times I’ve tried have failed, and too much optimism is going to feel exactly like the run-up to my psychotic episode where everything seemed more and more great and then blew up into various kinds of disaster.

FOC: ok, I don’t think I’m confused by that one. You already know that you’re averse to hope.

Me: Right… so, having tried a few different things and failing is basically all there is to that I think.

FOC: There was something else you wanted to mention. About what your life would be like if you fixed all of those things.

Me: Oh yes! This is super weird. I imagine: if I imagine a version of me who had fixed those 5 things – or 6 if we include vegetarianism – how would I judge that person? Would they be a better person than I am?

And somehow the answer comes out to be no. The “fixed” version of myself that’s identical to me minus those 6 problems, doesn’t seem like a great hero I want to aspire to. He seems like a barely acceptable Deep Mockito. When trying to compare this barely acceptable specimen of humanity to my actual self… it’s like I don’t even want to dignify myself with the comparison. I don’t even count as a “person” to compare Fixed Mockito against.

FOC: That’s really sad. Can you be more precise about it though?

Me: I think I’m anchoring on self-improvement manuals that seem to take as a basic premise that you have already sorted all that stuff out. They’re all about having grown-up problems like not having enough money or having to navigate complex social situations or thinking rationally about how to build an AI. They’re not exactly written by “people like me” – I guess they’re written by people who’ve already accomplished enough self-improvement that they feel qualified to talk about the subject. I started this blog because I knew that I wasn’t qualified to write it, I wanted to track what it was like to start from some kind of dingy place and end up somewhere sane. But yeah, the examples I’ve seen have been about how to get from A to B, not how to get from 0 to A.

FOC: What do you mean, grown-up problems?

Me: Like, problems that other people have seem an unavoidable consequence of their circumstances. Problems that I have seem like it’s because I’m an idiot.

FOC: You know that isn’t how it works, right?

Me: Yes, logically it doesn’t make sense but there seems part of me that believes it. I guess I should try introducing that part into the conversation.

Reverse Fundamental Attribution Error: It’s 00:26. Bedtime?

Me: No, this is interesting.

RFAE: Not especially interesting though. You just tend to stay up late whether it’s writing garbage into here or doing math or messing around on Facebook or whatever it is. It’s the same process going on in each case, the exact task that you’re doing is somewhat irrelevant.

Me: OK, fair enough. I’ll timebox it to 01:00. Can you explain why other people are so great?

RFAE: They’re not great, exactly. They are able to pursue opportunities that are unavailable to you, and also generally don’t have the kind of weaknesses that you describe. If they seem to be acting stupidly you can’t blame them for that because you don’t know all their background and history and what’s going on inside them. You do know that about yourself so blame is more justified.

FOC: Wait, what’s this about them having opportunities that you don’t?

Me: It’s kind of a weird mix of not deserving something and not being strategic enough to be able to acquire it. I don’t remember if I’ve mentioned the examples on this blog yet, but it would be things like having a beautiful shiny girlfriend or a beautiful shiny job at MIRI. Things that confer status, are generally difficult to get due to being competitive and relying heavily on skills and qualities that I don’t have, and so on.

FOC: And RFAE attributes this to some sort of personality failing that you should feel bad about, instead of circumstance?

RFAE: Yeah. Your circumstances are actually reasonably well set up for those sorts of things, you may or may not actually be able to get them but you don’t even try because you lack the gumption.

Me: OK. I think I understand RFAE’s model at least.

FOC: Can you explain it for the benefit of everybody else? All I’m really getting is that it’s shitting on you at every available opportunity, and I suspect what you have in mind is something more systematized than that.

Me: It’s like it’s really taken the “don’t just a man until you’ve walked a mile in his shoes” thing to heart. I’ve spent 11864 days being myself – I don’t know what that is in miles – and that’s enough time to build up a lot of judgement. I have information about both external events that have happened to me and internal thought processes, and it doesn’t come out looking good.

Somehow my prior is that people will respond rationally most of the time, to the best of their ability. That’s what I apply to other people. Even if it might make more sense to anchor on my own experiences… I’m not sure that I want to, because I feel I’d just feel worse about everyone else instead of feeling better about myself.

FOC: You think that having a more accurate model of the world in this case is bad?

Me: Yes. I am aware that having an accurate model is supposed to be good, and that people who take that sort of thing really seriously seem to find that it works for them.

FOC: You also said that this was a social emotion, judging yourself relative to others. It seems like your relative status at least would go up.

Me: that somehow isn’t reassuring.

FOC: would you feel compassion for others, like you know that they’re weak but that somehow it’s ok?

Me: no. And anyway, it is getting late and I had other things I wanted to think about here.

The next hypothesis:

  • there’s some subagent which is optimizing for something other than “being a disaster” but that still obtains value from the things that I mentioned. Reasoning with it and coming up with a mutual strategy that benefits everybody will bring forth great goodies.

I want to believe this, I really do. It would be really nice if I could fix akrasia by getting in touch with some version of myself, figuring out what’s going on and agreeing on a tidy compromise.

And saying “I want to believe this but don’t” is not a valid counterargument. So let’s take it seriously and see where it would lead.

Let’s recap.


  • Eating fast food so much
  • Not exercising
  • Feeling horrible whenever I donate to charity and hence not doing it
  • Putting off anything that involves filling out forms pretty much indefinitely
  • Keeping my immediate environment clean, tidy and organized
  • Eating meat

What do those things have in common? There are two main themes: convenience, and avoiding things that feel annoying. Several of them contain some amount of both. The one about donating to charity isn’t really covered and I’m still very confused about it, but 5 out of 6 is reasonable coverage.

So, without passing any judgement on it… what would I expect if I was optimizing for convenience and avoiding things that felt aversive at the moment? I think something like my actual behaviour. It doesn’t explain all of it of course – like I have my actual values that help explain the small subset of my behaviour that actually seems to make some kind of sense. There are also other irregularities such as staying too late at work, which I omitted because it hasn’t been such an issue recently, which aren’t covered by that so neatly.

So let’s imagine that I’m an inconvenience and short-term-annoyance minimizer. What do I do about it?

Well, obviously… do the exact thing that Sincerious told me not to. But maybe not in the way that they were thinking.

The part of me that seeks the most convenient route through things and which avoids immediate pain points, doesn’t get so involved in the long-term goal setting. This means the other parts are free to do that around him.

The other parts are not completely unified: as mentioned before, there’s one that would rather be totally altruistic and another that is scared to risk being in pain or other kinds of chaos. Together they can agree on some kind of fudged utilitarianism as a compromise. But that far goal doesn’t have to involve anything about convenience or avoiding annoyances.

Actually planning out routes to the far goal needs to make allowances for it though. As humans when we seek to maximize some function, we do it in the context of our future self not being some superhero that’s capable of anything, but rather one that has all these weird limitations. This affects the kinds of strategy that we would try to implement. This doesn’t mean that I’m good at the strategizing – it also doesn’t mean, incidentally, that our long term goals are going to be stable forever because we might just forget what they used to be, or change our minds, or (more strategically) give up some corner of them in order to more effectively focus on the rest.

So in the far view – ok I may have to somehow encode convenience into my utility function or something, but for now I don’t and treat this extra part of myself as something that needs to be worked around. In the nearer view, though, more compassion is needed and treating it as something that I actually want seems to make more sense.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s