This post concerns another research paper from the Machine Intelligence Research Institute, Robust Cooperation in the Prisoner’s Dilemma: Program Equilibrium via Provability Logic by Patrick LaVictoire et al (5/31/2013 preprint).
Mutual defection is the unique Nash equilibrium in the classical prisoner’s dilemma. It is troubling whenever rational agents pursuing their own self-interest predictably find themselves unhappy with the result, and so variants of the prisoner’s dilemma in which mutual cooperation can be rationally achieved are interesting, and perhaps reassuring. The best known variant is the iterated prisoner’s dilemma, in which the potential for long-term mutual cooperation can outweigh the short-term gains of immediate defection.
As for one-shot prisoner’s dilemmas, Douglas Hofstadter may have been the first to seriously suggest that the standard game-theoretic analysis (in which players’ actions are assumed to be independent) is insufficient. His idea of “superrationality” holds that two players who know they are similar to one other will reason that their decision will be the same, so they ought to cooperate. In this setting, any reasoning favoring cooperation is self-reinforcing, and any reasoning favoring defection is self-defeating. His 1983 account of a casual experiment he ran (fascinating though it may be) failed to exhibit superrationality among humans.
LaVictoire et al investigate agents more reliable and more transparent than humans: computer programs. In their setup, each program has access to the other’s source code while deciding whether to cooperate or defect. They identify several agents which will cooperate with agents similar to themselves but never with a defector. Here, Löb’s theorem is the key ingredient to formalizing the notion that a decision to cooperate with similar agents is self-reinforcing.
1. Löb’s Theorem
The investigation starts in an unlikely seeming place: Löb’s theorem is often read as a generalization of Gödel’s second incompleteness theorem, which says that a reasonable theory of arithmetic cannot prove its own consistency. (Here a reasonable theory is one which is consistent and computably axiomatizable and extends first order Peano Arithmetic, or at least the weaker Robinson arithmetic, which is just powerful enough to represent computable functions.)
Toward a statement of Löb’s theorem, fix a Gödel numbering of formulas for such a theory , and a (computable, hence representable) proof predicate
that checks whether the number codes a proof for the formula .
Stealing some notation from modal logic, write as an abbreviation for
Now if is taken to be a contradictory sentence (usually denoted or by logicians), the sentence expressing the consistency of can be written simply . The second incompleteness theorem says that if doesn’t prove , then it doesn’t prove either. For further notational convenience, we often drop the subscripts when is obvious from context.
To segue into Löb’s theorem, rewrite the incompleteness theorem again as:
Theorem 1 (Gödel) If , then .
That is, a consistent will not believe that a proof of contradiction would actually imply a contradiction. Perhaps surprisingly, above can be replaced with any arithmetic sentence: if doesn’t already prove , then it will not believe that a proof of would imply .
Theorem 2 (Löb) For any sentence , if , then .
In fact, one can quickly derive Löb’s theorem from Gödel’s, though there are more direct, lower-level proofs. One benefit of a lower-level proof is that one can better describe how a -proof of is used to construct a -proof of , and so one can actually formalize Löb’s theorem in Peano Arithmetic (which can check proofs in , even if ). That is,
with ‘s all relative to . But we give the short proof here.1
Now suppose that , or equivalently (by the deduction theorem) that the theory is consistent. With a wave of our hands, we claim the deduction theorem can be formalized in Peano Arithmetic, so that also recognizes this equivalence,
Thus far, we’ve only talked about Löb’s theorem in the contrapositive, where it most resembles the second incompleteness theorem in putting limitations on what a theory can prove. In this interpretation, Löb’s theorem implies that a theory cannot recognize the soundness of its proofs. For this reason, Löb’s theorem is mainly seen by MIRI researchers as a big problem for self-reflective reasoning.2
The recent paper by LaVictoire et al is more concerned with the positive direction in Löb’s theorem, which incidentally is what Löb had in mind when he proved it. Specifically, he was trying to determine the truth or falsity of “truth-teller sentences” which assert their own provability. That is, sentences constructed from the diagonal lemma such that
Löb’s theorem constructs proofs of all such sentences (actually only requiring provability of the right-to-left implication), but before his work it seemed that truth-teller sentences could just as well be either true or false, possibly even depending on the particulars of the encodings!
By contrast, the sentences Gödel used to prove his incompleteness theorems mimic the liar paradox by asserting their own unprovability, i.e.
These sentences can easily be argued to be true and unprovable when is arithmetically sound. The modern versions of Gödel’s theorems which merely require to be consistent (e.g. when , consistent by the incompleteness theorem for , or the theory from the modern proof of Löb’s theorem above) are due to Rosser.
In the next two sections, we’ll see how Löb’s theorem allows two programs to “trust” one another to cooperate.
2. CliqueBot and FairBot
Any discussion of cooperation in the prisoner’s dilemma with revealed source code should probably start with the so-called “CliqueBot”, which apparently has been known for 30 years. Via quining tricks (i.e. Kleene’s second recursion theorem), CliqueBot compares its own source code to the other player’s, and cooperates if and only if the sources are identical. Such a program is a clear, if modest, improvement over “always defect”, as it’s tailor-made to cooperate with copies of itself (which will of course also cooperate with it). But this cooperation is fickle, in that trivial modifications of CliqueBot’s source code result in mutual defection, and so LaVictoire et al note that
An ecology of such agents would be akin to an all-out war between incompatible cliques.
The real idea explored in LaVictoire et al is that the logical dependence of two agents’ choices can be made more robust, namely via Löb’s theorem. Specifically, they identify an agent which:
- Always cooperates with itself, and moreover achieves mutual cooperation with any agent functionally equivalent to itself (and even with some agents who are just functionally similar, but not equivalent),
- Always defects against the “stupid” CooperateBot and DefectBot (which always cooperate or defect, respectively, against any opponent),
- Is unexploitable, in that it never cooperates with a defector.
Note that DefectBot trivially satisfies (2) and (3). CliqueBot still satisfies (2) and (3), while receiving modest partial credit for (1).
The formal set up is this: agents are one-variable predicates on Gödel numbers of formulas, where truth corresponds to cooperation, and falsity to defection. These predicates must be simple enough to be represented by formulas of PA; agents are then identified with such formulas. For now, agents need not be computable, and in fact the two discussed are defined in terms of the merely computably enumerable sets of agents they cooperate with (but see section 4 on two approaches to recovering computability).
The first agent described in the paper is FairBot, apparently discovered by Vladimir Slepnev in 2010. FairBot cooperates with you whenever you provably cooperate with it, i.e.
FairBot is willing to search through Peano Arithmetic’s infinitely many proofs, looking for a proof of its partner’s cooperation. Finding one, FairBot will cooperate immediately; divergence in this search is FairBot’s defection. If we represent semicomputable agents as partial recursive functions (with divergence interpreted as defection), then FairBot can also be constructed via the second recursion theorem.
The obvious computable approximation to FairBot simply defects if it can’t a proof shorter than a certain prespecified bound.
Note that FairBot is unexploitable and defects against DefectBot: since is sound, it will not prove that DefectBot (or any other defector) cooperates with FairBot. (But note that this proof of defection against does not occur within ; this will be relevant to the construction of the next agent PrudentBot.)
We’ll see that FairBot does cooperate with itself, but it loses points by needlessly cooperating with CooperateBot. One might say this is only fair, but LaVictoire et al are quick to point out that you really do want to defect against CooperateBot; if you’re worried enough about CooperateBot’s “feelings” that you actually prefer outcome to , then you’re not really in a prisoner’s dilemma at all. They also argue that CooperateBots are abundant in real ecosystems, represented by non-agents which are not capable of threatening defection (e.g. natural resources to be used, or undesirable features of the environment to be removed or changed). The ability to distinguish “real” agents from CooperateBot and lookalikes is highly advantageous.3
So how does Löb’s theorem enter the picture?
then proves that and mutually cooperate.
Proof: When and are identical, then the definition of gives
That is, is a truth-teller sentence, and thus is proved by Löb’s theorem.
The general case is proved by noting that provably distributes over . That is, given any sentences and , a proof of can easily be used to construct separate proofs of and ; conversely proofs of and can be joined to construct a proof of , so
(We’ll actually only need the right-to-left implication here.)
With this in place, let be the sentence and be the sentence . By the definitions of and ,
so is a truth-teller sentence, and is proved by Löb’s theorem.
To recap, both CliqueBot and FairBot have their shortcomings: CliqueBot’s cooperation with itself is not robust. This is repaired somewhat with FairBot, but then FairBot can’t bring itself to defect against CooperateBot.
The clause which should be added to FairBot to ensure defection against CooperateBot says something like “If you’re willing to cooperate with DefectBot, then I defect against you”, reasoning that you’ll probably still be willing to cooperate with me. This is the essence of PrudentBot, but some care must be taken to get it right. As a first attempt, one might define such that
with both boxes representing provability in . Then:
- inherits unexploitability from FairBot, as it never cooperates in a situation where FairBot wouldn’t.
- However, since FairBot and are defined in terms of the semirecursive sets of agents they cooperate with, it is “difficult” to determine defection against particular agents, such as in the definition of .
- Specifically, you can’t prove just in that (or FairBot for that matter) defects against some agent . For if proved , then
but knows that either disjunct implies , i.e. .4 Thus proving such a defection in would contradict the second incompleteness theorem.
- Since can’t prove that defects against (and since -provable defection against is a necessary condition for to cooperate with you), defects against itself.
- Now, the agent does in fact defect against both CooperateBot and DefectBot, and this behavior is provable in the sound theory . This is because and are sufficiently transparent that can prove both
and thus proves both
and this is what is needed to establish defection against and .
Thus PrudentBot can be salvaged by allowing it to search for proofs of defection in , i.e. with such that
With this in place, the above arguments still show that is unexploitable, and that is enough to prove defection against and . Moreover, can still check proofs in the stronger system, so
Thus when analyzes the play of two PrudentBots against one another (or versus ), the additional term regarding DefectBot “drops out”, and the same Löbian argument of Theorem 3 establishes:
Thus PrudentBot satisfies all three desiderata outlined in the preceding section, and this mostly concludes the investigation.
4. Two Approaches to Computability
4.1. Bounded Löb’s Theorem
As mentioned above, the most obvious way to construct computable approximations to FairBot and PrudentBot is to simply bound the lengths of proofs they will consider. This only restricts the set of agents the two are willing to cooperate with, so unexploitability is preserved.
It is “obvious” that such a construction can be made to generally work. Of course, BoundedFairBot will have to mutually defect with HorriblyObfuscatedBoundedFairBot if e.g. the source code of the latter exceeds the maximum length of proofs the former is willing to consider. But if done correctly, mutual cooperation among BoundedFairBot and similar agents should be preserved.
The idea of course is that an instance of BoundedFairBot is determined by the bound below which it searches for proofs, and even very large can be specified and reasoned about quite compactly (e.g. using Knuth’s up-arrow notation if the standard binary expansion isn’t compact enough), so BoundedFairBot is only slightly more complex than its unbounded counterpart. If the length of the Löbian proof that FairBot cooperates with itself is much less than , then a modified proof that BoundedFairBot cooperates with itself ought to exist with length not much greater than , in particular with length below .
A provably correct construction here would need to specify bounds in Löb’s theorem, along the lines of
If , and if (and the length of ) are both “sufficiently small” compared to , then .
Here of course is meant to denote provability in symbols or fewer, and is the arithmetic formalization.
We require that not be unreasonably large (in fact, we ask that be small), because for any and , will always prove , whether or not it proves . In the case that does not prove in symbols or fewer, can simply establish by exhaustive enumeration of all exponentially-many proofs of length or less. By contrast, the length of a Löbian argument should scale quite slowly with compactly represented .
This kind of thing can get pretty messy (Slepnev seems to have worked out the details here, though I haven’t personally checked them). LaVictoire et al consider the problem of Löb bounds solved, and focus on a new, cleaner approach.
4.2. Modal Agents
This approach involves doing some actual modal logic. In previous sections, the “modal” symbol was merely an abbreviation for a relation definable in , but here it will be a formal logical symbol. (So in this section, when we wish to talk about provability in or another theory, we will explicitly use subscripts, e.g. .) Here a certain class of “modal agents” is defined in the Gödel-Löb provability logic . Behavior of modal agents need not be defined against arbitrary agents, but when locked in so-called “modal combat” with another such agent, the outcome is algorithmically decidable by a fixed point theorem (and a normal form theorem) for . The Gödel-Löb provability logic is a propositional modal logic abstracting away the details of arithmetic and the provability predicate into the formal symbol . That is, the formulas of are defined recursively by:
- and all propositional variables are formulas.
- If and are formulas, then so is .
- If is a formula, then so is .
Other standard connectives are used as abbreviations, e.g. abbreviates , and abbreviates , i.e. , and abbreviates , i.e. . Quantification is not allowed! (But is intended to capture the notion of “there exists a proof”.)
Before giving the axioms and rules of inference, we point out that is:
- Just powerful enough to talk about provability in : a modal formula is a theorem of if and only if every instance of the formula, with propositional variables replaced by (Gödel numbers of) sentences of arithmetic and the formal replaced by , is a theorem of . The right-to-left implication is Solovay’s completeness theorem, which is considered highly non-trivial.
- Still a weak enough extension of propositional logic that its set of theorems is decidable, via the Kripke semantics. Additionally, sentences without propositional variables can be algorithmically classified as true or false when interpreted as statements about provability in Peano Arithmetic. See the next post for details.
The Gödel-Löb axioms are all the instances of:
- All propositional tautologies, e.g. and and so forth, for any modal formulas and .
- The distribution schema: , capturing the idea that a proof of and a proof of can be easily combined to construct a proof of , and recognizes this.
- Formal Löb’s theorem: .
The rules of inference for Gödel-Löb are:
- Modus Ponens: From and , conclude .
- Necessitation: From , conclude .
Thus a proof in is a sequence of formulas, all of which are either axioms or follow from previous lines using a rule of inference.
Necessitation is supposed to capture the fact that if , then , because can check proofs. However, is not in general provable in , just as does not in general prove .
There are two redundant but useful axiom schemata derivable from the others, and one more redundant rule of inference worth pointing out:
- Formal necessitation schema:
- Alternate distribution schema:
- Löb’s Rule: from , conclude .
One last remark is in order before talking about PrudentBot and friends in : provability in must be rewritten in terms of provability in , but this is easy. By the deduction theorem, if and only if , thus can be replaced with .
Then modal agents are defined in terms of modal formulas which relate their actions to their opponents’ actions, so that the play of two modal agents is described by a system of modal formulas. For example, the following system describes the play of FairBot against PrudentBot:
Here stands for FairBot’s hypothetical cooperation, stands for PrudentBot’s hypothetical cooperation, and stands for FairBot’s hypothetical cooperation with DefectBot, obtained by substituting into FairBot’s criterion for cooperation, where stands for DefectBot’s hypothetical cooperation with FairBot.5
The fixed point theorem implies that such systems can be “solved”, i.e. there is a algorithm which constructs sentences , , , and containing no propositional variables such that proves all of the above formulas with replaced by , by , etc. The only hypothesis necessary for the fixed point theorem is that propositional variables on the right hand side occur within the scope of some . For this reason, agents like CliqueBot requiring direct access to source code are not suitable for the modal framework.
FairBot and PrudentBot are simple enough that we can deduce on our own what and must be up to equivalence (namely ), by the same reasoning as in sections 2 and 3, now formalized in . (But it is convenient to know ahead of time that the fixed point theorem guarantees the existence of such and in the first place.) That is, if , , , satisfy the above, distribution and the definitions of and give
This is essentially the proof from section 2 that FairBot does not cooperate with DefectBot, because is false, although not provable in by incompleteness.
Using more distribution and some propositional tautologies, is now proved equivalent to a simpler sentence:
since of course is provable in by Necessitation. Further distribution and propositional tautologies give
thus by Löb’s Rule.
This a formalization in of the outlined argument in section 3 that the mutual cooperation between and is visible to alone, even though the definition of involved provability in . While doesn’t prove , it certainly does prove .
Of course, it can happen that the fixed point theorem doesn’t produce and equivalent to or , in which case there’s still a bit of work to be done to determine whether and are really true or false. The play of FairBot against DefectBot is just such an example: in the modal framework, FairBot’s cooperation with DefectBot is , i.e. false, but not provable in .
It turns out that the truth values of any sentences without propositional variables are effectively decidable. The key here is the normal form theorem for , which says that every sentence without propositional variables is provably equivalent to an (effectively generated) propositional combination of sentences of the form . Here of course the power notation means , and means .
And then sentences in normal form are easy to classify as true or false. Because every is false, you can simply remove every from a normal form sentence, and you’re left with a propositional combination of ‘s, the truth value of which can be easily evaluated.
(Reminder: this reasoning about truth and falsity is taking place outside of . However, it is actually true that any normal form sentence has either or for sufficiently large .)
Further details on the Gödel-Löb provability logic (e.g. proofs of its decidability, the fixed point theorem, and the normal form theorem) are the topic of the next post.
The preprint by LaVictoire et al (and earlier work by Slepnev) successfully establishes that superrational cooperation in a one-shot prisoner’s dilemma is theoretically possible.
Of course, exhaustive proof searching is terribly inefficient in practice, with the proof space growing exponentially in the lengths of proofs. Thus proof searching seems ill-suited even for programs which do nothing other than engage in prisoner’s dilemmas with other such programs. The results just came in for this small real-world prisoner’s-dilemma-with-revealed-source-code tournament, and of course no entries tried to find Löbian cooperation. This would have been quite difficult to program, and all for naught if nobody else wrote a transparent proof searcher.6
It may be that just searching for proofs matching the general shape of Löbian arguments would boost efficiency, as it cuts down the search space substantially, though I’m not confident that it would be enough. One difficulty that would likely arise here is that now it’s no longer obvious to (and probably not even true) that just any proof of your cooperation implies my cooperation and vice versa. Perhaps this could be addressed by (formally specifying and) proving a more restrictive version of Löb’s theorem, along the lines of
If there’s a [short] proof that a [bounded length] Löbian proof of would imply , then there’s a [bounded length] Löbian proof of .
Formally specifying what is meant above might be difficult.
For more complex programs (e.g. that must navigate the real world and do things other than play prisoner’s dilemmas), proof searching seems like a non-starter: it’s not easy to read an arbitrary program and deduce its behavior, whereas FairBot and company can in theory be made transparent to .
On the other hand, LaVictoire et al demonstrate that their modal framework alleviates computational problems significantly, if one is content to only consider modal agents. Specifically, they say they’ve double-checked the results of their paper by coding up the fixed point theorem for (and indeed they say that having such a program on hand helped them discover PrudentBot in the first place).
Of course, the modal framework is unrealistically limiting. In abstracting away all the particulars of reasoning about source code into the formal symbol , you are left with agents who do nothing except engage in prisoner’s dilemmas with other modal agents.
Real cooperation seems to remain a hard problem.
- A more direct proof, straight from the Hilbert-Bernays-Löb provability conditions and formalized in , can be found in a text such as Boolos’ The Logic of Provability or Lindström’s Provability Logic: A Short Introduction. But even the short proof given here can be formalized in if you first formalize the incompleteness theorem.
- This kind of blind spot regarding self-trust is thought to be particularly troubling in the following scenario: suppose I am a reasoner, roughly equivalent in power to , and can’t be everywhere at once. I wish to build some autonomous robots to do my bidding. Before constructing these robots, I would like to ensure that any logical conclusions derived by those robots will actually be correct. This is a weak, but arguably necessary, criterion of safety. Löb’s theorem seems to imply that such robots must be noticeably stupider than me. That is (with the obvious interpretations of , , , and ), can’t simultaneously guarantee the following for :
- Soundness: for all , and
- Expressiveness: for all . (In other words, it would be nice to know that the robots would be smart enough to read any -proof and understood it as an -proof, but this is incompatible with Soundness by Löb’s theorem.)
A more precise statement of the problem (including situations where agents do more than just prove theorems about arithmetic), and various partial workarounds are presented in Yudkowsky et al.
- LaVictoire et al note that agents can be contrived which selectively punish basically any behavior. For example, an agent might defect against anyone who defects against CooperateBot, thus giving preferential treatment to FairBot. But such agents should be rare in a “real ecosystem”, as they must lose points from either cooperating with CooperateBot or from mutually defecting with copies of themselves.
- The symbol distributes over in the sense that if , then also , because knows that it proves (by just reading over and checking the proof) and recognizes that any proof of could be combined with a proof of to construct a proof of . Now for any , , thus , and by contraposition .
- LaVictoire et al also introduce the notion of the rank of a modal agent. FairBot and DefectBot are defined without calls to other agents, and are said to be of rank . The definition of PrudentBot contains a definition of the rank DefectBot, and is said to have rank . The play of two rank agents against one another is easily expressed in a system of just two formulas, but as above, the natural expression of the play of higher ranked agents will be larger systems, in which some of the formulas represent calls to lower ranked agents, which naturally ought to be “solved” first. They also show that rank modal agents cannot cooperate with FairBot and defect against CooperateBot (though certainly non-modal agents can do this via simple source code comparison), thus higher rank agents like PrudentBot are necessary to get good behavior.
- In fact, since copy-paste CliqueBot collusion was explicitly forbidden in the rules, the tournament was not won in the back of trust and mutual cooperation. Perhaps disappointingly, it was swept by three very simple random submissions whose cooperation or defection was independent of their opponents’ sources, the winner being biased towards defection. I am not sure whether the random submissions won just by a streak of dumb luck, or whether on average, they predictably trick the programs that call
eval, often enough to rack up some free points. For my part, I defected against all of them, but was a bit unlucky in that all three also defected against me. (Note: that my program explicitly cooperates with CooperateBot is not accidental; the tournament was not seeded with CooperateBots. This won me points against submissions B and E, though I also lost some points, thanks to the three players who actually submitted CooperateBot!) [UPDATE: this unofficial 100-iteration re-run suggests the random submissions won by dumb luck.]