![]() ![]() This technique rests on the following conjecture:Ĭonjecture. This is achieved by imposing an inductive bias within the training process that makes deceptive formation exceedingly unlikely. The speed prior is a potential technique for combating formation of deceptive alignment. What is the speed prior and why do we care about it? I also include an appendix on a failed approach to forwarding the speed prior. I finish with a speculative appendix section on possible ways of employing speed priors for run-time detection of deception, and on using a run-time form of the speed prior to rule out cryptographic cognition within deceptive models. In the last 3 sections, I consider some additional new proposals for forwarding the speed prior. In these sections, my contributions consist primarily of distilling and re-examining existing work on the speed prior. The first 2 major sections of this post do not contain original content. This post examines the current literature on the speed prior before later diving into some new ideas on how to forward the speed prior in instances where mesa-optimization can occur. It is also not clear how directly mappable these notions of speed priors are to NNs relative to their treatment in generic models of computation (like TMs or circuits). Acceptance of any of the arguments in this post should be conditional upon comfort that they can be reduced to very precise provable statements about properties produced by inductive biases subject to a speed prior as in the described proposals and implementations. A lot of the arguments related to usage of the speed prior employ high-level reasoning and lack substantial formalizable proofs. Also thanks to James Lucassen for reviewing a draft with me and providing extensive feedback.Įpistemic status: Somewhat uncertain. Thanks to Evan Hubinger for his mentorship under the SERI MATS program, and to Arun Jose, Rob Ghilduta and Martín Soto for providing prior references and some feedback. ![]() This post was written under the mentorship of Evan Hubinger, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |