AGI's Deception
Suppose you have a nascent AGI, and you’ve been training against all hints of deceptiveness. What goes wrong?
When I ask this question of people who are optimistic that we can just “train AIs not to be deceptive”, there are a few answers that seem well-known. Perhaps you lack the interpretability tools to correctly identify the precursors of ‘deception’ so that you can only train against visibly deceptive AI outputs instead of AI thoughts about how to plan deceptions. Or perhaps training against interpreted deceptive thoughts also trains against your interpretability tools, and your AI becomes illegibly deceptive rather than non-deceptive.
And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven’t observed others notice on their own.
Stay tuned.