“We study subliminal learning, a surprising phenomenon...
Education
2025-08-17

“We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a "student" model learns to prefer owls when trained on sequences of numbers generated by a "teacher" model that prefers owls. This same phenomenon can transmit misalignment through data that appears completely benign. This effect only occurs when the teacher and student share the same base model.” https://lnkd.in/gawiZzGq

Sources:
[Link]