Sandhini Agarwal: We have a lot of next steps. I definitely think how viral ChatGPT has gotten has made a lot of issues that we knew existed really bubble up and become critical—things we want to solve as soon as possible. Like, we know the model is still very biased. And yes, ChatGPT is very good at refusing bad requests, but it’s also quite easy to write prompts that make it not refuse what we wanted it to refuse.
Liam Fedus: It’s been thrilling to watch the diverse and creative applications from users, but we’re always focused on areas to improve upon. We think that through an iterative process where we deploy, get feedback, and refine, we can produce the most aligned and capable technology. As our technology evolves, new issues inevitably emerge.
Sandhini Agarwal: In the weeks after launch, we looked at some of the most terrible examples that people had found, the worst things people were seeing in the wild. We kind of assessed each of them and talked about how we should fix it.
Jan Leike: Sometimes it’s something that’s gone viral on Twitter, but we have some people who actually reach out quietly.
Sandhini Agarwal: A lot of things that we found were jailbreaks, which is definitely a problem we need to fix. But because users have to try these convoluted methods to get the model to say something bad, it isn’t like this was something that we completely missed, or something that was very surprising for us. Still, that’s something we’re actively working on right now. When we find jailbreaks, we add them to our training and testing data. All of the data that we’re seeing feeds into a future model.
Jan Leike: Every time we have a better model, we want to put it out and test it. We’re very optimistic that some targeted adversarial training can improve the situation with jailbreaking a lot. It’s not clear whether these problems will go away entirely, but we think we can make a lot of the jailbreaking a lot more difficult. Again, it’s not like we didn’t know that jailbreaking was possible before the release. I think it’s very difficult to really anticipate what the real safety problems are going to be with these systems once you’ve deployed them. So we are putting a lot of emphasis on monitoring what people are using the system for, seeing what happens, and then reacting to that. This is not to say that we shouldn’t proactively mitigate safety problems when we do anticipate them. But yeah, it is very hard to foresee everything that will actually happen when a system hits the real world.
In January, Microsoft revealed Bing Chat, a search chatbot that many assume to be a version of OpenAI’s officially unannounced GPT-4. (OpenAI says: “Bing is powered by one of our next-generation models that Microsoft customized specifically for search. It incorporates advancements from ChatGPT and GPT-3.5.”) The use of chatbots by tech giants with multibillion-dollar reputations to protect creates new challenges for those tasked with building the underlying models.