Full Video:
Model Release & Availability
When will O3 and O3-mini be available to the public?
OpenAI plans to launch O3-mini around the end of January 2024, with the full O3 model following shortly after. The exact release dates will depend on the completion of safety testing and necessary interventions.
How can I apply for early access as a safety researcher?
Safety and security researchers can apply through OpenAI’s website using a dedicated application form. OpenAI has opened rolling applications for researchers to test both O3 and O3-mini models as part of their expanded safety testing program.
What’s the application deadline for safety testing?
OpenAI has set January 10th, 2024, as the final deadline for safety testing applications. Sam Altman emphasized that applications are reviewed on a rolling basis, encouraging interested researchers to submit their applications promptly.
Will these models be available through the OpenAI API?
O3-mini will support all API features available in the O1 series, including function calling, structured outputs, and developer messages. Hongyu Ren demonstrated that O3-mini achieves comparable or better performance than O1 on most API features, providing a more cost-effective solution for developers.
Technical Performance
How does O3 compare to previous models in terms of performance?
OpenAI’s O3 demonstrates significant improvements across multiple benchmarks, showing 20% better accuracy than O1 on Sweet Bench Verified (71.7%). In mathematics, O3 achieves 96.7% accuracy on competition math compared to O1’s 83.3%, and reaches 87.7% on GPQ Diamond, surpassing both O1 and typical PhD expert performance of 70%.
What are the key benchmarks that O3 has achieved?
O3 has achieved remarkable scores across coding and mathematics benchmarks, including a Code Forces ELO of 2727, outperforming OpenAI’s chief scientist Yakov and most competitive programmers. On Epic AI’s Frontier Math Benchmark, O3 achieved over 25% accuracy where previous models scored under 2%, and it reached an unprecedented 87.5% on ARC AGI’s benchmark under high compute settings.
What makes the ARC AGI benchmark achievement significant?
According to Greg Kamradt, President of ARC Prize Foundation, the ARC AGI benchmark remained unbeaten for five years until O3 achieved a groundbreaking 75.7% accuracy on low compute and 87.5% on high compute. This achievement is particularly significant as it surpasses human performance (85%) and demonstrates the model’s ability to learn new skills on the fly rather than rely on memorization.
How does O3-mini compare to O1-mini in terms of cost and performance?
Hongyu Ren demonstrated that O3-mini achieves better performance than O1-mini at a fraction of the cost, particularly in coding tasks where it matches or exceeds O1’s performance. The model supports three reasoning effort levels (low, medium, high) and achieves comparable or better results than O1 while maintaining significantly lower latency, with the low setting responding in under one second.
Features & Capabilities
What are the three reasoning effort levels in O3-mini?
O3-mini introduces low, medium, and high reasoning effort levels as configurable settings. Mark Chen explained that the low setting prioritizes speed and efficiency for simple tasks, while medium offers a balanced approach for general use. The high setting enables deeper analysis and complex problem-solving, though at the cost of increased latency.
What API features will O3-mini support?
O3-mini supports the complete suite of API features available in the O1 series, including function calling, structured outputs, and developer messages. Hongyu Ren demonstrated these capabilities through live coding examples, showing seamless integration with existing OpenAI API implementations. The model maintains backward compatibility while offering enhanced performance and reliability across all supported features.
How does the latency compare between O3-mini and previous models?
O3-mini achieves significantly reduced latency compared to its predecessors, with responses under one second when using the low reasoning setting. Mark Chen highlighted that even at higher reasoning levels, O3-mini maintains competitive response times while delivering superior results. The optimized architecture allows for faster processing without compromising the quality of outputs, making it particularly suitable for real-time applications.
What type of tasks can these models handle effectively?
O3-mini excels at a diverse range of tasks, from coding challenges to mathematical problem-solving, with particularly strong performance in structured reasoning tasks. The model demonstrates exceptional capabilities in competition-level programming, achieving impressive Code Forces ELO scores and maintaining 62% accuracy on GPQ Diamond questions. Sam Altman emphasized that O3-mini’s versatility extends to practical applications in software development, data analysis, and complex problem-solving scenarios.
Safety & Testing
What is Deliberative Alignment and how does it improve safety?
OpenAI’s new Deliberative Alignment technique leverages the model’s reasoning capabilities to establish more accurate safety boundaries, moving beyond simple example-based training. This approach enables O3 models to analyze prompts more deeply, uncovering hidden malicious intent and better identifying deceptive prompts. According to Mark Chen, this technique has achieved significantly better performance on both rejection and refusal benchmarks, demonstrated by the green data points achieving superior results compared to previous models.
How is OpenAI approaching safety testing for these models?
OpenAI has implemented a multi-layered approach to safety testing, combining internal testing with a new external testing program. Sam Altman announced that safety and security researchers can apply for early access through OpenAI’s website, with applications being reviewed on a rolling basis until January 10th, 2024. This marks the first time OpenAI has opened their safety testing process to external researchers at this scale.
What kind of safety interventions will be implemented?
While specific details weren’t fully disclosed, OpenAI plans to implement additional safety interventions on top of both O3 and O3-mini before public release. The company is conducting extensive internal safety testing alongside the new external testing program, focusing on identifying potential vulnerabilities and implementing appropriate safeguards.
What are the rejection and refusal benchmarks mentioned?
The rejection and refusal benchmarks measure O3’s ability to accurately identify and handle potentially unsafe requests. Mark Chen demonstrated that with Deliberative Alignment, the models achieved significantly better performance on both metrics simultaneously – showing improved ability to reject inappropriate requests while maintaining appropriate responses to legitimate ones, represented by the green data points in the upper right quadrant of their benchmark graph.
Cost & Performance Trade-offs
How cost-effective is O3-mini compared to O1?
Hongyu Ren demonstrated that O3-mini achieves superior performance at a fraction of O1’s cost, particularly evident in coding benchmarks. On the Code Forces ELO vs. cost trade-off graph, O3-mini defines a new cost-efficient reasoning frontier, delivering better performance than O1 while being significantly more economical. The cost advantages are particularly notable when using the low and medium reasoning effort settings.
What are the performance differences between low, medium, and high reasoning efforts?
O3-mini’s performance scales progressively across its three reasoning levels, with each tier showing distinct capabilities. Low reasoning effort achieves comparable performance to O1-mini with sub-second latency, medium reasoning effort matches or exceeds O1’s performance, and high reasoning effort pushes capabilities even further, demonstrated by achievements like 62% accuracy on GPQ Diamond benchmarks.
How does the cost-to-performance ratio compare to existing models?
According to Hongyu’s presentation, O3-mini establishes a new cost-efficient frontier, particularly evident in coding tasks where it outperforms O1-mini while costing significantly less. The model maintains comparable or better performance than O1 across various benchmarks, including math problems and API features, while offering substantially improved cost efficiency.
What are the latency implications for different reasoning levels?
O3-mini’s low reasoning effort setting achieves latency under one second, comparable to GPT-4’s response times and significantly faster than O1-mini. The medium setting operates at approximately half the latency of O1, while the high reasoning effort mode takes longer but delivers enhanced performance for complex tasks that require deeper analysis.
Technical Applications
How can developers utilize the function calling and structured outputs?
Hongyu Ren demonstrated that O3-mini supports all API features from the O1 series, including function calling and structured outputs, with comparable or better performance. The live demo showed O3-mini’s ability to create complex applications, including a code generator and executor that could handle multiple programming tasks simultaneously while maintaining structured output formats.
What kind of coding tasks can these models handle?
O3 demonstrates exceptional coding capabilities, achieving a Code Forces ELO of 2727, surpassing both OpenAI’s chief scientist Yakov and most competitive programmers. The live demonstration showed O3-mini handling complex tasks like creating a complete server application with UI components, implementing automatic code execution, and even self-evaluation capabilities.
How do the models perform on mathematical problems?
O3 achieves remarkable mathematics performance, scoring 96.7% on competition math compared to O1’s 83.3%, and reaching 87.7% on GPQ Diamond (PhD-level questions). On Epic AI’s Frontier Math Benchmark, considered the toughest mathematical benchmark available, O3 achieved over 25% accuracy where previous models scored under 2% – a task so challenging that professional mathematicians might need hours or days to solve single problems.
What are the practical applications for these capabilities?
O3-mini demonstrates practical applications across software development, scientific computing, and automated testing scenarios. During the presentation, Hongyu Ren showcased real-world applications including automated code generation, self-evaluation systems, and complex mathematical problem-solving, while maintaining low latency and cost-effectiveness for developers.
Benchmark Details
What is the significance of the Code Forces ELO score?
O3 achieved a remarkable Code Forces ELO of 2727, surpassing both OpenAI’s chief scientist Yakov and most competitive programmers. Mark Chen, who coaches competitive programming, noted that his best score was around 2500, emphasizing the exceptional nature of O3’s performance. Sam Altman mentioned that only one person at OpenAI has a higher score of around 3000.
How does the GPQ Diamond benchmark measure PhD-level knowledge?
GPQ Diamond evaluates model performance on PhD-level science questions, with O3 achieving 87.7% accuracy, approximately 10% better than O1’s 78%. Hongyu Ren explained that typical PhD experts score around 70% in their specific field of expertise, making O3’s performance particularly impressive as it excels across multiple domains.
What makes the Epic AI’s Frontier Math Benchmark particularly challenging?
Epic AI’s Frontier Math Benchmark consists of novel, unpublished, and extremely difficult problems that could take professional mathematicians hours or even days to solve individual questions. Mark Chen emphasized that all previous AI models scored less than 2% on this benchmark, while O3 achieved over 25% accuracy in aggressive test time settings, marking a significant breakthrough.
Why is the ARC AGI benchmark considered important?
According to Greg Kamradt, President of ARC Prize Foundation, the ARC AGI benchmark remained unbeaten for 5 years since its creation in 2019. The benchmark uniquely tests AI’s ability to learn new skills on the fly rather than rely on memorization, with each task requiring distinct abilities. O3’s achievement of 75.7% on low compute and 87.5% on high compute (surpassing human performance of 85%) represents a major milestone in AI development.
Future Implications
How will these models impact AI development?
Sam Altman and Mark Chen positioned O3 as marking the beginning of the next phase of AI development, where models can handle increasingly complex tasks requiring sophisticated reasoning. The breakthrough performance on previously unbeaten benchmarks like ARC AGI (unbeaten for 5 years) and Epic AI’s Frontier Math suggests a significant leap in AI capabilities that could redefine what’s possible in artificial intelligence.
What are the next steps in OpenAI’s model development?
OpenAI plans to implement additional safety interventions for both O3 and O3-mini before their public release. The company has introduced a new Deliberative Alignment technique and opened safety testing to external researchers, suggesting a strong focus on responsible deployment and safety improvements in future developments.
How will these models affect existing AI applications?
O3-mini’s improved cost-to-performance ratio, combined with its support for all existing API features, promises to make advanced AI capabilities more accessible and economical for developers. Hongyu Ren demonstrated that the model’s three reasoning levels (low, medium, high) provide flexibility for different use cases, from quick responses to deep analysis, potentially transforming how AI is integrated into applications.
What potential improvements or updates might we see in the future?
The saturation of existing benchmarks, as noted by Mark Chen, has highlighted the need for more challenging tests to assess frontier models accurately. OpenAI’s partnership with ARC Prize Foundation to develop new frontier benchmarks in 2025 suggests a continued push toward more capable models and more rigorous testing methodologies.
Full Transcript
(00:01) [Music] Good morning! We have an exciting one for you today. We started this 12-day event 12 days ago with the launch of O1, our first reasoning model. It’s been amazing to see what people are doing with that, and very gratifying to hear how much people like it. We view this as sort of the beginning of the next phase of AI where you can use these models to do increasingly complex tasks that require a lot of reasoning. So, for the last day of this event, um, we thought it would be fun to go from one Frontier Model to
(00:31) our next Frontier Model. Today, we’re going to talk about that next Frontier Model, um, which you would think logically maybe should be called O2, um, but out of respect to our friends at telica, and in the grand tradition of OpenAI being really truly bad at names, it’s going to be called O3 actually. We’re going to launch uh, not launch, we’re going to announce two models today: O3 and O3 mini. O3 is a very very smart model. Uh, O3 mini is an incredibly smart model but still uh, but a really good performance and cost. So, to get the bad news out of the
(01:04) way first, we’re not going to publicly launch these today. Um, the good news is we’re going to make them available for Public Safety testing starting today. You can apply, and we’ll talk about that later. We’ve taken safety testing seriously as our models get uh, more and more capable. And at this new level of capability, we want to try adding a new part of our safety testing procedure which is to allow uh, Public Access for researchers that want to help us test. We’ll talk more at the end about when these models uh, when we expect to make
(01:30) these models models generally available. But we’re so excited uh, to show you what they can do. To talk about their performance, got a little surprise. We’ll show you some demos, uh, and without further ado, I’ll hand it over to Mark to talk about it. Cool, thank you so much Sam. So, my name is Mark. I lead research at OpenAI, and I want to talk a little bit about O3’s capabilities. Now, O3 is a really strong model at very hard technical benchmarks, and I want to start with coding benchmarks. If you can bring those up. So, on software style benchmarks, we
(01:57) have Sweet Bench verified, which is a benchmark consisting of real-world software tasks. We’re seeing that O3 performs at about 71.7% accuracy, which is over 20% better than our O1 models. Now, this really signifies that we’re really climbing the frontier of utility as well. On competition code, we see that O1 achieves an ELO on this contest coding site called Codeforces, about 1891. At our most aggressive High test time compute settings, we’re able to achieve almost like a 2727 ELO here. Just so Mark was a competitive programmer, actually still
(02:32) coaches competitive programming, very very good. What, what is your? I think my best at a comparable site was about 2500, that’s tough. Well, I, I will say, you know, our chief scientist, um, this is also better than our chief scientist Yakov’s score. I think there’s one guy at opening eye who’s still like a 3,000 something yeah, a few more months to yeah enoy. Hopefully we have a couple months to enjoy there. Great, that’s, I mean this is it’s in this model is incredible at programming. Yeah, and not just programming, but also mathematics.
(03:00) So we see that on competition math benchmarks, just like competitive programming, we achieve very, very strong scores. So, O3 gets about 96.7% accuracy versus an O1 performance of 83.3% on the AIME. What’s your best AIME score? I did get a perfect score once, so I’m safe, but yeah, um, really what this signifies is that O3 um, often just misses one question whenever we tested on this very hard feeder exam for the USA mathematical Olympian. There’s another very tough benchmark which is called GPQ Diamond, and this measures the model’s
(03:35) performance on PhD level science questions. Here, we get another state-of-the-art number, 87.7%, which is about 10% better than our O1 performance, which was at 78%. Just to put this in perspective, if you take an expert PhD, they typically get about 70% in kind of their field of strength here. So, one thing that you might notice, yeah from, from some of these benchmarks is that we’re reaching saturation for a lot of them or nearing saturation. So, the last year has really highlighted the need for really harder benchmarks to
(04:09) accurately assess where our Frontier models lie. And I think a couple have emerged as fairly promising over the last months. One in particular I want to call out is Epic AI’s Frontier math benchmark. Now, you can see the scores look a lot lower than they did for the the previous benchmarks we showed, and this is because this is considered today the toughest mathematical benchmark out there. This is a data set that consists of novel, unpublished and also very hard to extremely hard, yeah, very, very hard problems. Even turns houses, you know, it
(04:38) would take professional mathematicians hours or even days to solve one of these problems. And today all offerings out there um, have less than 2% accuracy um, on, on this benchmark. And we’re seeing with O3 in aggressive test time settings, we’re able to get over 25%. Yeah, um, that’s awesome. In addition to Epic AI’s Frontier math benchmark, we have one more surprise for you guys. So, I want to talk about the ARC Benchmark at this point, but I would love to invite one of our friends, Greg, who is the president of the ARC foundation on to
(05:13) talk about this benchmark. Wonderful, Sam and Mark, thank you very much for having us today. Of course. Hello everybody, my name is Greg Camrad, and I’m the president of the ARC Prize Foundation. Now, ARC Prize is a non-profit with the mission of being a North Star towards AGI through and during benchmarks. So, our first benchmark, ARC AGI, was developed in 2019 by Francois Chollet in his paper on the measure of intelligence. However, it has been unbeaten for 5 years now. In AI world, that’s like it feels like centuries is where it is. So, the system
(05:46) that beats ARC AGI is going to be an important Milestone towards general intelligence. But I’m excited to say today that we have a new state-of-the-art score to announce. Before I get into that though, I want to talk about what ARC AGI is. So, I would love to show you an example here. ARC AGI is all about having input examples and output examples. Well, they’re good, they’re good, okay input examples and output examples. Now, the goal is you want to understand the rule of the transformation and guess it on the output. So Sam, what do you think is
(06:20) happening in here? Probably putting a dark blue square in the empty space. See, yes, that is exactly it. Now, that is really um, it’s easy for humans to uh intuit guess what that is. It’s actually surprisingly hard for AI to know, to understand what’s going on. So, I want to show one more hard example here. Now Mark, I’m going to put you on the spot. What do you think is going on in this uh task? Okay, so you take each of these yellow squares, you count the number of colored kind of squares there and you create a border of that with that. That is exactly
(06:52) and that’s much quicker than most people, so congratulations on that. Um, what’s interesting though, is AI has not been able to get this problem thus far, and even though that we verified that a panel of humans could actually do it. Now, the unique part about AR AGI is every task requires distinct skills. And what I mean by that is we won’t ask, there won’t be another task that you need to fill in the corners with blue squares. And but we do that on purpose, and the reason why we do that is because we want to test the
(07:22) model’s ability to learn new skills on the fly. We don’t just want it to uh repeat what it’s already memorized. That that’s the whole point here. Now, ARC AGI version 1 took 5 years to go from 0% to 5% with leading Frontier models. However, today I’m very excited to say that O3 has scored a new state-of-the-art score that we have verified on low compute. For uh, O3, it has scored 75.
(07:54) 7 on ARC AI’s semi-private holdout set. Now, this is extremely impressive because this is within the uh compute requirement that we have for our public leader board and this is the new number one entry on RKG Pub. So, congratulations to that. Thank you so much. Yeah. Now uh, as a capabilities demonstration, when we ask O3 to think longer and we actually ramp up to high compute, O3 was able to score 85.
(08:21) 7% on the same hidden holdout set. This is especially important, .5, sorry, 87.5, yes. This is especially important because um Human Performance is is comparable at 85% threshold. So, being above this is a major Milestone and we have never tested a system that has done this or any model that has done this beforehand. So, this is new territory in the RCG world. Congratulations with that. Congratulations for making such a great benchmark. Yeah, um, when I look at these scores I realize um, I need to switch my worldview a little bit. I need to fix my AI intuitions about what AI can actually
(08:54) do and what it’s capable of uh, especially in this O3 world. But the work also is not over yet, and these are still the early days of AI. So, um, we need more enduring benchmarks like ARC AGI to help measure and guide progress. And I am excited to accelerate that progress and I’m excited to partner with Open AI next year to develop our next Frontier Benchmark. Amazing, you know, it’s also a benchmark that we’ve been targeting and been on our mind for a very long time. So, excited to work with you in the future. Worth mentioning that we didn’t we
(09:26) target and we think it’s an awesome, we didn’t go do specif, you the general. But yeah, really appreciate the partnership. This was a fun one to do. Absolutely. And even though this has done so well, AR priz will continue in 2025, and anybody can find out more at ARC pri.org. Great. Thank you so much. Absolutely. Okay, so next up we’re going to talk about O3 mini. Um, O3 mini is a thing that we’re really really excited about, and Hongu who trained the model will come out and join us. Hey hey you. Hey, um hi everyone. Um, I’m H Uren, I’m open
(10:03) air researcher, uh, working on reasoning. So, this September we released O1 mini, uh, which is an efficient reasoning model that you the O1 family that’s really capable of uh, math and coding, probably among the best in the world given the low cost. So now together with O3 I’m very happy to uh, tell you more about uh, O3 mini, which is a brand new model in the O3 family that truly defines a new cost-efficient reasoning Frontier. It’s incredible. Um, yeah, though it’s not available to our users today, we are opening access to the model to uh, our
(10:37) safety and the security researchers to test the model out. Um, with the release of adaptive thinking time in the API a couple days ago, for all three mini will support three different options: low, median and high reasoning effort. So, the users can freely adjust the uh thinking time based on their different use cases. So, for example for some, we may want the model to think longer for more complicated problems and think shorter uh, with like simpler ones. Um, with that, I’m happy to show the first set of evals of all three
(11:14) mini. Um, so on the left hand side we show the coding evals. So, it’s like Codeforces ELO which measures how good a programmer is, uh and the higher is better. So, as we can see on the plot, with more thinking time all3 mini is able to have like increasing Yow all all performing all1 mini and with like median thinking time is able to measure even better than all1. Yeah, so it’s like for an order of magnitude more speed and cost we can deliver the same code performance on this or even better insurance, right? So although it’s like
(11:48) the ultra Min high is still like a couple hundred points away from Mark, it’s not far. That’s better than me probably. Um, but just an incredible sort of cost to Performance gain over been able to offer with O1, and we think people will really love this. Yeah, I hope so. So on the right hand plot, we show the estimated cost versus Cod forces yellow trade-off. Uh, so it’s pretty clear that all3 un defines like a new uh, cost-efficient reasoning Frontier on coding. Uh, so it’s achieve like better performance compar better performance
(12:20) than all1 is a fractional cost. Amazing. Um, with that being said, um, I would like to do a live demo on ult mini, uh, so um, and hopefully you can test out all the three different like low, medium, high uh, thinking time of the model. So let me p the problem. Um, so I’m testing out all three mini High first, and the task is that um, asking the model to uh, use Python to implement a code generator and executor. So if I launch this uh, run this like python script, it will launch a server um, and um locally with a with a with a UI
(13:14) that contains a text box. And then we can uh, make coding requests in a text box. It will send the request to call ult mini API and Al mini API will solve the task and return a piece of code. And it will then uh, save the code locally on my desktop and then open a terminal to execute the code automatically. So it’s a very complicated pretty complicated house, right? Um, and it out puts like a big triangle code. So if we copy the code and paste it to our server and then we like to run launch This Server. So we should get a text box
(13:56) when you’re launching it. Yeah, okay great. Oh yeah, I see. Hope so to be launching something. Um, okay, oh great. We have a we have a UI where we can enter some coding prps. Let’s try out a simple one like PR open the eye and a random number. Submit. So, it’s sending the request to all3 mini medium. So you should be pretty fast, right? So on this for terminal. Yeah, 41 that’s the magic number, right? So you save the generated code to this like local script um, on a desktop and print out open 41. Um, is there any other task you guys want to test it out?
(14:39) I wonder if you could get it to get its own GP QA numbers. That is, that’s a great ask. Just as what I expected, we practice a lot yesterday. Um, okay, so now let me copy the code and send it in the code UI. So, in this task we asked the model to evaluate all three mini with the low reasoning effort on this hard GPQ data set. And the model needs to First download the the raw file from this URL, and then you need to figure out which part is a question, which part is a um, which part is the answer, and or which part is the options, right? And then
(15:27) formulate all the questions and to and then ask the model to answer it and then par the result and then to grade it. That’s actually blazingly fast. Yeah, and it’s actually really fast because it’s calling the all3 mini with low reasoning effort. Um, yeah, let’s see how it goes. I guess two tasks are really hard here. Yeah, the long tail open the problem. Go go. Yeah, G is a hard data set. Yes. Yeah, it contains is like maybe 196 easy problems and two really hard problems. Um, while we’re waiting for this, do you want to show the what the request was
(16:09) again? Mhm. Oh, it’s actually Returns the results. It’s uh 61.6%. 6.6%, right? This a low reasoning effort model it’s actually pretty fast. Then full evaluation in the uh in the a minut and somehow very cool to like just ask a model to evaluate itself like this. Yeah, exactly, right. And if you just summarize what we just did, we asked the model to write a script to evaluate itself um through on this like hard GQ set uh, from a UI, right, from this code generator and executor created by the model itself in the first place. Next
(16:47) year we’re going to bring you on and you’re going to have to improve. Ask the model to improve itself. Yeah. Let’s definely ask the model to improve it next time, maybe not. Um, um, so um, besides Codeforces and GPQ, the model is also a pretty good um, um, math model. So we we show on this plot, uh, with like on this AM 2024 data set, also3 Min low achieves um, comparable performance with all1 mini and O3 mini medium achieves like comparable better performance than O1. We check the solid bar, which are passle ones, and we can further push the performance with all3
(17:25) mini high, right? And on the right hand side plot, when we measure the latency on this like anonymized O preview traffic, we show that all3 mini low drastically reduce the latency of O1 mini, right? Almost like achieving comparable latency with uh gbt 40 where under a second. So, probably is like instant response and also Mei medium is like half the latency of O1. Um, and here’s another set of eval I’m even more excited to to show you guys is um, uh API features, right? We get a lot of requests from our developer communities
(18:03) to support like function calling, structured outputs, developer messages on all mini series models. And here um, all3 mini will support all these features same as O1. Um, and notably it achieves like comparable better performance than for all on most of the evil, providing a more cost-effective solution to our developers. Cool. Um, and if you actually enil the True gpq damond Performance that I run a couple days ago, uh, it actually also mean L is actually 62%, right? We basically ask model to eval itself. Yeah, right. Next time we should
(18:41) totally just ask model to automatically do the evaluation instead of ask. Um, yeah. So with that, um, that’s it for alter Mei, and I hope our user can have a much better user experience in already next year. Fantastic work. Yeah thank great. Thank you. Cool. So I know you’re excited to get this in your own hands, um, and we’re very working very hard to postra this model to do some uh safety interventions on top of the model. And we’re doing a lot of internal safety testing right now. But something new we’re doing this time is we’re also
(19:13) opening up this model to external safety testing starting today with O3 mini and also eventually with O3. So, how do you get Early Access as a safety researcher or a security researcher? You can go to our website and you can see a form like this one that you see on the screen and and applications for this form are rolling, they’ll close on January 10th, and we really invite you to apply. Uh, we’re excited to see what kind of things that you can explore with this and what kind of um jailbreaks and other things you discover. Cool. Great. So, one other thing
(19:44) that I’m excited to talk about is a a new report that we published, I think yesterday or today, um, that advances our safety program, and this is a new technique called deliberative alignment. Typically, when we do safety training on top of our model, we’re trying to learn this decision boundary of what’s safe and what’s unsafe, right? And usually it’s uh just through showing examples, pure examples of this is a safe prompt, this is an unsafe prompt. But we can now leverage the reasoning capabilities that we have from our models to find a more
(20:16) accurate safety boundary here. And this technique called deliberative alignment allows us to take a safety spec allows the model to reason over a prompt and also just tell you know is this a safe prompt or not. Often times within the reasoning, it would just uncover that hey, you know, this user is trying to trick me or they’re expressing this kind of intent that’s hidden. So even if you kind of try to Cipher your your prompts, often times the reasoning will break that. And the primary result you see is in this figure that’s shown over here. We have um
(20:47) our performance on a rejection Benchmark on the x-axis and on over refusals on the y- AIS and here uh, to the right is better. So, this is our ability to accurately tell when we should reject something also our ability to tell when we should review something and typically you think of these two metrics as having some sort of tradeoff. It’s really hard to do well. I’m it is really hard to yeah, um, but it seems with deliberative alignment that we can get these two green points on the top right, whereas the previous models, the red and blue
(21:15) points um signify the performance of our previous models. So, we’re really starting to leverage safety to get sorry leverage reasoning to get better safety. Yeah, I think this is a really great result of safety. Yeah, fantastic. Okay, so to sum this up, O3 mini and O3 apply, please if you’d like for safety testing to help us uh, test these models as an additional step. We plan to launch O3 mini around the end of January and full O3 shortly after that, but uh, that will you know, the more people can help us safety test the
(21:45) more we can uh, make sure we hit that. So, please check it out. Uh, and thanks for following along with us with this. It’s been a lot of fun for us. We hope you’ve enjoyed it too. Merry Christmas. Merry Christmas. Merry Christmas. [Music]