Artificial Intelligence in Competitive Programming

One of the benchmarks for progress in Artificial Intelligence (AI) is games usually played by humans, such as Minesweeper, Chess, Jeopardy and even StarCraft. By pitting AI against the best players in the world in these games, we can assess how well these AI models perform in a real-world game situation. It is no mean feat to create an AI that can play games such as Chess better than any human opponent. So when Google DeepMind came up with engines such as AlphaZero and MuZero that outdid more well-known chess engines such as IBM Stockfish 8, it drew a huge reaction from the chess and chess programming community.

Another such wave might be starting in the competitive programming (CP) community. Competitive programming is (in my opinion) a fun sport where contestants have a fixed amount of time to code solutions to sets of problems in general-purpose programming languages like C, C++, Python, Rust, Java and so on, under constraints on input, runtime limit and memory limit. Many online sites such as Codeforces, CodeChef and CodeDrills have online judges that compile and run submitted code in many programming languages on multiple public and private test cases to verify if the problem has been solved by the contestant. Codeforces have a rating system where your rating is a result of your performance in live contests hosted regularly on that platform. Perhaps the most prestigious on-site CP contest is the International Collegiate Programming Contest (ICPC).

AlphaCode and CP

Google DeepMind have now come up with another AI model, AlphaCode (AC), and more recently, AlphaCode 2 (AC2), powered by Gemini. AC2 reads the problem statement and outputs the solution code, using transformer-based language models to generate codes and filter out the eventual solution code. Validation on Codeforces using newer problems than in the training dataset show that AC2 performs at the 85th percentile on this platform (between Expert and Candidate Master). This has evoked a good number of reactions on Codeforces, with some members voicing their concerns.

Implications for CP

Instead of going through the technical details that can be found in the original paper as well as the technical report for AC2, I will try to summarize the reactions of the community as of the day of typing this post.

Some members observed that AC2 could solve problems rated over 3000 in one go, but struggled to solve lower rated problems that appear in division 2 A-C. This led to concern over test data leakage. A clarification tweet from Rémi Leblond tries to explain this anomaly. A pertinent point raised is that the public test cases for the high rated problem were strong, so passing those testcases legitimately would almost certainly receive an AC verdict, making it an outlier in this way. This is something humans like myself experience when solving a tougher problem in a live setting where the public tests are hard to pass, so I feel that some of the anomalies are valid to an extent. But it remains to be seen how that factors into assessing or benchmarking AC2’s performance.
Other members, including the Codeforces founder Mike Mirzayanov welcomed the progress made by AC2 and are eager to see its growth in performance. Still other members also drew parallels to the IMO Grand Challenge. I feel that this is a long way to go, especially because there are fewer topics in CP than in mathematics. Perhaps, a better assessment would come in a live round.
Few comments pointed out the impact on platforms like Codeforces, talking about problem validation and competitive integrity. This is by far the most plaguing problem for regular CPers like myself, with many of my own contacts through CP harping on this point. It is an absolutely valid concern, and other members pointed out the anomalies listed above as well as believe that at this stage, AC2 in a live setting would perform even worse than on the validation (which was done using virtual contests).

Conclusion

From what I can see above, I think there is one question on everyone’s mind: Can AC2 replicate its performance in a live contest? An answer to this question will definitely quench a lot of question marks in the community, and make clear the progress of AI in PSA-based settings like CP.

AlphaCode and CP

Implications for CP

Conclusion

GoatsWeb