KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

Microsoft GenAI, University of Washington, The University of Texas at Austin
*Work done during internship at Microsoft GenAI

Abstract

We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question-solution-test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. KodCode is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.

KodCode Pipeline



This figure demonstrates the pipeline for generating KodCode-V1. Our approach follows a three-step pipeline: Coding Question Synthesis, Solution & Test Generation, and Post-training Data Synthesis. The final KodCode-V1 dataset contains 447K verified question-solution-test triplets. The distribution of each subset is demonstrated on the right.

✨ KodCode is Diverse! ✨



This figure demonstrates the diversity of KodCode-V1. We present statistics on questions, solutions, and number of tests, as well as the distribution of each subset compared with baselines.

✨ KodCode is Challenging! ✨



This figure shows the difficulty of KodCode-V1. Importantly, we found that allocating more attempts to challenging problems is effective in improving pass rates for these problems.

✨ Performance on Coding Benchmarks ✨



This figure shows the performance of KodCode-SFT on coding benchmarks. We compare KodCode with SOTA non-reasoning and reasoning models on HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench.


🐱 Meet Kodkod: Nature's Stealthy Programmer! 🐾

Kodkod (Leopardus guigna) is a wild cat species native to the Americas, known for its adaptability and resilience in various environments. Just as the Kodkod navigates through diverse habitats, from dense forests to open areas, KodCode is designed to handle various programming challenges with agility and precision.

BibTeX

@article{xu2024kodcode,
        title={KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding},
        author={Zhangchen Xu and Yang Liu and Yueqin Yin and Mingyuan Zhou and Radha Poovendran},
      }