Nyuuzyou Preserves Google Code Archive

The Google Code Archive is a massive dataset that preserves source code from the defunct Google Code hosting service. It contains over 65 million files gathered from nearly 500,000 repositories, offering a comprehensive look at open-source development history.
Nyuuzyou compiled this collection to capture the state of software engineering between 2006 and 2016. This resource solves the problem of accessing historical codebases, which is useful for training AI models on older coding patterns.
Dataset Size: 50.1 GB
Inside the collection
- Contains over 65 million files from 488,618 distinct repositories.
- Spans 454 programming languages including Java, PHP, and Python.
- Includes rich metadata like repository names, file paths, and licenses.
- Applies quality filtering to remove binaries and generated files.
- Compressed using Parquet format with Zstd for efficient storage.
Developers working on code translation or security analysis tools will find this historical data valuable for understanding legacy systems. Because the dataset covers a ten-year span, it allows researchers to track how programming conventions have changed over time. The variety of licenses included also provides flexibility for different types of commercial and non-commercial projects.
Important considerations for usage
The creator applied a rigorous pipeline to filter out low-quality content, such as vendor code and build artifacts. This process ensures that the dataset focuses on actual source code rather than unnecessary clutter. However, users should approach the data with care due to its age. The documentation notes that the dataset may contain:
'security vulnerabilities that have since been discovered and patched.'
Users must also be aware that sensitive information might exist within the files. The archive could potentially hold API keys or personal email addresses that were committed accidentally during the original development period. It is recommended to implement additional filtering when preparing this data for training new models.
Check out the Google Code Archive on Hugging Face.