Code Repositories
Dat dataset is a a collection of non-publicly available, closed-source code. The 18B token count is based on code files (.py, .c, and similar.) and code-adjacent files (.html, .json, .xml). Other files are also available; when included, the total token count reaches 64 billion.
Dat dataset is a a collection of non-publicly available, closed-source code. The 18B token count is based on code files (.py, .c, and similar.) and code-adjacent files (.html, .json, .xml). Other files are also available; when included, the total token count reaches 64 billion.
Dat dataset is a a collection of non-publicly available, closed-source code. The 18B token count is based on code files (.py, .c, and similar.) and code-adjacent files (.html, .json, .xml). Other files are also available; when included, the total token count reaches 64 billion.
Dat dataset is a a collection of non-publicly available, closed-source code. The 18B token count is based on code files (.py, .c, and similar.) and code-adjacent files (.html, .json, .xml). Other files are also available; when included, the total token count reaches 64 billion.
Dataset specs
Type
Text
File format
json
Region/Locale
EN
Amount
18B tokens
Leverage
Provide recommendations, suggestions or automated actions based on analysis of code to enhance developer efficiency, collaboration and code quality.
Use cases
Train AI models to generate code snippets or patches based on natural language descriptions to automate repetitive tasks and boilerplate code.
Fine-tune an LLM to analyze languages like Python and C to suggest improvements and create user-friendly documentation.



Do you need a specific dataset?
We understand the uniqueness of every project. That's why we offer customizable dataset solutions to match your specific requirements.

Dataset specs
Type
Text
File format
json
Region/Locale
EN
Amount
18B tokens