Code Repositories

Dat dataset is a a collection of non-publicly available, closed-source code. The 18B token count is based on code files (.py, .c, and similar.) and code-adjacent files (.html, .json, .xml). Other files are also available; when included, the total token count reaches 64 billion.

Coding

Tech

Dataset specs

Type

Text

Region/Locale

EN

Amount

18B tokens

Dataset SubTypeTechDomainCodingFile Formatjson, xml, html, css

Leverage

Provide recommendations, suggestions or automated actions based on analysis of code to enhance developer efficiency, collaboration and code quality.

Use cases

Train AI models to generate code snippets or patches based on natural language descriptions to automate repetitive tasks and boilerplate code.
Fine-tune an LLM to analyze languages like Python and C to suggest improvements and create user-friendly documentation.

Do you need a specific dataset? edit

We understand the uniqueness of every project. That's why we offer customizable dataset solutions to match your specific requirements.