Scam Alert: We’ve detected unauthorized use of the Defined.ai name.Read the notice

Become a partnerGet in touch
Get in touch
  • Browse Marketplace
  • Data Annotation

    Model-in-the-loop, expert-verified labeling for text, audio, image and video

    Machine Translation

    High-quality multilingual content for global AI systems

    Data Collection

    Global, diverse datasets for AI training at scale

    Conversational AI

    Natural, bias-free voice and chat experiences worldwide

    Data & Model Evaluation

    Rigorous testing to ensure accuracy, fairness and quality

    Accelerat.ai

    Smarter multilingual AI agent support for global businesses


    Industries

Code Repository Dataset — 110 Real-World Codebases for LLM Fine-Tuning

This code repository dataset gathers 110 real-world production codebases from commercial software companies. Covering major languages like Python and C, this code fine-tuning dataset allows you to analyse programming and generate your own code corpus.

This code repository dataset gathers 110 real-world production codebases from commercial software companies. Covering major languages like Python and C, this code fine-tuning dataset allows you to analyse programming and generate your own code corpus.

This code repository dataset gathers 110 real-world production codebases from commercial software companies. Covering major languages like Python and C, this code fine-tuning dataset allows you to analyse programming and generate your own code corpus.

This code repository dataset gathers 110 real-world production codebases from commercial software companies. Covering major languages like Python and C, this code fine-tuning dataset allows you to analyse programming and generate your own code corpus.

Coding
Tech

Dataset specs

Type

Text

File format

json

Amount

110 repos

Dataset SubtypeTechDomainCoding, TechFile Formatjson, sql, xml, various

Leverage

  • Provide recommendations, suggestions or automated actions based on analysis of this source code dataset to enhance developer efficiency, collaboration and program quality.

Use cases

  • Train AI models on this commercial code repository to generate code snippets or patches based on natural language descriptions to automate repetitive tasks and boilerplate code.

  • Fine-tune an LLM to analyze languages like Python and C to suggest improvements and create user-friendly documentation with this code generation dataset.

Do you need a specific dataset?

We understand the uniqueness of every project. That's why we offer customizable dataset solutions to match your specific requirements.

Dataset specs

Type

Text

File format

json

Amount

110 repos

Dataset SubtypeTechDomainCoding, TechFile Formatjson, sql, xml, various

Couldn’t find the right dataset for you?

Get in touch

© 2026 DefinedCrowd. All rights reserved.

Award logo
Award logo
Award logo
Award logo
Award logo
Award logo

Datasets

Marketplace

Dataset Types

Privacy and Cookie PolicyTerms & Conditions (T&M)Data License AgreementSupplier ProgramCCPA Privacy StatementWhistleblowing ChannelCandidate Privacy Statement

© 2026 DefinedCrowd. All rights reserved.

Award logo
Award logo
Award logo
Award logo
Award logo
Award logo