Rights-cleared text datasets for safer LLM training and fine-tuning

Browse text datasets for LLM training, dialogue systems, classification and evaluation, with clearer provenance, stronger quality controls and less of the noise that comes with raw web-scraped data.

Browse text datasets Speak with a text data expert

200B+

Tokens of text

500+

Languages and locales

175+

Domains

GDPRCompliant

CertificationISO 27001/27701 & ISO 42001

Trusted by Leading AI Innovators

Text data collection

Sourcing

Find text datasets that better match your model goals, compliance requirements and domain needs — from rights-cleared corpora and conversational data to custom collection for harder-to-source use cases.

Rights-cleared text datasets for LLM training to reduce copyright risk vs. raw web-scraped text
Rare, high-value corpora where data is hard to find or curate (e.g. medical and legal-style content)
Conversational data and dialogue datasets for chatbot and copilot development, including chatbot training data and a chatbot training dataset format
Labeled NLP dataset options for text classification, NER, sentiment analysis, safety/toxicity and preference-style tasks
Custom data collection when you need domain-specific corpora or realistic evaluation capabilities and internal teams lack bandwidth

Validation

Reduce legal, privacy and quality uncertainty earlier with validation steps that help your team assess whether a text dataset is usable, compliant and fit for training or evaluation.

Rights and provenance validation to reduce copyright exposure and confirm allowable usage
Privacy screening (e.g. PII detection/redaction workflows) to manage privacy risk in real-world text
Annotation QA for labeled datasets (guidelines, reviewer checks, spot audits and measurable accuracy targets)

Structuring

Move faster from shortlist to implementation with text datasets designed to fit modern LLM, NLP and AI training workflows with less cleanup and less ingestion friction.

Clean schemas for documents, passages, instructions and multi-turn conversations
Task-ready labels for text classification dataset use cases, NER, sentiment analysis dataset needs and safety/toxicity evaluation
Metadata like domain, locale, source type, rights status and volume to support training plans, fine-tuning strategy and robust evaluation

Featured text datasets

Browse featured text datasets ready to power LLM training, dialogue systems, classification and evaluation AI applications. Browse all text datasets

Get a custom dataset list

Code Instruction Dataset — 17,000 Human-Reviewed Prompt & Response Pairs for LLM Fine-Tuning

Tech

Code Repository Dataset — 110 Real-World Codebases for LLM Fine-Tuning

Coding

Tech

Wearable Health Data for AI Model Training

Healthcare

English books

Academic

Textbooks

Named Entity Tagged Sentences in Hindi

hi-IN

Named Entities

Aspect-Based Sentiment Annotations in European Spanish

es-ES

Various

SOAP notes of English Doctor-Patient Conversations

EN,

en-US

Healthcare

Japanese legal templates

Legal

What you can build with text datasets

Healthcare

Support medical NLP, clinical documentation and domain-specific copilots.

Read use case

Gaming AI

Train assistants, moderation systems and narrative or support workflows.

Read use case

Content Moderation

Improve detection, classification and policy enforcement across user-generated text.

Read use case

Healthcare

Support medical NLP, clinical documentation and domain-specific copilots.

Read use case

Gaming AI

Train assistants, moderation systems and narrative or support workflows.

Read use case

Content Moderation

Improve detection, classification and policy enforcement across user-generated text.

Read use case

Introducing the new and improved

Defined.ai Data Marketplace

The world’s largest marketplace of AI training data

Browse AI Marketplace Get in touch

Text datasets FAQ

What are text datasets?

Text datasets are curated collections of documents, passages, conversations and labels used to train, fine-tune or evaluate AI models for language understanding, generation, retrieval and classification tasks.

Why not rely only on web-scraped text datasets?

Raw web-scraped text can introduce copyright, privacy and quality risks, and often contains noisy, duplicated or generic content that is less useful for differentiated model training. Defined.ai positions rights-cleared, quality-controlled text datasets as a safer alternative for enterprise AI development.

Do you provide text datasets for LLM training?

Yes. Defined.ai supports text datasets for LLM pre-training, fine-tuning and evaluation, including rights-cleared corpora, conversational data, instruction tuning formats and RLHF-style datasets.

Do you offer conversational data for chatbots and copilots?

Yes. Defined.ai provides conversational data and dialogue datasets for chatbot training, copilot development and evaluation workflows, including multi-turn conversation formats.

Do you offer labeled NLP datasets for classification, sentiment or safety use cases?

Yes. The page supports labeled NLP dataset options for text classification, NER, sentiment analysis, safety and toxicity-related tasks, with annotation QA and task-ready structures designed to support modern NLP workflows.

Are your text datasets rights-cleared for commercial AI use?

Defined.ai text datasets are rights-cleared and provenance-validated, with privacy screening and usage controls designed to reduce legal exposure compared with raw web-scraped text. If you have specific usage or review requirements, contact us to speak with a text data expert.

Can Defined.ai help us choose the right text dataset for our use case?

Yes. Defined.ai can help narrow options based on domain, model goals, data format, labeling needs and whether an off-the-shelf or custom dataset is the better fit.

Can you create a custom text dataset?

Yes. If you need a specific domain, tone, locale, schema or labeling approach, Defined.ai supports custom text data collection, including realistic evaluation capabilities for teams that need something more tailored than existing inventory.