Mind the (Accent) Gap: DefinedCrowd Contributing to More Inclusive Speech Technology

29 Jul 2021

In a drive to address bias in speech technology, DefinedCrowd is offering AI developers free speech datasets to enable them to test how well their speech recognition models understand nonnative English speakers with a variety of accents.

Seattle, Washington, USA, 29 July 2021 – DefinedCrowd, the one-stop-shop for high-quality artificial intelligence training data, today released the first of a series of free Spanish-accented English speech datasets to allow AI developers to test how well their speech recognition models understand nonnative English speakers, a demographic represented by over 35 million people in the United States.

“There is an accent gap in speech technology. Research shows that speech recognition technologies are not nearly as accurate in understanding nonnative accents as they are in understanding white, non-immigrant, upper-middle-class Americans,” said Dr. Daniela Braga, founder and CEO of DefinedCrowd.

It is not a surprising phenomenon; it is this demographic that had access to and trained the technology from the beginning. To address the bias present in speech recognition technology, DefinedCrowd has released the first of four sets of Spanish-accented English speech datasets, which developers can use to test or benchmark their models to identify bias and areas which need more training data.

“Unfortunately, it has resulted in models that are more useful to some people than to others. And that must change,” said Dr. Braga.

However, many companies do not have the resources to train or test their systems with different accents, meaning that speech recognition systems are likely to provide an unresponsive, inaccurate, and even isolating experience to nonnative English speakers.

This is clearly bad for business: according to the U.S. Census, over 35 million people in the United States are native speakers of a language other than English. Sixty percent of these people speak Spanish at home.

“For companies with AI solutions to compete in the large nonnative English-speaking market in the U.S., speech models need to be able to understand a wide range of different Spanish accents, originating from all the Americas,” said Christopher Shulby, Director of Machine Learning Engineering at DefinedCrowd.

The first dataset, released in two phases, includes Spanish-accented English data from the Americas, including Argentina, Brazil, Canada, Chile, Colombia, Dominican Republic, Guatemala, Honduras, Mexico, Nicaragua, Panama, Peru, the United States, Uruguay and Venezuela.

Subsequent releases will include datasets from native Spanish speakers from around the world, including Australia, China, Finland, France, Germany, India, Israel, Italy, Norway, Portugal, Russia, Spain, Sweden, and the United Kingdom.

The datasets represent speakers aged from 18 – 40, with an equal distribution of male and female speakers.

To access the data, developers will need to register on DefinedCrowd’s Marketplace here, after which they will receive a link to download the dataset.

To learn more about the DefinedCrowd, please visit defined.ai

About Defined.ai

Defined.ai is the leading provider of ethical AI data, offering the world’s biggest ethical AI data marketplace alongside subscriptions for flexible data access and custom services. With deep expertise in artificial intelligence and machine learning, Defined.ai delivers high-quality, ethically sourced training data, enabling companies to accelerate their AI solutions with data that is secure, bias-free, and compliant with ethical and legal standards.

Founded by Daniela Braga, PhD, in 2015, Defined.ai has earned recognition in top-tier outlets, including Forbes, Fortune, CB Insights, and Inc., and has received numerous awards, with appearances on prestigious lists like Forbes AI 50, Deloitte Fast 100, and Inc. 500. The company has raised over $80 million in funding and is headquartered in Seattle, WA, USA, with additional offices in Lisbon, Portugal.

Contact