Interview with Head Corporate Counsel: Navigating Ethical AI Practices and Compliance
17 Oct 2024
I’m Christine Payne, the Head of Marketing at Defined.ai, and after countless data misuse headlines this summer, I wanted to understand the challenges that our customers are facing when tasked with implementing ethical AI practices internally. More importantly, shed some light on how they can build better processes to avoid being added to the growing list of data misusers.
So, who better than our very own in-house legal counsel, Melissa Carvalho, to guide us through the complexities of answering the question on all our minds, “How can my data be protected in the unpredictable AI world?”
Melissa shares her perspective on how AI companies can implement more effective ethical practices to avoid making headlines from the misuse of data. She shares insights on the evolving legal landscape around AI and copyright, highlighting why transparency in data sourcing is so important. For AI business leaders, this conversation sheds light on the practical steps you can take to manage legal risks while remaining competitive in the fast-moving AI space.
The Role of Legal Counsel in Shaping AI Data Ethics
CP: Tell us more about your role at Defined.ai and how it aligns with our mission – creating the largest marketplace for ethically sourced AI training data?
MC: I’m the in-house legal counsel. While a common role in many companies, it’s pretty unique in AI startups – which makes it even more exciting for me. Not only does my role directly align with our mission, but it also significantly impacts it. To be an ethical marketplace, we must ensure that we’re providing data that is 100% compliant, so it’s critical that I’m viewed as a business partner between our stakeholders, partners and customers. It’s my job to create a process that requires everyone to take a beat, and take stock, as many times as needed to secure consistency with our core values. It might not sound exciting to many, but I take a lot of pride in knowing that we practice what we preach – internally and for our customers. Earlier this year, we released our Ethical AI Manifesto, which details our level of commitment to strict legal practices and aligns with our Privacy Program, while being laser-focused on enacting standard codes of practice and programs for the AI industry that are informed by an iterative learning process. Guess how many AI marketplaces are investing time and resources to hold the industry accountable? Trick question, it’s just us. I could ramble on, but long story short, I’m focused on removing ambiguity at every level at Defined.ai by improving legal awareness, which ultimately results in better cross-functional collaboration so we can deliver safer solutions to our customers – faster.
It might not sound exciting to many, but I take a lot of pride in knowing that we practice what we preach – internally and for our customers.
CP: What distinguishes an "ethical AI marketplace" like Defined.ai (from a standard AI marketplace), particularly in the context of copyright compliance?
MC: Oh, that’s an easy one – our Partner Program! Something that I've never seen in any marketplace. It requires every partner that we onboard to undergo a “cleansing process” before they can sell on our marketplace. This means that every supplier must complete our ethical questionnaire, sign a code of conduct, and provide evidence attesting to the legitimacy of their data collection process and transmission to Defined.ai. This is non-negotiable for us.
Navigating Legal Challenges in the AI Industry
CP: With lawsuits like the one against Anthropic, an AI safety-focused organization, making headlines, how do you see the legal landscape evolving around the use of copyrighted data for AI training?
MC: Well, copyright law and AI are at a crossroads. It’s like the book from Daniel Klein – “Every Time I Find the Meaning of Life, They Change It." We’ve seen several approaches across the globe, with different stances on how to deal with this problem. At the end of the day, we need to choose how we review past mistakes and evolve our laws. Are we going to use the same copyright laws or make amendments for AI? What’s the recipe?
Look at the UK, for example. Last summer, the UK Intellectual Property Office attempted to devise a code of practice that would provide AI firms with guidance on how to ethically and legally input copyrighted work into their models. Essentially, “labeling” them as safe to use. Well, that was a great idea in theory but sadly, nothing has come of it because the IPO still can’t agree on a standard code of practice with enough transparency to protect both, the AI firms as well as the authors.
Then in the EU, the AI Act has two requirements that AI companies must comply with. The first involves establishing a copyright policy that allows authors to opt-out and the second requirement states that AI companies must provide a detailed summary of the content being used for training its general-purpose AI models. But it falls short of having any specific rules for copyright infringement.
Now in the U.S., it gets even trickier. The courts can’t clearly answer the potentially multibillion-dollar question – are certain AI companies massively infringing on the rights of authors by using images, writings and other scraped data to train their models? At the end of the day, however, the math is simple. Scraped data equals unconsented data. For me, and for Defined.ai, there’s no difference.
The solution lies within either the License as Incentive regime, Mandatory License approach or a more extreme avenue, Model Deletion, which could significantly impact the model’s performance. Over the years, there have been several lawsuits on this topic. All of these options have their own flaws. There’s no quick and easy fix for such a complex technological revolution.
Approaches like the License as Incentive regime can reduce the burden of technological expertise somewhat. But it still requires enough know-how to establish a system that determines which outputs are relevant and detects them reliably enough to trigger the licensing fee. Another potential approach is to impose a type of mandatory license. Where the underlying harm is an IP violation. For instance, courts could require a defendant to pay a fee to the plaintiff each time the model generates an output that bears sufficient resemblance to the IP in question. This is easier said than done. If, for example, a model generates text that is 90% identical to a copyrighted text, is that enough to trigger the license? How about 80%? Mandatory licenses have worked in other contexts, and despite their flaws they may still be preferable to ordering the deletion of a model.
CP: How does transparency in data sourcing and model training impact the legal risks associated with AI development?
MC: Transparency is the backbone of this business. The data, the system and the business models need mutual transparency.
Whenever an AI system has a significant impact on people’s lives, we should be able to demand an explanation of the AI system’s decision-making process long before escalation.
This means that data sourcing and labelling must be documented with a global set of standards and practices if we expect to continue doing business across the world. And sadly, most players in the game aren’t doing this. They’re not ethically sourcing and documenting their data.
Yes, AI offers substantial benefits to individuals and society, but they also have the potential to have a severe negative impact if not managed correctly. As renowned technology and intellectual property lawyer, Regina Penti once said, “Just because something is out there for free doesn’t mean it’s free of rights.”
At Defined.ai, we have open conversations with our partners on what the copyrights owner’s expectations are regarding the use of their data. Then, as thoroughly as possible, we infuse this information into our contractual agreements to ensure alignment starting from the statement of work presented by the client, all the way through to the proposals we make to the partners, detailing how the data will be used.
Let me give you a basic example on how we approach this at Defined.ai. We classify our partners based on the nature of the datasets they present us with. One of the main categories is media partners. If the partner presents us with a portfolio of data related to publications like newspapers, journals, books, or even blogs, they will fall into this category. That means my legal analysis will prioritize the IP rights assessment.
So going back to your initial question, perhaps this paints a better picture of the importance of my role here. It’s my duty to strike a balance between maximizing the benefits of AI systems and minimizing its risks. Oh sure, it’s a challenge, don’t let me convince you otherwise. We won’t solve it here that’s for sure, but we need to keep this topic front and center. So, let’s keep talking about it.
Transparency is the backbone of this business. The data, the system and the business models need mutual transparency.
Implementing Effective Practices for Ethical AI
CP: Yes, it seems we’re just scratching the service. My next question focuses more on taking your experience in such a niche and sensitive industry, coupled with your passion for driving change. Can you give our readers some tips and tricks for obtaining proper consent to use data, particularly copyrighted work, to train their models? Without giving away our secret sauce, of course.
MC: Oh, well let me readjust myself here, this could take a few minutes. Let’s say I've got a few things on my mind for those who are still reading.
First and foremost, organizations must be super clear about their process for obtaining consent in everything from their internal processes and documentation down to how they govern or run their businesses. If it’s not embedded in your company values, for example, it should be. Don’t overlook this piece, this impacts your recruitment process, and the right talent is key. It’s a waterfall effect from here, let’s be honest.
You’re probably wondering, okay this is reasonable advice but where do I start, Melissa? I’d say be purpose driven. This means having a specified purpose at every stage of the AI lifecycle so you can intimately understand the scope of each processing activity, evaluate its level of compliance from data protection to IP law, and more importantly, evidence it all. Then, along the way of the assessment journey, you can pulse-check yourself and identify key moments where you’ll say – is this a breach of any laws, contract or otherwise? If yes, rinse and repeat until you get to a point where these questions are a distant memory.
The evidence, or paper trail that you create, will depend on the jurisdiction where you operate, its applicable laws, and the type of data you’re handling. No matter what, however, you’ll need to flesh out the legitimacy of the data. This includes provenance, geographical scope, lawfulness of collection, data privacy policy, IP management methodology, and fair compensation.
So, what does that look like? Let's go back to the example I used before – media partners. In this case, amongst other evidence, we’d need to review the redacted agreement between the relevant parties, and in the case where the datasets are not in the public domain, identify the owner or the sub-licensing rights where the datasets include any data protected by copyright, trademark or patent. This way, the AI company training its model can ensure that the data provider either owns the full right, title and interest to that dataset, or is entitled to sublicense, providing Defined.ai with proper warranties and representations that third party’s IP rights are not and won’t be infringed.
At the end of the day, the line between a trained model and its underlying training data is conceptually distinct, in practice it can be ... how shall I say it – blurry. The only certainty we have is that the risk and cost of feeding unreliable data has been proven quite high and it can easily lead to claims of negligence, unjust enrichment and vicarious copyright infringement.
As my grandmother always said, “the more power you use Melissa, the higher our electricity bill!” Well, the same goes for creating AI models.
CP: Looking ahead, how do you foresee ethical AI marketplaces like Defined.ai influencing the broader AI industry, particularly in the context of upcoming copyright and IP regulations?
MC: Let’s look at the increasing volume of lawsuits across music, publication, entertainment, tech – no industry is exempt from the reality that the use of data requires protection and consent.
Defined.ai’s marketplace creates transparency, and we’re contributing to a trustworthy system that’s built on the buying and selling of responsible AI. This means data that is compensated for, unbiased, and consented for use. The demand and pace for high-quality data is only accelerating. We need a rock-solid and reliable licensing framework governing the use of copyrighted material to train AI models – yesterday.
But setting standards without the input of relevant stakeholders and societal participation will risk losing legitimacy in the long run. And historically we know that standardization efforts often fail where the human behavioural aspect is ignored. Defined.ai and its leadership team are committed to its mission of creating the largest marketplace for ethical AI. And as part of this mission, we’ll continue to create guidelines like the Ethical AI Manifesto for other marketplaces and AI companies to leverage.
Let’s look at the increasing volume of lawsuits across music, publication, entertainment, tech – no industry is exempt from the reality that the use of data requires protection and consent.
CP: In the last few months alone, we’ve seen unethical data use significantly impact the Music industry and now, Publishing. What is your “non-legally binding” advice to companies buying data to fine-tune and train their AI models?
MC: There are hundreds of guides, frameworks, principles, etcetera, focused on AI governance, and yet none of them can be truly global in reach and coverage, which raises problems of coordination and implementation.
We need to foster a holistic, multifaceted and adaptive vision and hold ourselves accountable to transparency. AI companies that subscribe to this vision have the potential to work directly with content providers to use copyrighted material and train models on data they own or opt to buy the content from a trustworthy marketplace, like ours.
CP: Defined.ai's marketplace is a combination of owned and sourced data through our extensive partner ecosystem. From a legal and compliance perspective, what's the benefit to third parties selling their data there?
MC: Well, for third parties to sell their data on Defined.ai, they would join our Partner Program. The main benefit to partners is an added revenue stream to sell existing data, responsibly. Our partners gain access to a trusted and expansive network of AI developers looking for a one-stop-shop option for buying data.
We have a dedicated market research and business development team that scours the industry to understand trends and demand and works with our partners to secure the required data. Then combined with my role, this ensures that every partner benefits from the same standards that we uphold our data marketplace to. We align with partners on ethical standards, permitted use and restrictions on data curated and ensure that there is a solid contractual framework in place.
We recently sat down with one of our partners to assess their overall experience and identify ways we can grow together – something else we do frequently as our team uncovers new trends in the market. Well during this discussion, their CEO, Aaron, was quick to note that while adding a new revenue stream to their portfolio was exciting, his team was relieved that Defined.ai guided them on how to navigate the compliance and legal pathways to secure data consent for launch. Something they were not equipped or prepared to do otherwise.
CP: The need for quality data is growing daily, what are some initial questions that companies can ask when looking to buy off-the-shelf datasets?
MC: Start with measurement-based questions and focus on the 3 tenets of high-quality data – “how good is your OTS in terms of accuracy, reliability, and consistency?”. But, to really get to the root of it all, you need to know if the OTS data is underpinned by fair representation, transparency, regular audits and Data Governance. There needs to be consistent efforts focused on reviewing the datasets to ensure they meet new regulatory standards.
Then lastly, I’d probably be curious about how data requirements are collected, delivered, reconciled, and what use cases are available to back it all up.
Learn more about selling your data on our marketplace by becoming a Defined.ai partner.