IP Issues With AI Code Generators

Srikanth Jandhyala; Jinwoo Kim; Arpita Bhattacharyya, Ph.D.; Finnegan, Henderson, Farabow, Garrett & Dunner, LLP.

Automated content generation based on large language and image models, also known as generative artificial intelligence (AI), got public attention recently when an AI-generated painting won first place in an art competition. Although the most popular use of generative AI is the creation of media content, there are multiple AI-based programming tools that automatically generate software source code.

Such AI programming assistants present an enticing opportunity for code developers to write code more efficiently. At the same time, however, it also poses many new legal questions for in-house IP counsels about using such AI tools to develop software products. For example, whether the generated code may be protected through copyright or by obtaining patents? Could the generated code be considered derivative work of other copyrighted code—for example, copyrighted code used to train the AI—and whether it is fair use to train the AI tool using such copyrighted code?

An in-house counsel may also want to know if any licensing limitations, especially open-source licenses, apply to the generated code and potentially require the company’s whole source code base to be licensed under the same open-source licensing terms. This article provides some points to consider when such IP issues related to the use of generative AI tools become relevant.

Generative AI for Source Code Generation

GitHub Copilot is one of the earliest and most popular generative AI tools for code generation. Other competing tools in the market make similar code recommendations but differ in how their underlying models are trained. According to GitHub, Copilot is based on Open AI’s model Codex and trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub. In comparison, Tabnine claims that its AI model is trained on only source code with permissive licenses, such as MIT, Apache 2.0, BSD-2-Clause, and BSD-3-Clause.

Copilot makes API (Application Programming Interface) calls to Open AI’s Codex model to generate code recommendations. Copilot can make code recommendations by auto-completing partly written software code used as an input prompt to generate a recommendation. Copilot can also make code recommendations based on a comment typed by a user and used as a prompt to generate a code. Copilot makes these code recommendations using the previously written code as context.

There are many other AI tools like Copilot that generate code by refactoring input code. Such AI tools could potentially face copyright and open-source licensing issues stemming from the training data used to train the AI model and the output code generated by the trained model, as discussed in the following sections.

Potential Intellectual Property Issues

Open-Source License ‘Tainting’

The AI model underlying a code generator is trained using source code available in public code repositories, for example, open-source licensed code available on GitHub. Some of these open-source licenses could be one of the so-called copyleft licenses—e.g., GPL—which places limitations on how code developed using the open-source licensed code is distributed for use by others. For example, if source code released under GPL v3 is used in a company’s proprietary software—by calling it or copying it—it could “taint” the whole proprietary source code and require it to be freely released under the same license terms. This may limit the company from obtaining and enforcing its IP rights, including patent rights on the particular embodiment that includes the GPL code. For example, GPL v3 states that each contributor to the code licensed under GPL v3 grants the user of GPL v3 software a non-exclusive, worldwide, royalty-free patent license on essential patent claims.

Indeed, AI code generators are known to recommend exact copies of code used to train the underlying AI model. For example, GitHub has acknowledged that at times code generated by Copilot recites publicly available open-source code on which it was trained. Consequently, the license of the open-source code may apply to the code developed using Copilot. For example, according to GitHub’s internal research, although the probability is low—about 1%—Copilot may generate code containing some code blocks that exactly match the training code.

If the generated code includes open-source code publicly available under a copyleft license, such as GPL v3, then it may cause the entire generated code to inherit the same open-source license, which may “taint” the developer’s whole source code, including proprietary code. That is, if a developer includes open-source licensed code in the developed code, the developed code may be considered as using open-source code and the license of the open-source code may get applied to the developed code. See, e.g., Artifex Software, Inc. v. Hancom, Inc., No. 16-CV-06982, 2017 BL 320360 (N.D. Cal. Sept. 12, 2017), recognizing a licensee’s obligation to share source code for derivative works under the GPL license because licensee incorporated GPL-licensed code into its software product.

In fact, if the open-source license is a copyleft license, it may override any potential proprietary license or other open-source licenses that protect patent rights to the developed code. The open-source license obligations and limitations may continue to exist even if the code developed using generative AI is relicensed under a different proprietary license.

It is reported that, in some instances, a generative AI tool such as Copilot may recommend using a proprietary license even though its generated code includes open-source code under GPL. This type of licensing conflict may create legal risks, and at a minimum some uncertainty, for the developer’s software product. For example, other downstream users of the developer’s code may be unaware that they may be using code containing open-source licensed code that has been relicensed under a different license, which may cause their own code to be tainted by an open-source license.

Such situations could be avoided by manually reviewing the code to identify known popular code, or by using automated code scanning tools to detect and manage the risk of inadvertently using open-source licensed code. Copilot has also developed a filtering tool that a developer can enable in Copilot to detect and suppress any such recommendations of open-source code, but its own research shows that the filtering tools do not catch all instances of repeated code and recommends additional checks to identify any potential IP breaches.

Copyright Ownership Issue

When a public code is released under a permissive open-source license, such as LGPL, others may use the code without any license restrictions as long as there is proper attribution. But when an open-source licensed code is repeated as generated code by a generative AI tool, there may be copyright violation when the usage of such code does not follow the open-source license terms, such as attribution and distribution in a certain manner.

An AI tool’s recitation of existing open-source code may infect only a small portion of a developer’s code—e.g., a function signature or a few lines of code representing a function—but may still be a copyright violation. Specifically, if the AI tool reproduces exact copies of open-source licensed code, the code developed using such copied code may violate copyrights of the original source code. Furthermore, even if the auto-generated code includes a somewhat modified version of the copyrighted source code, it may still be considered a derivative work that is protected under the copyright of the original code.

Another point to consider is whether the developed code may be considered fair use, and thereby avoid copyright violation, because the copyrighted portion of the source code is transformed into something with new utility or meaning. See, e.g., Google LLC v. Oracle America, Inc., 141 S. Ct. 1183 (2021), where the US Supreme Court held that, even assuming Oracle’s java function code is copyrightable, Google’s use of the code in Android platform qualifies as fair use because use of the code in mobile environment is transformative.

Similarly, auto-generated code or any other content generated by generative AI tools may be considered fair use of publicly available code for training AI models because the generated code is transformative. However, courts are yet to decide the metes and bounds of derivative work and fair use in the context of generative AI.

Another point to consider is who can claim ownership of code generated by an AI tool. Under current US law, a work may be entitled to copyright protection if it contains sufficient original and creative authorship by a human author. The US Copyright Office has recently decided that AI cannot be an author of a creative work. Absent human authorship, it is uncertain if anyone can claim copyright ownership of code generated by AI. It could be argued that the rightful copyright owner is the person whose queries or inputs generated the code. But without further guidance from the courts, there is no certainty that code generated by AI is copyrightable.

Conclusion

AI tools for automatic code generation are still new, and many intellectual property rights and ownership issues are yet to be resolved. There are some new licenses geared towards AI software that allow limitations on whether and when to use the underlying data and models for training and use in AI software. The new licenses hope to improve licensing of training data for AI tools to overcome some of the above-identified issues.

But it will likely take time for developers to upload their code with the new licenses to open-source repositories, such as GitHub, and for the generative AI models to be retrained using only source code with these new licenses. GitHub recently announced an enterprise version of Copilot, which may address some of the aforementioned IP issues, but many IP questions still remain unresolved and it is yet to be seen how these issues will play out in the courts.

Additionally, the lack of traceability of code recommended by generative AI tools suggests that there is no straightforward way of knowing if generated code includes any repeated code that may carry baggage from the original open-source licensing terms. Until we get clearer guidance from the courts, it would be wise to use these generative AI tools with caution, and to scan developed code manually or by using scanning software to identify any open-source licensed code, and thus avoid or minimize the risk of inadvertently violating open-source licenses and copyrights.

___

This post was originally published by Finnegan, Henderson, Farabow, Garrett & Dunner, LLP.