“Foundation models are the epicenter of artificial intelligence (AI) as AI begins to shape how the economy and society function,” begins “The Foundation Model Transparency Index v1.1 (FMTI v1.1),” a May, 2024 paper by a group of researchers from Stanford, MIT, and Princeton under the auspices of Stanford’s Center for Research on Foundation Models (CRFM). “For such a high-impact technology, transparency is vital to facilitate accountability, competition, and collective understanding. As an illustrative example, the current lack of transparency regarding the data used to build foundation models makes it difficult to assess what copyrighted information is used to train foundation models. Governments around the world are intervening to increase transparency.”
In October of 2023, CRFM introduced the Foundation Model Transparency Index (FMTI), a scoring system designed by a multidisciplinary research team to score the transparency of foundation models in order to promote responsible business practices and greater public accountability. The FMTI features three top-level domains, which in turn aggregate 23 subdomains and 100 transparency indicators:
- Upstream domain — the 32 indicators and 6 subdomains used to build the model, including data size and composition, human labor used, training methods, hardware and energy used, and environmental impacts;
- Model domain — the 33 indicators and 8 subdomains that identify the properties and function of the foundation model itself, including size, capabilities, limitations, risks, and trustworthiness; and
- Downstream domain — the 35 indicators and 9 subdomains that identify how the model is used, including release process, distribution channels, user interface, data protection, feedback mechanisms, and documentation.
Launched in October 2023, the first iteration of the index (FMTI v1.0) scored 10 major foundation developers (e.g. OpenAI, Google, Meta) based on publicly available information as of October of 2023. The overarching finding was that all ten developers have significant room for improvement. The average score was only 37 out of 100; Meta’s Llama2 was the highest scoring model at 54 out of 100; OpenAI’s GPT-4 had a score of 48; and Google’s PaLM2 had a score of 40.
“To understand how the landscape has evolved in the last 6 months, we conduct a follow-up study (FMTI v1.1),” wrote the CRFM research team in their May 2024 FMTI v1.1 paper. “To enable direct comparison, we retain the 100 transparency indicators and the associated threshold for awarding a point from FMTI v1.0. However, instead of searching for public information as was done in FMTI v1.0, we request that developers report the relevant information for each indicator.
“We implemented this change for three reasons: (i) completeness: we obviate the concern that information was missed when searching the Internet; (ii) clarity: we reduce uncertainty by having developers affirmatively disclose information; and (iii) scalability: we remove the effort required for researchers to conduct an open-ended search for decentralized public information.”
“We contacted 19 foundation model developers, and 14 provided reports related to the 100 transparency indicators (Adept, AI21 Labs, Aleph Alpha, Amazon, Anthropic, BigCode/Hugging Face/ServiceNow, Google, IBM, Meta, Microsoft, Mistral, OpenAI, Stability AI, Writer). Given each developer’s initial report, we provided scores based on whether each disclosure satisfied the associated indicator. Developers responded to these initial scores, engaging in dialogue via email and virtual meetings, and clarifying matters in many cases. Following this iterative process, for each developer we publish a transparency report that consolidates the information it discloses. These reports contain new information, which developers had not disclosed publicly prior to the start of FMTI v1.1. On average, developers disclosed information related to 16.6 indicators that was not previously public.”
Let me summarize the key FMTI v1.1 findings.
The mean score is 58 out of 100 and the top score is 85. This is a 21 point improvement over the October 2023 FMTI mean score of 37. In addition, the top score rose by 31 points, (85 vs 54) and the bottom score rose by 21 points (33 vs 12). “All eight developers scored in both the October 2023 and May 2024 FMTI have improved their scores. Of the 100 transparency indicators, 96 are satisfied by at least one developer and 89 are satisfied by multiple developers.”
Developers still have significant room for improvement. 11 of the 14 developers scored below 65 out of 100, showing that there is a significant lack of transparency in the foundation model ecosystem and substantial room for improvement across developers. “If developers emulate the most-transparent developer for each indicator, overall transparency would improve sharply.”
Developers disclosed significant new information, which contributed to their higher scores. Developers proactively released transparency reports, in contrast to the previous approach where the FMTI research team had to collect the information from the internet. “Developers disclosed an average of 17 new indicators-worth of information in their reports.”
Developers performed best on indicators in the downstream domain, scoring 65% of available points overall in comparison to 61% on the model domain and 46% in the upstream domain. “Developers scored worse across upstream indicators: of the 20 indicators where developers score highest, just one indicator (model objectives) is in the upstream domain. … On the whole, developers are less transparent about the data, labor, and compute used to build their models than how they evaluate or distribute their models.”
The subdomains with the highest scores are user interface (93%) capabilities (89%), model basics (89%), documentation for downstream deployers (89%), and user data protection (88%). “Each of these subdomains is in the downstream domain, where developers scored near or above 70% on 6 of the 9 subdomains.”
The subdomains with the lowest total scores are data access (7%), impact (15%), trustworthiness (29%), and model mitigations (31%). “Developers score 50% or less on 10 of 23 subdomains in the index, including 3 of the 5 largest subdomains—impact (15%), data (34%), and data labor (50%). The lack of transparency in these subdomains shows that the foundation model ecosystem is still quite opaque—there is little information about how people use foundation models, what data is used to build foundation models, and whether foundation models are trustworthy.”
Open developers generally outperform closed developers: the median open developer scores 5.5 points higher than the median closed developer. “The difference in transparency between open and closed developers is attributable to the substantial gap in upstream transparency: within the upstream domain, the median open developer scores 3 additional points on indicators in the upstream subdomain over the median closed developer.”
Sustained opacity on specific issues. “While overall trends indicate significant improvement in the status quo for transparency, some areas have seen no real headway: information about data (copyright, licenses, and PII [Personal Identifiable Information]), how effective companies' guardrails are (mitigation evaluations), and the downstream impact of foundation models (how people use models and how many people use them in specific regions) all remain quite opaque.”
“The societal impact of foundation models is escalating, attracting the attention of firms, media, academia, government, and the public,” notes the May 2024 FMTI paper in conclusion. “By dissecting what developers do and do not publicly disclose, and how this has changed, the Index allows different stakeholders (e.g. developers, customers, investors, policymakers) to make more clear-eyed decisions. And, in turn, by establishing the practice of transparency reporting for foundation models, the Index surfaces a new resource that downstream developers, researchers, and journalists should capitalize on to build collective understanding. Moving forward, we hope that headway on transparency will demonstrably translate to better societal outcomes like greater accountability, improved science, increased innovation, and better policy.”
Comments