The Open Source AI and Data Ecosystem

“The availability of enterprise-grade open source software (OSS) is changing how organizations develop, maintain, and deliver products,” wrote Ibrahim Haddad in a recent report, Artificial Intelligence and Data in Open Source. Haddad is VP of Strategic Programs at the Linux Foundation (LF) and Executive Director of the LF AI & Data initiative. “Using and adopting OSS can offer many benefits, including reduced development costs, faster product development, higher code quality standards, and more. The open source methodology offers key and unique benefits to the domains of AI and data, specifically in areas of fairness, robustness, explainability, lineage, availability of data, and governance.”

Earlier this year, Stanford released the 2022 AI Index report, its fifth annual study on the impact and progress of AI. The Stanford report noted that “2021 was the year that AI went from an emerging technology to a mature technology – we’re no longer dealing with a speculative part of scientific research, but instead something that has real-world impact, both positive and negative.”

A few weeks ago I wrote about the impressive scope of the Linux Foundation. The LF supports a large and growing number of open source projects in a wide variety of areas. AI is no different from other technology domains, so it’s not surprising that open source now plays a major role as AI is being increasingly integrated into the economy.

“The open source AI and data ecosystem presents several opportunities for new R&D, new startups, and innovations,” said Haddad. “The infusion of AI in products and services has created opportunities to improve people’s lives around the world. It also has raised concerns about the fairness, explainability, and security of these applications and systems. Various national and global initiatives are working to address these concerns. LF AI & Data and its member organizations consider trusted and responsible AI as a critical domain and as a global group working on policies, guidelines, and use cases to ensure the development of trustworthy AI systems and processes.”

The report focused on six AI & Data areas where open source methodologies can bring unique benefits:

Fairness. “Methods to detect and mitigate bias in datasets and models, e.g., bias against known protected populations”;
Robustness. “Methods to detect alterations and tampering with datasets and models, e.g., modifications from known adversarial attacks”;
Explainability. “Methods to enhance persona’s or role’s ability to understand and interpret AI model outcomes, decisions, and recommendations, e.g., ranking and debating results and options”;
Lineage. “Methods to ensure the provenance of datasets and AI models, e.g., reproducibility of generated datasets and AI models”;
Data. “Open source data-specific licenses make data freely accessible for use without mechanisms of control”; and
Governance. “A governance structure and tools to clean, sort, tag, trace, and govern data and datasets.”

Let me briefly discuss three of these areas: fairness, explainability, and data.

Fairness. A major finding of the 2022 AI Index Report was that while large language models like GPT-3 and BERT are setting new records on technical benchmarks, they’re also more prone to reflect the biases that may have been included in their training data, including racists, sexist, extremist and other harmful language as well as overtly abusive language patterns and harmful ideologies. That’s why methods to reduce bias and abusive behaviors are so important.

AI Fairness 360, for example, is an open source toolkit to help examine, report, and mitigate discrimination and bias in machine learning models throughout the AI application lifecycle. “The AI Fairness 360 Python package includes a comprehensive set of metrics for datasets and models to test for biases, explanations for these metrics, and algorithms to mitigate bias in datasets and models. The AI Fairness 360 interactive demo provides a gentle introduction to the concepts and capabilities. The tutorials and other notebooks offer a deeper, data scientist-oriented introduction.”

Explainability. Despite their widespread adoption, ML models remain mostly black boxes. The methods behind an ML prediction, – subtle adjustments to the numerical weights that interconnect its huge number of artificial neurons, – are very difficult to explain because they’re so different from the methods used by humans. The bigger the training data set, the more accurate the prediction, but the more difficult it will be to provide a detailed, understandable explanation to a human of how the prediction was made. Understanding the reasons behind predictions is very important in assessing whether to trust an ML model, which is fundamental if one plans to take important actions based on the prediction, such as a medical diagnosis or a judicial decision.

AI Explanability 360 is an open source library that supports the interpretability and explainability of data sets and machine learning models throughout the AI application lifecycle. “The AI Explainability 360 interactive demo provides a gentle introduction to the concepts and capabilities by walking through an example use case from the perspective of different consumer personas. The tutorials and other notebooks offer a deeper, data scientist-oriented introduction.”

Data. “We’re all familiar with the expression, garbage in, garbage out, referring to the importance of inputting good data to derive valuable insights. With the global digitalization and transformation of industries and economies, data has become quite abundant; the challenge has shifted from finding data to selecting quality data, efficiently mining the data for actionable insights, and effectively converting those insights into business value. The LF AI & Data community recognizes the importance of data and has been keen on hosting and supporting key projects covering data lineage, format, store, operations, feature engineering, governance, stream processing, and pipeline management.”

Open source software communities have shown the power of open collaboration for building some of the world’s most important infrastructures. AI communities are similarly looking to collaboratively build open datasets that can be shared. This is particularly important given the huge amounts of training data required for new, leading edge AI advances like foundation models. However, intellectual property for data is generally treated differently than IP for software. As a result, open source software licenses cannot be readily applied to data.

One of the most important AI & Data projects is the Community Data License Agreement (CDLA). CDLA is a legal framework for developing license agreements to enable access, sharing, and use of data openly among individuals and organizations. CDLA-Permissive-2.0, for example, is “a short license agreement, easily comprehensible to data scientists and lawyers alike, to permit recipients to broadly use, analyze, modify and share data. … Proprietary datasets will continue to exist, but data availability under the CDLA licenses (two versions exist) should allow everyone to build credible products, including smaller players.”

“Open source has already won in AI and data,” wrote Haddad in conclusion. “We are far more innovative in collaboration than in isolation. Evident by the data available to us today, open source as a methodology and practice has fueled our massive advances in AI. We’re going now through the process of open source AI dominating the software world. This situation is the new normal. Let’s celebrate it and continue our pursuit of technological advances in fair, transparent, and ethical ways.”

Irving Wladawsky-Berger

RECENT POSTS

CATEGORIES

Subscribe to this blog via email