LAION

OpenAssistant
	Screenshot of the data collection web portal
Developer(s)	LAION and contributors
Initial release	15 April 2023; 3 years ago
Type	Large Language Model; Generative pre-trained transformer; Chatbot;
License	Apache License 2.0
Website	open-assistant.io

LAION
Type	Non-profit
Industry	Artificial intelligence
Founder	Christoph Schuhmann; Jenia Jitsev; Richard Vencu; Robert Kaczmarczyk; Theo Coombes; Mehdi Cherti; Aarush Katta; Jan Ebert;

Short description: Non-profit German artificial intelligence organization

LAION (acronym for Large-scale Artificial Intelligence Open Network) is a German non-profit which makes open-sourced artificial intelligence models and datasets.^[1] It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile text-to-image models, including Stable Diffusion and Imagen.^[2]^[3]

In February 2023, LAION was named in the Getty Images lawsuit against Stable Diffusion as a non-party.^[4] In April 2023, LAION was directly sued by a German photographer who wanted to have his images removed from the training set.^[5] In September 2024, the Regional Court of Hamburg dismissed the lawsuit, in what was described as a "landmark ruling on TDM [Text and data mining] exceptions for AI training data" in Germany and the EU more generally.^[6]

On April 15, 2023, LAION and contributors publicly released an open source AI assistant chatbot called OpenAssistant.

Image datasets

LAION has publicly released a number of large datasets of image-caption pairs which have been widely used by AI researchers. The data is derived from the Common Crawl, a dataset of scraped web pages. The developers searched the crawled html for <img> tags and treated their alt attributes as captions. They used CLIP to identify and discard images whose content did not appear to match their captions.^[7] LAION does not host the content of scraped images themselves; rather, the dataset contains URLs pointing to images, which researchers must download themselves.^[8]

The first such dataset, LAION-400M, was released in August 2021 and consisted of 400 million image-caption pairs. The pairs were extracted from a random subset of webpages scraped by Common Crawl between 2014 and 2021.^[9] It was an attempt to recreate the process used by OpenAI to collect the 400 million image-caption pairs they used to train the CLIP model - the company had chosen to open-source the model's code and weights, but not its training dataset.^[7] Imagen, a text-to-image model announced by Google Brain in 2022, was trained on LAION-400M in combination with private internal datasets.^[10]

A successor of more than 5 billion pairs, LAION-5B, was released in March 2022.^[11] As of its release, it was the largest freely available dataset of image-caption pairs in existence.^[7] Its creation was funded by Doodlebot, Hugging Face and Stability AI, the AI company behind the funding of the Stable Diffusion text-to-image model, which was trained on it.^[12]

Criticism

Several studies show that the images in LAION-5B contain problematic images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.^[13]^[14]

An investigation by Bayerischer Rundfunk showed that LAION's datasets, hosted on Hugging Face, contain large amounts of private and sensitive data harvested from public websites.^[15]

In December 2023, the Stanford Internet Observatory released a report on LAION-5B that found 3,226 suspected instances of links to child sexual abuse material with 1,008 of these being externally validated. In response, LAION temporarily removed LAION-5B and LAION-400M citing its "zero tolerance policy for illegal content" and "an abundance of caution".^[16] In August 2024, LAION released a cleaned dataset called Re-LAION-5B.^[17]

OpenAssistant

OpenAssistant was an artificial intelligence (AI) open source chat-based assistant that could understand tasks, interact with third-party systems and retrieve information dynamically to do so. The project was developed by a group of volunteers in collaboration with LAION. One of the goals for development included free access to large language models that can be run locally on consumer hardware.^[18]^[19] The project was backed by a worldwide crowdsourcing effort involving over 13,500 volunteers who have created 600k human-generated data points.^[19]^[20] The project has since been shut down; however, the datasets and models remain available on Hugging Face.

References

↑ "About". LAION.ai. https://laion.ai/about/.
↑ Edwards, Benj (15 September 2022). "Have AI image generators assimilated your art? New tool lets you check". Ars Technica. https://arstechnica.com/information-technology/2022/09/have-ai-image-generators-assimilated-your-art-new-tool-lets-you-check/.
↑ Newman, Marissa; Cantrill, Aggi (24 April 2023). "The Future of AI Relies on a High School Teacher's Free Database" (in en). Bloomberg News. https://www.bloomberg.com/news/features/2023-04-24/a-high-school-teacher-s-free-image-database-powers-ai-unicorns.
↑ "Getty Images (US), Inc. v. Stability AI, Inc., 1:23-cv-00135" (in en-us). https://www.courtlistener.com/docket/66788385/getty-images-us-inc-v-stability-ai-inc/.
↑ "A Photographer Tried to Get His Photos Removed from an AI Dataset. He Got an Invoice Instead." (in en-us). 28 April 2023. https://www.vice.com/en/article/a-photographer-tried-to-get-his-photos-removed-from-an-ai-dataset-he-got-an-invoice-instead/.
↑ Goldstein, Paul; Stuetzle, Christiane; Bischoff, Susan (2024-11-13). "Kneschke vs. LAION - Landmark Ruling on TDM exceptions for AI training data – Part 1" (in en-US). https://copyrightblog.kluweriplaw.com/2024/11/13/kneschke-vs-laion-landmark-ruling-on-tdm-exceptions-for-ai-training-data-part-1/.
↑ ^7.0 ^7.1 ^7.2 Alford, Anthony (17 May 2022). "LAION Releases Five Billion Image-Text Pair Dataset LAION-5B". InfoQ. https://www.infoq.com/news/2022/05/laion-5b-image-text-dataset/.
↑ Edwards, Benj (21 September 2022). "Artist finds private medical record photos in popular AI training data set". Ars Technica. https://arstechnica.com/information-technology/2022/09/artist-finds-private-medical-record-photos-in-popular-ai-training-data-set/.
↑ Schuhmann, Christoph (8 August 2021). "LAION-400-Million Open Dataset". LAION blog. https://laion.ai/blog/laion-400-open-dataset/.
↑ Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad (23 May 2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv:2205.11487 [cs.CV].
↑ Beaumont, Romain (3 March 2022). "LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets". LAION blog. https://laion.ai/blog/laion-5b/.
↑ Wiggers, Kyle (12 August 2022). "This startup is setting a DALL-E 2-like AI free, consequences be damned". TechCrunch. https://techcrunch.com/2022/08/12/a-startup-wants-to-democratize-the-tech-behind-dall-e-2-consequences-be-damned/.
↑ Birhane, Abeba; Prabhu, Vinay Uday; Kahembwe, Emmanuel (2021). Multimodal datasets: misogyny, pornography, and malignant stereotypes.
↑ Birhane, Abeba; Prabhu, Vinay; Han, Sang; Boddeti, Vishnu Naresh; Luccioni, Alexandra Sasha (2023-11-06), Into the LAIONs Den: Investigating Hate in Multimodal Datasets
↑ Brunner, Katharina; Harlan, Elisa (2023-06-07). "We Are All Raw Material for AI". https://interaktiv.br.de/ki-trainingsdaten/en/index.html.
↑ Cole, Samantha (20 December 2023). "Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material" (in en). https://www.404media.co/laion-datasets-removed-stanford-csam-child-abuse/. Retrieved 22 December 2023.
↑ Belanger, Ashley (2024-08-30). "Nonprofit scrubs illegal content from controversial AI training dataset" (in en-us). https://arstechnica.com/tech-policy/2024/08/nonprofit-scrubs-illegal-content-from-controversial-ai-training-dataset/.
↑ Open-Assistant, LAION AI, 2023-03-09, https://github.com/LAION-AI/Open-Assistant, retrieved 2023-03-09
↑ ^19.0 ^19.1 Köpf, Andreas; Kilcher, Yannic; von Rütte, Dimitri; Anagnostidis, Sotiris; Tam, Zhi-Rui; Stevens, Keith; Barhoum, Abdullah; Duc, Nguyen Minh; Stanley, Oliver; Nagyfi, Richárd; ES, Shahul; Suri, Sameer; Glushkov, David; Dantuluri, Arnav; Maguire, Andrew (2023-04-14). "OpenAssistant Conversations -- Democratizing Large Language Model Alignment". arXiv:2304.07327 [cs.CL].
↑ "Open Assistant: Explore the Possibilities of Open and Collaborative Chatbot Development" (in en-US). https://www.kdnuggets.com/open-assistant-explore-the-possibilities-of-open-and-collaborative-chatbot-development.html.

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/LAION. Read more

[About-1] "About". LAION.ai. https://laion.ai/about/.

[Ars-Trained-2] Edwards, Benj (15 September 2022). "Have AI image generators assimilated your art? New tool lets you check". Ars Technica. https://arstechnica.com/information-technology/2022/09/have-ai-image-generators-assimilated-your-art-new-tool-lets-you-check/.

[BB_teacher-3] Newman, Marissa; Cantrill, Aggi (24 April 2023). "The Future of AI Relies on a High School Teacher's Free Database" (in en). Bloomberg News. https://www.bloomberg.com/news/features/2023-04-24/a-high-school-teacher-s-free-image-database-powers-ai-unicorns.

[4] "Getty Images (US), Inc. v. Stability AI, Inc., 1:23-cv-00135" (in en-us). https://www.courtlistener.com/docket/66788385/getty-images-us-inc-v-stability-ai-inc/.

[5] "A Photographer Tried to Get His Photos Removed from an AI Dataset. He Got an Invoice Instead." (in en-us). 28 April 2023. https://www.vice.com/en/article/a-photographer-tried-to-get-his-photos-removed-from-an-ai-dataset-he-got-an-invoice-instead/.

[:2-6] Goldstein, Paul; Stuetzle, Christiane; Bischoff, Susan (2024-11-13). "Kneschke vs. LAION - Landmark Ruling on TDM exceptions for AI training data – Part 1" (in en-US). https://copyrightblog.kluweriplaw.com/2024/11/13/kneschke-vs-laion-landmark-ruling-on-tdm-exceptions-for-ai-training-data-part-1/.

[Infoq-5b-7] 7.0 ^7.1 ^7.2 Alford, Anthony (17 May 2022). "LAION Releases Five Billion Image-Text Pair Dataset LAION-5B". InfoQ. https://www.infoq.com/news/2022/05/laion-5b-image-text-dataset/.

[Ars-medical-8] Edwards, Benj (21 September 2022). "Artist finds private medical record photos in popular AI training data set". Ars Technica. https://arstechnica.com/information-technology/2022/09/artist-finds-private-medical-record-photos-in-popular-ai-training-data-set/.

[Laion-400m-blog-9] Schuhmann, Christoph (8 August 2021). "LAION-400-Million Open Dataset". LAION blog. https://laion.ai/blog/laion-400-open-dataset/.

[imagen-paper-10] Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad (23 May 2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv:2205.11487 [cs.CV].

[Laion-5b-blog-11] Beaumont, Romain (3 March 2022). "LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets". LAION blog. https://laion.ai/blog/laion-5b/.

[tc-sai-12] Wiggers, Kyle (12 August 2022). "This startup is setting a DALL-E 2-like AI free, consequences be damned". TechCrunch. https://techcrunch.com/2022/08/12/a-startup-wants-to-democratize-the-tech-behind-dall-e-2-consequences-be-damned/.

[13] Birhane, Abeba; Prabhu, Vinay Uday; Kahembwe, Emmanuel (2021). Multimodal datasets: misogyny, pornography, and malignant stereotypes.

[14] Birhane, Abeba; Prabhu, Vinay; Han, Sang; Boddeti, Vishnu Naresh; Luccioni, Alexandra Sasha (2023-11-06), Into the LAIONs Den: Investigating Hate in Multimodal Datasets

[15] Brunner, Katharina; Harlan, Elisa (2023-06-07). "We Are All Raw Material for AI". https://interaktiv.br.de/ki-trainingsdaten/en/index.html.

[16] Cole, Samantha (20 December 2023). "Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material" (in en). https://www.404media.co/laion-datasets-removed-stanford-csam-child-abuse/. Retrieved 22 December 2023.

[17] Belanger, Ashley (2024-08-30). "Nonprofit scrubs illegal content from controversial AI training dataset" (in en-us). https://arstechnica.com/tech-policy/2024/08/nonprofit-scrubs-illegal-content-from-controversial-ai-training-dataset/.

[18] Open-Assistant, LAION AI, 2023-03-09, https://github.com/LAION-AI/Open-Assistant, retrieved 2023-03-09

[:0-19] 19.0 ^19.1 Köpf, Andreas; Kilcher, Yannic; von Rütte, Dimitri; Anagnostidis, Sotiris; Tam, Zhi-Rui; Stevens, Keith; Barhoum, Abdullah; Duc, Nguyen Minh; Stanley, Oliver; Nagyfi, Richárd; ES, Shahul; Suri, Sameer; Glushkov, David; Dantuluri, Arnav; Maguire, Andrew (2023-04-14). "OpenAssistant Conversations -- Democratizing Large Language Model Alignment". arXiv:2304.07327 [cs.CL].

[20] "Open Assistant: Explore the Possibilities of Open and Collaborative Chatbot Development" (in en-US). https://www.kdnuggets.com/open-assistant-explore-the-possibilities-of-open-and-collaborative-chatbot-development.html.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

Anonymous

Search

LAION

Namespaces

More

Page actions

Contents

Image datasets

Criticism

OpenAssistant

See also

References

Navigation

Navigation

Resources

Help

googletranslator

Navigation

Wiki tools

Wiki tools

Anonymous

Search

LAION

Image datasets

Criticism

OpenAssistant

See also

References

Navigation

Wiki tools

Page tools

Other projects

Categories