Software:LAION

OpenAssistant
	Screenshot of the data collection web portal
Developer(s)	LAION and contributors
Initial release	15 April 2023; 12 months ago
Type	Large Language Model; Generative pre-trained transformer; Chatbot;
License	Apache License 2.0
Website	open-assistant.io

LAION
Type	Non-profit
Industry	Artificial intelligence
Founder	Christoph Schuhmann; Jenia Jitsev; Richard Vencu; Robert Kaczmarczyk; Theo Coombes; Mehdi Cherti; Aarush Katta; Jan Ebert;

Short description: Non-profit German artificial intelligence organization

LAION (acronym for Large-scale Artificial Intelligence Open Network) is a German non-profit which makes open-sourced artificial intelligence models and datasets.^[1] It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile text-to-image models, including Stable Diffusion and Imagen.^[2]^[3]

In February 2023, LAION was named in the Getty Images lawsuit against Stable Diffusion as a non-party.^[4] In April 2023, LAION was directly sued by a German photographer who wanted to have his images removed from the training set.^[5]

On April 15, 2023, LAION and contributors released to public an open source AI assistant chatbot OpenAssistant.

Image datasets

LAION has publicly released a number of large datasets of image-caption pairs which have been widely used by AI researchers. The data is derived from the Common Crawl, a dataset of scraped web pages. The developers searched the crawled html for <img> tags and treated their alt attributes as captions. They used CLIP to identify and discard images whose content did not appear to match their captions.^[6] LAION does not host the content of scraped images themselves; rather, the dataset contains URLs pointing to images, which researchers must download themselves.^[7]

The first such dataset, LAION-400M, was released in August 2021 and consisted of 400 million image-caption pairs. The pairs were extracted from a random subset of webpages scraped by Common Crawl between 2014 and 2021.^[8] It was an attempt to recreate the process used by OpenAI to collect the 400 million image-caption pairs they used to train the CLIP model - the company had chosen to open-source the model's code and weights, but not its training dataset.^[6] Imagen, a text-to-image model announced by Google Brain in 2022, was trained on LAION-400M in combination with private internal datasets.^[9]

A successor of more than 5 billion pairs, LAION-5B, was released in March 2022.^[10] As of its release, it was the largest freely available dataset of image-caption pairs in existence.^[6] Its creation was funded by Doodlebot, Hugging Face and Stability AI, the AI company behind the funding of the Stable Diffusion text-to-image model, which was trained on it.^[11]

Criticism

Several studies show that the images in LAION-5B contain problematic images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.^[12]^[13]

An investigation by Bayerischer Rundfunk showed that LAION's datasets, hosted on Hugging Face, contain large amounts of private and sensitive data.^[14]

In December 2023, the Stanford Internet Observatory released a report on LAION-5B that found 3,226 suspected instances of links to child sexual abuse material with 1,008 of these being externally validated. In response, LAION temporarily removed LAION-5B and LAION-400M citing its "zero tolerance policy for illegal content" and "an abundance of caution".^[15]

OpenAssistant

OpenAssistant is an artificial intelligence (AI) open source chat-based assistant that understands tasks, can interact with third-party systems and retrieve information dynamically to do so. The project is developed by a group of volunteers in collaboration with LAION. One of the goals for development includes free access to large language models that can be run locally on consumer hardware.^[16]^[17] The project is backed by a worldwide crowdsourcing effort involving over 13,500 volunteers who have created 600k human-generated data points.^[17]^[18]

References

↑ "About". LAION.ai. https://laion.ai/about/.
↑ Edwards, Benj (15 September 2022). "Have AI image generators assimilated your art? New tool lets you check". Ars Technica. https://arstechnica.com/information-technology/2022/09/have-ai-image-generators-assimilated-your-art-new-tool-lets-you-check/.
↑ Newman, Marissa; Cantrill, Aggi (24 April 2023). "The Future of AI Relies on a High School Teacher's Free Database" (in en). Bloomberg News. https://www.bloomberg.com/news/features/2023-04-24/a-high-school-teacher-s-free-image-database-powers-ai-unicorns.
↑ "Getty Images (US), Inc. v. Stability AI, Inc., 1:23-cv-00135" (in en-us). https://www.courtlistener.com/docket/66788385/getty-images-us-inc-v-stability-ai-inc/.
↑ "A Photographer Tried to Get His Photos Removed from an AI Dataset. He Got an Invoice Instead." (in en-us). https://www.vice.com/en/article/pkapb7/a-photographer-tried-to-get-his-photos-removed-from-an-ai-dataset-he-got-an-invoice-instead/.
↑ ^6.0 ^6.1 ^6.2 Alford, Anthony (17 May 2022). "LAION Releases Five Billion Image-Text Pair Dataset LAION-5B". InfoQ. https://www.infoq.com/news/2022/05/laion-5b-image-text-dataset/.
↑ Edwards, Benj (21 September 2022). "Artist finds private medical record photos in popular AI training data set". Ars Technica. https://arstechnica.com/information-technology/2022/09/artist-finds-private-medical-record-photos-in-popular-ai-training-data-set/.
↑ Schuhmann, Christoph (8 August 2021). "LAION-400-Million Open Dataset". LAION blog. https://laion.ai/blog/laion-400-open-dataset/.
↑ Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad (23 May 2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv:2205.11487 [cs.CV].
↑ Beaumont, Romain (3 March 2022). "LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets". LAION blog. https://laion.ai/blog/laion-5b/.
↑ Wiggers, Kyle (12 August 2022). "This startup is setting a DALL-E 2-like AI free, consequences be damned". TechCrunch. https://techcrunch.com/2022/08/12/a-startup-wants-to-democratize-the-tech-behind-dall-e-2-consequences-be-damned/.
↑ Birhane, Abeba; Prabhu, Vinay Uday; Kahembwe, Emmanuel (2021). Multimodal datasets: misogyny, pornography, and malignant stereotypes.
↑ Birhane, Abeba; Prabhu, Vinay; Han, Sang; Boddeti, Vishnu Naresh; Luccioni, Alexandra Sasha (2023-11-06), Into the LAIONs Den: Investigating Hate in Multimodal Datasets
↑ Brunner, Katharina; Harlan, Elisa. "We Are All Raw Material for AI". https://interaktiv.br.de/ki-trainingsdaten/en/index.html.
↑ Cole, Samantha (20 December 2023). "Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material" (in en). https://www.404media.co/laion-datasets-removed-stanford-csam-child-abuse/. Retrieved 22 December 2023.
↑ Open-Assistant, LAION AI, 2023-03-09, https://github.com/LAION-AI/Open-Assistant, retrieved 2023-03-09
↑ ^17.0 ^17.1 Köpf, Andreas; Kilcher, Yannic; von Rütte, Dimitri; Anagnostidis, Sotiris; Tam, Zhi-Rui; Stevens, Keith; Barhoum, Abdullah; Duc, Nguyen Minh; Stanley, Oliver; Nagyfi, Richárd; ES, Shahul; Suri, Sameer; Glushkov, David; Dantuluri, Arnav; Maguire, Andrew (2023-04-14). "OpenAssistant Conversations -- Democratizing Large Language Model Alignment". arXiv:2304.07327 [cs.CL].
↑ "Open Assistant: Explore the Possibilities of Open and Collaborative Chatbot Development" (in en-US). https://www.kdnuggets.com/open-assistant-explore-the-possibilities-of-open-and-collaborative-chatbot-development.html.

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/LAION. Read more

[About-1] "About". LAION.ai. https://laion.ai/about/.

[Ars-Trained-2] Edwards, Benj (15 September 2022). "Have AI image generators assimilated your art? New tool lets you check". Ars Technica. https://arstechnica.com/information-technology/2022/09/have-ai-image-generators-assimilated-your-art-new-tool-lets-you-check/.

[BB_teacher-3] Newman, Marissa; Cantrill, Aggi (24 April 2023). "The Future of AI Relies on a High School Teacher's Free Database" (in en). Bloomberg News. https://www.bloomberg.com/news/features/2023-04-24/a-high-school-teacher-s-free-image-database-powers-ai-unicorns.

[4] "Getty Images (US), Inc. v. Stability AI, Inc., 1:23-cv-00135" (in en-us). https://www.courtlistener.com/docket/66788385/getty-images-us-inc-v-stability-ai-inc/.

[5] "A Photographer Tried to Get His Photos Removed from an AI Dataset. He Got an Invoice Instead." (in en-us). https://www.vice.com/en/article/pkapb7/a-photographer-tried-to-get-his-photos-removed-from-an-ai-dataset-he-got-an-invoice-instead/.

[Infoq-5b-6] 6.0 ^6.1 ^6.2 Alford, Anthony (17 May 2022). "LAION Releases Five Billion Image-Text Pair Dataset LAION-5B". InfoQ. https://www.infoq.com/news/2022/05/laion-5b-image-text-dataset/.

[Ars-medical-7] Edwards, Benj (21 September 2022). "Artist finds private medical record photos in popular AI training data set". Ars Technica. https://arstechnica.com/information-technology/2022/09/artist-finds-private-medical-record-photos-in-popular-ai-training-data-set/.

[Laion-400m-blog-8] Schuhmann, Christoph (8 August 2021). "LAION-400-Million Open Dataset". LAION blog. https://laion.ai/blog/laion-400-open-dataset/.

[imagen-paper-9] Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad (23 May 2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv:2205.11487 [cs.CV].

[Laion-5b-blog-10] Beaumont, Romain (3 March 2022). "LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets". LAION blog. https://laion.ai/blog/laion-5b/.

[tc-sai-11] Wiggers, Kyle (12 August 2022). "This startup is setting a DALL-E 2-like AI free, consequences be damned". TechCrunch. https://techcrunch.com/2022/08/12/a-startup-wants-to-democratize-the-tech-behind-dall-e-2-consequences-be-damned/.

[12] Birhane, Abeba; Prabhu, Vinay Uday; Kahembwe, Emmanuel (2021). Multimodal datasets: misogyny, pornography, and malignant stereotypes.

[13] Birhane, Abeba; Prabhu, Vinay; Han, Sang; Boddeti, Vishnu Naresh; Luccioni, Alexandra Sasha (2023-11-06), Into the LAIONs Den: Investigating Hate in Multimodal Datasets

[14] Brunner, Katharina; Harlan, Elisa. "We Are All Raw Material for AI". https://interaktiv.br.de/ki-trainingsdaten/en/index.html.

[15] Cole, Samantha (20 December 2023). "Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material" (in en). https://www.404media.co/laion-datasets-removed-stanford-csam-child-abuse/. Retrieved 22 December 2023.

[16] Open-Assistant, LAION AI, 2023-03-09, https://github.com/LAION-AI/Open-Assistant, retrieved 2023-03-09

[:0-17] 17.0 ^17.1 Köpf, Andreas; Kilcher, Yannic; von Rütte, Dimitri; Anagnostidis, Sotiris; Tam, Zhi-Rui; Stevens, Keith; Barhoum, Abdullah; Duc, Nguyen Minh; Stanley, Oliver; Nagyfi, Richárd; ES, Shahul; Suri, Sameer; Glushkov, David; Dantuluri, Arnav; Maguire, Andrew (2023-04-14). "OpenAssistant Conversations -- Democratizing Large Language Model Alignment". arXiv:2304.07327 [cs.CL].

[18] "Open Assistant: Explore the Possibilities of Open and Collaborative Chatbot Development" (in en-US). https://www.kdnuggets.com/open-assistant-explore-the-possibilities-of-open-and-collaborative-chatbot-development.html.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

Anonymous

Search

Software:LAION

Namespaces

More

Page actions

Contents

Image datasets

Criticism

OpenAssistant

References

Navigation

Navigation

Help

Translate

Wiki tools

Wiki tools

Anonymous

Search

Software:LAION

Image datasets

Criticism

OpenAssistant

References

Navigation

Wiki tools

Page tools

Other projects

Categories