AlexNet: Difference between revisions

From HandWiki
imported>StanislovAI
link
 
add
 
Line 1: Line 1:
{{short description|Convolutional neural network}}
{{Infobox software
'''AlexNet''' is the name of a [[Convolutional neural network|convolutional neural network]] (CNN) architecture, designed by [[Biography:Alex Krizhevsky|Alex Krizhevsky]] in collaboration with [[Biography:Ilya Sutskever|Ilya Sutskever]] and [[Biography:Geoffrey Hinton|Geoffrey Hinton]], who was Krizhevsky's Ph.D. advisor at the University of Toronto.<ref name =":1">{{Cite web|url=https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/|title=The data that transformed AI research—and possibly the world|first=Dave|last=Gershgorn|website=Quartz|date=26 July 2017 }}</ref><ref name=":0">{{Cite journal|last1=Krizhevsky|first1=Alex|last2=Sutskever|first2=Ilya|last3=Hinton|first3=Geoffrey E.|date=2017-05-24|title=ImageNet classification with deep convolutional neural networks|url=https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf|journal=Communications of the ACM|volume=60|issue=6|pages=84–90|doi=10.1145/3065386|s2cid=195908774|issn=0001-0782|doi-access=free}}</ref>
| name = AlexNet
| logo = 150px
| developer = [[Biography:Alex Krizhevsky|Alex Krizhevsky]], [[Biography:Ilya Sutskever|Ilya Sutskever]], and [[Biography:Geoffrey Hinton|Geoffrey Hinton]]
| released = {{Start date and age|2011|06|28}}
| genre = [[Convolutional neural network]]
| license = New BSD License
| repo = {{URL|https://code.google.com/archive/p/cuda-convnet/}}
| programming language = [[Software:CUDA|CUDA]], [[C++]]
| website =  
}}
[[File:AlexNet_architecture.png|thumb|362x362px|AlexNet architecture and a possible modification. At the top is half of the original AlexNet, which is divided into two halves, one for each GPU. At the bottom is the same architecture, but the final "projection" layer is replaced by another that projects to fewer outputs. If one freezes the remaining model and only fine-tunes the last layer, one can obtain another vision model at a significantly lower cost than training one from scratch.]]
[[File:AlexNet_block_diagram.svg|thumb|245x245px|LeNet (left) and AlexNet (right) block diagram]]
'''AlexNet''' is a [[Convolutional neural network|convolutional neural network]] architecture developed for image classification tasks, notably achieving prominence through its performance in the [[ImageNet]] Large Scale Visual Recognition Challenge (ILSVRC). It classifies images into 1,000 distinct object categories and is regarded as the first widely recognized application of deep convolutional networks in large-scale visual recognition.


AlexNet competed in the ImageNet Large Scale Visual Recognition Challenge on September 30, 2012.<ref name =":2">{{Cite web|url=https://image-net.org/challenges/LSVRC/2012/results.html|title=ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC2012)|website=image-net.org}}</ref> The network achieved a top-5 error of 15.3%, more than 10.8 percentage points lower than that of the runner up.  The original paper's primary result was that the depth of the model was essential for its high performance, which was computationally expensive, but made feasible due to the utilization of [[Graphics processing unit|graphics processing units]] (GPUs) during training.<ref name=":0" />
Developed in 2012 by [[Biography:Alex Krizhevsky|Alex Krizhevsky]] in collaboration with [[Biography:Ilya Sutskever|Ilya Sutskever]] and his Ph.D. advisor [[Biography:Geoffrey Hinton|Geoffrey Hinton]] at the [[Organization:University of Toronto|University of Toronto]], the model contains 60 million parameters and 650,000 [[Artificial neuron|neurons]].<ref name=":0">{{Cite journal |last1=Krizhevsky |first1=Alex |last2=Sutskever |first2=Ilya |last3=Hinton |first3=Geoffrey E. |date=2017-05-24 |title=ImageNet classification with deep convolutional neural networks |url=https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf |journal=Communications of the ACM |volume=60 |issue=6 |pages=84–90 |doi=10.1145/3065386 |issn=0001-0782 |s2cid=195908774 |doi-access=free}}</ref> The original paper's primary result was that the depth of the model was essential for its high performance, which was computationally expensive, but made feasible due to the utilization of [[Graphics processing unit|graphics processing unit]]s (GPUs) during training.<ref name=":0" />


== Historic context ==
The three formed team SuperVision and submitted AlexNet in the ImageNet Large Scale Visual Recognition Challenge on September 30, 2012.<ref name=":2">{{Cite web |title=ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC2012) |url=https://image-net.org/challenges/LSVRC/2012/results.html |website=image-net.org}}</ref> The network achieved a top-5 error rate of 15.3% to win the contest, more than 10.8% above the runner-up.


AlexNet was not the first fast GPU-implementation of a CNN to win an image recognition contest. A CNN on GPU by K. Chellapilla et al. (2006) was 4 times faster than an equivalent implementation on CPU.<ref>{{cite book |author1=Kumar Chellapilla |title=Tenth International Workshop on Frontiers in Handwriting Recognition |author2=Sidd Puri |author3=Patrice Simard |date=2006 |publisher=Suvisoft |editor1-last=Lorette |editor1-first=Guy |chapter=High Performance Convolutional Neural Networks for Document Processing |chapter-url=https://hal.inria.fr/inria-00112631/document }}</ref> A deep CNN of [https://scholar.google.com/citations?user=dayrypAAAAAJ&hl=en Dan Cireșan] et al. (2011) at IDSIA was already 60 times faster<ref name="flexible">{{cite journal|last=Cireșan|first=Dan|author2=Ueli Meier |author3=Jonathan Masci |author4=Luca M. Gambardella |author5=Jurgen Schmidhuber |title=Flexible, High Performance Convolutional Neural Networks for Image Classification|journal=Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence-Volume Volume Two|year=2011|volume=2|pages=1237–1242|url=http://www.idsia.ch/~juergen/ijcai2011.pdf|access-date=17 November 2013}}</ref> and outperformed predecessors in August 2011.<ref>{{Cite web|url=http://benchmark.ini.rub.de/?section=gtsrb&subsection=results|title=IJCNN 2011 Competition result table|website=OFFICIAL IJCNN2011 COMPETITION|language=en-US|access-date=2019-01-14|date=2010}}</ref> Between May 15, 2011, and September 10, 2012, their CNN won no fewer than four image competitions.<ref>{{Cite web|url=http://people.idsia.ch/~juergen/computer-vision-contests-won-by-gpu-cnns.html|last1=Schmidhuber|first1=Jürgen|title=History of computer vision contests won by deep CNNs on GPU|language=en-US|access-date=14 January 2019|date=17 March 2017}}</ref><ref name="schdeepscholar">{{cite journal|last1=Schmidhuber|first1=Jürgen|title=Deep Learning|journal=Scholarpedia|url=http://www.scholarpedia.org/article/Deep_Learning|date=2015|volume=10|issue=11|pages=1527–54|pmid=16764513|doi=10.1162/neco.2006.18.7.1527|citeseerx=10.1.1.76.1541|s2cid=2309950}}</ref> They also significantly improved on the best performance in the literature for multiple image [[Database|database]]s.<ref name="mcdns">{{cite book |last1=Cireșan |first1=Dan |first2=Ueli |last2=Meier |first3=Jürgen |last3=Schmidhuber |title=2012 IEEE Conference on Computer Vision and Pattern Recognition |chapter=Multi-column deep neural networks for image classification |date=June 2012 |pages=3642–3649 |doi=10.1109/CVPR.2012.6248110 |arxiv=1202.2745 |isbn=978-1-4673-1226-4 |oclc=812295155 |publisher=[[Organization:Institute of Electrical and Electronics Engineers|Institute of Electrical and Electronics Engineers]] (IEEE) |location=New York, NY|citeseerx=10.1.1.300.3283 |s2cid=2161592 }}</ref>
The architecture influenced a large number of subsequent work in [[Deep learning|deep learning]], especially in applying neural networks to [[Computer vision|computer vision]].


According to the AlexNet paper,<ref name=":0" /> Cireșan's earlier net is "somewhat similar." Both were originally written with [[Software:CUDA|CUDA]] to run with GPU support. In fact, both are actually just variants of the CNN designs introduced by [[Biography:Yann LeCun|Yann LeCun]] et al. (1989)<ref name="LeCun Boser Denker Henderson 1989 pp. 541–551">{{cite journal <!-- Citation bot bypass--> |last=LeCun |first=Y. |last2=Boser |first2=B. |last3=Denker |first3=J. S. |last4=Henderson |first4=D. |last5=Howard |first5=R. E. |last6=Hubbard |first6=W. |last7=Jackel |first7=L. D. |title=Backpropagation Applied to Handwritten Zip Code Recognition |journal=Neural Computation |publisher=MIT Press - Journals |volume=1 |issue=4 |year=1989 |issn=0899-7667 |url=http://yann.lecun.com/exdb/publis/pdf/lecun-89e.pdf <!-- URL!=DOI; URL is free, not the DOI --> |doi=10.1162/neco.1989.1.4.541 |pages=541–551 |oclc=364746139}}</ref><ref name="lecun98">{{cite journal|last=LeCun|first=Yann|author2=Léon Bottou |author3=Yoshua Bengio |author4=Patrick Haffner |title=Gradient-based learning applied to document recognition|journal=Proceedings of the IEEE|year=1998|volume=86|issue=11|pages=2278–2324|url=http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf|access-date=October 7, 2016|doi=10.1109/5.726791|citeseerx=10.1.1.32.9552|s2cid=14542261 }}</ref> who applied the [[Backpropagation|backpropagation]] algorithm to a variant of [[Biography:Kunihiko Fukushima|Kunihiko Fukushima]]'s original CNN architecture called "[[Neocognitron|neocognitron]]."<ref name=fukuneoscholar>{{cite journal | last1 = Fukushima | first1 = K. | year = 2007 | title = Neocognitron | journal = Scholarpedia | volume = 2 | issue = 1| page = 1717 | doi=10.4249/scholarpedia.1717| bibcode = 2007SchpJ...2.1717F | doi-access = free }}</ref><ref name="intro">{{cite journal|last=Fukushima|first=Kunihiko|title=Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position|journal=Biological Cybernetics|year=1980|volume=36|issue=4|pages=193–202|url=http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf|access-date=16 November 2013|doi=10.1007/BF00344251|pmid=7370364|s2cid=206775608}}</ref> The architecture was later modified by J. Weng's method called [[Convolutional neural network#Pooling layer|max-pooling]].<ref name="weng1993">{{cite journal |first1=J |last1=Weng |first2=N |last2=Ahuja |first3=TS |last3=Huang |title=Learning recognition and segmentation of 3-D objects from 2-D images |journal=Proc. 4th International Conf. Computer Vision |year=1993 |pages=121–128 }}</ref><ref name="schdeepscholar" />
== Architecture ==
AlexNet contains eight [[Layer (deep learning)|layers]]: the first five are [[Convolution|convolution]]al layers, some of them followed by [[Pooling layer#Max pooling|max-pooling]] layers, and the last three are [[Convolutional neural network#Fully connected layer|fully connected layers]]. The network, except the last layer, is split into two copies, each run on one GPU, because the network did not fit the [[VRAM]] of a single [[Company:Nvidia|Nvidia]] GTX 580 3GB GPU.<ref name=":0" />{{Pg|location=Section 3.2|quote=A single GTX 580 GPU has only 3GB of memory, which limits the maximum size of the networks that can be trained on it. It turns out that 1.2 million training examples are enough to train networks which are too big to fit on one GPU. Therefore we spread the net across two GPUs.}} The entire structure can be written as<blockquote>(CONV → RN → MP)<sup>2</sup> → (CONV<sup>3</sup> → MP) → (FC → DO)<sup>2</sup> → Linear → softmax</blockquote>where


In 2015, AlexNet was outperformed by Microsoft Research Asia's [[Residual neural network|very deep CNN with over 100 layers]], which won the ImageNet 2015 contest.<ref>{{cite book|last1=He|first1=Kaiming|last2=Zhang|first2=Xiangyu|last3=Ren|first3=Shaoqing|last4=Sun|first4=Jian|title=2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) |chapter=Deep Residual Learning for Image Recognition |date=2016|pages=770–778|doi=10.1109/CVPR.2016.90|arxiv=1512.03385|isbn=978-1-4673-8851-1|s2cid=206594692}}</ref>
* CONV = convolutional layer (with ReLU activation)
* RN = local response normalization
* MP = max-pooling
* FC = fully connected layer (with ReLU activation)
* Linear = fully connected layer (without activation)
* DO = [[Dropout (neural networks)|dropout]]
 
Notably, the convolutional layers 3, 4, 5 were connected to one another without any pooling or normalization. It used the non-saturating ReLU [[Activation function|activation function]], which trained better than tanh and [[Sigmoid function|sigmoid]].<ref name=":0" />
 
== Training ==
The ImageNet training set contained 1.2 million images. The model was trained for 90 epochs over a period of five to six days using two Nvidia GTX 580 GPUs (3GB each).<ref name=":0" /> These GPUs have a theoretical performance of 1.581 [[Floating point operations per second|TFLOPS]] in float32 and were priced at US$500 upon release.<ref>{{Cite web |date=2024-11-12 |title=NVIDIA GeForce GTX 580 Specs |url=https://www.techpowerup.com/gpu-specs/geforce-gtx-580.c270 |access-date=2024-11-12 |website=TechPowerUp |language=en}}</ref> Each forward pass of AlexNet required approximately 1.43 GFLOPs.<ref>{{Cite web |title=calflops: a FLOPs and Params calculate tool for neural networks |url=https://pypi.org/project/calflops/ |access-date=2024-12-10 |website=pypi.org}}</ref> Based on these values, the two GPUs together were theoretically capable of performing over 2,200 forward passes per second under ideal conditions.
 
The dataset images were stored in JPEG format. They took up 27GB of disk. The neural network took up 2GB of RAM on each GPU, and around 5GB of system RAM during training. The GPUs were responsible for training, while the CPUs were responsible for loading images from disk, and data-augmenting the images.<ref>[https://www.image-net.org/static_files/files/supervision.pdf SuperVision team invited talk] at the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) workshop on 2012-10-12, held within the ECCV 2012, in Florence, Italy.</ref>
 
AlexNet was trained with momentum gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. Learning rate started at 10<sup>−2</sup> and was manually decreased 10-fold whenever validation error appeared to stop decreasing. It was reduced three times during training, ending at 10<sup>−5</sup>.
 
It used two forms of [[Data augmentation|data augmentation]], both computed on the fly on the CPU, thus "computationally free":
 
* Each image from ImageNet was first scaled, so that its shorter side was of length 256. Then the central 256×256 patch was cropped out and normalized (dividing the pixel values so that they fall between 0 and 1, then subtracting by [0.485, 0.456, 0.406], then dividing by [0.229, 0.224, 0.225]. These are the mean and standard deviations for ImageNet, so this [[Whitening transformation|whitens]] the input data).
* Extracting random 224×224 patches (and their horizontal reflections) from the 256×256 crop. This increases the size of the training set 2048-fold.
* Randomly shifting the RGB value of each image along the three [[Principal component analysis|principal directions]] of the RGB values of its pixels.
The resolution 224×224 was picked, because 256 - 16 - 16 = 224, meaning that given a 256×256 image, framing out a width of 16 on its 4 sides results in a 224×224 image.
 
It used local response normalization, and [[Dropout (neural networks)|dropout regularization]] with drop probability 0.5.
 
All [[Weight initialization|weights were initialized]] as [[Normal distribution|gaussians]] with 0 mean and 0.01 standard deviation. Biases in convolutional layers 2, 4, 5, and all fully-connected layers, were initialized to constant 1 to avoid the dying ReLU problem.
 
At test time, to use a trained AlexNet for predicting the class of an image, that image is first scaled, so that its shorter side was of length 256. Then the central 256×256 patch was cropped out. Then, the five 224 × 224 patches (the four corner patches and the center patch) as well as their horizontal reflections are computed, 10 patches in all. The network's predicted probabilities on all 10 patches are averaged, and that is the final predicted probability.
 
=== ImageNet competition ===
The version they used to enter the 2012 ImageNet competition was an ensemble of 7 AlexNets.
 
Specifically, they trained 5 AlexNets of the previously described architecture (with 5 CONV layers) on the ILSVRC-2012 training set (1.2 million images). They also trained 2 variant AlexNets, obtained by adding one extra CONV layer over the last pooling layer. These were trained by first training on the entire ImageNet Fall 2011 release (15 million images in 22K categories), and then finetuning it on the ILSVRC-2012 training set. The final system of 7 AlexNets was used by averaging their predicted probabilities.
 
== History ==
 
=== Previous work ===
{{comparison_image_neural_networks.svg}}
In 1980, [[Biography:Kunihiko Fukushima|Kunihiko Fukushima]] proposed an early CNN named [[Neocognitron|neocognitron]].<ref name="fukuneoscholar">{{cite journal |last1=Fukushima |first1=K. |year=2007 |title=Neocognitron |journal=Scholarpedia |volume=2 |issue=1 |page=1717 |bibcode=2007SchpJ...2.1717F |doi=10.4249/scholarpedia.1717 |doi-access=free}}</ref><ref name="intro">{{cite journal |last=Fukushima |first=Kunihiko |year=1980 |title=Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position |url=http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf |journal=Biological Cybernetics |volume=36 |issue=4 |pages=193–202 |doi=10.1007/BF00344251 |pmid=7370364 |s2cid=206775608 |access-date=16 November 2013}}</ref> It was trained by an [[Unsupervised learning|unsupervised learning]] algorithm. The [[LeNet|LeNet-5]] ([[Biography:Yann LeCun|Yann LeCun]] et al., 1989)<ref name="LeCun Boser Denker Henderson 1989 pp. 541–551">{{cite journal <!-- Citation bot bypass--> |last=LeCun |first=Y. |last2=Boser |first2=B. |last3=Denker |first3=J. S. |last4=Henderson |first4=D. |last5=Howard |first5=R. E. |last6=Hubbard |first6=W. |last7=Jackel |first7=L. D. |year=1989 |title=Backpropagation Applied to Handwritten Zip Code Recognition |url=http://yann.lecun.com/exdb/publis/pdf/lecun-89e.pdf <!-- URL!=DOI; URL is free, not the DOI --> |journal=Neural Computation |publisher=MIT Press - Journals |volume=1 |issue=4 |pages=541–551 |doi=10.1162/neco.1989.1.4.541 |issn=0899-7667 |oclc=364746139}}</ref><ref name="lecun98">{{cite journal |last=LeCun |first=Yann |author2=Léon Bottou |author3=Yoshua Bengio |author4=Patrick Haffner |year=1998 |title=Gradient-based learning applied to document recognition |url=http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf |journal=Proceedings of the IEEE |volume=86 |issue=11 |pages=2278–2324 |citeseerx=10.1.1.32.9552 |doi=10.1109/5.726791 |s2cid=14542261 |access-date=October 7, 2016}}</ref> was trained by supervised learning with [[Backpropagation|backpropagation]] algorithm, with an architecture that is essentially the same as AlexNet on a small scale.
 
Max pooling was used in 1990 for speech processing (essentially a 1-dimensional CNN),<ref name="Yamaguchi111990">{{cite conference |last1=Yamaguchi |first1=Kouichi |last2=Sakamoto |first2=Kenji |last3=Akabane |first3=Toshio |last4=Fujimoto |first4=Yoshiji |date=November 1990 |title=A Neural Network for Speaker-Independent Isolated Word Recognition |url=https://www.isca-speech.org/archive/icslp_1990/i90_1077.html |conference=First International Conference on Spoken Language Processing (ICSLP 90) |location=Kobe, Japan |archive-url=https://web.archive.org/web/20210307233750/https://www.isca-speech.org/archive/icslp_1990/i90_1077.html |archive-date=2021-03-07 |access-date=2019-09-04 |url-status=dead}}</ref> and for image processing, was first used in the Cresceptron of 1992.<ref>{{cite conference |last=Weng |first=J. |last2=Ahuja |first2=N. |last3=Huang |first3=T.S. |date=1992 |title=Cresceptron: a self-organizing neural network which grows adaptively |publisher=IEEE |volume=1 |pages=576–581 |doi=10.1109/IJCNN.1992.287150 |isbn=978-0-7803-0559-5|url=https://vision.ai.illinois.edu/html-files-to-import/publications/cresceptron_1992.pdf}}</ref>
 
During the 2000s, as [[Graphics processing unit|GPU]] hardware improved, some researchers adapted these for [[General-purpose computing on graphics processing units|general-purpose computing]], including neural network training. (K. Chellapilla et al., 2006) trained a CNN on GPU that was 4 times faster than an equivalent CPU implementation.<ref>{{cite book |author1=Kumar Chellapilla |title=Tenth International Workshop on Frontiers in Handwriting Recognition |author2=Sidd Puri |author3=Patrice Simard |date=2006 |publisher=Suvisoft |editor1-last=Lorette |editor1-first=Guy |chapter=High Performance Convolutional Neural Networks for Document Processing |chapter-url=https://hal.inria.fr/inria-00112631/document}}</ref> (Raina et al 2009) trained a [[Deep belief network|deep belief network]] with 100 million parameters on an Nvidia GeForce GTX 280 at up to 70 times speedup over CPUs.<ref>{{cite conference |last=Raina |first=Rajat |last2=Madhavan |first2=Anand |last3=Ng |first3=Andrew Y. |date=2009-06-14 |title=Large-scale deep unsupervised learning using graphics processors |url=https://dl.acm.org/doi/10.1145/1553374.1553486 |language=en |publisher=ACM |pages=873–880 |doi=10.1145/1553374.1553486 |isbn=978-1-60558-516-1|url-access=subscription }}</ref> A deep CNN of (Dan Cireșan ''et al.'', 2011) at IDSIA was 60 times faster than an equivalent CPU implementation.<ref name="flexible">{{cite journal |last=Cireșan |first=Dan |author2=Ueli Meier |author3=Jonathan Masci |author4=Luca M. Gambardella |author5=Jurgen Schmidhuber |year=2011 |title=Flexible, High Performance Convolutional Neural Networks for Image Classification |url=http://www.idsia.ch/~juergen/ijcai2011.pdf |journal=Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence-Volume Volume Two |volume=2 |pages=1237–1242 |access-date=17 November 2013}}</ref> Between May 15, 2011, and September 10, 2012, their CNN won four image competitions and achieved [[Social:State of the art|state of the art]] for multiple image [[Database|database]]s.<ref>{{Cite web |date=2010 |title=IJCNN 2011 Competition result table |url=http://benchmark.ini.rub.de/?section=gtsrb&subsection=results |access-date=2019-01-14 |website=OFFICIAL IJCNN2011 COMPETITION |language=en-US}}</ref><ref>{{Cite web |last1=Schmidhuber |first1=Jürgen |date=17 March 2017 |title=History of computer vision contests won by deep CNNs on GPU |url=http://people.idsia.ch/~juergen/computer-vision-contests-won-by-gpu-cnns.html |access-date=14 January 2019 |language=en-US}}</ref><ref name="mcdns">{{cite book |last1=Cireșan |first1=Dan |title=2012 IEEE Conference on Computer Vision and Pattern Recognition |last2=Meier |first2=Ueli |last3=Schmidhuber |first3=Jürgen |date=June 2012 |publisher=[[Organization:Institute of Electrical and Electronics Engineers|Institute of Electrical and Electronics Engineers]] (IEEE) |isbn=978-1-4673-1226-4 |location=New York, NY |pages=3642–3649 |chapter=Multi-column deep neural networks for image classification |citeseerx=10.1.1.300.3283 |doi=10.1109/CVPR.2012.6248110 |oclc=812295155 |arxiv=1202.2745 |s2cid=2161592}}</ref> According to the AlexNet paper,<ref name=":0" /> Cireșan's earlier net is "somewhat similar". Both were written with [[Software:CUDA|CUDA]] to run on GPU.
 
=== Computer vision ===
During the 1990–2010 period, neural networks were not better than other machine learning methods like [[Kernel regression|kernel regression]], [[Support vector machine|support vector machine]]s, [[AdaBoost]], structured estimation,<ref>{{Cite journal |last1=Taskar |first1=Ben |last2=Guestrin |first2=Carlos |last3=Koller |first3=Daphne |date=2003 |title=Max-Margin Markov Networks |url=https://proceedings.neurips.cc/paper/2003/hash/878d5691c824ee2aaf770f7d36c151d6-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=MIT Press |volume=16}}</ref> among others. For computer vision in particular, much progress came from manual [[Feature engineering|feature engineering]], such as [[Scale-invariant feature transform|SIFT]] features, [[Speeded up robust features|SURF]] features, [[Histogram of oriented gradients|HoG]] features, [[Bag-of-words model in computer vision|bags of visual words]], etc. It was a minority position in computer vision that features can be learned directly from data, a position which became dominant after AlexNet.<ref name=":3">{{Cite book |last1=Zhang |first1=Aston |title=Dive into deep learning |last2=Lipton |first2=Zachary |last3=Li |first3=Mu |last4=Smola |first4=Alexander J. |date=2024 |publisher=Cambridge University Press |isbn=978-1-009-38943-3 |location=Cambridge New York Port Melbourne New Delhi Singapore |chapter=8.1. Deep Convolutional Neural Networks (AlexNet) |chapter-url=https://d2l.ai/chapter_convolutional-modern/alexnet.html}}</ref>
 
In 2011, [[Biography:Geoffrey Hinton|Geoffrey Hinton]] started reaching out to colleagues about "What do I have to do to convince you that neural networks are the future?", and [[Biography:Jitendra Malik|Jitendra Malik]], a sceptic of neural networks, recommended the PASCAL Visual Object Classes challenge. Hinton said its dataset was too small, so Malik recommended to him the ImageNet challenge.<ref>{{Cite book |last=Li |first=Fei Fei |title=The worlds I see: curiosity, exploration, and discovery at the dawn of AI |date=2023 |publisher=Moment of Lift Books ; Flatiron Books |isbn=978-1-250-89793-0 |edition=First |location=New York}}</ref>
 
The [[ImageNet]] dataset, which became central to AlexNet's success, was created by [[Biography:Fei-Fei Li|Fei-Fei Li]] and her collaborators beginning in 2007. Aiming to advance visual recognition through large-scale data, Li built a dataset far larger than earlier efforts, ultimately containing over 14 million labeled images across 22,000 categories. The images were labeled using Amazon Mechanical Turk and organized via the [[Software:WordNet|WordNet]] hierarchy. Initially met with skepticism, ImageNet later became the foundation of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and a key resource in the rise of deep learning.<ref name=":1" />
 
Sutskever and Krizhevsky were both graduate students. Before 2011, Krizhevsky had already written <code>cuda-convnet</code> to train small CNNs on [[CIFAR-10]] with a single GPU. Sutskever convinced Krizhevsky, who could do [[General-purpose computing on graphics processing units|GPGPU]] well, to train a CNN on ImageNet, with Hinton serving as principal investigator. So Krizhevsky extended <code>cuda-convnet</code> for multi-GPU training. AlexNet was trained on 2 Nvidia GTX 580 in Krizhevsky's bedroom at his parents' house. During 2012, Krizhevsky performed [[Hyperparameter optimization|hyperparameter optimization]] on the network until it [[ImageNet#History of the ImageNet challenge|won the ImageNet competition]] later the same year. Hinton commented that, "Ilya thought we should do it, Alex made it work, and I got the Nobel Prize".<ref>{{Cite web |last=hhackford |date=2025-03-20 |title=CHM Releases AlexNet Source Code |url=https://computerhistory.org/blog/chm-releases-alexnet-source-code/ |access-date=2025-03-22 |website=CHM |language=en}}</ref> At the 2012 European Conference on Computer Vision, following AlexNet's win, researcher [[Biography:Yann LeCun|Yann LeCun]] described the model as "an unequivocal turning point in the history of computer vision".<ref name=":1" />
 
AlexNet's success in 2012 was enabled by the convergence of three developments that had matured over the previous decade: large-scale labeled datasets, general-purpose GPU computing, and improved training methods for deep neural networks. The availability of ImageNet provided the data necessary for training deep models on a broad range of object categories. Advances in GPU programming through [[Company:Nvidia|Nvidia]]'s [[Software:CUDA|CUDA]] platform enabled practical training of large models. Together with algorithmic improvements, these factors enabled AlexNet to achieve high performance on large-scale visual recognition benchmarks.<ref name=":1">{{Cite web |date=11 November 2024 |title=How a stubborn computer scientist accidentally launched the deep learning boom |url=https://arstechnica.com/ai/2024/11/how-a-stubborn-computer-scientist-accidentally-launched-the-deep-learning-boom/ |access-date=24 March 2025 |website=Ars Technica}}</ref> Reflecting on its significance over a decade later, Fei-Fei Li stated in a 2024 interview: "That moment was pretty symbolic to the world of AI because three fundamental elements of modern AI converged for the first time".<ref name=":1" />
 
While AlexNet and LeNet share essentially the same design and algorithm, AlexNet is much larger than LeNet and was trained on a much larger dataset on much faster hardware. Over the period of 20 years, both data and compute became cheaply available.<ref name=":3" />


== Network design ==
=== Subsequent work ===
AlexNet is highly influential, resulting in much subsequent work in using CNNs for computer vision and using GPUs to accelerate deep learning. As of early 2025, the AlexNet paper has been cited over 184,000 times according to Google Scholar.<ref>[https://scholar.google.com/citations?view_op=view_citation&hl=en&user=xegzhJcAAAAJ&citation_for_view=xegzhJcAAAAJ:u5HHmVD_uO8C AlexNet paper on Google Scholar ]</ref>


AlexNet contained eight layers; the first five were [[Convolution|convolutional]] layers, some of them followed by [[Convolutional neural network#Pooling layer|max-pooling]] layers, and the last three were fully connected layers. The network, except the last layer, is split into two copies, each run on one GPU.<ref name=":0" /> The entire structure can be written as <math display="block">(CNN \to RN\to MP)^2 \to (CNN^3 \to MP) \to (FC \to DO)^2 \to Linear \to softmax </math> where
At the time of publication, there was no framework available for GPU-based neural network training and inference. The codebase for AlexNet was released under a BSD license, and had been commonly used in neural network research for several subsequent years.<ref>{{Cite web |last=Krizhevsky |first=Alex |date=July 18, 2014 |title=cuda-convnet: High-performance C++/CUDA implementation of convolutional neural networks |url=https://code.google.com/archive/p/cuda-convnet/ |access-date=2024-10-20 |website=Google Code Archive}}</ref><ref name=":3" />


* CNN = convolutional layer (with ReLU activation)
In one direction, subsequent works aimed to train increasingly deep CNNs that achieve increasingly higher performance on ImageNet. In this line of research are GoogLeNet (2014), [[Software:VGGNet|VGGNet]] (2014), [[Highway network]] (2015), and [[Residual neural network|ResNet]] (2015). Another direction aimed to reproduce the performance of AlexNet at a lower cost. In this line of research are [[Software:SqueezeNet|SqueezeNet]] (2016), [[Software:MobileNet|MobileNet]] (2017), [[EfficientNet]] (2019).
* RN = local response normalization
 
* MP = maxpooling
Geoffrey Hinton, Ilya Sutskever, and Alex Krizhevsky formed DNNResearch soon afterwards and sold the company, and the AlexNet source code along with it, to Google. There had been improvements and reimplementations for the AlexNet, but the original version as of 2012, at the time of its winning of ImageNet, had been released under [[Software:BSD licenses|BSD-2 license]] via Computer History Museum.<ref>{{Citation |title=computerhistory/AlexNet-Source-Code |date=2025-03-22 |url=https://github.com/computerhistory/AlexNet-Source-Code |access-date=2025-03-22 |publisher=Computer History Museum}}</ref>
* FC = fully connected layer (with ReLU activation)
* Linear = fully connected layer (without activation)
* DO = dropout


It used the non-saturating ReLU activation function, which showed improved training performance over tanh and [[Sigmoid function|sigmoid]].<ref name=":0" />
==See also==


== Influence ==
* [[Software:Lists of open-source artificial intelligence software|List of open-source artificial intelligence software]]
AlexNet is considered one of the most influential papers published in computer vision, having spurred many more papers published employing CNNs and GPUs to accelerate [[Deep learning|deep learning]].<ref>{{Cite web|url=https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html|title=The 9 Deep Learning Papers You Need To Know About (Understanding CNNs Part 3)|last=Deshpande|first=Adit|website=adeshpande3.github.io|access-date=2018-12-04}}</ref> As of early 2023, the AlexNet paper has been cited over 120,000 times according to Google Scholar.<ref>[https://scholar.google.com/citations?view_op=view_citation&hl=en&user=xegzhJcAAAAJ&citation_for_view=xegzhJcAAAAJ:u5HHmVD_uO8C AlexNet paper on Google Scholar ]</ref>
* [[Open-source artificial intelligence]]


==References==
==References==
{{reflist}}
{{reflist}}


{{Differentiable computing}}
{{Artificial intelligence navbox}}


[[Category:Object recognition and categorization]]
[[Category:Object recognition and categorization]]


{{Sourceattribution|AlexNet}}
{{Sourceattribution|AlexNet}}

Latest revision as of 11:58, 23 May 2026

AlexNet
File:150px
Developer(s)Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton
Initial releaseJune 28, 2011; 14 years ago (2011-06-28)
Repositorycode.google.com/archive/p/cuda-convnet/
Written inCUDA, C++
TypeConvolutional neural network
LicenseNew BSD License
AlexNet architecture and a possible modification. At the top is half of the original AlexNet, which is divided into two halves, one for each GPU. At the bottom is the same architecture, but the final "projection" layer is replaced by another that projects to fewer outputs. If one freezes the remaining model and only fine-tunes the last layer, one can obtain another vision model at a significantly lower cost than training one from scratch.
LeNet (left) and AlexNet (right) block diagram

AlexNet is a convolutional neural network architecture developed for image classification tasks, notably achieving prominence through its performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It classifies images into 1,000 distinct object categories and is regarded as the first widely recognized application of deep convolutional networks in large-scale visual recognition.

Developed in 2012 by Alex Krizhevsky in collaboration with Ilya Sutskever and his Ph.D. advisor Geoffrey Hinton at the University of Toronto, the model contains 60 million parameters and 650,000 neurons.[1] The original paper's primary result was that the depth of the model was essential for its high performance, which was computationally expensive, but made feasible due to the utilization of graphics processing units (GPUs) during training.[1]

The three formed team SuperVision and submitted AlexNet in the ImageNet Large Scale Visual Recognition Challenge on September 30, 2012.[2] The network achieved a top-5 error rate of 15.3% to win the contest, more than 10.8% above the runner-up.

The architecture influenced a large number of subsequent work in deep learning, especially in applying neural networks to computer vision.

Architecture

AlexNet contains eight layers: the first five are convolutional layers, some of them followed by max-pooling layers, and the last three are fully connected layers. The network, except the last layer, is split into two copies, each run on one GPU, because the network did not fit the VRAM of a single Nvidia GTX 580 3GB GPU.[1]Template:Pg The entire structure can be written as

(CONV → RN → MP)2 → (CONV3 → MP) → (FC → DO)2 → Linear → softmax

where

  • CONV = convolutional layer (with ReLU activation)
  • RN = local response normalization
  • MP = max-pooling
  • FC = fully connected layer (with ReLU activation)
  • Linear = fully connected layer (without activation)
  • DO = dropout

Notably, the convolutional layers 3, 4, 5 were connected to one another without any pooling or normalization. It used the non-saturating ReLU activation function, which trained better than tanh and sigmoid.[1]

Training

The ImageNet training set contained 1.2 million images. The model was trained for 90 epochs over a period of five to six days using two Nvidia GTX 580 GPUs (3GB each).[1] These GPUs have a theoretical performance of 1.581 TFLOPS in float32 and were priced at US$500 upon release.[3] Each forward pass of AlexNet required approximately 1.43 GFLOPs.[4] Based on these values, the two GPUs together were theoretically capable of performing over 2,200 forward passes per second under ideal conditions.

The dataset images were stored in JPEG format. They took up 27GB of disk. The neural network took up 2GB of RAM on each GPU, and around 5GB of system RAM during training. The GPUs were responsible for training, while the CPUs were responsible for loading images from disk, and data-augmenting the images.[5]

AlexNet was trained with momentum gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. Learning rate started at 10−2 and was manually decreased 10-fold whenever validation error appeared to stop decreasing. It was reduced three times during training, ending at 10−5.

It used two forms of data augmentation, both computed on the fly on the CPU, thus "computationally free":

  • Each image from ImageNet was first scaled, so that its shorter side was of length 256. Then the central 256×256 patch was cropped out and normalized (dividing the pixel values so that they fall between 0 and 1, then subtracting by [0.485, 0.456, 0.406], then dividing by [0.229, 0.224, 0.225]. These are the mean and standard deviations for ImageNet, so this whitens the input data).
  • Extracting random 224×224 patches (and their horizontal reflections) from the 256×256 crop. This increases the size of the training set 2048-fold.
  • Randomly shifting the RGB value of each image along the three principal directions of the RGB values of its pixels.

The resolution 224×224 was picked, because 256 - 16 - 16 = 224, meaning that given a 256×256 image, framing out a width of 16 on its 4 sides results in a 224×224 image.

It used local response normalization, and dropout regularization with drop probability 0.5.

All weights were initialized as gaussians with 0 mean and 0.01 standard deviation. Biases in convolutional layers 2, 4, 5, and all fully-connected layers, were initialized to constant 1 to avoid the dying ReLU problem.

At test time, to use a trained AlexNet for predicting the class of an image, that image is first scaled, so that its shorter side was of length 256. Then the central 256×256 patch was cropped out. Then, the five 224 × 224 patches (the four corner patches and the center patch) as well as their horizontal reflections are computed, 10 patches in all. The network's predicted probabilities on all 10 patches are averaged, and that is the final predicted probability.

ImageNet competition

The version they used to enter the 2012 ImageNet competition was an ensemble of 7 AlexNets.

Specifically, they trained 5 AlexNets of the previously described architecture (with 5 CONV layers) on the ILSVRC-2012 training set (1.2 million images). They also trained 2 variant AlexNets, obtained by adding one extra CONV layer over the last pooling layer. These were trained by first training on the entire ImageNet Fall 2011 release (15 million images in 22K categories), and then finetuning it on the ILSVRC-2012 training set. The final system of 7 AlexNets was used by averaging their predicted probabilities.

History

Previous work

Template:Comparison image neural networks.svg In 1980, Kunihiko Fukushima proposed an early CNN named neocognitron.[6][7] It was trained by an unsupervised learning algorithm. The LeNet-5 (Yann LeCun et al., 1989)[8][9] was trained by supervised learning with backpropagation algorithm, with an architecture that is essentially the same as AlexNet on a small scale.

Max pooling was used in 1990 for speech processing (essentially a 1-dimensional CNN),[10] and for image processing, was first used in the Cresceptron of 1992.[11]

During the 2000s, as GPU hardware improved, some researchers adapted these for general-purpose computing, including neural network training. (K. Chellapilla et al., 2006) trained a CNN on GPU that was 4 times faster than an equivalent CPU implementation.[12] (Raina et al 2009) trained a deep belief network with 100 million parameters on an Nvidia GeForce GTX 280 at up to 70 times speedup over CPUs.[13] A deep CNN of (Dan Cireșan et al., 2011) at IDSIA was 60 times faster than an equivalent CPU implementation.[14] Between May 15, 2011, and September 10, 2012, their CNN won four image competitions and achieved state of the art for multiple image databases.[15][16][17] According to the AlexNet paper,[1] Cireșan's earlier net is "somewhat similar". Both were written with CUDA to run on GPU.

Computer vision

During the 1990–2010 period, neural networks were not better than other machine learning methods like kernel regression, support vector machines, AdaBoost, structured estimation,[18] among others. For computer vision in particular, much progress came from manual feature engineering, such as SIFT features, SURF features, HoG features, bags of visual words, etc. It was a minority position in computer vision that features can be learned directly from data, a position which became dominant after AlexNet.[19]

In 2011, Geoffrey Hinton started reaching out to colleagues about "What do I have to do to convince you that neural networks are the future?", and Jitendra Malik, a sceptic of neural networks, recommended the PASCAL Visual Object Classes challenge. Hinton said its dataset was too small, so Malik recommended to him the ImageNet challenge.[20]

The ImageNet dataset, which became central to AlexNet's success, was created by Fei-Fei Li and her collaborators beginning in 2007. Aiming to advance visual recognition through large-scale data, Li built a dataset far larger than earlier efforts, ultimately containing over 14 million labeled images across 22,000 categories. The images were labeled using Amazon Mechanical Turk and organized via the WordNet hierarchy. Initially met with skepticism, ImageNet later became the foundation of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and a key resource in the rise of deep learning.[21]

Sutskever and Krizhevsky were both graduate students. Before 2011, Krizhevsky had already written cuda-convnet to train small CNNs on CIFAR-10 with a single GPU. Sutskever convinced Krizhevsky, who could do GPGPU well, to train a CNN on ImageNet, with Hinton serving as principal investigator. So Krizhevsky extended cuda-convnet for multi-GPU training. AlexNet was trained on 2 Nvidia GTX 580 in Krizhevsky's bedroom at his parents' house. During 2012, Krizhevsky performed hyperparameter optimization on the network until it won the ImageNet competition later the same year. Hinton commented that, "Ilya thought we should do it, Alex made it work, and I got the Nobel Prize".[22] At the 2012 European Conference on Computer Vision, following AlexNet's win, researcher Yann LeCun described the model as "an unequivocal turning point in the history of computer vision".[21]

AlexNet's success in 2012 was enabled by the convergence of three developments that had matured over the previous decade: large-scale labeled datasets, general-purpose GPU computing, and improved training methods for deep neural networks. The availability of ImageNet provided the data necessary for training deep models on a broad range of object categories. Advances in GPU programming through Nvidia's CUDA platform enabled practical training of large models. Together with algorithmic improvements, these factors enabled AlexNet to achieve high performance on large-scale visual recognition benchmarks.[21] Reflecting on its significance over a decade later, Fei-Fei Li stated in a 2024 interview: "That moment was pretty symbolic to the world of AI because three fundamental elements of modern AI converged for the first time".[21]

While AlexNet and LeNet share essentially the same design and algorithm, AlexNet is much larger than LeNet and was trained on a much larger dataset on much faster hardware. Over the period of 20 years, both data and compute became cheaply available.[19]

Subsequent work

AlexNet is highly influential, resulting in much subsequent work in using CNNs for computer vision and using GPUs to accelerate deep learning. As of early 2025, the AlexNet paper has been cited over 184,000 times according to Google Scholar.[23]

At the time of publication, there was no framework available for GPU-based neural network training and inference. The codebase for AlexNet was released under a BSD license, and had been commonly used in neural network research for several subsequent years.[24][19]

In one direction, subsequent works aimed to train increasingly deep CNNs that achieve increasingly higher performance on ImageNet. In this line of research are GoogLeNet (2014), VGGNet (2014), Highway network (2015), and ResNet (2015). Another direction aimed to reproduce the performance of AlexNet at a lower cost. In this line of research are SqueezeNet (2016), MobileNet (2017), EfficientNet (2019).

Geoffrey Hinton, Ilya Sutskever, and Alex Krizhevsky formed DNNResearch soon afterwards and sold the company, and the AlexNet source code along with it, to Google. There had been improvements and reimplementations for the AlexNet, but the original version as of 2012, at the time of its winning of ImageNet, had been released under BSD-2 license via Computer History Museum.[25]

See also

References

  1. 1.0 1.1 1.2 1.3 1.4 1.5 Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E. (2017-05-24). "ImageNet classification with deep convolutional neural networks". Communications of the ACM 60 (6): 84–90. doi:10.1145/3065386. ISSN 0001-0782. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf. 
  2. "ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC2012)". https://image-net.org/challenges/LSVRC/2012/results.html. 
  3. "NVIDIA GeForce GTX 580 Specs" (in en). 2024-11-12. https://www.techpowerup.com/gpu-specs/geforce-gtx-580.c270. 
  4. "calflops: a FLOPs and Params calculate tool for neural networks". https://pypi.org/project/calflops/. 
  5. SuperVision team invited talk at the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) workshop on 2012-10-12, held within the ECCV 2012, in Florence, Italy.
  6. Fukushima, K. (2007). "Neocognitron". Scholarpedia 2 (1): 1717. doi:10.4249/scholarpedia.1717. Bibcode2007SchpJ...2.1717F. 
  7. Fukushima, Kunihiko (1980). "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position". Biological Cybernetics 36 (4): 193–202. doi:10.1007/BF00344251. PMID 7370364. http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf. Retrieved 16 November 2013. 
  8. LeCun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W.; Jackel, L. D. (1989). "Backpropagation Applied to Handwritten Zip Code Recognition". Neural Computation (MIT Press - Journals) 1 (4): 541–551. doi:10.1162/neco.1989.1.4.541. ISSN 0899-7667. OCLC 364746139. http://yann.lecun.com/exdb/publis/pdf/lecun-89e.pdf. 
  9. LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick Haffner (1998). "Gradient-based learning applied to document recognition". Proceedings of the IEEE 86 (11): 2278–2324. doi:10.1109/5.726791. http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf. Retrieved October 7, 2016. 
  10. Yamaguchi, Kouichi; Sakamoto, Kenji; Akabane, Toshio; Fujimoto, Yoshiji (November 1990). "A Neural Network for Speaker-Independent Isolated Word Recognition". First International Conference on Spoken Language Processing (ICSLP 90). Kobe, Japan. https://www.isca-speech.org/archive/icslp_1990/i90_1077.html. Retrieved 2019-09-04. 
  11. Weng, J.; Ahuja, N.; Huang, T.S. (1992). "Cresceptron: a self-organizing neural network which grows adaptively". 1. IEEE. pp. 576–581. doi:10.1109/IJCNN.1992.287150. ISBN 978-0-7803-0559-5. https://vision.ai.illinois.edu/html-files-to-import/publications/cresceptron_1992.pdf. 
  12. Kumar Chellapilla; Sidd Puri; Patrice Simard (2006). "High Performance Convolutional Neural Networks for Document Processing". in Lorette, Guy. Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft. https://hal.inria.fr/inria-00112631/document. 
  13. Raina, Rajat; Madhavan, Anand; Ng, Andrew Y. (2009-06-14). "Large-scale deep unsupervised learning using graphics processors" (in en). ACM. pp. 873–880. doi:10.1145/1553374.1553486. ISBN 978-1-60558-516-1. https://dl.acm.org/doi/10.1145/1553374.1553486. 
  14. Cireșan, Dan; Ueli Meier; Jonathan Masci; Luca M. Gambardella; Jurgen Schmidhuber (2011). "Flexible, High Performance Convolutional Neural Networks for Image Classification". Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence-Volume Volume Two 2: 1237–1242. http://www.idsia.ch/~juergen/ijcai2011.pdf. Retrieved 17 November 2013. 
  15. "IJCNN 2011 Competition result table" (in en-US). 2010. http://benchmark.ini.rub.de/?section=gtsrb&subsection=results. 
  16. Schmidhuber, Jürgen (17 March 2017). "History of computer vision contests won by deep CNNs on GPU" (in en-US). http://people.idsia.ch/~juergen/computer-vision-contests-won-by-gpu-cnns.html. 
  17. Cireșan, Dan; Meier, Ueli; Schmidhuber, Jürgen (June 2012). "Multi-column deep neural networks for image classification". 2012 IEEE Conference on Computer Vision and Pattern Recognition. New York, NY: Institute of Electrical and Electronics Engineers (IEEE). pp. 3642–3649. doi:10.1109/CVPR.2012.6248110. ISBN 978-1-4673-1226-4. OCLC 812295155. 
  18. Taskar, Ben; Guestrin, Carlos; Koller, Daphne (2003). "Max-Margin Markov Networks". Advances in Neural Information Processing Systems (MIT Press) 16. https://proceedings.neurips.cc/paper/2003/hash/878d5691c824ee2aaf770f7d36c151d6-Abstract.html. 
  19. 19.0 19.1 19.2 Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "8.1. Deep Convolutional Neural Networks (AlexNet)". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN 978-1-009-38943-3. https://d2l.ai/chapter_convolutional-modern/alexnet.html. 
  20. Li, Fei Fei (2023). The worlds I see: curiosity, exploration, and discovery at the dawn of AI (First ed.). New York: Moment of Lift Books ; Flatiron Books. ISBN 978-1-250-89793-0. 
  21. 21.0 21.1 21.2 21.3 "How a stubborn computer scientist accidentally launched the deep learning boom". 11 November 2024. https://arstechnica.com/ai/2024/11/how-a-stubborn-computer-scientist-accidentally-launched-the-deep-learning-boom/. 
  22. hhackford (2025-03-20). "CHM Releases AlexNet Source Code" (in en). https://computerhistory.org/blog/chm-releases-alexnet-source-code/. 
  23. AlexNet paper on Google Scholar
  24. Krizhevsky, Alex (July 18, 2014). "cuda-convnet: High-performance C++/CUDA implementation of convolutional neural networks". https://code.google.com/archive/p/cuda-convnet/. 
  25. computerhistory/AlexNet-Source-Code, Computer History Museum, 2025-03-22, https://github.com/computerhistory/AlexNet-Source-Code, retrieved 2025-03-22