ImageNet

The ImageNet project is a large visual database designed for use in visual object recognition software research. More than 14 million{{cite news|title=New computer vision challenge wants to teach robots to see in 3D|url=https://www.newscientist.com/article/2127131-new-computer-vision-challenge-wants-to-teach-robots-to-see-in-3d/|access-date=3 February 2018|work=New Scientist|date=7 April 2017}}{{cite news|last1=Markoff|first1=John|title=For Web Images, Creating New Technology to Seek and Find|url=https://www.nytimes.com/2012/11/20/science/for-web-images-creating-new-technology-to-seek-and-find.html|access-date=3 February 2018|work=The New York Times|date=19 November 2012}} images have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided.{{Cite web |date=2020-09-07 |title=ImageNet |url=http://image-net.org/about-stats.php |archive-url=https://web.archive.org/web/20200907212153/http://image-net.org/about-stats.php |archive-date=2020-09-07 |access-date=2022-10-11 }} ImageNet contains more than 20,000 categories, with a typical category, such as "balloon" or "strawberry", consisting of several hundred images.{{cite news|title=From not working to neural networking|url=https://www.economist.com/news/special-report/21700756-artificial-intelligence-boom-based-old-idea-modern-twist-not|access-date=3 February 2018|newspaper=The Economist|date=25 June 2016}} The database of annotations of third-party image URLs is freely available directly from ImageNet, though the actual images are not owned by ImageNet.{{cite web|title=ImageNet Overview|url=https://image-net.org/about.php|publisher=ImageNet|access-date=15 October 2022}} Since 2010, the ImageNet project runs an annual software contest, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), where software programs compete to correctly classify and detect objects and scenes. The challenge uses a "trimmed" list of one thousand non-overlapping classes.

History

AI researcher Fei-Fei Li began working on the idea for ImageNet in 2006. At a time when most AI research focused on models and algorithms, Li wanted to expand and improve the data available to train AI algorithms.{{Cite magazine |url=https://www.wired.com/story/fei-fei-li-artificial-intelligence-humanity/ |title=Fei-Fei Li's Quest to Make AI Better for Humanity |last=Hempel |first=Jesse |magazine=Wired |quote=When Li, who had moved back to Princeton to take a job as an assistant professor in 2007, talked up her idea for ImageNet, she had a hard time getting faculty members to help out. Finally, a professor who specialized in computer architecture agreed to join her as a collaborator. |date=13 November 2018 |access-date=5 May 2019}} In 2007, Li met with Princeton professor Christiane Fellbaum, one of the creators of WordNet, to discuss the project. As a result of this meeting, Li went on to build ImageNet starting from the roughly 22,000 nouns of WordNet and using many of its features. She was also inspired by a 1987 estimate{{Cite journal |last=Biederman |first=Irving |author-link=Irving Biederman |date=1987 |title=Recognition-by-components: A theory of human image understanding. |url=http://dx.doi.org/10.1037//0033-295x.94.2.115 |journal=Psychological Review |volume=94 |issue=2 |pages=115–117 |doi=10.1037/0033-295x.94.2.115 |pmid=3575582 |issn=0033-295X}} that the average person recognizes roughly 30,000 different kinds of objects.{{Cite web |last=Lee |first=Timothy B. |date=2024-11-11 |title=How a stubborn computer scientist accidentally launched the deep learning boom |url=https://arstechnica.com/ai/2024/11/how-a-stubborn-computer-scientist-accidentally-launched-the-deep-learning-boom/ |access-date=2024-11-12 |website=Ars Technica |language=en-US}}

As an assistant professor at Princeton, Li assembled a team of researchers to work on the ImageNet project. They used Amazon Mechanical Turk to help with the classification of images. Labeling started in July 2008 and ended in April 2010. It took 2.5 years to complete the labeling. They had enough budget to have each of the 14 million images labelled three times.

The original plan called for 10,000 images per category, for 40,000 categories at 400 million images, each verified 3 times. They found that humans can classify at most 2 images/sec. At this rate, it was estimated to take 19 human-years of labor (without rest).Li, F-F. ImageNet. "[https://web.archive.org/web/20130115112543/http://www.image-net.org/papers/ImageNet_2010.pdf Crowdsourcing, benchmarking & other cool things]." CMU VASC Semin 16 (2010): 18-25.

They presented their database for the first time as a poster at the 2009 Conference on Computer Vision and Pattern Recognition (CVPR) in Florida, titled "ImageNet: A Preview of a Large-scale Hierarchical Dataset".{{Cite web |title=CVPR 2009: IEEE Computer Society Conference on Computer Vision and Pattern Recognition |url=http://tab.computer.org/pamitc/archive/cvpr2009/posters.html |access-date=2024-11-13 |website=tab.computer.org}}{{cite web |url=https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/ |title=The data that transformed AI research—and possibly the world |last=Gershgorn |first=Dave |date=26 July 2017 |website=Quartz |publisher=Atlantic Media Co.|quote=Having read about WordNet's approach, Li met with professor Christiane Fellbaum, a researcher influential in the continued work on WordNet, during a 2006 visit to Princeton. |access-date=26 July 2017 }}{{Citation |last1=Deng |first1=Jia |last2=Dong |first2=Wei |last3=Socher |first3=Richard |last4=Li |first4=Li-Jia |last5=Li |first5=Kai |last6=Fei-Fei |first6=Li |contribution=ImageNet: A Large-Scale Hierarchical Image Database |year=2009 |title=2009 conference on Computer Vision and Pattern Recognition |contribution-url=http://www.image-net.org/papers/imagenet_cvpr09.pdf |access-date=26 July 2017 |archive-date=15 January 2021 |archive-url=https://web.archive.org/web/20210115185228/http://www.image-net.org/papers/imagenet_cvpr09.pdf |url-status=dead }}{{Citation|last=Li|first=Fei-Fei|title=How we're teaching computers to understand pictures|date=23 March 2015 |url=https://www.ted.com/talks/fei_fei_li_how_we_re_teaching_computers_to_understand_pictures?language=en|access-date=16 December 2018}} The poster was reused at Vision Sciences Society 2009.Deng, Jia, et al. "[https://web.archive.org/web/20130115112451/http://www.image-net.org/papers/ImageNet_VSS2009.pdf Construction and analysis of a large scale image ontology]." Vision Sciences Society 186.2 (2009).

In 2009, Alex Berg suggested adding object localization as a task. Li approached [http://host.robots.ox.ac.uk/pascal/VOC/ PASCAL Visual Object Classes] contest in 2009 for a collaboration. It resulted in the subsequent ImageNet Large Scale Visual Recognition Challenge starting in 2010, which has 1000 classes and object localization, as compared to [http://host.robots.ox.ac.uk/pascal/VOC/ PASCAL VOC] which had just 20 classes and 19,737 images (in 2010).

= Significance for deep learning =

On 30 September 2012, a convolutional neural network (CNN) called AlexNet{{Cite journal |last1=Krizhevsky |first1=Alex |last2=Sutskever |first2=Ilya |last3=Hinton |first3=Geoffrey E. |date=June 2017 |title=ImageNet classification with deep convolutional neural networks |url=https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf |journal=Communications of the ACM |volume=60 |issue=6 |pages=84–90 |doi=10.1145/3065386 |issn=0001-0782 |s2cid=195908774 |access-date=24 May 2017 |doi-access=free}} achieved a top-5 error of 15.3% in the ImageNet 2012 Challenge, more than 10.8 percentage points lower than that of the runner up. Using convolutional neural networks was feasible due to the use of graphics processing units (GPUs) during training, an essential ingredient of the deep learning revolution. According to The Economist, "Suddenly people started to pay attention, not just within the AI community but across the technology industry as a whole."{{cite news |date=30 November 2017 |title=Machines 'beat humans' for a growing number of tasks |url=https://www.ft.com/content/4cc048f6-d5f4-11e7-a303-9060cb1e5f44 |access-date=3 February 2018 |work=Financial Times}}{{Cite web |last1=Gershgorn |first1=Dave |date=18 June 2018 |title=The inside story of how AI got good enough to dominate Silicon Valley |url=https://qz.com/1307091/the-inside-story-of-how-ai-got-good-enough-to-dominate-silicon-valley/ |access-date=10 December 2018 |website=Quartz}}

In 2015, AlexNet was outperformed by Microsoft's very deep CNN with over 100 layers, which won the ImageNet 2015 contest, having 3.57% error on the test set.{{cite book |last1=He |first1=Kaiming |title=2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) |last2=Zhang |first2=Xiangyu |last3=Ren |first3=Shaoqing |last4=Sun |first4=Jian |year=2016 |isbn=978-1-4673-8851-1 |pages=770–778 |chapter=Deep Residual Learning for Image Recognition |doi=10.1109/CVPR.2016.90 |arxiv=1512.03385 |s2cid=206594692}}

Dataset

ImageNet crowdsources its annotation process. Image-level annotations indicate the presence or absence of an object class in an image, such as "there are tigers in this image" or "there are no tigers in this image". Object-level annotations provide a bounding box around the (visible part of the) indicated object. ImageNet uses a variant of the broad WordNet schema to categorize objects, augmented with 120 categories of dog breeds to showcase fine-grained classification.Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.

In 2012, ImageNet was the world's largest academic user of Mechanical Turk. The average worker identified 50 images per minute.

The original plan of the full ImageNet would have roughly 50M clean, diverse and full resolution images spread over approximately 50K synsets. This was not achieved.

The summary statistics given on April 30, 2010:{{Cite web |date=2013-01-15 |title=ImageNet Summary and Statistics (updated on April 30, 2010) |url=http://www.image-net.org/about-stats |access-date=2024-11-13 |archive-url=https://web.archive.org/web/20130115112755/http://www.image-net.org/about-stats |archive-date=15 January 2013 }}

Total number of non-empty synsets: 21841
Total number of images: 14,197,122
Number of images with bounding box annotations: 1,034,908
Number of synsets with SIFT features: 1000
Number of images with SIFT features: 1.2 million

= Categories =

The categories of ImageNet were filtered from the WordNet concepts. Each concept, since it can contain multiple synonyms (for example, "kitty" and "young cat"), so each concept is called a "synonym set" or "synset". There were more than 100,000 synsets in WordNet 3.0, majority of them are nouns (80,000+). The ImageNet dataset filtered these to 21,841 synsets that are countable nouns that can be visually illustrated.

Each synset in WordNet 3.0 has a "WordNet ID" (wnid), which is a concatenation of part of speech and an "offset" (a unique identifying number). Every wnid starts with "n" because ImageNet only includes nouns. For example, the wnid of synset "dog, domestic dog, Canis familiaris" is "n02084071".{{Cite web |date=2013-01-22 |title=ImageNet API documentation |url=http://www.image-net.org/download-API |access-date=2024-11-13 |archive-url=https://web.archive.org/web/20130122145752/http://www.image-net.org/download-API |archive-date=22 January 2013 }}

The categories in ImageNet fall into 9 levels, from level 1 (such as "mammal") to level 9 (such as "German shepherd").

= Image format =

The images were scraped from online image search (Google, Picsearch, MSN, Yahoo, Flickr, etc) using synonyms in multiple languages. For example: German shepherd, German police dog, German shepherd dog, Alsatian, ovejero alemán, pastore tedesco, 德国牧羊犬.Berg, Alex, Jia Deng, and L. Fei-Fei. "[https://www.image-net.org/static_files/files/pascal_ilsvrc.pdf Large scale visual recognition challenge 2010]." November 2010.

ImageNet consists of images in RGB format with varying resolutions. For example, in ImageNet 2012, "fish" category, the resolution ranges from 4288 x 2848 to 75 x 56. In machine learning, these are typically preprocessed into a standard constant resolution, and whitened, before further processing by neural networks.

For example, in PyTorch, ImageNet images are by default normalized by dividing the pixel values so that they fall between 0 and 1, then subtracting by [0.485, 0.456, 0.406], then dividing by [0.229, 0.224, 0.225]. These are the mean and standard deviations, for ImageNet, so these whitens the input data.{{Cite web |title=std and mean for image normalization different from ImageNet · Issue #20 · openai/CLIP |url=https://github.com/openai/CLIP/issues/20 |access-date=2024-09-19 |website=GitHub |language=en}}

= Labels and annotations =

Each image is labelled with exactly one wnid.

Dense SIFT features (raw SIFT descriptors, quantized codewords, and coordinates of each descriptor/codeword) for ImageNet-1K were available for download, designed for bag of visual words.{{Cite web |date=2013-04-05 |title=ImageNet |url=http://www.image-net.org/download-features.php |access-date=2024-11-13 |archive-url=https://web.archive.org/web/20130405035300/http://www.image-net.org/download-features.php |archive-date=5 April 2013 }}

The bounding boxes of objects were available for about 3000 popular synsetshttps://web.archive.org/web/20181030191122/http://www.image-net.org/api/text/imagenet.sbow.obtain_synset_list with on average 150 images in each synset.{{cite web | url=http://www.image-net.org/download-bboxes | archive-url=https://web.archive.org/web/20130405005059/http://www.image-net.org/download-bboxes | archive-date=5 April 2013 | title=ImageNet }}

Furthermore, some images have attributes. They released 25 attributes for ~400 popular synsets:{{cite web | url=http://www.image-net.org/download-attributes | archive-url=https://web.archive.org/web/20191222152337/http://www.image-net.org/download-attributes | archive-date=22 December 2019 | title=ImageNet }}{{Cite book |last1=Russakovsky |first1=Olga |last2=Fei-Fei |first2=Li |chapter=Attribute Learning in Large-Scale Datasets |series=Lecture Notes in Computer Science |date=2012 |volume=6553 |editor-last=Kutulakos |editor-first=Kiriakos N. |title=Trends and Topics in Computer Vision |chapter-url=https://link.springer.com/chapter/10.1007/978-3-642-35749-7_1 |language=en |location=Berlin, Heidelberg |publisher=Springer |pages=1–14 |doi=10.1007/978-3-642-35749-7_1 |isbn=978-3-642-35749-7}}

Color: black, blue, brown, gray, green, orange, pink, red, violet, white, yellow
Pattern: spotted, striped
Shape: long, round, rectangular, square
Texture: furry, smooth, rough, shiny, metallic, vegetation, wooden, wet

= ImageNet-21K =

The full original dataset is referred to as ImageNet-21K. ImageNet-21k contains 14,197,122 images divided into 21,841 classes. Some papers round this up and name it ImageNet-22k.{{cite arXiv |eprint=2104.10972 |class=cs.CV |first1=Tal |last1=Ridnik |first2=Emanuel |last2=Ben-Baruch |title=ImageNet-21K Pretraining for the Masses |date=2021-08-05 |last3=Noy |first3=Asaf |last4=Zelnik-Manor |first4=Lihi}}

The full ImageNet-21k was released in Fall of 2011, as fall11_whole.tar. There is no official train-validation-test split for ImageNet-21k. Some classes contain only 1-10 samples, while others contain thousands.

= ImageNet-1K =

There are various subsets of the ImageNet dataset used in various context, sometimes referred to as "versions".

One of the most highly used subsets of ImageNet is the "ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012–2017 image classification and localization dataset". This is also referred to in the research literature as ImageNet-1K or ILSVRC2017, reflecting the original ILSVRC challenge that involved 1,000 classes. ImageNet-1K contains 1,281,167 training images, 50,000 validation images and 100,000 test images.{{Cite web |title=ImageNet |url=https://www.image-net.org/download.php |access-date=2022-10-19 |website=www.image-net.org}}

Each category in ImageNet-1K is a leaf category, meaning that there are no child nodes below it, unlike ImageNet-21K. For example, in ImageNet-21K, there are some images categorized as simply "mammal", whereas in ImageNet-1K, there are only images categorized as things like "German shepherd", since there are no child-words below "German shepherd".

= Later developments =

In 2021 winter, ImageNet-21k was updated. 2,702 categories in the "person" subtree were filtered to prevent "problematic behaviors" in a trained model. In 2021, ImageNet-1k was updated by annotating faces appearing in the 997 non-person categories. They found training models on the dataset with these faces blurred caused minimal loss in performance.{{Cite web |orig-date=March 11, 2021 |title=An Update to the ImageNet Website and Dataset |url=https://www.image-net.org/update-mar-11-2021.php |access-date=2024-11-13 |website=www.image-net.org}}

ImageNet-C is an adversarially perturbed version of ImageNet constructed in 2019.{{cite arXiv |eprint=1903.12261 |last1=Hendrycks |first1=Dan |last2=Dietterich |first2=Thomas |title=Benchmarking Neural Network Robustness to Common Corruptions and Perturbations |date=2019 |class=cs.LG }}

ImageNetV2 was a new dataset containing three test sets with 10,000 each, constructed by the same methodology as the original ImageNet.{{Cite journal |last1=Recht |first1=Benjamin |last2=Roelofs |first2=Rebecca |last3=Schmidt |first3=Ludwig |last4=Shankar |first4=Vaishaal |date=2019-05-24 |title=Do ImageNet Classifiers Generalize to ImageNet? |url=https://proceedings.mlr.press/v97/recht19a.html |journal=Proceedings of the 36th International Conference on Machine Learning |language=en |publisher=PMLR |pages=5389–5400}}

ImageNet-21K-P was a filtered and cleaned subset of ImageNet-21K, with 12,358,688 images from 11,221 categories.

class="wikitable"

|+Table of datasets

!Name

!Published

!Classes

!Training

!Validation

!Test

!Size

PASCAL VOC

|2005

|20

ImageNet-1K

|2009

|1,000

|1,281,167

|50,000

|100,000

|130 GB

ImageNet-21K

|2011

|21,841

|14,197,122

|1.31 TB

ImageNetV2

|2019

|30,000

ImageNet-21K-P

|2021

|11,221

|11,797,632

|561,052

History of the ImageNet challenge

File:ImageNet_error_rate_history_(just_systems).svg

The ILSVRC aims to "follow in the footsteps" of the smaller-scale [http://host.robots.ox.ac.uk/pascal/VOC/ PASCAL VOC] challenge, established in 2005, which contained only about 20,000 images and twenty object classes. To "democratize" ImageNet, Fei-Fei Li proposed to the [http://host.robots.ox.ac.uk/pascal/VOC/ PASCAL VOC] team a collaboration, beginning in 2010, where research teams would evaluate their algorithms on the given data set, and compete to achieve higher accuracy on several visual recognition tasks.

The resulting annual competition is now known as the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The ILSVRC uses a "trimmed" list of only 1000 image categories or "classes", including 90 of the 120 dog breeds classified by the full ImageNet schema.

The 2010s saw dramatic progress in image processing.

The first competition in 2010 had 11 participating teams. The winning team was a linear support vector machine (SVM). The features are a dense grid of HoG and LBP, sparsified by local coordinate coding and pooling.[https://www.image-net.org/static_files/files/ILSVRC2010_NEC-UIUC.pdf ImageNet classification: fast descriptor coding and large-scale SVM training] It achieved 52.9% in classification accuracy and 71.8% in top-5 accuracy. It was trained for 4 days on three 8-core machines (dual quad-core 2 GHz Intel Xeon CPU).{{Cite book |last1=Lin |first1=Yuanqing |last2=Lv |first2=Fengjun |last3=Zhu |first3=Shenghuo |last4=Yang |first4=Ming |last5=Cour |first5=Timothee |last6=Yu |first6=Kai |last7=Cao |first7=Liangliang |last8=Huang |first8=Thomas |chapter=Large-scale image classification: Fast feature extraction and SVM training |date=June 2011 |title=CVPR 2011 |chapter-url=http://dx.doi.org/10.1109/cvpr.2011.5995477 |publisher=IEEE |pages=1689–1696 |doi=10.1109/cvpr.2011.5995477|isbn=978-1-4577-0394-2 }}

The second competition in 2011 had fewer teams, with another SVM winning at top-5 error rate 25%. The winning team was XRCE by Florent Perronnin, Jorge Sanchez. The system was another linear SVM, running on quantized{{Cite book |last1=Sanchez |first1=Jorge |last2=Perronnin |first2=Florent |chapter=High-dimensional signature compression for large-scale image classification |date=June 2011 |title=CVPR 2011 |chapter-url=http://dx.doi.org/10.1109/cvpr.2011.5995504 |publisher=IEEE |pages=1665–1672 |doi=10.1109/cvpr.2011.5995504|isbn=978-1-4577-0394-2 }} Fisher vectors.{{Cite book |last1=Perronnin |first1=Florent |last2=Sánchez |first2=Jorge |last3=Mensink |first3=Thomas |chapter=Improving the Fisher Kernel for Large-Scale Image Classification |series=Lecture Notes in Computer Science |date=2010 |volume=6314 |editor-last=Daniilidis |editor-first=Kostas |editor2-last=Maragos |editor2-first=Petros |editor3-last=Paragios |editor3-first=Nikos |title=Computer Vision – ECCV 2010 |chapter-url=https://link.springer.com/chapter/10.1007/978-3-642-15561-1_11 |language=en |location=Berlin, Heidelberg |publisher=Springer |pages=143–156 |doi=10.1007/978-3-642-15561-1_11 |isbn=978-3-642-15561-1}}[https://web.archive.org/web/20201027234359/http://image-net.org/challenges/LSVRC/2011/ilsvrc11.pdf "XRCE@ILSVRC2011: Compressed Fisher vectors for LSVR"], Florent Perronnin and Jorge Sánchez, Xerox Research Centre Europe (XRCE) It achieved 74.2% in top-5 accuracy.

In 2012, a deep convolutional neural net called AlexNet achieved 84.7% in top-5 accuracy, a great leap forward.{{cite web | url=https://www.image-net.org/challenges/LSVRC/2012/results | title=ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC2012) }} In the next couple of years, top-5 accuracy grew to above 90%. While the 2012 breakthrough "combined pieces that were all there before", the dramatic quantitative improvement marked the start of an industry-wide artificial intelligence boom.

In 2013, most high-ranking entries used convolutional neural networks. The winning entry for object localization was the OverFeat, an architecture for simultaneous object classification and localization.{{cite arXiv |eprint=1312.6229 |last1=Sermanet |first1=Pierre |last2=Eigen |first2=David |last3=Zhang |first3=Xiang |last4=Mathieu |first4=Michael |last5=Fergus |first5=Rob |last6=LeCun |first6=Yann |title=OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks |date=2013 |class=cs.CV }}

By 2014, more than fifty institutions participated in the ILSVRC. In 2017, 29 of 38 competing teams had greater than 95% accuracy.{{cite news|last1=Gershgorn|first1=Dave|title=The Quartz guide to artificial intelligence: What is it, why is it important, and should we be afraid?|url=https://qz.com/1046350/the-quartz-guide-to-artificial-intelligence-what-is-it-why-is-it-important-and-should-we-be-afraid/|access-date=3 February 2018|work=Quartz|date=10 September 2017}} In 2017 ImageNet stated it would roll out a new, much more difficult challenge in 2018 that involves classifying 3D objects using natural language. Because creating 3D data is more costly than annotating a pre-existing 2D image, the dataset is expected to be smaller. The applications of progress in this area would range from robotic navigation to augmented reality.

By 2015, researchers at Microsoft reported that their CNNs exceeded human ability at the narrow ILSVRC tasks.{{cite news |last1=Markoff |first1=John |date=10 December 2015 |title=A Learning Advance in Artificial Intelligence Rivals Human Abilities |url=https://www.nytimes.com/2015/12/11/science/an-advance-in-artificial-intelligence-rivals-human-vision-abilities.html |access-date=22 June 2016 |work=The New York Times}} However, as one of the challenge's organizers, Olga Russakovsky, pointed out in 2015, the contest is over only 1000 categories; humans can recognize a larger number of categories, and also (unlike the programs) can judge the context of an image.{{cite news |last1=Aron |first1=Jacob |date=21 September 2015 |title=Forget the Turing test – there are better ways of judging AI |url=https://www.newscientist.com/article/dn28206-forget-the-turing-test-there-are-better-ways-of-judging-ai/ |access-date=22 June 2016 |work=New Scientist}}

In 2016, the winning entry was CUImage, an ensemble model of 6 networks: Inception v3, Inception v4, Inception ResNet v2, ResNet 200, Wide ResNet 68, and Wide ResNet 3.{{cite web | url=https://image-net.org/challenges/LSVRC/2016/results | title=Ilsvrc2016 }} The runner-up was ResNeXt, which combines the Inception module with ResNet.{{Cite conference |last1=Xie |first1=Saining |last2=Girshick |first2=Ross |last3=Dollar |first3=Piotr |last4=Tu |first4=Zhuowen |last5=He |first5=Kaiming |author-link5=Kaiming He |year=2017 |title=Aggregated Residual Transformations for Deep Neural Networks |url=https://openaccess.thecvf.com/content_cvpr_2017/papers/Xie_Aggregated_Residual_Transformations_CVPR_2017_paper.pdf |conference=Conference on Computer Vision and Pattern Recognition |pages=1492–1500 |arxiv=1611.05431 |doi=10.1109/CVPR.2017.634}}

In 2017, the winning entry was the Squeeze-and-Excitation Network (SENet), reducing the top-5 error to 2.251%.{{cite arXiv |eprint=1709.01507 |last1=Hu |first1=Jie |last2=Shen |first2=Li |last3=Albanie |first3=Samuel |last4=Sun |first4=Gang |last5=Wu |first5=Enhua |title=Squeeze-and-Excitation Networks |date=2017 |class=cs.CV }}

Bias in ImageNet

It is estimated that over 6% of labels in the ImageNet-1k validation set are wrong.{{Citation |last1=Northcutt |first1=Curtis G. |title=Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks |date=2021-11-07 |arxiv=2103.14749 |last2=Athalye |first2=Anish |last3=Mueller |first3=Jonas}} It is also found that around 10% of ImageNet-1k contains ambiguous or erroneous labels, and that, when presented with a model's prediction and the original ImageNet label, human annotators prefer the prediction of a state of the art model in 2020 trained on the original ImageNet, suggesting that ImageNet-1k has been saturated.{{Citation |last1=Beyer |first1=Lucas |title=Are we done with ImageNet? |date=2020-06-12 |arxiv=2006.07159 |last2=Hénaff |first2=Olivier J. |last3=Kolesnikov |first3=Alexander |last4=Zhai |first4=Xiaohua |last5=Oord |first5=Aäron van den}}

A study of the history of the multiple layers (taxonomy, object classes and labeling) of ImageNet and WordNet in 2019 described how bias{{clarification needed|date=December 2023}} is deeply embedded in most classification approaches for all sorts of images.{{Cite magazine|url=https://www.wired.com/story/viral-app-labels-you-isnt-what-you-think/|title=The Viral App That Labels You Isn't Quite What You Think|magazine=Wired|access-date=22 September 2019|issn=1059-1028}}{{Cite news |url=https://www.theguardian.com/technology/2019/sep/17/imagenet-roulette-asian-racist-slur-selfie |title=The viral selfie app ImageNet Roulette seemed fun – until it called me a racist slur |last=Wong |first=Julia Carrie |author-link=Julia Carrie Wong |date=18 September 2019 |work=The Guardian|access-date=22 September 2019 |issn=0261-3077}}{{Cite web|url=https://www.excavating.ai/|title=Excavating AI: The Politics of Training Sets for Machine Learning|last1=Crawford|first1=Kate|last2=Paglen|first2=Trevor|date=19 September 2019|website=-|access-date=22 September 2019}}{{cite journal|last=Lyons|first=Michael|date=24 December 2020|title=Excavating "Excavating AI": The Elephant in the Gallery|doi=10.5281/zenodo.4037538 |arxiv=2009.01215}} ImageNet is working to address various sources of bias.{{Cite web|url=http://image-net.org/update-sep-17-2019.php|title=Towards Fairer Datasets: Filtering and Balancing the Distribution of the People Subtree in the ImageNet Hierarchy|date=17 September 2019|website=image-net.org|access-date=22 September 2019}}

One downside of WordNet use is the categories may be more "elevated" than would be optimal for ImageNet: "Most people are more interested in Lady Gaga or the iPod Mini than in this rare kind of diplodocus."{{Clarify|date=August 2019}}