MegaPixels
MegaFace Dataset
Images from the MegaFace face recognition training and benchmarking dataset

MegaFace

MegaFace is a large-scale public face recognition training dataset that serves as one of the most important benchmarks for commercial face recognition vendors. It includes 4,753,320 faces of 672,057 identities from 3,311,471 photos downloaded from 48,383 Flickr users' photo albums. All photos included a Creative Commons licenses, but most were not licensed for commercial use.

Oct 11: New York Times investigates MegaFace: How Photos of Your Kids Are Powering Surveillance Technology

This analysis explores how the MegaFace face recognition dataset exploited the good intentions of Flickr users and the Creative Commons license system to advance facial recognition technologies around the world by companies including Alibaba, Amazon, Google, CyberLink, IntelliVision, N-TechLab (FindFace.pro), Mitsubishi, Orion Star Technology, Philips, Samsung 1, SenseTime, Sogou, Tencent, and Vision Semantics to name only a few. According to the press release from the University of Washington, "more than 300 research groups [were] working with MegaFace" as of 2016.

To understand which licenses were applied to the images in the MegaFace dataset we analyzed the metadata for all 3,311,471 images from 48,383 Flickr accounts and found that 69% (2,284,369) of the images prohibited commercial use, while only 31% (1,027,102) allowed it. But all 3,311,471 images required some form of attribution, of which none was provided by the MegaFace dataset nor any of the research projects that used it. This would amount to 3,311,471 violations of Creative Commons licenses for each commercial use of the dataset if it were to be enforced.

MegaFace Dataset Creative Commons Licenses

Creative Commons License Images Definition
BY (Attribution) 540,073 (16.3%) creativecommons.org/licenses/by/2.0/
BY-ND (No-Derivs) 179,759 (5.4%) creativecommons.org/licenses/by-nd/2.0/
BY-SA (Attribution ShareAlike) 307,270 (9.3%) creativecommons.org/licenses/by-sa/2.0/
BY-NC (Attribution-NonCommercial) 433,861 (13.1%) creativecommons.org/licenses/by-nc
BY-NC-SA (Attribution-NonCommercial-ShareAlike) 960,331 (29%) creativecommons.org/licenses/by-nc-sa/2.0/
BY-NC-ND (Attribution-NonCommercial-NoDerivs) 890,177 (26.9%) creativecommons.org/licenses/by-nc-nd/2.0/
No commercial use allowed 2,284,369 (69%)  
Commercial use allowed 1,027,102 (31%)  

Defining commercial use of training data is still a gray area. But the intent of the dataset is clear. According to the research paper introducing the dataset, the motivation for creating MegaFace was commercial in nature: "let's say one wishes to create an application that uses the best face recognition algorithm out there, how would they know which algorithm is better to implement or buy?" 1 In other words, how can commercial face recognition vendors prove their product is superior? Simple: they compete in open challenges, using the MegaFace dataset as a baseline for comparison with other algorithms, and then advertise the results.

According to BiometricUpdate.com, a news website for the biometrics industry, the MegaFace dataset has now become "one of the most reliable and popular frameworks of reference in assessing facial recognition performance, particularly on a massive scale". It frequently appears in press releases and promotional material for top facial recognition vendors.

 144 of 4,753,520 face images from the MegaFace face recognition dataset
144 of 4,753,520 face images from the MegaFace face recognition dataset

Origins

The MegaFace dataset begins in 2004, the first year Flickr began offering free online photo sharing to Internet users. Since the beginning Flickr recommended and promoted Creative Commons (CC) licenses as a way to facilitate sharing and reposting images. Featured images on their homepage prominently displayed CC licensing, the majority of their licensing options were CC, and later they provided unlimited free hosting for images that used CC licenses.

Their strategy worked. By 2010 Flickr had surpassed 100 million CC-licensed images.

Assuming that both creators and users understood the licensing agreement, this was a huge success. Photographers gained an audience, families could easily share images, publishers had access to free content, and the Internet was a better place.

But the assumptions around sharing were changing. Three years later, a group of researchers from Lawrence Livermore National Laboratory, Berkeley, Yahoo Labs, and In-Q-Tel's Lab41 realized this valuable resource could also be shared for science. Then in 2014 they released Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M). At the time, and still today, it is the largest public multimedia collection ever released, containing 99.2 million photos and 800,000 videos, along with their user-generated metadata. Their intention was to provide "publicly shareable and legally usable data that is flexible and rich enough to promote advancement [...] in achieving research growth and facilitating synergy within the research community". 3 Their strategy seemed to work, too.

In 2015 researchers at the University of Washington tapped YFCC100M to create the MegaFace face recognition dataset. All 4,753,320 annotated faces, from 3,311,471 images, in the MegaFace dataset were derived from the original YFC100M dataset. The only public dataset with a comparable number of images was Microsoft Research's MS-Celeb-1M dataset. Incidentally, MS-Celeb has since been withdrawn due to our joint investigation with the Financial Times.

In 2017, one year after the release of MegaFace, SenseTime Limited (CN) funded a new derivative dataset based on the original MegaFace dataset. Their new dataset, called MegaAge, was used to study facial age analysis. Then again in 2018, MegaFace was used to create another face dataset, called TinyFace, for the purpose of studying face recognition on low resolution imagery, such as CCTV. And yet again in 2019, the MegaFace dataset was used to create another face recognition dataset called DiveFace by a group called SensitiveNets from Madrid, which aims to "train unbiased and discrimination-aware face recognition algorithms".

Not only does MegaFace appear in an ever-growing list of research projects and derivative datasets funded by and used by giant technology companies, it is also appears in patents. A 2018 patent from China called "Deep learning-based face recognition and face verification supervised learning method" (patent number CN108256450A) claims that "experimental data sets of the present invention comprises a largest face recognition database MegaFace Challenge" and that "The method of the present experiment only on MegaFace database to a data set of three gallery to test the proposed model of the present method." The figures included in the patent publication even include images from the MegaFace dataset.

 [0035] FIG. 6 (text auto-translated) "is a schematic MegaFace some sample data set in a face image"
[0035] FIG. 6 (text auto-translated) "is a schematic MegaFace some sample data set in a face image"

Despite the widespread exploitation of non-commercial Creative Commons licenses in the MegaFace dataset it still remains available to download at http://megaface.cs.washington.edu. Below we provide a list of verified research projects that have used the dataset in their academic, commercial, and defense research to help support the claim the the MegaFace dataset not only violates the privacy rights of those who did not consent to being added to a face recognition dataset, but it also ignores the intellectual property rights of all image holders.

Who used MegaFace Dataset?

The bar chart below presents a ranking of the top countries where dataset citations originated. Mouse over individual columns to see yearly totals. These charts show at most the top 10 countries.

Information Supply Chain

To help understand how MegaFace Dataset has been used around the world by commercial, military, and academic organizations; existing publicly available research citing MegaFace Dataset was collected, verified, and geocoded to show how AI training data has proliferated around the world. Click on the markers to reveal research projects at that location.

Citation data is collected using SemanticScholar.org then dataset usage verified and geolocated. Citations are used to provide overview of how and where images were used.

Dataset Citations

The dataset citations used in the visualizations were collected from Semantic Scholar, a website which aggregates and indexes research papers. Each citation was geocoded using names of institutions found in the PDF front matter, or as listed on other resources. These papers have been manually verified to show that researchers downloaded and used the dataset to train or test machine learning algorithms. If you use our data, please cite our work.

Supplementary Information

Estimated Age Distribution

Age distribution was estimated by analyzing all faces in the dataset using a pre-trained neural network. Faces were detected automatically, and may include additional faces appearing next to an annotated face, or may even skip false faces that were erroneously included as part of the original dataset. These numbers are provided as an estimation and not a factual representation of the exact age of all faces in this dataset.

Cite Our Work

If you find this analysis helpful, please cite our work:

@online{megapixels,
  author = {Harvey, Adam. LaPlace, Jules.},
  title = {MegaPixels: Origins, Ethics, and Privacy Implications of Publicly Available Face Recognition Image Datasets},
  year = 2019,
  url = {https://megapixels.cc/},
  urldate = {2019-04-18}
}

Citing MegaFace

If you use any data from the MegaFace dataset, please cite their work as:

@article{Nech2017LevelPF,
 title={Level Playing Field for Million Scale Face Recognition},
 author={Aaron Nech and Ira Kemelmacher-Shlizerman},
 journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
 year={2017},
 pages={3406-3415}
}

References