Facial recognition systems are everywhere, but they need an essential raw material: data and more data. To train the algorithms, researchers not only work on improving their models, they also need large databases to test if the systems work. Where does this data come from? In the case of IBM, as revealed by NBCNews , of images with CC license from Flickr.

IBM published a data set called ‘Diversity in Faces’ last year . It is an interesting work because instead of relying on images of famous people, they showed the wide plurality of faces that can exist. Something that serves to improve facial recognition in your treatment, for example, with various skin tones.

But what was not known is that many of these images have been extracted from Flickr and include personal images. So much so that many users have been surprised to know they were in that database, without having given prior consent for such use .

To have photos on Flickr to be found in the IBM database

Face recognition

As one photographer affected by NBC explains, ” none of the people I photographed had any idea that their photos were going to be used in this way .” The key of the matter is in the use of the Creative Commons license ; While it is allowed to use these images, it was difficult to anticipate that these images would be used to train facial recognition systems and that later they could classify faces according to gender, hair color or ethnicity.

Precisely to avoid that the facial recognition was inclined towards a type of profile or person, a large enough database was used to improve the accuracy. And this is where the millions of photos used come in.

The images were not collected directly by IBM, but by Yahoo. Specifically, the set of faces is inside the database YFCC100M , a set of 99.2 million photos with creative commons license created by Yahoo, who we remember owns Flickr itself.

The NBC has offered a tool to know if your photo is among the database used. You have to enter the Flickr user and you will get a result.

In particular, the IBM database is not public . Although if you are a researcher and offer your reasons, you can request access to IBM to work with this set of data.

The database of ‘Diversity in Faces’ used by IBM initially contained 100 million Flickr images, which were subsequently reduced to one million faces in order to work with them and identify the most important patterns of each one . Values ​​such as estimated age, gender, nose size, distance between the eyes, skin color … more than 200 values ​​to identify a person with their algorithms.

As researcher Jack Poulson comments , “it is almost impossible to get them to remove our photo, IBM requires links to those photos, but the company has not publicly published the list of Flickr users and therefore it is very difficult to know who has their photo included ” .

Currently, the company offers its IBM Watson Visual Recognition system to recognize and estimate the age and gender of people and can be used by other customers to identify specific people in photos or videos. A skill that has been trained in part thanks to the use without consent of the face of millions of people. And IBM is not the only company to use our photos for that purpose.

From IBM, they offer us the following answer:


We take people’s privacy very seriously and have been very careful to comply with the privacy principles, which includes having limited the ‘Diversity in Faces’ database to publicly available image annotations as they can only access verified researchers. 
People can opt out of this database. 
IBM has been committed to creating responsible, fair and trustworthy technology for more than a century and believes that it is essential to strive for honesty and accuracy in facial recognition.