One of the apparent strengths of artificial intelligence is its ability to remove human bias. However, although this may be the intention, AI systems learn what they are taught; meaning if they are not powered by robust and diverse data sets, bias can still emerge.

The challenge in training AI is clearly demonstrated by facial recognition technology. Facial recognition systems use biometrics to map facial features from an image, and then compares this with a database of known faces to find a match.

If the data used when training the machine learning software favours particular facial characteristics, problems arise. If, for example, a larger proportion of the data comes from people of a certain ethnicity or skin colour, the system will be better equipped to recognise certain facial features, and will struggle to recognise others.

This means that some users may encounter problems when using facial recognition. According to the New York Times, a study conducted last year by Joy Buolamwini, a researcher at the MIT Media Lab, found that Amazon’s facial analysis software can recognise the face of a white man 99% of the time. However, for darker skinned women, the software made errors in 35% of cases, often misidentifying gender.

To combat this, data sets must be large enough and different enough that the technology learns to recognise a wide variety of different faces regardless of age, gender, ethnicity and skin tone, as not only are errors annoying for users, they point to an inherently unrepresentative dataset.

This will only become more apparent as facial recognition software becomes more commonplace, with the iPhone XR equipped with Face ID and many airports expected to replace passports with biometric facial recognition in the future, highlighting the need for AI systems that are fair and accurate.

IBM’s facial recognition dataset

Today, IBM Research, a subsidiary of the computer hardware company, released a new, large and diverse dataset called Diversity in Faces (DiF) to advance the study of accuracy in facial recognition technology.

Believed to be the first of its kind, DiF provides a data set of annotations of 1 million human facial images using publicly available images from the YFCC-100M Creative Commons data set.

IBM then annotated the faces using ten different coding schemes to measure craniofacial features such as head length, nose length, forehead height and other factors, including age and gender.

By studying a wide range of different faces, it is hoped that diversity and coverage of data for AI facial recognition will improve by providing a more balanced distribution and broader coverage of facial images compared with previous data sets.

The dataset is now available to the global research community upon request.