Is your data science project ethical?
Serge KorzhReading Time: 5 minutes
There is no doubt that data is one of the most critical resources in today's world. Let's take an eCommerce shop as an example – every aspect of running it is data-driven. Market analysis data helps to decide whom to target, where to invest resources, and how to gain competitive advantage. User feedback is crucial in determining how to improve products and services, and conversion tracking is necessary for a successful marketing campaign. Moreover, as the field of machine learning is continually advancing, the potential value of clean, structured data keeps increasing. But, as always, with great power comes great responsibility – data is not only an asset but also a liability. With that in mind, let us go over a few key points and issues with handling data.
Protect the data you have
Companies store a lot of confidential data about their users. It is common sense that this data should be stored securely, and access to it should be restricted. Yet, massive data leakages of private user data have become so common that such stories aren't a big surprise to anyone. And it doesn't have to be a result of some malicious attack – sometimes, it's just because of a small bug or lack of a bulletproof policy on data access. Such was the case with Google+ API that for three years had a security flaw allowing developers to access private data of hundreds of thousands of users.
Anonymisation is not as easy as you might think
A lot of the data stored by companies is what's called Personally Identifiable Information (PII), which is any information that can be uniquely attributed to a specific person. Usually, we don't want to expose PII data of our users to the public, so we remove and obscure it so that it becomes non-PII – a process called de-identification.
For example, a database record with a national identification number (e.g. SSN in the US) is clearly PII, while a record with only a name (say, "John") is not. But what if we also know that John's age is, for instance, 24? Well, we still wouldn't probably be able to track down a particular John of age 24. But what if we add a zip code? This is where things get a bit complicated. It is easy to see how a person can be identified by an SSN or a driver's licence number; such attributes are meant to be PII. However, what's rarely taken into account is that information never exists in a vacuum. There are many ways in which identity can be revealed seemingly out of nowhere by correlating the data with other data sources.
As an example, in 2006, Netflix released an anonymised dataset containing movie ratings and their timestamps, as a part of a data science competition aiming to improve its movie recommendation system. They replaced all information specific to a user with a random ID number, and there seemed to be no way of tracking down the user. Yet, shortly after, researchers from the University of Texas published a paper where they were able to re-identify some of the data. It turned out that some of the Netflix users were leaving the same ratings on another movie-related website, IMDb, where the ratings are shown publicly. By correlating the timestamps and scores from the Netflix dataset with the data on IMDb, the researchers could identify 99% of the users that left at least eight ratings, even accounting for rating and date mismatches. Moreover, they showed that this can be partially achieved even without any timestamps. That is why in situations where a part of user-related data is shared with a third party or the public, one must think very carefully about how much information is actually exposed, what can be reconstructed from it, and whether there is a risk of revealing people's identities.
Regulation is not everything
It's crucial to have such regulations as GDPR and CCPA enforcing ethical and just handling of data. Yet, there is one problem – the law is slow and always lags behind the real world. Technology, especially in data science, improves rapidly. At the same time, regulations take a lot of time to be written and put in place. As a result, it is possible to substantially misuse data while still complying with the regulations. That's precisely what happened during the Facebook–Cambridge Analytica scandal when private data of millions of Facebook users was collected and exploited without their consent. It's not enough to merely comply with the law – we always have to make sure that we handle data ethically. We must account for all possible consequences and inform the data subjects about how their data is being used.
Bias in, bias out
There is a famous saying in computer science: "Garbage in, garbage out", pointing out that flawed and erroneous input inevitably yields useless output. Nowadays, we use massive amounts of data to train machine learning algorithms, which we then use to support decision making in many aspects of our lives – banks use them to decide who is eligible for a loan, companies use them to rate job applications, and some governments are starting to experiment with their usage in predictive policing.
We tend to think that the data never lies, that it is unbiased and objective. But the reality is – data is produced, gathered, and organised by humans. If those humans have some biases, they might creep into the collected datasets, and then, into the models and algorithms that process those datasets. As an example, take word2vec, an ML model that captures semantic relations between the words. After being trained on the Google News dataset, it has learned to associate the word "she" with occupations such as "homemaker", "nurse", "housekeeper", while "he" with "computer programmer", "doctor", "architect". Such subtleties are quite hard to spot without a thorough analysis, yet, models like this are used in countless other applications. They reinforce the biases that might be decisive in many cases, such as which ads you see or even your creditworthiness.
“We don’t realise how biased we are until we see an AI reproduce the same bias, and we see that it’s biased.”
Conclusion
It's hard to overestimate the impact of data digitisation, machine learning, and AI on our society. However, opinions differ on how much of this impact is for good – some say AI will solve all our problems, others argue it brings much more harm than good. In my opinion, however, AI isn't the one to be blamed. It is not machines that perpetuate bias, detract ourselves from work, and reinforce discrimination. We must realise that AI is just another tool in our toolbox, and we are the only ones who are responsible for how the tool gets used.
If you want to learn more about the ethical issues in data science, check out this checklist from DrivenData that goes over many other aspects to keep in mind during a data science project. And in case you want to dive deeper into the topic, there is an excellent MOOC from the University of Michigan covering these and many other subjects in much greater detail.