Is your data science project ethical?

There is no doubt that data is one of the most important resources in today’s world. Let’s take an eCommerce shop as an example – almost every aspect of running it is data-driven: market analysis data helps determine whom to target, where to invest resources, and how to gain competitive advantage, user feedback and reviews data is crucial in determining how to improve products and services, conversion tracking is necessary for a successful marketing campaign – the list could go on and on. Moreover, as the field of machine learning is constantly advancing, the potential value of clean, structured data (of which a sizeable eCommerce shop is in abundance) just keeps increasing. But, as always, with great power comes great responsibility – data is not only an asset, but also a liability. With that in mind, let us go over a few key points and issues with handling data.

Protect the data you have

Companies store a lot of confidential data about their users and, needless to say, this data should be securely stored and the access to it should be restricted. Yet, massive data leakages of private user data has become so common that such stories aren’t a big surprise to anyone. And it doesn’t have to be a result of some malicious attack – sometimes, it’s just because of an inconsiderate bug or lack of a bulletproof policy on data access – such was the case with Google+ API, that for three years had a security flaw allowing developers to access private data of hundreds of thousands of users.

Anonymization is not as easy as you might think

A lot of the data that is stored by companies is what’s called Personally Identifiable Information (PII), which is any information that can be uniquely attributed to a certain person, and the process of transforming PII data into non-PII is called de-indentification.

For example, a database record with a national identification number (e.g. SSN in the US) is obviously PII, while a record with only a name (say, “John”) is not. But what if we also know that John’s age is, for instance, 24? Well, we still wouldn’t probably be able to track down a particular John of age 24. But what if we add a zip code? This is where things get a bit complicated. It is easy to see how a person can be identified by an SSN or a driver’s licence number, such attributes are meant to be PII. However, what's rarely taken into account is that information never exists in a vacuum. There are many ways in which identity can be revealed seemingly out of nowhere by correlating the data with other data sources.

As an example, in 2006, Netflix released an anonymised dataset containing movie ratings and their timestamps, as a part of a data science competition aiming to improve their movie recommendation system. All information specific to a user was replaced with a random ID number, there seemed to be no way of tracking down the user. Yet, shortly after, researchers from the University of Texas published a paper, where they were able to re-identify some of the data. It turned out that some of the Netflix users were leaving the same ratings on another movie-related website, IMDb, where the ratings were shown publicly. By correlating the timestamps and ratings of a user ID from the Netflix dataset with the data on IMDb, the researchers could identify 99% of the users that left at least 8 ratings, even accounting for rating and date mismatches. Moreover, they showed that this can be partially achieved even without any timestamps. That is why in situations where a part of user-related data is shared with a third party or is opened to the public, one must think very carefully about how much information is actually shared, what can be reconstructed from it and whether there is a risk of exposing people's identities.

Regulation is not everything

While it’s crucial to have such regulations as GDPR and CCPA enforcing ethical and just handling of data, there is one problem – the law is slow and almost always lags behind the real world. Technology, especially in data science, improves rapidly, while regulations take a lot of time to be written and put in place. As a result, it is completely possible to substantially misuse data while still complying with the regulations, which, for example, happened during the Facebook–Cambridge Analytica scandal, when private data of millions of Facebook users was collected and exploited without their consent. Therefore, it’s not enough to just comply with the law – we have to always make sure that the data is handled ethically, that we have accounted for all of the consequences, and the subjects of the data are informed exactly of how the data is being used.

Bias in, bias out

There is a famous saying in computer science: “Garbage in, garbage out”, pointing out that flawed and erroneous input inevitably yields useless output. Nowadays, we use massive amounts of data to train machine learning algorithms, which in turn are used to support decision making in many aspects of our lives – banks use them to decide who is eligible for a loan, companies use them to rate job applications, and some governments are starting to experiment with their usage in predictive policing.

We tend to think that the data never lies, that it is unbiased and objective. But the reality is – data is produced, gathered, and organised by humans, and if those humans have some biases, they might creep into the collected datasets, and then, into the models and algorithms that process those datasets. As an example, take word2vec, an ML model that captures semantic relations between the words, which, after being trained on the Google News dataset, has learned to associate the word “she” with occupations such as “homemaker”, “nurse”, “housekeeper”, while “he” with “computer programmer”, “doctor”, “architect”. Such subtleties are quite hard to spot without a thorough analysis, yet, models like this are used in countless other applications, reinforcing the biases that might be decisive in many cases, such as which ads one sees or even one’s creditworthiness.

“We don’t realise how biased we are until we see an AI reproduce the same bias, and we see that it’s biased.”


It’s hard to overestimate the impact of data digitization, machine learning and AI on our society. However, the opinions on how much of this impact is for good are vastly different – some say AI will solve all our problems, others argue it brings much more harm than good. In my opinion, however, AI isn’t the one to be blamed. It is not machines that perpetuate bias, detract ourselves of work, and reinforce discrimination. We must realize that AI is just another tool in our toolbox and we are the only ones who are responsible for how the tool gets used.

If you want to learn more about the ethical issues in data science, consider checking out this checklist from DrivenData that goes over many other potential problems and things to keep in mind during a data science project. And in case you would like to dive even deeper into the topic of Data Science Ethics, there is an awesome course on edX from the University of Michigan covering these and many other topics in much greater detail.


Serge Korzh

Data Scientist

I believe data is the key to solving many of the world’s biggest problems. My journey into Data Science started from Natural Language Processing, which, in turn, grew from my passion for linguistics. Currently my interest lies in applications of Machine Learning in health and development of sustainable solutions.