23 sources of data bias for #machinelearning and #deeplearning

Author: ajit jaokar

In the paper A survey on bias and fairness in machine learning.- the authors outline 23 types of bias in data for machinelearning. The source is good – so below is an actual representation because I found it useful as it is

full paper link below

 

1) Historical Bias. Historical bias is the already existing bias and socio-technical issues in the world and can seep into from the data generation process even given a perfect sampling and feature selection. An example of this type of bias can be found in a 2018 image search result where searching for women CEOs ultimately resulted in fewer female CEO images due to the fact that only 5% of Fortune 500 CEOs were woman—which would cause the search results to be biased towards male CEOs. These search results were of course reflecting the reality, but whether or not the search algorithms should reflect this reality is an issue worth considering.

 

2) Representation Bias. Representation bias happens from the way we define and sample from a population. Lacking geographical diversity in datasets like ImageNet is an example for this type of bias. This demonstrates a bias towards Western countries.

 

3) Measurement Bias. Measurement bias happens from the way we choose, utilize, and measure a particular feature. An example of this type of bias was observed in the recidivism risk prediction tool COMPAS, where prior arrests and friend/family arrests were used as proxy variables to measure level of “riskiness” or “crime”—-which on its own can be viewed as mismeasured proxies. This is due to the fact that minority communities are controlled and policed more frequently, so they have higher arrest rates. However, one should not conclude that because people coming from minority groups have higher arrest rates therefore they are more dangerous as there is a difference in how these groups are assessed and controlled.

 

4) Evaluation Bias. Evaluation bias happens during model evaluation. This includes the use of inappropriate and disproportionate benchmarks for evaluation of applications such as Adience and IJB-A benchmarks. These benchmarks are used in the evaluation of facial recognition systems that were biased toward skin color and gender, and can serve as examples for this type of bias .

5) Aggregation Bias. Aggregation bias happens when false conclusions are drawn for a subgroup based on observing other different subgroups or generally when false assumptions about a population affect the model’s outcome and definition. An example of this type of bias can be seen in clinical aid tools. Consider diabetes patients who have apparent differences across ethnicities and genders, or more specifically, HbA1c levels that are widely used in diagnosis and monitoring of diabetes are different in complicated ways across genders and ethnicities. Therefore, because of these factors and their different meanings and importance across different sub-groups and populations, a single model would most probably not be best-suited for all groups in a population. This is true even when they are represented equally in the training data. Any general assumptions about different populations can result in aggregation bias.

6) Population Bias. Population bias arises when statistics, demographics, representatives, and user characteristics are different in the user population represented in the dataset or platform from the original target population. An example of this type of bias can arise from different user demographics on different social platforms, such as women being more likely to use Pinterest, Facebook, Instagram, while men being more active in online forums like Reddit or Twitter. More of such examples and statistics related to social media use among young adults according to gender, race, ethnicity, and parental educational background can be found in .

 

7) Simpson’s Paradox. Simpson’s paradox can bias the analysis of heterogeneous data that is composed of subgroups or individuals with different behaviors. According to Simpson’s paradox, a trend, association, or characteristic observed in underlying subgroups may be quite different from association or characteristic observed when these subgroups are aggregated. One of the better-known examples of this type of paradox arose during the gender bias lawsuit in university admissions against UC Berkeley. After analyzing graduate school admissions data, it seemed like there was bias toward women, a smaller fraction of whom were being admitted to graduate programs compared to their male counterparts. However, when admissions data was separated and analyzed over the departments, women applicants had equality and in some cases even a small advantage over men. The paradox happened as women tended to apply to departments with lower admission rates for both genders. Simpson’s paradox has been observed in a variety of domains, including biology, psychology, astronomy, and computational social science.

8) Longitudinal Data Fallacy. Observational studies often treat cross-sectional data as if it were longitudinal, which may create biases due to Simpson’s paradox. As an example, analysis of bulk Reddit data revealed that comment length decreased over time on average. However, bulk data represented a cross-sectional snapshot of the population, which in reality contained different cohorts who joined Reddit in different years. When data was disaggregated by cohorts, the comment length within each cohort was found to increase over time.

9) Sampling Bias. Sampling bias arises due to non-random sampling of subgroups. As a conse-quence of sampling bias, the trends estimated for one population may not generalize to data collected from a new population. For the intuition, consider again the example in Figure Suppose the next time the study is conducted, one of the subgroups is sampled more frequently than the rest. The positive trend found by the regression model in the first study almost com-pletely disappears (solid red line in plot on the right), although the subgroup trends (dashed green lines) are unaffected.

 

10) Behavioral Bias. Behavioral bias arises from different user behavior across platforms, con-texts, or different datasets. An example of this type of bias can be observed in, where authors show how differences in emoji representations among platforms can result in different reactions and behavior from people and sometimes even leading to communication errors.

 

11) Content Production Bias. Content Production bias arises from structural, lexical, semantic, and syntactic differences in the contents generated by users. An example of this type of bias can be seen in where the differences in use of language across different gender and age groups is discussed. The differences in use of language can also be seen across and within countries and populations.

12) Linking Bias. Linking bias arises when network attributes obtained from user connections, activities, or interactions differ and misrepresent the true behavior of the users. In authors show how social networks can be biased toward low-degree nodes when only considering the links in the network and not considering the content and behavior of users in the network. Temporal Bias. Temporal bias arises from differences in populations and behaviors over time. An example can be observed in Twitter where people talking about a particular topic start using a hashtag at some point to capture attention, then continue the discussion about the event without using the hashtag.


13) Popularity Bias.
Items that are more popular tend to be exposed more. However, popularity metrics are subject to manipulation—for example, by fake reviews or social bots. As an instance, this type of bias can be seen in search engines or recommendation systems where popular objects would be presented more to the public. But this presentation may not be a result of good quality; instead, it may be due to other biased factors.

14) Algorithmic Bias. Algorithmic bias is when the bias is not present in the input data and is added purely by the algorithm.

15) User Interaction Bias. User Interaction bias is a type of bias that can not only be observant on the Web but also get triggered from two sources—the user interface and through the user itself by imposing his/her self-selected biased behavior and interaction. This type of bias can be influenced by other types and subtypes, such as Presentation and Ranking biases.

16) Presentation Bias. Presentation bias is a result of how information is presented. For example, on the Web users can only click on content that they see, so the seen content gets clicks, while everything else gets no click. And it could be the case that the user does not see all the information on the Web.Ranking Bias. The idea that top-ranked results are the most relevant and important will result in attraction of more clicks than others. This bias affects search engines  and crowdsourcing applications.

17) Social Bias. Social bias happens when other people’s actions or content coming from them affect our judgment. . An example of this type of bias can be a case where we want to rate or review an item with a low score, but when influenced by other high ratings, we change our scoring thinking that perhaps we are being too harsh.

18) Emergent Bias. Emergent bias happens as a result of use and interaction with real users. This bias arises as a result of change in population, cultural values, or societal knowledge usually sometime after the completion of design . This type of bias is more likely to be observed in user interfaces, since interfaces tend to reflect the capacities, characteristics, and habits of prospective users by design.

19 Self-Selection Bias. Self-selection bias is a subtype of the selection or sampling bias in which subjects of the research select themselves. An example of this type of bias can be observed in situations where survey takers decide that they can appropriately participate in a study themselves. For instance, in a survey study about smart or successful students, some less successful students might think that they are successful to take the survey—which would then bias the outcome of the analysis. In fact, the chances of this situation happening is high, as the more successful students probably would not spend time filling out surveys that would increase the risk of self-selection

(20) Omitted Variable Bias. Omitted variable bias4 occurs when one or more important variables are left out of the model. An example for this case would be when someone designs a model to predict, with relatively high accuracy, the annual percentage rate at which customers will stop subscribing to a service, but soon observes that the majority of users are canceling their subscription without receiving any warning from the designed model. Now imagine that the reason for canceling the subscriptions is appearance of a new strong competitor in the market which offers the same solution, but for half the price. The appearance of the competitor was something that the model was not ready for; therefore, it is considered to be an omitted variable.

(21) Cause-Effect Bias. Cause-effect bias4 can happen as a result of the fallacy that correlation implies causation. An example of this type of bias can be observed in a situation where a data analyst in a company wants to analyze how successful a new loyalty program is. The analyst sees that customers who signed up for the loyalty program are spending more money in the company’s e-commerce store than those who did not. It is going to be problematic if the analyst immediately jumps to the conclusion that the loyalty program is successful, since it might be the case that only more committed or loyal customers, who might have planned to spend more money anyway, are interested in the loyalty program in the first place. This type of bias can have serious consequences due to its nature and the roles it can play in sensitive decision-making policies.

(22) Observer Bias. Observer biass happens when researchers subconsciously project their expectations onto the research. This type of bias can happen when researchers (unintentionally) influence participants (during interviews and surveys) or when they cherry pick participants or statistics that will favor their research.

(23) Funding Bias. Funding bias arises when biased results are reported in order to support or satisfy the funding agency or financial supporter of the research study. As an example, this manifests when employees of a company report biased results in their data and statistics in order to keep the funding agencies or other parties satisfied.

Paper link A survey on bias and fairness in machine learning by Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, Aram Galstyan

Image source genetic literacy project and dogtown media

Go to Source