Skip to Main Content

Working with Data

Evaluating Data Sources

Remember that all data is gathered by people who make decisions about what to collect. A good way to evaluate a dataset is to look at the data's source. Generally, data from non-profit or governmental organizations is reliable. Data from private sources or data collection firms should be examined to determine its suitability for study. Here are some questions you can ask of a dataset: 

  • Who gathered it? A group of researchers, a corporation, a government agency?
  • For what purpose was it gathered? Was it gathered to answer a specific question? Or perhaps to prove a specific observation? You cannot ask questions of a dataset that it cannot answer, so carefully consider whether the data you have found is relevant to your research question. 
  • What decisions did they make about the dataset? These could be data cleaning decisions, choices about which data to publish, or something else. Decisions already made will affect what you're able to do with the data. 
  • Are you allowed to reuse it? If so, are there privacy or ethical considerations? See the Ethics in Data Use section below. 

The answers to these questions can often be found in data documentation or by web searching. 

Learn more about evaluating sources

Ethics in Data Use

Ethical data use involves keeping an eye to privacy and reuse restrictions and interrogating how and why data was collected. 

Privacy and reuse

Data can include information that is potentially harmful if made public. For example, if a social scientist collects information from people addicted to drugs, and shares that information without appropriately anonymizing the dataset, that could affect someone's ability to get a loan, a job, or cause family issues. Ethical data use almost always include anonymizing data or limit these risks. Similarly, if reusing data that contains potentially harmful information, think about what you might be able to omit from your analysis to protect privacy. 

Data collection

Remember that data is only as good as its collection methods, and interrogate why data was collected in a certain way. Do you notice certain groups or factors are conspicuously missing? Could the data collection method have violated privacy?