Unless you've collected your data yourself (and therefore it's in the exact format you need), you may need to make some changes to the dataset before analysis or visualization. Remember to keep backups of the unedited dataset, and document any changes you make.
First, make sure you have an unedited copy somewhere safe! You can even use version control programs like GitHub to save copies, ensuring you can always roll back to an earlier version.
"Data cleaning" is sometimes a contested term for its implications of deleting data. "Tidying" or "wrangling" are other terms you may hear, and it typically involves:
OpenRefine is a tool to clean up structured data. It works with spreadsheet formats like CSV or TSV, as well as JSON, XML, Excel, and others. It offers powerful algorithms to detect data entry errors. This is a great choice for normalizing values (like changing "UDel" to "University of Delaware").
You can also do data cleaning by hand directly and easily in a spreadsheet program like Microsoft Excel, OpenOffice Calc, or Notepad++. Notepad++ recognizes many file formats and is a great tool to try if you're having trouble opening a data file.
For those with coding knowledge (or those who want to learn!), programming languages like Python or R can also be used to clean up data. These are good options for working with unstructured text data. This technique can be combined with regular expressions (regex) to find and correct errors or make other changes.
Does your data format match the method or tool you want to use? You may find your chosen platform does not accept Excel files, or you need to learn how to understand a JSON data format. Common formatting issues include:
Remember that the two main types of data are structured and unstructured, and your data analysis method might need one or both at different parts of the process. The Digital Scholarship and Publishing team can help you figure out what you need and when, as well as ensuring you have the correct format for your method.
A few forms of data analysis are listed below with a brief description. The Digital Scholarship and Publishing team is available to discuss options for data analysis and figure out next steps at any point in your research.
Through data visualization, you may be able to quickly detect trends or relationships between variables. Tool for data visualization include Tableau, Excel, coding languages like Python, web-based tools, and more. See the guide on data visualization for more information.
Statistical analysis can be done with both quantitative and qualitative data, using programs like Stata, SPSS, Excel, or programming languages like R or Python. UD graduate students, faculty, and staff can consult with the StatLab on statistical analysis questions.
Computational text analysis, also called text mining, can be done with qualitative or other free text data. See the research guide Text Mining Methods and Tools for more information.