Research Guides: Working with Data: Preparing and Analyzing Data

Preparing Data (Cleaning and Formatting)

Unless you've collected your data yourself (and therefore it's in the exact format you need), you may need to make some changes to the dataset before analysis or visualization. Remember to keep backups of the unedited dataset, and document any changes you make.

Data cleaning

First, make sure you have an unedited copy somewhere safe! You can even use version control programs like GitHub to save copies, ensuring you can always roll back to an earlier version.

"Data cleaning" is sometimes a contested term for its implications of deleting data. "Tidying" or "wrangling" are other terms you may hear, and it typically involves:

Correcting data entry errors / typos
Standardizing or normalizing values or fields (column names for spreadsheets)
- For example, changing occurrences of "UD" and "UDel" to "University of Delaware."
Merging fields
Handling null values (may appear as N/A, Unknown, or just blank)
- Depending on who created the data, null might mean not available, unknown, or not applicable.
For text data and mining: removing front and end matter, making all the text lowercase, or normalizing how abbreviations are handled

Programs and techniques

Software programs

OpenRefine is a tool to clean up structured data. It works with spreadsheet formats like CSV or TSV, as well as JSON, XML, Excel, and others. It offers powerful algorithms to detect data entry errors. This is a great choice for normalizing values (like changing "UDel" to "University of Delaware").

You can also do data cleaning by hand directly and easily in a spreadsheet program like Microsoft Excel, OpenOffice Calc, or Notepad++. Notepad++ recognizes many file formats and is a great tool to try if you're having trouble opening a data file.

Programming languages

For those with coding knowledge (or those who want to learn!), programming languages like Python or R can also be used to clean up data. These are good options for working with unstructured text data. This technique can be combined with regular expressions (regex) to find and correct errors or make other changes.

Formatting data

Does your data format match the method or tool you want to use? You may find your chosen platform does not accept Excel files, or you need to learn how to understand a JSON data format. Common formatting issues include:

A need to rearrange or summarize your variables
- This could be using Pivot tables in Excel to view a relationship between multiple variables, or creating network data from a larger spreadsheet.
Ensuring the data is machine-readable (able to be understood and processed by a computer)
Differences in character encoding

Remember that the two main types of data are structured and unstructured, and your data analysis method might need one or both at different parts of the process. Librarians can help you figure out what you need and when, as well as ensuring you have the correct format for your method. Contact a librarian for help.

Analysis

A few forms of data analysis are listed below with a brief description. Librarians are available to discuss options for data analysis and figure out next steps at any point in your research.

Data visualization

Through data visualization, you may be able to quickly detect trends or relationships between variables. Tool for data visualization include Tableau, Excel, coding languages like Python, web-based tools, and more. See the guide on data visualization for more information.

Statistical analysis

Statistical analysis can be done with both quantitative and qualitative data, using programs like Stata, SPSS, Excel, or programming languages like R or Python. UD graduate students, faculty, and staff can consult with the StatLab on statistical analysis questions.

Computational text analysis

Computational text analysis, also called text mining, can be done with qualitative or other free text data. See the research guide Text Mining Methods and Tools for more information.

Working with Data

Need help?

License