Skip to main content

Research Data Management: File Formats

This guide is intended to provide information on preparing a data management plan.

Why are file formats important?

The choice of the file format(s) in which you record, store, and transmit your data will have an large impact on the ability of others to use your data in the future.  Because of the rapid changes in technology, researchers should always consider the possibility for both hardware and software obsolescence. How will your data be read if the software used to produce it becomes unavailable?  A good DMP will take into account these possibilities and will list all of the software involved in the project and, if possible, plan to have that software stored along with the data.

Guides to File Formats

The following are links to various sites that will provide information on different types of file formats.

Characteristics of Accessible Formats

Formats more likely to be accessible in the future are:

  • Non-proprietary
  • Open, documented standard
  • Common usage by research community
  • Standard representation (ASCII, Unicode)
  • Unencrypted
  • Uncompressed

Consider migrating your data into a format with the above characteristics, in addition to keeping a copy in the original software format.     

Examples of preferred format choices:

  • PDF/A, not Word
  • ASCII, not Excel
  • MPEG-4, not Quicktime
  • TIFF or JPEG2000, not GIF or JPG
  • XML or RDF, not RDBMS

For examples of how data archives treat different file formats, see the UK Data Archive page on data formats and software. Note that not all repositories are able to migrate data files to newer file formats for preservation.

Open vs. Closed Formats

Formats can be either closed or open (also called free formats).  Open formats are free or low cost, widely distributed, and not tied to particular vendors.  They are managed by organizations where the specifications are documented and they can be used by anyone.

Closed formats are controlled by organizations and vendors where the specifications are unpublished.  A closed format is covered by copyright, trademark, or patents and covered with a variety of restrictions on use.  They usually require specific device or application to use them.