Data studies are published using a variety of formats and delivery techniques. The options a user has for accessing data are impacted by
A general rule that applies to all sources and formats for delivery is to always look for the documentation that comes with the numbers. The documentation can be in the form of a codebook, technical documentation, footnotes, links to source notes, and/or survey questionnaires.
When users are not sure which source is best for them, they can contact EDS, at eds@columbia.edu, for advice.
Formats for Data
- Data Ready for Use
This refers to studies that are comprised of files that are ready for use by user-supplied software applications. Studies can be a single file or a collection of files. Such files can be documentation (text, PDF, Word), program code (code to read text files in to an application like Stata, SAS, SPSS), system files ready to use in a particular application (SAS, Stata, shape file, etc.) or plain text files. Major parts of the EDS collections including: ICPSR, SSEDL, Roper, many public data archives, and portals for spatial data, are published this way.
Features to be aware of when obtaining these studies are file size and the speed of your Internet connection. Most often downloading such files can be done using the functionality of a browser and how this works can vary among browsers. Generally left clicking on a file name will prompt the browser to open it with an application on your PC (it may ask what application and may ask if you want just to save it). With this option the browser may automatically choose an application different than what you want. Right clicking will prompt the "save as" file command which downloads the file without attempting to open it, but sometimes to a default location rather than one of your choosing.
Though increasing less common for remote archives and public Internet Sites, some sites may allow you to use File Transfer Protocol (FTP) to download files. It is an option still available from DataGate where the path name to the location of the files on the local Unix server is given. Using FTP you can directly access a remote host and move files to a local host. - Data Extraction and Analysis Interfaces
Data extraction refers to web-based programs that let users filter both variables and cases and thus download a subset from a larger dataset. These interfaces are particularly useful for large files and for multi-year studies like the Panel Study of Income Dynamics, the National Longitudinal Surveys, and Integrated Public Use Microdata Series (IPUMS). They simplify prorcessing by allowing a user to get only the data needed for their analysis without having to handle large and/or multiple files. Some studies are only published using such a web interface and for others these interfaces are just one alternative for access.
Some extraction programs also have features that allow for online analysis of the data. These interfaces provide as output the results from statistical tests like frequencies, cross-tabulations, comparisons of means, or correlation matrices. Such functionality lets users do statistical analysis without the need for mastering a statistical software package and can allow experienced users to quickly assess the appropriateness of the dataset or particular variables before doing a large download. For more information refer to the ICPSR discussion about DAS and NESSTAR, two of the widely used interfaces. EDS staff have worked with these and other similar interfaces and can assist anyone who wants an introduction to them. - Custom Data Tables
This refers to sites that allow a user to build a custom table of data by specifying the row headers and column headers (some interfaces allow for multiple row and column headers). Interfaces for building data tables are particularly useful for time series and for data that are published as aggregate data not microdata. Typically the choices one makes when building a table are ones based on features like years, countries, pairs of countries, and series name. There are many varieties for these interfaces including robust sophisticated ones like the Census Bureau's American FactFinder - Data Sets option and the Beyond 20/20 interface used by the CDC for reports on Aging Activities and by SourceOecd databases, and straightforward ones like the one used by many World Bank applications including WDI Online. EDS can provide assistance to anyone who wants to an introduction to these. Some studies are only published using such a web interface and for others these interfaces are just one alternative for access.
Limitations to Access
- Studies that require visiting EDS
For a few of our products, EDS is unable to provide campus-wide access and therefore users must come to EDS to access the product. The reasons why this happens are:- data are published on CD/DVD with custom PC-based software, like an extraction program or table generator, and therefore cannot be delivered over the Internet,
- data are published on CD/DVD and are in such low demand that uploading for campus access is done only by request,
- data, regardless of how it is published, come with license restrictions that do not permit campus-wide distribution,
- for spatial data published on CD/DVD, EDS does not have an Internet interface for efficiently storing and delivering data over the network (development work is ongoing).
- IP Authentication
Most authorization procedures result from license agreements that require us to limit our distribution to persons with a full-time affiliation with Columbia. The most common type of authorization used for licensed products is called IP authentication. Its application is invisible when a user is accessing the data when connected to the Internet via the campus network. When using another Internet provider, it requires that a user link to a resource using the Columbia-specific url (resolver url) set up for that resource and used to reference that resource on all the EDS or Libraries' web pages. The user will be prompted to enter their Columbia username (UNI) and password before gaining access.
For two archives, to which Columbia subscribes and for which IP authentication applies, special situations apply.-
ICPSR
EDS provides access to the studies in the ICPSR archive at the University of Michigan both by directing users to the ICPSR main page (from which they can start searching) and to the ICPSR page for individual studies (usually after a user has done search in DataGate). From the campus network, IP authentication works for both cases. When accessing the ICPSR using another Internet provider, authentication only works when using the resolver url for ICPSR which takes you to the ICPSR main page. The study-specifc urls for ICPSR titles that are found on the pages for individual study titles within DataGate will not work from off the campus network. Off-network users can get to these studies by using the resolver url for the ICPSR main page and searching for the study at the ICPSR site. Note to download from the ICPSR site, users must register (see item bullet point about registration). - Social Science Electronic
Data Library
Users must always access SSEDL using the resolver url, whether on the campus network or not. The resolver links to main SSEDL main page for institutional subscribers. When an SSEDL study is listed in DataGate, the resolver url and SSEDL study number are listed so that users have the information they need to find the study at the SSEDL site. Once they have located a study to download, Columbia users must use the key icon to proceed with downloading rather thant the "view/order" button.
-
ICPSR
- Authorizing downloads from DataGate
Users can download data stored locally and accessed via DataGate from anywhere. Access to the data download page (not program files or documentation) for data listed in DataGate does not rely on IP authentication. Instead users will always be prompted for their UNI and password even when they are connecting from the Columbia network. - Sites requiring Registration
Some sites require that users register. The requirement can apply both to public sites, like IPUMS and the ESDS portion of the UK Data Archive, and to subscription sites covered by IP authentication, like Roper and ICPSR. Registration is a common requirement for sites outside the U.S., and done by domestic sites for a variety of reasons. Some sites have functionality, like saving a search or storing output, that requires individual "my data" accounts; others process search requests off line and need contact information to email users when results are ready. For security reasons, if you need to set up a password, NEVER use the password you use with your Columbia UNI or any other password you use for access to important information. The registration may involve agreeing to terms of use. See the next item for information about this. - Products requiring a User Agreement
Licensed products and many publicly available products come with terms to which users must agree. Though these terms are listed on the sites, as you access data you may not be required to read the general terms of use that are common for many data products. When required to read and acknowledge the terms, some sites may ask that you inform them with the specifics of any publications that result from using the product. Other terms can be more demanding. They can require a signed agreement and possibly ask for a statement about what type of use is being made for the data. Definitely for studies that fall into this last category and that contain data based on responses from or about individuals, users should consider also submitting their research plan to the campus Institutional Review Board.

