Frequently Asked Questions (FAQ)
What is IPUMS-International?
What's in the future for IPUMS-International?
Why do frequencies for IPUMS samples sometimes differ from official results?
How do IPUMS data differ from public samples that may already be in distribution?
How does IPUMS-International add value to the data?
Getting started
Where should a new user start?
How do I get access to IPUMS-International data?
What are the restrictions on use?
Basic concepts
What are microdata?
What are "harmonized variables"?
What are "source variables"?
What are "pointer variables"?
What are "general" and "detailed" versions of variables?
What are "weights"?
What does "universe" mean in the variable descriptions?
Getting data
How do I obtain data?
What format are the data in?
What is the best way to use the extract system?
How long does a data extract take?
What if the samples are too big for me to handle?
How does "sample selection" work on the IPUMS-International web site?
What does "Add to cart" mean?
Why can't I open the data file?
Is there a preferred statistical package for using the IPUMS?
Can I get the original data?
How is a record uniquely identified?
Using IPUMS data
Are there tricky aspects of IPUMS data to be particularly aware of?
What are the major limitations of the data?
Can I study multiple countries? One country?
There were questions asked in the census for which I don't see variables in IPUMS-International. Where are the data?
Can I find particular individuals in the IPUMS data?
Can I use IPUMS for genealogy?
What is the difference between spatially-harmonized and year-specific geography variables?
Can I map IPUMS data using GIS?
Using the variables page
Variables page menu
Variables page details
Using the data extract system
Your data cart
Why are some variables in my data cart preselected?
What is "Type"?
Extract request page
> Extract definition: Data format
> Extract definition: Data structure
Extract option: Customize sample sizes
Extract option: Select cases
Extract option: Attach characteristics
Extract option: Describe your extract
Extract option: Standardize monetary values
General information about the project
What is IPUMS-International? [top]
IPUMS-International (Integrated Public Use Microdata Series, International) is the world's largest collection of publicly available individual-level census data. The data are samples from population censuses from around the world taken since 1960. Names and other identifying information have been removed. The variables have been given consistent codes and have been documented to enable cross-national and cross-temporal comparisons. This "integration" process is described more fully here.
IPUMS is not a collection of compiled statistics; it is composed of microdata. Each record is a person, with all characteristics numerically coded. In most samples persons are organized into households, making it possible to study the characteristics of people in the context of their families or other co-residents. Because the data are individuals and not tables, researchers must use a statistical package to analyze the millions of records in the database. A data extraction system enables users to select only the samples and variables they require.
IPUMS-International is very similar to the IPUMS-USA data system, which contains all national-level samples of the United States from 1850 to the present. Scholars interested only in the United States are better served using IPUMS-USA, which is optimized for U.S. research.
What's in the future for IPUMS-International? [top]
IPUMS-International is funded by a grants from the U.S. National Science Foundation and National Institutes of Health. We have every expectation of continuing the project indefinitely, but will have to secure further funding as our current grants expire. To be successful, we need to have a large body of users and published works we can point to. Please inform us if you have any presentations or publications using IPUMS data.
Why do frequencies for IPUMS samples sometimes differ from official results? [top]
There are a number of reasons why the IPUMS sample data may yield different counts from official census results:
1. Sample error. IPUMS microdata are samples, not the full-count data.
2. Sample bias. The IPUMS sample unit is usually the household. This introduces a slight bias in estimating statistics for individuals -- greater bias for characteristics more commonly shared by households members (ethnicity, religion, birthplace) than by individuals (age, sex, marital status).
3. Statistical disclosure controls. Measures to safe-guard the privacy and confidentiality of individuals, households and other entities, can introduce error in the samples.
4. Omission of special populations, such as homeless, collective dwellings, or other non-private households (usually documented on sample description pages).
5. Omission of areas of the country due to loss of microdata, lack of coverage, security concerns, etc. (usually documented by a flag on the sample selection page)
6. Imputations and adjustments to the official figures by the National Census Office that are not encoded in the microdata files provided to IPUMS.
Users who require official census results should consult web pages of the National Census Offices.
How do IPUMS data differ from public samples that may already be in distribution? [top]
A limited number of the IPUMS-International samples are also distributed by national statistical offices through alternative means. The IPUMS-International samples do differ, however. They are subjected to a battery of tests for structural integrity, and we may have edited the data to fix errors in household structure. We also generate our own set of technical and constructed variables. All documentation is translated into English and edited as appropriate. Most significantly, all integrated variables are recoded and documented in an international context to enhance comparative research.
How does IPUMS-International add value to the data? [top]
The process of integration itself adds value to the data by fully documenting all codes and compiling all variable documentation in a hyperlinked web format. But we do many other things as well:
Most census samples have a certain number of structural errors: the roster for a household might be incomplete, members from different households might be merged together, person records might not have corresponding household records, and so forth. All IPUMS samples are processed using a consistent set of diagnostic tools to uncover such problems. Even most samples previously in public circulation through other outlets have a small number of such errors. We fix them, either by sampling around bad cases, by performing logical edits, or by using whole-household substitution.
IPUMS-International creates a consistent set of constructed variables for all samples. Most important are the family interrelationship "pointer" variables that indicate the location within the household of every person's mother, father, and spouse.
In the future, IPUMS will carry out missing data allocation and consistency edits on important variables. This kind of data editing performs logical fixes when possible, or it involves finding a donor record that shares key characteristics with the person in question and substituting their response for the variable. Allocation is a more statistically sound way to deal with missing data than simply excluding such cases from analyses. Missing data allocation is technically demanding, and we do not expect to carry it out in the near future.
Getting started
Where should a new user start? [top]
The natural starting point is the "Select Data" or "Browse and Select Data" links on the left navigation bar and the top banner. These links open the variables page: the primary tool for exploring the contents of IPUMS-International. By default, the variables page displays one variable group at a time for all samples in the data series. You can change the view option to show all groups simultaneously, but the page can get very large and slow to load. However, you can filter the information at any point to include only the samples of interest to you ("Select samples.") Initially, the variables screen is set to display the harmonized variables. Select the "view source variables" button to browse the variables that are specific to individual samples. More detailed information on using the variable menu is available.
When you select samples, the page will display only variables present in those censuses. An "x" indicates the availability of a variable for a particular sample.
On the variables page, clicking on a variable name brings up its documentation. The information about the variable is contained on a number of tabs. The default tab is the brief description of the variable. More information is usually available on the "comparability" tab, which discusses international and intra-national comparability issues. The "questionnaire text" tab compiles all the questionnaire text and instructions pertaining to the census question for every sample. The variables page also has direct links to the codes page for each variable (they are also accessible as a tab in the variable description). The codes page shows the codes and labels for the variable, and the availability of categories across samples. These categories can suggest the types of research possible with a given sample.
Throughout the variable documentation system there are buttons to "Add to cart." Any variables you select in this way are put in your data cart to include in a data extract. Your selections only last for the current web session.
The Data Cart in the upper right keeps track of your variable and sample selections. Once you have made some selections you can click on "View Cart" to review your choices. If you have selected variables and samples you can enter the data extract system. To make a data extract you must be registered to use IPUMS-International. The instructions for the extraction system are here.
How do I get access to IPUMS-International data? [top]
Access to the documentation is freely available without restriction; however, users must apply for access to the data. The application system requires a description of an applicant's proposed research, and asks for the user's institutional affiliation and other information to verify identity. Every application is individually reviewed by project staff. We may ask for additional information if we are uncertain about the suitability of the intended research. Applicants are required to agree to a number of conditions to use the data. Access to the system enables a user to extract data from any country in the database. To apply for access go here.
Registrations to use the data expire after one year and can be renewed.
What are the restrictions on use? [top]
Our agreements with the various national statistical offices require that IPUMS-International be used only for scholarly and educational purposes, including public policy research. Commercial use of the data is prohibited. To gain access, applicants are required to agree to a number of conditions that amount to a legal contract. Chief among them are a prohibition against redistributing the data and against attempting to identify individuals using the data. To see the full list of conditions, go to the application page here.
Basic concepts
What are microdata? [top]
Census microdata are composed of individual records containing information collected on persons and households. The unit of observation is the individual. The responses of each person to the different census questions are recorded in separate variables.
Microdata stand in contrast to more familiar "summary" or "aggregate" data. Aggregate data are compiled statistics, such as a table of marital status by sex for some locality. There are no such tabular or summary statistics in the IPUMS data.
Microdata are inherently flexible. One need not depend on published statistics from a census that compiled the data in a certain way, if at all. Users can generate their own statistics from the data in any manner desired, including individual-level multivariate analyses.
See an image of IPUMS data here. All IPUMS data are in this general format.
What are "harmonized variables"? [top]
Harmonization is the process of making variables from different censuses and countries comparable. For example, most censuses ask about marital status; however, they differ both in their classification schemes (one census might recognize only a general category of "married," while another might distinguish between civil and religious marriages) and in the numeric codes assigned to each category ("divorced" might be coded as a "4" in one census and as a "2" in another). To create a harmonized variable for marital status we recode the marital status variable from each census into a unified coding scheme that we design. Most of this work is carried out using correspondence tables like the one here.
Because some censuses provide more detail than others, a coding scheme that reduced variables down to the lowest common denominator across all samples would inevitably lose important information. As a result, many IPUMS harmonized variables use composite coding schemes. The first one or two digits of the code provides information available across all samples; the next one or two digits provide additional information available in a broad subset of samples. Finally, trailing digits provide detail only rarely available. All meaningful detail in the original enumerations is therefore available to researchers if they need it, but they can confine their attention to the less-detailed digits if they wish.
For some variables, the composite coding structure is recognized in our system by formally distinguishing separate "general" and "detailed" versions, as described here.
The other component of harmonization is the variable documentation. The documentation aims to highlight important comparability issues that are not self-evident from the coding structure for the variable. A general comparability discussion emphasizes issues for international comparisons, and country-specific discussions note comparability concerns when making intra-national comparisons over time. IPUMS staff must exercise their judgment in composing this documentation -- there is no formula for it. But users need not depend totally on us: the variable documentation provides links to both English-language and original-language census questionnaires and instructions. This material is readily available on every variable description page through the link to "enumeration text."
What are "source variables"? [top]
All variables in the IPUMS are processed to varying degrees: they are documented in English, associated with the relevant sections of the original census instructions, and the data are analyzed and often recoded for technical and other considerations. But not all IPUMS variables are "harmonized" for international and inter-temporal comparability.
The regular IPUMS variables -- the ones on the main variable availability screen -- are harmonized: the same codes and labels apply across all the samples that contain the variable. Source variables, in contrast, are unique to each sample (i.e., "unharmonized"). They generally correspond to the variables in the original datasets submitted by the various countries to the IPUMS project. The source variable codes and labels are not consistent across samples, but the variables have been processed to make them more regularized. Stray values are recoded; all data are converted to numeric values; data universes are empirically determined; unknown and NIU categories are coded consistently; and other edits may be made to address confidentiality concerns. In addition, each source variable is assigned a unique name in the IPUMS database, and the value labels and other variable documentation are written in English.
Many source variables serve as inputs for the harmonized IPUMS variables. For example, underlying the harmonized variable for marital status, MARST, are numerous source variables, typically one per sample -- CL70A_MARST for Chile 1970, UG91A_MARST for Uganda 1991, and so forth. Each harmonized IPUMS variable description has a link to the source variables that served as the inputs for it. The source variables are also accessible in a comprehensive list using the menu buttons on the variables page. The variable description for each source variable lists the integrated variables for which it provides the source data.
The source variables can be included in data extracts. Thus researchers can get both the harmonized and unharmonized forms of specific variables (for example, the internationally comparable employment status variable, EMPSTAT, and the employment status variable specific to 1998 Cambodia, KH98A_EMPSTAT). Perhaps more importantly, the source variables give researchers access to data that IPUMS has not been able to incorporate in an internationally comparable manner.
Some source variables are not available to researchers because of confidentiality concerns or other reasons. Even variables that serve as inputs to the regular harmonized IPUMS variables may be hidden.
What are "pointer variables"? [top]
The IPUMS family interrelationship "pointer" variables indicate the location within the household of every person's mother, father, and spouse. Nearly all samples indicate the relationship of each person to the head of household, but it is much harder to relate individuals to persons other than the head (for example, grandchildren to children, sons-in-laws to daughters, or unrelated persons to each other). We have developed a complex core algorithm to make such connections, and we customize it as needed to account for peculiarities of specific samples. The pointer variables are called MOMLOC, POPLOC, and SPLOC in the IPUMS system. The variables PARRULE and SPRULE indicate the conditions under which a specific link was made. The parental pointer variables identify social parents, not strictly biological ones.
The pointer variables make it possible to construct individual-level variables representing the characteristics of co-resident persons, such as occupation of spouse, age of mother, or educational attainment of father. The data extraction system can perform this step for you. The "Variable options" step includes a feature to "Attach characteristics" of other persons in the household based on the pointer variables. These attached characteristics appear as new variables in your data extract. For maximum flexibility, you can also do this matching yourself. You need to include the serial and person ID variables (SERIAL and PERNUM) in your extract, as well as the pointer variables themselves, to perform the necessary data manipulations.
Some IPUMS samples are samples of individuals -- they do not contain persons organized into households. No pointer variables are available for those samples, which are identified in the notes column of the samples page.
What are "general" and "detailed" versions of variables? [top]
Most variables in the IPUMS have a composite coding structure, where the first digit is largely comparable across samples, and second and subsequent digits provide progressively more detail available in some samples and not others. For some highly requested variables, the composite coding structure is formally recognized in our system by distinguishing separate "general" and "detailed" versions of the variable. For example, researchers can access an internationally comparable 1-digit general version of "employment status", or they can use the fully detailed 3-digit version, if their research requires finer distinctions. The two sets of codes are completely consistent with one another; one simply provides more categories, while the other is simpler to use and usually more comparable across samples. Other variables only have the default full-detail version.
When you select a variable in the extract system that has general and detailed versions, both are selected by default. You can unselect either version on the "Variable options" screen. Both versions of a variable come with appropriate syntax labels. In data extracts, the detailed version of the variable gets a "D" appended onto the end of its mnemonic (for example, for "marital status," MARST is general and MARSTD is detailed).
The general and detailed versions of a variable both correspond to the same description in the documentation system. The codes and frequencies of each version are viewable separately on the relevant variable codes page.
What are "weights"? [top]
Most IPUMS samples are unweighted or "flat": every person in the sample data represents a fixed number of persons in the population. Approximately one-fourth of the IPUMS samples, however, are weighted, with some records representing more cases than others. This means that persons and households with some characteristics are over-represented in the samples, while others are underrepresented. See the PERWT variable or samples page for a listing of the weighted IPUMS-International samples.
To obtain representative statistics from these samples, users must apply sample weights. Follow one of the following procedures:
1. For person-level analyses using a weighted sample, apply the PERWT variable. PERWT gives the population represented by each individual in the sample.
2. For household-level analyses using a weighted sample, weight the households using the HHWT variable. HHWT gives the number of households in the general population represented by each household in the sample. Group quarters (collective dwellings) are not usually weighted properly for household-level analyses and should generally be excluded using the GQ variable.
Even the unweighted samples have values for HHWT and PERWT, but every record in those samples receives an identical weight. This allows the application of the weight variables in pooled extracts that contain both weighted and unweighted samples. Otherwise, the use of the weights is optional in the unweighted samples.
What does "universe" mean in the variable descriptions? [top]
The universe is the population at risk of having a response for the variable in question. In most cases these are the households or persons to whom the census question was asked, as reflected on the census questionnaire. For example, children are not usually asked employment questions, and men and children are not asked fertility questions. In some instances the universe suggested by the census questionnaire is not accurate, however, because of post-enumeration data processing. IPUMS-International empirically verifies universes to obtain the most accurate statement possible of the universe. In some cases, there is no independent information in a sample to verify a universe.
Cases that are outside of the universe for a variable are labeled "NIU (not in universe)" on the codes page. Differences in a variable's universe across samples are a common data comparability issue.
The universes will not always be free of apparently erroneous cases. Some persons or households that should not have answered the question did, and some that should have answered may be included in the "NIU" category. But until we perform comprehensive data editing and allocation, we do not know whether the variable in question is in error, or whether the variables that define the universe (for example, age or employment status) are incorrect.
In our early data releases we performed some editing to clean up the universes, but that approach was not comprehensive enough to ensure we were properly identifying the variable with faulty information. We removed those universe edits for the December 2006 data release.
Getting data
How do I obtain data? [top]
All IPUMS-International data are delivered through our data extraction system. Users select the variables and samples they are interested in, and the system creates a custom-made extract containing only this information. The system will pool data from multiple samples into a single data file; in fact, it was primarily designed for this purpose.
IPUMS-International never distributes complete samples of any census; data are only delivered through the extract system. Instructions for the downloading and reading the data are available here.
Data are generated on our server. The system sends out an email message to the user when the extract is completed. The user must download the extract and analyze it on their local machine. The extract system is accessed through the Data Cart, which becomes clickable once you have selected variables and samples.
What format are the data in? [top]
IPUMS-International produces fixed-column ASCII data. Data are entirely numeric. By default, the extraction system rectangularizes the data: that is, it puts household information on the person records and does not retain the households as separate records. No information is lost, and this is the format preferred by most researchers; however, it can be overridden to yield hierarchical data.
In addition to the ASCII data file, the system creates a statistical package syntax file to accompany each extract. The syntax file is designed to read in the ASCII data while applying appropriate variable and value labels. R, SPSS, SAS, and STATA are supported. You must download the syntax file with the extract or you will be unable to read the data. The syntax file requires minor editing to identify the location of the data file on your local computer. Alternatively, you can request your data formatted for SPSS (.sav), SAS (.sas7bat), STATA (.dta), or as a comma delimited file (.csv) on the Extract Request page.
A codebook file is also created with each extract. It records the characteristics of your extract and should be downloaded for record-keeping.
All data files are created in gzip compressed format. You must uncompress the file to analyze it. Most data compression utilities will handle the files.
What is the best way to use the extract system? [top]
The data extraction system is a flexible tool. There is no need to download variables or samples you don't expect to use for your current analysis. The system records every extract you make. You can reload and modify an old extract, dropping or adding variables or samples. Go to the "Download or Revise Extracts" page and click on the "Revise" link.
If you choose several large samples and many variables, you can make extremely large extracts that will be cumbersome to analyze. The extract system is designed to minimize this problem. The "Extract Request" screen predicts the size of your data extract and provides options for reducing the size of your dataset. The system will inform you if your extract violates the maximum size allowable.
Some variables are preselected for you. They identify the sample, in case your extract pools data from multiple sources, as well as other technical variables. Some of the samples are truly weighted, with different records representing more persons in the population than others, so the system preselects the person weight variable (PERWT).
How long does a data extract take? [top]
The time needed to make an extract differs depending on the number and size of samples requested, whether case selection is performed, and the load on our server. Extracts can take from a few minutes to an hour or more. The system sends an email when the extract is completed, so there is no need to stay active on the IPUMS-International site while the extract is being made.
What if the samples are too big for me to handle? [top]
It is possible to make samples that are extremely large. Extracts over a certain size will not be allowed by our server. If you have a legitimate reason to make extremely large extracts, email us to request a higher threshold.
Roughly speaking, the number of records times the number of columns of data requested yields the file size of your extract in bytes. The number or records in each sample is given on the samples page. Among the current samples, the modern U.S., Brazil, China, France, Mexico, Pakistan, and Vietnam datasets are particularly large.
The "extract request" screen predicts the size of your data file. If it is too large for your purposes, there are several things you can do.
1) Select fewer variables and/or samples.
2) Use the "Select cases" feature on the extract page to include only the particular kinds of people or households you want to include in your data. (Note: the estimated file size does not include case selection in its calculation, so this number will not change.)
3) Use the "Customize sample sizes" feature to draw smaller subsets of some or all of the samples in your extract. Entering numbers in any of the cells in the right half of the screen will tell the system to systematically extract the corresponding number of households from the selected sample(s). The households will be drawn evenly from across the entire country, and the sample weights in the data will be adjusted appropriately.
How does "sample selection" work on the IPUMS-International web site? [top]
When a user first enters the variable documentation system, all samples are selected by default. Every variable in the system will display on all relevant screens.
Users can filter the information displayed by selecting only the samples of interest to them. Only the variables available in one of the selected samples will appear in the variable lists. The integrated variable descriptions and codes pages will also be filtered to display only the text and columns corresponding to the selected samples. Sample selections can be altered at any time in your session. Selections do not persist beyond the current session.
When a user enters the extract system after selecting samples, those selections are carried into the data extract system.
What does "Add to cart" mean? [top]
While browsing variables in the documentation system, you can select them to include in a data extract, sending them to your data cart. You can deselect the variables by unchecking its box in the data cart. After you proceed to "create data extract", you can return to the variable list to make more selections.
Why can't I open the data file? [top]
There are two likely explanations:
1) The data produced by the extract system are gzipped (the file has a .gz extension). You must use a data compression utility to uncompress the file before you can analyze it.
2) You cannot open the default ASCII data file directly with a statistical package. The extract system generates a syntax (set-up) file to read the ASCII file into your statistical package. You must download the syntax file along with the data file from our server, open the syntax file with your statistical package, and edit the path in the syntax file to point to the location of the data on your local computer. Now you are ready to read in the data. Alternatively, you can request your data formatted for SPSS (.sav), SAS (.sas7bat), STATA (.dta), or as a comma delimited file (.csv) on the Extract Request page. More detailed instructions for the downloading and reading the data are available here.
Is there a preferred statistical package for using the IPUMS? [top]
IPUMS-International supports R, SPSS, SAS and STATA. By default, the extract system generates an ASCII data file (.dat) and provides R, SPSS, SAS, and STATA syntax files with which to read the data. You can request your data formatted for SPSS (.sav), SAS (.sas7bat), STATA (.dta), or as a comma delimited file (.csv) on the Extract Request page.
Can I get the original data? [top]
In accordance with our agreements with the national statistical offices, IPUMS-International does not distribute the original samples provided by our international partners. We do provide access to source variables, but even these have undergone processing. We clean up stray codes, translate the labels into English, document the variable, and sometimes perform additional programming. We also carry out confidentiality measures that affect small numbers of cases for some variables. We take care not to lose meaningful information in these transformations, so researchers retain access to the full power of the original information in the input variables.
How is a record uniquely identified? [top]
Three variables constitute a unique identifier for each record in the IPUMS: SAMPLE, SERIAL, and PERNUM (census sample, household identifier, and person number within the household). The combination of SAMPLE and SERIAL is the unique household identifier.
Using IPUMS data
Are there tricky aspects of IPUMS data to be particularly aware of? [top]
Some samples are weighted: each individual does not represent the same number of persons in the population. It is important to use the weight variables when performing analyses with these samples. See the PERWT variable or samples page for a listing of the weighted IPUMS-International samples. For other samples the use of weights is optional.
Not all samples contain the full universe of persons in the national population. Various, usually small, subpopulations may be missing. In different samples, the institutionalized population, transients, migrants, indigenous peoples, or other groups may be excluded or under-represented in the sample data. See the notes and detailed sample designs on the samples page. In a few cases, subsections of the country may be missing entirely. These particularly selective samples are identified on the pick list of samples within the data extraction system to emphasize this limitation.
It is important to examine the documentation for the variables you are using. The codes and labels for variable categories do not tell the whole story. In other words, the syntax labels are not enough. There are two things to pay particular attention to. The universe for a variable -- the population at risk for answering the question -- can differ subtly or markedly across samples. Also, read the variable comparability discussions for the samples you are interested in. Important comparability issues should be mentioned there. If a variable is of particular importance in your research (for example, it is your dependent variable), you are also well served to read the enumeration text associated with it. This text is linked directly to the variable, so it is quite easy to call it up.
By default, the extract system rectangularizes the data: it puts the household information on the person records and drops the separate household record. This can distort analyses at the household level. The number of observations will be inflated to the number of person records. You can either select the first person in each household (PERNUM) or select the "hierarchical" box in the extract system to get the proper number of household observations. The rectangularizing feature also drops any vacant households, which are otherwise available in some samples. Despite these complications, the great majority of researchers prefer the rectangularized format, which is why it is the default output of our system.
What are the major limitations of the data? [top]
The data are composed entirely of individual person and household records from population censuses. There are no macroeconomic, business, or aggregate statistics. We do not deliver the published statistics from the population censuses.
Of particular importance for some topics, most censuses do not include a question on personal income. See the variables page for the subject content of the samples.
Some samples are samples of individuals -- the persons are not organized into households. Other samples do not provide single years of age. Both types of limitations preclude certain research. These samples are identified on the samples page and by notes in the data extract system.
IPUMS-International is composed entirely of sample data, with sample densities ranging from 1 percent to 10 percent of national populations. Some subpopulations may be too small to study with the sample data.
Because the data are public-use, measures have been taken to assure confidentiality. Names and other identifying information are suppressed. Most importantly for many researchers, geographic information is usually limited, sometimes severely. In many samples, places smaller than 20,000 population are not identified; in others the threshold is higher, and in some only states or regions can be determined.
Can I study multiple countries? One country? [top]
IPUMS-International is designed to facilitate cross-national and cross-temporal research, but there is no restriction against single-country studies. Data extracts can contain records from every sample in the entire data series, or records from only a single sample. Researchers interested in studying only the United States, however, would be best served by going to IPUMS-USA, which is optimized for U.S. research and has greater temporal depth.
There were questions asked in the census for which I don't see variables in IPUMS-International. Where are the data? [top]
Some countries did not supply variables for every census question. The census responses may never have been processed, or the variable may have been left out of the public use file because of confidentiality concerns. It is also possible we have the data, but have insufficient metadata to make it available at this time. When we add a sample, we harmonize only the core group of variables that we think most researchers would desire. Other variables can be accessed as source variables. In some cases we give a harmonized IPUMS version of a variable but don't allow access to the unharmonized version because it contains too much detail for purposes of confidentiality.
Can I find particular individuals in the IPUMS data? [top]
No. A variety of steps have been taken to ensure the confidentiality of the data. Most fundamentally, samples do not contain names or addresses. The data are only samples, so there is no guarantee any given individual will be in the dataset. Low levels of geography are suppressed. We randomize the order of cases within the geographic areas that are identified, and swap a small number of dwellings across geographic boundaries. Very small variable categories in recent censuses are combined with other categories or suppressed. Finally, the user agreement explicitly forbids any attempt to identify individuals.
Can I use IPUMS for genealogy? [top]
No. IPUMS does not contain names. Moreover, the use of the data for genealogy is expressly prohibited in the user license agreement to which all persons must agree.
What is the difference between spatially-harmonized and year-specific geography variables? [top]
IPUMS-International geography releases two kinds of geography variables - spatially harmonized and year-specific variables. Spatially-harmonized variables provide consistent geographic units for a country across sample years to facilitate comparisons over time. They are useful for studying multiple countries across several years. Some detail is lost in the construction of harmonized units. Year-specific variables retain all the original detail from each sample, but they are not consistent over time. Year-specific variables are useful for studying one specific sample year for a country.
Both spatially-harmonized and year-specific geographic variables combine units that have less than 20,000 persons to its neighboring units for confidentiality purposes and public dissemination of data. For more information on geography variables and the spatial harmonization procedure, refer to the geography variables section in our website.
Can I map IPUMS data using GIS? [top]
Users can create maps with IPUMS-International data using a statistical software program and ArcMap (a GIS mapping application). GIS (shapefiles) files are available for the first and second administrative level of geography for IPUMS-International countries. IPUMS releases spatially harmonized GIS files with stable subnational units over time. For users interested in studying and displaying one specific sample year for a country, IPUMS provides year-specific GIS boundary files.
Resulting maps are intended to facilitate visual illustrations of IPUMS-International data rather than precise geographical calculation. For more information about linking IPUMS-International data to world and select country maps and to access the maps themselves, consult the GIS Boundary Files section of our website.
Using the variables page
Variables page menu [top]
Use the left side of the menu to browse variables:
Household: household variables by group
Person: person variables by group
A-Z or Sample: harmonized variables by letter/ source variables by sample
Search: display only variables that contain specified text in particular fields
Use the buttons and links on the right side of the menu to:
Select Samples: limit the display of variable information to selected samples
View . . . Variables: toggle between viewing harmonized and unharmonized source variables
Options: alter how the variable list is displayed or get help for this page
Variables page details [top]
The Menu
The variables page allows you to browse integrated and unharmonized variables while limiting and controlling how the information is displayed.
The left side of the menu is for browsing the variables. The radio button on the right switches the variable menu between showing harmonized and unharmonized source variables.
When you "Select Samples" you limit the variable list to display only variables that are available in at least one of those samples. But the effect of selecting samples extends into all the variable descriptions and codes pages you can access through the variable system. Only information relevant to your selected samples will be displayed in any context while you browse the variables. You can change your sample selections at any point.
Selecting samples is a good practice when exploring the IPUMS, because the amount of information can be unwieldy. On the other hand, sometimes you need to see everything to determine what kinds of research are possible using the database.
"Search" lets you specify search terms for specific fields of variable metadata. The system will return a list of variables that include any of the search terms you indicate. Both harmonized and source variables can be searched.
The final choices are "Options" and "Help." The "Options" item brings up a screen that offers a number of choices regarding the display of the variable list. Each selection has a default choice.
Use short country codes / Use long country codes
Switch between the 2-letter country abbreviations and longer abbreviations. The short codes are the default.
View one group / View all groups together
Switch between viewing one variable group at a time and viewing all variable groups on one screen. Unless you have a limited number of samples selected, your browser may be slow to display all groups. The default view is one group at a time.
Show availability detail / Show availability summary
Switch between displaying the full sample-specific availability matrix, and a view that only displays the total number of samples that contain each variable. Both views only display or sum the samples that the user has selected in "Select samples." The default view is the detailed availability information. This option is disabled while viewing source variables.
View available variables / View all variables
Switch between a view that only displays variables present in one of your selected samples, and a view that displays every variable, even if they are not available. The default view is to only display available variables.
Use long source variable names / Use short source variable names
Switch between descriptive names for the source variables (up to 16 characters) and cryptic 8-character names. The default view is to display long names.
Samples are displayed oldest to newest / Samples . . . newest to oldest
Display the samples columns indicating variable availability in chronological order or reverse chronological order. The default is oldest to newest.
The Variable List
As you browse the variables, they are displayed in a list containing a number of columns. The variable name links to the variable description, which includes detailed comparability discussions, universes, and enumeration text. The variable codes -- and their associated labels -- can be accessed directly using the "codes" links. The "type" column indicates if it is a person or household variable. In some contexts, like the alphabetic view, the two types are pooled together.
The area to the right of the "codes" column differs between harmonized and source variables. For integrated variables, the default view displays a column for every sample that the user chose in "Select samples." By default, all samples are selected. The country abbreviation and last two digits of the sample year identify each sample at the top of every column. Hover over the country code with the mouse to see the full country name. If a variable is available in a given sample, an "x" is printed in that column.
The source variables by definition are only available in a single sample; therefore, they do not require detailed availability information.
In the column labeled "Add to cart", each variable has a yellow circle with a "+" on the far left. Click these circles to add them to your data cart (they will appear green when you hover over them to indicate that you may select it). Once you have clicked them, these icons change to a checked box, indicating that the variable is in your data cart. To remove the variable from your data cart, simply click the checkbox.
Using the data extract system
Your data cart [top]
You cannot create data from the extract system unless you are a registered user. If you are not registered, you must apply for access.
At the top right corner of the variables page is a summary of your data cart. This box displays the number of variables and samples you have selected. Clicking the yellow circle next to a variable places it in your data cart. You can view your data cart at any time by clicking "View Cart". The "View Cart" link only becomes operative when you have selected a variable or sample.
The data cart lists the variables preselected by the extract system as well as any variables you selected while browsing the documentation. As with the variable selection page, you can remove variables from your extract in this step by clicking the checkbox next to the variable in the "Add to cart" column. If you chose a variable but subsequently altered your sample selections in such a way that the variable is no longer available, it is indicated by an "i" icon.
The data cart also includes record type, links to codes pages, and sample availability for the variables in your cart.
Buttons are provided to return to the variable list to make more selections or to alter your sample choices. If you return to the variable list, click on "View Cart" again to return to the data cart.
When you are satisfied with your data selections, click "Create data extract" to finalize your extract request.
Why are some variables in my data cart preselected? [top]
Certain variables appear in your data cart even if you did not select them, and they are not included in the constantly updated count of variables in your data cart.
Unless you are absolutely certain you will not need one of these variables, we recommend that you not remove them from your data cart.
What is "Type"? [top]
The "Type" column on the variables selection pages and in your data cart indicates the record type of the variable. The variables with a "P" are from the person record, and the variables with an "H" are from the household record. Data at the household level pertain to each person in the household, and are identical on each person record within a household in the rectangular data file.
Extract request page [top]
When you click "Create data extract" in the Data Cart, you come to the Extract Request page. All of the actions on this page are optional. If you wish, you can simply hit the "Submit" button and create your data extract. You will be prompted to log in if have have not done so already.
The page summarizes your data extract and provides a number of options for customizing it. A link at the top expands to show the samples you selected. If any samples have notes associated with them, a message will appear on the samples bar to encourage you to review that information. Click the appropriate links to go back to the variable browsing and sample selection pages to alter your choices. You return to the extract request page via the data cart, where you can review the availability matrix for selections and easily drop variables by unchecking them.
All data extracts include a text data file (fixed-width format), along with R, Stata, SPSS, and SAS syntax files to load those data. On this page, you can elect to receive the data in an alternative format.
A separate link lets you choose the preferred data structure for your extract: rectangular or hierarchical. Rectangular format is the default.
Another row on the page estimates the size of your extract. If the estimated size is too large, click on the link to reduce extract size. Two of the methods for reducing the size of extracts involve options buttons on the lower half of the extract request page.
When you submit an extract, there will be a delay ranging from minutes to hours, depending on the size of the job. You do not need to wait on our site for the job to be completed. Our system will send you an email when your extract is ready.
The definitions of every extract will remain on our server indefinitely, but the data files are subject to deletion after three days. However, the screen where you download extracts has a feature that lets you revise old extracts. When you click on "revise," all your selections for that extract will be loaded into the system, after which you can edit or regenerate it. Note, however, that each successive data release can create difficulties for recreating old extracts, because codes might change.
> Extract definition: Data format [top]
By default, the extract system generates an ASCII data file (.dat) and provides R, SPSS, SAS, and STATA syntax files with which to read the data. You can request your data formatted for SPSS (.sav), SAS (.sas7bat), STATA (.dta), or as a comma delimited file (.csv).
> Extract definition: Data structure [top]
You can choose the preferred file structure for your extract. Rectangular data only contain person records -- requested household information is attached to each household member. Hierarchical data contains a distinct household record followed by a separate person record for each member of the household. The system defaults to rectangular format, which is the overwhelming choice of researchers.
Vacant housing units can only be extracted using the hierarchical data structure.
Extract option: Customize sample sizes [top]
Near the bottom of the screen is the expected size of your data extract. Note: the predicted extract size does not take account of any case selection you may have implemented using the "Select cases" option. If you used case selection, your extract probably will be smaller than the size reported on this screen.
To alter the size of a sample, enter the desired number in one of its boxes in terms of households, persons, or sample density. For any sample, you can enter only one number to define the density; the other two cells will be calculated from that number. An entry in the first row of boxes, "All samples", will apply the same selection to every sample in your extract. The minimum number of cases for any sample is 10,000 households. If you enter a number larger than the number of cases in a sample, the tool will indicate the maximum number.
At any point you can clear your selections and return to the full sizes for every sample.
The sampling unit for the sample-size tool is the household. The system will draw a systematic sample of every Nth household -- after a random start -- at the proper density to produce the number of cases you requested. Your data extract will have altered weights that reflect the new sample densities. Thus, your subsamples will still be representative of the full population, but some divergence from the full-sample estimates should be expected, particularly for estimates of small geographic areas or uncommon categories of cases.
If a sample contains vacant housing units and you request the default rectangularized data structure, the actual number of households in your extract will fall somewhat below the number displayed here.
Extract option: Select cases [top]
The "select cases" feature allows users to limit their dataset to contain only records with specific values for selected variables, such as persons age 65 and older. Multiple variables can be used in combination during case selection. Selections for multiple variables are additive, each being implicitly connected by a logical "AND" for processing purposes. You can only perform case selection on either the general or the detailed version of a variable, not both.
Simply extracting selected cases can be too crude, however, because you may need the people who co-resided with your selected population. Accordingly, the case selection function also lets you choose to include everyone living in a household with a person with the selected characteristics.
Users should be careful with the case selection feature. It is possible to select a specific variable category (i.e., polygamous marriage) that does not exist across all the samples in your extract, thereby inadvertently excluding those samples from your dataset.
Extract option: Attach characteristics [top]
The data extract system can attach a characteristic of a person's mother, father, or spouse as a new variable on the person's record. It can also attach the characteristics of the household head. For example, using the variable "Occupation," it can make a new variable for "Occupation of mother." All persons in the extract who reside in a household with their mother would receive a value for this new variable. Persons without a mother present in the household would receive a missing value. The extract system automatically generates a unique name for the new variable.
The attached-characteristics feature uses the constructed IPUMS family interrelationship "pointer variables" that identify co-resident mothers, fathers, and spouses for each person. The pointer variables identify social mothers and fathers, not strictly biological parents.
Extract option: Describe your extract [top]
You can describe your extract for future reference. Our system will display the description on the page where you download your data extract.
Extract option: Standardize monetary values [top]
IPUMS standardizes monetary units across time for selected variables. For more information about the specific index(es) available via IPUMS, please see the Monetary Standardization Feature page.
Upon selecting a standardized version of an IPUMS variable for inclusion in an extract, both the original IPUMS variable and a standardized version of the variable are delivered to the user in their customized data extract.