Harmonization
International census samples employ differing numeric classification systems and reconciliation of these codes is a major part of this project. Variables must be easy to use for comparisons across time and space. This requires that we provide the lowest common denominator of detail that is fully comparable. On the other hand, we must retain all meaningful detail in each sample, even when it is unique to a single dataset.
For most variables, it is impossible to construct a single uniform classification without losing information. Some samples provide far more detail than others, so the lowest common denominator of all samples inevitably loses important information. Composite coding schemes offer a solution. The first one or two digits of the code provide information available across all samples. The next one or two digits provide additional information available in a broad subset of samples. Finally, trailing digits provide detail only rarely available. For example, in IPUMS International, the first digit of the variable for marital status is comparable across all samples. The second digit delineates consensual unions from other forms of marriage (where appropriate) and distinguishes among the categories separated, divorced, and married with spouse absent. The final digit provides additional detail with the married and married-spouse-absent categories (such as polygamous marriages in Kenya). The basic goal of our harmonization efforts is to simplify use of the data while losing no meaningful information.
In addition to providing harmonized codes for variables and accompanying documentation, the IPUMS International project is carrying out a variety of additional tasks to improve data quality, not all of which have been implemented at this time. These tasks include the following:
- Cleaning data to eliminate duplicate records, inappropriately merged households, and other errors. Logical fixes are implemented where possible; otherwise, a household with similar characteristics is donated in place of the corrupt one. A flag identifies the records that were substituted.
- Developing internal consistency checks to maximize data integrity. This includes, for example, examining consistency between age and marital status, occupation, and school attendance; looking for persons with multiple spouses for countries in which this is not an accepted custom; and checking for agreement between household and individual characteristics.
- Implementing allocation procedures to impute values for missing or inconsistent data items, using logical edits together with probabilistic "hot deck" methodology. A data quality flag identifies allocated data items.
- Creating constructed variables to simplify data analysis, including family interrelationship variables. A system of logical rules identifies the record number within each household of the individual's mother, father, or spouse, if they were present in the household. These pointers allow users to automatically attach the characteristics of these kin or to construct measures of fertility and family composition. Other constructed variables describe family and household characteristics at the individual and household level (such as family and subfamily membership, family and subfamily size, and number of own children).