How We Link Data


LIFE-M links multiple data sources to provide a rich longitudinal and intergenerational picture of health, human capital, and economic outcomes.  The first step of the process is to reconstitute birth and marriage families of the late 19th and early 20th century birth cohorts (G2) (arrow 1, see generational structure).  This requires linking birth records (G2) to one another.  Linking variables are parents’ full names (G1) and other information such as parents’ birth places when available.  Also, G2 can be linked to their own children (G3), because birth records contain mothers’ birth names.  This step allows the reconstruction of two to three generations of interrelated families.

The second step of the linking process is to link G2 to their marriage records (arrow 2).  Linking variables are bride and groom full birth names, exact date of birth (allowing for over-reporting of age, Blank et al. 2009), and place of birth (when available in the collection).  Although 90% of women born in this period were ever married (Bailey et al. 2014), marriage registration was highly incomplete and less than a full match rate is expected.  When possible, we repeat this step for G1 and G3.

The third step is to link G2 to their grandparents (G0) using the 1900 and 1880 censuses (arrow 3).  The linking variables are parents’ (birth or married) names (G1) to the 1900 census names, which provides key information on birth place, age, and race.  Next we link G1 to the 1880 census using their names only or names in addition to ages, birth place, and race (obtained from the 1880 link).  This step connects G2 to G0 and is important because it allows for the addition of G1’s early life family conditions, including G0 ancestry/heritage and economic circumstances (such as occupation, race, and address).

The fourth step links four generations (G0, G1, G2, G3) to the full-count 1940 census (arrow 4).  This step uses full names (including birth and maiden names of women), exact birth dates/age, and birth place.  The 1940 is the first census to include rich information on educational attainment, wages and salary, and many employment outcomes.  While this will only be possible for some of G0 (many will have passed before 1940), most of G1, G2 as adults (in their marriage families), and G3 as children (in their birth families) will be recorded in the 1940 census.

Process for Creating Links

Our process begins with hand linking to obtain a ground truth dataset. The ground truth’s creation is semi-automated, making use of both computer programming and human input.  After cleaning and standardizing the data, we use a bi-gram matching procedure to generate a list of candidate matches based upon name similarity and age.

From this list of candidate links, LIFE-M creates links using an independent, blinded human review process. “Data trainers” first participate in a rigorous orientation process. During this period, they receive detailed feedback on their accuracy relative to an answer key. They continue this process (this takes 10 to 20 hours of data training) until their matches agree with the truth dataset 95 percent of the time.

After completing this orientation, data trainers become part of a team that conducts independent, blinded clerical review. Each potential match is reviewed by two trainers who choose from a set of candidate matches generated using a probabilistic, bigram match on name, date of birth (or age), and birth state (Wasi 2014). In the cases where the two initial reviewers disagree, the records are re-reviewed by an additional three individuals to resolve these discrepancies. Our automated system randomly assigns batches among the 15 to 30 trainers who are employed at any time, so it is difficult for trainers to coordinate with peers.  Any discrepancies between the two trainers result in additional reviews by three other trainers who also make independent determinations about whether the candidate records are true links. Random “audit batches” provide feedback to trainers about the accuracy of their decisions, and weekly meetings encourage discussions of difficult cases to help trainers achieve consistent and accurate matches.  The result of this process is a highly vetted, hand-matched ground truth dataset for a random sample.

Validation of Human Links

To validate the quality of the LIFE-M ground truth, the Record Linking Lab at Brigham Young University (BYU) employed research assistants to use genealogical methods to hand link a sample of 543 boys, 225 of which had been linked by LIFE-M. They used multiple sources of genealogical data (only a subset of which were used in the LIFE-M linking) to create correct record linkages and complete family trees. Although genealogical linking is cost (and time) prohibitive for larger projects, the advantage of the genealogical approach is that it produces a very low rate of false links.  The BYU team had no knowledge of LIFE-M’s links while doing this exercise, so their work can be viewed as independent and blinded.

BYU’s genealogical method ultimately linked 392 of 543 boys for a match rate of 72 percent. This was 151 more links than the LIFE-M team found for the same sample of 543 boys. The success of the BYU team can be attributed to the use of multiple data sources and more intensive searching to distinguish between seemingly similar matches. However, for the 225 links found by LIFE-M, the BYU team agreed with these matches 96 percent of the time. Only 16 of LIFE-M’s links differed from those found by the genealogical method. Taking the genealogical method as the gold standard, this implies that LIFE-M’s false positive rate would be 4 percent.  In a separate review process for a much larger sample, the LIFE-M data trainers found a false link of around 2 percent in the ground truth data.