Sherman D. Hanna and Suzanne Lindamood, October 27, 2006
For various reasons beyond the scope of this note, some households in the SCF have different characteristics in different implicates (Lindamood, Hanna, & Bi, 2007; Qin, 1998). For instance, a household could be coded as Black in one implicate, Hispanic in another, and white in three of the implicates. Because of this possibility, when conducting an analysis of a subsample of an SCF dataset, such as Hispanic households, you should select only the households that have that subsample characteristic in all five implicates. This selection may be accomplished by a cross-tabulation of the household ID number by the variable of interest, run for each implicate, or by using the code that Sung and Montalto (1998) developed to simplify the procedure (Sung & Montalto, 1998).
Using SCF data to illustrate the observation that some households have different characteristics in different implicates (Example A), for a subsample of Black households in the 2004 SCF, the number of Black households ranges from 483 in Implicates 2 and 4 to 485 in Implicates 3 and 5. Using the code described above to select only the households that have the same response for a particular variable in all five datasets (Example B) shows that only 482 households have the respondent coded as "Black" in all five implicates.
This example provides an additional reason to use all five implicates in analyses and the need to select a subsample from the five implicates rather than analyzing just one implicate (Lindamood, Hanna, & Bi, 2007).
data one; set data.SCF04;
implic=y1-(10*yy1);
etc.
if black=1;
proc freq;
tables implic;
title ' Example A: Black . . . . . in each implicate
The FREQ Procedure
Cumulative Cumulative
implic Frequency Percent Frequency Percent
1 484 20.00 484 20.00
2 483 19.96 967 39.96
3 485 20.04 1452 60.00
4 483 19.96 1935 79.96
5 485 20.04 2420 100.00
Code is based on Sung and Montalto (1998).
data one; set data.SCF04;
implic=y1-(10*yy1);
etc.
if black=1;
proc sort;by yy1 implic;
proc means noprint;var implic;by yy1;
output out=FIVE n=noimp;
data final;merge one five;by yy1;
if noimp=5;
proc freq;
tables implic;
title ' Example B: Black . . . . .in all 5 implicates
The FREQ Procedure
Cumulative Cumulative
implic Frequency Percent Frequency Percent
1 482 20.00 482 20.00
2 482 20.00 964 40.00
3 482 20.00 1446 60.00
4 482 20.00 1928 80.00
5 482 20.00 2410 100.00
Should You Delete Households That Differ Across Implicates from an Analysis of the Total Sample?
When your research involves analyzing the entire sample and separately analyzing subsamples, you might want to delete those households that have different across-implicates values for the criterion variable. The household ID can be used to determine for which households a variable is coded differently in different implicates.
For the variable "race/ethnic group" (X6809),6 households in the 2007 SCF have different values for the variable. The following link lists the household IDs of those households.
File listing the household ID (YY1) for those households in the 2007 SCF.
For the variable "race/ethnic group" (X6809), 13 households in the 2004 SCF have different values for the variable. The following link lists the household IDs of those households.
File listing the household ID (YY1) for those households in the 2004 SCF.
These 13 households in the 2001 SCF have different values for X6809.
File listing the household ID (YY1) for those households in the 2001 SCF.
These 7 households in the 1998 SCF have different values for X6809.
File listing the household ID (YY1) for those households in the 1998 SCF.
These 16 households in the 1995 SCF have different values for race (X5909).
File listing the household ID (YY1) for those households in the 1995 SCF.
These 15 households in the 1992 SCF have different values for X5909.
File listing the household ID (YY1) for those households in the 1992 SCF.
If you are not analyzing separate subsamples, it is less obvious whether you should delete households with multiple values of a key variable. For instance, if I am interested in the effect of race/ethnic group on stock ownership, I know that when I include all households in the 2004 SCF in the analysis, 13 of those households will have different values of race in different implicates. While this will only make a small difference in my results, conceptually perhaps I should exclude those households since there is no way of knowing what the actual response was to X6809. It is possible that a respondent refused to answer the question, and the SCF staff imputed an answer, and it is also possible that the respondent answered the question but the SCF staff assigned different responses to help protect the privacy of the respondent (see discussion in Lindamood, Hanna, & Bi, 2007).
There are other common examples when subgroups are used for an analysis. This occurs, for instance, when a subsample is selected based on age or household type. The easiest way to deal with this issue for a subsample is to use the Sung and Montalto (1998) code in the subsample -- if the number of observations is the same after applying the code then there is no variation across implicates.