USECSPRO: Joining records

Joining records is a process where the information related to same logical units artificially spread over several data records during data collection is reunited back into a single record. 'Joining' should not be confused with 'merging', which has a broader meaning.

A typical situation where a join may be required is when the architect of the CSPro database has separated e.g. the educational information on household members from their employment information into two different schedules. Records in both schedules would contain a case id (household id) and a supplementary roster id, which enumerates persons within the same household.

Here is how a join is done in Stata:

  cspro join using "file.dat", dictionary("file.dcf") level("levelname") records("R1 R2 ... RN")

An equivalent Mata function is:

  cspro_join("dictionary.dcf", "data.dat", "levelname", "R1 R2... RN")

In both commands, the records are united using the identifier (one or typically more variables) specified as the identifier for the given data level in the CSPro dictionary file.

Warning: joining is not a safe operation. There is not enough information in the dictionary file to validate whether a particular join is possible or not. usecspro will only join records at the same hierarchical level of CSPro database. Any other data manipulations should be done in Stata with the help of the merge command.

An example of the data structure, where one should not attempt a join is when e.g. Schedule A collects information on the household, Schedule B collects information on household members, and Schedule C collects information on some assets. With such a hierarchy, there will be two rosters: one for individuals, and one for assets, with two roster-ids (with both usually starting with 1). One should avoid joining B+C, which in this case means setting direct association of each one household member with each one asset. It is probably safe, however, to do a join A+B and A+C.

A separate Mata function is reserved for the case where the join should occur using more variables than the ID for the given level, e.g. when the effective id should be a combination of ID and PID, while only ID is declared the identifier of the level.

cspro_guess_joinp("dictionary.dcf","data.dat", "levelname", "R1 R2 ...RN")

Note that the arguments are the same, but it is the algorithm that is different.

Windows (and only Windows) users can select multiple records in the data import dialog. In that case there will be an attempt to 'guess_join' them during conversion. The program has no way of determining whether such an operation makes sense or not and whether it is technically possible. It may or may not succeed.

Joining records, and especially 'guess_joining' records should be treated as experimental and undertaken with caution. Everything that 'join' and 'guess_join' do can be done with regular Stata's merge command, albeit with a few manipulations with the data to properly sort it and determine the appropriate IDs and merge strategy.