What is SAVE?
The SAVE study is a survey initiated in 2001 and produced by the Mannheim Research Institute for the Economics of Aging. It collects detailed quantitative information on economic variables and on relevant socio-psychological aspects of a representative sample of German households. SAVE has been constructed as a panel: since 2005, the same households are interviewed at a yearly pace.
Who can use SAVE? How can I obtain the data?
Data and documents are only released for academic research and teaching: a commercial use is strictly prohibited.
The SAVE study can be ordered from GESIS filling in and sending the general order form for archive data: note that with the button at the end of the form you can directly send the order by e-mail.
In your order form refer to the following study number and title:
|4051||Saving and financial investment of private households (SAVE) 2001|
|4436||Saving and financial investment of private households (SAVE) 2003/04|
|4437||Saving and financial investment of private households (SAVE) 2005|
|4521||Saving and financial investment of private households (SAVE) 2006|
|4740||Saving and financial investment of private households (SAVE) 2007|
|4970||Saving and financial investment of private households(SAVE) 2008|
|5230||Saving and financial investment of private households(SAVE) 2009|
Further information on data access can be found here.
What does it mean that some data are imputed? Are the data made-up?
Partial lack of information or item nonresponse is a well-known phenomenon in household studies: all the main surveys (such as the German SOEP, or the American Survey of Consumer Finances) have to deal with this issue. Deleting the observations with missing values and relying only on a complete-case analysis not only reduces the sample size, but might also lead to biased results, as item nonresponse may not be random among respondents.
To handle this problem, missing values in SAVE are imputed, that is they are filled-in with plausible values, using a multiple imputation procedure. This method aims at capturing all relevant relationships between variables in order to preserve their correlation structure. To do so, the missing values in each single variable are imputed conditioning on as many relevant and available variables as possible. The goal of imputation is not to create any artificial information but to use the existing information in such a way that public users can analyze the resulting complete dataset with standard statistical methods for complete data. Therefore, in a first step prior to the multiple imputation, the data base's panel structure was used to impute serveral variables logically.
As imputation is a very resources-consuming process that is not at the disposal of many users, the main data providers (such as the Federal Reserve Bank, who produces the Survey of Consumer Finances in the US) usually provide the final users with imputations. In SAVE all the imputed values are flagged, so users are free to ignore the imputations: the dataset “SAVE_[year]_indicator” (released together with the SAVE-data) indicates whether a certain value is original (0), stochastic (1) or has been imputed (2).
A complete description of the imputation mechanism in SAVE is given in:
- Schunk, D. (2008): “A Markov chain Monte Carlo algorithm for multiple imputation in large surveys.” Advances in Statistical Analysis 92(1), 101 - 114.
- Ziegelmeyer, M. (2009): “Documentation of the logical imputation using the panel structure of the 2003-2008 German SAVE Survey.” MEA Discussion Paper 173-09, MEA Mannheim.
An overview of approaches to deal with item nonresponse is presented in:
- Rässler, S. und R. Riphahn (2006): “Survey item nonresponse and its treatment” Allgemeines Statistisches Archiv, 90, 217-232.
For a more general introduction to multiple imputation you are referred to:
- Rubin, D.B. (1987): “Multiple Imputation for Nonresponse in Surveys.” Wiley, New York.
- Little, R.J.A. und D.B. Rubin ( 2002), “Statistical analysis with missing data.” Wiley, New York
Basic information about multiple imputation in the form of answers to Frequently Asked Questions is provided here.
Why are there five datasets for each year? Which dataset should I use?
Missing data are imputed in SAVE using a multiple imputation technique. This is a Monte Carlo technique in which the missing values are replaced by m>1 simulated versions. Like in other surveys, such as the Survey of Consumer Finances, in SAVE m is set equal to five. In other words, the whole imputation algorithm is repeated five times, producing the five datasets that are provided to the final user.
To get meaningful results, each of the completed dataset should be analyzed by standard methods, and the results should be combined to produce estimates and confidence intervals that incorporate missing-data uncertainty. Standard errors obtained using only a single dataset are generally too low; furthermore single imputation is more prone to generate biased results. The statistical analysis of a single dataset is, however, good to get confidence with the data and to gather a first idea about magnitude and direction of the estimated effects. To this scope, it is absolutely indifferent which of the five dataset is used.
Rubin, D.B. (1996) “Multiple Imputation After 18+ Years” Journal of the American Statistical Association, 91(434), pp. 473-489 explains how to combine the results obtained from the separate analysis of the five datasets.
See also Appendix 6.2 in Schunk, D. (2007) “A Markov Chain Monte Carlo Multiple Imputation Procedure for Dealing with Item Nonresponse in the German SAVE Survey” MEA Discussion paper 121-07, University of Mannheim.
In each dataset there are three different types of weights: which one should be used?
To answer this question let us go back one step and think about why we want to use weights.
These weights are aimed at “offering protection against unfavourable sample compositions” (Holt and Smith 1979 “Post Stratification” Journal of the Royal Statistical Society. Series A. 142(1)), recalibrating the sample so that the results are more representative of the whole population. Weights should be used always when looking at descriptive statistics (i.e. sample averages), while it is questionable to use them in regressions, as they tend to reduce the precision of the estimates without really giving any extra benefit (Radbill, L. and Winship, C. (1994) “Sampling Weights and Regression Analysis” Sociological Methods & Research, 23(2), 230 - 251). The following is therefore tailored to the univariate example.
So, how do weights improve the sample averages? Let’s imagine you are interested in knowing the percentage of graduates in the population: if in your sample there are a lot of young individuals with high income (that tend also to be better educated, and therefore to have more often a university degree), the simple average across the sample will overestimate the percentage of graduates in the population. Using a weight that attaches a “lighter” value to the oversampled group, on the contrary, will give you a better picture of the “real” value in the population.
In principle, therefore, which set of weight is more appropriate depends on the population characteristic you are interested in, and on the degree of influence that each stratum has on the result. Back to the previous example, if we expect that the percentage of graduates varies with age and income of the respondents more than with age and household size (that is, if we expect that the differences in the percentage of graduates among age and income groups are much bigger than among income and household size groups), then the set of weights that recalibrate the relative weight of households by age and income is definitely better. It goes without saying, however, that once a set of weight is selected, it should be then used to produce all the descriptive statistics in the same paper.
In practice, however, the differences among the three different weights in SAVE are not so big. Fortunately, different groups are equally well covered in the SAVE-sample (note, in fact, that all the weights are relatively small). Therefore, the results do not change that much using a set of weight or another: once you choose a set of weights, you can use the other two to perform robustness checks.
Are the values in the survey 2001 in Deutsche Mark(DM) or in Euros?
The values were reported by respondents in the survey 2001 in Deutsche Marks (DM). They have been subsequently transformed in Euros, so that the final user has not to convert the data again.
How is the variable bik coded?
In the actual release of the SAVE data, the values of the variable bik (indicating the size and the type of the municipality, i.e. if the municipality is in the core of the region or not) are not coded.
Here you find the labels
0 = more than 500.000, core area
1 = more than 500.000, not core
2 = 100.000 - 499.999, core area
3 = 100.000 - 499.999, not core
4 = 50.000 - 99.999, core area
5 = 50.000 - 99.999, not core
6 = 20.000 - 49.999
7 = 5.000 - 19.999
8 = 2.000 - 4.999
9 = less than 2.000
Future releases of the data will correct this omission.
What does the value 10 for the variable bula in the surveys 2003, 2005, 2006 and 2007 mean?
Starting with the survey 2003, the variable bula (Bundesland - federal state) assumes also the value 10, which has been unfortunately not labelled.
The correct labelling of the values is the following: