In statistics, imputation is the process of replacing lost data with the value being replaced. When changing the data point, it is known as "imputation unit"; when replacing components from data points, it is known as "imputation items". There are three major problems causing data loss: missing data can introduce large amounts of bias, making handling and data analysis more difficult, and making reductions in efficiency. Because the missing data can cause problems to analyze the data, imputation is seen as a way to avoid the traps involved with listwise removal of cases that have missing values. That is, when one or more values ââare missing for a case, most statistical packages default to remove any cases that have missing values, which may cause bias or affect the representativeness of results. Imputation retains all cases by replacing lost data with an approximate value based on other available information. After all missing values ââhave been taken into account, the data set can then be analyzed using standard techniques for complete data. Imputation theory continues to grow and thus requires consistent attention to new information about the subject. There are many theories shared by scientists to account for missing data, but most of them introduce a large number of biases. Some notable attempts to deal with missing data include: hot deck and deck chill deck; listwise and pairwise removal; means imputation; regression imputation; final observations made forward; stochastic imputation; and a lot of imputation.
Video Imputation (statistics)
Penghapusan Listwise (selesai)
By far the most common way to handle missing data is listwise removal (also known as a complete case), ie when all cases with missing values ââare deleted. If data is completely lost at random, then listwise removal does not add bias, but it lowers the power of analysis by reducing the effective sample size. For example, if 1000 cases are collected but 80 has a missing value, the effective sample size after listwise removal is 920. If the case is not lost completely randomly, then the removal of listwise will cause bias because the case sub-samples represented by the missing data do not represent the original sample (and if the original sample itself is a representative sample of a population, the complete case does not represent that population). While listwise removal is not biased when lost data is lost completely randomly, this rarely happens in actuality.
Paired deletion (or "available case analysis") involves deleting a case when no variables are required for a particular analysis, but including cases in the analysis for all required variables. When paired deletion is used, the total N for analysis will not be consistent across parameter estimates. Because N values ââare incomplete at some point of time, while still maintaining a complete case comparison for other parameters, paired deletion may introduce unlikely math situations such as a correlation of more than 100%.
The complete case of one advantage having over other methods is that it's easy and easy to implement. This is a big reason why a complete case is the most popular method of handling lost data regardless of the many losses it has.
Maps Imputation (statistics)
The only imputation
Hot-deck
The ever common imputation method is the hot-deck imputation in which the lost value is alleged from a randomly selected similar record. The term "hot deck" dates back to data storage on hollow cards, and indicates that the information donor is from the same data set as the recipient. Stack of "hot" cards as they are currently being processed.
One form of the hot-deck implication is called "last observation done forward" (or LOCF for short), which involves sorting the dataset according to one of a number of variables, thus creating the ordered dataset. This technique then finds the missing value first and uses the cell value immediately before the data is lost to calculate the lost value. This process is repeated for the next cell with the missing value until all the lost values ââhave been taken into account. In a common scenario where the case is a variable repeated measurement for someone or other entity, it is a belief that if the measurement is lost, the best guess is that it does not change from the last measured time. This method is known to increase the risk of increased bias and possible false inferences. For this reason, LOCF is not recommended for use.
Cold-deck
Cold-deck imputation, on the contrary, chooses donors from other datasets. Due to advances in computer power, more sophisticated imputation methods generally replace the original random and decked random deck engineering technique.
Substitute means
Another imputation technique involves replacing the lost value with the average variable for all other cases, which has the benefit of not changing the sample mean for that variable. However, imputation means to weaken all correlations involving variable (s) being taken into account. This is because, in the case of imputation, there is no guarantee of a relationship between the variables being taken into account and other measurable variables. Thus, imputation means having some properties of interest for univariate analysis but becomes problematic for multivariate analysis.
Regression
Regression rejection has the opposite problem of meaningful imputation. The regression model is estimated to predict observed values ââof variables based on other variables, and the model is then used to calculate the value in cases where the variable is missing. In other words, the information available for incomplete and incomplete cases is used to predict the value of a particular variable. The embedded values ââof the regression model are then used to calculate the lost value. The problem is that the calculated data has no error term included in their estimation, so the estimate fits perfectly along the regression line without residual variance. This causes the relationship to be more identified and suggests greater precision in values ââthat are taken into account than justified. The regression model predicts the most likely value of lost data but does not provide uncertainty about the value.
Stochastic regression is a fairly successful attempt to correct the lack of an error term in regression imputation by adding an average regression variance to the regression imputation to introduce an error. Stochastic regression shows much less bias than the techniques mentioned above, but still misses one thing - if the data is taken into account then intuitively one would think that more noise should be introduced to the problem than simple residual variance.
Many imputations
To address the problem of noise impairment due to imputation, Rubin (1987) developed a method for averaging results across multiple calculated data sets to explain this. All methods of double imputation follow three steps.
- Imputation - Similar to a single imputation, the missing values ââare alleged. However, the calculated values ââare taken m times from the distribution, not just once. At the end of this step, there must be a complete m data set. Analysis
- - Each data set m is analyzed. At the end of this step there should be an analysis m .
- Merging - The m result is consolidated into one result by calculating the mean, variance, and confidence interval of the variable of concern.
Just as there are some single imputation methods, there are several methods of some imputation as well. One advantage that has a lot of imputation over a single imputation and a complete method is that, double imputation is flexible and can be used in various scenarios. Some imputations may be used in cases where data is completely lost, lost at random, and even when data is not lost randomly. However, the main method of some imputation is some imputation by chained equations (MICE). It is also known as "full conditional specification" and, "multiple imputation multiple regression." An important point to note is that MICE must be implemented only when missing data follows the missing on a random mechanism.
As alluded to in the previous section, a single imputation does not take into account the uncertainty in imputation. After imputation, data is treated as if they are real real value in a single imputation. The negligence of uncertainty in imputation can and will lead to results and errors that are too precise in every conclusion drawn. By counting over and over again, imputations must contribute to the uncertainty and range of values ââthat can be taken by the true value.
In addition, while single and complete imputation cases are easier to apply, double imputation is not too difficult to implement. There are different statistical packages in different statistical software that easily allow one to do some imputation. For example, the MICE package allows the user in R to perform some imputation using the MICE method.
See also
- Bootstrapping (stats)
- Censoring (stats)
- Geo-imputation
- Interpolation
- Maximizing-calculation algorithm
References
External links
- Missing Data: Heffalumps Instrument Level and Item-Level Woozles
- Multiple-imputation.com
- Some questions about Demands, Penn State U
- Description of the hot deck deck from Statistics Finland.
- Paper extends the Rao-Shao approach and discusses problems with many imputations.
- Fuzzy Unordered Rules Paper Induction Algorithm Used as the Lost Value Impedation Method for K-Mean Clustering in Real Cardiovascular Data.
- [1] Real-world app from Imputation by the Office of National Statistics UK
Source of the article : Wikipedia