R & hadoop: September 2014

Unfollow Dorcas

Hey all, what tools can one use on data that has so many missing values?

Dorcas Kareithi Mathematical Modelling Intern at ICIPE - International Centre of Insect Physiology and Ecology

Comments

Petar Ozretić, Marius Achi like this

11 comments Jump to most recent comment

David
David Reinke

Associate Planner at Kittelson & Associates, Inc.

What do you mean by "so many"? If you're talking about more than half the data set, then you may be out of luck. If you're talking in the range up to 20% missing, there are a number of good references on dealing with missing data. The classic is Little & Rubin, Statistical Analysis with Missing Data (2nd ed.). There's a new book out by van Buuren, Flexible Imputation of Missing Data. I suggest reading up first on the different types of missing data mechanisms; if the data are missing completely at random, you're pretty safe. Dealing with other missing data mechanisms will take a bit more caution. The book by McLachlan and Krishnan on the EM algorithm may also be worth a look, but you can find information on the EM algorithm in a lot of books and papers these days. Good luck!
- Like (5)
- Reply privately
- Flag as inappropriate
- 1 day ago
Petar O., Dainius T. and 3 others like this
Mark
Mark Powell

Consultant - Rescuer of Doomed Projects; Solver of Impossible Problems; Inspired by Sharing How to Do It All

Dorcas,

Gelman et al, "Bayesian Data Analysis," also has a great chapter on missing data, and will be consistent with the references David recommended.

Mark Powell
- Like (4)
- Reply privately
- Flag as inappropriate
- 18 hours ago
David R., Tymoteusz W. and 2 others like this
Dorcas
Dorcas Kareithi

Mathematical Modelling Intern at ICIPE - International Centre of Insect Physiology and Ecology

Thank you all for the response. David and Mark;where can I get the book?
- Like
- Reply privately
- Flag as inappropriate
- 17 hours ago
Mark
Mark Powell

Consultant - Rescuer of Doomed Projects; Solver of Impossible Problems; Inspired by Sharing How to Do It All

Dorcas,

I usually just go to Amazon.com. All of the references suggested by me and David are available there.

Mark Powell
- Like
- Reply privately
- Flag as inappropriate
- 17 hours ago
Marx
Marx Dambaza

Lecturer at Midlands State University

For missing data treatment especially for credit scoring purpose I would suggest the Bayesian Inference with missing data using Bound and Collapse Method for nonignorable missing data mechanism. Bound and Collapse method is a deterministic imputation model based on Dirichlet probability distribution of a multinomial random variable was suggested by Ramoni and Sebastian, 1998) and applied to credit scoring of SMEs by Chen and Astebro (2003) and proved very handy. The maximum likelihood can also does the tricks as long as the missingness can be modeled.
- Like
- Reply privately
- Flag as inappropriate
- 16 hours ago
Tymoteusz
Tymoteusz Wołodźko

Research and Data Analytics Specialist

In my opinion Gelman's et al. book is worth consideration. In another book by Gelman and Hill on regression (http://www.stat.columbia.edu/~gelman/arm/) there's also a chapter on missing data, you can find this one online:
http://www.stat.columbia.edu/~gelman/arm/missing.pdf

You could check also "The BUGS Book" by Lunn et al ( http://www.amazon.co.uk/BUGS-Book-Practical-Introduction-Statistical/dp/1584888490/ ) it also gives some hints on missing data in Bayesian approach.
- Like
- Reply privately
- Flag as inappropriate
- 13 hours ago
David
David Reinke

Associate Planner at Kittelson & Associates, Inc.

Gelman et al. latest edition is worth buying in its own right: the best book on Bayesian data analysis is now even better. If your book budget is limited and if you want to know how to handle missing data I recommend Gelman et al. first (covers more than just missing data), then van Buuren (most up-to-date & more complete coverage of missing data than Gelman et al.), with Little & Rubin coming in third.
- Like
- Reply privately
- Flag as inappropriate
- 12 hours ago

Mark
Mark Powell

Consultant - Rescuer of Doomed Projects; Solver of Impossible Problems; Inspired by Sharing How to Do It All

David,

You made a comment above " If you're talking about more than half the data set, then you may be out of luck."

It really depends on what the data are and how the missing data are "missing." You can get good usable estimates with 100% of the data missing if you have the right kind of data and the right kind of "missing." It is all in the formulation of the likelihood.

Mark Powell
- Like
- Reply privately
- Flag as inappropriate
- 10 hours ago
Mark
Mark Powell

Consultant - Rescuer of Doomed Projects; Solver of Impossible Problems; Inspired by Sharing How to Do It All

David,

Let me correct myself, you can always do a prior predictive estimate without any data at all, even without any missing data.

Mark Powell
- Like
- Reply privately
- Flag as inappropriate
- 10 hours ago
Thomas
Thomas Riebenbauer

Scientist at JOANNEUM RESEARCH Forschungsgesellschaft mbH

I like the R package VIM: Visualization and Imputation of Missing Values (http://cran.r-project.org/web/packages/VIM/index.html). Especially the graphics can really help on understanding the structure of missing values. The package also comes with an optional graphical user interface.

There are also some good manuals and explanations around in the web (e.g. http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2014/mtg1/Topic_5_Austria.pdf).
- Like
- Reply privately
- Flag as inappropriate
- 10 hours ago
Tymoteusz

Tymoteusz Wołodźko

Research and Data Analytics Specialist

Btw, do any you have any sources to recommend on missing (at random) data in directed social network? Thanks.

Friday 12 September 2014

what tools can one use on data that has so many missing values?

Hey all, what tools can one use on data that has so many missing values?

Comments