Friday, 12 September 2014

what tools can one use on data that has so many missing values?

Hey all, what tools can one use on data that has so many missing values?

Mathematical Modelling Intern at ICIPE - International Centre of Insect Physiology and Ecology

Comments

  • David
    Associate Planner at Kittelson & Associates, Inc.
    What do you mean by "so many"? If you're talking about more than half the data set, then you may be out of luck. If you're talking in the range up to 20% missing, there are a number of good references on dealing with missing data. The classic is Little & Rubin, Statistical Analysis with Missing Data (2nd ed.). There's a new book out by van Buuren, Flexible Imputation of Missing Data. I suggest reading up first on the different types of missing data mechanisms; if the data are missing completely at random, you're pretty safe. Dealing with other missing data mechanisms will take a bit more caution. The book by McLachlan and Krishnan on the EM algorithm may also be worth a look, but you can find information on the EM algorithm in a lot of books and papers these days. Good luck!
    Petar O., Dainius T. and 3 others like this
  • Mark Powell
    Mark
    Consultant - Rescuer of Doomed Projects; Solver of Impossible Problems; Inspired by Sharing How to Do It All
    Dorcas,

    Gelman et al, "Bayesian Data Analysis," also has a great chapter on missing data, and will be consistent with the references David recommended.

    Mark Powell
    David R., Tymoteusz W. and 2 others like this
  • Dorcas Kareithi
    Dorcas
    Mathematical Modelling Intern at ICIPE - International Centre of Insect Physiology and Ecology
    Thank you all for the response. David and Mark;where can I get the book?
  • Mark Powell
    Mark
    Consultant - Rescuer of Doomed Projects; Solver of Impossible Problems; Inspired by Sharing How to Do It All
    Dorcas,

    I usually just go to Amazon.com. All of the references suggested by me and David are available there.

    Mark Powell
  • Marx
    Lecturer at Midlands State University
    For missing data treatment especially for credit scoring purpose I would suggest the Bayesian Inference with missing data using Bound and Collapse Method for nonignorable missing data mechanism. Bound and Collapse method is a deterministic imputation model based on Dirichlet probability distribution of a multinomial random variable was suggested by Ramoni and Sebastian, 1998) and applied to credit scoring of SMEs by Chen and Astebro (2003) and proved very handy. The maximum likelihood can also does the tricks as long as the missingness can be modeled.
  • Tymoteusz Wołodźko
    Tymoteusz
    Research and Data Analytics Specialist
    In my opinion Gelman's et al. book is worth consideration. In another book by Gelman and Hill on regression (http://www.stat.columbia.edu/~gelman/arm/) there's also a chapter on missing data, you can find this one online:
    http://www.stat.columbia.edu/~gelman/arm/missing.pdf

    You could check also "The BUGS Book" by Lunn et al ( http://www.amazon.co.uk/BUGS-Book-Practical-Introduction-Statistical/dp/1584888490/ ) it also gives some hints on missing data in Bayesian approach.
  • David
    Associate Planner at Kittelson & Associates, Inc.
    Gelman et al. latest edition is worth buying in its own right: the best book on Bayesian data analysis is now even better. If your book budget is limited and if you want to know how to handle missing data I recommend Gelman et al. first (covers more than just missing data), then van Buuren (most up-to-date & more complete coverage of missing data than Gelman et al.), with Little & Rubin coming in third.
  • Mark Powell
    Mark
    Consultant - Rescuer of Doomed Projects; Solver of Impossible Problems; Inspired by Sharing How to Do It All
    David,

    You made a comment above " If you're talking about more than half the data set, then you may be out of luck."

    It really depends on what the data are and how the missing data are "missing." You can get good usable estimates with 100% of the data missing if you have the right kind of data and the right kind of "missing." It is all in the formulation of the likelihood.

    Mark Powell
  • Mark Powell
    Mark
    Consultant - Rescuer of Doomed Projects; Solver of Impossible Problems; Inspired by Sharing How to Do It All
    David,

    Let me correct myself, you can always do a prior predictive estimate without any data at all, even without any missing data.

    Mark Powell
  • Thomas Riebenbauer
    Thomas
    Scientist at JOANNEUM RESEARCH Forschungsgesellschaft mbH
    I like the R package VIM: Visualization and Imputation of Missing Values (http://cran.r-project.org/web/packages/VIM/index.html). Especially the graphics can really help on understanding the structure of missing values. The package also comes with an optional graphical user interface.

    There are also some good manuals and explanations around in the web (e.g. http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2014/mtg1/Topic_5_Austria.pdf).
  • Tymoteusz Wołodźko
    Tymoteusz
    Research and Data Analytics Specialist
    Btw, do any you have any sources to recommend on missing (at random) data in directed social network? Thanks.