Friday, October 26, 2012

New Paper on General Information Metrics for Experiment Planning

As part of our ongoing project on "turning the scientific method into math", Marc Harper and I have written a paper on expectation potential information as the key measure of information yield from a proposed experiment. Take a look at the paper; we are eager for feedback (e.g. add a comment on this post). The basic idea is:
  • empirical information (\(I_e\)) measures prediction power on observables.
  • potential information (\(I_p\)) measures the maximum additional prediction power possible for a given set of observables, relative to the current model. In other words the theoretical increase in empirical information achievable by the best possible model. The key point is that \(I_p\) can be estimated without in any way searching model space.The value of any experiment is its ability to surprise us, i.e. to demonstrate that our current model is inadequate. Potential information provides a general measure of this, so the value of an experimental dataset is simply its potential information measure. For more details on this previous work, see here.
  • expectation potential information (\(E(I_p)\)) forecasts the expected information value of an experiment, under our current beliefs (uncertainty) about its likely outcomes. That is, adopting the view that our "current model" is always a mix of competing models, the \(E(I_p)\) for a proposed experiment measures its ability to resolve major uncertainties in that mixture.
Two things about this work have been fun:
  • we used an interesting "test problem", RoboMendel: a robot scientist tasked with proposing experiments to discover the laws of genetics. It's been fun working through how the basic \(E(I_p)\) metric addresses not only fine details of experiment planning (e.g. the value of including a specific control) but also the big questions of "what should we look at?"
  • Note that all these metrics are defined strictly in terms of prediction power on observable variables, contrary to the usual focus in statistical inference on our ability to infer hidden variables. Yet the \(E(I_p)\) metric comes full circle; you can prove that as the mixture probabilities converge to the true marginal probabilities of possible "outcomes", the expectation potential information metric converges \(E(I_p) \to I(X;\Omega)\), i.e. the classic information theory metric of how "informative" the observable X is of the true hidden state of the system \(\Omega\).

Bioinformatics Learning 2.0: proposing an open source consortium for bioinformatics teaching materials

I've spent almost all of my time over the last year transforming how I teach (replacing lectures with Socratic active learning, where students answer questions in class), both by developing software tools for this and teaching two different courses this way (an Intro Bioinformatics theory course for computer science students, and a Genomics and Computational Biology course for biology students). I have made all of these tools and materials available as open source, because I believe this job (even for a single course) is always too big for one person (particularly the need for hundreds of exercises for the students to learn how to actually use the concepts). Rather than each instructor "reinventing the wheel" (producing all their own course materials despite the fact that dozens of other courses overlap the same material), we should share materials as open source, so each instructor can use the best materials from everyone else, and focus their own efforts on areas where they have special interest or expertise (which in turn they contribute back to the community).

I was invited to speak about all this at RECOMB-BE this year; here's a link to a video of my talk if you're interested.