Friday, October 26, 2012

New Paper on General Information Metrics for Experiment Planning

As part of our ongoing project on "turning the scientific method into math", Marc Harper and I have written a paper on expectation potential information as the key measure of information yield from a proposed experiment. Take a look at the paper; we are eager for feedback (e.g. add a comment on this post). The basic idea is:
  • empirical information (\(I_e\)) measures prediction power on observables.
  • potential information (\(I_p\)) measures the maximum additional prediction power possible for a given set of observables, relative to the current model. In other words the theoretical increase in empirical information achievable by the best possible model. The key point is that \(I_p\) can be estimated without in any way searching model space.The value of any experiment is its ability to surprise us, i.e. to demonstrate that our current model is inadequate. Potential information provides a general measure of this, so the value of an experimental dataset is simply its potential information measure. For more details on this previous work, see here.
  • expectation potential information (\(E(I_p)\)) forecasts the expected information value of an experiment, under our current beliefs (uncertainty) about its likely outcomes. That is, adopting the view that our "current model" is always a mix of competing models, the \(E(I_p)\) for a proposed experiment measures its ability to resolve major uncertainties in that mixture.
Two things about this work have been fun:
  • we used an interesting "test problem", RoboMendel: a robot scientist tasked with proposing experiments to discover the laws of genetics. It's been fun working through how the basic \(E(I_p)\) metric addresses not only fine details of experiment planning (e.g. the value of including a specific control) but also the big questions of "what should we look at?"
  • Note that all these metrics are defined strictly in terms of prediction power on observable variables, contrary to the usual focus in statistical inference on our ability to infer hidden variables. Yet the \(E(I_p)\) metric comes full circle; you can prove that as the mixture probabilities converge to the true marginal probabilities of possible "outcomes", the expectation potential information metric converges \(E(I_p) \to I(X;\Omega)\), i.e. the classic information theory metric of how "informative" the observable X is of the true hidden state of the system \(\Omega\).

Bioinformatics Learning 2.0: proposing an open source consortium for bioinformatics teaching materials

I've spent almost all of my time over the last year transforming how I teach (replacing lectures with Socratic active learning, where students answer questions in class), both by developing software tools for this and teaching two different courses this way (an Intro Bioinformatics theory course for computer science students, and a Genomics and Computational Biology course for biology students). I have made all of these tools and materials available as open source, because I believe this job (even for a single course) is always too big for one person (particularly the need for hundreds of exercises for the students to learn how to actually use the concepts). Rather than each instructor "reinventing the wheel" (producing all their own course materials despite the fact that dozens of other courses overlap the same material), we should share materials as open source, so each instructor can use the best materials from everyone else, and focus their own efforts on areas where they have special interest or expertise (which in turn they contribute back to the community).

I was invited to speak about all this at RECOMB-BE this year; here's a link to a video of my talk if you're interested.

Wednesday, August 22, 2012

Intro Bioinfo Theory Course Example Release 1


What is the purpose of this release?

It illustrates the kind of content I am releasing as open source course materials, to bootstrap an open source bioinformatics teaching materials consortium where instructors can selectively use, modify and share materials for their own teaching. If you're interested in using these materials or participating in such a consortium, I invite you to contribute your thoughts or feedback in the Comments section below, or by email to

What is this?

This is a snapshot of the reading, lectures, homework, projects, practice exams, and exams from my 2011 Bioinformatics Theory course offered separately as a CS undergrad course and Bioinformatics graduate course (different exams; separate graduate term project). The course uses a core set of simple genetics, sequence analysis and phylogeny problems to teach fundamental principles that arise in virtually all bioinformatics problems. This course is not for students who want to learn to use existing methods (e.g. BLAST) but rather for students who might in the future want to invent new bioinformatics analyses. It emphasizes statistical inference, graph models and computational complexity.

Note: this is not a standard lecture course; approximately half the class time was devoted to in-class concept tests, where the class was presented with a question that tests conceptual understanding. Students answered concept tests using an open-response (i.e. not multiple choice) in-class response system by typing answers on their laptops or smartphones. We then discussed our answers using Peer Instruction methods, and I analyzed all the individual students' answers in detail; at the subsequent class, I went through each of conceptual errors the students made for each question. I have written approximately 200 concept tests for a wide range of statistical and computational concepts relevant to bioinformatics, and a wide variety of "problems-to-work" (i.e. more conventional homework problems) covering the same material. I am making all of these materials and software available as open source; this is the first step in that release process.