Google is not a proxy for big data. Ah-ah-achoo!!

Google is a great source of data for Google. Twitter is a great source of data for Twitter. Facebook is a great source of data for Facebook. Remember more data mean more skepticism, not less. Correlation doesn’t mean causation.

Gigaom

Another year, another report about how inaccurate the Google Flu Trends predictions turned out to be for the previous year. And more warnings about the dangers of relying on Google, and therefore “big data” and algorithms, for important stuff.

Repeat after me: Google is not a proxy for big data. It also isn’t supposed to replace the Centers for Disease Control. Even it wouldn’t make that claim.

I made the same argument in more detail when this concern popped up last year, but here it is in a nutshell: “Big data” isn’t the enemy, it’s a friend. So are algorithms. But they must be used correctly.

Google is a great source of data for Google. Twitter is a great source of data for Twitter. Facebook is a great source of data for Facebook. For everyone else, they’re just additional sources of data of varying value depending on what’s being…

View original post 182 more words

Cancer, machine learning and data integration

Machine learning to fight cancer.

Follow the Data

Machine Learning Methods in the Computational Biology of Cancer is an arXiv preprint of a pretty nice article dealing with some analysis that can be used for high-dimensional biological (and other) data – although the examples come from cancer research, they could easily be about something else. This paper does a good job of describing penalized regression methods such as lasso, ridge regression and elastic net. It also goes into compressed sensing and its applicability to biology, although cautioning that it cannot yet be straightforwardly applied to biological data. This is because compressed sensing is based on the assumption that one can choose the “measurement matrix” freely, whereas in biology, it (usually called “design matrix” in this context) is already fixed.

The Critical Assessment of Massive Data Analysis (CAMDA) 2014 conference has released its data analysis challenges. Last year’s challenges on toxicogenomics and toxicity prediction will be reprised (perhaps in…

View original post 155 more words