Tuesday, November 18, 2008
Gasta Search Statistics
Search engine data is the most relevant on the net
The net reveals the ties that bind
Regular columnist Bill Thompson wonders what it will take to get used to living in a networked society.
One of the throwaway remarks I sometimes make at conferences is that "Google knows you're pregnant before you do".
I can say this because the things you search for will change as your life changes, and search engine providers may well be able to spot the significance of these changes because they aggregate data from millions of people.
Now Google's philanthropic arm, google.org, has shown just what it can do with the data it gathers from us all by offering to predict where 'flu outbreaks will take place in the USA.
It has found that "certain search terms are good indicators of flu activity", in that they correlate well with reports from the official Centers for Disease Control and Prevention.
And it claims that "across each of the nine surveillance regions of the United States, we were able to accurately estimate current flu levels one to two weeks faster than published CDC reports", a result that could save people's lives by alerting them to have 'flu vaccinations earlier than they might otherwise have done.
This is a really interesting piece of work and clearly demonstrates the power of data mining. Its potential usefulness is not limited to health matters.
Pick and mix
As John Naughton pointed out in The Observer, "everyone I know in business has known for months that the UK is in recession, but it's only lately that the authorities have been in a position to confirm that - because the official data always lag the current reality."
Perhaps the answer lies buried somewhere in the queries being made online, with company directors or politicians searching for terms that imply a coming recession, like details of redundancy pay or bankruptcy protection.
It isn't only Google who can do this of course. Its database of queries is vast and fast-growing, but it is only one among many databases that underpin the modern world.
The banking system is really only a collection of collected databases recording who has which assets, while neither government nor business could operate without complex data stores.
Soon the national ID register will store details of everyone in the UK, while the forthcoming Communications Data Bill is likely to include proposals to create a vast system that will record details of every e-mail sent, every website surfed and every file downloaded.
As we have seen with flu trends, sometimes the "interesting" knowledge that can be extracted is well-concealed until comparisons can be made with other sources, as it was the correlation between some search terms and the real-world data that mattered.
Of course Google has not revealed which search terms it analysed because doing so would undermine the model's effectiveness.
Unfortunately it is being equally reticent about how it has ensured that the data its uses is properly anonymised so that users cannot be identified on the basis of their queries.
A letter from the Electronic Privacy Information Center (EPIC) and Patient Privacy Rights to Google boss Eric Schmidt has not been answered, leaving those concerned with online privacy uncertain over the broader implications of the project.
But as Cade Metz points out in an insightful article in The Register, we may all be happy to know that a 'flu outbreak is coming, but what happens when the disease involved is more life-threatening and the government asks Google for the names and IP addresses of anyone whose search terms indicate that they are infected?
It's not that I don't trust Google. I don't trust any company, government department or individual without a good reason to do so.
In the case of search engines that claim to protect my privacy I want to know just how they do it and will not accept vague reassurances.
In the case of governments that want to build vast databases, I want strong legal sanctions against their abuse and full disclosure of the technical details.
Those of us living in the west with access to technology and the network have lived through a revolution in the last decade and a half that is as radical in its impact as the industrial revolution, and it has happened a lot faster.
It is hardly surprising that we do not yet know how to operate in a networked world where amazingly detailed data is routinely stored, processed and made available.
We will need to think in new ways, learn to assess risk according to new criteria, and find ways to hold those who have power over us - whether political, social or cultural - accountable in new ways.
The US writer Curt Monash has written about this topic many times over the years, arguing that since we clearly cannot halt the move towards data capture and use we should put legal and regulatory frameworks in place as a matter of urgency.
We have made a start in Europe with data protection legislation which could be strengthened and reinforced if politicians were willing to make the effort.
But first we need an active press and an engaged population, one that asks hard questions, forces those who want to develop new databases to be accountable and open, and makes the boundaries of acceptable surveillance a matter of public debate.
And perhaps we should ask google.org to start work on "Privacy Trends", hoping to spot privacy disasters before they happen by looking at searches for "compromised data", "hacked database" and "lost USB stick".
Bill Thompson is an independent journalist and regular commentator on the BBC World Service programme Digital Planet.