Facebook and Data Science

April 6, 2018

My favorite topics in teaching, other than Bayesian statistics (“of course”), are about interesting applications, ethics and impact to society. One of the things I always do is point out that many of the big technology companies are fundamentally “data” companies selling their consumer data to advertisers. Lots of gnarly ethical issues here. But the huge sleeper issue in all this is medical informatics where medical research really needs consumer lifestyle data if it wants to make major breakthroughs in the lifestyle diseases that are gradually strangling the Western economies. Even gathering lifestyle data is difficult (think diet, for instance), let alone dealing with the ethics and privacy involved.

Anyway, great piece by Jennifer Duke “You’re worth $2.54 to Facebook: Care to pay more?” in the Australian press today (SMH, The Age). We had an insightful 15 minute discussion on the phone on Friday and I managed to get a worthy quote in her article. Impressed by her broad knowledge of the topics. Good to see our journalists know their stuff.

Reuters has an extensive piece outlining details and the election influence, Cambridge Analytica CEO claims influence on U.S. election, Facebook questioned.

The Conversation seems to be surfing the media hub-bub with a dozen or more articles from the academic community in the last week or so. Here are some that caught my eye.

a novice’s guide to Facebook data harvesting;
some insights on methods: How Cambridge Analytica’s Facebook targeting model really worked, interesting because they seem to use matrix factorisation, an area where my group at Monash are leading edge, in the smaller scale versions at least (we don’t scale to “internet” size, yet … unless you want to pay us too ;-));
insight into how Cambridge Analytica got its data, Cambridge Analytica scandal: legitimate researchers using Facebook data could be collateral damage; it seems people opted to use a personality app and this harvested some additional data;
discussions of Online privacy must improve, and certainly the world has the technology to do this and the EU for instance seems to be pushing that way with laws, its all a matter of whether the big social media companies want to play this game;
and the prescient Michael Brand (formerly a professor at Monash with me) beating most of us to it with Can Facebook influence an election result? … back in 2016.

Some other background articles are:

An older article in the Huffington Post, Didn’t Read Facebook’s Fine Print? Here’s Exactly What It Says, and commenting on an older terms of service, but a lot still applies.
The Australian Government has fairly strong privacy laws under the Privacy Act and its amendments. This is described at Guide to securing personal information, which has a broad definition of personal information that probably covers most of what Facebook keeps. Though a special class of information, sensitive information, which includes medical and financial details, silent phone numbers, etc., and requires a higher level of protection.
In 2014 Cambridge researchers Kosinski, Stillwell and Graepel published an article on PNAS (Proc. National Academy of Sciences of the USA) showing Private traits and attributes are predictable from digital records of human behavior. If that sounds too technical, the short version is:
If your a frequent user, Facebook probably knows your religion, sexual preferences and any serious diseases you might have, and major personality traits, even if you take great care not to expose them.
Keep in mind this was the best known of a long series of research. When this information is inferred (i.e., predicted using a statistical algorithm) it is called implicit information.
Note Facebook has been relaxing their privacy default settings over the years, The Evolution of Privacy on Facebook, according to Matt McKeon. This makes their job of monetising their users easier.
Note it is not clear what the Privacy Act says about implicit information. Note implicit information can be very hard to extract, and require access to a fuller database to make inferences.

Personally, I believe online data privacy will evolve in fits and bursts, but there are a lot of technical hurdles. Online advertising, for instance, needs to turn around impressions at great speed and doesn’t have time to work through complex APIs so I suspect they will need the personal data in some form on their own servers. Sounds like a perfect application for cryptosystems to me, if it can be made fast enough. As for data harvesting, well, I expect that will go on forever.

Posted in data science | Tagged media |

Bayesian Models