[Based on my WordCamp 2014 Presentation, Prezi here]
Take a look at this video about Woosh Water‘s project. Woosh provide an amazing service, where the ordinary municipal water fountains were replaced by hi-tech fountains, providing clean water for residents, with user experience and less of the dirt and homelessness affiliated with water fountains. Their project seems, at first glance, as amazing. However, when you inspect the service in depth, some things change.
But it’s not just water, you know. When people interact, either online or offline, they create crumbles of information. For example, the Israeli Transit Card, “Rav Kav”, allows you to purchase bus passes. However, in order to do so, it shares quite a lot of information with the transit operators; some of which are not really required. For example, it is not sure that data is erased after use, and there’s really no need for the bus operators to retain your photos or travel history. The same goes for your location data from the cellular operators. While the cellular operator needs your current location to serve your calls, it does not need to retain a history of your data. However, once it retains this information, then others may use it. For example, the Israeli startup “Trendit” receives information from cellular providers to estimate the number of people in a specific venue.
We call this information “Residual Data”.
Now, when you develop an application, you’re eager to store as much information as possible. Who knows what you may need it for in the future. This is based on two wrong assumptions: the first is that people will not misuse the information; we don’t really have to look for the obvious numerous examples of police misuse of information. The second wrong assumption is that statistical and anonymous information, if gathered, is harmless. The act of redientifying anonymous information becomes easier with growing power of computing.
For me, the problem begins when you retain information: you want people to access the information you retain (if you’re a social network, for example), and you can’t really protect information you store which should always be available. A good example is Yoav Even’s review of the Israeli medical history general availability. Mr. Even called in order to receive the medical information of one of his friends. In order to have the information faxed to his offices, all he had to do is give the friend’s ID number (which is generally available after the Israeli census leak). However, you usually only start to think about privacy when the personal information leaks.
Here is how I (usually) work when I help clients design their project: First, we ask do we really need this information. This goes for every aspect; not just names and email addresses, but also information that is considered anonymous but may later be reidentified. Things like browsing history, IP address or browser identification. Ask yourself why do you need it, and can you replace it (either with hash or other information). For example, keeping your users’ email to contact them is great; but keeping their IP address for more than 14 days has no actual use.
Next, ask yourself if the end-user can store the information at the client’s end, and not on your server. A lot of times, using distributed storage may save costs for application developers, but may also limit the data breach. Quite a lot of information, where it is not needed for processing, may be saved at the client.
Then, once we decided that this information is used, ask ourselves what are the benefits from retaining this information? For example, if we save a person’s purchase history in order to profile him and tailor advertisements, we might consider just storing the profile information or the categories of the purchased products.
Then, let’s ask ourselves what is the cost of retaining the information. The cost is divided into two groups: (a) the actual cost of saving the information; and (b) the cost of repairing a data breach. Meaning, that we need to ask whether the benefit of storing a large amount of data is lower than the cost of repairing the breach where the personal information of X users is online (and see eBay’s latest scandal as an example).
So, what can you do? My recommendation is to plan privacy ahead. Think of your product as something that should not be “keep everything, analyze later” but more like “let’s only keep what we must, and dump the rest”. This will make the cost of a data breach lower, and will actually help you in the long run as being more privacy oriented.