Quantifying My World: February 2012

DataLossDB is a community driven database of incidents where any type of data has been stolen. Their goal is to record as many data losses in the world as possible. The primary sources for the database are state government office or bureau “data breach notifications”, but individuals also contribute.

I find this data especially interesting because the organizations listed in the database have often lost personal information that could be sensitive and many databases contain my personal information. If the wrong people get the information, innocent people could become victims and there could be consequences. One possibility would be if an online shopping site’s payment information was insecure and then stolen. All of the customers could have their identity stolen, or lose money out of their bank accounts or credit cards. Much of the data could have been kept secure if the organizations had been less careless.

This is a screenshot of the records of the database:

Each incident is recorded in the database with the type, description of loss, date of loss, number of lost records, source of the data loss, submitter, organization, and location. The type, description of loss, source of the data loss, submitter, organization and location are categorical values. Some examples of data types are “snail mail”, “stolen laptop”, “web”, and “hack”. It would be particularly interesting to sort the records by organization and see if there are certain organizations that repeatedly lose data. Then these organizations could be sorted into how trust-worthy they were by using the number of lost data. The location could also be mapped to see if there is any correlation between areas. The number of lost records is a numerical value. To get a numerical response variable, one could sort the number of lost records by month or year. Recording the date of the data loss allows for a time series to be constructed because it is an ordinal variable.

This data prompts me to think of several questions to examine. How many of the data losses are from accident and has that damaged any individuals? Companies should be held responsible for accidental data loss. How many companies are repeat offenders? If an organization is hacked, does that result in the data being lost or leaked? Or does the hacker only want to hack into the organization to prove they are ingenious? Some of this data could be used to back some arguments about internet privacy laws. They should be standardized (at least across the U.S.) because so many companies range more than just a state.

Quantifying My World

Friday, February 3, 2012

B0: Data Sources