DataLossDB is a community driven database of incidents where any type of data has been stolen. Their goal is to record as many data losses
in the world as possible. The primary
sources for the database are state government office or bureau “data breach
notifications”, but individuals also contribute.
I find this data especially interesting because the
organizations listed in the database have often lost personal information that
could be sensitive and many databases contain my personal information. If the wrong people get the information,
innocent people could become victims and there could be consequences. One possibility would be if an online
shopping site’s payment information was insecure and then stolen. All of the customers could have their
identity stolen, or lose money out of their bank accounts or credit cards. Much of the data could have been kept secure
if the organizations had been less careless.
This is a screenshot of the records of the database:
Each incident is recorded in the database with the type,
description of loss, date of loss, number of lost records, source of the data
loss, submitter, organization, and location.
The type, description of loss, source of the data loss, submitter,
organization and location are categorical values. Some examples of data types are “snail mail”,
“stolen laptop”, “web”, and “hack”. It
would be particularly interesting to sort the records by organization and see
if there are certain organizations that repeatedly lose data. Then these organizations could be sorted into
how trust-worthy they were by using the number of lost data. The location could also be mapped to see if
there is any correlation between areas. The
number of lost records is a numerical value.
To get a numerical response variable, one could sort the number of lost
records by month or year. Recording the
date of the data loss allows for a time series to be constructed because it is
an ordinal variable.
This data prompts me to think of several questions to examine. How many of the data losses are from accident
and has that damaged any individuals?
Companies should be held responsible for accidental data loss. How many companies are repeat offenders? If an organization is hacked, does that result
in the data being lost or leaked? Or does the hacker only want to hack into the
organization to prove they are ingenious?
Some of this data could be used to back some arguments about internet
privacy laws. They should be
standardized (at least across the U.S.) because so many companies range more than
just a state.