Tuesday, April 3, 2012

B1: Designing and Data-collecting

The main goal of my team’s project is to see if there is a relationship between traditional globalization and globalization or participation on-line for different countries.  We hope to see a correlation between global on-line activity for a country and economic and trade information, such as GDP per capita.  

To accomplish this goal, we decided to combine three data sources: Alexa, World Bank, and Wikipedia.  Alexa is a site that contains immense amounts of information about websites in the world.  One of the aspects that we were interested in using, was the top accessed 100 sites for each country so we could determine how global a country was on-line.  World Bank contains many databases with general data about the world, but we were interested in the financial and trade information about countries for globalization off-line information.  Thirdly, Wikipedia is one of the most popular wikis and we used it to get user participation information.

I focused on gathering data from Alexa by downloading the web pages with the top 100 site listings for each country, and then processing them with python.  We wanted to have a global metric that would represent how global a country was on-line based on its top 100 sites.  Figuring out how to calculate this metric proved to be more complicated than actually doing the calculation for all of the countries.  We first looked at all of the sites and assigned a global number to each based on the number of countries that included them in the top 100 sites.  Then for each country, we added up the number for each site it contained in its top 100 sites and got a global metric that we could use to sort the countries by how global they were.

Another issue that we weren’t anticipating was receiving different country names from Alexa and World Bank.  We compared and matched each country name from Alexa to World Bank & our wikipedia data and then used the country names from Alexa in our actual data.

We then combined the data from each source together by country.  For each country, the data included the global metric we calculated, the GDP & GNP per capita (USD & PPP), trade, military, the number of internet users, and a number representing online participation (from Wikipedia data).

When we start graphically analyzing the data, we hope that the global metric and the GDP, GNP, or trade have a relationship.  It would also be interesting to see whether the countries that have larger military interests have lower on-line globalization, or participation.  The last thing that would be interesting to see would be if the online participation rate is directly related to the global metric.

Friday, February 3, 2012

B0: Data Sources


DataLossDB is a community driven database of incidents where any type of data has been stolen.  Their goal is to record as many data losses in the world as possible.  The primary sources for the database are state government office or bureau “data breach notifications”, but individuals also contribute.

I find this data especially interesting because the organizations listed in the database have often lost personal information that could be sensitive and many databases contain my personal information.  If the wrong people get the information, innocent people could become victims and there could be consequences.  One possibility would be if an online shopping site’s payment information was insecure and then stolen.  All of the customers could have their identity stolen, or lose money out of their bank accounts or credit cards.  Much of the data could have been kept secure if the organizations had been less careless.    

This is a screenshot of the records of the database:

Each incident is recorded in the database with the type, description of loss, date of loss, number of lost records, source of the data loss, submitter, organization, and location.  The type, description of loss, source of the data loss, submitter, organization and location are categorical values.  Some examples of data types are “snail mail”, “stolen laptop”, “web”, and “hack”.  It would be particularly interesting to sort the records by organization and see if there are certain organizations that repeatedly lose data.  Then these organizations could be sorted into how trust-worthy they were by using the number of lost data.  The location could also be mapped to see if there is any correlation between areas.  The number of lost records is a numerical value.  To get a numerical response variable, one could sort the number of lost records by month or year.  Recording the date of the data loss allows for a time series to be constructed because it is an ordinal variable. 

This data prompts me to think of several questions to examine.  How many of the data losses are from accident and has that damaged any individuals?  Companies should be held responsible for accidental data loss.  How many companies are repeat offenders?  If an organization is hacked, does that result in the data being lost or leaked? Or does the hacker only want to hack into the organization to prove they are ingenious?  Some of this data could be used to back some arguments about internet privacy laws.  They should be standardized (at least across the U.S.) because so many companies range more than just a state.