Tuesday, April 3, 2012

B1: Designing and Data-collecting

The main goal of my team’s project is to see if there is a relationship between traditional globalization and globalization or participation on-line for different countries.  We hope to see a correlation between global on-line activity for a country and economic and trade information, such as GDP per capita.  

To accomplish this goal, we decided to combine three data sources: Alexa, World Bank, and Wikipedia.  Alexa is a site that contains immense amounts of information about websites in the world.  One of the aspects that we were interested in using, was the top accessed 100 sites for each country so we could determine how global a country was on-line.  World Bank contains many databases with general data about the world, but we were interested in the financial and trade information about countries for globalization off-line information.  Thirdly, Wikipedia is one of the most popular wikis and we used it to get user participation information.

I focused on gathering data from Alexa by downloading the web pages with the top 100 site listings for each country, and then processing them with python.  We wanted to have a global metric that would represent how global a country was on-line based on its top 100 sites.  Figuring out how to calculate this metric proved to be more complicated than actually doing the calculation for all of the countries.  We first looked at all of the sites and assigned a global number to each based on the number of countries that included them in the top 100 sites.  Then for each country, we added up the number for each site it contained in its top 100 sites and got a global metric that we could use to sort the countries by how global they were.

Another issue that we weren’t anticipating was receiving different country names from Alexa and World Bank.  We compared and matched each country name from Alexa to World Bank & our wikipedia data and then used the country names from Alexa in our actual data.

We then combined the data from each source together by country.  For each country, the data included the global metric we calculated, the GDP & GNP per capita (USD & PPP), trade, military, the number of internet users, and a number representing online participation (from Wikipedia data).

When we start graphically analyzing the data, we hope that the global metric and the GDP, GNP, or trade have a relationship.  It would also be interesting to see whether the countries that have larger military interests have lower on-line globalization, or participation.  The last thing that would be interesting to see would be if the online participation rate is directly related to the global metric.