Category Archives: blog

Home / Archive by category "blog"

Five dimensional scatterplot using ggplot2

“How many dimensions can you show on a scatterplot or a chart?” is a question that often pops up in many situations, for example, data scientist interview. This question came up also in my workplace. My colleagues were of the opinion that five dimensions would be a bit too much for a two-dimensional scatterplot. Here, I write a simple code in R to plot five dimensions in a two-dimensional scatterplot. Along with the five dimensional scatterplot, I will also add few manipulations to the figure produced by ggplot for better (?) visuals.

First let us load the libraries needed for the script

library(ggplot2)

Next, we create a five dimensional dataframe. We will use two continuous variables for the two axes and three categorical variables to display different classes in the data. Since, we use rnorm function to generate random points use set.seed() function to set the seed to get exactly same sequence of numbers. This is important for reproducibility of the code.

set.seed(123)

dat <- data.frame(status = rep(c("Single", "Married"), each=10),# conditional categorical variables
                  height = 41:60 + rnorm(20,sd=3), # variable for the x-axis
                  weight = 41:60 + rnorm(20,sd=3), # variable for the y-axix
                  Education = rep(4:8,4),  # A numerical but categorical variable determines the size of points
                  gender = rep(c("Male", "Female"), each=2))  # gender to display in the plot) 

Now, we use ggplot function from ggplot2 package. The most important parameter of ggplot is a function aes. The aes is often called as aesthetics and it maps variables to different parts of the plot. It takes as input x-axis, y-axis and other formatting options.

dat.plot <- ggplot(dat, 
   aes(x=height,  #  height as x-axis
   y=weight,  #  weight as y-axix
   shape=status, #  shape denotes marital status
   color = gender, #  color denotes the gender
   size = Education))   #  size denotes the education

The aes function only creates the aesthetics but we need to add a layer of points using the aesthetics mapping. The function geom_point() does exactly that.

dat.plot <- dat.plot + geom_point() 

We can change labels in both x-axis, y-axis and title using:

dat.plot <- dat.plot +  xlab("Height")                             # add xlabel to plot
dat.plot <- dat.plot +  ylab("Weight")                             # add ylabel to plot
dat.plot <- dat.plot +  ggtitle("Five Dimensional Scatterplot")    # add title to image
First Five Dimensional Scatterplot

Figure 1: First Five Dimensional Scatterplot

This creates a plot like the Figure 1 above. Figure 1 shows five dimensions in two-dimensional scatterplot. We have used glyphs, shapes, and color to add further three dimensions to the scatterplot on top of the regular X and Y axis. However, we can put the legends inside the plot, as the left-hand side of plot is free. We can also use  the theme argument of ggplot2 which changes the themes of the plots.

# Put bottom-left corner of legend box in bottom-left corner of graph
dat.plot <- dat.plot + theme(legend.position=c(0,0), #position the legend 
                             legend.justification=c(0,0)) #justification of legend
# bottom-left is 0,0; top-right is 1,1
Second Five Dimensional Scatterplot

Figure 2: Second Five Dimensional Scatterplot

This creates a plot shown in Figure 2, however, Edward Tufte, would still be not happy because the gray color of the background ggplot still takes some ink so data to ink ratio is decreased (Actually, it is my personal preference that plots have white background). We also can increase the font size of titles and X and Y labels. Also lets improve the readability of the plot with larger font sizes for axes and titles.

dat.plot <- dat.plot +  theme(legend.position=c(0,0.3),
        legend.justification=c(0,0), 
        panel.background = element_rect(fill = "white", #background color of the plot
        colour = "black",  #color of the rectange around plot
        size = 1, linetype = "solid"), # Line type and width of lines of the rectangle around the figure
        axis.title=element_text(size=18,face="bold"),  # fontface and font size of both X and Y labels
        axis.text=element_text(size=14), # ticklabels of of both X and Y axes
        plot.title = element_text(size = rel(2), # Title of the plot 
        face="bold", colour = "black"))  # Color and fontface of the title

We now change the titles of legends in the plot through scale_color_discrete and scale_shape_discrete.

dat.plot <- dat.plot + scale_color_discrete(name ="Gender", labels=c("Female", "Male")) 
dat.plot <- dat.plot + scale_shape_discrete(name="Marital\nStatus", labels=c("Married", "Single" ))
 Third and Final Five Dimensional Scatterplot

Figure 3: Third and Final Five Dimensional Scatterplot

The final plot is as shown in Figure 3, which seems better than the first one. The code is available from my Github page.

CA elections: A statistical Look Back

Census and elections are festivals for people interested in data as it gives rise to the data that can be subjected to various statistical analysis. However, the practice of democracy is very new in Nepal and all the data related to it are not publicly available. Nevertheless, it is always interesting to statistically analyze the available data. Since, the datasets for only two CA elections are available, sophisticated algorithms and methods can not be used. Furthermore, lack of sample size also deters the confidence of the results obtained from the statistical analysis. Results from the proportional system do not give rise to interesting numbers, therefore, mostly the results of First Past The Post (FPTP) have been analyzed. Since, Election commission (EC) has data for only 239 constituencies in elections of 2064, the analysis of 2064 elections is based on those 239 constituencies whereas in 2070 analysis is based on all 240 constituencies.

 

Poor Performance and Luck?

There is no denying the fact that the United Communist Party of Nepal (Maoist) (UCPN(M)) performed poorly in the second CA election. Many political analysts have commented on their poor performance and reasons behind it. Even statistically, their performance was rather poor. The performance can be gauged not only by the number of seats won and percentage of votes garnered, but also by differences of votes in the constituencies it lost. Mere difference in number of votes between the winning and losing candidates does not show the true measure of difference between the candidates. For example, losing by 1000 votes in a sparsely populated constituency, such as, Manang with a voting population of 4,795 is different from losing by 1000 votes in a densely populated constituencies, such as, Bhaktapur-2 where the voting population is 82,218. So, weighted difference between the candidates can better show the actual difference between the candidates. The formula for weighted difference is:

 

( Votes Winning – Votes Losing ) × 100
Weighted Difference = —————————————————————————————
Total Votes Cast

  

The formula above not only calculates the difference between the winning and loosing candidates but also weighs the difference with the total votes casted in the constituency. The votes casted does not include the invalid votes. This formula does not largely deviate from original difference. In 2070, the closest candidates in terms of number of votes Mr. Gayananda Mandal (NC's candidate in Morang-4) who lost the election is still the first in the list with weighted difference of 0.02% of total votes casted. The original difference is 8 votes and is also first among the candidates losing by least votes. Similarly, Baburam Adhikari lost Gorkha-2 by maximum weighted difference of 52.04% of total votes casted. If we select the top three candidates from each constituency and see 50 candidates who lost by maximum weighted difference, 27 of them are UCPN(M) candidates. Similarly, only 8 and 5 of them are from UML and NC, respectively. Remaining 10 positions are occupied by other political parties such as RPP, Sadbhawana. In contrast to 2064 elections, when only 2 out of 50 candidates who lost with maximum weighted difference of votes were from UPCN(M). However, 20 were from NC and 25 from UML.

  

On the other hand, if we select the top three candidates from each constituency and see the 50 candidates who lost by minimum margin with respect to weighted difference, only 3 candidates are from UCPN(M). However, none of those three UCPN(M) candidates losing with least weighted difference from UPCN(M) are among the top 20 candidates with when compared in all the constituencies. Similarly, NC had 11 candidates and UML had 33 candidates. So, UML was the unlucky party as it lost by very few votes in most number of constituencies. Comparing with 2064, 14 out of 50 candidates losing with minimum weighted difference were from UCPN(M). In 2064, 16 were from UML and 12 were from NC.

Neils Bohr once said “Prediction is difficult especially about the future.”

and we all know volatility of Nepalese politics. In addition, this difference in elections of 2064 and 2070 is statistically significant which makes prediction of votes considerably arduous task in the future.

  

Do we really need proportional (PR) system ?

There has been several opinions against the proportional election system in Nepal.

  1. पीपलबोट डटकम जनमतको हुर्मत, नारायण वाग्ले, Setopati
  2. पीपलबोट डटकमः मजेत्रोजत्रो मतपत्र, नारायण वाग्ले, Setopati
  3. गर्लफ्रेन्ड सभासद !, अखण्ड भण्डारी, Kantipur
  4. समानुपातिकको विडम्बना, श्रीकृष्ण अनिरुद्ध गौतम, Kantipur

I also do not personally support the PR system, especially the way it is implemented in Nepal. Essence of democracy lies in the candidates being elected by the people, not by few dynastic leaders. If the proportional system is of absolute necessity, we must implement other forms of PR systems such as the Single Transferable Vote (STV) system which has been used, for example, in Australia, Iceland, and Republic of Ireland. However, voters do not have necessary education for such a system to be implemented in Nepal. People often argue that PR system is successful in many European countries such as Finland and Denmark. However, PR system in European countries such as Denmark, and Finland is considerably different from that implemented in Nepal. People vote for candidates in multi-member constituencies, not for a party as in Nepal. Order of candidates selection is chosen by votes received by the candidate, not by the leaders of political parties. One of the basic reasons to implement PR system is not to waste the people's votes. Against the basic principle of PR system, we are already losing 518,404 (5.40%) votes of unelected 92 parties. Only 30 out of 122 parties had a seat in CA, so all votes casted for those 92 parties went down the drain. Additionally, this number does not include those invalid votes resulting because the complexity of PR system such as multiple ballot papers and large size of ballot papers.

  

The other main reason behind adopting PR system is inclusiveness in the parliament. For example, it is argued that we do not have proportional number of women, marginalized groups, scheduled cast, and tribes in the parliament. If we always believe that problems of a backward community, for example, issues relating to women or dalits can only be raised by a woman or a dalit, then the system will never improve. Everyone especially in the parliament should understand everyone's problem. Nevertheless, it is of absolute necessity that proportional number of candidates should be elected from different classes in the society in a democracy. Here, I will try to use some numbers to prove that we can achieve inclusive parliament with even by FPTP system if we restrict the selection of candidates by the parties. I will explain this with an example of women candidates.

  

We have seen that political parties have nominated women candidates in constituencies where the situation of the party is rather poor. UPCNM and UML both nominated women candidates in Kathmandu-1 where NC's Prakash Man Singh was almost sure to be elected. Similarly, NC had nominated women candidate in Kathmandu-2 against the heavyweight opponents Madhav Kumar Nepal of UML and Lila Mani Pokhrel of UPCN(M). In-fact, in 2070 elections there is only one constituency where both first and second candidates are women (Udayapur-2, Manju Kumari Chaudhari (UML) won by 23 votes against Pramila Rai (NC)). In spite of this discrimination for women candidates, numbers show that women perform almost as well as men, if not better. Additionally, many heavyweight leaders have filed their wifes as dummy candidates (eg. Ram Chandra Jha, http://nepalihimal.com/article/1860 ). If there are enough women candidates, statistics shows that they will come winners. Similar strategy can be used with other communities such as dalits, madhesis i.e. increase their nominations from the political
parties.

  

In 2070, there were 6,127 candidates in the FPTP system, of which 5,458 (89.1 % ) were males and 667 (10.9%) were females. However, 230 (95.8%) seats were won by males and 10 (4.2%) were won by females. The numbers show that 230 of 5458 (4.21%) of only the male candidates have won the elections. On the other hand, 10 out of 667 (1.5%) women candidates have won the election. Although this number in case of women is small, we can see that women candidates were also among the candidates who lost by minimum votes. Five of the ten candidates who lost by minimum margin of votes are women. However, if we analyze the results in 2064, the trend is obvious. There were 3,947 candidates of which 3,577 (90.63%) were males and 369 (9.34%) were females. 30 (12.5%) women candidates won in FPTP system. Looking at the gender-wise percentage of candidates, we can see that women fared excellently than the men. 30 out of 369 i.e. 8.13 % of women candidates won the elections and 209 out of 3,577 i.e. 5.84% of male candidates won the election. These results show that if we have enough women candidates (for example, 40%), it will not be impossible to elect 33% percentage women candidates in the parliament even in FPTP system. Similarly, by nominating enough candidates from different sections of society such as dalits, madhesis, and marginalized groups, we can have an inclusive parliament in FPTP system.

  

In the current PR system, a woman candidate is selected at the mercy of the senior leaders who in most political parties are high caste males. Furthermore, a political party can change a PR candidate anytime, not only from the list submitted to election commission but alarmingly during the parliament term. This system does not give enough confidence to the women MPs to put forth their plight before the parliament against their leaders' wishes. As Paris Hilton said “No matter what a woman looks like, if she's confident, she's sexy.” We need confident women candidates and other candidates from the marginalized groups to make their voices heard in the parliament.

  

An alternative

Even after nominating enough women candidates (40%) in FPTP system, if women can not gain 33% representation, we can add a small but definite number of parliamentary seats called leveling seats or adjustment seats to maintain the proportionality in the parliament. For this, we can select the candidates who lost by minimum weighted difference from the specific groups to fulfill the proportional representation. This leveling system has been used in PR systems, for example, in Norway, Denmark, Sweden, and Germany. If we look at five women candidates who lost by minimum weighted difference in 2070 (Pramila Rai, Surita Kumari Shah, Lila Devi Bokhim Limbu, Maina Kumari Bhandari, Sarala Kumari Yadav), most of them also fall in the marginalized community such as Janajati or Madhesi. These women candidates could be selected in the leveling seats as they represent not only women but also marginalized groups. The order of choice can be the weighted difference in the votes considering candidates from a particular marginalized group that are not represented in the parliament. Hence, using a little bit of statistics (complicated ones are used already in the calculation in PR system), we can have a proportionally represented parliament who represent the true voices and aspirations of general public.

Data Science Quotes

Quotations are often used to assert the claims and support credibility of a person's views on a topic. Quotes are very popular in newspaper columns and presentations to clarify or reinforce the summary or main points and augment the arguments. I am also a big fan of quotes and have used them in every chapter of my Masters Thesis and Doctoral Dissertation. Ever since DataMarket and even Linkedin came up with their quotes, I planned to publish some of my favorites that were missing from those two lists. So, here it comes. Because of my background in machine learning and data mining, the list could be biased and tilted in that direction.

Science these days has basically turned into a data ­management problem.

Jimmy Lin, Associate Professor, University of Maryland

The purpose of models is not to fit the data but to sharpen the questions.

Samuel Karlin, 11th R A Fisher Memorial Lecture (1983)

Although we often hear that data speak for themselves, their voices can be soft and sly.

F. Mosteller, S. Fienberg, R. Rourke from Beginning Statistics with Data Analysis

Data does not equal information; information does not equal knowledge; and, most importantly of all, knowledge does not equal wisdom. We have oceans of data, rivers of information, small puddles of knowledge, and the odd drop of wisdom.

Henry Nix, Keynote address, AURISA, 1990

With too little data, you won’t be able to make any conclusions that you trust. With loads of data you will find relationships that aren’t real... Big data isn’t about bits, it’s about talent.

Douglas Merrill, Former CIO and VP of Engineering at Google

All models are wrong, but some are useful.

George E. P. Box, Empirical model­ building and response surfaces (1987), Wiley, p. 424

Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.

John W. Tukey, The future of data analysis. Annals of Mathematical Statistics, 1962, 33:1­67 (see pp.13­14)

Statisticians, like artists, have the bad habit of falling in love with their models.

­­ George Box

The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.

John Tukey, Sunset salvo. The American Statistician 40 (1)

Prediction is very difficult, especially about the future.

Niels Bohr

Statistical thinking will one day be as necessary a qualification for efficient citizenship as the ability to read and write.

H.G. Wells

If you torture the data enough, nature will always confess.

Ronald Coase, How should economists chose? American Enterprise Institute, Washington, D. C. (1982)

Conducting data analysis is like drinking a fine wine. It is important to swirl and sniff the wine, to unpack the complex bouquet and to appreciate the experience. Gulping the wine doesn’t work.

Daniel B. Wright (2003)

­

We are drowning in information and starving for knowledge.

Rutherford D. Roger

Data do not speak for themselves ­ they need context, and they need skeptical evaluation

Allen Wilcox

Data is the sword of the 21st century, those who wield it well, the Samurai.

Jonathan Rosenberg, Google’s Senior Vice President for Product Management

With three constants, I can fit a dog. With four, I can make it bark.

William Reifsnyder

All information looks like noise until you break the code.

Neal Stephenson, Hiro in Neal Stephenson's Snow Crash (1992)

It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.

Arthur Conan Doyle

There are two things you are better off not watching in the making: sausages and econometric estimates.

Edward Leamer, 1983 "Let's Take the Con Out of Econometrics," American Economic Review, American Economic Association, vol. 73(1), pages 31­43,

­

Data analysis is simply a dialogue with the data

­­Stephen F. Gull, 1994

Data is the new oil? No: Data is the new soil.

David McCandless

My Finland Experience

*A shorter version of this article appeared in Helsinki Times, perhaps the most popular  English weekly in Finland (http://www.helsinkitimes.fi/columns/columns/expat-view/10517-my-finnish-experience.html)

I have been living in Helsinki (Espoo, a municipality in the capital region to be precise), Finland since September 2008 as a student in International Masters Degree programme for the first two and half years and then as a doctoral student for the last three and half years. Finland is a Nordic and/or a Scandinavian country situated in northernmost Europe and bordered by Sweden to the west, Russia to the east, Norway to the north and Estonia to the south. Finland leads the world in many sectors including but not limited to Social Welfare (popularly known as Scandinavian model of social welfare), Standard of Living, and Human Rights. Very few people know that Finland is the land of Nokia, former largest cell phone manufacturer in the world; the land of Santa Claus, the land of Mika Häkkinen and Kimi Räikkönen, the popular F-1 drivers; Angry Birds (Rovio) & Clash of Clans (Supercell), popular mobile games; Linus Torvalds, principal developer of the Linux kernel(an open-source operating system). People of this country are also known for their inventions such as the SMS message, xylitol chewing gum, dish draining closet, wireless heart-rate monitor and many others. After all who would know them, as they neither bombed Iraq nor need their staff to wear pocketless pants to stop corruption1?

 

I personally think that Finns are well educated, well informed, and knowledgeable. When I say I am a Nepali, unlike many of their European counterparts, they never confuse me as an Italian from Napoli (or Naples). It may be because of myriad of Nepalese restaurants in Helsinki, now expanded to other towns outside the capital region. Helsinki is probably the only city outside Nepal where Nepalese restaurants out number the Indian restaurants. Education has very deep roots in Finnish society. Almost everyone has a Masters Degree. Every citizen is encouraged by the state to study by providing multifold of facilities to the students. Education is not only free but you are fed, accommodated, and paid to be a student. It is obvious from the fact that Finland often tops PISA rankings2. Finland is also a pioneer in gender equality as Finland's parliament is first parliament in the world to adopt full gender equality. Additionally, Sri Lanka and Finland are the only countries that had women serving simultaneously as a president and a prime minister.

 

Summer in Helsinki

Summer in Helsinki

Finns are generally known to be quite and like to remain low profile. This is true to some extent but, I have found Finns with great  sense of humor, sarcasm without overdoing it like the Brits. For example, there were no strikes and stone pelting when a New  Zealand minister severely criticized Finland3 or when an article was recently published in theguardian mocking Finland4. Imagine  the situation, if these were said and published about Nepal, it could have induced a riot. However, their reply will be very tougue- in-check. However, they are always up for a friendly banter. For example, it seems that one of the Finns was very annoyed when a  Nepali posted his/her view in a forum saying that people in the world especially Nepal have no or limited knowledge about Finland.  A typical Finn replied

“Well, yes, see now we're limited in the resources. We don't have a mad prince killing the king, no Maoist  rebels, no mountains people come to visit, and nobody offers a goat to get the airplane fixed... so what do the people in Nepal know  about Finland? The multitude of Finnish restaurants in Kathmandu?”5.

As I have told your earlier, let me remind you again there  are numerous (about 50) Nepalese restaurants in Helsinki.

 

As any developed country, Finland has been one of the popular destinations among the Nepalese students. In addition to the  students, there are a lot of families either for work or business. So, Nepalese community is not that scare in a country with a  population of just over 5 million. We get together in Nepali festivals and organize picnics during summer. Overall, Helsinki is a  lively city for Nepalese. Outside of Helsinki, you strive to find people let alone the Nepalese. However, the number has surged  recently because of the influx of Nepalese students and entrepreneurs. Social life of Finland is what makes Finland one of the  difficult places in the world to live. Even the people living in the same apartment hardly talk to each other. Government policies encourage young ones to live away from parents. Unlike Nepal you will find most of the students living in student apartments despite a their parents' beautiful house nearby.

 

Winter in Helsinki

Winter in Helsinki

Finland maintains one of the best living standards in the world. Every aspect of life e.g. apartments, education, medical facilities and everyday life (Shopping, Traveling) are well organized. Finland is very pragmatic and simplest of issues are taken care of which has not been the case even in most of the developed countries. For example, it is very easy to clean windows in Finland as all the windows open inwards. In most other developed countries like Ireland many window cleaning companies are earning their livelihood just because the windows open outwards. Finnish society is a knowledge based society so you find research given a very high priority here. Finland is a country with high morale values, and honesty and also the obedience of values is high priority as seen in the lost wallet experiment6. This could be one of the biggest learning for Nepalese from Finland. Additionally, crime rates are extremely low. Adherence to law is so much that people do not cross the street in red light even if no vehicles are seen in the horizon. Similarly, there are no traffic policemen anywhere, not even in the busiest of Helsinki streets but hardly anyone will violate the traffic rules.

 

Living away from home is always a problem but you need to consider few more issues before you take a route to the Nokia land. First and foremost is the language. Finnish and Swedish (spoken by approximately 6% people) are two official languages here in Finland. However, all most all the people especially the young generation can speak English. Nevertheless, they are reluctant to speak in English unless they are very good at it. On the other hand, to run day to day activities like shopping and traveling, language would be very important because you find that all the goods are named in Finnish and all the official documents are in Finnish. Finnish is a language of Finno-Ugric language family and very difficult to learn. There are many jokes about difficulties of Finnish language such as

Which is the heavenly language? Finnish, because it takes an eternity to learn”.

In order to be involved in some part time or full time job, language is utmost important. Few IT related jobs in multinational companies like Nokia, Rovio, Supercell, Kone have English as their working language. Therefore, IT sector can be a easier way into the job than any other field. However, the competition is tough as computer eduction is given a top priority here and it has been considering giving programming language classes to students in the elementary school7. Second important issue would be the weather. Since Finland is located very close to the North pole, you get extreme cold weather here with temperature as low as -50 degree Celsius. In addition to that winters are very dark and we have lights only for 2 to 3 hours. In those two to three hours also you hardly see any sun. Summers are rainy and day light of 23 hours can also be frustrating.

 

In spite of above mentioned difficulties Finland can be a good destination as a student, an entrepreneur or a worker if you are dedicated and hardworking. I have seen many of my friends who have come here as a student especially in the IT sector have good positions in the companies like Nokia and in the university departments. Since the education is free and if the student doesn't live a lavish life, one can manage his living expenses with around Euro 500-600 a month. So, students can get good education by working very few hours or during weekends. Students in masters degree can complete their education in little more cost than Nepal even without working. However, finding a job when you do not have skills and the language can become very difficult. Having said that we have a lot to learn from this country, education is only one of the flowers in the whole garland.

 

References

2. “The Programme for International Student Assessment (PISA) is a worldwide study by the Organisation for Economic Co-operation and Development (OECD) in member and non-member nations of 15-year-old school pupils' scholastic performance on mathematics, science, and reading” Wikipedia.

About Prem

Prem Raj is a Data Scientist by trade and training, and a Post Doctoral Researcher at the University of Turku, Finland. He designs and develops algorithms, tools, and methods to make sense of vast amount of data.