2016 Election: Sentiment Analysis From Twitter on Presidential Candidates

In an effort to understand the potential future of the American People, I decided to undertake the noble (and thankless) task of analyzing textual data regarding 2016 Presidential Candidates through the Twitter API. I used Perl with the Net::Twitter module from CPAN to perform the data extraction and processing, the positive and negative words were from “Mining and Summarizing Customer Reviews” (Hu, Liu, 2004). Lastly, the wordclouds were made with Tableau.

The following data visualizations were made from a sample size of 500 tweets. Each of the visualizations used a different search criteria- which will be specified with the images.

The following words cloud is from the search criteria of ‘#BernieSanders’:

sa_sanders_2_26_16

This search found 32 total positive words and 24 total negative words.

The next word cloud utilized the search term ‘#HillaryClinton’:

sa_#hillary_clinton_2_26_16

The above results yielded 41 total positive words and 65 total negative words.

The next result is from ‘#DonaldTrump’:

sa_#donaldtrump_2_26_16

The above cloud had 56 total positive words and 53 total negative words.

The next result is from ‘#JohnKasich’:

sa_#johnkasich_2_26_!6

The above cloud had 60 positive words and 50 negative words.

The next result is from ‘#MarcoRubio’:

sa_#marcorubio_2_26_16

The above cloud had 61 positive and 59 negative words.

The next result is from ‘#TedCruz’:

sa_#tedcruz_2_26_16

The above cloud had 46 positive and 56 negative words.

The final result is for ‘#BenCarson’:

sa_#bencarson_2_26_16.jpg

The above cloud had 43 positive and 35 negative words.

Hopefully these results are beneficial to the reader in choosing a candidate!

 

 

 

2016 Election: Sentiment Analysis From Twitter on Presidential Candidates

Using knn (K nearest neighbor) with R

In this post I would like to go over some basic prediction and analysis techniques using R. This article assumes you have R set up on your machine.
For our purposes, we will use Knn ( K nearest neighbor ) to predict Diabetic patients of a data set. It is a lazy, instance-based learning that does not build a model. Instead, it tries to find natural patterns in the data.

I am going to use a data set that R comes with- the Pima Indians Diabetes set.


> library(MASS)
> pima_data  <- Pima.tr
> summary(pima_data)
     npreg            glu              bp              skin            bmi             ped              age       
 Min.   : 0.00   Min.   : 56.0   Min.   : 38.00   Min.   : 7.00   Min.   :18.20   Min.   :0.0850   Min.   :21.00  
 1st Qu.: 1.00   1st Qu.:100.0   1st Qu.: 64.00   1st Qu.:20.75   1st Qu.:27.57   1st Qu.:0.2535   1st Qu.:23.00  
 Median : 2.00   Median :120.5   Median : 70.00   Median :29.00   Median :32.80   Median :0.3725   Median :28.00  
 Mean   : 3.57   Mean   :124.0   Mean   : 71.26   Mean   :29.21   Mean   :32.31   Mean   :0.4608   Mean   :32.11  
 3rd Qu.: 6.00   3rd Qu.:144.0   3rd Qu.: 78.00   3rd Qu.:36.00   3rd Qu.:36.50   3rd Qu.:0.6160   3rd Qu.:39.25  
 Max.   :14.00   Max.   :199.0   Max.   :110.00   Max.   :99.00   Max.   :47.90   Max.   :2.2880   Max.   :63.00  
  type    
 No :132  
 Yes: 68

The above is a summary of all the fields in our data set. The ‘type’ field is the target column that we are trying to predict. With the exception of the ‘ped’ (pedigree) field, we can see that the fields vary greatly from minimum to max values. These fields should be simplified with normalization- we’ll get to that.

We can find out the percentage of occurrences for each of the types: Yes = Positive for Diabetes. No = They do not have diabetes. Let’s find out the the percentage of each below:


> round(prop.table(table(pima_data$type))* 100, digits=1)

 No Yes 
 66  34

Alright. 66% of these cases are negative, 34% are positive. Let’s normalize the data fields- except our target column (type).


>summary(pima_n)
     npreg              glu               bp              skin             bmi              ped               age         
 Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.07143   1st Qu.:0.3077   1st Qu.:0.3611   1st Qu.:0.1495   1st Qu.:0.3157   1st Qu.:0.07649   1st Qu.:0.04762  
 Median :0.14286   Median :0.4510   Median :0.4444   Median :0.2391   Median :0.4916   Median :0.13050   Median :0.16667  
 Mean   :0.25500   Mean   :0.4753   Mean   :0.4619   Mean   :0.2415   Mean   :0.4751   Mean   :0.17057   Mean   :0.26452  
 3rd Qu.:0.42857   3rd Qu.:0.6154   3rd Qu.:0.5556   3rd Qu.:0.3152   3rd Qu.:0.6162   3rd Qu.:0.24103   3rd Qu.:0.43452  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000

We can tell our ‘normalize’ function worked because the max is now 1 and the minimum is now 0 for our fields. Here is the normalize function code:


normalize <- function(x) {
       return ((x - min(x)) / (max(x) - min(x)))
}

We are going to separate this data set into 2 parts: training and test. The number of rows is 200, so we will give most of our data to the training set, and the rest to test against:


> pima_train <- pima_n[1:150, ]

> pima_test <- pima_n[151:200, ]

Then we will assign labels to check against from the original data set:


> pima_trainlab <- pima_data[1:150, 8]
> pima_testlab <- pima_data[151:200, 8]

It is helpful to keep in mind here that the syntax for doing this is:


trainortest_label_variable <- original_data[range, column_to_predict]

Now we will build our knn predictor with the following:


> library(gmodels) # for the CrossTable function
> library(class) # for knn function
> pima_pred <- knn(train = pima_train, test = pima_test, cl=pima_trainlab, k=39)
> CrossTable( x=pima_testlab, y=pima_pred, prop.chisq=FALSE)

In the above code, we set up our knn() function with the k value 39. Sometimes the K value will take some adjusting. It should be an odd number just in case there are ties. Afterwards, we call the CrossTable function to check the results of our knn() function predictions shown below. It found that we guessed 80% of Diabetes negative patients, and 40% of Diabetes positive. Not great, but one can obtain better results with less lazy algorithms or with different data sets. For example, with the UCI Breast Cancer set, I have predicted 95%+ of malignant and benign tumors. So just play around with different K values and data sets and see what you find. Good Luck!


             | pima_pred 
pima_testlab |        No |       Yes | Row Total | 
-------------|-----------|-----------|-----------|
          No |        24 |         6 |        30 | 
             |     0.800 |     0.200 |     0.600 | 
             |     0.686 |     0.400 |           | 
             |     0.480 |     0.120 |           | 
-------------|-----------|-----------|-----------|
         Yes |        11 |         9 |        20 | 
             |     0.550 |     0.450 |     0.400 | 
             |     0.314 |     0.600 |           | 
             |     0.220 |     0.180 |           | 
-------------|-----------|-----------|-----------|
Column Total |        35 |        15 |        50 | 
             |     0.700 |     0.300 |           | 
-------------|-----------|-----------|-----------|

Here are my results with a similar approach using the UCI Breast Cancer data:


Total Observations in Table:  100 

 
              | wdbc_predz 
wdbc_testzlab |    Benign | Malignant | Row Total | 
--------------|-----------|-----------|-----------|
       Benign |        76 |         1 |        77 | 
              |     0.987 |     0.013 |     0.770 | 
              |     0.987 |     0.043 |           | 
              |     0.760 |     0.010 |           | 
--------------|-----------|-----------|-----------|
    Malignant |         1 |        22 |        23 | 
              |     0.043 |     0.957 |     0.230 | 
              |     0.013 |     0.957 |           | 
              |     0.010 |     0.220 |           | 
--------------|-----------|-----------|-----------|
 Column Total |        77 |        23 |       100 | 
              |     0.770 |     0.230 |           | 
--------------|-----------|-----------|-----------|
Using knn (K nearest neighbor) with R

Finding the Greatest Common Factor in Perl

Once again from Knuth’s Art of Computer Programming, one of his first examples of in his book of an easy algorithm to implement is euclid’s method for finding the greatest common factor between two numbers. The code below was developed from the pseudo-code presented in his book:


#!/usr/bin/perl

use strict;
use warnings;

=ignore

Euclid's algorithm for finding greatest common factor of 2 numbers

=cut


my $m = 225;
my $n = 10;

my $remainder;
for ( 1..$n ) {
    $remainder = divide($m, $n);
    if ( $remainder == 0 ) {
        print $n,"\n"; # Found the GCF, break out of the loop
        last;
    }
    else {
        $m = $n;
        $n = $remainder;
    }
}

sub divide {

    my $m = shift;
    my $n = shift;

    my $remainder = $m % $n;
    return $remainder;
}

 

Finding the Greatest Common Factor in Perl

The Knuth-Morris-Pratt Algorithm in Perl and using index() to match a string pattern.

In light of my recent post using a brute-force string searching algorithm, I decided to post an implementation of the Knuth-Morris-Pratt algorithm in Perl. This implementation is essentially straight from the pseudo-code found on the wiki.


#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

my $string = "ABC ABCDAB ABCDABCDABDE";

my $pattern = "ABCDABD";
my $table_array = build_table($pattern);

print "found pattern at index: "
. search_string( $string, $pattern, $table_array ),"\n";


sub search_string {

    my $string      = shift;
    my $pattern     = shift;
    my $table_array = shift;

    my $m = 0; # beginning of current match
    my $i = 0; # the position of the current character in pattern sought

    my @split_string = split(//, $string);
    my @split_pattern = split(//, $pattern);

    while ( $m + $i < scalar(@split_string) ) {
         if ( $split_pattern[$i] eq $split_string[ $m + $i] ) {
             if ( $i == scalar(@split_pattern) - 1 ) {   
                 return $m;             
             }             
         $i++;
         }
         else {
             if ( @{$table_array}[$i] > -1 ) {
                $m = $m + $i - @{$table_array}[$i];
                $i = @{$table_array}[$i];
            }
            else {
                $i = 0;
                $m++;
            }
        }

    }

    return length($string);

}

sub build_table {

    my $string = shift;

    my @split_string = split (//, $string);

    my $pos = 2;
    my $cnd = 0;

    @{$table_array}[0] = -1;
    @{$table_array}[1] = 0;

    while ( $pos < scalar(@split_string) ) {  
       if ( $split_string[$pos-1] eq $split_string[$cnd] ) {   
          $cnd++; 
          @{$table_array}[$pos] = $cnd;
          $pos++;
       } 
       elsif ( $cnd > 0 ) {
            $cnd = @{$table_array}[$cnd];
       }
       else {
            @{$table_array}[$pos] = 0;
            $pos++;
       }

    }

    return $table_array;
}

It should be noted, however, that the index() function in Perl uses the Boyer-Moore algorithm under the hood- so implementing a string searching function like the following may be an easier, and faster solution. It takes a pattern and string (to search within) as arguments.


use strict;
use warnings;
use Data::Dumper;

my $matches = occurrences('CGATGGTCG',
'TCGATGGTAAATACTGTGCGATGGTCGATGGTTCGATGGTCGATGGTCGGGACGATGGTGGGCGATGGTGCGATGGTTCGATGGTACGATGGTCGATGGTACGATGGTCAGGGCGATGGTTAACGCGATGGTGGCAGTCGATGGTTGCGATGGTTCGATGGTCCCGATGGTGCGACGATGGTATTCCGATGGTTCGATGGTCGATGGTACTGCGATGGTCGATGGTACATCGATGGTATCCGATGGTCGATGGTGGCGATGGTCGATGGTCGATGGTCGATGGTGTTATCGATGGTCCGATGGTCGATGGTTAGCGATGGTTATAGGTATCCCGATGGTCGATGGTCGATGGTTACGATGGTCCGATGGTCGATGGTCTTTGTCGATGGTTCGATGGTCGATGGTAACGATGGTCGATGGTTTGTCGATGGTCGCGATGGTCGCCGATGGTGCCGATGGTGGGTCGATGGTGCTCGATGGTCGATGGTCCGCGATGGTTGCGTCGATGGTCGATGGTCGATGGTGGACTCGATGGTCACGATGGTTTCTCGATGGTGGTTCCGATGGTCGATGGTGTCGATGGTACGCAAGTACAGATAGTGCGATGGTGAGGATAGTGCGATGGTAGCGATGGTCGCGATGGTCGATGGTTACTTGCCTGCGATGGTGTGTACGATGGTCGGAACGCCCGATGGTGACGATGGTCATGCGATGGTATTCAATTCGATGGTCTCCGGCCGAAGAAAGCGATGGTCCCAAGATGATCGATGGTCGATGGTCGATGGTGTCGATGGTCCGATGGTCCGTTTCGATGGTACTTCGATGGTTTGCGATGGTATATGTCGATGGTCCAACGATGGTGGTGCGATGGTCTGCGATGGTA');

print join(' ', @{$matches});
sub occurrences {

    my( $x, $y ) = @_;

    my $pos = 0;
    my $matches = 0;
    my @locations;

    while (1) {
        $pos = index($y, $x, $pos);
        last if($pos < 0);
        $matches++;
        $pos++;
        push @locations, $pos;
    }

    return \@locations;
}
The Knuth-Morris-Pratt Algorithm in Perl and using index() to match a string pattern.

Finding the Index and Number of Occurrences of a Pattern in a String

It is a common task in programming to search for patterns in a given string. This is especially true in the field of Bio Informatics. For various reasons, the programmer may want to know the location, or index, that the pattern started at within the String. The following code below is an example of how to find the index and also the number of occurrences of a given pattern in Perl.

sub pattern_count {
    my ($text, $pattern) = @_;

    my $pattern_len = length($pattern);
    my $count = 0;
    my $pos;
    for ( my $i = 0; $i < length($text); $i++) {
        if ( $pattern eq substr($text, $i, $pattern_len ) ) {
            $count += 1;
            $pos = $i + 1;
            print "found $pattern at $pos\n";
        }
    }
    return $count;

}

my $num_occurrences = pattern_count('GATATATGCATATACTT', 'ATAT');

print "The number of times the pattern occurred is : " . $num_occurrences,"\n";


The String to search within is the first argument to the method, and the desired pattern is the second argument given.

The difficult part about using a regular expression for this task is that sometimes the desired pattern may overlap itself in occurrences. Since using the global flag for a regular expression will not pick up the overlapping occurrences, this implementation of walking the string is used.

Finding the Index and Number of Occurrences of a Pattern in a String

From Hello World! to Hello Salary! – Breaking into the world of Software Development

As my first blog post I thought it would be appropriate to touch on how an individual seeking to enter the field of Software Development could ‘get their foot in the door’.

My entry into the world of Software Development was almost accidental. I was initially focused on getting a job as a System Administrator. However, after a couple months of studying and searching around, I decided that simply getting a certification was not enough for me to land a good job with the salary and benefits I was looking for. It was at this time that I decided to go back to school to get a four-year degree in Information Technology. Upon completion of an ‘Intro to Programming’ course, I decided to start applying to entry level programming positions listed in the classified ads. By having code to show off that I recently completed in my college course, I landed an interview and got the job!

Instead of giving you my life story, I think it would be beneficial to give a list of Action Items that someone looking to break into Software Development should consider implementing:

  • ┬áIf you do not have a degree in a quantitative field, IT or computer science, go back to school! The experience you get from these courses could help you land a job while you’re still in school. Certifications seem to matter very little to employers in this aspect of the industry. It’s what you can do that matters.
  • Make an account on GitHub and start pushing your work up to it. Having a public GitHub account will show employers that you care about the field and that programming is a passion of yours.
  • Make an account on Stack Overflow because you will quite certainly run into problems which you need the insight of veterans to help you through.
  • Assuming you have a LinkedIn account, start adding skills to it that are in line with the aspect of Software Development you are looking to pursue. Join some discussion groups related to the language(s) and Operating System(s) you’re interested in.
  • Keep practicing! Make an account at Project Euler or Hacker Rank and start completing challenges. Upload your code to your GitHub to show it off when necessary.
  • Unlike other fields, Software Development is constantly evolving new technologies and best practices. Learn about the new technologies coming out and what their adoption may mean for you or your company.
  • If you land an entry-level or Junior level job, find a Senior Developer you like and stick to them! Humble yourself and learn everything you can from them. Ask questions, learn their thought process. It will take you from being a total newbie to someone who can actually code. If you have not earned a job in the field yet, network on LinkedIn or some where you can meet other programmers and learn from them.
  • Lastly, set goals for yourself! Since Software Development is a very large field with innumerable possibilities, you need to set clearly defined short-term and long-term goals for yourself in order to become a better Developer. Short term goals could be completing a certain amount of programming challenges or making a certain amount of commits to a Open Source project. Long-term goals could be attaining that degree or a promotion within your company.
From Hello World! to Hello Salary! – Breaking into the world of Software Development