R: 2011

Sunday, October 2, 2011

Strings in R

Here's a good post for string manipulation in R.

And from the date, 2001, I'm very behind.

But atleast I can make graphs with personalized text on them :).

Thursday, September 29, 2011

Installing rgl: meeting dependecies

It can prove to be a bottled pickle trying to install this R package, but the solution - a walk in the park!

So if you have tried, and failed for the moment, go to your terminal and apt-get these two packages :):

$ sudo apt-get install libx11-dev r-cran-rgl

Lastly, you will need to install 'rgl' from within R, I think...:

> install.packages('rgl')

Hope it worked!

Ciao!

Thursday, September 15, 2011

Functions in R: Beginner

Multidimensional scaling with R has so far been pleasant. The two methods that I've been using are cmdscale() and isoMDS() from this awesome R-based website. But as you can see, to produce a graph, there is quite a bit of typing. So why not functionize it?

My programming skills are limited - let me warn you in advance. In programming, a function is a group of code, which can be called with a single string that includes any required arguments. Two functions have already been named in this blog, know what they are? Yes, cmdscale() and isoMDS().

But we can create our own if we wish. Some useful information was found on this page - click on 'Function' so skip to it.

NB: in R, '<- -="" a="" and="" assign="" at="" automate="" be="" below="" but="" characters...anyways="" chose="" class="source-code" code="" combination="" data="" desire.="" do="" few="" from="" i="" if="" in="" is="" it="" ll="" looking="" lot.="" lot="" m="" manipulate="" mds="" my="" now="" of="" on="" one="" only="" perform="" post="" pre="" produces="" quick-r="" that="" the="" they="" this="" throughout="" to="" typing="" using="" variables="" want="" way="" we="" website="" what="" why="" wondering="" would="">

d <- between="" cmdscale="" code="" d="" dim="" dist="" distances="" eig="TRUE," euclidean="" fit="" is="" k="" main="Metric MDS" mydata="" number="" of="" plot="" points="" results="" rows="" solution="" the="" type="n" view="" x="" xlab="Coordinate 1" y="" ylab="Coordinate 2">

If this code is assigned to a function string, then by calling the function with the desired MDS fit, the rest of the code can be computed. But what do I mean by 'desired MDS fit'? First, here is my read.R file with 4 groups of code: 1) import any required packages; 2) import the data to mydata and perform any transformation on the data; 3) from the Euclidean distances, perform nonMetric or Metric MDS; 4) plot the points using the called 'desired MDS fit'.

library(MASS)

mydata <- b="" cmdfit="" cmdscale="" d="" data.csv="" dist="" eig="TRUE," header="TRUE," isofit="" isomds="" k="2)" mds="" metric="" mydata="" nonmetric="" path="" read.csv="" row.names="1)" to="">myfit <- b="" cex="0.7)" function="" labels="row.names(mydata)," main="an MDS" plot="" points="" text="" type="n" x1="" x="" xlab="Coordinate 1" y="" ylab="Coordinate 2">

The code in bold is the main workhorse of the read.R file, and hopefully the unbolded stuff isn't new to you. For the bolded stuff, we've assigned function(x1 = 'cmdfit'){x1;...cex=0.7)} to the string 'myfit'. Now, if you run 'myfit' in the R buffer, if will echo the code.

> myfit
function(x1 = 'cmdfit'){
 x1;
 x <- cex="0.7)" code="" labels="row.names(mydata)," main="x1" plot="" points="" text="" type="n" x1="" x="" xlab="Coordinate 1" y="" ylab="Coordinate 2">

The most important things to note are that 'x1' is the variable throughout the function and that if we do not give a value, it will use the default, in this case 'cmdfit'. Because we have already defined what 'cmdfit' and 'isofit' are, R should be able to run the code happily.

This is my first go at creating functions in the R statistical scripting language. There is a good example of a R function here and for those who are new to programming, I hope this has been painless.

The idea here is to reduce the amount of typing and get back to the statistical analyses. For MDS, there may be several variables which account to the final result. So, it may be required to add/remove variables, and then rerun the MDS in order to achieve an output which has not been subject to bias, i.e. a variable were all observations recorded a 0 will indicate that the observations are all the same, when really, all the other variables suggest otherwise! Also, as datasets tend to be large, you may simply wish to analyse only a select few variables.

That's all for now.
Ciao!

Monday, September 12, 2011

Trend line using lines(lowess())

For univariate data there may be a need to visualize the trending nature of the data. When dealing with diurnal ambient temperatures, one will find that it fluctuates quite a bit. If you want to find the average ambient temperature for the day disregarding the amount of time that it was actually that temperature, then you could sum the maximum and the minimum, divide by 2 and get and idea. But if you wish to find the average temps for the last month with recorded max and mins for each day, it might get a bit tedious.

This post has used information provided here.

R lets you graph univariate data using the line() function. This is useful at times. However, the trend line can be misleading since the variable is independent. The trend line produced here uses the means of the datapoints and there is no relationship between the axis.

To simply graph the data points, I will use the following method - assuming the data has been imported to 'mydata', and the temperatures are stored with the variable name 'mmax.temp' (NB: R doesn't like 'spaces' or '-' in file names and will default to '.'):

> plot(mydata$mmax.temp, type="o", col="blue")

where type = the of type of symbols and lines to use
col = the colour of the symbols and lines

This will produce a graph like this:

To add a trend line using just the temperatures, use lines(lowess()), lowess is the method for calculating the averages - I'm certain other methods exist but I know only this one for now:

> lines(lowess(mydata$mmax.temp, f=0.2))

where mydata$mmax.temp = your data$variable
f = the smoothness of the line - the greater the value, the smoother the line

and this will superimpose a trend line to the graph we just created.

You can play around with the f value to your liking, but for what I've been working with, there is minimal difference. If you wish to read more about the lowess() function go here.

Now go make some trending lines ;D!

Ciao!

Tuesday, September 6, 2011

Read in a .R script file

Sometimes you will need to run several different commands in order to reach you desired output. If you are running the same set of code over and over again, changing only a single parameter, then this can be type-heavy. So why not just make a script file? Once you've done so, all you will need to do, is make the single change, and then read in the script file and let R do the rest.

To read in your awesome script file just use source():

> source('/path/to/.R')

As you can see, you will most likely not need any further options. But if you want to see those options that are available, just run help(source) to bring up the documentation.

Ciao!

Saturday, September 3, 2011

Simple correlation matrix

So we're getting better at using R now. Currently, I just load the R buffer from the Linux terminal by typing in 'R'.

But today, I want to show how to perform a simple correlation between two columns of data. These two columns of data could be the the height of giraffes and the length of their legs. Or maybe even the number of people at a beach and the temperature recorded on that same day.

The command I will be using is cor. See here for details about this command and finer details about the output (sorry, but a lot of the technical stuff is technical to me too).

NB: your data needs to be arranged like this for the correlation to work using this method

     Darling     Gwydir
1 5            1
2    24            59
3    0              0
4    0              0
5    6              52
6      336          8
7    314    29
8    0              0
9    36            50
10 85            200
11 5291        406
12    0    0
13    57    231
14    0              8

Once your data has been imported to 'data',

> data <- read.csv('/path/to/file.csv', header=TRUE, row.names=1)

you can use 'cor' to do pairwise comparisons of all the data vectors. Of course, if you happen to have more than 2 columns, the method doesn't change, you will just be outputted with a larger matrix than the one below.

> cor(data)
                Darling    Gwydir 
Darling      1.00000000 0.7878988 
Gwydir       0.78789880 1.0000000

Another useful command is 'symnum'. The output is a computerized table with symbols indicating the level of correlation. Neat!

> symnum(cor(data))
             D G N Mc L Mr
Darling      1            
Gwydir       , 1          
Namoi        B + 1        
Macquarie      .   1      
Lachlan      B + B    1   
Murrumbidgee   .   B    1 
attr(,"legend")
[1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1

Enjoy! Till next time!

Wednesday, August 31, 2011

Count the number observations for a specified Variable

Data that is not numerical can be a challenge to use in R - I have been finding this out over the last few days. The example data is for a venue which records the days on which they hold a gigt. Example:

Gig#     Day
1       Mon
2       Mon
3       Wed
4       Sat
5       Fri

At the moment, all I want to do is be able to count the number of gigs (Ofcourse, the answer is 5 here, but I need R to be able to count that correctly too).

Solution: use table(). Assume the data has been imported to the variable of your choice, 'data'. By specifying the variable you wish to look at, simply use $variable-of-interest

> table(data)
> table(data$Day)

The output should look something like this - this is a dataset I am working on atm.

      Day
Year   Fri Mon Sat Sun Thu Tue Wed
  2010  11   1  16   8   1   1   5
  2011  10   1  11   5   1   1   2

To be continued...

Sunday, August 14, 2011

Import .csv with header

First rule, always make the .csv file with a header. Later on, this header is used to call the specific data later on.

Second, for a univariate dataset (100 observations, 1 variable), use \n as the delimiter. Basically, put the 100 observations on 100 lines.

Example for football player height:

col1
1.9
2.1
2.0
1.9
1.9

Now to import the data into R. Open R, use the read.csv command, but make sure to use single quotes when giving the filename. Otherwise it will read the data as string.

> data <- read.csv('/path/to/file.csv', header = TRUE)

To check the size of the dataset, use the str() command:

> str(data)
'data.frame':	100 obs. of  1 variable:
 $ col1: num  -0.1128 -0.4808 -0.0156 -0.2525 0.0834 ...

As you can see, the 'num' that appears after '$ col1:' indicates that the data has been imported as numerical data. Perform a hist() on you data now using the name of the dataset (data) and the variable you wish to test (col1).

Ciao

Link to first R...

Just the link to the other R post that I made - I think I need to expand my knowledge here.

http://dirtyhabanero.blogspot.com/2011/05/rkward-cli-useful-commands.html

R