8 Representing Data in R
In this exercise, we will go through some aspects of Linear Algebra and how we implement them in R. We will also look how to represent data and encode it properly for applications. It will teach you to implement statistical methodologies effectively in R.
After these initial introduction to concepts, we will implement some linear algebraic manipulations. In this section you should check if you can repeat the calculations by hand. You will have the solutions in R and can check if you reach the same solutions calculating by hand.
Some of the ideas will be familiar to you but we will phrase them in the linear algebra context. We will use data built into R or we will generate our own data.
8.1 Representing data as vectors
The first thing we will revisit is vectors. This is quite basic and you have already encountered it so think of this as a refresher. We create a basic vector using c()
and assign it using <-
to a variable. Let us create a vector \(x\) going from 1 to 3. We can also use the function seq()
to create a more elaborate vector. Let us create another vector, \(y\), that is made up of \(x\) and sequence from 10 to 120 in steps of 20. Some functions will also return vectors, a simple example is creating random numbers for instance using rnorm()
. Let us create another vector, \(z\), which consists of 7 random numbers from \(\operatorname{N}(0,1)\).
Hint: Use ?seq
(or ?rnorm
) to find more information on a function. Make use of this in everything that follows.
#> [1] 1 2 3
#> [1] 1 2 3 10 30 50 70 90 110
#> [1] 0.96356355 0.05798104 0.85412413 -0.78319170
#> [5] 0.16486881 -1.65757463 0.76259245
When you are trying to implement any vector operations in R you will be using data and for manipulations. It is important that data is made up of numbers and not anything else, otherwise operations will not work. There is one way of checking if what we have created is a vector made up of numbers, we can use the function class()
. If it only contains numbers you will get a response of integer
, or numeric
. We can check what it looks like for the vectors we have created.
Note: We can nest functions in R, to run them. We make use of that here.
# Check what each of the vectors is made of.
# We nest the print function and class function to create one output.
print(class(x))
#> [1] "integer"
#> [1] "numeric"
#> [1] "numeric"
What happens when one of the entries in the vector is a string. We will create a new vector x1
based on \(x\) and y2
based on \(y\) to test what happens to the vector.
#> [1] "character"
#> [1] "character"
This is not a vector we can use for manipulations as you will see in the next section.
8.2 Vector operations
8.2.1 Addition
Next we look at operations we can perform with vectors. There are a few simple ones we will explore and also see what errors we can have. You will also have noticed a few ways that R behaves in unexpected ways, this can be useful and can also cause errors. You need to be aware of them to exploit them or know when something has gone wrong.
The first thing we look at is addition. We can add two vectors together that are of the same length. The first quirk in R is that we can also add vectors together that are multiples of each other. This is important to remember as it doesn’t follow the rules we looked at for linear algebra. The simplest such case is when we add a number to a vector. We will look at the error message when the vectors are not of equal length.
#> [1] 2 4 6
#> [1] 3 4 5
#> [1] 2 4 6 11 32 53 71 92 113
#> [1] 3
#> [1] 9
#> [1] 9
#> Warning in x + z: longer object length is not a multiple of
#> shorter object length
#> [1] 1.9635636 2.0579810 3.8541241 0.2168083 2.1648688
#> [6] 1.3424254 1.7625924
We can add \(x\) and \(y\), only because one is a multiple of the other. The vector \(x\) has length 3 and the vector \(y\) has length 9. The addition is performed without error. You will find that the shorter vector is replicated to perform the addition and the new vector is the same length as the longer. The final part we tried is to add two vectors which are not multiples of each other. You will notice that the code will run and produce a result but you will get a warning message. This is among the reasons you want to be careful with warnings and checking each step of functions and code you write.
8.2.2 Multiplication
The second important operation is multiplication. We can obviously multiply a vector with a numerical value. We can also multiply two vectors of the same length, and two vectors of different length, where the lengths are multiples of each other. This operation has no equivalent in linear algebra we have studied but it can be useful and can provide a shortcut for many applications. We can of course also multiply the transpose of a vector with a vector of equal length which results in a simple number. For this we use the special product sign %*%
. Finally we can also check what happens if we multiply a numeric vector with a character vector.
#> [1] 2 4 6
#> [1] 1 4 9
#> [1] 1 4 9 10 60 150 70 180 330
#> [,1]
#> [1,] 14
#> Error in x * x1: non-numeric argument to binary operator
You get an error message for the final product because one of the vectors is not a numeric vector. There are many other useful commands sum(x)
will sum the values of a vector. You can also use max(x)
or min(x)
to find the maximum and minimal entry of a vector. There are many others we don’t have time to go into.
8.3 Representing data as a matrix
An important part of linear algebra is of course also matrices. In R a matrix is simply data arrange in two dimensions as a square or rectangle. Here is an example of a simple matrix:
\[ A = \begin{bmatrix} 1 & 3 & 4\\ 5 & 9 & 2 \end{bmatrix} \]
In R to create a matrix using the matrix()
function. We first provide a data vector and then we can specify how many rows and columns the matrix has and in which order the matrix is filled with the data. Either filling rows one at a time or columns at a time.
# Create a matrix A using the specified vector
A <-
matrix(
c(1, 3, 4, 5, 9, 2), # data
nrow = 2, # 2 rows
ncol = 3, # 3 columns
byrow = TRUE # fill by row
)
# Create a matrix B using the specified vector
B <-
matrix(
c(1, 3, 4, 5, 9, 2), # data
nrow = 2, # 2 rows
ncol = 3, # 3 columns
byrow = FALSE # fill by column
)
# Create a matrix C using the specified vector
C <-
matrix(
c(1, 3, 4, 5, 9, 2), # data
nrow = 3, # 2 rows
ncol = 2, # 3 columns
byrow = FALSE # fill by column
)
# Create a matrix D using the specified vector
D <-
matrix(
c(1, 3, 4, 5, 9, 2), # data
nrow = 6, # 2 rows
ncol = 2, # 3 columns
byrow = FALSE # fill by column
)
# print matrix A
print(A)
#> [,1] [,2] [,3]
#> [1,] 1 3 4
#> [2,] 5 9 2
#> [,1] [,2] [,3]
#> [1,] 1 4 9
#> [2,] 3 5 2
#> [,1] [,2]
#> [1,] 1 5
#> [2,] 3 9
#> [3,] 4 2
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 3 3
#> [3,] 4 4
#> [4,] 5 5
#> [5,] 9 9
#> [6,] 2 2
You can see that it is not only important what the data vector looks like but also how we fill the matrix. You can see different options when we created matrices \(A\), \(B\), and \(C\). We can create matrices with different dimensions and fill them up differently. It is important to know which matrix we are trying to create and to ensure we are doing it correctly.
Always print it (if not too large) to check you have done it correctly.
The other thing you will have seen when creating \(D\) is that R will reuse your data vector if you put in a matrix where the number of elements are a multiple of the data vector without warning and with warning when it isn’t an exact exact multiple.
We use dim()
to check the dimension of the data. For instance dim(A) =
2, 3, and dim(D) =
6, 2. The other useful thing is that we can name rows and columns to identify wha the data is that we have created. We can use it in two different ways, using dimnames()
, and a combination of colnames()
and rownames()
.
# use dimnames to create both in one go using a list of vector
dimnames(A) <- list(
c("sample1", "sample2"), # rows
c("patient1", "patient2", "patient3") # columns
)
# print matrix A
print(A)
#> patient1 patient2 patient3
#> sample1 1 3 4
#> sample2 5 9 2
#> gene1 gene2 gene3
#> [1,] 1 4 9
#> [2,] 3 5 2
# create row names C
rownames(C) <- c("hospital1", "hospital2", "hospital3")
# print matrix C
print(C)
#> [,1] [,2]
#> hospital1 1 5
#> hospital2 3 9
#> hospital3 4 2
This can be very useful when working with data. We can also create a matrix based on a data matrix. We will make use of the iris
data set which is built into R. This can be useful when the data is not made up of just numerical values and you need to perform linear algebraic manipulations. We can convert a subset of a data.frame
to matrix using the as.matrix()
command.
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
# Create a matrix of the first 3 columns and the first 10 rows
iris_matrix <- as.matrix(iris[1:10, 1:3])
print(iris_matrix)
#> Sepal.Length Sepal.Width Petal.Length
#> 1 5.1 3.5 1.4
#> 2 4.9 3.0 1.4
#> 3 4.7 3.2 1.3
#> 4 4.6 3.1 1.5
#> 5 5.0 3.6 1.4
#> 6 5.4 3.9 1.7
#> 7 4.6 3.4 1.4
#> 8 5.0 3.4 1.5
#> 9 4.4 2.9 1.4
#> 10 4.9 3.1 1.5
# Create a matrix of the first 10 rows and all columns
iris_full <- as.matrix(iris[1:10, ])
print(iris_full)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 "5.1" "3.5" "1.4" "0.2"
#> 2 "4.9" "3.0" "1.4" "0.2"
#> 3 "4.7" "3.2" "1.3" "0.2"
#> 4 "4.6" "3.1" "1.5" "0.2"
#> 5 "5.0" "3.6" "1.4" "0.2"
#> 6 "5.4" "3.9" "1.7" "0.4"
#> 7 "4.6" "3.4" "1.4" "0.3"
#> 8 "5.0" "3.4" "1.5" "0.2"
#> 9 "4.4" "2.9" "1.4" "0.2"
#> 10 "4.9" "3.1" "1.5" "0.1"
#> Species
#> 1 "setosa"
#> 2 "setosa"
#> 3 "setosa"
#> 4 "setosa"
#> 5 "setosa"
#> 6 "setosa"
#> 7 "setosa"
#> 8 "setosa"
#> 9 "setosa"
#> 10 "setosa"
We can see that if the data we use in the as.matrix()
command consists of more than numerical values all entries will be converted to characters. This will not allow for matrix manipulations. So you have to be careful when creating such a matrix.
8.4 Matrix operations
Now we can look at matrix operations that will allow us to perform operations. We can create a transpose of a matrix using the t()
command. All normal arithmetic operations like addition and subs-traction. There are multiple ways we can use to perform multiplications. If we use the simple product *
operation it multiplies both matrices element by element. We did not cover this in the lectures but it is called a Hadamard product. We can implement the matrix product using %*%
we can also use this to multiply a vector with a matrix.
#> patient1 patient2 patient3
#> sample1 0 -1 -5
#> sample2 2 4 0
#> patient1 patient2 patient3
#> sample1 2 7 13
#> sample2 8 14 4
#> Error in A + D: non-conformable arrays
#> patient1 patient2 patient3
#> sample1 1 3 4
#> sample2 5 9 2
#> gene1 gene2 gene3
#> [1,] 1 4 9
#> [2,] 3 5 2
#> patient1 patient2 patient3
#> sample1 1 12 36
#> sample2 15 45 4
#> Error in A %*% B: non-conformable arguments
#> patient1 patient2 patient3
#> sample1 1 3 4
#> sample2 5 9 2
#> gene1 gene2 gene3
#> [1,] 1 4 9
#> [2,] 3 5 2
#> [,1] [,2]
#> hospital1 1 5
#> hospital2 3 9
#> hospital3 4 2
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 3 3
#> [3,] 4 4
#> [4,] 5 5
#> [5,] 9 9
#> [6,] 2 2
#> [,1] [,2]
#> sample1 26 40
#> sample2 40 110
#> Error in A %*% B: non-conformable arguments
#> patient1 patient2 patient3
#> hospital1 26 48 14
#> hospital2 48 90 30
#> hospital3 14 30 20
#> Error in A %*% D: non-conformable arguments
#> [1] 3
#> [1] 2 3
#> patient1 patient2 patient3
#> sample1 1 9 8
#> sample2 10 9 6
#> [,1]
#> sample1 19
#> sample2 29
#> Error in x %*% A: non-conformable arguments
Q1: Check the product
A * x
, does it behave as you expect? What aboutx * A
?
There are several other operations that can be useful for matrices. We will not be able to go through all of them. Here are a few you should explore:
8.5 Testing application
We will create a few matrices and vectors here. Then you should implement some arithmetic operations in R. Following that you should try to perform the same calculations by hand to check you have done the operations correctly and can perform them by hand.
# Creating matrices
A1 <- matrix(
c(1:9),
nrow = 3
)
A2 <- matrix(
c(1, 3, 4, 6, 9, 2, 1, 0, 3),
nrow = 3
)
# Creating vectors
v1 <- c(0, 1, 1)
v2 <- c(2, 1, 0)
v3 <- c(3, 1, 1)
x1 <- c(1, 2, 0, 1)
x2 <- c(2, 3, 1, 1)
x3 <- c(4, 1, 2, 0)
# Another way to create matrices
B1 <- cbind(v1, v3, v2)
B2 <- rbind(v1, v2, v3)
Q2: Calculate the following matrices.
A1 + A2
A1 * A2
A1 - A2
t(A1)
t(B2)
Q3: Compute the norms of all vectors.
Q4: Compute the inverse of
A1
,A2
, andB3
. You don’t need to invert the matrices by hand but check they are an inverse.
Q5: Compute
A1 %*% v1
andv1 %*% A1
Q6: Compute
v2 %*% B2
andB2 %*% v2