Tutorial for the use of package "regpro"

Introduction

We start the tutorial with data visualization tools like QQ plots, tail plots, and smooth scatter plots. Then we give an introduction for calculating an estimator in the linear regression, kernel regression, local linear regression, additive model, single index model, and forward stagewise modeling. Finally we introduce the calculation of linear and kernel quantile regression estimators. The functions are available in the R-package ``regpro''. This package can be used to learn and to implement regression methods. We introduce also programs of package ``denpro'' which contains tools for data visualization, function visualization, and density estimation.

Contents

1 Data Visualization

1.1 QQ Plots

QQ plots are defined in Section 6.1.2 of the book. First we generate data from t-distribution. It is useful to set the seed of the random number generator using the function ``set.seed''. Then we can later reproduce the results. In the following sections of this tutorial we skip the application of function ``set.seed'' to save space.

set.seed(1)
dendat<-rt(1000,df=6)

We compare the sample to the normal distribution. The QQ plot is now the plot of the points

( x_{(i)} , F^{-1}(p_i) ) , i=1, … ,n ,

where

x_{(1)} ≤ … ≤ x_{(n)}

is the ordered sample, p_i = (i-1 ⁄ 2) ⁄ n,

and F is the distribution function of the standard normal distribution. We plot also the line x=y with the function ``segments''.

x<-dendat[order(dendat)]
n<-length(x) 
p<-(seq(1:n)-1/2)/n
y<-qnorm(p,mean=mean(x),sd=sd(x))
plot(x,y)
segments(-6,-6,6,6) 

1.2 Tail Plots

Tail plots are defined in Section 6.1.2 of the book. We show how a left tail plot can be made. The right tail plot can be made analogously. The left tail consists of the observations that are to the left from the median. The data is ordered before plotting. The left tail plot is a scatter plot of the observations in the left tail and the level, defined as the order statistics of the observations. The logarithmic scale is used for the y-axis.

split<-median(dendat)
left.tail<-dendat[(dendat < split)] 
ord<-order(left.tail,decreasing=TRUE)
ordered.left.tail<-left.tail[ord]
level<-seq(length(left.tail),1)
plot(ordered.left.tail,level,log="y")

1.3 Two Dimensional Scatter Plots

Function ``plot'' can bes used to plot two dimensional scatter plots.

n<-20000
dendat<-matrix(rnorm(2*n),n,2)
plot(dendat)

When the sample size n is large it is useful to plot binned data, as in Figure 6.1(b) of the book. Bin counts can be calculated with the function ``pcf.histo'' in package ``denpro''. Parameter N gives the number of bins for each direction.

N<-c(100,100)
pcf<-pcf.histo(dendat,N)

Then we transform the bin counts to interval [0,1], make a scale of gray values, and use the function ``plot.histo'' from the package ``denpro''.

f<-sqrt(pcf$value)
colo<-1-(f-min(f)+0.5)/(max(f)-min(f)+0.5)
col<-gray(colo)
plot.histo(pcf,col=col)

Available are also the function ``hexbin'' in the R package ``hexbin'' which uses hexagonal binning, and the function ``hist2d'' in the R package ``gplots''.

1.4 Three Dimensional Scatter Plots

Let us simulate three dimensional data.

dendat<-matrix(rnorm(3*20000),20000,3)

Then we calculate a three dimensional histogram.

N<-c(100,100,100)
pcf<-pcf.histo(dendat,N)

We use the function ``histo2data'' to make data points from the histogram. The centers of the bins are the new data points. We rotate the new data and project it to two dimension. Before plotting the data, we order it in such a way that the observations with the largest value for the third coordinate are plotted last.

hd<-histo2data(pcf)
alpha<-pi; beta<-pi; gamma<-0
rotdat<-rotation3d(hd$dendat,alpha,beta,gamma)
i1<-1; i2<-2; i3<-3
ord<-order(rotdat[,i3]); plotdat<-rotdat[ord,c(i1,i2)]
plot(plotdat,col=hd$col[ord])

2 Linear Regression

In a two dimensional linear model the response variable Y satisfies

Y = β_0 + β_1 X_1 + β_2 X_2 + ε ,

where X_1 and X_2 are the explanatory variables and ε is an error term.

First we simulate data from a linear model. Let the regression function coefficients be β_0=0, β_1=2, and β_2 = -2. Let the explanatory variable X=(X_1,X_2) be uniformly distributed on [0,1]^2 and let the error term be ε ∼ N(0,σ^2). Sample size is n and the number of explanatory variables is d=2.

n<-100
d<-2
x<-matrix(runif(n*d),n,d)
y<-matrix(x[,1]-2*x[,2]+0.1*rnorm(n),n,1)

The least squares estimator of linear regression coefficients was defined in equation (2.10) of the book as

\hat β = ({\bf X}'{\bf X})^{-1} {\bf X}'{\bf y},

where {\bf X} is the n× (d+1) matrix whose first columns consists of ones, the other columns are the observations from the d explanatory variables, and {\bf y} is the n× 1 vector of the observed values of the repsonse variable. In the code below, we first insert the column vector of ones to the original n× d matrix x. The matrix transpose is calculated with function t(), the matrix multiplication is calculated using the operator % * %, and the matrix inversion is calculated with function''solve''.

X<-matrix(0,n,d+1)
X[,1]<-1
X[,2:(d+1)]<-x
XtX<-t(X)%*%X
invXtX<-solve(XtX,diag(rep(1,d+1)))
beta<-invXtX%*%t(X)%*%y

The ridge regression estimator was defined in equation (2.32) of the book as

\hat β = ({\bf X}'{\bf X}+ λ I)^{-1} {\bf X}'{\bf y},

where I is the d times d identity matrix, and λ ≥ 0. In ridge regression the data is normalized to have mean expectations and unit variances.
Y<-(y-mean(y))/sd(c(y))
X<-(x-colMeans(x))/sqrt(colMeans(x^2)-colMeans(x)^2)
lambda<-10
XtX<-t(X)%*%X+lambda*diag(rep(1,d))
invXtX<-solve(XtX,diag(rep(1,d)))
beta<-invXtX%*%t(X)%*%Y

The above code is included in the function ``linear'' of package ``regpro''.

3. Kernel Regression

Kernel regression estimator was defined in equation (3.6) of the book. The kernel estimator is

\hat{f}(x) = ∑_{i=1}^n p_i(x) Y_i ,

where

p_i(x) = K_h(x-X_i) ⁄ ∑_{i=1}^n K_h(x-X_i) , i=1,…,n ,

K:{\bf R}^{d} → {\bf R} is the kernel function, K_h(x) = K(x/h)/h^{d}, and h>0 is the smoothing parameter.

3.1 One Dimensional Kernel Regression

First we simulate data. Let the regression function be

f(x) = φ ( x ) + φ (x-3) ,

where φ is the density function of the standard normal density. Let the distribution of the explanatory variable X be uniform on [-1,4]^2 and let the error term has distribution ε ∼ N(0,σ^2). Sample size is n.

n<-500
x<-5*matrix(runif(n),n,1)-1
phi1D<-function(x){ (2*pi)^(-1/2)*exp(-x^2/2) }
func<-function(t){ phi1D(t)+phi1D(t-3) }
y<-matrix(func(x)+0.1*rnorm(n),n,1)

We calculate \hat{f}(x_0), where x_0=0.5. Let us choose the kernel K to be the Bartlett kernel function K(x) = (1-x^2)_+. The code below is implemented in function ``kernesti.regr'' of package ``regpro''.

arg<-0.5 
h<-0.1
ker<-function(x){ (1-x^2)*(x^2<=1) }
w<-ker((arg-x)/h)
p<-w/sum(w)
hatf<-sum(p*y)   # the estimated value

Let us plot the estimate on a grid of points, together with the true regression function and the data. We choose N evaluation points.

N<-40 
t<-5*seq(1,N)/(N+1)-1
hatf<-matrix(0,length(t),1)
f<-hatf
for (i in 1:length(t)){
  hatf[i]<-kernesti.regr(t[i],x,y,h=0.2)
  f[i]<-func(t[i])
}
plot(x,y)                                # data
matplot(t,hatf,type="l",add=TRUE)        # estimate
matplot(t,f,type="l",add=TRUE,col="red") # true function

We can use also functions ``pcf.kernesti'' and ``draw.pcf'' which allow a more automatic plotting of the function. Function ``draw.pcf'' is included in package ``denpro''.

pcf<-pcf.kernesti(x,y,h=0.2,N=N)
dp<-draw.pcf(pcf)
plot(x[order(x)],y[order(x)])              # data
matplot(dp$x,dp$y,type="l",add=TRUE)       # estimate
matplot(dp$x,func(dp$x),type="l",add=TRUE) # true func.

3.2 Moving Averages

We have defined moving averages of a time series in Section 3.24 of the book. The two-sided moving average is defined in equation (3.12) of the book and the onde-sided moving average is defined in equation (3.14). These can be calculated with the function ``kernesti.regr'' in the following way. In the case of two-sided moving averages we use a two-sided symmetric kernel K:{\bf R}→ {\bf R}. In the case of one-sided moving averages we use a one-sided nonsymmetric kernel K:[0,∞) → {\bf R}.

Let us simulate data from a GARCH(1,1) model. The GARCH(1,1) model was defined in Section 3.9.3 of the book, see equation (3.65) of the book, as

Y_t = σ_t ε_t,

σ_t^2 = α_0 + α_1 Y_{t-1}^2 + β σ_{t-1}^2 .

We take the parameter values from equation (3.66) of the book, where the estimates for S&P 500 returns were given.

a0<-7.2*10^(-7)
a1<-0.0077
b<-0.92
n<-1000     # sample size
y<-matrix(0,n,1); sigma<-matrix(a0,n,1)
for (t in 2:n){
   sigma[t]<-sqrt( a0+a1*y[t-1]^2+b*sigma[t-1]^2 )
   y[t]<-sigma[t]*rnorm(1)
}

We estimate the conditional variance sequentially, using one-sided exponentially weighted moving averages.

ewma<-matrix(0,n,1)
for (i in 1:n){
   ycur<-matrix(y[1:i]^2,i,1)
   xcur<-matrix(seq(1:i),i,1)
   ewma[i]<-kernesti.regr(i,xcur,ycur,h=10,kernel="exp")
}
plot(ewma,type="l")

To calculate one-sided moving averages we can use the faster program ``ma'', contained in the package ``regpro''.

ewma<-matrix(0,n,1)
for (i in 1:n) ewma[i]<-ma(y[1:i]^2,h=10)
plot(ewma,type="l")

3.3 Two Dimensional Kernel Regression

Two dimensional regression function estimates can be visualuzed with perspective plots and contour plots. In addition, level set tree based methods can be used.

First we simulate data. The regression function is

f(x) = ∑_{i=1}^3 φ(x-m_i) ,

where m_1=(0,0), m_2=D× (0,1), m_3=D× (1/2,\sqrt{3}/2), D=3, and φ is the density function of the standard normal density. The distribution of the explanatory variables X=(X_1,X_2) is uniform on [-3,5]^2 and the error term ε has distribution N(0,σ^2).
n<-1000
d<-2 
x<-8*matrix(runif(n*d),n,d)-3
C<-(2*pi)^(-d/2)
phi<-function(x){ return( C*exp(-sum(x^2)/2) ) }
D<-3; c1<-c(0,0); c2<-D*c(1,0); c3<-D*c(1/2,sqrt(3)/2)
func<-function(x){phi(x-c1)+phi(x-c2)+phi(x-c3)}
for (i in 1:n) y[i]<-func(x[i,])+0.01*rnorm(1)

We calculate \hat{f}(x_0), where x_0=(0.5,0.5). Let us choose the kernel K to be the standard normal density function. The code below is implemented in function ``kernesti.regr'' of package ``regpro''.

arg<-c(0.5,0.5)
h<-0.5 
ker<-function(x){ return( exp(-rowSums(x^2)/2) ) }
argu<-matrix(arg,dim(x)[1],d,byrow=TRUE)
w<-ker((argu-x)/h)
p<-w/sum(w)
hatf<-sum(p*y)  # the estimate

3.3.1 Perspective Plots and Contour Plots

Let us plot the estimate on a grid of points. A 2D function can be plotted with a perspective plot or with a contour plot. We plot below also the true regression function.

num<-30  # number of grid points in one direction
t<-8*seq(1,num)/(num+1)-3; u<-t
hatf<-matrix(0,num,num); f<-hatf
for (i in 1:num){
   for (j in 1:num){
      arg<-matrix(c(t[i],u[j]),1,2)
      hatf[i,j]<-kernesti.regr(arg,x,y,h=0.5)
      f[i,j]<-phi(arg-c1)+phi(arg-c2)+phi(arg-c3)
   }
}
contour(t,u,hatf)   # kernel estimate
persp(t,u,hatf,ticktype="detailed",phi=30,theta=-30)
contour(t,u,f)      # true function
persp(t,u,f,phi=30,theta=-30,ticktype="detailed")

We can use also functions ``pcf.kernesti'' and ``draw.pcf'' that allow a more automatic plotting of the function. Function ``draw.pcf'' is in package ``denpro''.

pcf<-pcf.kernesti(x,y,h=0.5,N=c(num,num))
dp<-draw.pcf(pcf,minval=min(y))
persp(dp$x,dp$y,dp$z,phi=30,theta=-30)
contour(dp$x,dp$y,dp$z,nlevels=30)

3.3.2 Level Set Trees

Function``pcf.kernesti'' gives an output that can be used to calculate level set trees, using package ``denpro''. Funnction ``leafsfirst'' calculates a level set tree using the Leafsfirst algorithm. Function ``plotvolu'' plots a volume plot. Parameter ``cutlev'' can be used to cut the lower part of the tree and to zoom into the higher levels. Function ``plotbary'' plots the barycenter plot of the level set tree. Parameter ``coordi'' is used to choose the coordinate of the barycenter plot. Function ``plottree'' plots the tree structure of the level set tree. Function ``treedisc'' can be used to prune the level set tree to make the plotting faster.

pcf<-pcf.kernesti(x,y,h=0.5,N=c(15,15))
lst<-leafsfirst(pcf,lowest="regr")

plotvolu(lst,lowest="regr")
plotvolu(lst,lowest="regr",cutlev=0.05)

plotbary(lst,lowest="regr") # barycenter plot,1st coord.
plotbary(lst,coordi=2,lowest="regr") # 2nd coord.

plottree(lst,lowest="regr") # plot level set tree

td<-treedisc(lst,pcf,ngrid=30,lowest="regr")
plotvolu(td,modelabel=FALSE,lowest="regr")

3.4 Three and Higher Dimensional Kernel Regression

Three and higher dimensional regression function estimates can be visualized by one or two dimensional slices, partial dependency plots, and level set tree based visualizations.

First we simulate data. The regression function is a mixture

f(x) = ∑_{i=1}^4 φ(x-c_i) ,

where c_1=D× (1/2,0,0), c_2=D× (-1/2,0,0), c_3=D× (0,\sqrt{3}/2,0), c_4=D× (0,1/(2\sqrt{3}),\sqrt{2/3}), D=3, and φ is the density function of the standard normal density. The distribution of the explanatory variables X=(X_1,X_2,X_3) is uniform on [-3,3]^3 and the error term ε has distribution N(0,σ^2). The sample size is denoted n and the number of explanatory variables is denoted d.

n<-1000; d<-3
x<-8*matrix(runif(n*d),n,d)-3
C<-(2*pi)^(-d/2)
phi<-function(x){ return( C*exp(-sum(x^2)/2) ) }
D<-3
c1<-D*c(1/2,0,0)
c2<-D*c(-1/2,0,0)   
c3<-D*c(0,sqrt(3)/2,0)
c4<-D*c(0,1/(2*sqrt(3)),sqrt(2/3))  
fun<-function(x){phi(x-c1)+phi(x-c2)+phi(x-c3)+phi(x-c4)}
y<-matrix(0,n,1)
for (i in 1:n) y[i]<-fun(x[i,])+0.01*rnorm(1)

3.4.1 Slices

A one dimensional slice of regression function estimate \hat{f} :{\bf R}^d → {\bf R} is

g(x_1) = \hat{f}(x_1,x_{0,2}, … ,x_{0,d}) ,

where x_{0,2}, … ,x_{0,d} is fixed point. We can choose x_{0,k} to be the sample median of the k:th coordinate of the vector of explanatory variables: x_{0,k} = \mbox{median}(X_{1,k},…,X_{n,k}) for k=2,…,d.

One dimensional slices can be calculated using function ``pcf.kernesti.slice''. The parameter p=0.5 indicates that the fixed point is the median. The parameter N gives the number of evaluation points.

pcf<-pcf.kernesti.slice(x,y,h=0.5,N=50,coordi=1,p=0.5)
dp<-draw.pcf(pcf); plot(dp$x,dp$y,type="l")

We can compare the one dimensional slice to the one dimensional slice of the linear regression estimate.

coordi<-1
notcoordi<-c(2,3)
li<-linear(x,y); a<-li$beta1[notcoordi]
x0<-c(median(x[,2]),median(x[,3]))
flin<-li$beta0+li$beta1[coordi]*dp$x+sum(a*x0)
plot(dp$x,flin,type="l")

We can now plot the slice of kernel regression estimate and the slice of linear regression in the same figure.

ylim<-c(min(flin,dp$y),max(flin,dp$y))
matplot(dp$x,flin,type="l",ylim=ylim)
matplot(dp$x,dp$y,type="l",add=TRUE)

3.4.2 Partial Dependence Functions

A partial dependence function eas defined in Section 7.2 of the book. A one dimensional partial dependence function is

g_{X_1}(x_1) = E f(x_1,X_2, … ,X_d) ,

where f(x) = E(Y | X=x) and X=(X_1, … ,X_d). We can use function ``pcf.kernesti.marg'' to calculate a kernel estimate of a one dimensional partial dependence function.

pcf<-pcf.kernesti.marg(x,y,h=0.5,N=30,coordi=1)
dp<-draw.pcf(pcf); plot(dp$x,dp$y,type="l")

3.4.3 Level Set Trees

Function``pcf.kernesti'' gives an output that can be used to calculate level set trees, using package ``denpro''. We proceed as explained in Section 3.3 of the tutorial, where two dimensional regression function was visualized.

pcf<-pcf.kernesti(x,y,h=0.5,N=c(15,15,15))
lst<-leafsfirst(pcf,lowest="regr")
td<-treedisc(lst,pcf,ngrid=30,lowest="regr")
plotvolu(td,modelabel=FALSE,lowest="regr")
plotbary(td,coordi=3,lowest="regr") # 3rd coord.
plottree(td,lowest="regr") # level set tree

3.5 Kernel Estimator of Derivatives

The kernel estimator of a partial derivative was defined in Section 3.2.9 of the book; see equation (3.28) of the book. The kernel estimator of the k:th partial derivative, with the Gaussian kernel, is

\widehat{D_kf}(x) = ∑_{i=1}^n q_i(x) \, Y_i ,

where

q_i(x) = \frac{1}{∑_{i=1}^n K_h(x-X_i)} \left( \frac{\partial}{\partial x_k} K_h(x-X_i) - p_i(x) ∑_{i=1}^n \frac{\partial}{\partial x_k} K_h(x-X_i) \right) ,

where

\frac{\partial}{\partial x_k} \, K_h(x-X_i) = \frac{1}{h^{d+1}} \, (D_kK) \left( \frac{x-X_i}{h} \right) ,

D_kK(x) = -x_k K(x), and p_i(x) are the weights of the kernel estimator.

3.5.1 One Dimensional Estimator of Derivatives

First we simulate data. Let the regression function be

f(x) = \int_{-1}^x ( φ ( t ) + φ (t-3) ) dt ,

where x ∈ [-1,4] and φ is the density function of the standard normal density. Let the distribution of the explanatory variable X be uniform on [-1,4]^2 and let the error term has distribution ε ∼ N(0,σ^2).

n<-1000
x<-5*matrix(runif(n),n,1)-1
phi1D<-function(x){ (2*pi)^(-1/2)*exp(-x^2/2) }
func0<-function(t){ phi1D(t)+phi1D(t-3) }
func<-function(t){ 
    ngrid<-1000; step<-5/ngrid; grid<-seq(-1,4,step)
    i0<-floor((t+1)/step)
    return( step*sum( func0(grid[1:i0]) ) ) 
}
y<-matrix(0,n,1)
for (i in 1:n) y[i]<-func(x[i])+0.001*rnorm(1)

The code below calculates \widehat{f'}(x_0), where x_0=0. This code is implemented in function ``kernesti.der'' of package ``regpro''.

arg<-0
h<-0.5
ker<-phi1D
dker<-function(t){ return( -t*exp(-t^2/2) ) }
w<-ker((arg-x)/h); p<-w/sum(w)
u<-dker((arg-x)/h)/h^2; q<-1/sum(w)*(u-p*sum(u))  
hatf<-sum(y*q) # the estimated value     

Let us plot the estimate on a grid of points.

t<-5*seq(1,n)/(n+1)-1
hatf<-matrix(0,length(t),1)
f<-hatf
for (i in 1:length(t)){
  hatf[i]<-kernesti.der(t[i],x,y,h=0.5)
  f[i]<-func0(t[i])
}
ylim<-c(min(f,hatf),max(f,hatf))
matplot(t,hatf,type="l",ylim=ylim)       # estimate
matplot(t,f,type="l",add=TRUE,col="red") # true func.

We can use also functions ``pcf.kernesti.der'' and ``draw.pcf'' which allow a more automatic plotting of the function. Function ``draw.pcf'' is included in package ``denpro''.

pcf<-pcf.kernesti.der(x,y,h=0.5,N=n)
dp<-draw.pcf(pcf)
plot(x,y); matplot(dp$x,dp$y,type="l",add=TRUE)  

3.5.2 Two and Higher Dimensional Estimator of Derivatives

First we simulate data. Let the regression function be

f(x_1,x_2) = f_1(x_1)f_1(x_2) ,

where f_1 is the function defined in (\ref{f2d}) and x_1,x_2 ∈ [-1,4]. The distribution of the explanatory variable X is uniform on [-1,4]^2. The error term ε has distribution N(0,σ^2). The sample size is n and the number of explanatory variables is d=2.

n<-1000
d<-2
x<-5*matrix(runif(n*d),n,d)-1
phi1D<-function(x){ (2*pi)^(-1/2)*exp(-x^2/2) }
func0<-function(t){ phi1D(t)+phi1D(t-3) }
func1<-function(t){ 
    ngrid<-1000; step<-5/ngrid; grid<-seq(-1,4,step)
    i0<-floor((t+1)/step)
    return( step*sum( func0(grid[1:i0]) ) ) 
}
func<-function(t){ func1(t[1])*func1(t[2]) }
y<-matrix(0,n,1)
for (i in 1:n) y[i]<-func(x[i,])+0.001*rnorm(1)

The code below calculates the first partial derivative \widehat{D_1f}(0), where D_1f(0) = \partial f(x)/\partial x_1 |_{x=0}. This code below is implemented in function ``kernesti.der'' of package ``regpro''. Parameter direc is set equal to one to estimate the first partial derivative.

direc<-1
arg<-c(0,0)
h<-0.5
ker<-function(xx){ exp(-rowSums(xx^2)/2) }
C<-(2*pi)^(-1)
dker<-function(xx){-C*xx[,direc]*exp(-rowSums(xx^2)/2)}
argu<-matrix(arg,n,d,byrow=TRUE)
w<-ker((argu-x)/h); p<-w/sum(w)
u<-dker((argu-x)/h)/h^(d+1); q<-1/sum(w)*(u-p*sum(u))  
value<-q%*%y  # the estimated value

Let us plot a perspective plot and contour plot of the estimate of the first partial derivative and of the true first partial derivative. Parameter num gives the number of grid points in one direction.

num<-30
t<-5*seq(1,num)/(num+1)-1
u<-t
hatf<-matrix(0,num,num); df<-hatf
for (i in 1:num){ 
  for (j in 1:num){
      arg<-matrix(c(t[i],u[j]),1,2)
      hatf[i,j]<-kernesti.der(arg,x,y,h=0.5)
      df[i,j]<-func0(arg[1])*func1(arg[2])
  } 
}
contour(t,u,hatf)   # kernel estimate
persp(t,u,hatf,ticktype="detailed",phi=3,theta=-30)
contour(t,u,df)     # true function
persp(t,u,df,phi=30,theta=-30,ticktype="detailed")

We can use functions ``pcf.kernesti.der'' and ``draw.pcf'' to make perspective plots and contour plots. Function ``draw.pcf'' is in package ``denpro''.

pcf<-pcf.kernesti.der(x,y,h=0.5,N=c(50,50),direc=1)
dp<-draw.pcf(pcf)
persp(dp$x,dp$y,dp$z,phi=3,theta=-30)
contour(dp$x,dp$y,dp$z)

We can also use level set tree based methods, as in Section 3.3 of the tutorial.

pcf<-pcf.kernesti.der(x,y,h=0.5,N=c(15,15))
lst<-leafsfirst(pcf,lowest="regr")
plotvolu(lst,lowest="regr")          # volume plot
plotbary(lst,coordi=1,lowest="regr") # barycenter plot
plottree(lst,lowest="regr")          # level set tree

3.6 Combined State and Time Space Smoothing

Locally stationary data was discussed in Section 3.2.5 of the book. To estimate a regression function with locally stationary data it can be useful to combine state space kernel estimator with moving averages. The kernel estimator combining time and state space smoothing, as defined in equation (3.20) of the book, is

\hat{f}_t(x) = ∑_{i=1}^t w_i(x,t) Y_i ,

where the weights have the form

w_i(x,t) = K((x-X_i)/h) L((t-i)/g) ⁄ ∑_{j=1}^n K((x-X_j)/h) L((t-j)/g) , i=1,…,t ,

where K:{\bf R}^{d} → {\bf R}, L:{\bf R} → {\bf R} are kernel functions and h>0, g>0 are smoothing parameters.

Let us simulate locally stationary data. Let the regression functions be

f_t(x) = 0.5 φ ( x- μ _t^{(1)} ) + 0.5 φ ( x-μ_t^{(2)} ) ,

where μ_t^{(1)} = -2t/T, μ_t^{(2)} = 2t/T, and φ is the density function of the standard normal distribution. The design variables X_t are i.i.d. N(0,1) and the errors ε_t are i.i.d. N(0,0.1^2).

n<-1000
x<-matrix(rnorm(n),n,1)
y<-matrix(0,n,1)
for (i in 1:n){
   mu1<--i/n*2; mu2<-i/n*2
   func<-function(t){
        return( 0.5*dnorm(t-mu1)+0.5*dnorm(t-mu2) )
   }
   y[i]<-func(x[i])+0.1*rnorm(1)
}

Now we apply function ``kernesti.regr''. The smoothing parameter h is the state space smoothing parameter and the smoothing parameter g is the time space smoothing parameter.

arg<-0 
kernesti.regr(arg,x,y,h=1,g=10,gernel="exp")

4 Local Linear Regression

Local linear estimator was discussed in Section 5.2.1 of the book.

4.1 One Dimensional Local Linear Regression

We use the same simulated data as was used in Section 3.1 of the tutorial to illustrate one dimensional kernel regression:

n<-500
x<-5*matrix(runif(n),n,1)-1
phi1D<-function(x){ (2*pi)^(-1/2)*exp(-x^2/2) }
func<-function(t){ phi1D(t)+phi1D(t-3) }
y<-matrix(func(x)+0.1*rnorm(n),n,1)

Let us choose the kernel K to be the Bartlett function K(x) = (1-x^2)_+. We want to calculate \hat{f}(x_0), where x_0=0.5. The weights of the one dimensional local linear regression are given in equation (5.9) of the book.

arg<-0.5
h<-0.5
ker<-function(x){ (1-x^2)*(x^2<=1) }
w<-ker((arg-x)/h); p<-w/sum(w)
barx<-sum(p*x); bary<-sum(p*y)
q<-p*(1-((x-barx)*(barx-arg))/sum(p*(x-barx)^2))
hatf<-sum(q*y)

The above code is implemented in function ``loclin'' of package ``regpro''. Let us plot the estimate on a grid of points.

N<-40
t<-5*seq(1,N)/(N+1)-1
hatf<-matrix(0,length(t),1); f<-hatf
for (i in 1:length(t)){
   hatf[i]<-loclin(t[i],x,y,h=0.5,kernel="bart")
   f[i]<-phi1D(t[i])+phi1D(t[i]-3)
}
plot(x,y)                         # data
matplot(t,hatf,type="l",add=TRUE) # estimate
matplot(t,f,type="l",add=TRUE)    # true function

We can use also functions ``pcf.loclin'' and ``draw.pcf'' that allow a more automatic plotting of the function. Function ``draw.pcf'' is in package ``denpro''.

pcf<-pcf.loclin(x,y,h=0.5,N=40,kernel="bart")
dp<-draw.pcf(pcf)
plot(x,y)
matplot(dp$x,dp$y,type="l",add=TRUE)
matplot(dp$x,func(dp$x),type="l",add=TRUE)

4.2 Two Dimensional Local Linear Regression

We use the same two dimensional simulated data as was used in Section 3.3 of the tutorial to illustrate two dimensional kernel regression:

n<-1000
d<-2 
x<-8*matrix(runif(n*d),n,d)-3
C<-(2*pi)^(-d/2)
phi<-function(x){ return( C*exp(-sum(x^2)/2) ) }
D<-3; c1<-c(0,0); c2<-D*c(1,0); c3<-D*c(1/2,sqrt(3)/2)
func<-function(x){phi(x-c1)+phi(x-c2)+phi(x-c3)}
for (i in 1:n) y[i]<-func(x[i,])+0.01*rnorm(1)

The local linear regression estimator was defined in equation (5.6) of the book as a solution of weighted linear least squares estimator. The weighted least squares estimator is

\hat β = ({\bf X}'W(x){\bf X})^{-1} {\bf X}'W(x){\bf y},

where {\bf X} is the n× (d+1) matrix whose first column consists of ones, the other columns are the observations from the d explanatory variables, and {\bf y} is the n× 1 vector of the observed values of the response variable.

First we calculate the matrix {\tt W} of the kernel weights. Object x is the n× d matrix of the observed values of the explanatory variables. In this example we have d=2.

arg<-c(0,0)
argu<-matrix(arg,n,d,byrow=TRUE)
w<-ker((x-argu)/h); weights<-w/sum(w); W<-diag(weights)

In the code below we first insert the column vector of ones to the original n× d matrix x. The matrix transpose is calculated with function t() and the matrix multiplication is calculated using the operator \% * \%. The matrix inversion can be calculated with function''solve''. We obtain the vector "esti" of three elements. The first element is the estimate of the regression function at one point. The second element is the estimate of the first partial derivative and the third element is the estimate of the second partial derivative.

X<-cbind(matrix(1,n,1),x-argu)
A<-t(X)%*%W%*%X; invA<-solve(A,diag(rep(1,d+1))) 
esti<-invA%*%t(X)%*%W%*%y; estimate<-esti[1]

We can use also functions ``pcf.loclin'' and ``draw.pcf'' that allow a more automatic plotting of the function. Function ``draw.pcf'' is included in package ``denpro''.

pcf<-pcf.loclin(x,y,h=0.5,N=c(20,20))
dp<-draw.pcf(pcf)
persp(dp$x,dp$y,dp$z,ticktype="detailed",phi=30,theta=3)
contour(dp$x,dp$y,dp$z,nlevels=30)

4.3 Three and Higher Dimensional Local Linear Regression

When the functions are three and higher dimensional we cannot use perspective plots and contour plots. However, we can use level set tree based methods. Let us use the same three dimensional data which was used to illustrate three dimensional kernel estimation in Section 3.4 of the tutorial:

n<-1000; d<-3
x<-8*matrix(runif(n*d),n,d)-3
C<-(2*pi)^(-d/2)
phi<-function(x){ return( C*exp(-sum(x^2)/2) ) }
D<-3
c1<-D*c(1/2,0,0)
c2<-D*c(-1/2,0,0)   
c3<-D*c(0,sqrt(3)/2,0)
c4<-D*c(0,1/(2*sqrt(3)),sqrt(2/3))  
fun<-function(x){phi(x-c1)+phi(x-c2)+phi(x-c3)+phi(x-c4)}
y<-matrix(0,n,1)
for (i in 1:n) y[i]<-fun(x[i,])+0.01*rnorm(1)

Function``pcf.loclin'' gives an output that can be used to calculate level set trees, using package ``denpro''. We proceed as explained in Section 3.3 of the tutorial, where two dimensional regression function was visualized. After calculating a level set tree, we plot volume plots and a barycenter plot.

pcf<-pcf.loclin(x,y,h=0.5,N=c(15,15,15))
lst<-leafsfirst(pcf,lowest="regr")

td<-treedisc(lst,pcf,ngrid=30,lowest="regr")
plotvolu(td,modelabel=FALSE,lowest="regr")
plotvolu(td,modelabel=FALSE,lowest="regr",cutlev=0.03)

plotbary(td,coordi=1,lowest="regr") 

4.4 Local Linear Derivative Estimation

We can use the same two dimensional data to illustrate local linear partial derivative estimation as was used to illustrate kernel estimation of partial derivatives in Section 3.5 of the tutorial:

n<-1000
d<-2
x<-5*matrix(runif(n*d),n,d)-1
phi1D<-function(x){ (2*pi)^(-1/2)*exp(-x^2/2) }
func0<-function(t){ phi1D(t)+phi1D(t-3) }
func1<-function(t){ 
    ngrid<-1000; step<-5/ngrid; grid<-seq(-1,4,step)
    i0<-floor((t+1)/step)
    return( step*sum( func0(grid[1:i0]) ) ) 
}
func<-function(t){ func1(t[1])*func1(t[2]) }
y<-matrix(0,n,1)
for (i in 1:n) y[i]<-func(x[i,])+0.001*rnorm(1)
We use function ``pcf.loclin'' with the argument "type". The argument "type" is set to 1 to estimate the first partial derivative, it is set to 2 to estimates the second partial derivative, and similarily in the higher than two dimensional cases. Function ``pcf.loclin'' is included in package ``regpro'' and function ``draw.pcf'' is included in package ``denpro''.

pcf<-pcf.loclin(x,y,h=0.5,N=c(20,20),type=1)
dp<-draw.pcf(pcf)
persp(dp$x,dp$y,dp$z,ticktype="detailed",phi=3,theta=30)
contour(dp$x,dp$y,dp$z,nlevels=30)

5 Additive Models: Backfitting

The additive models were covered in Section 4.2 of the book, where the backfitting algorithm was presented. In the two dimensional additive model the response variable Y can be written

Y = f_1(X_1) + f_2(X_2) + ε ,

where X_1 and X_2 are the explanatory variables, f_k : {\bf R} → {\bf R} are the unknown components, and ε is an error term.

First we simulate data from an additive model. The distribution of the explanatory variable X=(X_1,X_2) is uniform on [0,1]^2. The regression function is f(x_1,x_2) = f_1(x_1) + f_2(x_2), where f_1(x_1) = x_1^2 - EX_1^2, and f_2(x_2) = log(x_2) - E log(X_2). The response variable is Y = f(x_1,x_2) + ε, where ε ∼ N(0,σ^2).

n<-100
d<-2
x<-matrix(runif(n*d),n,d)
fun1<-function(t){t^2}; fun2<-function(t){log(t)}
f<-matrix(0,n,d)
f[,1]<-fun1(x[,1])-mean(fun1(x[,1])) 
f[,2]<-fun2(x[,2])-mean(fun2(x[,2])) 
y<-f[,1]+f[,2]+0.1*rnorm(n)

We estimate the additive model using function ``additive''. We need to give as arguments the n× d-matrix x of the values of the explanatory variables, the n vector y of the values of the response variable, the smoothing parameter h>0, and the number of iterations M ≥ 1.

h<-0.1
M<-5
est<-additive(x,y,h=h,M=M)

The output est$eval is n× d matrix that contains the evaluations \hat{f}_k(X_{i,k}), i=1,…,n, k=1,…,d, where X_{ik} is the k:th component of the i:th observation. Next we plot the components of the estimate. The code below plots the estimate of the first component and the true first component, The functions are plotted at the observations X_{i1}, i=1,…,n.

or<-order(x[,1])
t<-x[or,1]
hatf1<-est$eval[or,1]; f1<-f[or,1]
plot(t,y[or])                       # data
matplot(t,hatf1,type="l",add=TRUE)  # estimate
matplot(t,f1,type="l",add=TRUE)     # true function

We can evaluate the estimate on a regular grid. We need to give the matrix est$eval as an argument. The matrix was calculated at a previous step. The code below calculates the estimates \hat{f}_1 and \hat{f}_2 and the estimate \hat{f}(x_1,x_2) = \bar{y} + \hat{f}_1(x_1) + \hat{f}_2(x_2) on a grid, where \bar{y} is the mean of the values of the response variable. Parameter num gives the number of grid points in one direction.

num<-50
t<-seq(1,num)/(num+1)
u<-t
hatf<-matrix(0,num,num)
hatf1<-matrix(0,num,1); hatf2<-matrix(0,num,1)
func<-function(arg){
  additive(x,y,arg,h=h,M=M,eval=est$eval)$valvec
}
for (i in 1:num){ for (j in 1:num){
      valvec<-func(c(t[i],u[j]))
      hatf1[i]<-valvec[1]; hatf2[j]<-valvec[2]
      hatf[i,j]<-mean(y)+sum(valvec)
} }
plot(t,hatf1,type="l")
persp(t,u,hatf,ticktype="detailed",phi=30,theta=-30)

Function ``pcf.additive'' can be used for perspective plots, contour plots, and level set trees.

N<-c(50,50)
pcf<-pcf.additive(x,y,N=N,h=h,eval=est$eval,M=M)

dp<-draw.pcf(pcf,minval=min(pcf$value))
persp(dp$x,dp$y,dp$z,phi=30,theta=-30)
contour(dp$x,dp$y,dp$z,nlevels=30)

lst<-leafsfirst(pcf)
plotvolu(lst,lowest="regr")

6 Single Index Regression

The single index model was defined in Section 4.1 of the book. In the single index model the response variable can be written

Y = g(θ'X) + ε ,

where X∈ {\bf R}^d, θ ∈ {\bf R}^d is an unknown direction vector with \|θ\|=1, and g :{\bf R} → {\bf R} is an unknown link function.

First we simulate data from a single index model. The distribution of the vector X=(X_1,X_2) of the explanatory variables is the standard normal 2D distribution. The error term is ε ∼ N(0,σ^2). The index vector is θ = (0,1) and the link function g is the distribution function Φ of the standard normal distribution.

n<-100
x<-matrix(rnorm(n*2),n,2)
theta<-matrix(c(0,1),2,1)
x1d<-x%*%theta
y<-pnorm(x1d)+0.1*rnorm(n)

6.1 Estimating the Index

We cover four estimators discussed in Section 4.1.2 of the book: derivative method, average derivative method, numerical minimization to find the least squares solution, and a stagewise algorithm to find the least squares solution.

The average derivative method was defined in equation (4.9) of the book. The following commands can be used to to estimate the direction vector θ with the average derivative method.

method<-"aved"
h<-1.5
hat.theta<-single.index(x,y,h=h,method=method)

The derivative method was defined in equation (4.8) of the book. In the derivative method we have to specify additionally the point at which the gradient is estimated. We choose the point (0,0).

method<-"poid"
h<-1.5
argd<-c(0,0)
hat.theta<-single.index(x,y,h=h,method=method,argd=argd)

The direction vector can be estimated using numerical minimization to find the solution in the least squares problem of equation (4.2) of the book. The numerical minimization can be performed with the following commands.

method<-"nume"; h<-1.5
hat.theta<-single.index(x,y,h=h,method=method)

The least squares problem in equation (4.2) of the book can be solved by using an iterative method, as explained in Section 4.1.2 of the book. The following commands can be used to apply the iterative method. Argument M gives the number of iterations.

method<-"iter"
h<-1.5
M<-10
hat.theta<-single.index(x,y,h=h,method=method,M=M)

6.2 Estimating the Link Function

After estimating the direction vector θ, we have to estimate the link function g:{\bf R} → {\bf R}. We do this by the kernel estimator using the following commands, which plot also the true link function.

x1d<-x%*%hat.theta  # project data to 1D
pcf<-pcf.kernesti(x1d,y,h=0.3,N=20)
dp<-draw.pcf(pcf)
matplot(dp$x,dp$y,type="l",ylim=c(min(y,0),max(y,1)))
matplot(dp$x,pnorm(dp$x),type="l",add=TRUE)

6.3 Plotting the Single Index Regression Function

We can estimate the regression function $f(x) = g(θ'x)$ on a grid with the function ``pcf.single.index''.

h<-0.3
N<-c(40,40)
method<-"poid"
pcf<-pcf.single.index(x,y,h=h,N=N,method=method)
dp<-draw.pcf(pcf)
persp(dp$x,dp$y,dp$z,ticktype="detailed",phi=3,theta=30)

If the data does not come from a single index model, it can be better to use the derivative method, and estimate the gradient at the point of evaluation. We give an example where the regression function is the standard normal density function. First we simulate data.

n<-1000
x<-matrix(6*runif(n*2)-3,n,2)
C<-(2*pi)^(-1)
phi<-function(x){ return( C*exp(-rowSums(x^2)/2) ) }
y<-phi(x)+0.1*rnorm(n)

Then we use the derivative method and estimate the direction so that the gradient is estimated separately at each the point of evaluation.

method<-"poid"
h<-1.5
num<-50
t<-6*seq(1,num)/(num+1)-3; u<-t
hatf<-matrix(0,num,num)
for (i in 1:num){  
  for (j in 1:num){
    arg<-c(t[i],u[j])
    hatf[i,j]<-
    single.index(x,y,arg=arg,h=h,method=method,argd=arg)
  } 
}
persp(t,u,hatf,ticktype="detailed",phi=30,theta=-30)

7 Forward Stagewise Modeling

Algorithms for forward stagewise modeling were given in Section 5.4 of the book. We include the stagewise fitting of additive models of Section 5.4.2 of the book and projection pursuit regression of Section 5.4.3 of the book to this tutorial.

7.1 Stagewise Fitting of Additive Models

An algorithm for stagewise fitting of additive models was given in Section 5.4.2 of the book. The stagewise fitting is an alternative method to backfitting for finding an estimate with the additive structure. We use the same simulated data as in the case of backfitting the additive model in Section 5 of the tutorial:

n<-100
d<-2
x<-matrix(runif(n*d),n,d)
fun1<-function(t){t^2}; fun2<-function(t){log(t)}
f<-matrix(0,n,d)
f[,1]<-fun1(x[,1])-mean(fun1(x[,1])) 
f[,2]<-fun2(x[,2])-mean(fun2(x[,2])) 
y<-f[,1]+f[,2]+0.1*rnorm(n)

We estimate the additive model using function ``additive.stage''. The arguments are the n × d matrix x of the values of the explanatory variables, the n vector y of the values of the response variable, the smoothing parameter h>0, and the number of iterations M.

h<-0.1
M<-5
est<-additive.stage(x,y,h=h,M=M)

The output ``est$eval'' is the n× d matrix that contains the evaluations \hat{f}_k(X_{i,k}), i=1,… ,n, k=1,… ,d. We can use the same code to plot the components as in the case of the backfitting algorithm.

We can evaluate the estimate on a regular grid. The evaluation proceeds differently than in the case of backfitting, because now we need to give as arguments the n× M matrix of residuals and the M vector, which indicates which variable is chosen at each step. These are given as ``est$residu'' and ``est$deet''. Note that in the case of backfitting it is enough to calculate the n × d matrix of final y-values, whereas to evaluate the stagewise estimator at an arbitrary point, we need to calculate and save the complete n× M matrix of residuals and the indicator vector which contains the estimated direction for each step.

num<-50
t<-seq(1,num)/(num+1)
u<-t
hatf<-matrix(0,num,num)
funi<-function(t,u){ 
   additive.stage(x,y,c(t,u),h=h,M=M,residu=est$residu,
                  deet=est$deet)$value 
}
for (i in 1:num){ 
  for (j in 1:num){
      hatf[i,j]<-funi(t[i],u[j])
  } 
}
persp(t,u,hatf,ticktype="detailed",phi=30,theta=3)

7.2 Projection Pursuit Regression

An algorithm for projection pursuit regression was given in Section 5.4.3 of the book. We simulate data where X=(X_1,X_2) is uniformly distributed on [-3,3]^2, ε ∼ N(0,σ^2), and the regression function is the density function of the standard 2D normal distribution.

n<-1000
x<-matrix(6*runif(n*2)-3,n,2)
phi<-function(x){ (2*pi)^(-1)*exp(-rowSums(x^2)/2) }
y<-phi(x)+0.1*rnorm(n)

We calculate the projection pursuit regression function estimate using function ``pp.regression''. The arguments are the n× d-matrix x of the values of the explanatory variables, the n vector y of the values of the response variable, the smoothing parameter h>0, and the number of iterations M.

h<-0.5
M<-3
est<-pp.regression(x,y,h=h,M=M)

The output ``est$eval'' is a vector of length n that contains the evaluations \hat{f}(X_{i}), i=1,… ,n. We can evaluate the estimate on a regular grid. We need to give as arguments the n× M matrix of residuals and the M vector of directions \hat{ θ }_m. These are given as ``est$residu'' and ``est$teet''.

num<-30
t<-6*seq(1,num)/(num+1)-3
u<-t
hatf<-matrix(0,num,num)
funi<-function(t,u){ 
    pp.regression(x,y,c(t,u),h=h,M=M,residu=est$residu,
    teet=est$teet)$value 
}
for (i in 1:num){  
  for (j in 1:num){
      hatf[i,j]<-funi(t[i],u[j])
  } 
}
persp(t,u,hatf,ticktype="detailed",phi=20,theta=-30)

8 Quantile Regression

Quantile regression with the kernel method was introduced in Section 3.8 of the book. Linear quantile regression was introduced in Section 5.1.2 of the book. Let us use the same data as was applied in Section 3.1 of the tutorial to illustrate one dimensional kernel regression:

n<-500
x<-5*matrix(runif(n),n,1)-1
phi1D<-function(x){ (2*pi)^(-1/2)*exp(-x^2/2) }
func<-function(t){ phi1D(t)+phi1D(t-3) }
y<-matrix(func(x)+0.1*rnorm(n),n,1)

8.1 Linear Quantile Regression

Function ``linear.quan'' implements linear quantile regression and it is included in package ``regpro''.

li<-linear.quan(x,y,p=0.1)
N<-50
t<-5*seq(1,N)/(N+1)-1
qhat<-matrix(0,N,1)
for (i in 1:N) qhat[i]<-li$beta0+li$beta1*t[i]
plot(x,y)
lines(t,qhat)

Let us look at the code of function ``linear.quan''. First we define function ``fn'', which calculates the quantile loss , when the argument ``b'' is the d+1 vector of the intercept and the coefficients of a linear function.

p<-0.1
n<-dim(x)[1]
d<-dim(x)[2]
rho<-function(t){ t*(p-(t<0)) }
fn<-function(b) { 
    b2<-matrix(b[2:(d+1)],d,1)
    gx<-b[1]+x%*%b2
    ro<-rho(y-gx)
    return(sum(ro)/n)
}

Then we use function ``optim'', which performs the numerical optimization. As the initial value for the optimization we give the solution of the least squares regression.

li<-linear(x,y)
par<-c(li$beta0,li$beta1) # initial value
op<-optim(par=par,fn=fn,method="L-BFGS-B")
beta0<-op$par[1]          # the intercept
beta1<-op$par[2:(d+1)]    # the coefficients

8.2 Kernel Quantile Regression

Function ``pcf.kern.quan'' implements kernel quantile regression.

pcf<-pcf.kern.quan(x,y,h=0.5,N=50,p=0.1)
dp<-draw.pcf(pcf); plot(x,y); lines(dp$x,dp$y,type="l")

Let us look at the code of function ``pcf.kern.quan''. We want to calculate the estimate at point arg. First we calculate the weights, similarily as in kernel mean regression.

arg<-1
h<-0.5
p<-0.1
ker<-function(x){ (1-x^2)*(x^2<=1) }
w<-ker((arg-x)/h); ps<-w/sum(w)

Then we implement the rule given in equation (3.55) of the book. The estimate of p-quantile is ``hatq''.

or<-order(y)
ps.ord<-ps[or]
i<-1
zum<-0
while ((i<=n) && (zumn) hatq<-max(y) else hatq<-y[or[i]]

January 2014