Description1

Logistic Regression [20 pts]

Consider the logistic regression model for binary classification that takes input features xi 2 Rm

and predicts yi 2 { 1, 1}. As we learned in class, the logistic regression model fits the probability

P (yi = 1|xi ) using the sigmoid function:

P (yi = 1) = (wT xi + b) =

1

1 + exp( wT xi

b)

,

(1)

where the sigmoid function is as follows, for any z 2 R,

(z) = (1 + exp( z)) 1 .

Given N training data points, we learn the logistic regression model by minimizing the following

loss function:

N

⇣

⌘

1 X

J(w, b) =

log 1 + exp

yi (wT xi + b) .

(2)

N

i=1

(a) (7 pts) Prove the following. For any z 2 R, the derivative of the sigmoid function is as

follows:

d (z)

= (z) 1

(z) .

dz

d

(Hint: use the chain rule. And dx

exp(x) = exp(x).)

Suppose we are updating the weight w and bias term b with the gradient descent algorithm and

the learning rate is ↵. We can derive the update rule as

@J(w, b)

@J(w, b)

1 X⇣

↵

, where

=

@wj

@wj

N

yi (wT xi + b)

@J(w, b)

@J(w, b)

1 X⇣

↵

, where

=

@b

@b

N

yi (wT xi + b)

N

wj

wj

i=1

⌘

1 yi xi,j

where wj denotes the j-th element of the weight vector w and xi,j denotes the j-th element of the

vector xi . And

N

b

b

n=1

⌘

1 yi

We now compare the stochastic gradient descent algorithm for logistic regression with the Perceptron algorithm. Let us define residue for example i as i = 1

yi (wT xi + b) . Assume the

learning rate for both algorithm is ↵.

(b) (5 pts) If the data point (xi , yi ) is correctly classified with high confidence, say yi = 1 and

w> xi + b = 5. What will be the residue? For the logistic regression, what will be the

update with respect to this data point? What about the update of the Perceptron algorithm?

(Hint: you may assume (5) ⇡ 0.9933. The answer should take the form wj

wj [?] or

wj

wj + [?] where you fill in [?]. You can directly use xi,j and ↵ as variables.)

(c) (5 pts) If the data point (xi , yi ) is misclassified, say yi = 1 and w> xi + b = 1. What will be

the residue and how is the weight updated for the two algorithms? (Hint: you may assume

( 1) ⇡ 0.2689.)

(d) (3 pts) For the update step in SGD and perceptron algorithm, is the influence of the input

data xi of the same magnitude? If not, what is the di↵erence?

2

2

Regularization [10 pts]

Data is separable in one dimension if there exists a threshold t such that all data that have values

less than t in this dimension have the same class label. Suppose we train logistic regression for

infinite iterations according to our current gradient descent update on the training data that is

separable in at least one dimension.

(a) (2 pts) (multiple choice) If the data is linearly separable in dimension j, i.e. xi,j < t for all
(xi , yi ) such that yi = 1. Which of the following is true about the corresponding weight wj
when we train the model for infinite iterations?
(A) The weight wj will be 0.
(B) The weight wj will be encouraged to grow continuously during each iteration of SGD
and can go to infinity or -infinity.
(b) (5 pts) We now add a new term (called l2-regularization) to the loss function
⇣
1 X
J(w, b) =
log 1 + exp
N
N
T
yi (w xi + b)
i=1
⌘
+ 0.1
M
X
wj2 .
(3)
j=0
What will be the update rule of wj for the new loss function?
(c) (3 pts) How does l2 regularization help correct the problem in (a)?
3
Neural Networks and Deep Learning [20 pts]
In the following set of questions, we will discuss feed-forward and backpropagation for neural
networks.
We will consider two kinds of hidden nodes/units for our neural network. The first cell (Cell A)
operates as a weighted sum with no activation function. Given a set of input nodes and their
weights, it outputs the weighted sum. The second cell (Cell B) operates as a power function. It
takes only one input along with the weight and outputs the power of input to the weight as the
output. We provide illustrations for both the nodes in Figure 1. Note that x = [x1 , x2 , . . . , xn ]T
and w = [w1 , w2 , . . . , wn ]T .
Using these two kinds of hidden nodes, we build a two-layer neural network as shown in Figure 2.
The inputs to the neural network is x = [x1 , x2 , x3 ]T and the output is ŷ. The parameters of the
model are v = [v1 , v2 , v3 ]T and w = [w1 , w2 , w3 ]T .
(a) (4 pts) We would like to compute the feed-forward for the network. Mathematically, compute
the expression for ŷ given the neural network in terms of x, v, w.
(b) (2 pts) Utilizing the expression for ŷ computed in (a), compute the predicted value ŷ when
x = [1, 3, 2]T , v = [1, 0, 2]T , w = [ 4, 3, 1]T
3
Purchase answer to see full
attachment

**We offer the bestcustom writing paper services. We have done this question before, we can also do it for you.**

#### Why Choose Us

- 100% non-plagiarized Papers
- 24/7 /365 Service Available
- Affordable Prices
- Any Paper, Urgency, and Subject
- Will complete your papers in 6 hours
- On-time Delivery
- Money-back and Privacy guarantees
- Unlimited Amendments upon request
- Satisfaction guarantee

#### How it Works

- Click on the “Place Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
- Fill in your paper’s requirements in the "
**PAPER DETAILS**" section. - Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
- Click “
**CREATE ACCOUNT & SIGN IN**” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page. - From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.