logistic regression and neural networks

Logistic Regression [20 pts]
Consider the logistic regression model for binary classification that takes input features xi 2 Rm
and predicts yi 2 { 1, 1}. As we learned in class, the logistic regression model fits the probability
P (yi = 1|xi ) using the sigmoid function:
P (yi = 1) = (wT xi + b) =
1 + exp( wT xi
where the sigmoid function is as follows, for any z 2 R,
(z) = (1 + exp( z)) 1 .
Given N training data points, we learn the logistic regression model by minimizing the following
loss function:

1 X
J(w, b) =
log 1 + exp
yi (wT xi + b) .
(a) (7 pts) Prove the following. For any z 2 R, the derivative of the sigmoid function is as
d (z)
= (z) 1
(z) .
(Hint: use the chain rule. And dx
exp(x) = exp(x).)
Suppose we are updating the weight w and bias term b with the gradient descent algorithm and
the learning rate is ↵. We can derive the update rule as
@J(w, b)
@J(w, b)
1 X⇣

, where
yi (wT xi + b)
@J(w, b)
@J(w, b)
1 X⇣

, where
yi (wT xi + b)

1 yi xi,j
where wj denotes the j-th element of the weight vector w and xi,j denotes the j-th element of the
vector xi . And

1 yi
We now compare the stochastic gradient descent algorithm for logistic regression with the Perceptron algorithm. Let us define residue for example i as i = 1
yi (wT xi + b) . Assume the
learning rate for both algorithm is ↵.
(b) (5 pts) If the data point (xi , yi ) is correctly classified with high confidence, say yi = 1 and
w> xi + b = 5. What will be the residue? For the logistic regression, what will be the
update with respect to this data point? What about the update of the Perceptron algorithm?
(Hint: you may assume (5) ⇡ 0.9933. The answer should take the form wj
wj [?] or
wj + [?] where you fill in [?]. You can directly use xi,j and ↵ as variables.)
(c) (5 pts) If the data point (xi , yi ) is misclassified, say yi = 1 and w> xi + b = 1. What will be
the residue and how is the weight updated for the two algorithms? (Hint: you may assume
( 1) ⇡ 0.2689.)
(d) (3 pts) For the update step in SGD and perceptron algorithm, is the influence of the input
data xi of the same magnitude? If not, what is the di↵erence?
Regularization [10 pts]
Data is separable in one dimension if there exists a threshold t such that all data that have values
less than t in this dimension have the same class label. Suppose we train logistic regression for
infinite iterations according to our current gradient descent update on the training data that is
separable in at least one dimension.
(a) (2 pts) (multiple choice) If the data is linearly separable in dimension j, i.e. xi,j < t for all (xi , yi ) such that yi = 1. Which of the following is true about the corresponding weight wj when we train the model for infinite iterations? (A) The weight wj will be 0. (B) The weight wj will be encouraged to grow continuously during each iteration of SGD and can go to infinity or -infinity. (b) (5 pts) We now add a new term (called l2-regularization) to the loss function ⇣ 1 X J(w, b) = log 1 + exp N N T yi (w xi + b) i=1 ⌘ + 0.1 M X wj2 . (3) j=0 What will be the update rule of wj for the new loss function? (c) (3 pts) How does l2 regularization help correct the problem in (a)? 3 Neural Networks and Deep Learning [20 pts] In the following set of questions, we will discuss feed-forward and backpropagation for neural networks. We will consider two kinds of hidden nodes/units for our neural network. The first cell (Cell A) operates as a weighted sum with no activation function. Given a set of input nodes and their weights, it outputs the weighted sum. The second cell (Cell B) operates as a power function. It takes only one input along with the weight and outputs the power of input to the weight as the output. We provide illustrations for both the nodes in Figure 1. Note that x = [x1 , x2 , . . . , xn ]T and w = [w1 , w2 , . . . , wn ]T . Using these two kinds of hidden nodes, we build a two-layer neural network as shown in Figure 2. The inputs to the neural network is x = [x1 , x2 , x3 ]T and the output is ŷ. The parameters of the model are v = [v1 , v2 , v3 ]T and w = [w1 , w2 , w3 ]T . (a) (4 pts) We would like to compute the feed-forward for the network. Mathematically, compute the expression for ŷ given the neural network in terms of x, v, w. (b) (2 pts) Utilizing the expression for ŷ computed in (a), compute the predicted value ŷ when x = [1, 3, 2]T , v = [1, 0, 2]T , w = [ 4, 3, 1]T 3 Purchase answer to see full attachment

We offer the bestcustom writing paper services. We have done this question before, we can also do it for you.

Why Choose Us

  • 100% non-plagiarized Papers
  • 24/7 /365 Service Available
  • Affordable Prices
  • Any Paper, Urgency, and Subject
  • Will complete your papers in 6 hours
  • On-time Delivery
  • Money-back and Privacy guarantees
  • Unlimited Amendments upon request
  • Satisfaction guarantee

How it Works

  • Click on the “Place Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
  • Fill in your paper’s requirements in the "PAPER DETAILS" section.
  • Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
  • Click “CREATE ACCOUNT & SIGN IN” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page.
  • From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.