Backpropagation

src: i.ytimg.com

Backpropagation is a method used in artificial neural networks to calculate the gradients required for calculating weights to be used in the network. It's usually used to train deep neural networks, a term that refers to a neural network with more than one hidden layer.

Backpropagation is a special case of an older and more common technique called automatic differentiation. In the context of learning, backpropagation is generally used by descent gradient optimization algorithms to adjust the weight of neurons by calculating the gradient of loss function. This technique is also sometimes called setback error reversal , since errors are counted on the output and redistributed through the network layer.

The backpropagation algorithm has been repeatedly rediscovered and is equivalent to automatic differentiation in inverse accumulation mode. Backpropagation requires a derivative of a loss function with respect to the output of the network to be known, which usually (but not necessarily) means that the desired target value is known. For this reason it is considered a supervised learning method, although it is used in some unattended networks such as autoencoders. Backpropagation is also a generalization of the delta rule to a multi-layered feedforward network, made possible by using chain rules to iteratively calculate the gradient for each layer. This is closely related to the Gauss-Newton algorithm, and is part of ongoing research in backpropagation neural. Backpropagation can be used with gradient-based optimizers, such as L-BFGS or Newton truncated.

Video Backpropagation

Motivation

The goal of any supervised learning algorithm is to find the function that most maps a set of inputs to the correct output. An example is a classification assignment, in which the input is an animal image, and the correct output is the animal name.

The motivation for backpropagation is to train a multi-layered neural network in such a way that it can learn the appropriate internal representations to enable it to study the mapping of inputs to outputs arbitrarily.

Maps Backpropagation

Loss function

Sometimes referred to as the function cost function or fault function (not to be confused with the Gauss error function), the loss function is a function that maps the values â€‹â€‹of one or more variables to a real number which intuitively represent some of the "costs" associated with those values. For backpropagation, the loss function calculates the difference between the expected network output and output, after the case has spread through the network.

Assumption

Dua asumsi harus dibuat tentang bentuk fungsi kesalahan. Yang pertama adalah dapat ditulis sebagai rata-rata ${\ textstyle E = {\ frac {1} {n}} \ jumlah _ {x} E_ {x}} Ã‚Â Ã‚Â$ atas fungsi kesalahan ${\ textstyle E_ {x}} Ã‚Â Ã‚Â$ , untuk ${\ textstyle n} Ã‚Â Ã‚Â$ contoh pelatihan individual, ${\ textstyle x} Ã‚Â Ã‚Â$ . Alasan untuk asumsi ini adalah bahwa algoritma backpropagation menghitung gradien dari fungsi kesalahan untuk contoh pelatihan tunggal, yang perlu digeneralisasikan ke fungsi kesalahan keseluruhan. Asumsi kedua adalah bahwa hal itu dapat ditulis sebagai fungsi dari output dari jaringan saraf.

Contoh fungsi kerugian

Biarkan ${\ displaystyle y, y '} Ã‚Â Ã‚Â$ menjadi vektor dalam ${\ displaystyle \ mathbb {R} ^ {n}} Ã‚Â Ã‚Â$ .

Pilih fungsi kesalahan ${\ displaystyle E (y, y ')} Ã‚Â Ã‚Â$ mengukur perbedaan antara dua output. Pilihan standar adalah kuadrat jarak Euclidean antara vektor ${\ displaystyle y} Ã‚Â Ã‚Â$ dan ${\ displaystyle y '} Ã‚Â Ã‚Â$ :

${\ displaystyle E (y, y ') = {\ tfrac {1} {2}} \ lVert y-y' \ rVert ^ {2}} Ã‚Â Ã‚Â$

Perhatikan bahwa faktor ${\ displaystyle {\ tfrac {1} {2}}} Ã‚Â Ã‚Â$ dengan nyaman membatalkan eksponen ketika fungsi kesalahan selanjutnya dibedakan.

Fungsi galat di atas ${\ textstyle n} Ã‚Â Ã‚Â$ contoh pelatihan dapat dengan mudah ditulis sebagai rata-rata kerugian atas contoh individual:

and therefore, a partial derivative with respect to the output:

Tensor-Based Backpropagation in Neural Networks with Non ...

src: blog.andplus.com

Optimization

The optimization algorithm repeats a two-phase cycle, propagation and weight propagation. When an input vector is displayed to the network, it is forwarded over the network, layer by layer, until it reaches the output layer. The network output is then compared to the desired output, using the loss function. The resulting error value is calculated for each neuron in the output layer. The error values â€‹â€‹are then propagated from the output back through the network, until each neuron has an associated error value that reflects its contribution to the original output.

Backpropagation uses these error values â€‹â€‹to calculate the gradient of the loss function. In the second stage, the gradient is fed to the optimization method, which in turn uses it to update the weights, in an effort to minimize the loss function.

Algorithm

Biarkan ${\ displaystyle N} Ã‚Â Ã‚Â$ menjadi jaringan saraf dengan ${\ displaystyle e} Ã‚Â Ã‚Â$ koneksi, ${\ displaystyle m} Ã‚Â Ã‚Â$ input, dan ${\ displaystyle n} Ã‚Â Ã‚Â$ output.

Di bawah ini, ${\ displaystyle x_ {1}, x_ {2}, \ dots} Ã‚Â Ã‚Â$ akan menunjukkan vektor dalam ${\ displaystyle \ mathbb {R} ^ {m}} Ã‚Â Ã‚Â$ , ${\ displaystyle y_ {1}, y_ {2}, \ dots} Ã‚Â Ã‚Â$ vektor dalam ${\ displaystyle \ mathbb {R} ^ {n}} Ã‚Â Ã‚Â$ , dan ${\ displaystyle w_ {0}, w_ {1}, w_ {2}, \ ldots} Ã‚Â Ã‚Â$ vektor dalam ${\ displaystyle \ mathbb {R} ^ {e}} Ã‚Â Ã‚Â$ . Ini disebut input , output dan bobot masing-masing.

Jaringan syaraf sesuai dengan fungsi ${\ displaystyle y = f_ {N} (w, x)} Ã‚Â Ã‚Â$ yang, diberi bobot ${\ displaystyle w} Ã‚Â Ã‚Â$ , memetakan input ${\ displaystyle x} Ã‚Â Ã‚Â$ ke output ${\ displaystyle y} Ã‚Â Ã‚Â$ .

Pengoptimalan dilakukan sebagai masukan urutan contoh pelatihan ${\ displaystyle (x_ {1}, y_ {1}), \ dots, (x_ {p}, y_ {p})} Ã‚Â Ã‚Â$ dan menghasilkan urutan bobot ${\ displaystyle w_ {0}, w_ {1}, \ dots, w_ {p}} Ã‚Â Ã‚Â$ mulai dari beberapa bobot awal ${\ displaystyle w_ {0}} Ã‚Â Ã‚Â$ , biasanya dipilih secara acak.

Bobot ini dihitung pada gilirannya: pertama menghitung ${\ displaystyle w_ {i}} Ã‚Â Ã‚Â$ hanya menggunakan ${\ displaystyle (x_ {i}, y_ {i}, w_ {i-1})} Ã‚Â Ã‚Â$ untuk ${\ displaystyle i = 1, \ dots, p} Ã‚Â Ã‚Â$ . Output dari algoritma ini adalah ${\ displaystyle w_ {p}} Ã‚Â Ã‚Â$ , memberi kita fungsi baru ${\ displaystyle x \ mapsto f_ {N} (w_ {p}, x)} Ã‚Â Ã‚Â$ . Perhitungannya sama di setiap langkah, maka hanya kasus ${\ displaystyle i = 1} Ã‚Â Ã‚Â$ dijelaskan.

Menghitung ${\ displaystyle w_ {1}} Ã‚Â Ã‚Â$ dari ${\ displaystyle (x_ {1}, y_ {1}, w_ {0})} Ã‚Â Ã‚Â$ dilakukan dengan mempertimbangkan bobot variabel ${\ displaystyle w} Ã‚Â Ã‚Â$ dan menerapkan gradient descent ke fungsi ${\ displaystyle w \ mapsto E (f_ {N} (w, x_ {1}), y_ {1})} Ã‚Â Ã‚Â$ untuk mencari minimum lokal, dimulai dari ${\ displaystyle w = w_ {0}} Ã‚Â Ã‚Â$ .

Ini membuat ${\ displaystyle w_ {1}} Ã‚Â Ã‚Â$ berat minimum yang ditemukan oleh gradient descent.

What is backpropagation really doing? | Chapter 3, deep learning ...

src: i.ytimg.com

Algoritma dalam kode

Untuk mengimplementasikan algoritma di atas, rumus eksplisit diperlukan untuk gradien fungsi ${\ displaystyle w \ mapsto E (f_ {N} (w, x), y)} Ã‚Â Ã‚Â$ di mana fungsinya adalah ${\ displaystyle E (y, y ') = | y-y' | ^ {2}} Ã‚Â Ã‚Â$ .

Learning algorithms can be divided into two phases: propagation and weight propagation.

Phase 1: propagation

Each propagation involves the following steps:

Propagation progresses through the network to generate the output value (s)
Cost calculation (error term)
Propagation of output activations back through the network using training pattern targets to generate delta (the difference between the targeted and actual output value) of all outputs and hidden neurons.

Phase 2: weight update

For each weight, the following steps should be followed:

Delta output weight and input activation multiplied to find a heavy gradient.
The ratio (percentage) of the weight gradient is subtracted from the weight.

This ratio (percentage) affects the speed and quality of learning; it's called the learning level . The larger the ratio, the faster the neurons train, but the lower the ratio, the more accurate the training is. The heavy gradient sign indicates whether the error varies directly with, or inversely proportional to, weight. Therefore, the weight must be renewed in the opposite direction, "down" the gradient.

Repeated learning (on new batches) until the network works properly.

Pseudocode

Here is a pseudocode for the gradient gradient gradient gradient to train a three layer network (only one hidden layer):

 initializes network weights (often small random values)  Ã‚ Ã‚  do   Example training  forEach  named ex  Prediction =  neural-net-output  (network, ex) //forward pass   Actual =  teacher-output  (ex)   ${\ displaystyle \ Delta w_ {h }}$ _{Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚ Ã‚Ã‚ <Ã‚>   Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚   Ã‚ Ã‚ Ã‚ Ã‚ Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚ Ã‚ <Ã‚>   Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚   Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚   Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚   Ã‚ Ã‚ Ã‚ Ã‚    {\ displaystyle \ Delta w_ {h}}   Ã‚ Ã‚     for all weights from the hidden layer to the output layer  //pass backwards     ${\ displaystyle \ Delta w_ {i }}$ _{Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚Ã‚ Ã‚Ã‚ <Ã‚>   Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚   Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚  me   Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚   Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚   Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚   Ã‚ Ã‚ Ã‚ Ã‚    {\ displaystyle \ Delta w_ {i}}   Ã‚ Ã‚}  for all weights from input layer to hidden layer  //backward pass resumed   Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ Ã‚ update network weights //input layer not modified by error estimation   Ã‚ Ã‚  to  all instances are properly classified or other termination criteria met  Ã‚ Ã‚  roll back  the network}

Lines labeled "reverse pass" can be implemented using a backpropagation algorithm, which calculates the network fault gradient as to the weights that the network can modify.

src: pbs.twimg.com

Intuition

Learn as an optimization problem

To understand the mathematical derivation of the backpropagation algorithm, it's good to develop some intuition about the relationship between the actual output of the neuron and the correct output for a particular training case. Consider a simple neural network with two input units, one output unit and no hidden unit. Each neuron uses a linear output which is the weighted sum of its inputs.

Awalnya, sebelum latihan, bobot akan diatur secara acak. Kemudian neuron belajar dari contoh-contoh pelatihan, yang dalam hal ini terdiri dari satu set tupel ${\ displaystyle (x_ {1}, x_ {2}, t)} Ã‚Â Ã‚Â$ di mana ${\ displaystyle x_ {1}} Ã‚Â Ã‚Â$ dan ${\ displaystyle x_ {2}} Ã‚Â Ã‚Â$ adalah input ke jaringan dan t adalah output yang benar (output yang akhirnya dihasilkan jaringan diberi masukan). Jaringan awal, diberikan ${\ displaystyle x_ {1}} Ã‚Â Ã‚Â$ dan ${\ displaystyle x_ {2}} Ã‚Â Ã‚Â$ , akan menghitung output y yang kemungkinan berbeda dari t (diberikan bobot acak). Metode umum untuk mengukur ketidaksesuaian antara keluaran yang diharapkan t dan output aktual y adalah ukuran kesalahan kuadrat:

{\ displaystyle E = (t-y) ^ {2},} Ã‚Â Ã‚Â

where E is mismatch or error.

Sebagai contoh, pertimbangkan jaringan pada satu kasus pelatihan: ${\ displaystyle (1,1,0)} Ã‚Â Ã‚Â$ , sehingga input ${\ displaystyle x_ {1}} Ã‚Â Ã‚Â$ dan ${\ displaystyle x_ {2}} Ã‚Â Ã‚Â$ masing-masing 1 dan 1 dan output yang benar, t adalah 0. Sekarang jika output aktual y diplot pada sumbu horizontal terhadap kesalahan E pada sumbu vertikal, hasilnya adalah parabola. Minimum parabola sesuai dengan output y yang meminimalkan kesalahan E . Untuk kasus pelatihan tunggal, minimum juga menyentuh sumbu horizontal, yang berarti kesalahan akan nol dan jaringan dapat menghasilkan output y yang sama persis dengan output yang diharapkan t . Oleh karena itu, masalah pemetaan input ke output dapat direduksi menjadi masalah optimisasi untuk menemukan fungsi yang akan menghasilkan kesalahan minimal.

Namun, output dari neuron bergantung pada jumlah tertimbang dari semua inputnya:

{\ displaystyle y = x_ {1} w_ {1} x_ {2} w_ {2},} Ã‚Â Ã‚Â

di mana ${\ displaystyle w_ {1}} Ã‚Â Ã‚Â$ dan ${\ displaystyle w_ {2}}$

Source of the article : Wikipedia

Selasa, 10 Juli 2018