

# Analog Back Propagation on-chip learning



Maurizio Valle



## On-chip learning architecture (architectural mapping)



M. Valle

## Synaptic module

## The synaptic module

- **F** is the feed-forward four-quadrant multiplier:  $W_{k,j} X_j$
- **B1** is the backward four-quadrant multiplier:  $\delta_k W_{k,j}$
- **B2** is the weight update four-quadrant multiplier:  $\Delta W_{k,j} = \eta_{k,j} \delta_k X_j$
- **B2** generates also the sign  $S_{k,j}$ :  $S_{k,j} = \text{sign}\left(\frac{\varepsilon_p}{\partial W_{k,j}}\right) = -\text{sign}(\Delta W_{k,j})$
- **H** is the local learning rate adaptation circuit block
- **WU** is the weight block:  $W_{k,j}^{\text{new}} = W_{k,j}^{\text{old}} + \Delta W_{k,j}$

The WU performs also the short-term memorization of the weight value.

## Neuron module



## Neuron module

- **A, activation function module:**  $X_{k(j)} = \Psi(a_{k(j)})$
- **D, derivative module:**  $D_{k(j)} = 1 - (X_{k(j)})^2$
- **R, the error multiplier:**  $\delta_k = (\bar{X}_k - X_k)D_k$   $\delta_j = (\sum_k \delta_k W_{k,j})D_j$
- **FC, the error circuit:**  $\epsilon_k = (\bar{X}_k - X_k)^2$

## Correspondence tables between neural and electrical variables

| Synaptic Module                 |                      | Neuron Module    |                      |
|---------------------------------|----------------------|------------------|----------------------|
| Neural variables                | Electrical variables | Neural variables | Electrical variables |
| $X_j$                           | $V_X$                |                  |                      |
| $W_{k,j} \cdot X_j$             | $I_{WX}$             |                  |                      |
| $W_{k,j}$                       | $V_W$                |                  |                      |
| $\Delta W_{k,j}$                | $I_{\Delta W}$       |                  |                      |
| $\delta_k$                      | $V_\delta$           |                  |                      |
| $\delta_k \cdot W_{k,j}$        | $I_{\delta W}$       |                  |                      |
| $\eta_{k,j}$                    | $I_\eta$             |                  |                      |
| $S_{k,j}$                       | $V_{S\eta}$          |                  |                      |
| $a_{k(j)}$                      | $I_a$                |                  |                      |
| $X_{j(k)}$                      | $V_X$                |                  |                      |
| $\bar{X}_k$                     | $V_T$                |                  |                      |
| $\sum_k \delta_k \cdot W_{k,j}$ | $V_{\delta W}$       |                  |                      |
| $\delta_{k(j)}$                 | $V_\delta$           |                  |                      |
| $D_{k(j)}$                      | $I_d$                |                  |                      |
| $\epsilon_k$                    | $I_\epsilon$         |                  |                      |

## On-chip learning algorithm

```

iterate on k
  select P in a random manner in the training set
  put P in input to the MLP
  perform the feedforward phase
  parallel for each synapse ith, jth
    compute  $\Delta W_{j,i}(k) = -\eta_{j,i}(k) \cdot \frac{\partial E(k)}{\partial W_{j,i}}$ 
    compute  $S_{j,i}(k)$ 
     $W_{j,i}(k+1) = W_{j,i}(k) + \Delta W_{j,i}(k)$ 
    if  $S_{j,i}(k) = S_{j,i}(k-1)$ 
       $\eta_{j,i}(k+1) = \eta_{j,i}(k) \cdot \left[ \frac{\eta_{\max}}{\eta_{j,i}(k)} \right]^\gamma$ 
    else
       $\eta_{j,i}(k+1) = \eta_{j,i}(k) \cdot \left[ \frac{\eta_{\min}}{\eta_{j,i}(k)} \right]^\gamma$ 
    endif
  end parallel
until convergence is reached

```

off-chip  
off-chip  
**on-chip**  
off-chip

## F and B1 four-quadrant multiplier



## The $\Psi$ Block

The  $\Psi$  block is a non-linear transconductor that converts the weight voltage  $V_w$  into a differential current  $I_w = I_{wp} - I_{wn}$ . Being equal the aspect ratio (i.e.,  $W/L$ ) of  $M_1$  and  $M_2$  as well for  $M_3$  and  $M_4$ , and supposing all of them biased in strong inversion, we can write:

$$I_{wn} = \begin{cases} \beta_n (V_w - V_{th1} - V_{th2})^2 & V_w \geq V_{th1} + V_{th2} \\ 0 & V_w < V_{th1} + V_{th2} \end{cases}$$

$$I_{wp} = \begin{cases} \beta_p (V_w - V_{th3} - V_{th4})^2 & V_w \leq V_{dd} + V_{th3} + V_{th4} \\ 0 & V_w > V_{dd} + V_{th3} + V_{th4} \end{cases}$$

where

$$\frac{1}{\sqrt{\beta_n}} = \frac{1}{\sqrt{\beta_1}} + \frac{1}{\sqrt{\beta_2}} \quad \frac{1}{\sqrt{\beta_p}} = \frac{1}{\sqrt{\beta_3}} + \frac{1}{\sqrt{\beta_4}}$$

$\beta_i$  and  $V_{thi}$  are the gain factors and the threshold voltages of  $M_i$  ( $i=1\div 4$ ) respectively.

## The OTA Block

The resulting differential current  $I_w$  can be written as:

$$I_{wx} = (I_{wp} - I_{wn}) \tanh\left(\frac{V_x - X_r}{2nU_T}\right) = g_w(V_w) \tanh\left(\frac{V_x - X_r}{2nU_T}\right)$$

where  $n$  is the weak inversion slope coefficient  $U_T$  is the thermal voltage, and  $V_r$  is the signal ground (i.e. the synaptic input is null for  $V_x = V_r$ ).

If the value of the argument of the  $\tanh$  function is small (i.e.  $|V_x - V_r| \leq 100mV$ ), we can approximate it with its argument:

$$I_{wx} \approx \frac{1}{2nU_T} g_w(V_w)(V_x - X_r)$$

## The B2 multiplier



## The Weight Unit

$$V_w(t_0 + T) = V_w(t_0) + \frac{T}{C_w} I_{\Delta W}$$



(a)



(b)

## Weight update circuit (1)



$$\Delta V_w = \frac{T}{C_w + C_p} \cdot I_{\Delta W} + \Delta V_{ci} + \Delta V_{cs}$$

Ideal Weight Update

Charge Sharing Term  
(hundreds millivolts)

Charge Injection Term  
(few millivolts)

The main error term is due to the charge sharing between  $C_p$  and  $C_w$  when the switch is closed. The value of  $\Delta V_{cs}$  depends on the values of  $C_p$  and  $C_w$ .

## Weight update circuit (2)



- Learning task: character recognition
- Network topology: 112×32×10 MLP
- Training set: 1000 char
- Test set: 1000 char

## Local learning rate adaptation circuit (H) (1)



$\phi_i$ : four phases  
non-overlapping clock





## The derivative circuit D

$$I_d = I_b \tanh\left(\frac{V_{ccm} - V_X}{2nU_T}\right) \tanh\left(\frac{V_X - V_{ssm}}{2nU_T}\right)$$

$$I_d \approx \frac{I_b}{4(nU_T)^2} (V_{ccm} - V_X)(V_X - V_{ssm})$$

$$V_{ccm} - V_r = V_r - V_{ssm} = V_{norm}$$

$$I_d \approx \frac{I_b V_{norm}}{4(nU_T)^2} \left(1 - \left(\frac{V_X - V_r}{V_{norm}}\right)^2\right)$$



M. Valle

Low Power Design Techniques and Neural Applications  
Barcelona, Feb. 23-27 2004

On-chip BP learning

20

## The error circuit R

$$V_\delta = R_{out} I_d \tanh\left(\frac{V_1 - V_2}{2nU_T}\right)$$

$$V_\delta \approx \frac{R_{out} I_d}{2nU_T} (V_1 - V_2)$$



M. Valle

Low Power Design Techniques and Neural Applications  
Barcelona, Feb. 23-27 2004

On-chip BP learning

21

## The SLANP chip (1)



M. Valle

Low Power Design Techniques and Neural Applications  
Barcelona, Feb. 23-27 2004

On-chip BP learning

22

## The SLANP chip (2) - the synaptic module



M. Valle

Low Power Design Techniques and Neural Applications  
Barcelona, Feb. 23-27 2004

On-chip BP learning

23

### The SLANP chip (3) - the neuron module



### The SLANP chip (4)



### Experimental results (1)



### Experimental results (2)



$$V_X^j = R_{out} I_b \tanh \left\{ \frac{R_{in} \left[ \sum_i g(V_w^{j,i}) \frac{(V_X^j - V_r)}{2nU_T} \right] - V_r}{2nU_T} \right\}$$

$$V_X^j = R_{out} I_b \tanh \left\{ \frac{R_{in} g(V_w^{j,1}) \frac{(V_X^j - V_r)}{2nU_T} - V_r}{2nU_T} \right\}$$



## Experimental results (3)



Transient response of the circuit for a positive (a) and negative (b) weight values (upper traces: synaptic input signals; bottom traces: neuron output signals).

## Experimental results (4) (circuit FC)



## Experimental results (5) (Weight decay due to leakage currents on the weight capacitor $C_w$ )



## Experimental results (6) - learning



Training of the NOT function. Top trace – target signal; middle trace – output signal; bottom trace – a weight signal. The network were configured as  $1 \times 8 \times 1$  MLP and the learning rates were fixed to 0.5V. The learning iteration was 800μs.

## Experimental results (7) - learning

Training set for the CLASSIFICATION problem.

| Input Pattern | Target |
|---------------|--------|
| 11001100      | 0001   |
| 10011001      | 0010   |
| 00110011      | 0100   |
| 01100110      | 1000   |



Four output neuron signals at the end of the learning process. The network was configured as 8x16x4 MLP and all the learning rates were locally adapted: the minimum and maximum learning rate values were 0.4 and 0.7V. The learning iteration was 80μs.

M. Valle

Low Power Design Techniques and Neural Applications  
Barcelona, Feb. 23-27 2004

On-chip BP learning

32

## Experimental results (8) - learning



Training of the 2 input XOR function. The two input signals (first and second traces), the target signal (third wave), and output signal (fourth wave) at the end of the training process. The network was configured as 2x2x1 MLP and all the learning rates were locally adapted: the minimum and maximum learning rate values were 0.4 and 0.7V. The learning iteration was 200μs.

M. Valle

Low Power Design Techniques and Neural Applications  
Barcelona, Feb. 23-27 2004

On-chip BP learning

33

## Performance

|                            |                                                   |
|----------------------------|---------------------------------------------------|
| Network size               | 8x16x4 MLP                                        |
| On-chip learning algorithm | by-pattern BP with local learning rate adaptation |
| Technology                 | ATMEL ES2 ECPD07                                  |
| Transistor count           | 22000                                             |
| Chip size                  | 3.5mm×3.5mm                                       |
| Power consumption          | 25mW                                              |
| Recall computational power | 106MCPS                                           |
| Computational power        | 2.65MCUPS                                         |
| Computational density      | 216000CUPS/mm <sup>2</sup>                        |
| Energy efficiency          | 106000CUPS/mW                                     |

M. Valle

Low Power Design Techniques and Neural Applications  
Barcelona, Feb. 23-27 2004

On-chip BP learning

34