DL Menu


Parameter Tying and Sharing


Till now, most of the methods focused on bringing the weights to a fixed point, e.g. 0 in the case of norm penalty. L2 regularization is a way to express our prior knowledge that we penalize model parameters that deviate significantly from the fixed value of 0.

However, there might be situations where we might have some prior knowledge on the kind of dependencies that the model should encode.

Two models perform the same classification task (with the same set of classes) but with different input data.

  • Model A with parameters w(A)
  • Model B with parameters w(B)

The two models map the input to two different but related outputs

y ̂ ( A ) = f ( w ( A ) , x )
y ̂ ( B ) = g ( w ( B ) , x )

Some standard regularisers like l1 and l2 penalize model parameters for deviating from the fixed value of zero. One of the side effects of Lasso or group-Lasso regularization in learning a Deep Neural Networks is that there is a possibility that many of the parameters may become zero. Thus, reducing the amount of memory required to store the model and lowering the computational cost of applying it. A significant drawback of Lasso (or group-Lasso) regularization is that in the presence of groups of highly correlated features, it tends to select only one or an arbitrary convex combination of elements from each group. Moreover, the learning process of Lasso tends to be unstable because the subsets of parameters that end up selected may change dramatically with minor changes in the data or algorithmic procedure. In Deep Neural Networks, it is almost unavoidable to encounter correlated features due to the high dimensionality of the input to each layer and because neurons tend to adapt, producing strongly correlated features that we pass as an input to the subsequent layer.

Parameter Sharing

Parameter sharing forces sets of parameters to be similar as we interpret various models or model components as sharing a unique set of parameters. We only need to store only a subset of memory.

Suppose two models A and B, perform a classification task on similar input and output distributions. In such a case, we'd expect the parameters for both models to be identical to each other as well. We could impose a norm penalty on the distance between the weights, but a more popular method is to force the parameters to be equal. The idea behind Parameter Sharing is the essence of forcing the parameters to be similar. A significant benefit here is that we need to store only a subset of the parameters (e.g., storing only the parameters for model A instead of storing for both A and B), which leads to significant memory savings.

Example: The most extensive use of parameter sharing is in convolutional neural networks. Natural images have specific statistical properties that are robust to translation. For example photo of a cat remains a photo of a cat if it is translated one pixel to the right. Convolution Neural Networks consider this property by sharing parameters across multiple image locations. Thus we can find a cat with the same cat detector in column i or i+1 in the image.


Convolutional Neural Network (CNN)

  • Share parameters over multiple image locations
  • Parameter sharing in CNN makes it translation invariant: i.e., we can find a cat with the same cat detector whether the cat appears at column i or column i+1 in the image
  • Parameter sharing in CNN also dramatically lowers the model parameters, and significantly increases network sizes without requiring a corresponding increase in training data

What are the benefits of parameter sharing in CNNs? Convolution Neural Networks have a couple of techniques known as parameter sharing and parameter tying. Parameter sharing is the method of sharing weights by all neurons in a particular feature map. Therefore helps to reduce the number of parameters in the whole system, making it computationally cheap.

Why is sharing parameters a good idea? Parameter sharing is used in all convolution layers in the network. Parameter sharing reduces the training time, which directly reduces the number of weight updates during backpropagation.