Siamese network

From HandWiki

Siamese network is an artificial neural network that use the same weights while working in tandem on two different input vectors to compute comparable output vectors.[1][2][3] Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints or more technical as a distance function for Locality-sensitive hashing.

It is possible to make a kind of structure that are functional similar to a siamese network, but still implement slightly different function. This is typically used for comparing similar instances in different type sets.

Uses of similarity measures where a siamese network might be used are such things as recognizing handwritten checks, automatic detection of faces in camera images, and matching queries with indexed documents. The perhaps most well-known application of siamese networks are face recognition, where known images of people are precomputed and compared to an image from a turnstile or similar. It is not obvious at first, but there are two slightly different problems. One is recognizing a person among a large number of other persons, that is the facial recognition problem. DeepFace is an example of such a system.[3] In its most extreme form this is recognizing a single person at a train station or airport. The other is face verification, that is to verify whether the photo in a pass is the same as the person claiming he or she is the same person. The siamese network might be the same, but the implementation can be quite different.

Learning

Learning in siamese networks can be done with triplet loss or contrastive loss. For learning by triplet loss a baseline vector (anchor image) is compared against a positive vector (truthy image) and a negative vector (falsy image). The negative vector will force learning in the network, while the positive vector will act like a regularizer. For learning by contrastive loss there must be a weight decay to regularize the weights, or some similar operation like a normalization.

A distance metric for a loss function must have the following properties[4]

  • Non-negativity: [math]\displaystyle{ \delta ( x, y ) \ge 0 }[/math]
  • Identity of Discernible: [math]\displaystyle{ \delta ( x, y ) = 0 \iff x=y }[/math]
  • Symmetry: [math]\displaystyle{ \delta ( x, y ) = \delta ( y, x ) }[/math]
  • Triangle inequality: [math]\displaystyle{ \delta ( x, z ) \le \delta ( x, y ) + \delta ( y, z ) }[/math]

In particular, the triplet loss algorithm is often defined with squared Euclidean distance at its core.

Predefined metrics, Euclidean distance metric

The common learning goal is to minimize a distance metric. This gives a loss function like

[math]\displaystyle{ \begin{align} \text{if} \, i = j \, \text{then} & \, \| \operatorname{f} \left ( x^{(i)} \right ) - \operatorname{f} \left ( x^{(j)} \right ) \| \, \text{is small} \\ \text{otherwise} & \, \| \operatorname{f} \left ( x^{(i)} \right ) - \operatorname{f} \left ( x^{(j)} \right ) \| \, \text{is large} \end{align} }[/math]
[math]\displaystyle{ i,j }[/math] are indexes into a set of vectors
[math]\displaystyle{ \operatorname{f}(\cdot) }[/math]function implemented by the siamese network

This is the most common case, but it is also a special case implementing an Euclidean distance metric.

On a matrix form the previous is often expressed as

[math]\displaystyle{ \operatorname{\delta} ( \mathbf{x}^{(i)}, \mathbf{x}^{(j)} ) \approx (\mathbf{x}^{(i)} - \mathbf{x}^{(j)})^{T}(\mathbf{x}^{(i)} - \mathbf{x}^{(j)}) }[/math]

This is not the same as it is the squared Euclidean distance, that is the Manhattan distance.

Learned metrics, nonlinear distance metric

A more general case is where the output vector from the siamese network is passed through additional network layers implementing non-linear distance metrics.

[math]\displaystyle{ \begin{align} \text{if} \, i = j \, \text{then} & \, \operatorname{\delta} \left [ \operatorname{f} \left ( x^{(i)} \right ), \, \operatorname{f} \left ( x^{(j)} \right ) \right ] \, \text{is small} \\ \text{otherwise} & \, \operatorname{\delta} \left [ \operatorname{f} \left ( x^{(i)} \right ), \, \operatorname{f} \left ( x^{(j)} \right ) \right ] \, \text{is large} \end{align} }[/math]
[math]\displaystyle{ i,j }[/math] are indexes into a set of vectors
[math]\displaystyle{ \operatorname{f}(\cdot) }[/math]function implemented by the siamese network
[math]\displaystyle{ \operatorname{\delta}(\cdot) }[/math]function implemented by the network joining outputs from the siamese network

On a matrix form the previous is often approximated as a Mahalanobis distance for a linear space as[5]

[math]\displaystyle{ \operatorname{\delta} ( \mathbf{x}^{(i)}, \mathbf{x}^{(j)} ) \approx (\mathbf{x}^{(i)} - \mathbf{x}^{(j)})^{T}\mathbf{M}(\mathbf{x}^{(i)} - \mathbf{x}^{(j)}) }[/math]

This can be further subdivided in at least Unsupervised learning and Supervised learning.

Learned metrics, half-twin networks

This form also allows the siamese network to be more of a half-twin, implementing a slightly different functions

[math]\displaystyle{ \begin{align} \text{if} \, i = j \, \text{then} & \, \operatorname{\delta} \left [ \operatorname{f} \left ( x^{(i)} \right ), \, \operatorname{g} \left ( x^{(j)} \right ) \right ] \, \text{is small} \\ \text{otherwise} & \, \operatorname{\delta} \left [ \operatorname{f} \left ( x^{(i)} \right ), \, \operatorname{g} \left ( x^{(j)} \right ) \right ] \, \text{is large} \end{align} }[/math]
[math]\displaystyle{ i,j }[/math] are indexes into a set of vectors
[math]\displaystyle{ \operatorname{f}(\cdot), \operatorname{g}(\cdot) }[/math]function implemented by the half-twin network
[math]\displaystyle{ \operatorname{\delta}(\cdot) }[/math]function implemented by the network joining outputs from the siamese network

References

  1. Bromley, Jane; Guyon, Isabelle; LeCun, Yann; Säckinger, Eduard; Shah, Roopak (1994). "Signature verification using a" siamese" time delay neural network". Advances in neural information processing systems: 737-744. https://papers.nips.cc/paper/769-signature-verification-using-a-siamese-time-delay-neural-network.pdf. 
  2. Chopra, S.; Hadsell, R.; LeCun, Y. (June 2005). "Learning a similarity metric discriminatively, with application to face verification". 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) 1: 539–546 vol. 1. doi:10.1109/CVPR.2005.202. https://ieeexplore.ieee.org/document/1467314/. 
  3. 3.0 3.1 Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. (June 2014). "DeepFace: Closing the Gap to Human-Level Performance in Face Verification". 2014 IEEE Conference on Computer Vision and Pattern Recognition: 1701–1708. doi:10.1109/CVPR.2014.220. https://ieeexplore.ieee.org/document/6909616/. 
  4. Chatterjee, Moitreya; Luo, Yunan. "Similarity Learning with (or without) Convolutional Neural Network". http://slazebni.cs.illinois.edu/spring17/lec09_similarity.pdf. 
  5. Chandra, M.P. (1936). "On the generalized distance in statistics". Proceedings of the National Institute of Sciences of India. 1 2: 49–55. http://library.isical.ac.in:8080/jspui/bitstream/123456789/6765/1/Vol02_1936_1_Art05-pcm.pdf.