Swish function: Difference between revisions

From HandWiki
link
 
fix
 
Line 6: Line 6:
where <math>\beta</math> can be constant (usually set to 1) or trainable and "sigmoid" refers to the [[Logistic function|logistic function]].
where <math>\beta</math> can be constant (usually set to 1) or trainable and "sigmoid" refers to the [[Logistic function|logistic function]].


The swish family was designed to smoothly [[Interpolation|interpolate]] between a linear function and the ReLU function.
The swish family was designed to smoothly [[Interpolation|interpolate]] between a linear function and the [[Rectified linear unit]] (ReLU) function.


When considering positive values, Swish is a particular case of doubly parameterized sigmoid shrinkage function defined in <ref>{{Cite book |last1=Atto |first1=Abdourrahmane M. |last2=Pastor |first2=Dominique |last3=Mercier |first3=Gregoire |title=2008 IEEE International Conference on Acoustics, Speech and Signal Processing |chapter=Smooth sigmoid wavelet shrinkage for non-parametric estimation |date=March 2008 |chapter-url=https://ieeexplore.ieee.org/document/4518347 |pages=3265–3268 |doi=10.1109/ICASSP.2008.4518347|isbn=978-1-4244-1483-3 |s2cid=9959057 |url=https://hal.archives-ouvertes.fr/hal-02136546/file/ICASSP_ATTO_2008.pdf }}</ref>{{Pg|location=Eq 3}}. Variants of the swish function include Mish.<ref>{{cite arXiv |eprint=1908.08681 |class=cs.LG |first=Diganta |last=Misra |title=Mish: A Self Regularized Non-Monotonic Neural Activation Function |date=2019}}</ref>
When considering positive values, Swish is a particular case of doubly parameterized sigmoid shrinkage function defined in <ref>{{Cite book |last1=Atto |first1=Abdourrahmane M. |last2=Pastor |first2=Dominique |last3=Mercier |first3=Gregoire |title=2008 IEEE International Conference on Acoustics, Speech and Signal Processing |chapter=Smooth sigmoid wavelet shrinkage for non-parametric estimation |date=March 2008 |chapter-url=https://ieeexplore.ieee.org/document/4518347 |pages=3265–3268 |doi=10.1109/ICASSP.2008.4518347|isbn=978-1-4244-1483-3 |s2cid=9959057 |url=https://hal.archives-ouvertes.fr/hal-02136546/file/ICASSP_ATTO_2008.pdf }}</ref>{{Pg|location=Eq 3}}. Variants of the swish function include Mish.<ref>{{cite arXiv |eprint=1908.08681 |class=cs.LG |first=Diganta |last=Misra |title=Mish: A Self Regularized Non-Monotonic Neural Activation Function |date=2019}}</ref>
Line 14: Line 14:


For β&nbsp;=&thinsp;1, the function is the '''Sigmoid Linear Unit''' (SiLU).
For β&nbsp;=&thinsp;1, the function is the '''Sigmoid Linear Unit''' (SiLU).
For β&nbsp;=&thinsp;1.702, the function approximates [[Rectified linear unit#Gaussian-error linear unit (GELU)|GeLU]].<ref name="Hendrycks-Gimpel_2016" />


With β&nbsp;→&nbsp;∞, the function converges to ReLU.  
With β&nbsp;→&nbsp;∞, the function converges to ReLU.  
Line 40: Line 42:


* [[Activation function]]
* [[Activation function]]
* Gating mechanism
* [[Gating mechanism]]


==References==
==References==
{{reflist|refs=
<references>
 
<ref name="Hendrycks-Gimpel_2016">{{cite arXiv |eprint = 1606.08415 |title = Gaussian Error Linear Units (GELUs) |last1 = Hendrycks |first1 = Dan |last2 = Gimpel |first2 = Kevin |year = 2016 |class = cs.LG}}</ref>
<ref name="Hendrycks-Gimpel_2016">{{cite arXiv |eprint = 1606.08415 |title = Gaussian Error Linear Units (GELUs) |last1 = Hendrycks |first1 = Dan |last2 = Gimpel |first2 = Kevin |year = 2016 |class = cs.LG}}</ref>
<ref name="Elfwing-Uchibe-Doya_2017">{{cite arXiv |first1=Stefan |last1=Elfwing |first2=Eiji |last2=Uchibe |first3=Kenji |last3=Doya |title=Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning |date=2017-11-02 |class=cs.LG |eprint=1702.03118v3}}</ref>
<ref name="Elfwing-Uchibe-Doya_2017">{{cite arXiv |first1=Stefan |last1=Elfwing |first2=Eiji |last2=Uchibe |first3=Kenji |last3=Doya |title=Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning |date=2017-11-02 |class=cs.LG |eprint=1702.03118v3}}</ref>
<ref name="Ramachandran-Zoph-Le_2017_v2">{{cite arXiv |title=Searching for Activation Functions |date=2017-10-27 |eprint=1710.05941v2 |last1=Ramachandran |first1=Prajit |last2=Zoph |first2=Barret |last3=Le |first3=Quoc V.|class=cs.NE }}</ref>
<ref name="Ramachandran-Zoph-Le_2017_v2">{{cite arXiv |title=Searching for Activation Functions |date=2017-10-27 |eprint=1710.05941v2 |last1=Ramachandran |first1=Prajit |last2=Zoph |first2=Barret |last3=Le |first3=Quoc V.|class=cs.NE }}</ref>
<ref name="Sefiks_2018">{{cite web |title=Swish as Neural Networks Activation Function |first=Sefik Ilkin |last=Serengil |series=Machine Learning, Math |date=2018-08-21 |url=https://sefiks.com/2018/08/21/swish-as-neural-networks-activation-function/ |access-date=2020-06-18 |url-status=live |archive-url=https://web.archive.org/web/20200618093348/https://sefiks.com/2018/08/21/swish-as-neural-networks-activation-function/ |archive-date=2020-06-18}}</ref>
<ref name="Sefiks_2018">{{cite web |title=Swish as Neural Networks Activation Function |first=Sefik Ilkin |last=Serengil |series=Machine Learning, Math |date=2018-08-21 |url=https://sefiks.com/2018/08/21/swish-as-neural-networks-activation-function/ |access-date=2020-06-18 |url-status=live |archive-url=https://web.archive.org/web/20200618093348/https://sefiks.com/2018/08/21/swish-as-neural-networks-activation-function/ |archive-date=2020-06-18}}</ref>
}}{{Differentiable computing}}
 
</references>{{Differentiable computing}}
[[Category:Functions and mappings]]
[[Category:Functions and mappings]]
[[Category:Artificial neural networks]]
[[Category:Artificial neural networks]]


{{Sourceattribution|Swish function}}
{{Sourceattribution|Swish function}}

Latest revision as of 18:10, 14 April 2026

Short description: Mathematical activation function in data analysis

The swish function is a family of mathematical function defined as follows:

The swish function
swishβ(x)=xsigmoid(βx)=x1+eβx.[1]

where β can be constant (usually set to 1) or trainable and "sigmoid" refers to the logistic function.

The swish family was designed to smoothly interpolate between a linear function and the Rectified linear unit (ReLU) function.

When considering positive values, Swish is a particular case of doubly parameterized sigmoid shrinkage function defined in [2]Template:Pg. Variants of the swish function include Mish.[3]

Special values

For β = 0, the function is linear: f(x) = x/2.

For β = 1, the function is the Sigmoid Linear Unit (SiLU).

For β = 1.702, the function approximates GeLU.[4]

With β → ∞, the function converges to ReLU.

Thus, the swish family smoothly interpolates between a linear function and the ReLU function.[1]

Since swishβ(x)=swish1(βx)/β, all instances of swish have the same shape as the default swish1, zoomed by β. One usually sets β>0. When β is trainable, this constraint can be enforced by β=eb, where b is trainable.

swish1(x)=x2+x24x448+x6480+O(x8)

swish1(x)=x2tanh(x2)+x2swish1(x)+swish1(x)=xtanh(x2)swish1(x)swish1(x)=x

Derivatives

Because swishβ(x)=swish1(βx)/β, it suffices to calculate its derivatives for the default case.swish1(x)=x+sinh(x)4cosh2(x2)+12so swish1(x)12 is odd.swish1(x)=1x2tanh(x2)2cosh2(x2)so swish1(x) is even.

History

SiLU was first proposed alongside the GELU in 2016,[4] then again proposed in 2017 as the Sigmoid-weighted Linear Unit (SiL) in reinforcement learning.[5][1] The SiLU/SiL was then again proposed as the SWISH over a year after its initial discovery, originally proposed without the learnable parameter β, so that β implicitly equaled 1. The swish paper was then updated to propose the activation with the learnable parameter β.

In 2017, after performing analysis on ImageNet data, researchers from Google indicated that using this function as an activation function in artificial neural networks improves the performance, compared to ReLU and sigmoid functions.[1] It is believed that one reason for the improvement is that the swish function helps alleviate the vanishing gradient problem during backpropagation.[6]

See also

References

  1. 1.0 1.1 1.2 1.3 Ramachandran, Prajit; Zoph, Barret; Le, Quoc V. (2017-10-27). "Searching for Activation Functions". arXiv:1710.05941v2 [cs.NE].
  2. Atto, Abdourrahmane M.; Pastor, Dominique; Mercier, Gregoire (March 2008). "Smooth sigmoid wavelet shrinkage for non-parametric estimation". 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 3265–3268. doi:10.1109/ICASSP.2008.4518347. ISBN 978-1-4244-1483-3. https://ieeexplore.ieee.org/document/4518347. 
  3. Misra, Diganta (2019). "Mish: A Self Regularized Non-Monotonic Neural Activation Function". arXiv:1908.08681 [cs.LG].
  4. 4.0 4.1 Hendrycks, Dan; Gimpel, Kevin (2016). "Gaussian Error Linear Units (GELUs)". arXiv:1606.08415 [cs.LG].
  5. Elfwing, Stefan; Uchibe, Eiji; Doya, Kenji (2017-11-02). "Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning". arXiv:1702.03118v3 [cs.LG].
  6. Serengil, Sefik Ilkin (2018-08-21). "Swish as Neural Networks Activation Function". https://sefiks.com/2018/08/21/swish-as-neural-networks-activation-function/.