νΌμ΄μ¨ μκ΄ κ³μ(Pearson correlation coefficient λλ Pearsonβs r)λ λ λ³μκ°μ κ΄λ ¨μ±μ ꡬνκΈ° μν΄ λ³΄νΈμ μΌλ‘ μ΄μ©λλ€. κ°λ μ λ€μκ³Ό κ°λ€.
r = Xμ Yκ° ν¨κ» λ³νλ μ λ / Xμ Yκ° κ°κ° λ³νλ μ λ
κ²°κ³Όμ ν΄μ
r κ°μ X μ Y κ° μμ ν λμΌνλ©΄ +1, μ ν λ€λ₯΄λ©΄ 0, λ°λλ°©ν₯μΌλ‘ μμ ν λμΌ νλ©΄ β1 μ κ°μ§λ€. κ²°μ κ³μ (coefficient of determination) λ r^2 λ‘ κ³μ°νλ©° μ΄κ²μ X λ‘λΆν° Y λ₯Ό μμΈ‘ν μ μλ μ λλ₯Ό μλ―Ένλ€.
μΌλ°μ μΌλ‘
rμ΄ -1.0κ³Ό -0.7 μ¬μ΄μ΄λ©΄, κ°ν μμ μ νκ΄κ³,
rμ΄ -0.7κ³Ό -0.3 μ¬μ΄μ΄λ©΄, λλ ·ν μμ μ νκ΄κ³,
rμ΄ -0.3κ³Ό -0.1 μ¬μ΄μ΄λ©΄, μ½ν μμ μ νκ΄κ³,
rμ΄ -0.1κ³Ό +0.1 μ¬μ΄μ΄λ©΄, κ±°μ 무μλ μ μλ μ νκ΄κ³,
rμ΄ +0.1κ³Ό +0.3 μ¬μ΄μ΄λ©΄, μ½ν μμ μ νκ΄κ³,
rμ΄ +0.3κ³Ό +0.7 μ¬μ΄μ΄λ©΄, λλ ·ν μμ μ νκ΄κ³,
rμ΄ +0.7κ³Ό +1.0 μ¬μ΄μ΄λ©΄, κ°ν μμ μ νκ΄κ³
λ‘ ν΄μν μ μλ€. pearson μκ΄ κ³μλ₯Ό ꡬνλ 곡μμ λ€μκ³Ό κ°λ€.
\[r_{pb} = \frac{\sum (x - m_x) (y - m_y)}{\sqrt{\sum (x - m_x)^2 (y - m_y)^2}}\]\(m_x\) λ xμ νκ· μ΄κ³ \(m_y\) λ yμ νκ· μ μλ―Ένλ€.
λ€μμ scipy.statsμ pearsonr ν¨μ ꡬν μ½λμ΄λ€.
def pearsonr(x, y):
r"""
Calculate a Pearson correlation coefficient and the p-value for testing
non-correlation.
The Pearson correlation coefficient measures the linear relationship
between two datasets. Strictly speaking, Pearson's correlation requires
that each dataset be normally distributed, and not necessarily zero-mean.
Like other correlation coefficients, this one varies between -1 and +1
with 0 implying no correlation. Correlations of -1 or +1 imply an exact
linear relationship. Positive correlations imply that as x increases, so
does y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets. The p-values are not entirely
reliable but are probably reasonable for datasets larger than 500 or so.
Parameters
----------
x : (N,) array_like
Input
y : (N,) array_like
Input
Returns
-------
r : float
Pearson's correlation coefficient
Notes
-----
The correlation coefficient is calculated as follows:
.. math::
r_{pb} = \frac{\sum (x - m_x) (y - m_y)
}{\sqrt{\sum (x - m_x)^2 (y - m_y)^2}}
where :math:`m_x` is the mean of the vector :math:`x` and :math:`m_y` is
the mean of the vector :math:`y`.
References
----------
http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
"""
# x and y should have same length.
x = np.asarray(x)
y = np.asarray(y)
n = len(x)
mx = x.mean()
my = y.mean()
xm, ym = x - mx, y - my
r_num = np.add.reduce(xm * ym)
r_den = np.sqrt(_sum_of_squares(xm) * _sum_of_squares(ym))
r = r_num / r_den
# Presumably, if abs(r) > 1, then it is only some small artifact of
# floating point arithmetic.
r = max(min(r, 1.0), -1.0)
return r