Pearson correlation coefficient


ν”Όμ–΄μŠ¨ 상관 κ³„μˆ˜(Pearson correlation coefficient λ˜λŠ” Pearson’s r)λŠ” 두 λ³€μˆ˜κ°„μ˜ 관련성을 κ΅¬ν•˜κΈ° μœ„ν•΄ 보편적으둜 μ΄μš©λœλ‹€. κ°œλ…μ€ λ‹€μŒκ³Ό κ°™λ‹€.

r = X와 Yκ°€ ν•¨κ»˜ λ³€ν•˜λŠ” 정도 / X와 Yκ°€ 각각 λ³€ν•˜λŠ” 정도

결과의 해석

r 값은 X 와 Y κ°€ μ™„μ „νžˆ λ™μΌν•˜λ©΄ +1, μ „ν˜€ λ‹€λ₯΄λ©΄ 0, λ°˜λŒ€λ°©ν–₯으둜 μ™„μ „νžˆ 동일 ν•˜λ©΄ –1 을 가진닀. κ²°μ •κ³„μˆ˜ (coefficient of determination) λŠ” r^2 둜 κ³„μ‚°ν•˜λ©° 이것은 X λ‘œλΆ€ν„° Y λ₯Ό μ˜ˆμΈ‘ν•  수 μžˆλŠ” 정도λ₯Ό μ˜λ―Έν•œλ‹€.

일반적으둜

r이 -1.0κ³Ό -0.7 사이이면, κ°•ν•œ 음적 μ„ ν˜•κ΄€κ³„,
r이 -0.7κ³Ό -0.3 사이이면, λšœλ ·ν•œ 음적 μ„ ν˜•κ΄€κ³„,
r이 -0.3κ³Ό -0.1 사이이면, μ•½ν•œ 음적 μ„ ν˜•κ΄€κ³„,
r이 -0.1κ³Ό +0.1 사이이면, 거의 λ¬΄μ‹œλ  수 μžˆλŠ” μ„ ν˜•κ΄€κ³„,
r이 +0.1κ³Ό +0.3 사이이면, μ•½ν•œ 양적 μ„ ν˜•κ΄€κ³„,
r이 +0.3κ³Ό +0.7 사이이면, λšœλ ·ν•œ 양적 μ„ ν˜•κ΄€κ³„,
r이 +0.7κ³Ό +1.0 사이이면, κ°•ν•œ 양적 μ„ ν˜•κ΄€κ³„

둜 해석할 수 μžˆλ‹€. pearson 상관 κ³„μˆ˜λ₯Ό κ΅¬ν•˜λŠ” 곡식은 λ‹€μŒκ³Ό κ°™λ‹€.

\[r_{pb} = \frac{\sum (x - m_x) (y - m_y)}{\sqrt{\sum (x - m_x)^2 (y - m_y)^2}}\]

\(m_x\) λŠ” x의 평균이고 \(m_y\) λŠ” y의 평균을 μ˜λ―Έν•œλ‹€.

λ‹€μŒμ€ scipy.stats의 pearsonr ν•¨μˆ˜ κ΅¬ν˜„ μ½”λ“œμ΄λ‹€.

def pearsonr(x, y):
    r"""
    Calculate a Pearson correlation coefficient and the p-value for testing
    non-correlation.
    The Pearson correlation coefficient measures the linear relationship
    between two datasets. Strictly speaking, Pearson's correlation requires
    that each dataset be normally distributed, and not necessarily zero-mean.
    Like other correlation coefficients, this one varies between -1 and +1
    with 0 implying no correlation. Correlations of -1 or +1 imply an exact
    linear relationship. Positive correlations imply that as x increases, so
    does y. Negative correlations imply that as x increases, y decreases.
    The p-value roughly indicates the probability of an uncorrelated system
    producing datasets that have a Pearson correlation at least as extreme
    as the one computed from these datasets. The p-values are not entirely
    reliable but are probably reasonable for datasets larger than 500 or so.

    Parameters
    ----------
    x : (N,) array_like
        Input
    y : (N,) array_like
        Input
    Returns
    -------
    r : float
        Pearson's correlation coefficient
    Notes
    -----
    The correlation coefficient is calculated as follows:
    .. math::
        r_{pb} = \frac{\sum (x - m_x) (y - m_y)
                       }{\sqrt{\sum (x - m_x)^2 (y - m_y)^2}}
    where :math:`m_x` is the mean of the vector :math:`x` and :math:`m_y` is
    the mean of the vector :math:`y`.

    References
    ----------
    http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
    """
    # x and y should have same length.
    x = np.asarray(x)
    y = np.asarray(y)
    n = len(x)
    mx = x.mean()
    my = y.mean()
    xm, ym = x - mx, y - my
    r_num = np.add.reduce(xm * ym)
    r_den = np.sqrt(_sum_of_squares(xm) * _sum_of_squares(ym))
    r = r_num / r_den

    # Presumably, if abs(r) > 1, then it is only some small artifact of
    # floating point arithmetic.
    r = max(min(r, 1.0), -1.0)

    return r