How to find Correlation of Pandas DataFrame?

Correlation of Pandas Dataframes: Mastering the corr() and corrwith() Functions

Correlation is a powerful statistical tool used to analyze the relationship between two variables. In Python, the Pandas library provides efficient methods to calculate correlations within and between Dataframes. This article will focus on the key functions for calculating correlations in Pandas: corr() and corrwith().

Understanding Correlation

Before diving into the code, let’s recap the basics of correlation:

  • Correlation values range from -1 to +1:
    • A positive correlation indicates that the variables move in the same direction.
    • A negative correlation signifies that the variables move in opposite directions.
    • A 0 correlation means the variables are independent of each other.

The corr() Function: Unveiling Correlations within a Dataframe

The corr() function is your go-to method for calculating correlations between columns within a Pandas Dataframe. It returns a correlation matrix, where each cell represents the correlation between two columns.

import pandas as pd

df = pd.DataFrame([[15,21,3,5],[8,7,9,5],
                   [5,1,5,8],[4,5,10,7]], 
                  columns = ['a','b','c', 'd'])

# Calculate the full correlation matrix
corr_matrix = df.corr()
print(corr_matrix)

# Extract the correlation between specific columns
correlation_ab = corr_matrix['a']['b']
print(correlation_ab)
#Output:
#          a         b         c         d
#a  1.000000  0.956738 -0.690649 -0.760644
#b  0.956738  1.000000 -0.562501 -0.753628
#c -0.690649 -0.562501  1.000000  0.084072
#d -0.760644 -0.753628  0.084072  1.000000
#0.9567377201337892

Key points about corr():

  • The diagonal of the correlation matrix is always 1 (a variable is perfectly correlated with itself).
  • The upper and lower triangles of the matrix are symmetrical.
  • You can directly calculate the correlation between two specific columns.

Parameters of the corr() Function

  • min_periods: Sets the minimum number of observations required per pair of columns to have a valid result.
  • method: Specifies the correlation method to use (‘pearson’, ‘kendall’, or ‘spearman’).

The corrwith() Function: Correlations between a Dataframe and a Series

The corrwith() function comes in handy when you want to find correlations between a Dataframe and a Series.

import pandas as pd

df = pd.DataFrame([[15,21,3,5],[8,7,9,5],
                   [5,1,5,8],[4,5,10,7]], 
                  columns = ['a','b','c', 'd'])

s = pd.Series([5,8,7,7])

# Calculate correlations between the DataFrame and the Series
correlations = df.corrwith(s)
print(correlations)

# Extract the correlation for a specific column
correlation_a = correlations['a']
print(correlation_a)

#Output:
#a   -0.746733
#b   -0.807023
#c    0.781722
#d    0.220755
#dtype: float64
#-0.7467330458877309

Parameters of the corrwith() Function:

axis: Specifies whether to compute row-wise (axis=0) or column-wise (axis=1) correlations.method: Same as in corr(), specifies the correlation method.

Conclusion

Understanding how to calculate and interpret correlations is crucial for data analysis. Pandas makes this process seamless with its corr() and corrwith() functions. By mastering these tools, you can unlock valuable insights hidden within your data.

Remember: Correlation does not imply causation. Always be cautious when interpreting correlation results and consider other factors that might influence the relationship between variables.

Use AI tools like ChatGPT and Gemini to learn coding efficiently!

You can also use AI tools like Gemini and ChatGPT to recreate the methods mentioned in the article and in more detail. It is free to register on these tools and you do not need any premium membership to use the prompts mentioned below.

correlation of pandas dataframe

correlation of pandas dataframe and series

Happy Learning!

Explore more from this category at Python DataFrames. Alternatively, search and view other topics at All Tutorials.