Skip to main content

Outliers

Outliers are:

  • Abnormal values
  • non-standard values

This can be caused by:

  • Errors in data collection
  • By chance
  • Fraud

How to deal with outliers:

  • Remove
  • Substitute/Transform
  • Do nothing

Detecting Outliers

Boxplot

A boxplot is a graphical representation of the distribution of a dataset. It shows the median, quartiles, and outliers.

import seaborn as sns

sns.boxplot(data=df, y='column')

Scatter Plot

A scatter plot is a graphical representation of the relationship between two variables. Outliers can be identified as points that are far from the rest of the data.

import matplotlib.pyplot as plt

plt.scatter(df['column1'], df['column2'])

PyOD

PyOD is a Python library for detecting outliers in multivariate data.

For each type use:

  • Time-series outlier detection: TODS
  • Graph outlier detection: PyGOD

Using KNN

from pyod.models.knn import KNN

detector = KNN()
detector.fit(X)

detections = detector.labels_ # 0 for inliers, 1 for outliers

confidence = detector.decision_scores_ # Outlier scores

outliers_index = []
for i in range(len(prevision)):
    if detections[i] == 1:
        outliers.append(i)

outliers = X[outliers_index]