How to Find Outliers for Skewed Data

When data is skewed, traditional methods like the empirical rule are ineffective for identifying outliers. Instead, the interquartile range (IQR) method provides a more reliable approach. This method identifies potential outliers by analyzing the spread of the middle 50% of the data. In this lesson, we will define the interquartile range, explain how to use it to detect outliers, and work through examples from astronomy and the music industry.

Interquartile Range

What is the Interquartile Range?

The interquartile range (IQR) measures the spread of the middle 50% of a dataset. It is calculated as:

\[ \text{{IQR}} = Q_3 - Q_1 \]

The IQR is useful for measuring variability in skewed data because it is resistant to extreme values.

Example 1

Astronomers measure the brightness (in apparent magnitude) of newly discovered exoplanets. Below are 20 brightness values recorded by space telescopes. Compute the interquartile range.

Exoplanet Brightness Measurements
Brightness (Magnitude)
-3.4 -2.9 -2.5 -2.2 -1.8 -1.5 -1.2 -0.9 -0.7 -0.5
-0.3 -0.1 0.2 0.4 0.7 1.0 1.3 1.7 2.1 3.8

Solution

Step 1: Use the Summary Statistics Calculator to identify \(Q_1\) and \(Q_3\)
Q 1 = -1.65, and Q 3 = 0.85
Step 2: Compute the IQR

\[ \text{{IQR}} = Q_3 - Q_1 = 0.85 - (-1.65) = 2.50 \]

$$\tag*{\(\blacksquare\)}$$

IQR Rule for Outliers

How to Identify Outliers with the IQR Method

For skewed data, a data point is considered an outlier if it is too far from the central 50% of the data.

The IQR Method tells us that

  • Any value that is smaller than \(Q_1-1.5(IQR)\) is significantly low.
  • Any value that is larger than \(Q_3+1.5(IQR)\) is significantly high.

Example 2

A record label tracks the number of streams (in millions) for 25 new song releases. Identify any outliers using the IQR method.

Music Streaming Counts (Millions)
Streams
1.2 1.5 2.1 2.4 2.9 3.3 3.6 4.0 4.2 4.8
5.1 5.7 6.0 6.2 6.8 7.3 8.1 8.5 9.0 15.4

Solution

Step 1: Use the Summary Statistics Calculator to identify \(Q_1\) and \(Q_3\)

Q 1 = 3.1 and Q 3 = 7.05

Step 2: Compute the IQR

\[ \text{{IQR}} = Q_3 - Q_1 = 7.05-3.1=3.95 \]

Step 3: Determine Outlier Boundaries
  • Lower Bound: \[\begin{align*}Q_1-1.5(IQR)&=3.1-1.5(3.95)\\\\&=-2.825\end{align*}\]
  • Upper Bound: \[\begin{align*}Q_3+1.5(IQR)&=7.05+1.5(3.95)\\\\&=12.975\end{align*}\]
Step 4: Identify Outliers

There are no songs that have negative amount of streams; so, there are no significantly low values.  But, there was a song that has 15.4 million streams.  Since this value is larger than 12.975, it is significantly high.

$$\tag*{\(\blacksquare\)}$$

Conclusion

The interquartile range is a powerful method for detecting outliers in skewed datasets across various fields. In the next lesson, we will learn how to visualize both the five-number summary and the IQR using boxplots.