To address this, we will now turn to the extremescommand. Some definitions suggest anything above the 150% of the interquartile range as an outlier, while others define 220% or 300% of the IQR as outliers. This affects our ability to inspect outliers that are defined differently. One problem associated with a box plot is that it does not allow us to change the interquartile level. This way, one can go and directly check the 74th observation (or more if there are multiple outliers) in the dataset without having to sort and inspect manually. Now, the graph shows a label saying ‘74’ beside the outlier value (as well as with other plotted values of price). The mlabel()option ensures that the plotted dots for the price are labelled with the corresponding observation numbers. This can then be followed by options you may want to add graph box price, mark(1,mlabel(id)) Commands to generate graphs are typically started with graph, followed by the graph type, and then followed by the variable(s) we wish to plot. Let’s use the command box to draw the box plot now. In other words, this variable depicts the observation number for each row. This will generate a variable that starts from 1 for the first observation and increments by 1 for each subsequent observation. Let’s generate a variable called id that would act as a unique ID number for each row. To get around this issue, let’s ask Stata to label the graph. For this, we have to manually sort the data. However, this does not let us know the exact value observation that helps give more context to it.
The plotted value at the very top indicates an outlier since it lies outside the typical distribution/pattern of the variable. In the dialogue box that opens, choose the variable that you wish to check for outliers from the drop-down menu in the first tab called ‘Main’.
To draw a box plot, click on the ’Graphics’ menu option and then ‘Box plot’. Method 2: Box PlotĪ box plot is the graphical equivalent of a five-number summary or the interquartile method of finding the outliers. If such a value does exist in your dataset, sorting will easily help you identify it. If we were to change the last (maximum) value to, say, 50,000, it becomes an outlier since it is now a very high value as compared to its previous observation (14,500) (We are replacing this value because we want to demonstrate the process, you don’t have to replace the value in your data). give you a visual overview of how a variable’s values increase and whether a few extreme values exist in isolation. Sorting and inspecting will only serve to fulfill this part: i.e. In this case, the price variable appears to have no extreme values. The edit command opens the dataset for you to inspect and edit. Let’s sort the price variable (in ascending order) to see how outliers can affect it. We start this by loading our dataset of choice, which in this case will be Stata’s built-in auto dataset: sysuse auto.dta Method 1: Sorting the Data Identifying and Finding Outliers in Your Data only one variable is affected by the presence of outliers. These are outliers that pertain to only one variable in the dataset i.e. This article will only cover univariate outliers. Whether they are treated or not is something that the researcher decides based on the context and setting of their research. The meaning and interpretation of the term ‘outlier’ may change depending on the type of data, the researcher’s objective, and the research setting. Put simply, an outlier is a value/observation in a dataset that is either extremely high or extremely low as compared to other values/observations in a given dataset.
This article will delve into what outliers are, how we can identify outliers in Stata, and finally, how they can be treated in Stata.