Documentation/HowTo/Calc/BoxplotWithWhiskers

=========================================> WORK IN PROGRESS <========================================= TODO: Submit for review Get approval by the authors

=
=========================================================================================

Copyright
This tutorial was written by Henk van der Burg, Harry Croon, Rob Westein and Kees Kriek.

Introduction
A box plot is a graphical representation of a data set, showing: • The minimum (the lowest value). • The first quartile. • The median. • The third quartile. • The maximum (the highest value). This data set can be, for example, a large database or a sample from a population.

The median
The median of an odd number of numbers is the value of the middle number of that row of numbers after the row is sorted by size. If the number of numbers is even, the average of the two middle numbers is taken.

Median for an even number of numbers in a number sequence
With an even number of numbers, the median is always between the 2 middle numbers. The median is then the mean of these two middle numbers. For example, we have the following numbers: 942-721-483-223-934-780-317-356. We first put these numbers from smallest to large, so 223-317-356-483-721-780-934-942. The median is between the 2 middle bold numbers and is then (483 + 721) / 2 = 602.

Median for an odd number of numbers in a number sequence
With an odd number of numbers, the median is the middle number. The middle, bold, number, the fifth, in the range 223-317-356-483-721-780-934-942-966 is the median.

The first and third quartiles
After we have determined the median, we can also calculate the quartiles. The median splits the data series into two parts, as it were, creating a left and a right part. The first quartile is the median of the left portion and the third quartile is the median of the right portion. These can be determined in the same way as explained above. So split the number sequence, 223-317-356-483-721-780-934-942-966. The first quartile in the bold left part of the number sequence is therefore between 317 and 356 in and is then (317 + 356) / 2 = 336.5. The third quartile in the italicized right part is between 780 and 934 and is then (780 + 934) / 2 = 857.

Draw a box plot with pen and paper
Now that we know what the median, the first and the third quartiles mean, we can start making a box plot. In a box plot we will further use the minimum and the maximum. You make a box plot as follows. First make a line with numbers in which the box plot should be placed. Then place a dot at the lowest value, the first quartile, the median, the third quartile and the highest value. Then draw a rectangle between the first and third quartiles and place a line in this rectangle at the height of the median. This is shown schematically in Figure 1.

Create Box Plots Using Calc
Calc does not currently have a prepared diagram to create a box plot, so we need to build a box plot using the above theory. As an example we will use Table 1, which shows the annual sales of the stores in 2018, 2019 and 2020.

1) Copy the range A1:D14 from Table 1 to a blank Calc worksheet.

2) Save this worksheet with the name BoxplotBranches.

Calculate the concepts mentioned in the Introduction
1) Select and right click cells A16:D16 and select Merge Cells in the context menu or select on the main menu bar.

2) Type the text Summary Minimum, Median, Maximum, 1st and 3rd quartiles.

3) In cells A17, A18, A19, A20, and A21, type Minimum (Q0), First quartile (Q1), Median (Q2), Third Quartile (Q3), and Maximum (Q4).

Calculate the minimum
1) To calculate the minimum over the year 2018, place the mouse pointer in cell B17, type the equal sign (=) and select the Function Wizard icon on the Formula Bar, which will open the Function Wizard dialog box Figure 2.



2) In the Functions list, double-click the MIN function, or from the Category list above it, first select Static, and then double-click the MIN function in the Functions list. In the Function Wizard, the right part is now filled with required field parameters (data) for the MIN function to be filled in with a blinking mouse pointer in the Number 1 box (shown (not blinking) in Figure 3).



3) Select the Shrink icon at the right of the Number 1 field, and then select the cell range B2:B14. Your selection will be inserted into the smaller dialog box of the Function Wizard (Figure 4).



4) Click the same icon again, now named Expand, and the cell range will be included in the Function Wizard dialog box in the Number 1 box with 100 in the top right of the Function result and the same number in the Result box above the Formula box (Figure 5). The Formula box displays the formula and click OK.



5) The result of the MIN function is now listed in cell B17 and the formula is in the Input line on the Formula Bar (Figure 6).



6) In the black-outlined cell B17, place the mouse pointer on the small black square at the bottom right until the mouse pointer changes shape (under Windows into a plus sign). Click and drag cell B17 to cell D17 to copy the function and result to cells C17 and D17.

7) Save your document before continuing.

Calculate the first quartile
In the theory in the section "The first and third quartiles" it is indicated that the first quartile is the median of the left part of a number sequence sorted from lowest to highest. The seventh (bold) number in the sorted number sequence (100,110,112,117,118,119,133, 136, 149, 152, 157, 172, 189) is the median of the number sequence. The median of the left portion (100,110,112,117,118,119) is between 112 and 117 (bold). Its median is: (112 + 117) / 2 = 114.5. Calc has a function for this that calculates this from the number sequence: the function QUARTILE.EXC (excluding median). That function calculates the first quartile (K1) in cell B18. Using the Function Wizard as described in points 1 to 6 of section "Calculate the minimum" or by entering the formula directly in the Input line on the Formula Bar or directly in the cell as described in the Note after point 7 above. The function to enter is: = QUARTILE.EXC (B2:B14,1). The 1 after the number sequence indicates that the first quartile must be calculated. After entering the function in cell B18, copy this cell to cells C18 and D18, as described under point 6 of section "Calculate the minimum". The Summary, Minimum, Median, Maximum, 1st and 3rd quartiles section in your table should then look like Figure 7.



{note| The first quartile can also be calculated with the function PERCENTILE.EXC. The function for this is: =PERCENTILE.EXC (B2:B14, 0.25).}}

Calculate the median
We have already indicated the outcome of the calculation in the first paragraph of the text. Calc has the function MEDIAN to calculate this. The function for cell B19 is: =MEDIAN (B2: 14), which is processed in the same way as indicated for the calculation of the minimum and first quartile. The table after the calculation of the medians is shown in Figure 8.



Calculate the third quartile
The calculation for the third quartile is basically the same as the calculation for the first quartile, where the reference to the profile should not be 1 in this case but 3 and 0.75 in the case of the PERCENTILE.EXC function. The functions are as follows: =QUARTILE.EXC (B2:B14,3) or PERCENTILE.EXC(B2:B14,075). After calculating the third quartile, you have a table like Figure 9.



Calculate the maximum
The maximum is calculated in the same way as the minimum, but using the MAX function instead of the MIN function. The function for this in cell B21 is: =MAX(B2:B14) and copy this cell back to cells C21 and D21. All concepts for the box plots have now been calculated in Figure 10.



Determine differences between the elements of the box plot
From the theory in the section "Draw a box plot with pen and paper" and Figure 1, it can be concluded that the differences between the above data elements must be calculated. We do this in a separate table labeled Differences for the boxes and whiskers in the merged cells A23:D23 with underneath in cells B24, C24 and D24 containing 2018, 2019 and 2020 respectively.

1) Merge cells A23:D23 as indicated in point 1 of "Calculate the concepts mentioned in the Introduction" and in these merged cells type the text Differences for Boxes and Whiskers and press.

2) In cells B24, C24, and D24, type 2018, 2019, and 2020, respectively, and press or  after each entry in the cell.

3) In cells A25, A26, A27, A28, and A29, respectively, type Minimum (Q0), First quartile (Q1) - Minimum (Q0), Median (Q2) - First Quartile (Q1), Third quartile (Q3) - Median (Q2 ) and Maximum (Q4) - Third       quartile (Q3) and press  after each entry in a cell.

4) The minimum in cell B25 is equal to the minimum in cell B17. In cell B25 type = B17 as a reference to this cell and press Enter, or in cell B25 type =, click the mouse pointer in cell B17 and press.

5) In cell B26 you must fill in the difference between the First quartile (Q1) and the Minimum (Q0), otherwise write the difference between cells B18 and B17. In cell B26 type the formula = B18-B17 and press to calculate this difference, or in cell B26 type = and click the mouse pointer in B18, then type the character - and click the mouse pointer in cell B17 and press  or use the Function Wizard to calculate the difference.

6) The differences in cells B27, B28 and B29 are calculated in a similar manner. Since cell B26 contains relative references, the formulas for these cells can be copied down, as previously described in point 6 of section "Calculate the minimum". Select cell B26 and hover the mouse pointer over the small square at the bottom right of the cell and when the mouse pointer has changed shape, drag the cell down to cell B29.

7) All cells in B25 to B29 contain relative references and can be copied to the right in the same way as under point 6 above to calculate the differences for the years 2019 and 2020. Select the cell range B25:B29 and hover the mouse pointer over the small square at the bottom right of the selection in cell B29, and if the mouse pointer has changed shape, drag the selection 2 columns to the right.

8) The table now contains the data shown in Figure 11.



9) Save the file.

Create the diagram
We are now going to make the diagram. Remember that the values in row 25 and row 29 are the bottom and top whiskers and these in B26 to B28 make up the boxes.

1) Select cells B24:D29 and select on the main menu bar or click the icon Insert Chart on the Standard bar.

2) Select Column from the Choose a Chart Type list in Step 1 of the Chart Wizard and select the middle example Stacked (see Figure 12).



3) Click Next.

4) In Step 2 of the Chart Wizard, the checkbox for Data series in columns is selected, change it by clicking Data series in rows, also click First row as label to include the years in the chart (Figure 13) and click Next.



5) In Step 3 of the Chart Wizard, do not change anything and click Next.

6) In Step 4 of the Chart Wizard, deselect Display legend (Figure 14) and click Finish.



7) Your diagram will now look like the one shown in Figure 15. Save your document.



Explanation of the colored blocks in the diagram
• The tops of the blue stacks and the bottoms of the orange stacks are the bottoms of the bottom whiskers. • The tops of the orange stacks and the bottoms of the yellow stacks are the tops of the bottom whiskers. • The difference between tops and bottoms of the orange stacks make up the bottom whiskers. • The yellow and green stacks eventually form the box of the box plot. • The tops of the yellow stacks and the bottoms of the green stacks are the median in the box of the box plot. • The bottoms of the purple stacks are the bottoms of the top whiskers. • The tops of the purple stacks are the tops of the top whiskers. The above is shown schematically for the 2018 column in Figure 16.

Edit the blue blocks in the stacked column chart
The information above shows that only the tops of the blue stacks are important for determining the bottom whiskers. Also, the bottoms of the orange stacks are the same as the tops of the blue stacks. The blue stacks are therefore not really important when making the whiskers. We will therefore first remove the blue stacks from or make them invisible in the stacked columns.

1) Double-click on the created diagram and a gray border appears around the diagram with a black block in the middle of all sides and at the corner points, a total of 8 pieces (see Figure 15).

2) Then click in one of the blue blocks in the diagram. A small block (Figure 17) appears in the middle of these areas. (Note that when the mouse pointer reaches the blue block, a help tip is displayed that says: Data point x, data series x, values: Year x value.)



3) Now right click on one of the blue stacks and select Format Data Series in the context menu or select on the main menu bar.

4) The Data Series dialog box (Figure 18) opens. Click on the Area sheet tab to open the tab and select the None button (after which the block is no longer selectable) or select the Colour button and select the color White from one of the available palettes in the Colours section (after which the block is selectable and can still be changed). In the examples on the right side of the dialog box, the newly selected color is displayed under New.



5) Then click the Borders sheet tab and select - none - as Style in the Line Properties section and click OK. All blue stacks are now no longer visible (Figure 19).



Edit the orange stacks in the stacked column chart
As indicated above, the bottom whiskers are formed by making the differences between the top and bottom sides of the orange areas in Calc (see also points 1 and 2 in Figure 16). The orange blocks themselves are therefore not required for making the bottom whiskers and they are removed or made invisible after the bottom whiskers have been applied. Proceed as follows: 1) Double-click the stacked column chart if it does not have a gray border with 8 squares around it.

2) Now select the orange stacks by clicking on one of the orange stacks. A small block will now appear in the orange stacks.

3) Right-click on one of the orange stacks and select Insert y-Error Bars in the context menu or select on the main menu bar.

4) The dialog box y-Error Bars for Data Series ‘Row 26’ opens. Click the Line sheet tab to open the Line tab, and in the Width field in the Line Properties section, change the value 0.00 cm to 0.03 cm (Figure 20).



5) Now select the y-Error Bars sheet tab to open the y-Error Bars tab and select the Cell Range option in the Error Category section and Negative in the Error Indicator section. In the Parameters section, the Negative (-) box now becomes active.

6) Click the Select data range button next to the Negative (-) box and select the cell range B26: D26 (Figure 21).



7) Click OK. Note, although difficult to see, that the whiskers are drawn in the diagram.

8) Click again on one of the orange stacks (containing the small block) and eight square blocks are placed around the selected orange stack, the selection handles (Figure 22).



9) Right-click on the orange stack with the selected selection handles and select Format Data Series in the context menu.

10) The Data Series dialog box opens. In it, delete the orange stack or make it invisible, as described in points 4 and 5 in the Edit the blue blocks in the stacked column chart section. This removes all orange stacks from the stacked column chart. Your diagram with the created bottom whiskers will now look like Figure 23.



Edit the yellow and green stacks in the stacked column chart
The yellow stacks in the stacked column chart make up the bottom parts of the boxes and the green top stacks make up the top parts of the boxes. The border between the yellow and green stacks indicates the median. To do this in Calc, do the following: 1) Double-click on the stacked column chart so that it has a gray border with 8 black blocks around it (Figure 15).

2) Click on one of the yellow stacks so that a small block appears in it and right-click on one of the yellow stacks and select Format Data Series in the context menu or select on the main menu bar.

3) Click the Borders sheet tab in the Data Series dialog box to open the Borders tab.

4) Select Continuous from the Style drop-down list in the Line Properties section and change the Width to 0.03 cm. Leave the color of the stacks unchanged. See The color of the boxes in the box plots with whiskers. after this and click OK.

5) The yellow stacks now have a black border around them.

6) Carry out the same actions from points 2 to 4 on the green stacks, so that the box and the median are created in Figure 24.



Edit the purple stacks in the stacked column chart
The purple stacks indicate the length of the top whiskers, which are drawn from the bottom to the top of the purple stacks. Since the bottom of the purple stacks is the same as the top of the green stacks, that information is not required. Our last table on row 29 determined the length of the top whiskers. The purple stacks are therefore not necessary (anymore) when making the box plots. Delete the purple stacks as described in points 1 to 5 under Edit the blue blocks in the stacked column chart. Your stacked column chart now looks like Figure 25.



Make the top whiskers in the diagram
The top whiskers must be placed on top of the boxes, consisting of the yellow and green parts, and have the length calculated on row 29 of the table shown in Figure 11. Do the following to place the top whiskers.

1) Double-click on the diagram so that it has a gray border with 8 black blocks around it (Figure 15).

2) Click on one of the green stacks so that a small block appears in it (Figure 24).

3) Right-click on one of the green stacks and select Insert y-Error Bars in the context menu or select on the main menu bar to open the dialog y-Error Bars for given ‘Row 28’.

4) Click the Line sheet tab and choose Continuous from the Style drop-down list in the Line Properties section of this tab and change the Width to 0.03 cm.

5) Click the y-Error Bars sheet tab and click Cell Range in the Error Category section and Positive in the Error Indicator section to select these parameters.

6) Click the Select data range icon to the right of the Positive (+) box in the now active Parameters section.

7) Select the cell range B29:D29 in the last table. The cell range is now included in the dialog y-Error Bars for data ‘Row 28’ (Figure 26).



8) Click OK and the top whiskers have been created in the diagram and your box plots and whiskers are complete (see Figure 27) and save the document.



The color of the boxes in the box plots with whiskers
In our box plots with whiskers, the boxes are indicated by yellow stacks below the median (the middle lines in the boxes) and by green stacks above the median. In many cases, no color is given to the surfaces, which makes them appear white. Excel 2016 and Excel 2019, where a box plot with whisker is one of the standard diagrams, the boxes per column (in our diagram the columns 2018, 2019 and 2020) are given a separate color, so that visually the difference in columns is visible. Depending on your preference or the situation in which you are going to use the box plots with whiskers, you can choose one of the above options. To adjust the colors in the boxes of the box plots:

1) Double-click on the diagram so that it has a gray border with 8 black squares around it.

2) To make all stacks white, select one of the yellow or green stacks and right-click on one of the yellow or green stacks and select Format Data Series in the context menu or select on the main menu bar.

3) On the Area tab of the Data Series dialog box, select the None button or the Colour button and choose White from the palettes in the Colours section.

4) To white or color only the stack in the 2018 column, click the yellow or green stack in the 2018 column so that it has selection handles around it and select Format Data Point in the context menu or select on the menu bar. Click the Colour button and choose the color White or the color of your choice from one of the palettes in the Colours section and click OK.

5) Make the other parts of the box the same as the choices made above. A possible example is shown in Figure 28.



Summary
If the box plots with whiskers are placed next to the stacked column charts, as in Figure 29, in the box plots it seems easier to observe that there is a reasonable skewed growth between the years 2018 to 2020 than in the stacked column charts.



It is also easier to see in the box plots that the average over the year is fairly evenly distributed for the year 2018. For the year 2019, the average is well above it and for the year 2020 the average is well below it.