Wednesday, 11 July 2018

microsoft excel - Grouping data columns by shared values


I don't know how to properly describe what I need to do, so I will give an example. A colleague has a data set in Excel like so:


Col A    Col B    Col C
aaaaa aaaaa bbbbb
bbbbb ccccc ccccc
ccccc ddddd eeeee

The end result should be something like this:


Col A    Col B    Col C
aaaaa aaaaa
bbbbb bbbbb
ccccc ccccc ccccc
ddddd
eeeee

Or even:


      Col A    Col B    Col C
aaaaa Yes Yes No
bbbbb Yes No Yes

etc.


(if it helps, the columns are protein extraction methods and the letters are protein IDs - we need to determine which proteins are extracted by which methods)


My colleague is doing this by hand, but there is enough data that it would be really helpful to automate it.


Is there a formula in Excel to do this automatically?



Answer



This is not a “turn-key” solution, but if you have thousands of rows, this may save you some effort.  (Do this in a scratch copy of your file, just in case something blows up or melts down, because “Undo” doesn’t always work.)  Note: this procedure was developed for Excel 2007 (but I have re-verified it in Excel 2013).


First, copy all your data into a scratch column; let’s call it V.  Note that you must copy the heading from Column A, or else put some dummy value in cell V1.


Illustration of copying data


Now go to the “Data” tab, “Sort & Filter” group, and click on “Advanced”:


                        “Sort & Filter” group, with “Advanced” highlighted


This will bring up the “Advanced Filter” dialog box:


                  “Advanced Filter” dialog box


Verify that “List range” shows your data in Column V.  Select “Copy to another location” and “Unique records only”.  Type “W1” in the “Copy to” field — or click in the field, and then click in W1 (there are several techniques that will get the same result).  Click on “OK”.  You should get something like this:


                              Data from V copied into W with duplicates removed


i.e., a list of your unique data values.  You may need to sort Column W.


Now enter =NOT(ISNA(VLOOKUP($W2,A$2:A$4,1,FALSE))) in X2 (replace the 4 with the number of the last row that contains data), and drag/fill down to match Column W (i.e., one row for each unique value in your original data) and to the right to Column Z (i.e., the number of columns in your data).


                              table showing TRUE/FALSE for whether each value is in each column


This gives you a truth table corresponding to the second form of the desired result in the question (but with “TRUE” and “FALSE” instead of “Yes” and “No”).  For example,



  • X2 is TRUE because Column A contains “aaaaa”,

  • X3 is TRUE because Column A contains “bbbbb”,

  • Y2 is TRUE because Column B contains “aaaaa”,

  • Y3 is FALSE because Column B does not contain “bbbbb”, etc.


Delete column V, and fix the headings (in Row 1) at your leisure.  If you don’t want to keep Columns A-C in the spreadsheet, then copy Columns W-Z and paste values.




Some explanation on the formula: The formula I have presented above is for use in Column X, which corresponds to Column A.  Since I used $W2, this is an absolute reference to Column W and it will refer to cell Wn when the formula is dragged/filled to row n of any column.  By contrast, A$2:A$4 is an absolute reference to Rows 2 through 4, but a relative reference to Column A.  When the formula is dragged to Column Y, this reference will automatically change to B$2:B$4.  When the formula is dragged to Column Z, this reference will automatically change to C$2:C$4.


No comments:

Post a Comment

Where does Skype save my contact's avatars in Linux?

I'm using Skype on Linux. Where can I find images cached by skype of my contact's avatars? Answer I wanted to get those Skype avat...