Blog post

Sunday, July 23, 2017

Visualising Ancestry DNA matches-Part 6-Busy graph treatments

In the last post we cast an appraising eye over the graphs we made using NodeXL Basic (a product of the ‘Social Media Research Foundation’). In this post, you’ll see some of the features of that may help calm a busy graph. Pick and choose from them as appropriate to you tree, research aims and aesthetic preferences.

If you haven’t made a graph yet, see the index to this series for earlier posts.

Display settings

Take it one group at a time

Once you’ve made groups, you can move to the Groups worksheet and enter ‘skip’ in the Visibility column for each group except the one(s) that you’re interested in. Click Refresh, and only the unskipped groups will be shown. You can also view a few groups at a time as I did in the previous post.

Reduce edge opacity

If there are a lot of crossing lines it might be easier to work with the graph if you reduce the line opacity. You can change the defaults used for the graph, including the edge opacity via the Graph Options button.

  • Click the Graph Options button
    image
  • Lower the default Edges Opacity – the lower the opacity, the more transparent the line.
    This may not remove as much visual clutter as you want, but if the dots appear to be sitting on a blanket of grey it may help you see some structure.
    image

Swap labels for tooltips

In Part 3 we used the Autofill columns button on the NodeXL ribbon to add labels to the graph. For a busy graph you may prefer to use same button to clear the labels column and set the tooltip to ‘name’. That way you’ll see the match’s name by hovering over their dot.

Grouping

If your groups don’t break up nicely, try a different clustering algorithm.

  • On the NodeXL Ribbon select Groups, Group by Cluster…
    image
  • Select an option from those presented and click OK. The calculations may take some time for a complex graph.
  • Refresh Graph to apply the new groupings.


image

Clauset-Newman-Moore clustering algorithm

image
Same graph with Wakita-Tsurumi clustering algorithm

Remember that these algorithms were not created with your DNA results in mind! Hopefully one of them will work well with your data – but don’t assume that because it sounds scientific it must be right.

Try a different group box layout – or none at all

  • Different group box layout options are available under ‘Layout Options’ on the graph area or main NodeXL ribbon, bottom item Layout Options…
    image


image

‘Force-directed’ box layout algorithm used, box edge width 0 (I.e. no line)

Hide intergroup connections

It’s possible to hide all the the lines that run between different groups. This instantly cleans up a graph and makes connections within a group easier to see, but does so at the expense of between-group information.

  • Click the Layout options dropdown on the NodeXL Ribbon or the graph area toolbar.
  • Change Intergroup edges to ‘Hide’.

image

image

Intragroup edges hidden

The graph with between group edges hidden is clean and pretty. It’s easier to see relationships within groups – but relationships between groups are not visible. Again, that reminder that the grouping algorithms were not designed for your DNA data. Those between group connections may be the clue that points you in the right direction.

Alternative: ‘Combine’ is an interesting option to try. It will draw a single, thick line between groups that interlink with each other.

Removing relatives

Skipping close relatives

When we created the Additional Input file we added the word ‘Skip’ to the Visibility column for you and your very close family. The ‘Skip’ direction tells NodeXL not to include that person in the graph, or in the clustering calculations.

It may be helpful to ‘Skip’ some more of your close relatives, especially if PC performance is an issue. Take care though – skipping a relative means the graph loses information.

  • Your closest relatives are probably at the top of the the Vertices worksheet. If not, sort the sharedCM column from largest to smallest using the dropdown. Your closest relatives will move to the top of the list.
    image
  • When you click on the row for a match, the dot that represents that person, and all the lines representing their relationships, will be highlighted in red. This will give you a sense of how widely spread their linkages are, and how much clutter will be cleared (or information lost) by skipping them.

There’s no magic number for the relationship distance or number of links that should be the threshold for skipping people. If I had an aunt and a second cousin who had the same number of links, I might skip the aunt, since theoretically her links are spread over half my tree. I would be much more likely to leave in the second cousin whose matches theoretically sit in a quarter of my tree.

As many readers have realised, you can ‘Skip’ people manually by entering ‘Skip’ in the Visibility column. However, I suggest that you also add the new ‘skip’ line to the Additional Input file as explained in Part 2 (note that the directions on this point have been revised since first posting). If something goes wrong, troubleshooting a large file with complex relationships can be difficult. Keeping the information in a smaller external file makes it easier remember what you’ve done, and allows you to reload or start again if necessary.

Note: After skipping people you might want to rerun your preferred grouping algorithm and refresh the graph.

Skipping children of known matches

It’s not a quick fix, but another category of person that you may want to ‘Skip’ is anyone who is known to be the child of another match. If they are only connected to you on the matching parent’s side you can safely ‘skip’ the child as they, at best, duplicate the parent’s relationship information. Take care that the relationship really is parent-child and not niece or nephew – the information visible to you may look the same in those cases.

Again, I suggest that you at least keep a record of these ‘skips’ outside your main file - the Additional Input file is made for this! If something goes wrong with the graph file, will you really want to track down those relationships again?

Filtering

Dynamic Filters allow you to hide your most distant and/or closest relatives.

  • Select Dynamic Filters
    image
  • Scroll down or expand the window to find the sharedCM slider
    image
  • As you adjust the slider’s lower value, your most distant cousins will disappear.
  • Adjusting the slider’s upper value will hide your closest cousins (you may need to slide it down a long way).
  • If you want to still see the filtered information, but make it less prominent, adjust the filter opacity to your liking.image

Wakita-Tsurumi grouping with matches below 15CM filtered out

Excluding matches

If you have a very large number of matches you may decide not to work with distant cousins at all. In this case you could enter ‘Skip’ next to each one, or you could save some time when downloading by using the Filter: 4th Cousin option in the DNAGedcom client.

Deleting smaller matches from the match list, whether before or after importing to NodeXL, won’t help. Matches listed in the in-common-with file will still be included in the graph, you just won’t know who they are!

If you want to go a bit past fourth cousins, but not all the way to those speculative distant matches, filtering or skipping may be a better option than excluding entirely.

Excel tips:
1) If you copy a cell then select multiple cells and paste, the paste value (e.g. ‘Skip’) will be entered into all of the selected cells.
2) Double click on the square at the bottom right corner of a cell to copy it down the page automatically to the next filled box, or the end of the table whichever comes first. It can be a bit fiddly to get the right spot – the curser should change into a black plus sign + without any arrows.
image

DNAGedcom Note:
There are two versions of the DNAGedcom client being used at present. Version 2 is necessary if you have FTDNA matches, but it doesn’t have the filter option for Ancestry DNA matches (I’m told the option will be reinstated in future). The version linked to in the first post of this series does have the option.

Coming up

In the next post, we’re going to extract more information from the files we already have.

1 comment:

  1. I'm using another approach to busy graph syndrome while I learn the toolset. I have something like 35,000 vertices and 45,000 edge rows. All of the former are my personal DNA matches. The latter are all matches between my matches - I have no edge rows where I map to someone. Since I map literally to everyone, I started out with myself out of the graph (maybe everyone does that). My first instinct was to add 35,000 more edge rows for my own matches, but I decided not to do that and I think it was a good decision so far.
    I have some specific mysteries I'm trying to resolve with DNA matching. So I start by ensuring each edge row has both the name of the ICW match and the ICW admin. Say I'm studying my known Corey cousins with an eye to finding the parents of our common ancestor John Corey who md Eve Britten. Eve can be excluded from the study because I have known cousin matches with her via her siblings.
    I start by filtering on the ICW admin name, removing all choices so there are no edge rows in play. I then select the ICW Admin person who is a known Corey cousin and all her edge rows. I graph and take a look around. That leads me to the next Corey cousin with lots of shared DNA, and I select that person's ICW Admin rows. I repeat this process, regraphing each time, until I either see something new I haven't realized before or the graphs get so cluttered I stop to poke around. I may use scaling or dynamic filters to help with "busy graph" issues. But I don't need to paint the entire set of groups across the 35,000 people in advance and then start removing some. I start small and build up to the point where the people selected are maximally useful. / Tom

    ReplyDelete