Blog post

Friday, July 28, 2017

Visualising Ancestry DNA matches-Part 7-Adding shared admin lines

I’ve loved seeing the comments on this blog, and posts on Facebook, describing success with these methods. Thank you for the positive feedback, and congratulations on your finds! We’re not finished yet…

If you're new to this series, the index will steer you through the previous posts.

In this post we’re going to squeeze more information from the match list. I’m going to show you how to quickly and easily see groups of kits that share the same administrator. I’m aware that Ancestry has recently made changes and in future each new adult’s kit will be registered in a separate account. I don’t know what this means for ‘admin’ data – but for now we have the information so let’s make the most of it.

The potential benefits of linking people with the same administrator are:

  • Identify clusters of closely related people within a busy graph.
  • Add relationship lines between distant (to you) matches who are closely related to each other. These connections may improve clustering calculations on a busy graph that uses distant cousins.
  • Add additional distant matches (who are not related to a fourth or closer cousin) to the graph.

Below is one of my groups. The newly created/identified edge lines are highlighted in red. I’ve had some success in asking kit administrators about the common ancestor of matches whose kits they manage.

image

Assumptions

The assumptions that we make matter. We need to be aware of the assumptions we’re making, because a wrong assumption can lead to a wrong interpretation. In this post, we’re assuming:

  • Each instance of the same administrator name is the same person.
  • All of our matches who are managed by the same administrator are related to each other.

These seem to be reasonable assumptions for my relatively sparse matches. As I investigate the groupings revealed, I can ‘skip’ lines if I think they’re not appropriate. So far I haven’t had to. This may not be the case for your kit – take due care.

Moving on – how to do this!

Add/identify shared matches with the same administrator

Once again, a few point and clicks on the right menus, and the job is done. There aren’t too many steps.

  • Click the Graph Metrics button on the NodeXL Basic ribbon.image
  • Clear the Overall graph metrics check box (it doesn’t matter if you don’t, but we’re not using them)
  • Tick the Edge creation by shared content similarity box
  • Select the Options… button
    image
  • An options box should appear. Select admin from the Analyze the contents of this column dropdown box
  • Set the Strength threshold for edge creation to 100% (we only want exact admin name matches)
    image
  • Click OK to accept the Edge Creation Metrics settings you have entered.
  • Click Calculate Metrics on the Graph Metrics dialog to start processing.

The new edges will take some time to process.

View the shared admin links

When processing finishes, Refresh the graph to apply the changes.

To see the new lines, move to the Edges worksheet. You will see a new column titled Shared Content. The newly created edges will be at the bottom of the sheet, with the relevant administrator’s name in the Shared Content column. Select all the new lines and you’ll see them highlighted in red on the graph.

If you have a graph with a lot of linkages between groups make sure that the between group links are set to show. If there are highlighted lines running between groups (and you think the assumptions we have made about administrators hold) this suggests that the clustering of matches could be improved. You may get a better result if you rerun your preferred grouping algorithm now.

Colour the new lines

The colouring instructions below are a quick fix. There are different ways to apply colour and we’ll do more with colour in a later post.

For now, highlight the rows with entries in the Shared Content column, then:

  • Right click any of the highlighted lines on the chart to access the right click menu.
    This can be a bit tricky. If you click a dot all the lines connected to that match will also be selected. Whoops! We don’t want that. If it happens, go back a step. Highlight the rows on the edges sheet, and try again.
  • Click Edit Selected Edge Properties…
    image
  • Select the colour you prefer and click OK.
    image

You may not be able to see the colour on the graph at first. Duplicate lines in the standard grey will be sitting on top of them. This is easily fixed – just sort the Shared Content column from Z to A so that the new entries move to the top of the page. Refresh the graph.

Remove unwanted lines

Skipping

If you administer kits for cousins from different branches of your family then new, incorrect lines will have been added. These can be dealt with by finding your name in the Shared Content column and ‘Skipping’ the offending lines (enter ‘Skip’ in the Visibility column on the Edge worksheet). Deleting the edge line entirely will also work. You will need to delete the lines again each time you recreate the shared admin links.

Alternative:
You can use a formula to specify the lines that should be skipped. The template uses Excel tables, which have special properties. If the Visibility column is all clear and you enter a formula it will automatically be entered into every row including new rows that are added later. No updating required.

You might have already noticed that some cells have a red triangle in the corner. When you hover over these cells a comment box will appear. The comment boxes contain useful information about use of each column and what the possible values mean. image

Taking a simple case where “YOURNAME” is the only value in the Shared Content column that you want to skip, a formula that will do the job is:

=IF([@[Shared Content]]="YOURNAME",0,1)

This formula tells Excel that if the value in the Shared Content column is ‘YOURNAME’ the value should be ‘0’ (which we can see from the comment box means ‘Skip’). Otherwise, the value is ‘1’ (which means ‘Show’).

Deleting

If you find that the shared admin lines are not suitable for your situation at all, simply delete the lines entirely. You won’t need the now empty Shared Content column – it can also be deleted.

Excel tip:
To remove the lines select any cell(s) in the row(s) you want to remove. On the Home ribbon click Delete, Delete Sheet Rows. This won’t work if you have filtered the table to find the rows.  

image

Retain wanted information

When you ‘count and merge duplicate edges’ the first instance of an edge (starting from the top) is kept. Duplicates further down the sheet will be deleted – even if they add information such as Shared Content or 'skip’ instructions.

To make sure you retain the new admin lines when removing duplicates send them to the top of the worksheet.

  • Sort the Shared Content column from largest to smallest, then
  • Sort the Visibility column so that skip instructions are at the top
    • If using words in the Visibility column, sort from largest to smallest
    • If using a formula that results in a number, sort from smallest to largest

Then remove duplicates as usual.

Note:
You can also use the Shared Content column (or any other column) in addition to the vertices to determine if two edges match. This is useful to tell the difference between relationships from the ICW data, and relationships that were created only through having a shared administrator.
image

Coming up….

In the next post, we’ll supplement the graph with known ancestry information.



Sunday, July 23, 2017

Visualising Ancestry DNA matches-Part 6-Busy graph treatments

In the last post we cast an appraising eye over the graphs we made using NodeXL Basic (a product of the ‘Social Media Research Foundation’). In this post, you’ll see some of the features of that may help calm a busy graph. Pick and choose from them as appropriate to you tree, research aims and aesthetic preferences.

If you haven’t made a graph yet, see the index to this series for earlier posts.

Display settings

Take it one group at a time

Once you’ve made groups, you can move to the Groups worksheet and enter ‘skip’ in the Visibility column for each group except the one(s) that you’re interested in. Click Refresh, and only the unskipped groups will be shown. You can also view a few groups at a time as I did in the previous post.

Reduce edge opacity

If there are a lot of crossing lines it might be easier to work with the graph if you reduce the line opacity. You can change the defaults used for the graph, including the edge opacity via the Graph Options button.

  • Click the Graph Options button
    image
  • Lower the default Edges Opacity – the lower the opacity, the more transparent the line.
    This may not remove as much visual clutter as you want, but if the dots appear to be sitting on a blanket of grey it may help you see some structure.
    image

Swap labels for tooltips

In Part 3 we used the Autofill columns button on the NodeXL ribbon to add labels to the graph. For a busy graph you may prefer to use same button to clear the labels column and set the tooltip to ‘name’. That way you’ll see the match’s name by hovering over their dot.

Grouping

If your groups don’t break up nicely, try a different clustering algorithm.

  • On the NodeXL Ribbon select Groups, Group by Cluster…
    image
  • Select an option from those presented and click OK. The calculations may take some time for a complex graph.
  • Refresh Graph to apply the new groupings.


image

Clauset-Newman-Moore clustering algorithm

image
Same graph with Wakita-Tsurumi clustering algorithm

Remember that these algorithms were not created with your DNA results in mind! Hopefully one of them will work well with your data – but don’t assume that because it sounds scientific it must be right.

Try a different group box layout – or none at all

  • Different group box layout options are available under ‘Layout Options’ on the graph area or main NodeXL ribbon, bottom item Layout Options…
    image


image

‘Force-directed’ box layout algorithm used, box edge width 0 (I.e. no line)

Hide intergroup connections

It’s possible to hide all the the lines that run between different groups. This instantly cleans up a graph and makes connections within a group easier to see, but does so at the expense of between-group information.

  • Click the Layout options dropdown on the NodeXL Ribbon or the graph area toolbar.
  • Change Intergroup edges to ‘Hide’.

image

image

Intragroup edges hidden

The graph with between group edges hidden is clean and pretty. It’s easier to see relationships within groups – but relationships between groups are not visible. Again, that reminder that the grouping algorithms were not designed for your DNA data. Those between group connections may be the clue that points you in the right direction.

Alternative: ‘Combine’ is an interesting option to try. It will draw a single, thick line between groups that interlink with each other.

Removing relatives

Skipping close relatives

When we created the Additional Input file we added the word ‘Skip’ to the Visibility column for you and your very close family. The ‘Skip’ direction tells NodeXL not to include that person in the graph, or in the clustering calculations.

It may be helpful to ‘Skip’ some more of your close relatives, especially if PC performance is an issue. Take care though – skipping a relative means the graph loses information.

  • Your closest relatives are probably at the top of the the Vertices worksheet. If not, sort the sharedCM column from largest to smallest using the dropdown. Your closest relatives will move to the top of the list.
    image
  • When you click on the row for a match, the dot that represents that person, and all the lines representing their relationships, will be highlighted in red. This will give you a sense of how widely spread their linkages are, and how much clutter will be cleared (or information lost) by skipping them.

There’s no magic number for the relationship distance or number of links that should be the threshold for skipping people. If I had an aunt and a second cousin who had the same number of links, I might skip the aunt, since theoretically her links are spread over half my tree. I would be much more likely to leave in the second cousin whose matches theoretically sit in a quarter of my tree.

As many readers have realised, you can ‘Skip’ people manually by entering ‘Skip’ in the Visibility column. However, I suggest that you also add the new ‘skip’ line to the Additional Input file as explained in Part 2 (note that the directions on this point have been revised since first posting). If something goes wrong, troubleshooting a large file with complex relationships can be difficult. Keeping the information in a smaller external file makes it easier remember what you’ve done, and allows you to reload or start again if necessary.

Note: After skipping people you might want to rerun your preferred grouping algorithm and refresh the graph.

Skipping children of known matches

It’s not a quick fix, but another category of person that you may want to ‘Skip’ is anyone who is known to be the child of another match. If they are only connected to you on the matching parent’s side you can safely ‘skip’ the child as they, at best, duplicate the parent’s relationship information. Take care that the relationship really is parent-child and not niece or nephew – the information visible to you may look the same in those cases.

Again, I suggest that you at least keep a record of these ‘skips’ outside your main file - the Additional Input file is made for this! If something goes wrong with the graph file, will you really want to track down those relationships again?

Filtering

Dynamic Filters allow you to hide your most distant and/or closest relatives.

  • Select Dynamic Filters
    image
  • Scroll down or expand the window to find the sharedCM slider
    image
  • As you adjust the slider’s lower value, your most distant cousins will disappear.
  • Adjusting the slider’s upper value will hide your closest cousins (you may need to slide it down a long way).
  • If you want to still see the filtered information, but make it less prominent, adjust the filter opacity to your liking.image

Wakita-Tsurumi grouping with matches below 15CM filtered out

Excluding matches

If you have a very large number of matches you may decide not to work with distant cousins at all. In this case you could enter ‘Skip’ next to each one, or you could save some time when downloading by using the Filter: 4th Cousin option in the DNAGedcom client.

Deleting smaller matches from the match list, whether before or after importing to NodeXL, won’t help. Matches listed in the in-common-with file will still be included in the graph, you just won’t know who they are!

If you want to go a bit past fourth cousins, but not all the way to those speculative distant matches, filtering or skipping may be a better option than excluding entirely.

Excel tips:
1) If you copy a cell then select multiple cells and paste, the paste value (e.g. ‘Skip’) will be entered into all of the selected cells.
2) Double click on the square at the bottom right corner of a cell to copy it down the page automatically to the next filled box, or the end of the table whichever comes first. It can be a bit fiddly to get the right spot – the curser should change into a black plus sign + without any arrows.
image

DNAGedcom Note:
There are two versions of the DNAGedcom client being used at present. Version 2 is necessary if you have FTDNA matches, but it doesn’t have the filter option for Ancestry DNA matches (I’m told the option will be reinstated in future). The version linked to in the first post of this series does have the option.

Coming up

In the next post, we’re going to extract more information from the files we already have.

Wednesday, July 19, 2017

Visualising Ancestry DNA matches-Part 5-Busy graph diagnosis

Our computers have to crunch a lot of numbers to make up these graphs. Even more so for a busy one. If you haven’t cleared duplicate relationships since you last loaded data (or ever!) head back to Post 4 and do this step now.

In writing these posts I’ve tried to choose a path that will be both useful and accessible to as many people as possible. The options I’ve chosen and methods I’ve used may not be the ones that work best for you. The choices you make should be driven by the nature of your tree and your research goals.

Before you try and clean up a busy graph, you need to understand why it’s busy. You’ll have a big head start on both understanding the relationships it shows, and what you would lose or gain by removing certain elements from the graph.

Sincere thanks to Blaine Bettinger, Joan Hanlon and Richard Rubin who allowed me to use their data to test the suggestions in this post. Thank you also to the several other people who offered me their data for the same purpose.

Why is the graph busy?

I have used one of Joan’s kits for demonstration purposes. The kit has 469 fourth cousin or closer matches. Below is the point reached, having followed the steps in earlier parts of this series.

image

The start point

Let’s take a closer look.

Distant relatives

Many of the groups have a cluster of interconnected closer relatives with a fringe of distant relatives. You can see the fringes quite clearly on the left of this group.

image

A group with a fringe of distant relatives.

Group interconnections

There’s a network of between-group streaks across the graph. In areas where these are thicker, it’s hard to tell where the streaks start and end. They obscure the relationships within groupings. We can work out where those linkages are coming from.

Interconnections from close cousins

As I click on each row of the Vertices worksheet, that person’s relationship lines are highlighted in red. I can see that some of the strong streaks between groups are due to a small number of closer relatives.

Connections from a third cousin are highlighted in the image below. This is gold! If Joan knows how that third cousin is related to the focus person, it will suggest what part of the tree those two groups are connected to. It works the other way around too. Clues from those two groups could lead to discovering how the predicted third cousin fits in.

image_thumb3

This third cousin has strong connections to two other groups

The number of lines from even a few first or second cousins, who will probably match with multiple people in other groups, may be enough to obscure what is going on in a graph.

Group interconnections – other linkages

If I move to the Groups worksheet, I can now click on each group in turn. All the people in the group and each of their relationships with other people, will be highlighted. The group below has linkages spread out to many other groups.

Remember that the relationship between two of your DNA matches may have nothing to do with your tree. It’s very likely that some of your DNA matches will be related to each other on other lines. While there are slightly more connections to the dark blue group, there’s nothing here that screams of a strong relationship between groups.   

image_thumb7[1]

On the other hand, the group at the top has multiple connections to the green group in the lower left hand corner (see below). It’s easy to imagine that the division of people between those two groups could change with a few new cousins added, or a slightly different grouping algorithm.

image

A prolific branch?

On the Groups worksheet I marked the Visibility column for all of the groups, except the two mentioned above, with the word ‘skip’. I Refreshed the graph and adjusted the Scale slider to see this:

image

It’s a bit hard to show it here – you’ll have to take my word for it – but it appears that there are perhaps half a dozen people at the furthest end of the the fourth cousin range with multiple connections to both groups. I found it interesting that the strong connection wasn’t driven by closer cousins. The group on the right also are more distantly related to the focus person, on average, than the group on the left. Perhaps a more distant, but prolific, line of descent from the same branch? Only research will tell.

Endogamy

If you are from an endogamous population, and your computer survives the journey to making a graph, you will find yourself with dots on a solid mat of grey. The following graph is from a person known to have some endogamy. Only fourth cousins and closer have been included in the graph. Almost any vertex I click on connects to multiple groups – I’m not at all sure that the groupings are meaningful in this case. In the bottom right hand corner, a few clusters from the less endogamous portions of the tree peek out.

image

If you looked at graphs in earlier post, you’ll know this is far removed from my own. I take back any complaint I may have made about not having enough matches!

Cleaning the clutter

So we’ve rummaged around in the clutter and found some items that should go, and a few we’d like to keep.

What next?

The next post



Tuesday, July 18, 2017

Researching Abroad Roadshow is coming to Canberra

Coming up next month (August 2017) is a fantastic opportunity for Australians with British or European ancestry. I’m talking about Unlock the Past’s ‘Genealogy Roadshow’ on the subject Researching Abroad: Finding British Isles and European Ancestors.

The roadshow features two very well regarded international speakers – Chris Paton from Scotland and Dirk Weissleder from Germany. If you haven’t been to a genealogy event before, this would be a great one to start with.

Researching abroad: 8-26 August. Brisbane, Auckland, Sydney, Melbourne, Canberra, Adelaide, Perth


I’ve followed Chris Paton’s blog for years. When I learned that he was going to be speaking in Canberra, I jumped at the chance to attend. I’ve had this in my calendar since February!

More recently, I signed up as a Roadshow Ambassador. I’m more than happy to support the Roadshow’s success as I would love to see more events like it. Tickets are available on the Unlock the Past site (scroll down to the city list). There’s a small discount for pre-booking, and by pre-booking you’ll also be entered into a rather substantial prize draw of genealogy goodies.



Disclosure: In return for acting as a Roadshow Ambassador I have received free entry to the event.

Saturday, July 15, 2017

Visualising Ancestry DNA matches–Part 4-Updating

This is the fourth part of a series about using a free Excel template, NodeXL Basic, to visualise Ancestry DNA match lists. For previous posts, see the index to the series.

So you’ve made a chart and it’s just the way you want it. The next thing you know you have brand new interesting DNA matches and you’re dying to know how they fit in. Once you get used to it, creating a new graph is quite quick and easy. Or…. you could save a little time and upload more information into the same chart.

How new data is treated

Before you load more information into an existing chart it’s useful to understand what will happen.

Matches

  • New matches are appended to the end of the Vertices sheet.
  • Data for existing matches is overwritten (where the columns have the same name).
  • Columns that were not present before are added.
  • Columns that were present before that are not re-imported are unaffected. The existing data will still be there. New matches will have no values in that column.
  • Matches who don’t appear in the new data are unaffected – you won’t lose them.

Relationships

  • All the relationships in the new file will be added to the end of the list on the Edges sheet.
  • This will duplicate existing relationships. Clean up is necessary.

This has some exciting implications for the kinds of things we will be able to do with the charts. I’m dying to get up to the posts where I’ll show them to you! One thing at a time…

Clearing duplicates

To make things easier on your computer, you should clear away duplicate relationship entries (edges) each time you reload data into an existing worksheet:

  • On the NodeXL ribbon select Prepare Data, Count and Merge Duplicate Edgesimage
  • Clear the Count duplicate edges box
  • Tick the Merge duplicate edges box
  • Select Vertex 1 and Vertex 2 as the columns that contain duplicates.
    image
  • Click OK.

If you have a long in-common-with list you should take this step now, even if you haven’t reimported data.

When you first import data, one ‘Vertex’ row is created for each person, and one ‘Edge’ for each relationship. For fourth and closer cousins who match each other, two relationships are recorded: ‘A matches B’ and ‘B matches  A’. Excel’s Remove duplicates button won’t help because it doesn’t recognise these pairs of reciprocal relationships as duplicates. Each duplicated relationship adds a little to the processing load for your computer.

When you clear duplicates, the first duplicate entry starting from the top is kept and any others discarded. Keep this in mind if you use any of the other edge columns – you may want to choose “Vertex 1, Vertex 2 and this column”, or to sort the edges before clean up so that the ones you want to retain are at the top.

Updating data – the process

To update a worksheet with fresh data, use the following steps.

Save a copy first in case something goes wrong.

  1. Import new in-common-with file – see Part 2
  2. Import new match file – see Part 2
  3. Clear duplicates – see above.
  4. Recreate groups – see Part 2
  5. Autofill columns – see Part 3

If the autofill process hasn’t caused your graph to refresh, then Refresh Graph now.

Note: If one or more of your new matches has a very close relationship to you and should be ‘skipped’, add them to the Additional Input file and reload that as well (see Part 2 for a reminder how to load it). Do so any time before step 3 above. Alternatively you can enter ‘Skip’ manually on the Vertices sheet.

Coming up…

I know a lot of people are interested in ways to work with a busy chart – that post will be next, I promise!

Monday, July 10, 2017

Visualising Ancestry DNA matches–Index

This series of posts describes how you can use a free Excel template (Windows only, I’m afraid) to draw a network graph of your Ancestry DNA matches.

Published posts

Part 1: Getting ready

  • NodeXL
  • DNAGedcom

Part 2: Loading files the first time

  • Additional input file (‘skip’ close relatives)
  • Load in-common-with
  • Load matches
  • Load additional input
  • Identify groups

Part 3: Navigation and presentation

  • Hiding columns
  • Zooming in
  • Changing layouts
  • Dot sizes and labels
  • Scale the features

Part 4: Updating

  • How new data is treated
  • Clearing duplicates
  • Updating – the process

Part 5: Busy graph diagnosis

Part 6: Busy graph treatments

Part 7: Adding shared administrator lines

Part 8: Adding known ancestors

Part 9: Combining kits

  • Loading more kits into the same graph
  • Using a formula to adjust dot sizes

Part 10: Colour coding

  • Show colour per vertex
  • Colour one node at a time
  • Apply colour codes in bulk
  • Colour coding examples – Branch, Side, Place, Heat map

Saturday, July 8, 2017

Visualising Ancestry DNA matches–First aid for complicated graphs

If you have tried the instructions in this series and found that no matter how carefully you followed the instructions, all you had was a big blob, it’s probably because your very close relatives have tested. Lucky you!

I will be providing a post with ideas to help with complicated charts including those with many family relationships, but in the meantime you can apply some simple first aid.

  • Add an extra line with the ‘Skip’ instruction to the Additional Input file for each parent, child or sibling who has also taken a test.

Post 2 has been updated to reflect this instruction.

Friday, July 7, 2017

Visualising Ancestry DNA matches-Part 3-Navigation and Presentation

This is the third post in a series about Visualising Ancestry DNA Matches. In previous posts we got ready, and loaded the files. In this post I’ll show you how to get around your graph, and provide some options to adjust the appearance of the chart so that it can be more easily understood.

To begin, open up the file you saved at the end of the last post. You won’t see the graph you created.

Don’t Panic.

Click Show Graph and your work will reappear with your last settings intact.

image

Navigation

Jump to a person

I suggested previously that you should move back to the Vertices worksheet and try clicking on some dots. If you haven’t done so, try it now. You’ll find that when you click on a dot the appropriate line on the Vertices worksheet is highlighted. If you scroll right on that worksheet you will see the person’s name, kit administrator and shared cM. You will also find that the ‘matchURL’ field contains a clickable hyperlink to your match page with that person. Very handy!

It works the other way as well. If you select a line or lines on the worksheet, the corresponding dot (or dots) will highlight in red. Note: You may have to click a few times to see this. Only a small proportion of all your matches are on the graph as it only displays people who have shared match information.

Hide excess columns

Scrolling to the right every time can be a bit annoying. We can quickly hide some of those excess columns. On the NodeXL Ribbon, click the Workbook Columns button. Here you hide and show the ‘Visual Properties’ and ‘Labels’ columns if you wish.

image

Take a closer look

The controls that will help you get around the chart itself are at the top of the graph display area.

image

  • The Arrow button allow you to make selections on the chart.
  • The + and – Magnifying glasses will zoom the image in or out, as will the Zoom slider
  • When you are zoomed in, the Hand button will let you move to different parts of the graph. If you can’t select dots, it’s probably because you’ve left the hand button active.
  • The Scale slider leaves the graph the same size, but will make everything on it (dots, line width, labels) smaller.
  • Notice most buttons have usage tips that will appear when you hover over them.

Have a play with the controls.

Presentation

NodeXL allows for a lot of customisation. We’re going to give our graphs a makeover! We’re going to try on different layouts, emphasise our closer cousins and accessorise with carefully chosen labels. By the time we’re finished those frumpy scribbles will be elegant figures wearing designer labels.

We’re aiming for before and after shots something like this:

imageimage

Layouts

So far we’ve stuck with the default layout algorithm. There are other layouts to choose from. When I first tried NodeXL I was suffering from a bad case of DNA circle envy, so I choose circle layouts. They worked well with small groups of matches. Since then I’ve acquired more matches and have settled on the ‘Harel-Koren Fast Multiscale’ option (used in the ‘after’ image above).

Layout options are available on the graph area toolbar and on the NodeXL Ribbon.

image

  • Select a layout option from the drop down list.
  • Each time you select a different layout option NodeXL forgets that you want to keep your groups in separate boxes. Remind it by opening up Layout options… (same menu, bottom item). It seems to retain the options you last set, so just click OK.
  • To apply the new layout, click Lay Out Again.

Go ahead and try different layouts out until you find one that works well with your data.

Dot size and labels

I’ve adjusted the dot sizes on my charts to correspond with the sharedCM value – bigger dots are closer cousins. I’ve also applied labels so I can see who is who without moving back to the vertices worksheet. When I hover over a dot, a tooltip appears with whatever note I had entered on the person’s Ancestry match page at the time I downloaded the file.

All this can be done very easily using options found under just one button.

Click the Autofill Columns button on the NodeXL Ribbon.

image

The dialog below will appear. This dialog will write values in the ‘Visual Properties’ columns and ‘Labels’ column based on the columns you choose.

  • Set Vertex Label to ‘name’
  • Set Vertex Tooltip to ‘note’
  • Set Vertex Size to ‘sharedCM’ – then click on the Options button on the right.

image

The Vertex Size options let you decide how big or small the dot representing each person should be based on numerical values in the column you select.

The settings shown below worked for well me. You may be quite happy to leave the smallest number as “The smallest number in the column”. I increased the number to 10 so that I could tell the difference between my closer cousins and everyone else more easily. The number 30 worked well for me as the upper limit (anyone with shared CM of 30 or more will be drawn at the maximum size). Experiment and see what works for you.

To get out and apply the settings:

  • Click OK on the ‘Vertex Size Options’ box
  • Click Autofill on the ‘Autofill Columns’ box.
    The information will be written into the appropriate columns and the settings applied immediately.
  • Click Close on the ‘Autofill Columns’ box

image

Scale the features

By now you should have graphs that look something like this:

image

It’s a bit cluttered and hard to see what’s going on. Use the Scale slider to adjust the dots and labels to suit the Zoom level you are using.

image

Here’s a closer look at the same group with a Zoom of 200 and a Scale of 40.

image

Don’t forget to save!

What can we do with this?

These graphs show DNA matching relationships in the Ancestry DNA data.

  • Each dot represents a person on your DNA match list.
  • The bigger the dot, the more shared DNA they have with you.
  • Each line represents a relationship between two people who are estimated to be fourth cousins or closer to each other (at least one of the two people must be estimated fourth cousin or closer to you).

When we look at it this way, we can see linkages that are not visible on the Ancestry DNA shared match pages. I can think of dozens of scenarios where this sort of information could lead to valuable clues.

For example:

  • On ‘Cousin K’s’ shared match page, I can see ‘Cousin O’ and ‘Cousin I’.
  • I don’t see ‘Cousin S’ or ‘Cousin T’ who are distantly related to me, but more closely related to Cousin K.
  • ‘Cousin S’ is a (estimated) distant relative to me, but must be a fourth cousin or closer to both ‘Cousin K’ and ‘Cousin I’ for the connecting lines to show.

Suppose the key to my connection with fourth Cousins ‘K’ and ‘I’ happens to lie with Cousin ‘S’? If Cousin ‘S’ doesn’t have a public tree linked to their DNA kit no amount of searching for names or places will find them. As I have thousands of DNA matches on Ancestry, I’m unlikely to make my way all the way to their page which will be well back in my results – let alone contact them if I have nothing else to go on.

Whether you’re taking a paper trail or a segment matching approach to your DNA matches, it helps to know which of your thousands of matches might be relevant to a particular problem.

Now that I’ve visualised the relationships this way, I know that Cousin S exists and that it could be worthwhile contacting them.

Sunday, July 2, 2017

Visualising Ancestry DNA matches-Part 2-Loading files the first time

This post is part two of a series.

In the first post I showed you the files and software you can use to visualise Ancestry DNA matches. Today we’re going to load  the match and in-common-with files you downloaded using the DNAGedcom client, and have our first look at a graph.

Getting set up and loading the files is not difficult, but there are a lot of steps to follow and details to note. I’ve suggested some check points at which you should save your progress.  If you miss a detail you won’t have to start from the beginning. Just reopen the file and resume from the last save point.

The first time you try this, give yourself at least forty five minutes at a time when you feel ready to concentrate.

It’s much quicker when you get used to it. The entire process described below takes me less than five minutes.

Thank you to my husband and to Aillin O’Brien who tested these instructions and provided invaluable feedback.

An index to this series of posts is available here.

Preparation: Set up a Spreadsheet for Additional Input

There’s one final step of preparation before we load the data.

If we load the information as it is, the chart will show connections between the test taker (I’ll call that person “you”) and every one of their matches. All you will see is a mass of dots. It’s also likely to tie up your computer while it thinks about all those lines it has to draw. I’ve made this mistake more than once... The graph appears eventually, but it isn’t very useful.

This will also occur when a direct line relative who can be expected to share a substantial number of matches with you from across your tree has also tested - a sibling, parent, child or grandchild.

The most efficient way I have found to get around this is to load in a small additional spreadsheet. We can also use the new spreadsheet to add other information, but we’ll get to that later.

  • Open Excel and create a new workbook:  File – New – Blank Workbook.

    image

In the first row of your new workbook, type in the following column headings:

  • matchID
  • Match name
  • Match admin
  • Vertex 2
  • Name
  • Vertex Type
  • Edge Type
  • Visibility
  • Comment

image

We need to enter one line of information in this table for each person with a large number of matches.

Under matchid you will enter the test ID number that was assigned to that person’s test by Ancestry. For your own test you can the URL when you go to your DNA page on Ancestry. It will look something like this – you need the part marked red:

https://www.ancestry.com.au/dna/insights/AAAAAAAA-BBBB-CCCC-DDDD-EEEEEEEEEEEE

  • Copy and paste your test ID number under both matchid and Vertex 2. 
  • Type the word “Skip” under Visibility.             <<< Don’t miss this step!

That’s all that’s strictly necessary for it to work, but a little extra information will remind you what this line is for later:

  • Put your name in three columns: Match name, Match admin and Name.
  • comment is for reminders to yourself. Put whatever you like there. I added a short explanatory note about what this line does.

image

Repeat the process for each close relative (sibling, parent, child or grandchild) who has DNA tested.

This time, use your own id as the matchid, and your close relative’s id as Vertex 2.

If you’re not the administrator for the test, you can find their match ID on your DNA match page. The red part is your ID, and the blue part is your relative.

https://www.ancestry.com.au/dna/tests/AAAAAAAA-BBBB-CCCC-DDDD-EEEEEEEEEEEE
/match/VVVVVVVV-WWWW-XXXX-YYYY-ZZZZZZZZZZZZ

Alternatively you can look up the match ID numbers in the matches files.

Save the file somewhere you will find it again. I’ll call this file the Additional Input file from now on.

OK, we’re all set. Now we create a NodeXL file and load the information in.

Create a NodeXL Workbook

The method required to use the template may vary with your version of Excel.
I have an Office 365 subscription. I select File New PERSONALNodeXLGraph

If this method doesn’t work for you, try searching for “NodeXL” in the Windows “Search programs and files” field or equivalent on your system, and double click the “NodeXL Template” file returned.

image

A new spreadsheet will open. It may check for template updates as it opens, and you will need to wait for 20 seconds for the splash screen to close. Once it does, your screen should look something like this:

image

A new ribbon called NodeXL Basic has appeared. It won’t be there when you open a normal file, it will only appear when you are using the special files created with the template. Click on the new ribbon and take a look. This is where most of the action will take place.

image

Load your files

Open your match list (m_yourname.csv), in-common-with list (icw_yourname.csv), and additional input file in the normal way then return to the new NodeXL sheet. With each load we have to tell the template which fields in the file hold information about people (‘vertices’) and relationships (‘edges’). I’ll tell you what to put in at each stage.

Important step before loading the first time:

On the NodeXL ribbon, click the Import button. It’s at the far left hand side.

image

  • Choose Import Options… (the bottom item on the menu).
  • Clear the box next to “Clear the NodeXL workbook before data is imported”. There should be no tick in the box.

Why: Otherwise, no matter how many files you load only the last one loaded will be in the spreadsheet. We need all three files – matches, in-common-with, and additional input – to go in.

Save the file now that you’ve adjusted the setting.

In-common-with file

  • Select Import from the NodeXL ribbon and choose From Open Workbook…image
  • Select the in-common-with file (icw_yourname.csv) in the top box of the dialog that appears
  • Leave “Columns have headers” checked
  • Tick the boxes for “match id” and “icwid” under Is Edge Column. No other boxes should be ticked
  • Confirm that “matchid” is selected in the  “Which edge column is Vertex 1?” dropdown
  • Under “Which edge column is Vertex 2?” choose “icwid”
  • Click import (say OK to the message about text wrapping if you get it.)

image

Check the import

Navigate to the Edges worksheet using the tabs at the bottom left of the screen.

image

You should see lots of ID numbers in the “Vertex 1” and “Vertex 2” columns.

The ID numbers will overlap each other and the other columns. That doesn’t matter. You should not see any other data entered in the sheet at this stage.

If the import looks correct, save your progress and carry on.

Matches file

  • Open the import dialog again:
    image
  • Click on the matches file in the top box (m_yourname.csv)
  • Set “testid” and “matchid” as edge columns (tick boxes). No other boxes in that column should be ticked.
  • Under “Is Vertex 2 Property Column” check the boxes for:
    “name”,
    “admin”
    “SharedCM”,
    “note” and
    “matchurl”   (you’ll need to scroll all the way to the bottom to find this).
  • Choose “testid” in the dropdown box under “Which edge column is Vertex 1”
  • Choose “matchid” for “Which edge column is Vertex 2”.
  • Click import (say OK to the message about text wrapping if you get it.)

image

Check the import

Navigate to the Vertices worksheet.

image

Check that the first column “Vertex 1” contains ID numbers. You should not see any names or other information in that column.

Scroll right and check that the “names”, “admin”, “shared CM”, “note” and “matchurl” columns have appeared and have information in them. You may need to scroll right to see them.

image

If this looks right, save you progress and continue.

Additional Input file

Important: Don’t forget to load this file!

  • Open the import dialog again
    image
  • Click on the Additional Input file in the top box
  • Set “matchid” and “Vertex 2” as edge columns (no other boxes should be ticked)
  • Scroll down and tick “Name” and “Visibility” under Is Vertex 2 Property Column    <<< Don’t miss this step
  • Choose “matchid” in the dropdown box under “Which edge column is Vertex 1”
  • Choose “Vertex 2” in the dropdown box under “Which edge column is Vertex 2”
  • Click import (and clear the text wrapping message if it appears)

image

Check the import

Move to the Vertices worksheet again and find the row with your own name (Control-F will bring up a search box).

Confirm that the word “Skip” is in the Visibility column.

image

If this looks right, save again and move to the next step.

Now make a chart!

Find the toolbar in the chart area, and click Show Graph.

Troubleshooting: If it takes more than a few seconds, there was probably a problem with the additional input file. When Excel has finished drawing thousands of dots, go back and check the Additional Input file instructions again and make sure you’ve loaded it. If it was missing and you’ve fixed it, click Refresh Graph which will have appeared where you found Show Graph before.

image

Don’t be disappointed if your chart looks like the image below (and it probably will). It will get better with a few tweaks.

image

Identify groups

On the NodeXL ribbon, find the Groups button. Click it and select “Group by connected component” from the option list.

This option works well for me, but if you have a lot of very interconnected matches you might find that one of the choices under “Group by Cluster” works better.

image

Refresh Graph will add the newly created grouping information to the chart. Your chart will become more colourful but no more tidy. Just one more step, and you’ll have something more interesting to look at.

Separate groups in the chart

You can access layout options from both the NodeXL Ribbon, and the chart area. Click on the dropdown in either location and select Layout Options from the dropdown menu.

image

Change the “Layout Style” option to “Lay out each of the graph’s groups in its own box” and click OK.

image

Click Lay Out Again to apply that change to the chart. You didn’t need to refresh the graph a second time because the data itself didn’t change, only the layout instructions.

image

This is how mine looks now. Each dot represents a person I have a DNA match with. Each line represents a relationship between two of my matches.

Move back to the Vertices worksheet and see what happens when you click on the chart dots.

image

That’s plenty for today. Don’t forget to save your file!

Go and get yourself a nice cup of tea (or whatever beverage you prefer) knowing that if you’ve made it this far you can definitely manage the next steps I have in mind.