Introduction

One evening in early 2014, I settled down in front of my computer to catch up on some of my family tree research. This was my second major sprint at digging into the stories of my predecessors. The previous sprint was back in 2001 with some weekend trips to the family records center in London. Back then, over the course of a year or so of dedicated effort on my part, I had slowly managed to find information on my immediate ancestors dating back to the late 1800s.

This second sprint had been very different. Over the course of only a few weeks I had created an extensively deep and wide family tree with some branches going all the way back to 17th century England. On this particular evening I was in for real surprise. As I logged in the Ancestry website, I had a hint waiting for me asking me to click on another member’s family tree where we seemed to have a common ancestor. I duly clicked, confirmed the common ancestor and then followed this new tree down to the present day. One of the names I saw when I got there was Alan S. P. Rickman.

A quick check of the family tree details against his Wikipedia biography revealed that this was the Alan Rickman — the star of the Harry Potter movies, the original Die Hard, and many other big-screen classics. He had always been one of my favorite British actors, but now I learned that he was my second cousin too. My latest attempt at geneaological research was proving a truly amazing experience for me. I had never imagined that I could progress so quickly and discover such rich and surprising information.

Why was my second sprint at genealogy so much more productive than my first? Of course, part of the reason was the digitization revolution, which meant that there were more historic documents available instantly to me electronically than was the case a decade before. But the biggest reason why I could make so much more rapid progress was because I was no longer doing this alone. I was, in fact, part of a huge open network graph created by millions of family history researchers. When that hint appeared that led me to my new-found famous cousin, an edge was connected from the nodes in my family tree to the nodes of Alan Rickman and his ancestors. It had been hundreds of these edges over the previous months that had allowed me to reach so far and wide into my own ancestral history. Thirteen years earlier when I last tried this, that graph simply did not exist.

Like it or not, nowadays most of us are a part of at least one big graph which has come into existence in the early years of the 21st century. When you connect with someone on Facebook or LinkedIn or follow someone on Twitter or Instagram, you are casting out yet another edge in a gigantic graph. Maybe genealogy is not your thing and you prefer music, or books, or knitting? Don’t worry, there’s a graph for all of those too where you can connect with people of the same interests and share your passion and knowledge. If you are stuck in a traffic jam and your car is at a standstill, you may be contributing to the value of an edge on numerous GPS navigation graphs and thus helping some other person avoid your pain.

The technology of graphs is all around us, and enables so many of the ways in which we live our lives today. That same technology is also available to us at no cost as an analytic tool to allow us to better understand network structures and dynamics in the fields of science, technology, economics, sociology and psychology to name just a few. It is available to academics and practitioners alike, and can be used on problems ranging from a very small network analysis which takes a few minutes on a laptop, to massive scale network mining requiring days or weeks of processing time.

But here’s the problem: few people really know how to do network analysis. It is still considered by many as a deep specialism or even a ‘dark art.’ It shouldn’t be. More academic students and researchers should know how to apply graph theory and methods to their work. More business analytics professionals should know how to store and analyze data that is not in traditional rectangular form and which focuses on connection rather than transaction. More companies and organizations should be thinking about how their data can be structured and analyzed to tell them more about how people, skills and knowledge connect and how this can influence key positive and negative institutional outcomes. What’s more, they should not need expensive and inflexible network analysis and visualization software offered by vendors to do this, when the best tools are freely available open source and just require a little bit of programming skill to make full use of them.

This book aims to make the field of graph and network analysis more approachable to students and professionals by explaining the most important elements of theory and sharing common methodologies using open source programming languages like R and Python. It does so by explaining theory in as much detail as is necessary to support analytical curiosity and interpretation, and by using a wide array of example data sets and code snippets to demonstrate the specific implementation and interpretation of methodologies. Those who start the book will learn about simple but important steps like creating graphs from data sources and visualizing them intuitively. Those who finish the book will learn about important measures like graph density and centrality and useful algorithms for partitioning graphs and identifying communities in complex populations. As you will see as you read on, these methods have many exciting applications. In organization settings — my personal speciality — they can be applied to problems of onboarding new hires, encouraging diverse collaboration and interaction, finding efficient communication strategies, identifying new organizational structures that better reflect the flow of work, detecting intensely collaborative groups, connecting individuals with common interests, finding potential leaders and many other problems.

There is no way that this book could be a complete overview of graph theory and methods — at least not without the printer running out of ink. There are many, many elements of graph theory and methodology that are not covered in this book. My focus has been primarily on teaching the most critical elements for those who will use graphs in a sociological, psychological or organizational context. This book will improve over time, so if you feel I have missed anything very important, or if you spot errors or have any other suggestions, please do get in touch by leaving an issue on the book’s Github repository1. If you use the contents of this book or any examples from it for teaching purposes, you don’t need to ask my permission to do so, but I would ask that you reference this book as the source.

It just remains for me to thank various people and groups which helped me make this book a reality. Various individuals have contributed to making this book better by reading early drafts, or by trying out code, offering encouragement or suggesting new examples: in particular Liz Romero, Rachel Ramsay, Christopher Belanger, Jenna Eagleson, Bennet Voorhees. Nothing in this book would be possible without the fantastic packages that exist in languages like R and Python for working with graphs, and so the authors and contributors of open source packages like igraph, networkx, ggraph, pyvis, visNetwork and networkD3 deserve thanks. I am grateful to the Stanford Network Analysis Project (SNAP) and the SocioPatterns collaboration2 for the datasets they make available to the public, many of which I have used as examples in this book. As always, my thanks to all the developers who work on the rmarkdown and bookdown R packages which allow me to write and format these technical books with much less complexity than would otherwise be the case.

For me there is no doubt that network analysis is the most exciting, most fun and fastest developing field in People Analytics. I hope you have as much fun learning from this book as I did writing it.

Keith McNulty
October 2021