Secondary Data Management & Organization
Imagine, you’ve just come out of a fantastic meeting with your research collaborators. The kind where ideas are bottomless, connections are being made lightning quick, and everybody has said, at least a few times, “oh! I’ve got a source for that!” As the hours tick away after the meeting ends you notice that the shared drive for your team is brimming with a mass amount of files containing data of all types from across many different origins and an initial comb through of the data feels overwhelming. How are you all ever going to make sense of all this information? Is this project doomed to fail because of information overload?
Sounds like what you need are some insights for how to manage and organize your secondary data so that research can progress forward! In this article, I will detail for you some of the tips I’ve learned over the years for weeding through mass archives of data with useability in mind.
Consider what you want to know
To begin, it is critical to take a step back from all that juicy data and consider what it is that you (and your team) want to know for your research from this data. There are a few foci for this that I try to keep in mind when I’ve organized data previously: the overview, the narratives, and the patterns that the data contain.
The overview can be thought of similarly to compiling an encyclopedia. What is the broadest picture of your research program? Commonly, the initial answers to this question are two to three layers narrower than the broadest picture of the topic. For instance, while working for the USAID LASER PULSE project about community violence and resilience in pastoralist communities in Southeastern Ethiopia, it might seem like the overview of the data should be about violent conflict. However, violent conflict is only one component of the overview and the true overview of the data is about the country of Ethiopia. In order to understand violent conflict, we must understand more about the context (ex. food, water, health, climate, etc.). From this example, it might seem like it would be clearer to state that the focus is the context rather than the overview, but not all research is bounded by a specific place. For example, the overview for a project around adaptive equipment used by people with disabilities would be the adaptive equipment itself. Essentially, the overview is what guides the broadest boundary of your research.
The narratives would then be the different elements within the boundary that the research is probing deeper into. From the example of looking at the USAID LASER PULSE project, within the overview of Ethiopia, the researchers were interested in data related to violent conflict in the areas of collaboration, different agencies that were supporting the communities through violent conflict, and the other medical and socio-ecological factors of what was happening around the time of the violent conflicts. We can think of the narratives of the project as the buckets that we wish to categorize our data in based on extant literature and community need.
The patterns in what we want to know from our data are related to potential associations between our narratives. While narratives about education, medical resources, food insecurity, and droughts in Ethiopia may not be explicitly about violent conflict there may be patterns to when the data shows, for example, years of extended drought and increased intercommunity violence or political unrest. Considering potential patterns that would be worth examining within the archive of data can help to guide the organizational format and layout of your data.
Considering the organizational structure of your inquiry
Now that we have taken the time to think about what it is that we wish to know from the secondary data, it is time to set up the organization of the data. For this, I would like to introduce a new idea: units of inquiry. Broadly, I approach the organization of data from a generalized taxonomical perspective which means that I am interested in knowing about how my data can be layered by its relationship to different parts of itself. Rather than having kingdoms, phyla, and so forth—like we would in a biological taxonomy—we have our units of inquiry which consist of things like place, people, organizations, time, et cetera.
Data organized with time as the primary unit of inquiry would be most commonly recognized as timelines which come up in infrastructural and community development as well as history. However, it is commonly a mistake to organize our data with the primary unit of inquiry being time because we neglect to recognize that what we are really interested in is the role of time within a place or the experiences of an individual.
Returning to the example of my work with USAID LASER PULSE, the overview we were seeking to learn more about from our secondary data was about Ethiopia with special attention to some specific pastoralist communities in the Southeastern part of the country. Because our overview was about place, it would follow that our primary unit of inquiry should be place. For our work, we specifically chose to have six primary units of inquiry that were all based around place to capture information about the country of Ethiopia as a whole, the two regions our communities are located within, and the three kebeles we are working with. We chose to capture data associated at all of these levels of place because that broadens the context of our primary unit of inquiry as well as the information about the kebele’s positioned within regions of Ethiopia.
Once the primary unit of inquiry has been established, it is likely that a secondary unit of inquiry will need to be identified. For this we can look to the narratives within our overview that were considered earlier. This secondary unit of inquiry will be those buckets that we identified earlier based on extant literature and community need as well as new buckets that arise as we are combing through the data. It has been my experience that the secondary units of inquiry tend to be developed through an iterative process in which new buckets are formed when there’s enough overlapping data between two categories to form one bucket or when there’s too much data going into one category, and we are able to identify new categories to split it into. The patterns we identified earlier that we thought might yield new and/or interesting correlations can also inform our secondary units of inquiry.
As you are combing through the data and putting things into different buckets for your secondary units of inquiry, you may identify that there is a layer of nuance that is contained within a secondary unit of inquiry that doesn’t quite work as a unit on its own. These would be tertiary units of inquiry. It has been my experience that the broader the anticipated overview of the project the less likely I have been to find tertiary units of inquiry. There are a couple of recommendations that I have for dealing with these tertiary units of inquiry: first, consider if it might be significant enough to warrant modifying the category of the secondary unit of inquiry that it falls within; second, consider if this data will be necessary. There’s a chance that a tertiary unit of inquiry is both interesting and has no significance to the project as a whole; in this case, I would not include this data in the secondary data organization. If you are ever unsure about the relevance of a piece of data that falls into the tertiary unit of inquiry, it is better to include it and not need it than to omit it and not be able to find it again later. The more practice you get with secondary data organization, the more of a sense of the importance of individual pieces of data you’ll develop.
Finally, there may come a time in the data organization process where you’ll have to ask yourself, can I simplify the data included or should I create two (or more) cleaner databases? My advice for this is to consider if you’re finding that it is either too difficult or too easy to fit all the secondary data into your database. If it is too difficult, this could mean that you have more data than your project needs, and you might consider setting up a second database for the data that is not fitting into the current database for future research. Additionally, experiencing difficulty with fitting data into your database could mean that the units of inquiry are too narrow and need to be reconsidered and/or broadened. If it is too easy to put data into your database and the file is getting unmanageable, then you may need to narrow the scope of your research into multiple overview sections that can each have their own database. In this situation, you essentially have multiple studies within the broader research program with their own databases.
Levine is part of the Purdue University team led by Dr. Stacey Connaughton (PI) with colleagues from Purdue, University of Addis Ababa, Search for Common Ground, and the Aged and Children’s Pastoralist Association. More information can be found here.