Last year, when I was looking at a publicly-released Excel file on school test scores that had been released by the Office of the State Superintendent of Education (OSSE) for the District of Columbia, I guessed that something was not working as intended. While much of the data was suppressed or redacted – that is, removed for privacy reasons – there were several features of the data that suggested that some of the missing data could be inferred, or filled in using the non-missing data.
In this article, we’ll go through the features of the data that made this possible, as well as the tools that I used in Python to solve it, and a larger takeaway about automation and data privacy.
Each row of the data shows who it represents from within a particular school, the level scored on the test, and a set of values. For instance, in the example below – which is a simplified example not using data from any real school* – we’re looking at 8th graders. The first line shows that this is data on the number of 8th graders at this school who scored a Level 1. We see there were 70 eighth graders who scored a Level 1 out of a total of 140 8th graders, or 50%. The yellow boxes show data that’s been removed for privacy reasons. We’ll keep using this example as we go over the features of the data that allow missing numbes to be inferred.
- Suppressed information is directly present in another line. The Total Count of 8th graders who took this test is 140, which we know from the information for Level 1, 2, and 3. We can now fill that in as the Total Count for the rows where this information was missing.
- Missing numbers, non-missing percentages. If we have the percentage and we have the Total Count (or in some cases, even if we don’t have that), we can infer the count. In this case, we know that the total percentage of students testing proficient is 5.71%, and we know now that the Total Count is 140, so we can infer that the missing Count for the last row is 8.
- Data relationships across rows. Within this data, there are multiple types of relationships. For instance, the Proficient Count is the sum of the Level 4 and Level 5 Count for any particular group. This means that we have another way of figuring out the total number of students testing proficient, even if we didn’t have that fraction: the Total Count of 8th graders who took the test is 140, the number who didn’t test proficient is (70+12+50)=132, therefore there are 140-132=8 proficient students. There are other types of relationships within the data - for instance, the school-level data is the sum of the grade-level data. These relationships can be translated into linear systems of equations and then, in some cases, solved.
In the table above you can eyeball that there’s an issue with the data and how to infer some of it. But even with less information or patterns of missing information where it’s not immediately obvious that there’s a solution, we can often find one in Python by combining the following tools.
Solutions Within Python
- GroupBy and forward/backward fill within the Python Pandas library. For filling in the Total Count info that we already know from another row, you just set up your data in the right way and then fill in the blanks.
- Fraction solving. I wrote a function using the Python fractions module which takes a Total Count value and a Percentage value and returns the Count. This is possible because the denominators are fairly small, and also the percentages provided are very precise – to two decimal places after the percentage sign. (For more on inferring numerators and denominators from fractions, see Uncovering Raw Data From Percentages. )
- Solving systems of linear equations via SymPy. SymPy is a Python library for doing symbolic mathematics, including solving equations. SymPy worked beautifully for solving my many systems of linear equations that stemmed from the additive relationships within the data. The harder part was figuring out how to translate my knowledge about data relationships into code.
- Additional system-of-equation-solving via brute force. The only place that the SymPy functions I was using struggled was using the fact that all of my answers had to be non-negative integers – you can’t have negative children**. After SymPy was done solving what it could, some of my equations still had unique solutions. To find these, I went through and “brute forced” it, or generated every possible set of numbers, tried all of them in combination, and then if there was only one possible solution where all of the values were integers greater than or equal to zero, that was the answer. This takes a lot longer than SymPy, but it’s possible to do because of some of the specific properties of my data that mean we don’t have an infinite possible solution space to search – that is, the solution couldn’t be any possible number out there.***
This likely isn’t a common use case, but there is a takeaway both for people with data they want to redact and for coders: Even data automation tasks where there’s not currently an out-of-the-box function in Python are still solvable somehow. This means you can’t rely on it being really boring or time-intensive to solve thousands of linear equations – or whatever it would take – to keep your data safe.
The code I wound up writing in Python was just a much larger, more complicated version of what I was able to do manually in Excel to prove the concept the first time I was looking at the data. But I never would have done it in Excel because it would have taken days, it would have been boring, and I probably would have made a lot of mistakes along the way.
But any task that you can do in Excel, even if it doesn’t seem like an obvious natural fit for automation, like translating missing data to symbols and setting up solving thousands of systems of linear equations, you can also do with code. If you’re counting on your data to be secure because it’d be a huge pain to solve it manually, that’s not a good strategy.
If you’re interested in talking about how to effectively redact your data, or you have another related use case for this type of functionality, you can reach out to me.
*OSSE has since re-released a different version of the data and taken down the original data set. My code is available, but it will not work if you run it directly on the re-released data.
**I think there might be a way to get SymPy to do this along these lines; this was my first project with SymPy.
*** The first is that the data is not that big. For instance, a Count value is never going to be bigger than the Total Count for that group. The second is that the solution space is convex, meaning that if one of the variables might be 1 but isn’t 2, then we can stop looking at 3+ because it’s definitely not going to be found there. Because of these properties, we’re not searching for an infinite number of possible solutions.