GPT Beat Our Coding Screens

Abigail Haddad

Apr 19, 2023 • 8 min read

GPT-4 successfully took the test we use as a technical screen for hiring software developers and data scientists.

With almost no human judgment necessary, it was able to score 100% on the developer test and nearly that high on the data science one. I’m going to discuss what our tests look like, potential stumbling blocks that didn’t stop it, how it performed on other questions of the sort that I ask in technical interviews, what it said was my biggest weakness based on my resume, and how I’m concerned it could be used in the later interview stages as well.

What the Screens Evaluate

First, the questions we ask on the test are a combination of multiple-choice questions about programming and math combined with some coding exercises. The developer screen is heavier on questions about what the output of certain code snippets will be, and the data science screen is heavier on math questions. In both tests, you write some SQL code, as well as other code in your choice of language.

The following is an example of the type of coding question we give:

Consider an array of numeric strings where each string is a positive number with anywhere from 1 to 10^6 digits. Sort the array's elements in non-decreasing, or ascending order of their integer values and return the sorted array.

And this is the Python code that GPT-4 wrote to solve it:

def sort_numeric_strings(arr):‌‌
return sorted(arr, key=int)‌‌‌‌

# Example usage‌‌
arr = ['123', '45', '1000', '6789', '6']‌‌
sorted_arr = sort_numeric_strings(arr)‌‌
print(sorted_arr)

It worked. It also accurately reported the errors Python would raise for various non-numeric inputs. It did incorrectly predict the outcome for a number with leading zeroes: Python raised an error, while GPT-4 assumed the function would handle it correctly.

You can already google these types of questions, but it’s a bigger challenge to sort through GitHub repos and stack overflow pages with code that might or might not run than it is to have a one-stop shop for answers.

It also performed well on conceptual questions like the following (also not an actual example from our screen):

Given three discrete probability distributions, A, B, and C, represented by histograms, determine which distribution has the highest variance and which has the highest mean.

The accompanying picture shows a graph with three different probability distributions. This was one area where I thought GPT might struggle, and also the only area where human judgment was required: I did have to figure out how to describe the picture. I was able to do so, but I was also already familiar with the material.

One option we could pursue would be to embed more information in the form of pictures that are difficult to describe the important parts of unless you know what you’re looking for. But my concern is that this doesn’t buy us much time; GPT-4 can already take image uploads as input, even if the functionality isn’t widely available yet.

Another option would be to explore writing new questions. To the extent that GPT is only getting these correct because it's trained directly on them, or on very similar problems, potentially we could regularly switch out our current problems for new problems. I don't know enough about this yet to say if that's plausible.

The purpose of coding screens is to screen people out, and we have technical interviews after the coding screens. But it performs well on those, too.

Interview Questions and GPT Responses

Here are some examples of the types of questions I ask in technical interviews for data scientists, and here are the answers GPT-4 gave me:

Can you explain to me what a binary classification model does? How do we think about false positives vs. false negatives, and how does that vary depending on the kind of problem we're trying to solve?‌‌‌‌

A binary classification model is a type of machine learning algorithm that categorizes input data into one of two possible classes. It helps in determining whether a given input belongs to class A or class B based on the patterns it has learned from the training data. Examples of binary classification problems include spam email detection (spam or not spam), medical diagnosis (disease present or not present), and credit risk assessment (default or no default).‌‌‌‌False positives and false negatives are types of errors that can occur in binary classification models.

A false positive (Type I error) occurs when the model incorrectly predicts that an instance belongs to the positive class (class A) when it actually belongs to the negative class (class B). A false negative (Type II error) occurs when the model incorrectly predicts that an instance belongs to the negative class (class B) when it actually belongs to the positive class (class A).‌‌‌‌

The importance of false positives and false negatives depends on the specific problem being solved. In some situations, minimizing false positives is more critical, while in others, minimizing false negatives is more important.‌‌‌‌For example, in a medical diagnosis scenario, where the goal is to detect a severe disease, false negatives are more dangerous because failing to identify a patient with the disease could lead to a lack of treatment and severe consequences. In this case, it might be acceptable to have a higher rate of false positives (diagnosing healthy patients as having the disease) to ensure that fewer cases are missed.‌‌‌‌

On the other hand, in a spam email detection system, false positives might be considered more problematic because classifying a legitimate email as spam can result in important messages being missed. In this situation, it might be more acceptable to have a higher rate of false negatives (allowing some spam emails to go undetected) to ensure that fewer legitimate emails are incorrectly classified as spam.‌‌‌‌

Balancing false positives and false negatives involves selecting an appropriate threshold for the model's predictions. Adjusting this threshold can help find the optimal balance between the two types of errors, depending on the specific problem and its requirements. Additionally, evaluation metrics such as precision, recall, and F1-score can be used to measure the performance of the model considering both types of errors.

This is a good answer. It’s wordy, but it hits all the points I’d want a candidate to hit.

This is another variation of a question I ask:

We're running these Tableau dashboards, but users report they're really slow to work with. Can you talk me through what might be going on and some possible solutions?‌‌‌‌

Slow Tableau dashboards can result from various factors, including issues with data sources, complexity of calculations, dashboard design, and hardware limitations. Here are some potential reasons and solutions for the slow performance of your Tableau dashboards:‌‌‌‌

1. Large data sources: Working with large data sources can impact the performance of your dashboards. To address this issue, consider:‌‌
*Aggregating data: Summarize the data to a higher level to reduce the number of records Tableau needs to process.‌‌
*Using extracts: Create a Tableau extract to work with a smaller, optimized subset of your data.‌‌
*Filtering data: Apply filters to your data source to limit the data that Tableau needs to process.‌‌‌‌

2. Complex calculations: Complex calculations, particularly those involving table calculations or LOD expressions, can slow down your dashboards. To improve performance:‌‌
*Optimize calculations: Review and simplify your calculations where possible.‌‌
*Pre-calculate metrics: Perform complex calculations in your data warehouse or during the ETL process, and store the results as additional columns in your data source.‌‌‌‌

3. Dashboard design: The design of your dashboard can also impact its performance. To address this issue:‌‌
*Limit the number of worksheets: Reduce the number of worksheets on your dashboard to decrease the rendering time.‌‌
*Use fewer filters and parameters: Each filter and parameter adds to the processing time, so try to minimize their use.‌‌
*Opt for simpler visualizations: Some chart types, like maps or scatter plots, can be more resource-intensive. Consider using simpler chart types when possible.‌‌‌‌

4. Data connections: The type of connection you use to access your data can affect dashboard performance.‌‌
*Opt for a live connection if your data source and network can support it.‌‌
*Use extracts if live connections are slow or if your data source has limited resources.‌‌‌‌

5. Hardware limitations: Insufficient hardware resources on your Tableau Server or user devices can result in slow dashboards. To address this issue:‌‌
*Upgrade hardware: Ensure that your Tableau Server and user devices meet or exceed the recommended hardware requirements.‌‌
*Scale out: Add more nodes to your Tableau Server cluster to distribute the workload and improve performance.‌‌‌‌

6. Network latency: Slow network connections can affect the performance of your dashboards, especially when using live connections.‌‌
*Optimize network performance: Work with your IT team to address any network bottlenecks or latency issues.‌‌
*Use data extracts: If network latency is an issue, consider using data extracts instead of live connections.‌‌‌‌

By addressing these potential issues, you can improve the performance of your Tableau dashboards and provide a better user experience.

This is also just a really solid, detailed answer.

Non-Technical Questions

I also asked a couple of non-technical questions. I gave it my resume and told it to write me a “tell me about yourself” introduction, and it didn’t do a good job. It brought up too many specific details and didn’t really tell a story. But when I asked it to write me the answer to “my biggest weakness”, it gave me the following, which is not a bad answer for many technical people:

One of my weaknesses is that I sometimes get too focused on the technical aspects of a project and may lose sight of the bigger picture. As a data scientist, I'm passionate about delving deep into the details and ensuring the accuracy and quality of my work. However, I understand that it's essential to maintain a balance between the technical side and the overall business objectives. To address this weakness, I've been working on strengthening my communication skills and prioritizing regular check-ins with stakeholders and team members. This allows me to align my work with the project goals and be more mindful of the broader context while still maintaining a strong technical foundation.

Feasibility of GPT Use In Interviewing

But the question for us isn’t just whether GPT can give good answers to the rest of our processes, it’s whether it would be possible for a candidate to use it during a remote interview. I think there are a few plausible ways someone could use it.

First, they could have a friend in the room, off-camera. The friend could type in questions and then send them to the candidate, who could read them from their screen. You could already have a friend assisting you in cheating on an interview – but with GPT, the friend doesn’t need to know how to do anything but type.

Second, they could develop a program that translates the interviewer’s speech into text, eliminating the need for another person in the room. I don’t know how difficult this would be, but there are already attempts to create this technology in a way that would offer prompts for real-time conversations. I suspect that the part of our current interview process that would be the most robust against GPT is where candidates share their screen and code in real time - but this is a rapidly changing situation.

It’s possible that the time lag associated with getting input from models would make it obvious to the interviewer that something was going on. But for how long?

Getting Ready Now

If GPT becomes a tool that we’re able to use on our client engagements, then we should have an evaluation component where the candidate uses it. GPT would allow us to give a more difficult, interesting coding exercise that would require the candidate's judgment and coding skills and, with this assist, could also be completed during the time frame of an interview.

However, we'll still want to interview people without them using AI assistance. So if you've shifted to interviewing people remotely, it's time to think about how you are going to institute in-person components which comprehensively test everything you want to assess.

If you don't have physical office space, or you interview people in places where you don't have staff, maybe that means renting space or working with organizations that can essentially proctor your job interviews, where the candidate is being monitored in a physical space while your interviewer is on the other end of a video chat.

There are a lot of processes that GPT and other technology that's new or becoming more widely available, like voice cloning, have the potential to make vastly more adversarial. No one really wants to think that the person they're interviewing might not be who they say they are, or might be using AI to pass a coding screen or interview – but it's still better to be prepared than to be caught off-guard.

What the Screens Evaluate

Interview Questions and GPT Responses

Non-Technical Questions

Feasibility of GPT Use In Interviewing

Getting Ready Now

Sign up for more like this.