How to use Data Science to predict if an H1B petition would be certified, withdrawn, or denied?
This case study explores the application of Data Science to predict if an H1B petition is certified, withdrawn, or denied.
This case study was brought to you by Case Studies @ Data Kapoor™. Explore one click implementation catered to your use case only at Case Studies @ Data Kapoor.
H1B is the most popular visa in the United States for long-term work options. H1B nonimmigrant visas allows employers to petition for highly educated foreign professionals to work in “specialty occupations” that require at least a bachelor’s degree or the equivalent.
To obtain an H1B Visa, an employer has to follow two steps (More information here):
- Employers first must attest, on a labor condition application (LCA) certified by the Department of Labor (DOL), that employment of the H-1B worker will not adversely affect the wages and working conditions of similarly employed U.S. workers
- Employers must also provide existing workers with notice of their intention to hire an H-1B worker.
Problem Statement
Each fiscal year, USCIS releases LCAs of H1B petitions with information about the certification or denial of an application. The probability of an H1B petition being picked is estimated at less than 38% (Reference).
The current annual statutory cap is 65,000 visas, with 20,000 additional visas for foreign professionals who graduate with a master’s degree or doctorate from a U.S. institution of higher learning — Reference
While there is no control over the lottery, we seek to tackle the prediction of H1B petition as certified, denied, or withdrawn as a classification problem.
We aim to find relations or combinations of features (columns or H1B petition data) that lead to a case into any of three categories.
Tools Required
- Python 3
- PySpark
- Pandas
- Seaborn
- Scikit-Learn
- NumPy
Data Sources
We use H1B visa petitions from 2011–2016 (Data Source). Following is the screenshot of the dataset with the following features:
- CASE_STATUS: The case status
- EMPLOYER_NAME: Employer Name
- SOC_NAME: Job Title Category
- JOB_TITLE: Job Title
- FULL_TIME_POSITION: Is this a full-time position
- PREVAILING_WAGE: The prevailing wage
- YEAR 0 WORKSITE: Work location
- lon: Longitute
- lat: Latitude
Exploratory Data Analysis
The dataset contains 2,895,144 rows where the years range from 2011 to 2016. While the dataset does not contain much information that also contributes to the factors of H1B selection (Such as degree level) and we also do not the how the lottery proceeds, we can still get a lot of information by first preprocessing the data.
Data Preprocessing
The dataset contains a lot of categorical columns. To do any analysis with respect to year, we will first perform label encoding.
For the columns:
- CASE_STATUS
- EMPLOYER_NAME
- SOC_NAME
- JOB_TITLE
- FULL_TIME_POSITION
- WORKSITE
Additionally, we will store the label encoder object and the corresponding mapping in a dictionary.
First, let’s check for any class imbalances. As you can see labels (4, 3, 5) ENDING QUALITY AND COMPLIANCE REVIEW — UNASSIGNED, INVALIDATED, REJECTED are extremely few. To handle these cases is out of scope for this case study.
Guess what? Case Studies @ Data Kapoor™ frequently manages such client cases/use cases where there is major class imbalance issue. If you have a similar use case and are struggling, we are always there for you!
If we see the line plots of how CASE_STATUS has changed over the years, we see a clear trend in the popularity of the H1B visa.
There is a difference between “rejection” and “denial” in the immigration world. A rejection simply means that there was an error with your filing or fee payment that can be corrected. A denial occurs when either you or your employer are not considered qualified for an H-1B — Reference
Additionally, the most popular job title with the highest number of petitions is “programmer analyst”.
By performing label encoding, we were able to analyze categorically to highlight key trends and most importantly understand the data
Prediction Models
To start off, we will be evaluating two models:
- Random Forest
- XGBoost
Feature Selection
We currently have 9 features (columns). To reduce the complexity and to identify relevant columns, we use Mutual Information. We use sklearn’s SelectKBest with the score_func as mutual information (or mutual_info_classif in sklearn). We select the best 3 features making our shape as follows:
(2895144, 9) -> (2895144, 3)
Train, Test, Validation Splits
We split the dataset into 60% (train), 20% (test), 20% (validation).
Following are the metrics:
Random Forest
Training Accuracy: 90%
Testing Accuracy: 86%
Validation Accuracy: 86%
Note: We do have slight overfitting
XGBoost
Training Accuracy: 87%
Testing Accuracy: 87%
Validation Accuracy: 87%
Note: There is no overfitting
Based on this brief analysis, we can infer that XGBoost is the best prediction model.
Case Study Observations and inferences
We have successfully developed a good model to predict H1B case status. With 87% testing accuracy we have achieved great results by using mutual information feature selection.
Why is this prediction problem relevant?
For this year, we had 483,927 H1B petitions filed. The relevance of this prediction is extremely apt given the current scenario. Anyone on a work visa faces extreme anxiety and frustration when they are eagerly waiting for the next steps.
This prediction problem solves the issue of anticipation.
Using this prediction model, we can predict the case status given the petition columns and potentially determine if there would be a second lottery in FY 2023. Additionally, we can also find out anomalous behaviors and extract insights into how the H1B program is being utilized.
Code
Access Notebook: Notebook
This case study was brought to you by Case Studies @ Data Kapoor™. Explore one click implementation catered to your use case only at Case Studies @ Data Kapoor.