Let’s delve deeper into the feature engineering techniques used in the IEEE-CIS Fraud Detection competition, focusing on how each approach contributed to creating a model capable of distinguishing between fraudulent and legitimate transactions:
- Behavioral Feature Engineering
This involves creating features that reflect user behavior and transaction patterns. Competitors derived features from the timestamp (TransactionDT), such as the hour of the day, day of the week, and weekend or weekday indicators. Transaction amount (TransactionAmt) was another focal point, where participants split it into integer and decimal parts to capture different aspects of the spending behavior. They also calculated session-level features, such as the average amount spent by each user over a specific period, the variance of transaction amounts, and the frequency of transactions.
Behavioral features are essential for detecting deviations from typical user patterns, which are strong indicators of potential fraud. For instance, unusually large transactions at odd hours or frequent transactions within a short period could signal suspicious activity. By capturing these nuances, the model can detect behaviors that deviate from the norm, such as a card being used at unusual times or for uncharacteristically large purchases.
- Aggregated and Statistical Features
This method involves calculating aggregated statistics on key variables for each user or account. Competitors aggregated features like transaction amount, frequency, and counts based on identifiers such as card ID, email domain, and address. Common statistical metrics included mean, median, standard deviation, min, and max values. For example, they calculated the average transaction amount for each card or the total number of transactions per email domain. Frequency encoding was also applied, converting categorical variables into counts or probabilities based on their occurrences in the dataset. Aggregated features help in identifying unusual patterns at the user level, such as a card being used in different locations within a short time frame or a high transaction count from an unusual email domain. By summarizing behavior over time, the model can detect changes in usage patterns that are often associated with fraudulent activity. Frequency encoding, on the other hand, highlights rare values that could be markers of fraud, such as a card being used in an unfamiliar domain.
- Missing Value Treatment and Dimensionality Reduction
Missing values were common in this dataset, especially among the V columns, which contain engineered features by Vesta. Competitors analyzed missing data patterns and grouped features by shared missing value structures. Techniques like forward-filling, mean imputation, or dropping columns with excessive missing data were employed. Additionally, PCA (Principal Component Analysis) and Lasso regression were used for dimensionality reduction, simplifying the dataset by retaining only the most important features.
Properly handling missing values prevents models from learning spurious patterns due to inconsistencies in the data. By identifying and grouping missing values, competitors could either impute or remove them, maintaining data quality and model stability. Dimensionality reduction reduces noise and computational complexity, enabling the model to focus on the most informative features while improving speed and efficiency. This is particularly useful for tree-based models, which can be prone to overfitting with high-dimensional data.
- SMOTE and Oversampling for Class Imbalance
Fraud detection datasets typically have a significant class imbalance, with a small proportion of fraudulent transactions. To counter this, competitors used SMOTE (Synthetic Minority Over-sampling Technique), which creates synthetic samples by interpolating between existing minority class samples. This helps balance the dataset, ensuring the model receives enough instances of fraudulent transactions during training. Oversampling and undersampling techniques were also applied to the training set, increasing the weight of fraud cases in the training process.
Beyond SMOTE, participants applied other techniques like random oversampling of the minority class (fraud) or undersampling of the majority class (non-fraud). These methods adjust the class distribution in the training data, ensuring a balanced representation of both fraud and non-fraud transactions. Some competitors also used stratified sampling, where batches during model training maintain a similar ratio of fraud to non-fraud cases as seen in the dataset. This provides the model with a balanced view during each iteration of training.
Fraud detection heavily depends on catching rare but critical instances, so models need a balanced dataset to accurately learn patterns associated with fraud. SMOTE and other oversampling techniques prevent the model from being overwhelmed by the majority class (non-fraud), which could lead to false negatives where fraud is incorrectly classified as non-fraud. This not only enhances the model’s recall (ability to detect fraud) but also improves precision by focusing on learning patterns specific to fraudulent transactions. As a result, the model achieves better overall performance, especially on the rare and important fraud cases.
- Card and Identity-Based Features
Features related to the payment card and user identity were engineered by combining elements like card1, addr1, and P_emaildomain to form interaction features. Additionally, target encoding was applied to these features by calculating the likelihood of fraud based on historical occurrences.
Certain combinations of user identifiers can correlate with fraudulent transactions. For example, a high-risk email domain or specific address patterns can signal fraud. By encoding these features, the model becomes sensitive to subtle identity-related indicators that are often overlooked by standard checks.
Detailed Business Implications
Each of these feature engineering approaches directly addresses challenges unique to the fraud detection domain:
Behavioral features allow for identifying sudden or unusual changes in user behavior, a red flag for potential fraud.
Aggregated and statistical features capture long-term patterns and help in detecting shifts in transactional behavior over time, essential for identifying consistent patterns associated with legitimate or fraudulent transactions.
Handling missing data and reducing dimensionality ensures that models remain efficient, avoiding distractions from irrelevant information while focusing on critical fraud indicators.
Class imbalance techniques make sure that fraudulent cases, though rare, are effectively represented, which is critical for the model to generalize well and maintain robustness in real-world application scenarios.
These methods collectively enhance the model’s ability to generalize from historical fraud data, making it better at detecting fraudulent activities in the live environment. They align with business goals by improving fraud detection accuracy, reducing false positives, and ultimately helping companies minimize financial losses due to fraud.