Publications
Research Leading to the following publication
2019
- Why is Developing Machine Learning Applications Challenging? A Study on Stack Overflow PostsMoayad Alshangiti, Hitesh Sapkota, Pradeep K. Murukannaiah, and 2 more authorsIn 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) 2019
Background: As smart and automated applications pervade our lives, an increasing number of software developers are required to incorporate machine learning (ML) techniques into application development. However, acquiring the ML skill set can be nontrivial for software developers owing to both the breadth and depth of the ML domain. Aims: We seek to understand the challenges developers face in the process of ML application development and offer insights to simplify the process. Despite its importance, there has been little research on this topic. A few existing studies on development challenges with ML are outdated, small scale, or they do no involve a representative set of developers. Method: We conduct an empirical study of ML-related developer posts on Stack Overflow. We perform in-depth quantitative and qualitative analyses focusing on a series of research questions related to the challenges of developing ML applications and the directions to address them. Results: Our findings include: (1) ML questions suffer from a much higher percentage of unanswered questions on Stack Overflow than other domains; (2) there is a lack of ML experts in the Stack Overflow QA community; (3) the data preprocessing and model deployment phases are where most of the challenges lay; and (4) addressing most of these challenges require more ML implementation knowledge than ML conceptual knowledge. Conclusions: Our findings suggest that most challenges are under the data preparation and model deployment phases, i.e., early and late stages. Also, the implementation aspect of ML shows much higher difficulty level among developers than the conceptual aspect.
- A network-centric approach for estimating trust between open source software developersHitesh Sapkota, Pradeep K. Murukannaiah, and Yi WangPLOS ONE Dec 2019
Trust between developers influences the success of open source software (OSS) projects. Although existing research recognizes the importance of trust, there is a lack of an effective and scalable computational method to measure trust in an OSS community. Consequently, OSS project members must rely on subjective inferences based on fragile and incomplete information for trust-related decision making. We propose an automated approach to assist a developer in identifying the trustworthiness of another developer. Our two-fold approach, first, computes direct trust between developer pairs who have interacted previously by analyzing their interactions via natural language processing. Second, we infer indirect trust between developers who have not interacted previously by constructing a community-wide developer network and propagating trust in the network. A large-scale evaluation of our approach on a GitHub dataset consisting of 24,315 developers shows that contributions from trusted developers are more likely to be accepted to a project compared to contributions from developers who are distrusted or lacking trust from project members. Further, we develop a pull request classifier that exploits trust metrics to effectively predict the likelihood of a pull request being accepted to a project, demonstrating the practical utility of our approach.
2021
- Distributionally Robust Optimization for Deep Kernel Multiple Instance LearningHitesh Sapkota, Yiming Ying, Feng Chen, and 1 more authorIn Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS) Apr 2021
Multiple Instance Learning (MIL) provides a promising solution to many real-world problems, where labels are only available at the bag level but missing for instances due to a high labeling cost. As a powerful Bayesian non-parametric model, Gaussian Processes (GP) have been extended from classical supervised learning to MIL settings, aiming to identify the most likely positive (or least negative) instance from a positive (or negative) bag using only the bag-level labels. However, solely focusing on a single instance in a bag makes the model less robust to outliers or multi-modal scenarios, where a single bag contains a diverse set of positive instances. We propose a general GP mixture framework that simultaneously considers multiple instances through a latent mixture model. By adding a top-k constraint, the framework is equivalent to choosing the top-k most positive instances, making it more robust to outliers and multimodal scenarios. We further introduce a Distributionally Robust Optimization (DRO) constraint that removes the limitation of specifying a fix k value. To ensure the prediction power over high-dimensional data (e.g., videos and images) that are common in MIL, we augment the GP kernel with fixed basis functions by using a deep neural network to learn adaptive basis functions so that the covariance structure of high-dimensional data can be accurately captured. Experiments are conducted on highly challenging real-world video anomaly detection tasks to demonstrate the effectiveness of the proposed model.
- Deep Reinforced Attention Regression for Partial Sketch Based Image RetrievalDingrong Wang, Hitesh Sapkota, Xumin Liu, and 1 more authorIn 2021 IEEE International Conference on Data Mining (ICDM) Dec 2021
Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) aims at finding a specific image from a large gallery given a query sketch. Despite the widespread applicability of FG-SBIR in many critical domains (e.g., crime activity tracking), existing approaches still suffer from a low accuracy while being sensitive to external noises such as unnecessary strokes in the sketch. The retrieval performance will further deteriorate under a more practical on-the-fly setting, where only a partially complete sketch with only a few (noisy) strokes are available to retrieve corresponding images. We propose a novel framework that leverages a uniquely designed deep reinforcement learning model that performs a dual-level exploration to deal with partial sketch training and attention region selection. By enforcing the model’s attention on the important regions of the original sketches, it remains robust to unnecessary stroke noises and improve the retrieval accuracy by a large margin. To sufficiently explore partial sketches and locate the important regions to attend, the model performs bootstrapped policy gradient for global exploration while adjusting a standard deviation term that governs a locator network for local exploration. The training process is guided by a hybrid loss that integrates a reinforcement loss and a supervised loss. A dynamic ranking reward is developed to fit the on-the-fly image retrieval process using partial sketches. The extensive experimentation performed on three public datasets shows that our proposed approach achieves the state-of-the-art performance on partial sketch based image retrieval.
2022
- Bayesian Nonparametric Submodular Video Partition for Robust Anomaly DetectionHitesh Sapkota, and Qi YuIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Jun 2022
Multiple-instance learning (MIL) provides an effective way to tackle the video anomaly detection problem by modeling it as a weakly supervised problem as the labels are usually only available at the video level while missing for frames due to expensive labeling cost. We propose to conduct novel Bayesian non-parametric submodular video partition (BN-SVP) to significantly improve MIL model training that can offer a highly reliable solution for robust anomaly detection in practical settings that include outlier segments or multiple types of abnormal events. BN-SVP essentially performs dynamic non-parametric hierarchical clustering with an enhanced self-transition that groups segments in a video into temporally consistent and semantically coherent hidden states that can be naturally interpreted as scenes. Each segment is assumed to be generated through a non-parametric mixture process that allows variations of segments within the same scenes to accommodate the dynamic and noisy nature of many real-world surveillance videos. The scene and mixture component assignment of BN-SVP also induces a pairwise similarity among segments, resulting in non-parametric construction of a submodular set function. Integrating this function with an MIL loss effectively exposes the model to a diverse set of potentially positive instances to improve its training. A greedy algorithm is developed to optimize the submodular function and support efficient model training. Our theoretical analysis ensures a strong performance guarantee of the proposed algorithm. The effectiveness of the proposed approach is demonstrated over multiple real-world anomaly video datasets with robust detection performance.
- Balancing Bias and Variance for Active Weakly Supervised LearningHitesh Sapkota, and Qi YuIn Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) Jun 2022
As a widely used weakly supervised learning scheme, modern multiple instance learning (MIL) models achieve competitive performance at the bag level. However, instance-level prediction, which is essential for many important applications, remains largely unsatisfactory. We propose to conduct novel active deep multiple instance learning that samples a small subset of informative instances for annotation, aiming to significantly boost the instance-level prediction. A variance regularized loss function is designed to properly balance the bias and variance of instance-level predictions, aiming to effectively accommodate the highly imbalanced instance distribution in MIL and other fundamental challenges. Instead of directly minimizing the variance regularized loss that is non-convex, we optimize a distributionally robust bag level likelihood as its convex surrogate. The robust bag likelihood provides a good approximation of the variance based MIL loss with a strong theoretical guarantee. It also automatically balances bias and variance, making it effective to identify the potentially positive instances to support active sampling. The robust bag likelihood can be naturally integrated with a deep architecture to support deep model training using mini-batches of positive-negative bag pairs. Finally, a novel P-F sampling function is developed that combines a probability vector and predicted instance scores, obtained by optimizing the robust bag likelihood. By leveraging the key MIL assumption, the sampling function can explore the most challenging bags and effectively detect their positive instances for annotation, which significantly improves the instance-level prediction. Experiments conducted over multiple real-world datasets clearly demonstrate the state-of-the-art instance-level prediction achieved by the proposed model.
2023
- Adaptive Robust Evidential Optimization For Open Set Detection from Imbalanced DataHitesh Sapkota, and Qi YuIn The Eleventh International Conference on Learning Representations Jun 2023
Open set detection (OSD) aims at identifying data samples of an unknown class (ie, open set) from those of known classes (ie, closed set) based on a model trained from closed set samples. However, a closed set may involve a highly imbalanced class distribution. Accurately differentiating open set samples and those from a minority class in the closed set poses a fundamental challenge as the model may be equally uncertain when recognizing samples from the minority class. In this paper, we propose Adaptive Robust Evidential Optimization (AREO) that offers a principled way to quantify sample uncertainty through evidential learning while optimally balancing the model training over all classes in the closed set through adaptive distributively robust optimization (DRO). To avoid the model to primarily focus on the most difficult samples by following the standard DRO, adaptive DRO training is performed, which is governed by a novel multi-scheduler learning mechanism to ensure an optimal model training behavior that gives sufficient attention to the difficult samples and the minority class while capable of learning common patterns from the majority classes. Our experimental results on multiple real-world datasets demonstrate that the proposed model outputs uncertainty scores that can clearly separate samples from closed and open sets, respectively, and the detection results outperform the competitive baselines.
- Distributionally Robust Ensemble of Lottery Tickets Towards Calibrated Sparse Network TrainingHitesh Sapkota, Dingrong Wang, ZHIQIANG TAO, and 1 more authorIn Thirty-seventh Conference on Neural Information Processing Systems Jun 2023
The recently developed sparse network training methods, such as Lottery Ticket Hypothesis (LTH) and its variants, have shown impressive learning capacity by finding sparse sub-networks from a dense one. While these methods could largely sparsify deep networks, they generally focus more on realizing comparable accuracy to dense counterparts yet neglect network calibration. However, how to achieve calibrated network predictions lies at the core of improving model reliability, especially when it comes to addressing the overconfident issue and out-of-distribution cases. In this study, we propose a novel Distributionally Robust Optimization (DRO) framework to achieve an ensemble of lottery tickets towards calibrated network sparsification. Specifically, the proposed DRO ensemble aims to learn multiple diverse and complementary sparse sub-networks (tickets) with the guidance of uncertainty sets, which encourage tickets to gradually capture different data distributions from easy to hard and naturally complement each other. We theoretically justify the strong calibration performance by showing how the proposed robust training process guarantees to lower the confidence of incorrect predictions. Extensive experimental results on several benchmarks show that our proposed lottery ticket ensemble leads to a clear calibration improvement without sacrificing accuracy and burdening inference costs. Furthermore, experiments on OOD datasets demonstrate the robustness of our approach in the open-set environment.