Machine Learning
Imprecisely Supervised Learning
Machine learning traditionally assumes that supervision labels for collected samples are either known precisely (in supervised learning) or not known at all (in unsupervised learning). Supervised learning typically relies on a domain expert playing the role of a teacher to provide the necessary labels of the data. Unsupervised learning is a class of problems in which one seeks to determine how the data is organized. It is distinguished from supervised learning in that learning algorithms are given unlabeled samples only.
Recent technological innovations have enabled us to collect large quantities of data. The proliferation of such data has facilitated knowledge discovery and pattern prediction using machine learning techniques. However, it has also imposed great challenges for human annotators to label the massive data in order to offer proper supervision. Often times the majority of collected data is either unlabeled or labeled with imprecise supervision.
Ambiguous labels and inconsistent annotations exist inevitably and bring a different set of machine learning problems associated with the efficient utilization, modeling and processing of imprecise supervision. Furthermore, annotation becomes imprecise not only due to the expensive and time-consuming nature of labeling data, but often also due to the difficulty and complexity of the practical problems themselves which hinder human annotators to acquire objective and reliable labels. The complex nature of the problem will make the analysis of imprecisely-labeled data, an intensive research endeavor. In this project, we design effective algorithms based on modern machine learning theories to address the several challenging problems and evaluate the proposed solutions in real-world scenarios of these problems by collaborating with industry partners, and experts from across disciplines.
This project is supported by NSF IIS-1320586.
Project Demonstration Webpage: http://app.labhealthinfo.uconn.edu/ImpreciseSupervisionDemo/
Related Publications
On Multiplicative Multitask Feature Learning.
Xin Wang, Jinbo Bi, Shipeng Yu, Jiangwen Sun
Advances in Neural Information Processing Systems 27 (NIPS2014), 2014[/expandsub1]
Latent Class Discovery and Prediction
This project aims to develop a toolbox of algorithms based on mathematical programming and statistics to solve the latent class discovery and prediction problem. In such a machine learning problem, researchers model two sets of variables: one set of variables (descriptors) is used to discover the hidden clusters of a population; the other set of variables (risk factors) is used to predict the identified clusters. This problem is frequently encountered in engineering, social and medical research disciplines, and is an under-explored problem. The ability to accurately predict the latent classes from risk factors in the absence of observed descriptors will advance many of these disciplines. Cluster analysis using only descriptors will yield latent classes that are not predictable by risk factors. Existing multi-view data analytic methods do not perform feature learning to find the subspace where a latent class presents, and to quantify the ability of the risk factors to predict latent classes. These methods cannot find risk-factor-sensitive latent classes.
In this project, we systematically address the problem by deriving novel approaches, including sparse multi-view matrix decomposition, multi-objective optimization of co-training, learning methods using privileged information, and efficient metric learning methods. Algorithmic analysis and theoretical justification of these approaches will be thoroughly studied. The proposed solutions will be evaluated in the analysis of real-world, large-scale biological and engineering data. By collaborating with domain experts, we will use the new techniques to address previously-difficult problems in disease subtyping, prediction of problematic human behaviors, and underwater acoustic communication.
Related Publications
Multi-view Biclustering for Genotype-Phenotype Association Studies of Complex Diseases
Jiangwen Sun, Jinbo Bi, Henry R. Kranzler
Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM2013), pages 316-321, 2013
Multi-view Singular Value Decomposition for Disease Subtyping and Genetic Associations
Jiangwen Sun, Jinbo Bi, Henry R. Kranzler
BMC Genetics.2014, 15:73, DOI: 10.1186/1471-2156-15-73, 2014
An Adaptive, Knowledge-Driven Medical Image Search Engine for Interactive Diffuse Parenchymal Lung Disease Quantification
Yimo Tao, Xiang Zhou, Jinbo Bi, Anna Jerebko, Matthias Wolf, Marcos Salganicoff and Arun Krishnan
Proceedings of SPIE medical imaging with an oral presentation, pages 7260-7263, 2009
A Machine Learning Approach to College Drinking Prediction and Risk Factor Identification
Jinbo Bi, Jiangwen Sun, Yu Wu, Howard Tennen, Stephen Armeli
ACM Transactions on Intelligent Systems and Technology, 4(4):32:1-24, 2013
AdaBoost on Low-Rank PSD Matrices for Metric Learning
Jinbo Bi, Dijia Wu, Le Lu, Meizhu Liu, Yimo Tao and Matthias Wolf
Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR2011), pages 2617-2624, 2011
An Intelligent Risk Assessment Approach to enhancing Distribution Resillience under Extreme Weather Events
Peng Zhang, Jinbo Bi, Guiling Wang
Technical Report of Department of Electrical and Computer Engineering, University of Connecticut, 2014
Bioinformatics
Disease Subtyping and Genomics
Despite great progress in molecular genetic methods, considerably less progress has been made in the refinement of phenotypes for substance dependence (SD) and other psychiatric disorders. SD, as defined by the Diagnostic and Statistical Manual of Mental Disorders (DSM), is clinically and etiologically heterogeneous. The DSM-defined traits are not optimal for gene finding efforts, which has substantially limited our understanding of the genetic etiology of SD. Thus, the differentiation of homogeneous subtypes of drug use, related behaviors, and co-occurring phenotypes could improve the identification of genetic variation that underlies the risk for SD and other complex traits. Existing methods are not adequate to tackle this task. The most sophisticated subtyping methods available perform unsupervised cluster analysis or latent class analysis of a disorder’s clinical features. Without theoretical guidance, blind cluster or latent class analysis can lead to subtypes of little utility in genetic analysis.
In this project, we will develop novel statistical methods to subtype SD traits quantitatively. Using data from >11,000 identically assessed subjects aggregated from family-based and case-control genetic studies (including genomewide association studies (GWAS)) of cocaine, opioid and alcohol dependence, we will identify clinical subtypes that are optimized with respect to heritability. Our preliminary results support the hypothesis that careful subtyping of substance use and related behaviors enhances the detection of genetic variants that contribute to the risk of addiction-related phenotypes and are not detected using a standard diagnostic approach. The primary aims of the proposed research are to develop: (1) bioinformatics methods to derive quantitative traits that are highly heritable in terms of traditional narrow-sense heritability and recently-defined SNP-based heritability; (2) integrative methods to jointly analyze phenotypic features and genetic markers to identify subtypes that are homogeneous phenotypically and genetically; and (3) genetic association approaches that are more efficient for subtype analysis. The derived subtypes and their association findings will be validated using multiple independent clinical samples.
This project is supported by NIH R01 grant R01DA037349 and NSF DBI-1356655.
Composite Traits for Genetic Selection in Agriculture
Identifying genetic variation underlying complex phenotypes aids the understanding of their biology. Complex phenotypes characterized by a variety of features are often associated with substantial phenotypic variation. Current statistical methods are ineffective to address this phenotypic heterogeneity, and hence lack of power to associate genetic variants with the phenotype. This project aims to design new algorithms that differentiate homogenous subtypes of a complex phenotype that are most informative in genetic analysis, and identify genetic variants that are associated with the subtypes but cannot be detected by the non-differentiated phenotype. The validity of the subtypes will be proved in multiple scales including the evidence from genomic structure and phenotypic features. The new algorithms will be validated in the areas of genetic selection for complex traits of agriculturally-important animals and plants, such as feed efficiency of cattle, and adaptive traits of soybean, etc. This project serves a vehicle to train graduate students in the multidisciplinary methods involving computer science and biology, and allow them to apply the methods in a variety of biological fields. A new course in the bioinformatics field will be developed for senior undergraduate students. High school educational materials will also be developed to educate high school students about how to mathematically model biological data so it solves biological problems.
This project is supported by NSF DBI-1356655 and NSF IIS-1447711.
Related Publications
A Sparse Integrative Cluster Analysis for Understanding Soybean Phenotypes
Jinbo Bi, Jiangwen Sun, Tingyang Xu, Jin Lu, Yansong Ma, Lijuan Qiu
Won the Best Paper Award for BIBM 2014 workshops out of 200 papers of 17 workshops, The 5th International Workshop on Integrative Data Analysis in Systems Biology (IDASB 2014).
High Performance Computing for Whole Genome Prediction
The premise of personalized medicine is based on prediction of an individual’s genetic risk to disease. Modern animal and plant breeding programs select individuals or lines based on genotypic information which circumvents the costly process of progeny testing, leading to greater efficiency. In these scientific areas, the ability to translate genotypic information into a quantitative prediction of the risk to disease or breeding targets is a matter of utmost importance. To address the technical barriers in the prediction using a whole-genome sample of genetic markers, there is urgent need for new statistical models and high performance computing foundations that allow the concurrent use of millions of genetic markers and a large variety of variables describing a disease (or a breeding target). This project proposes to solve several such barriers by an integrative approach combining and developing techniques for data reduction, parallel computing and Bayesian inference. This interdisciplinary project provides educational opportunities for graduate and undergraduate students to get first-hand research experience in computational aspects of genomics data analysis.
This project aims to understand how genome-wide markers help to predict not-yet-specified phenotypes of individuals and how the total genetic contribution can be better estimated for a phenotype. The primary goals of the proposed research are to develop: (1) parallel algorithms to reduce data that comprises millions of genetic markers into lower dimensions; (2) sparse predictive modeling with correction for the uneven tagging issue due to linkage disequilibrium; (3) fast algorithms for multi-locus mapping problems; and (4) collaborative prediction methods to jointly predict multiple phenotypes. The proposed solutions will be tested in the analysis of large-scale biological data, including a dairy cattle database collected by US Department of Agriculture and a dataset aggregated from multiple genetic studies of human diseases. This project will yield user-friendly software tools that can be broadly deployed to biological research areas that study genetics of complex phenotypes.
This project is supported by NSF CCF-1514357 and NSF IIS-1447711.
Brain Science
Brain-Computer Interface (BCI) Enabled Memory Training for Schizophrenia
Advances in science supporting the growth and adaptability, or “neuroplasticity”, of human brain cells into late adulthood provide new promise for interventions designed to preserve and rehabilitate brain function. The merging of brain science and computer technology has created a consumer market for software designed to train brain functions, such as memory and attention, following the rationale that brain circuitry can be strengthened like muscles in response to repetitive exercise. So called “computer-based cognitive training” software can be purchased privately at low cost, and can be used on mobile devices.. However, while accessibility and portability are advantages of computer-based interventions, there is also an important and often overlooked shortcoming: it cannot be assumed that compromised brain areas, or normally expected approaches to performing cognitive training exercises, will be utilized during this training. Instead, compensatory mechanisms that have developed naturally around weakened or damaged brain tissue may be used preferentially. Therefore, as compensatory mechanisms are learned and reinforced during training, underutilization of the damaged tissue may lead to further weakening, rather than strengthening, of its natural function.
The proposed research will attempt to address a critical limitation of current cognitive training software through the development of a brain-computer interface (BCI) enabled training program. In a novel application of BCI, this project will examine how interactive control over training software functions could be used to monitor and reinforce targeted brain activity during training. This project will utilize a large pool of EEG recordings of patients and healthy community members performing a memory task. Analysis of archived data using advanced machine learning approaches will provide patterns of EEG activity associated with correct and failed memory trials, and differences in EEG that best distinguish patients from healthy comparison subjects. Then, The memory training prototype will feature BCI control over trial start, difficulty level, and user response according to parameters for optimal brain activity identified in the archived data.
This project is supported by VA grant I21 RX001731.
LifeRhythm: A Framework for Automatic and Pervasive Depression Screening Using Smartphones
Because of its high prevalence and significant health and economic impacts, depression is a profound public health problem. Currently, screening for depression is based on physician-administered interview tools or patient self-report. While physician-administered tools are more authoritative, availability is constrained both by cost and lack of access to trained mental health professionals. Patient self-reporting, on the other hand, suffers from recall bias and inconsistent patient participation. In particular, neither approach satisfactorily addresses the chronic and recurring nature of depression that requires frequent assessment for monitoring onset and progress. To address depression as a public health problem, there is urgent need for an objective, accurate, easily accessible and scalable depression screening tool. The ubiquitous adoption of smartphones around the world creates new opportunities in automatic and pervasive screening of depression across large populations. The education plan of this proposal includes developing and enhancing various undergraduate and graduate-level courses, as well as disseminating the results to medical students through clinical supervision and increasing the participation from under-represented groups in research and outreach activities.
The goal of this project is to develop LifeRhythm, an automated system for automatic and pervasive depression screening using smartphone data. LifeRhythm continuously monitors the behavioral rhythms of individuals through their smartphones, extracts normalized features from the raw data, and applies multiple machine-learning models for real-time diagnosis. The project applies LifeRhythm to two settings that have complementary strengths. The first setting uses “high-resolution” sensing data collected from smartphones, which provides extremely rich and descriptive behavioral data, allowing the best leverage for machine learning models. The second setting uses “low-resolution” wireless association meta-data collected passively from large-scale WiFi networks, which eliminates the need of data collection on smartphones and can be especially valuable for a large organization, where it could automatically provide depression screening of tens of thousands of people simultaneously at very little cost. Development of LifeRhythm will be coupled with several tightly related machine-learning research efforts, including novel techniques for collaborative prediction, integrative learning, modeling of temporal dynamics, and model refinement using multiplicative-weights-based techniques. Though this proposal is primarily focused on development of screening tools, future work could naturally develop an associated intervention program. In addition, this research may lead to methodologies that are applicable to other mood disorders such as bipolar illness. The broader impacts will include dissemination of research results (and the annotated dataset) to the technical communities.
This project is supported by NSF grant IIS-1407205.
Clinical Decision Support
Easy Breathing: An Asthma Management Program
Asthma is a serious public health problem. Asthma affects almost 25 million people, 7.8% of the population in 2008, in the US. Worldwide, 300 million people suffer from asthma with 250,000 annual deaths attributed to the disease. Annual expenditures for health and lost productivity due to asthma are estimated at over $20 billion, according to the National Heart Lung and Blood Institute. National guidelines for asthma diagnosis and management were released beginning in 1991 and have undergone multiple iterations as new science and new therapeutic approaches have emerged. The most recent guidelines (National Asthma Education and Prevention Program (NAEPP) Expert Panel Report 3) were released in 2007 and are 463 pages in length. NAEPP guidelines have not been widely adopted by primary care clinicians. Barriers to change are greater in primary care settings where resources are limited and patients face greater hardship.
Easy Breathing©, a disease management program created by Dr. Michelle Cloutier, is based upon the 2007 NAEPP Guidelines. Easy Breathing has improved the quality of pediatric and adult asthma management, and reduced unnecessary medical services utilization. However, it uses paper and pencil dictations that have not been translated to electronic format. As medical institutions begin transitioning to use Electronic Health Records (EHR), they have found Easy Breathing cumbersome to use in their paperless offices and in their new clinical workflows.
Our team in the Computer Science Department led by Dr. Jinbo Bi has taken an initial effort to create an informatics model that computerizes Easy Breathing into computer logic flow and operations, and further integrates Easy Breathing into EHR systems. The success of this cross-discipline project will influence both computer engineering and medicine. Our team has worked closely with Dr. Cloutier and her team to implement the computerized Easy Breathing into a web-based system which provides clinical decision support and patient-friendly treatment plans.
Please check out the Easy Breathing website: http://app.labhealthinfo.uconn.edu/EasyBreathing/
Demo Username: demo1
Password: pass1
This project is supported by UConn Faculty Large Grant awarded to Jinbo Bi, and The Donaghue Foundation R3 Grant.
Related Publications
Translating Effective Paper-based Disease Management into Electronic Medical Systems
Tingyang Xu, Jinbo Bi, Michelle M. Cloutier
IEEE International Conference on Healthcare Informatics (ICHI2014), 2014
An Intelligent Web-based Decision Support Tool for Enhancing Asthma Guideline Adherence.
Jinbo Bi and Arun Abraham.
Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, 2012.
A Nursing Informatics Platform
According to IMIA Special Interest Group on Nursing Informatics, Nursing Informatics is the “science and practice (that) integrates nursing, its information and knowledge, with management of information and communication technologies to promote the health of people, families, and communities worldwide.” Our project offers solutions in the core area that develops an information and communication platform to address inter-professional workflow needs across multiple venues for baby care and parental care. The platform, if constructed successfully, will create research methodologies to disseminate new knowledge into peri-natal care practice.
Particularly, we collaborate with UConn Nursing School and build a web-based system that does not only provide regular educational materials that are usually given at the hospital discharge of a mother and a newborn, but also offers lived nursing aids that simulate a nurse’s responses when a discharged patient calls back with urgent nursing needs. Moreover, if lived nursing aids identify more severe situations or cannot offer the requested help, the system allows the patient to login to a secured system to send instant messages to her/his nurses.
Our website comes with an administrator website that is equipped with advanced software design to allow nurses to edit the education materials that appear in this website without knowing any algorithmic configuration running at the background. The lived nursing aids run algorithms that automatically implement the flowcharts given by an experienced nurse to reflect how she reacts to the various urgent needs of a patient.
We welcome bug reports and usability feedbacks. Please send your report to Dr. Jinbo Bi.
Nursing Platform Website: http://app.labhealthinfo.uconn.edu/Nursing/