Jessa Bekker | publications

Journal papers

Learning from positive and unlabeled data: a survey Jessa Bekker, and Jesse Davis Machine Learning 2020 [Bib] [Abs] [arXiv] [PDF] [Video]
@article{bekker2020mlj, author = {Bekker, Jessa and Davis, Jesse}, title = {{L}earning from positive and unlabeled data: a survey}, journal = {{M}achine {L}earning}, volume = {109}, number = {4}, pages = {719--760}, eprint = {4}, month = may, year = {2020}, doi = {10.1007/s10994-020-05877-5}, url = {https://lirias.kuleuven.be/3028883} }

Learning from positive and unlabeled data or PU learning is the setting where a learner only has access to positive examples and unlabeled data. The assumption is that the unlabeled data can contain both positive and negative examples. This setting has attracted increasing interest within the machine learning literature as this type of data naturally arises in applications such as medical diagnosis and knowledge base completion. This article provides a survey of the current state of the art in PU learning. It proposes seven key research questions that commonly arise in this field and provides a broad overview of how the field has tried to address them.

Conference papers

Interactive Multi-level Prosody Control for Expressive Speech Synthesis Tobias Cornille, Fengna Wang, and Jessa Bekker In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 2022 [Bib] [Abs] [PDF]
@inproceedings{cornille2022icassp, author = {Cornille, Tobias and Wang, Fengna and Bekker, Jessa}, title = {Interactive Multi-level Prosody Control for Expressive Speech Synthesis}, booktitle = {Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing}, month = may, year = {2022}, url = {https://lirias.kuleuven.be/3704076} }

Recent neural-based text-to-speech (TTS) models are able to produce highly natural speech. To synthesize expressive speech, the prosody of the speech has to be modeled, and predicted/controlled during synthesis. However, intuitive control over prosody remains elusive. Some techniques only allow control over the global style of the speech and do not allow fine-grained adjustments. Other techniques create fine-grained prosody embeddings, but these are difficult to manipulate to obtain a desired speaking style. We thus present ConEx, a novel model for expressive speech synthesis, which can produce speech in a certain speaking style, while also allowing local adjustments to the prosody of the generated speech. The model builds upon the non-autoregressive architecture of FastSpeech and includes a reference encoder to learn global prosody embeddings, and a vector quantized variational autoencoder to create fine-grained prosody embeddings. To realize prosody manipulation, a new interactive method is proposed. Experiments on two datasets show that the model enables multi-level prosody control.
Unifying Knowledge Base Completion with PU Learning to Mitigate the Observation Bias Jonas Schouterden, Jessa Bekker, Jesse Davis, and Hendrik Blockeel In Proceedings of the 36th AAAI Conference on Artificial Intelligence 2022 [Bib] [Abs] [PDF] [Code]
@inproceedings{schouterden2022aaai, author = {Schouterden, Jonas and Bekker, Jessa and Davis, Jesse and Blockeel, Hendrik}, title = {Unifying Knowledge Base Completion with PU Learning to Mitigate the Observation Bias}, booktitle = {{P}roceedings of the 36th {AAAI} {C}onference on {A}rtificial {I}ntelligence}, month = feb, year = {2022}, url = {https://lirias.kuleuven.be/3654887} }

Methods for Knowledge Base Completion (KBC) reason about a knowledge base (KB) in order to derive new facts that should be included in the KB. This is challenging for two reasons. First, KBs only contain positive examples. This complicates model evaluation which needs both positive and negative examples. Second, those facts that were selected to be included in the knowledge base, are most likely not an i.i.d. sample of the true facts, due to the way knowledge bases are constructed. In this paper, we focus on rule-based approaches, which traditionally address the first challenge by making assumptions that enable identifying negative examples, which in turn makes it possible to compute a rule’s confidence or precision. However, they largely ignore the second challenge, which means that their estimates of a rule’s confidence can be biased. This paper approaches rule-based KBC through the lens of PU learning, which can cope with both challenges. We make three contributions. (1) We provide a unifying view that formalizes the relationship between multiple existing confidences measures based on (i) what assumption they make about and (ii) how their accuracy depends on the selection mechanism. (2) We introduce two new confidence measures that can mitigate known biases by using propensity scores that quantify how likely a fact is to be included the KB. (3) We show through theoretical and empirical analysis that taking the bias into account improves the confidence estimates, even when the propensity scores are not known exactly.
Beyond the Selected Completely At Random Assumption for Learning from Positive and Unlabeled Data Jessa Bekker, Pieter Robberechts, and Jesse Davis In Proceedings of the 2019 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2019 [Bib] [Abs] [arXiv] [PDF] [Supp] [Code]
@inproceedings{bekker2019ecml, author = {Bekker, Jessa and Robberechts, Pieter and Davis, Jesse}, title = {{B}eyond the {S}elected {C}ompletely {A}t {R}andom {A}ssumption for {L}earning from {P}ositive and {U}nlabeled {D}ata}, booktitle = {{P}roceedings of the 2019 {E}uropean {C}onference on {M}achine {L}earning and {P}rinciples and {P}ractice of {K}nowledge {D}iscovery in {D}atabases}, editor = {Brefeld, Ulf and Fromont, {\'E}lisa and Hotho, Andreas and Knobbe, Arno J. and Maathuis, Marloes H. and Robardet, C{\'e}line}, publisher = {Springer, Cham}, pages = {71--85}, year = {2019}, volume = {11907}, doi = {10.1007/978-3-030-46147-8_5}, url = {https://lirias.kuleuven.be/2817987}, isbn = {978-3-030-46147-8} }

Most positive and unlabeled data is subject to selection biases. The labeled examples can, for example, be selected from the positive set because they are easier to obtain or more obviously positive. This paper investigates how learning can be enabled in this setting. We propose and theoretically analyze an empirical-risk-based method for incorporating the labeling mechanism. Additionally, we investigate under which assumptions learning is possible when the labeling mechanism is not fully understood and propose a practical method to enable this. Our empirical analysis supports the theoretical results and shows that taking into account the possibility of a selection bias, even when the labeling mechanism is unknown, improves the trained classifiers.
Estimating the class prior in positive and unlabeled data through decision tree induction Jessa Bekker, and Jesse Davis In Proceedings of the 32th AAAI Conference on Artificial Intelligence 2018 [Bib] [Abs] [PDF] [Supp] [Slides] [Code]
@inproceedings{bekker2018aaai, author = {Bekker, Jessa and Davis, Jesse}, title = {{E}stimating the class prior in positive and unlabeled data through decision tree induction}, booktitle = {{P}roceedings of the 32th {AAAI} {C}onference on {A}rtificial {I}ntelligence}, editor = {McIlraith, Sheila A. and Weinberger, Kilian Q.}, publisher = {AAAI Press}, pages = {2712--2719}, month = feb, year = {2018}, url = {https://lirias.kuleuven.be/1656838} }

For tasks such as medical diagnosis and knowledge base completion, a classifier may only have access to positive and unlabeled examples, where the unlabeled data consists of both positive and negative examples. One way that enables learning from this type of data is knowing the true class prior. In this paper, we propose a simple yet effective method for estimating the class prior, by estimating the probability that a positive example is selected to be labeled. Our key insight is that subdomains of the data give a lower bound on this probability. This lower bound gets closer to the real probability as the ratio of labeled examples increases. Finding such subsets can naturally be done via top-down decision tree induction. Experiments show that our method makes estimates which are equivalently accurate as those of the state of the art methods, and is an order of magnitude faster.
Positive and Unlabeled Relational Classification Through Label Frequency Estimation Jessa Bekker, and Jesse Davis In Inductive Logic Programming 2018 [Bib] [Abs] [PDF] [Poster]
@inproceedings{bekker2017ilp, author = {Bekker, Jessa and Davis, Jesse}, editor = {Lachiche, Nicolas and Vrain, Christel}, title = {Positive and Unlabeled Relational Classification Through Label Frequency Estimation}, booktitle = {Inductive Logic Programming}, year = {2018}, publisher = {Springer International Publishing}, address = {Cham}, pages = {16--30}, isbn = {978-3-319-78090-0} }

Many applications, such as knowledge base completion and automated diagnosis of patients, only have access to positive examples but lack negative examples which are required by standard relational learning techniques and suffer under the closed-world assumption. The corresponding propositional problem is known as Positive and Unlabeled (PU) learning. In this field, it is known that using the label frequency (the fraction of true positive examples that are labeled) makes learning easier. This notion has not been explored yet in the relational domain. The goal of this work is twofold: (1) to explore if using the label frequency would also be useful when working with relational data and (2) to propose a method for estimating the label frequency from relational positive and unlabeled data. Our experiments confirm the usefulness of knowing the label frequency and of our estimate.
Learning the structure of probabilistic sentential decision diagrams Yitao Liang, Jessa Bekker, and Guy Van den Broeck In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI) 2017 [Bib] [Abs] [PDF] [Supp] [Poster] [Slides] [Code]
@inproceedings{bekker2017uai, author = {Liang, Yitao and Bekker, Jessa and {Van den Broeck}, Guy}, title = {{L}earning the structure of probabilistic sentential decision diagrams}, booktitle = {{P}roceedings of the 33rd {C}onference on {U}ncertainty in {A}rtificial {I}ntelligence ({UAI})}, editor = {Elidan, Gal and Kersting, Kristian and Ihler, Alexander T.}, publisher = {Auai Press}, pages = {10}, month = aug, year = {2017}, url = {https://lirias.kuleuven.be/1599168} }

The probabilistic sentential decision diagram (PSDD) was recently introduced as a tractable representation of probability distributions that are subject to logical constraints. Meanwhile, efforts in tractable learning achieved great success inducing complex joint distributions from data without constraints, while guaranteeing efficient exact probabilistic inference; for instance by learning arithmetic circuits (ACs) or sum-product networks (SPNs). This paper studies the efficacy of PSDDs for the standard tractable learning task without constraints and develops the first PSDD structure learning algorithm, called LearnPSDD. Experiments on standard benchmarks show competitive performance, despite the fact that PSDDs are more tractable and more restrictive than their alternatives. LearnPSDD compares favorably to SPNs, particularly in terms of model size, which is a proxy for tractability. We report state-of-the-art likelihood results on six datasets. Moreover, LearnPSDD retains the ability to learn PSDD structures in probability spaces subject to logical constraints, which is beyond the reach of other representations.
Tractable Learning for Complex Probability Queries Jessa Bekker, Jesse Davis, Arthur Choi, Adnan Darwiche, and Guy Van den Broeck In Advances in Neural Information Processing Systems 2015 [Bib] [Abs] [PDF] [Supp] [Code]
@inproceedings{bekker2015nips, author = {Bekker, Jessa and Davis, Jesse and Choi, Arthur and Darwiche, Adnan and {Van den Broeck}, Guy}, booktitle = {Advances in Neural Information Processing Systems}, editor = {Cortes, C. and Lawrence, N. and Lee, D. and Sugiyama, M. and Garnett, R.}, pages = {2242--2250}, publisher = {Curran Associates, Inc.}, title = {Tractable Learning for Complex Probability Queries}, url = {https://proceedings.neurips.cc/paper/2015/file/bb7946e7d85c81a9e69fee1cea4a087c-Paper.pdf}, volume = {28}, year = {2015} }

Tractable learning aims to learn probabilistic models where inference is guaran- teed to be efficient. However, the particular class of queries that is tractable de- pends on the model and underlying representation. Usually this class is MPE or conditional probabilities Pr(x|y) for joint assignments x, y. We propose a tractable learner that guarantees efficient inference for a broader class of queries. It simultaneously learns a Markov network and its tractable circuit representation, in order to guarantee and measure tractability. Our approach differs from earlier work by using Sentential Decision Diagrams (SDD) as the tractable language in- stead of Arithmetic Circuits (AC). SDDs have desirable properties, which more general representations such as ACs lack, that enable basic primitives for Boolean circuit compilation. This allows us to support a broader class of complex proba- bility queries, including counting, threshold, and parity, in polytime.

Workshop papers/abstracts

Learning from positive and unlabeled data under the selected at random assumption Jessa Bekker, and Jesse Davis In Proceedings of The Learning with Imbalanced domains: Theory and Application Workshop @ ECML 2018 2018 [Bib] [Abs] [arXiv] [PDF] [Code]
@inproceedings{bekker2018lidta, author = {Bekker, Jessa and Davis, Jesse}, title = {{L}earning from positive and unlabeled data under the selected at random assumption}, booktitle = {{P}roceedings of {T}he {L}earning with {I}mbalanced domains: {T}heory and {A}pplication {W}orkshop {@} {ECML} 2018}, publisher = {Journal of Machine Learning Research}, pages = {8--22}, month = sep, year = {2018}, volume = {94}, url = {https://lirias.kuleuven.be/1999809} }

For many interesting tasks, such as medical diagnosis and web page classification, a learner only has access to some positively labeled examples and many unlabeled examples. Learning from this type of data requires making assumptions about the true distribution of the classes and/or the mechanism that was used to select the positive examples to be labeled. The commonly made assumptions, separability of the classes and positive examples being selected completely at random, are very strong. This paper proposes a weaker assumption that assumes the positive examples to be selected at random, conditioned on some of the attributes. To learn under this assumption, an EM method is proposed. Experiments show that our method is not only very capable of learning under this assumption, but it also outperforms the state of the art for learning under the selected completely at random assumption.
Measuring adverse drug effects on multimorbity using tractable Bayesian networks Jessa Bekker, Arjen Hommersom, Martijn Lappenschaar, and Jesse Davis In Proceedings of Machine Learning for Health @ NIPS 2016 2016 [Bib] [Abs] [arXiv] [PDF] [Poster]
@inproceedings{bekker2016ml4hc, author = {Bekker, Jessa and Hommersom, Arjen and Lappenschaar, Martijn and Davis, Jesse}, title = {{M}easuring adverse drug effects on multimorbity using tractable {B}ayesian networks}, booktitle = {Proceedings of Machine Learning for Health @ NIPS 2016}, month = dec, year = {2016}, volume = {abs/1612.03055}, url = {https://lirias.kuleuven.be/1656690} }

Managing patients with multimorbidity often results in polypharmacy: the prescription of multiple drugs. However, the long-term effects of specific combinations of drugs and diseases are typically unknown. In particular, drugs prescribed for one condition may result in adverse effects for the other. To investigate which types of drugs may affect the further progression of multimorbidity, we query models of diseases and prescriptions that are learned from primary care data. State-of-the-art tractable Bayesian network representations, on which such complex queries can be computed efficiently, are employed for these large medical networks. Our results confirm that prescriptions may lead to unintended negative consequences in further development of multimorbidity in cardiovascular diseases. Moreover, a drug treatment for one disease group may affect diseases of another group.
Learning the structure of probabilistic SDDs Jessa Bekker, Arthur Choi, and Guy Van den Broeck Women in Machine Learning 2016 [Bib] [Abs] [PDF] [Poster]
@misc{bekker2016wiml, author = {Bekker, Jessa and Choi, Arthur and {Van den Broeck}, Guy}, title = {{L}earning the structure of probabilistic {SDD}s}, journal = {{W}omen in {M}achine {L}earning}, month = dec, year = {2016}, url = {https://lirias.kuleuven.be/1656681} }

Many domains, such as health care, gain benefit from machine learning if a certain degree of accuracy is guaranteed about the predictions. For techniques that model uncertainty, such as Bayesian networks and other graphical models, it is in general infeasible to do predictions with such guarantees. While we can measure the accuracy of the model, inference of simple queries such as marginals is #P-hard and therefore the predictions are often approximations with unknown accuracy. The domain of tractable learning provides a solution by restricting the learned models to those that do allow exact inference and therefore the predictions are as accurate as the learned model itself. The key of tractable learning is the usage of a tractable representation for the model. A tractable representation essentially represents the calculations needed to do inference; the size of this representation is therefore the complexity of inference. By keeping this tractable representation small enough, exact inference will always be feasible. There exist different types of tractable representation that differ in the types of tractable queries, tractable operations (needed during learning) and compactness. Sentential Decision Diagrams (SDDs) support the widest range of tractable queries and operations but pay for this by being less compact [1]. Probabilistic SDDs (PSDDs) can be more compact than SDDs (at least as compact) while supporting the same tractable queries. SDDs represent boolean formulas. An SDD that represents a probabilistic model has weights for its variables and supports efficient weighted model counting to do predictions. Combining two SDDs is also efficient, therefore any query can be answered by compiling it to an SDD and combining it with the model. PSDDs are a variation on SDDs with parameters on the edges instead of the variables. PSDDs are at least as compact as SDDs because any SDD can be represented by a PSDD but not the other way round. They can answer the same queries as SDDs because they too can be combined with an SDD that represent the query. Therefore PSDDs are an attractive tractable representation, but until now they were not used for tractable learning; a parameter learner does exists [2]. In this work, we present the first structure learner for PSDDs. It starts with an initial model and incrementally changes it to improve the accuracy. For the incremental changes, we introduce some new operators that change the distribution without changing the possible worlds. This method can naturally incorporate constraints by using them as the initial model and is therefore ideal to learn in structured spaces or on top of expert knowledge but it can also handle unconstrained cases.
Ordering-based search for tractable Bayesian networks Jessa Bekker, Guy Van den Broeck, and Jesse Davis Women in Machine Learning 2015 [Bib] [Abs] [PDF] [Poster]
@misc{bekker2015wiml, author = {Bekker, Jessa and {Van den Broeck}, Guy and Davis, Jesse}, title = {{O}rdering-based search for tractable {B}ayesian networks}, journal = {{W}omen in {M}achine {L}earning}, month = dec, year = {2015}, url = {https://lirias.kuleuven.be/1656675} }

Tractable Bayesian network learning’s goal is to learn Bayesian networks (BNs) where inference is guaranteed to be efficient. The state of the art techniques for tractable learning are based on knowledge compilation: they keep a tractable representation of the network next to the classic graphical model. The tractable representation allows inference with a complexity polynomial in the size of this representation and therefore provide a measure for tractability. The current best tractable BN learner is ACBN (also known as LearnAC) which uses Arithmetic Circuits (ACs) as the tractable representation. It greedily splits the conditional distributions. The splits are scored on the increase in likelihood with the AC size as penalty. We propose the usage Sentential Decision Diagrams (SDDs) which is a subclass of ACs. SDDs have desirable properties that are absent in more general representations such as ACs. The additional properties translate among other to the ability of not only efficiently adding but also deleting edges in the corresponding Bayesian network. Deleting edges is out of the question with ACs. This makes it possible to use other than greedy learning algorithms for learning tractable BNs. Using SDDs we can thus transform existing BN learners to tractable BN learners. In this work we investigate using the ordering-based search algorithm for learning tractable BNs. In classic ordering-based search, one is looking for a Bayesian network that optimizes a decomposable score function such as BIC, MDL or BDe. The key idea of ordering-based search is that even though the task of finding the structure that optimizes the score function is NP-hard, the same task becomes efficient when the variable ordering is fixed. Because of the decomposability of the score function one only needs to look for the optimal set of parents for each node, which is trivial. Ordering-based search searches over the space of variable orderings by swapping adjacent variables. In tractable learning, a different scoring function is used. It is a trade-off between the likelihood of the model and the inference complexity, i.e. the size of the tractable representation. Unfortunately the SDD size is not decomposable. Different BN edges change the SDD size in different ways, dependent on the current SDD structure. We therefore adjust ordering-based search with an extra search dimension: the number of edges in the tree CPT of each node. We propose SDDBN, an algorithm that searches for a BN structure that optimizes a trade-off between likelihood and tractability. It uses two search operators: “Swap” and “Split”. “Swap” swaps two adjacent variables in the variable order. It adds parents to the swapped nodes until the size of the SDD is about the same as before the swap. This allows a fair comparison between the two orderings. “Split” adds a parent to a node by adding a split to the tree CPT. We use random restarts to cope with local maxima. Another contribution is an empirical evaluation. We conduct experiments that compare the performance of the models learned with SDDBN with the models learned by LearnAC. The preliminary results look promising.

PhD Dissertation

Learning from Positive and Unlabeled Data Jessa Bekker 2018 [Bib] [Abs] [PDF]
@phdthesis{bekker2018phd, author = {Bekker, Jessa}, title = {{L}earning from {P}ositive and {U}nlabeled {D}ata}, school = {Informatics Section, Faculty of Engineering Science, Science, Engineering and Technology Group}, pages = {42}, month = dec, year = {2018}, note = {Davis, J. (supervisor)}, url = {https://lirias.kuleuven.be/2334467} }

The goal of binary classification is to train a model that can distinguish between examples belonging to one of two classes: positive and negative. Traditional algorithms for training such a model assume access to labeled examples, where the label is the desired output of the model: positive or negative. However, in practice the labels might be hard to obtain for various reasons: the need for expert knowledge, costly tests, or just the vast number of examples. Therefore, it is desirable to have algorithms that can learn when only a fraction of the examples are labeled. In some situations, only a subset of the positive examples are labeled, but none of the negative examples. For example, the task of predicting if a patient has a disease, can use patients that were diagnosed with that disease as positive examples, however, being undiagnosed does not imply not having the disease. Or, for movie recommendation, watched movies are positive examples of good movies, but unwatched movies can also be good. The field that aims to design algorithms to learn from this kind of data is called learning from positive and unlabeled data or PU learning in short. To enable learning in this harder setting, assumptions need to be made either about the distribution of the classes, or the labeling mechanism that dictates which examples get labeled, or both. This dissertation studies the commonly made assumptions in PU learning with two objectives: 1) to exploit them for improving PU algorithms, and 2) to propose more realistic assumptions that still enable learning. The assumption of interest for this dissertation is the common selected completely at random (SCAR) assumption, which assumes that the set of labeled examples is a uniform subset of the positive examples. In this case, learning is possible if the class prior, that is the ratio of positive examples in the data, is known. Therefore, estimating the class prior is a crucial subtask in PU learning. This dissertation has four main contributions. First, it provides a comprehensive survey of the field. Second, it proposes a novel class prior estimation algorithm that is equally accurate as the state-of-the-art methods while being an order of magnitude faster. Third, it investigates whether the SCAR assumption can also be leveraged when learning from relational PU data and shows how the propositional techniques can be modified to this end. Fourth, it proposes a more realistic assumption: selected at random (SAR), which assumes that the probability for a positive example to be selected to be labeled can depend on its attributes. This is a big step forward for PU learning, as it enables addressing numerous new real problems. For this setting, it shows how a known labeling mechanism can be taken into account by the learning algorithm, it investigates which additional assumptions are necessary to estimate the labeling mechanism from the data, and it proposes a practical algorithm to learn in this setting when the labeling mechanism is unknown.