Robustness to Data • bnmonitor

Consider the asia dataset available from bnlearn. Details of the variables can be found in the bnlearn documentation.

library(bnlearn)
data(asia)
summary(asia)
#>    A          S          T          L          B          E          X       
#>  no :4958   no :2485   no :4956   no :4670   no :2451   no :4630   no :4431  
#>  yes:  42   yes:2515   yes:  44   yes: 330   yes:2549   yes: 370   yes: 569  
#>    D       
#>  no :2650  
#>  yes:2350

Consider two candidate BN models (Lauritzen and Spiegelhalter, 1988), which only differ in the edge from S to D.

asia_bn <- bnlearn::model2network("[A][S][T|A][L|S][B|S][D|B:E][E|T:L][X|E]")
bnlearn::graphviz.plot(asia_bn) 
#> Loading required namespace: Rgraphviz

asia_bn_alt <-  bnlearn::model2network("[A][S][T|A][L|S][B|S][E|T:L][X|E][D|B:E:S]")
bnlearn::graphviz.plot(asia_bn_alt)

Global monitor

A first useful diagnostic is the global_monitor, reporting the contribution of each vertex to the log-likelihood of the model.

library(bnmonitor)
asia <- asia[sample(1:nrow(asia),50),]
glob_asia <- node_monitor(dag = asia_bn, df = asia, alpha = 3)
glob_asia_alt <- node_monitor(dag = asia_bn_alt, df = asia, alpha = 3)
glob_asia
#>   Vertex         Score
#> 1      A  8.622186e+00
#> 2      B  3.236561e+01
#> 3      D -2.220446e-15
#> 4      E  4.819082e+00
#> 5      L  1.923433e+01
#> 6      S  3.520300e+01
#> 7      T  5.769555e+00
#> 8      X  7.566835e+00
glob_asia_alt
#>   Vertex         Score
#> 1      A  8.622186e+00
#> 2      B  3.236561e+01
#> 3      D -2.442491e-15
#> 4      E  4.819082e+00
#> 5      L  1.923433e+01
#> 6      S  3.520300e+01
#> 7      T  5.769555e+00
#> 8      X  7.566835e+00

In the alternative model, Dysnopea contributes more to the log-likelihood.

Global monitors can be plotted giving a quick view of the decomposition of the log-likelihood. The darker the color, the more substantial the contribution of a vertex.

plot(glob_asia)

Node monitor

There are two variants of node monitors.

The marginal node monitor computes the probability of the \(i\)th observation in the data set in turn after passing the evidence of the \(i-1\)th cases in the data set.
The conditional node monitor computes the probability of the \(i\)th observation in the data set after passing evidence of the \(i-1\)th cases in the data set, and the evidence for all nodes in the \(i\)th case except the node of interest.

As a quick survey of the nodes, the node_monitor command computes the marginal and conditional monitors for the final observation in the data set.

node_asia <- final_node_monitor(dag = asia_bn, df = asia)
node_asia
#>   node marg.z.score cond.z.score
#> 1    A   -0.1597335    -0.161389
#> 2    S   -2.4718659    -2.052991
#> 3    T          NaN          NaN
#> 4    L    0.9056249     1.075533
#> 5    B   -4.9089630    -3.949560
#> 6    E    0.9056249     1.075533
#> 7    X    0.9056249     1.075533
#> 8    D   -4.8971373    -5.798998

The scores indicate a poor fit of the probability distributions specified for the Smoking, Bronchitis, and Dysnopea nodes, since these are larger than 1.96 in absolute value. Plots can also be created to give a visual counterpart of the node monitors.

plot(node_asia, which = "marginal")

As an illustration we investigate further the fit of the variable Dysnopea.

The sequential marginal monitor seq_marg_monitor gives us a closer look at which particular forecasts in the data set might cause this poor fit. We examine the sequential monitor for both candidate models.

seq_asia <- seq_marg_monitor(dag = asia_bn, df = asia, node.name = "D")
seq_asia_alt <- seq_marg_monitor(dag = asia_bn_alt, df = asia, node.name = "D")
seq_asia
#> Marginal Node Monitor for D 
#>  Minimum      -0.1692618 
#>  Maximum      1.732051
seq_asia_alt
#> Marginal Node Monitor for D 
#>  Minimum      -0.1692618 
#>  Maximum      1.732051

plot(seq_asia)

plot(seq_asia_alt)

Both monitors indicate that for some observations there is a poor fit (score above 1.96 in absolute value). In particular for the alternative models the marginal monitor has larger values in absolute value.

A similar analysis can be conducted with seq_marg_monitor, which would show that the model fits well (not reported here).

Parent Child monitor

Once a vertex has been identified as a poor fit, further investigation can be carried out to check for which values of the parents the model provides a bad representation. This can be achieved with the seq_pa_ch_monitor function.

As an illustration consider the asia_bn BN, the vertex D (Dysnopea), the parent variable B (Bronchitis) which can take values yes and no.

asia_pa_ch1 <- seq_pa_ch_monitor(dag = asia_bn, df = asia, node.name = "D", pa.names =  "B", pa.val =  "yes", alpha = 3)
asia_pa_ch1
#> Parent Child Node Monitor 
#>  Minimum      -0.933608 
#>  Maximum      1.358783
asia_pa_ch2 <- seq_pa_ch_monitor(dag = asia_bn, df = asia, node.name = "D", pa.names = "B", pa.val = "no", alpha = 3)
asia_pa_ch2
#> Parent Child Node Monitor 
#>  Minimum      -1.243054 
#>  Maximum      0.3804437
plot(asia_pa_ch1)

plot(asia_pa_ch2)

For this model, Dysnopea is adequately modeled for both values of Bronchitis, since most scores largely fall in the recommended range.

Influential observations

The last robustness tool is the absolute value of the log-likelihood ratio between a model learnt without one observation and the one learnt with the full dataset. Larger values are associated to atomic events which influence the structural learning.

influence <- influential_obs(dag = asia_bn, data = asia, alpha = 3)
head(influence)
#>       A   S  T   L   B   E   X   D    score
#> 2556 no  no no  no  no  no  no  no 5.855111
#> 4670 no yes no  no yes  no  no yes 6.855438
#> 1878 no  no no  no yes  no  no  no 6.760820
#> 2531 no yes no yes  no yes yes  no 7.819889
#> 3466 no yes no  no  no  no  no yes 7.581375
#> 2778 no yes no  no  no  no  no  no 7.581375
plot(influence)