Consider the asia
dataset available from bnlearn
. Details of the variables can be found in the bnlearn
documentation.
library(bnlearn)
data(asia)
summary(asia)
#> A S T L B E X
#> no :4958 no :2485 no :4956 no :4670 no :2451 no :4630 no :4431
#> yes: 42 yes:2515 yes: 44 yes: 330 yes:2549 yes: 370 yes: 569
#> D
#> no :2650
#> yes:2350
Consider two candidate BN models (Lauritzen and Spiegelhalter, 1988), which only differ in the edge from S
to D
.
asia_bn <- bnlearn::model2network("[A][S][T|A][L|S][B|S][D|B:E][E|T:L][X|E]")
bnlearn::graphviz.plot(asia_bn)
#> Loading required namespace: Rgraphviz
asia_bn_alt <- bnlearn::model2network("[A][S][T|A][L|S][B|S][E|T:L][X|E][D|B:E:S]")
bnlearn::graphviz.plot(asia_bn_alt)
Global monitor
A first useful diagnostic is the global_monitor
, reporting the contribution of each vertex to the log-likelihood of the model.
library(bnmonitor)
asia <- asia[sample(1:nrow(asia),50),]
glob_asia <- node_monitor(dag = asia_bn, df = asia, alpha = 3)
glob_asia_alt <- node_monitor(dag = asia_bn_alt, df = asia, alpha = 3)
glob_asia
#> Vertex Score
#> 1 A 8.622186e+00
#> 2 B 3.236561e+01
#> 3 D -2.220446e-15
#> 4 E 4.819082e+00
#> 5 L 1.923433e+01
#> 6 S 3.520300e+01
#> 7 T 5.769555e+00
#> 8 X 7.566835e+00
glob_asia_alt
#> Vertex Score
#> 1 A 8.622186e+00
#> 2 B 3.236561e+01
#> 3 D -2.442491e-15
#> 4 E 4.819082e+00
#> 5 L 1.923433e+01
#> 6 S 3.520300e+01
#> 7 T 5.769555e+00
#> 8 X 7.566835e+00
In the alternative model, Dysnopea contributes more to the log-likelihood.
Global monitors can be plotted giving a quick view of the decomposition of the log-likelihood. The darker the color, the more substantial the contribution of a vertex.
plot(glob_asia)
Node monitor
There are two variants of node monitors.
The marginal node monitor computes the probability of the \(i\)th observation in the data set in turn after passing the evidence of the \(i-1\)th cases in the data set.
The conditional node monitor computes the probability of the \(i\)th observation in the data set after passing evidence of the \(i-1\)th cases in the data set, and the evidence for all nodes in the \(i\)th case except the node of interest.
As a quick survey of the nodes, the node_monitor
command computes the marginal and conditional monitors for the final observation in the data set.
node_asia <- final_node_monitor(dag = asia_bn, df = asia)
node_asia
#> node marg.z.score cond.z.score
#> 1 A -0.1597335 -0.161389
#> 2 S -2.4718659 -2.052991
#> 3 T NaN NaN
#> 4 L 0.9056249 1.075533
#> 5 B -4.9089630 -3.949560
#> 6 E 0.9056249 1.075533
#> 7 X 0.9056249 1.075533
#> 8 D -4.8971373 -5.798998
The scores indicate a poor fit of the probability distributions specified for the Smoking, Bronchitis, and Dysnopea nodes, since these are larger than 1.96 in absolute value. Plots can also be created to give a visual counterpart of the node monitors.
plot(node_asia, which = "marginal")
As an illustration we investigate further the fit of the variable Dysnopea.
The sequential marginal monitor seq_marg_monitor
gives us a closer look at which particular forecasts in the data set might cause this poor fit. We examine the sequential monitor for both candidate models.
seq_asia <- seq_marg_monitor(dag = asia_bn, df = asia, node.name = "D")
seq_asia_alt <- seq_marg_monitor(dag = asia_bn_alt, df = asia, node.name = "D")
seq_asia
#> Marginal Node Monitor for D
#> Minimum -0.1692618
#> Maximum 1.732051
seq_asia_alt
#> Marginal Node Monitor for D
#> Minimum -0.1692618
#> Maximum 1.732051
plot(seq_asia)
plot(seq_asia_alt)
Both monitors indicate that for some observations there is a poor fit (score above 1.96 in absolute value). In particular for the alternative models the marginal monitor has larger values in absolute value.
A similar analysis can be conducted with seq_marg_monitor
, which would show that the model fits well (not reported here).
Parent Child monitor
Once a vertex has been identified as a poor fit, further investigation can be carried out to check for which values of the parents the model provides a bad representation. This can be achieved with the seq_pa_ch_monitor
function.
As an illustration consider the asia_bn
BN, the vertex D
(Dysnopea), the parent variable B
(Bronchitis) which can take values yes
and no
.
asia_pa_ch1 <- seq_pa_ch_monitor(dag = asia_bn, df = asia, node.name = "D", pa.names = "B", pa.val = "yes", alpha = 3)
asia_pa_ch1
#> Parent Child Node Monitor
#> Minimum -0.933608
#> Maximum 1.358783
asia_pa_ch2 <- seq_pa_ch_monitor(dag = asia_bn, df = asia, node.name = "D", pa.names = "B", pa.val = "no", alpha = 3)
asia_pa_ch2
#> Parent Child Node Monitor
#> Minimum -1.243054
#> Maximum 0.3804437
plot(asia_pa_ch1)
plot(asia_pa_ch2)
For this model, Dysnopea is adequately modeled for both values of Bronchitis, since most scores largely fall in the recommended range.
Influential observations
The last robustness tool is the absolute value of the log-likelihood ratio between a model learnt without one observation and the one learnt with the full dataset. Larger values are associated to atomic events which influence the structural learning.
influence <- influential_obs(dag = asia_bn, data = asia, alpha = 3)
head(influence)
#> A S T L B E X D score
#> 2556 no no no no no no no no 5.855111
#> 4670 no yes no no yes no no yes 6.855438
#> 1878 no no no no yes no no no 6.760820
#> 2531 no yes no yes no yes yes no 7.819889
#> 3466 no yes no no no no no yes 7.581375
#> 2778 no yes no no no no no no 7.581375
plot(influence)