PVF: A novel metric for understanding AI programs’ vulnerability in opposition to SDCs in mannequin parameters

PVF: A novel metric for understanding AI programs’ vulnerability in opposition to SDCs in mannequin parameters
PVF: A novel metric for understanding AI programs’ vulnerability in opposition to SDCs in mannequin parameters
  • We’re introducing parameter vulnerability factor (PVF), a novel metric for understanding and measuring AI programs’ vulnerability in opposition to silent information corruptions (SDCs) in mannequin parameters.
  • PVF may be tailor-made to totally different AI fashions and duties, tailored to totally different {hardware} faults, and even prolonged to the coaching section of AI fashions.
  • We’re sharing outcomes of our personal case research utilizing PVF to measure the influence of SDCs in mannequin parameters, in addition to potential strategies of figuring out SDCs in mannequin parameters.

Reliability is a crucial side of any profitable AI implementation. However the rising complexity and variety of AI {hardware} programs additionally brings an elevated threat of {hardware} faults corresponding to bit flips. Manufacturing defects, ageing elements, or environmental elements can result in information corruptions – errors or alterations in information that may happen throughout storage, transmission, or processing and lead to unintended adjustments in info.

Silent information corruptions (SDCs), the place an undetected {hardware} fault ends in misguided utility habits, have turn out to be more and more prevalent and troublesome to detect. Inside AI programs, an SDC can create what’s known as parameter corruption, the place AI mannequin parameters are corrupted and their authentic values are altered.

When this happens throughout AI inference/servicing it may probably result in incorrect or degraded mannequin output for customers, in the end affecting the standard and reliability of AI providers.

Determine 1 exhibits an instance of this, the place a single bit flip can drastically alter the output of a ResNet mannequin. 

Determine 1: Flipping a random bit of 1 parameter within the 1st convolution (conv) layer in ResNet-18 drastically alters the mannequin’s output.


With this escalating thread in thoughts, there are two essential questions: How weak are AI fashions to parameter corruptions? And the way do totally different elements (corresponding to modules and layers) of the fashions exhibit totally different vulnerability ranges to parameter corruptions?

Answering these questions is a crucial a part of delivering dependable AI programs and providers and affords invaluable insights for guiding AI {hardware} system design, corresponding to when assigning AI mannequin parameters or software program variables to {hardware} blocks with differing fault safety capabilities. Moreover, it may present essential info for formulating methods to detect and mitigate SDCs in AI programs in an environment friendly and efficient method.

Parameter vulnerability factor (PVF) is a novel metric we’ve launched with the purpose to standardize the quantification of AI mannequin vulnerability in opposition to parameter corruptions. PVF is a flexible metric that may be tailor-made to totally different AI fashions/duties and can also be adaptable to totally different {hardware} fault fashions. Moreover, PVF may be prolonged to the coaching section to guage the consequences of parameter corruptions on the mannequin’s convergence functionality.

What’s PVF?

PVF is impressed by the architectural vulnerability issue (AVF) metric used inside the laptop structure neighborhood. We outline a mannequin parameter’s PVF because the chance {that a} corruption in that particular mannequin parameter will result in an incorrect output. Much like AVF, this statistical idea may be derived from statistically in depth and significant fault injection (FI) experiments. 

PVF has a number of options:

Parameter-level quantitative evaluation

As a quantitative metric, PVF concentrates on parameter-level vulnerability, calculating the chance {that a} corruption in a selected mannequin parameter will result in an incorrect mannequin output. This “parameter” may be outlined at totally different scales and granularities, corresponding to a person parameter or a gaggle of parameters.

Scalability throughout AI fashions/duties

PVF is scalable and relevant throughout a variety of AI fashions, duties, and {hardware} fault fashions.

Offers insights for guiding AI system design

PVF can present invaluable insights for AI system designers, guiding them in making knowledgeable selections about balancing fault safety with efficiency and effectivity. For instance, engineers would possibly leverage PVF to assist map greater weak parameters to better-protected {hardware} blocks and discover tradeoffs on latency, energy, and reliability by enabling a surgical strategy to fault tolerance at selective places as an alternative of a catch-all/none strategy. 

Can be utilized as an ordinary metric for AI vulnerability/resilience analysis

PVF has the potential to unify and standardize such practices, making it simpler to check the reliability of various AI programs/parameters and fostering open collaboration and progress within the business and analysis neighborhood.

How PVF works

Much like AVF as a statistical idea, PVF must be derived by means of a lot of FI  experiments which can be statistically significant. Determine 2 exhibits an total stream to compute PVF by means of a FI course of. We’ve offered a case research on the open-source DLRM inference with extra particulars and instance case research in our paper.

Determine 2: Computing PVF by means of FI.

Determine 3 illustrates the PVF of three DLRM parameter elements, embedding desk, bot-MLP, and top-MLP, below 1, 2, 4, 8, 16, 32, 64, and 128 bit flips throughout every inference. We observe totally different vulnerability ranges throughout totally different elements of DLRM. For instance, below a single bit flip, the embedding desk has comparatively low PVF; that is attributed to embedding tables being extremely sparse, and parameter corruptions are solely activated when the actual corrupted parameter is activated by the corresponding sparse characteristic. Nevertheless, top-MLP can have 0.4% below even a single bit flip. That is important – for each 1000 inferences, 4 inferences might be incorrect. This highlights the significance of defending particular weak parameters for a given mannequin primarily based on the PVF measurement. 

Determine 3: The PVF of DLRM parameters below random bit flips.

We observe that with 128 bit flips throughout every inference, for MLP elements, PVF has elevated to 40% and 10% for top-MLP and bot-MLP elements respectively, whereas observing a number of NaN values. Prime-MLP element has greater PVF than bot-MLP. That is attributed to the top-MLP being nearer to the ultimate mannequin, and therefore has much less of an opportunity to be mitigated by inherent error masking chance of neural layers. 

The applicability of PVF

PVF is a flexible metric the place the definition of an “incorrect output” (which is able to fluctuate primarily based on the mannequin/job) may be tailored to swimsuit person necessities. To adapt PVF to numerous {hardware} fault fashions the strategy to calculate PVF stays constant as depicted in Determine 2. The one modification required is the way through which the fault is injected, primarily based on the assumed fault fashions. 

Moreover, PVF may be prolonged to the coaching section to guage the consequences of parameter corruptions on the mannequin’s convergence functionality. Throughout coaching, the mannequin’s parameters are iteratively up to date to reduce a loss perform. A corruption in a parameter may probably disrupt this studying course of, stopping the mannequin from converging to an optimum answer. By making use of the PVF idea throughout coaching, we may quantify the chance {that a} corruption in every parameter would lead to such a convergence failure.

Dr. DNA and additional exploration avenues for PVF

The logical development after understanding AI vulnerability to SDCs is to determine and reduce their influence on AI programs. To provoke this, we’ve launched Dr. DNA, a technique designed to detect and mitigate SDCs that happen throughout deep studying mannequin inference. Particularly, we formulate and extract a set of distinctive SDC signatures from the distribution of neuron activations (DNA), primarily based on which we suggest early-stage detection and mitigation of SDCs throughout DNN inference. 

We carry out an in depth analysis throughout 10 consultant DNN fashions utilized in three frequent duties (imaginative and prescient, GenAI, and segmentation) together with ResNet, Imaginative and prescient Transformer, EfficientNet, YOLO, and so on., below 4 totally different error fashions. Outcomes present that Dr. DNA  achieves a 100% SDC detection price for many instances, a 95% detection price on common and a >90% detection price throughout all instances, representing 20-70% enchancment over baselines. Dr. DNA also can mitigate the influence of SDCs by successfully recovering DNN mannequin efficiency with <1% reminiscence overhead and <2.5% latency overhead. 

Learn the analysis papers

PVF (Parameter Vulnerability Factor): A Novel Metric for Understanding AI Vulnerability Against SDCs in Model Parameters

Dr. DNA: Combating Silent Data Corruptions in Deep Learning using Distribution of Neuron Activations