Medical ML Model Template and rules :

Why was this ML model made? Why was it necessary to create this model? What does it provide that we do not already have?
What techniques were used to create this model? Is this model made with Support Vector Machines, Logistic Regression,  Decision Tree,  Random Forest, Gradient Boosting, or something else? What is this specific method being used? How does this model address fairness, explainability, and bias?
How many people were in the study? How big was this dataset? How does it compare to datasets of other models? Is this dataset still growing? Is the training and testing split 80-20 or a different metric?
What were the results of this model?  Report using the standard methods described below. Focus on the confusion matrix, recall, precision, and accuracy to create a standardized explanation.
How does this model compare to other non-algorithmic assessments and existing algorithms? Does this model demonstrate improvement compared to humans and other models?
Is this a viable method? Is this model a replacement for humans or a supplement? How would this be used as a supplement when with human intervention? Should it be reported with and without human intervention?

 

Standardizing Terminology - for experts as well as novices. These all must be reported at the top of the paper to give a clear understanding of the results of this model.
True Positive (TP): This is an outcome where the model correctly predicts the positive class.

False Positive (FP): Also known as a Type I error, this is an outcome where the model incorrectly predicts the positive class.

True Negative (TN): This is an outcome where the model correctly predicts the negative class.

False Negative (FN): Also known as a Type II error, this is an outcome where the model incorrectly predicts the negative class.

Confusion Matrix: This is used to organize and display the TP, FP, TN, and FT. This is a 2x2 grid with TP on the top left, FP on the top right, FN on the bottom right, and TF on the bottom left. Here is an example layout:

TP FP
FN TN

Accuracy: This is the proportion of true results (both true positives and true negatives) among the total number of cases examined. It’s calculated as:

Accuracy = (True Positives + True Negatives) / Total Predictions

Precision: Also known as positive predictive value, this metric is the ratio of true positives to the sum of true and false positives. It’s a measure of a classifier’s exactness. A low precision indicates a high number of false positives.

Precision = True Positives / (False Positives + True Positives​)

Recall: Also known as sensitivity, this metric is the ratio of true positives to the sum of true positives and false negatives. It’s a measure of a classifier’s completeness. A low recall indicates a high number of false negatives.

Recall = True Positives / (False Negatives + True Positives​)

 

Here are some example papers:

Diabetic Rhetinopathy Example

This paper on Diabetic Retinopathy could benefit from our standards. Here are the key areas to focus on: Diabetic Retinopathy paper 

Here are some things this paper should improve:

  1. The paper should articulate the necessity of the proposed model and discuss its advantages over human verification. This would help readers understand the value and impact of the research.
  2. The authors have opted for an AutoML approach and a smaller sample size than previous models. It would be beneficial if they justified these choices to give readers a better understanding of their methodology.
  3. The paper should include a comprehensive report of key metrics. These metrics include a confusion matrix, accuracy, precision, and recall. The clear reporting of these metrics is crucial for evaluating this model's performance.

Here is an example of the template filled out for Diabetic Retinopathy: Diabetic Retinopathy Example

ECG Diagnosis

This paper serves as a solid foundation for an example. ECG Diagnosis Paper

Here are some things this paper did right and some things that could continue to be worked on:

  1. The paper effectively discusses the necessity of the proposed model and compares its performance with human capabilities. However, it would be beneficial to include a discussion on how the model performs when used in conjunction with human verification.
  2. The authors clearly state the model used in their research and present its results. To enhance understanding, they could elaborate on why they chose this particular model and how they optimized it.
  3. The paper includes almost all the key metrics, which is commendable. For better readability and quick reference, these metrics could be neatly organized and presented at the beginning of the paper.

Here is an example of the template filled out for ECG: ECG Example

More “Must Dos” for AI in medicine and drug  discovery 

Note: This still is non-exhaustive and will continuously be updated as AI and research continue to grow and change.
Requirements:

1. Signs of remission: The effectiveness of new medicines should be gauged by their ability to induce remission. When there are already established and well-studied drugs available, new drugs must demonstrate their benefits not only through inducing remission but also in prolonging life. This becomes even more critical in situations where there are no established drugs or in the case of terminal diseases. In such scenarios, the new drug must significantly prolong life while also ensuring a minimum acceptable quality of life for the patient. To truly understand the impact and effectiveness of the new drug, a published comparison with established medicines should be made available. This allows for a clear understanding of the differences and potential advantages of the new treatment. The goal of any new treatment should always be to improve the patient’s health and quality of life, and remission is a key indicator of this improvement.

2. An A.I. model is not a clinical study: These models have the power to find new and unique solutions we have not seen before and do trials at a rapid pace. However, we cannot yet guarantee the certainty of these models without a physical study being done. Patients with any illness are already suffering and should not be blindly subjected to these model results without proper testing. The results of these models still aren’t guaranteed and need professionals to verify and test these findings. It’s important to remember that relying solely on AI models without proper verification could lead to inaccurate diagnoses or treatments, potentially causing harm to patients. Therefore, while AI has immense potential in healthcare, it must be used responsibly and ethically, with patient safety and well-being as the top priority.

3. Publish findings and research with medicinal AI models: For the betterment of the fields as a whole, all findings, successes, and failures, should be published for everyone to study and learn from so future models can improve. All reports must include a confusion matrix, accuracy, precision, and recall results clearly at the top of the paper. These are only the minimum required to help explain the reliability of these models. Failing to adhere to these practices could lead to a lack of transparency and reproducibility in AI research, which could in turn hinder progress and compromise the reliability of AI models in healthcare.

4. Transparency/white box models: The decision-making process of the AI should be a transparent white-box process and explainable to healthcare professionals. This also coincides with the fact that AI algorithms should be thoroughly validated using large and diverse datasets before deployment. This will help ensure the accuracy and range of training. While transparent, there must also be continuous monitoring by professionals for its accuracy. A white-box process means that the decision-making process of the AI is clear, understandable, and explainable. This transparency is crucial, especially in healthcare, where understanding the reasoning behind predictions can impact patient care and outcomes. While these models are transparent, they should also be continuously monitored by professionals to ensure their accuracy. This kind of monitoring is only possible with a white box model. On the other hand, black box models are harder to interpret and understand. Typically, black box only provide the input and output which makes them less desirable in situations where transparency and auditing is critical.

5. AutoML does not replace data scientists: An AutoML begins with a raw dataset and builds a machine learning model that can ready for deployment. However, the problem is these models have not been vetted for accuracy and are black box models. They are meant for nonprofessionals to have access creating a basic models for low stakes use. They are not at the same standards as a professional data scientists' models and will have a decreased accuracy in comparison. For medical purposes, all models should be created and vetted by professionals. This is why data scientists play a crucial role in this process. They bring a deep understanding of data and the ability to interpret it in ways that machines currently cannot. They can make judgments about the relevance and accuracy of data, and understand the context of the problem, which is crucial for building effective models. Data scientists also interpret the models, understand their limitations, and validate their results. They consider the ethical implications of models, including fairness, privacy, and potential misuse. While AutoML can increase accessibility, the expertise and skills of a data scientist are still crucial in the AI industry, especially in fields like healthcare where the stakes are high.

6. Publish a database of Patients with consent: Allowing access to the database will allow for public audits for accuracy, allow cleaning up for usability purposes, and could give access to more datasets for similar models to be using. This will allow for your dataset to be improved for your use, and allow researchers in the similar field to gain more reliable data. Also,  it’s crucial to remember that any sharing of patient data must be done with utmost respect for privacy and data security. Ensuring the anonymization of data and obtaining informed consent from patients are essential practices in this process

7. Don’t Ignore Any Context: AI should always consider the context in which it is used. A solution that works in one setting may not work in another. Datasets should be provided with as much information as reasonably possible to ensure coverage of all possible outcomes. Training data must attempt to have exhaustive information about a patient. Ignoring the context could lead to inaccurate predictions or inappropriate solutions, potentially causing harm or inconvenience. Therefore, it’s crucial to ensure that AI systems are designed and trained to be context-aware and adaptable.

Why do we need these standards?

With Medical AI gaining in popularity, universal standards must be created. Research reports should incorporate accepted nomenclature and definitions. Our suggested goal for AI, given its potential is patient-centered with an emphasis on remission and cure. Our standards are designed to enhance the use of AI. We want to ensure that new research is ethical and effective while researchers expand their knowledge about the conditions and diagnostic categories they are studying. With society expecting to spend $52 billion in healthcare AI in 2026, we would expect a benefit to improve individual health. Currently, cancer rates are going up for young people in GI and breast cancer plus new respiratory illness are being found in children; in 2020 we had a pandemic that resulted in 7.5 million deaths. Visual Intelligence LLC is creating these standards towards an improvement in overall population health.