1. Mechanism & Architecture

Understanding the Probabilistic Engine

To master Generative AI, one must abandon the mental model of a "Database" or "Search Engine." An LLM (Large Language Model) is a massive statistical function that predicts the next token in a sequence.

Tokenization & The "Cost" of Words

LLMs do not read words; they read Tokens. A token is roughly 0.75 of a word (e.g., "ing", "the", "45").

Clinical Implication: Calculation Errors

Because "1024" might be split into tokens like `10` and `24`, LLMs struggle with arithmetic. They do not "calculate" a Creatinine Clearance; they predict which numbers usually follow the text "CrCl = ". This leads to confident but mathematically incorrect outputs.

The Latent Space

Imagine a 3D map where every medical concept has a coordinate. "Myocardial Infarction" is close to "Troponin," but far from "Appendicitis."

When you query an LLM, it traverses this map. Hallucinations occur when the model takes a "shortcut" through this space—connecting two concepts that are statistically related in language (e.g., a specific drug and a specific side effect) but factually unrelated in reality.

2. RLHF & Model Behavior

Why the AI is Polite but Sycophantic

Reinforcement Learning from Human Feedback (RLHF)

Raw LLMs are chaotic. To make them usable, companies use RLHF. Humans review AI outputs and rank them: "Answer A is better than Answer B." The model learns to optimize for human preference.

The Sycophancy Trap

Because the model is optimized to "please" the user, it suffers from Sycophancy. If a senior clinician asks, "This looks like Wegener's, right?", the model is statistically biased to agree with the user's premise to maximize the "helpfulness" score, even if the clinical evidence is weak.

Temperature & Determinism

Temperature (0.0 - 1.0) controls randomness.

  • High Temp (0.8+): Creative, diverse, high hallucination risk.
  • Low Temp (0.0 - 0.2): Deterministic, repetitive, safer for extraction tasks.

Pro Tip: You often cannot control temperature in web interfaces (ChatGPT/Gemini), which means the output is non-deterministic. You cannot rely on it to be identical twice.

3. Privacy & Enterprise Security

BAAs, Zero-Retention, and Local LLMs

The "Training Data" Economy

If the product is free, you are the training data. Standard terms of service usually allow the vendor to use your inputs to retrain future models.

The BAA Requirement

In the US, using a non-enterprise AI tool for PHI is a HIPAA violation. You must have a Business Associate Agreement (BAA). Enterprise environments (e.g., Azure OpenAI, Gemini Enterprise) usually offer "Zero Data Retention," meaning they process the data and immediately discard it.

Mosaic Re-identification

Stripping the "18 Identifiers" is no longer sufficient. AI is a pattern-matching engine.

Example: A prompt containing "Welder" + "Zip Code 10001" + "Rare Sarcoma" can be cross-referenced with public news or case reports to identify a patient, even without a name.

Local LLMs (The "Air Gap")

For maximum security, hospitals are deploying open-weights models (like Llama 3 or Mistral) on internal servers. These run entirely offline. No data leaves the hospital firewall.

4. Prompting I: Few-Shot Learning

Moving from Zero-Shot to Pattern Matching

Zero-Shot Prompting (asking for a result with no examples) is the leading cause of poor formatting and vague answers.

Few-Shot Prompting involves providing 1-3 examples of the input and the desired output format. This "grounds" the model.

STRATEGY: Few-Shot Prompting TYPE: Clinical Transformation
# SYSTEM ROLE You are a clinical scribe. Transform raw notes into a professional "Assessment & Plan" format. # EXAMPLE 1 Input: 55M, hx HTN, came in with chest pain. Trop neg. EKG normal. sending home. Output: Assessment: 55-year-old male with history of HTN. Presentation: Atypical chest pain. Objective: Troponin negative, EKG NSR. Plan: Discharge with outpatient cardiology follow-up. # EXAMPLE 2 Input: Kid, 7yo, asthma flare. O2 sat 92%. gave albuterol and pred. improving. admit for obs. Output: Assessment: 7-year-old male with known asthma. Presentation: Acute asthma exacerbation with hypoxia (SpO2 92%). Intervention: Treated with Albuterol neb and Prednisone. Plan: Admit to Pediatrics for observation and SpO2 monitoring. # ACTUAL TASK Input: [PASTE YOUR MESSY NOTES HERE] Output:

By seeing the examples, the model understands the "voice," the level of abbreviation expansion, and the exact structure required.

5. Prompting II: Advanced Reasoning

Chain of Thought (CoT) & Chain of Verification (CoV)

Chain of Thought (CoT)

Standard prompts ask for the answer immediately. CoT forces the model to "show its work." This reduces hallucination rates by 40-50% in reasoning tasks.

The "Let's think step by step" Magic

Simply appending the phrase "Let's think step by step" to a prompt can trigger CoT processing in the model's latent space.

Chain of Verification (CoV)

This is a self-correction loop. You instruct the model to generate an answer, and then critique itself.

STRATEGY: Chain of Verification
User: "Propose a diagnosis for [Symptoms]." System Instructions: 1. Draft an initial differential diagnosis. 2. Fact Check: Look at each diagnosis in your draft. Does the patient's age/gender fit? Do the labs support it? List any contradictions. 3. Revised Output: Based on the fact check, write the final response.

Adversarial Prompting

To combat Sycophancy, assign the AI a "Critic" persona:

"I believe this is [Diagnosis X]. Act as a skeptical Attending Physician. Review my hypothesis and tell me why I might be WRONG. What 'Do Not Miss' diagnoses am I ignoring?"

6. RAG & Knowledge Retrieval

Connecting the AI to Trusted Documents

An LLM's training data is frozen in time (e.g., "Knowledge cutoff: Jan 2024"). It does not know about a drug recalled yesterday.

Retrieval Augmented Generation (RAG)

RAG is the architecture used by modern clinical AI tools. It works in two steps:

  1. Retrieve: The system searches a trusted database (e.g., UpToDate, Hospital Guidelines, PubMed) for relevant PDFs.
  2. Generate: It pastes those PDFs into the LLM's prompt and says: "Using ONLY the text below, answer the doctor's question."

Why RAG is Safer

It grounds the AI. If the answer isn't in the retrieved documents, a RAG system is programmed to say "I don't know," whereas a standard LLM will hallucinate an answer.

The Context Window Limit: Even with RAG, LLMs have a limit on how much text they can read. They suffer from "Lost in the Middle" syndrome—where they successfully retrieve info from the start and end of a long document, but miss details buried in the middle.

7. Failure Modes & Bias

When Good Models Go Bad

Automation Bias

The human tendency to over-trust automated decision support systems. When an AI provides a differential diagnosis, clinicians stop brainstorming their own ideas. This narrows the cognitive field.

Rule: AI should be the second opinion, never the first.

Demographic Bias

LLMs are trained on the internet and historical medical data. Historical data contains biases (e.g., dermatology images primarily featuring lighter skin tones; pain management data showing racial disparities).

The "Default Male" Problem

Unless specified, LLMs often assume the patient is male. In cardiac presentations, this can lead to the model missing "atypical" (female) presentations of MI.

8. Final Certification Exam

Scenario-Based Assessment

The following exam consists of 10 randomized questions drawn from a master bank. You need 80% to pass.