πŸ‘¨β€πŸ« Tutorial

A complete walkthrough: Train powerful sentence embeddings for medical text.

This tutorial demonstrates how to train domain-specific sentence embeddings using PubMed data with the AnglE framework. You’ll learn data preparation, model training, evaluation, and practical application.


πŸ“‹ Overview

In this tutorial, you will:

  1. πŸ“¦ Prepare a medical text dataset from PubMed

  2. πŸš‚ Train a sentence embedding model

  3. πŸ“Š Evaluate the model performance

  4. πŸ”§ Apply the model in practice

Expected Time: 2-4 hours (depending on hardware)

Prerequisites: - Python 3.7+ - CUDA-compatible GPU(s) - Basic knowledge of PyTorch and HuggingFace


Step 1: Data Preparation

πŸ“₯ Dataset Selection

We’ll use the PubMedQA dataset for training.

Pre-processed Dataset Available:

For convenience, we’ve already processed the data into AnglE’s Format C (query, positive, negative) and made it available on HuggingFace:

πŸ“¦ WhereIsAI/medical-triples

Note

Format C is ideal for contrastive learning with hard negatives. See πŸš‚ Training and Finetuning for more format options.


Step 2: Train the Model

⬇️ Installation

First, install the angle-emb library:

python -m pip install -U angle-emb

🎯 Training with angle-trainer

Use the angle-trainer CLI for streamlined training. You’ll need to specify:

  • Dataset path and hyperparameters

  • Model architecture

  • Training configuration

See πŸš‚ Training and Finetuning for detailed parameter descriptions.

πŸ“ Training Examples:

Example 1: Train BERT-base Model

Train a base model suitable for general medical text:

CUDA_VISIBLE_DEVICES=1,2,3 WANDB_MODE=disabled accelerate launch \
--multi_gpu \
--num_processes 3 \
--main_process_port 1234 \
-m angle_emb.angle_trainer \
--train_name_or_path WhereIsAI/medical-triples \
--train_subset_name all_pubmed_en_v1 \
--save_dir ckpts/pubmedbert-medical-base-v1 \
--model_name_or_path microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext \
--pooling_strategy cls \
--maxlen 75 \
--ibn_w 1.0 \
--cln_w 1.0 \
--cosine_w 0.0 \
--angle_w 0.02 \
--learning_rate 1e-6 \
--logging_steps 5 \
--save_steps 500 \
--warmup_steps 50 \
--batch_size 64 \
--seed 42 \
--gradient_accumulation_steps 3 \
--push_to_hub 1 --hub_model_id pubmed-angle-base-en --hub_private_repo 1 \
--epochs 1 \
--fp16 1

Key Parameters Explained:

  • --model_name_or_path: Pre-trained model specialized for biomedical text

  • --ibn_w, --cln_w, --angle_w: Loss weights for Format C

  • --maxlen 75: Sequence length optimized for PubMed abstracts

  • --push_to_hub 1: Automatically upload to HuggingFace Hub


Example 2: Train BERT-large Model

Train a larger model for better performance:

CUDA_VISIBLE_DEVICES=1,2,3 WANDB_MODE=disabled accelerate launch \
--multi_gpu \
--num_processes 3 \
--main_process_port 1234 \
-m angle_emb.angle_trainer \
--train_name_or_path WhereIsAI/medical-triples \
--column_rename_mapping "text:query" \
--train_subset_name all_pubmed_en_v1 \
--save_dir ckpts/uae-medical-large-v1 \
--model_name_or_path WhereIsAI/UAE-Large-V1 \
--pooling_strategy cls \
--maxlen 75 \
--ibn_w 1.0 \
--cln_w 1.0 \
--cosine_w 0.0 \
--angle_w 0.02 \
--learning_rate 1e-6 \
--logging_steps 5 \
--save_steps 500 \
--warmup_steps 50 \
--batch_size 32 \
--seed 42 \
--gradient_accumulation_steps 2 \
--push_to_hub 1 --hub_model_id pubmed-angle-large-en --hub_private_repo 1 \
--epochs 1 \
--fp16 1

Tip

Fine-tuning from a general-purpose model (like UAE-Large-V1) often yields better results than training from scratch.


Step 3: Evaluate the Model

πŸ“Š Evaluation Setup

AnglE provides a CorrelationEvaluator to measure embedding quality using Spearman’s correlation.

Evaluation Dataset:

We’ve prepared the PubMedQA test set in Format A (text1, text2, label):

πŸ“¦ WhereIsAI/pubmedqa-test-angle-format-a

πŸ“ˆ Evaluation Code

Evaluate your trained model:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from angle_emb import AnglE, CorrelationEvaluator
from datasets import load_dataset

# Load trained model
angle = AnglE.from_pretrained(
    'WhereIsAI/pubmed-angle-base-en',
    pooling_strategy='cls'
).cuda()

# Load evaluation dataset
ds = load_dataset('WhereIsAI/pubmedqa-test-angle-format-a', split='train')

# Evaluate
metric = CorrelationEvaluator(
    text1=ds['text1'],
    text2=ds['text2'],
    labels=ds['label']
)(angle, show_progress=True)

print(metric)

πŸ“Š Benchmark Results

Comparison of models trained on PubMed data:

Model

Spearman’s Correlation

tavakolih/all-MiniLM-L6-v2-pubmed-full

84.56

NeuML/pubmedbert-base-embeddings

84.88

WhereIsAI/pubmed-angle-base-en

86.01

WhereIsAI/pubmed-angle-large-en

86.21 πŸ†

Note

The AnglE-trained models outperform existing popular models, with the large variant achieving the highest correlation of 86.21.


Step 4: Use the Model

πŸ”§ Practical Application

Load and use your trained model for semantic similarity tasks:

from angle_emb import AnglE
from angle_emb.utils import cosine_similarity

# Load model
angle = AnglE.from_pretrained(
    'WhereIsAI/pubmed-angle-base-en',
    pooling_strategy='cls'
).cuda()

# Define query and documents
query = 'How to treat childhood obesity and overweight?'
docs = [
    query,
    'The child is overweight. Parents should relieve their children\'s '
    'symptoms through physical activity and healthy eating. First, they '
    'can let them do some aerobic exercise, such as jogging, climbing, '
    'swimming, etc. In terms of diet, children should eat more cucumbers, '
    'carrots, spinach, etc. Parents should also discourage their children '
    'from eating fried foods and dried fruits, which are high in calories '
    'and fat. Parents should not let their children lie in bed without '
    'moving after eating. If their children\'s condition is serious during '
    'the treatment of childhood obesity, parents should go to the hospital '
    'for treatment under the guidance of a doctor in a timely manner.',
    'If you want to treat tonsillitis better, you can choose some '
    'anti-inflammatory drugs under the guidance of a doctor, or use local '
    'drugs, such as washing the tonsil crypts, injecting drugs into the '
    'tonsils, etc. If your child has a sore throat, you can also give him '
    'or her some pain relievers. If your child has a fever, you can give '
    'him or her antipyretics. If the condition is serious, seek medical '
    'attention as soon as possible. If the medication does not have a good '
    'effect and the symptoms recur, the author suggests surgical treatment. '
    'Parents should also make sure to keep their children warm to prevent '
    'them from catching a cold and getting tonsillitis again.',
]

# Encode all texts
embeddings = angle.encode(docs)
query_emb = embeddings[0]

# Calculate similarities
for doc, emb in zip(docs[1:], embeddings[1:]):
    similarity = cosine_similarity(query_emb, emb)
    print(f"Similarity: {similarity:.4f}")

Output:

Similarity: 0.8030  # Highly relevant (obesity treatment)
Similarity: 0.4261  # Less relevant (tonsillitis treatment)

Tip

Higher similarity scores indicate more relevant documents. Use this for search, ranking, or clustering tasks.


πŸŽ“ Summary

Congratulations! You’ve learned how to:

βœ… Prepare domain-specific datasets for sentence embedding training

βœ… Train BERT-based models using the angle-trainer CLI

βœ… Evaluate model performance with correlation metrics

βœ… Apply trained models for semantic similarity tasks

πŸ“š Next Steps

Questions? See 🫑 Citation for how to cite this work in your research.