π¨βπ« Tutorialο
A complete walkthrough: Train powerful sentence embeddings for medical text.
This tutorial demonstrates how to train domain-specific sentence embeddings using PubMed data with the AnglE framework. Youβll learn data preparation, model training, evaluation, and practical application.
π Overviewο
In this tutorial, you will:
π¦ Prepare a medical text dataset from PubMed
π Train a sentence embedding model
π Evaluate the model performance
π§ Apply the model in practice
Expected Time: 2-4 hours (depending on hardware)
Prerequisites: - Python 3.7+ - CUDA-compatible GPU(s) - Basic knowledge of PyTorch and HuggingFace
Step 1: Data Preparationο
π₯ Dataset Selectionο
Weβll use the PubMedQA dataset for training.
Pre-processed Dataset Available:
For convenience, weβve already processed the data into AnglEβs Format C (query, positive, negative) and made it available on HuggingFace:
π¦ WhereIsAI/medical-triples
Note
Format C is ideal for contrastive learning with hard negatives. See π Training and Finetuning for more format options.
Step 2: Train the Modelο
β¬οΈ Installationο
First, install the angle-emb library:
python -m pip install -U angle-emb
π― Training with angle-trainerο
Use the angle-trainer CLI for streamlined training. Youβll need to specify:
Dataset path and hyperparameters
Model architecture
Training configuration
See π Training and Finetuning for detailed parameter descriptions.
π Training Examples:
Example 1: Train BERT-base Model
Train a base model suitable for general medical text:
CUDA_VISIBLE_DEVICES=1,2,3 WANDB_MODE=disabled accelerate launch \
--multi_gpu \
--num_processes 3 \
--main_process_port 1234 \
-m angle_emb.angle_trainer \
--train_name_or_path WhereIsAI/medical-triples \
--train_subset_name all_pubmed_en_v1 \
--save_dir ckpts/pubmedbert-medical-base-v1 \
--model_name_or_path microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext \
--pooling_strategy cls \
--maxlen 75 \
--ibn_w 1.0 \
--cln_w 1.0 \
--cosine_w 0.0 \
--angle_w 0.02 \
--learning_rate 1e-6 \
--logging_steps 5 \
--save_steps 500 \
--warmup_steps 50 \
--batch_size 64 \
--seed 42 \
--gradient_accumulation_steps 3 \
--push_to_hub 1 --hub_model_id pubmed-angle-base-en --hub_private_repo 1 \
--epochs 1 \
--fp16 1
Key Parameters Explained:
--model_name_or_path: Pre-trained model specialized for biomedical text--ibn_w,--cln_w,--angle_w: Loss weights for Format C--maxlen 75: Sequence length optimized for PubMed abstracts--push_to_hub 1: Automatically upload to HuggingFace Hub
Example 2: Train BERT-large Model
Train a larger model for better performance:
CUDA_VISIBLE_DEVICES=1,2,3 WANDB_MODE=disabled accelerate launch \
--multi_gpu \
--num_processes 3 \
--main_process_port 1234 \
-m angle_emb.angle_trainer \
--train_name_or_path WhereIsAI/medical-triples \
--column_rename_mapping "text:query" \
--train_subset_name all_pubmed_en_v1 \
--save_dir ckpts/uae-medical-large-v1 \
--model_name_or_path WhereIsAI/UAE-Large-V1 \
--pooling_strategy cls \
--maxlen 75 \
--ibn_w 1.0 \
--cln_w 1.0 \
--cosine_w 0.0 \
--angle_w 0.02 \
--learning_rate 1e-6 \
--logging_steps 5 \
--save_steps 500 \
--warmup_steps 50 \
--batch_size 32 \
--seed 42 \
--gradient_accumulation_steps 2 \
--push_to_hub 1 --hub_model_id pubmed-angle-large-en --hub_private_repo 1 \
--epochs 1 \
--fp16 1
Tip
Fine-tuning from a general-purpose model (like UAE-Large-V1) often yields better results than training from scratch.
Step 3: Evaluate the Modelο
π Evaluation Setupο
AnglE provides a CorrelationEvaluator to measure embedding quality using Spearmanβs correlation.
Evaluation Dataset:
Weβve prepared the PubMedQA test set in Format A (text1, text2, label):
π Evaluation Codeο
Evaluate your trained model:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from angle_emb import AnglE, CorrelationEvaluator
from datasets import load_dataset
# Load trained model
angle = AnglE.from_pretrained(
'WhereIsAI/pubmed-angle-base-en',
pooling_strategy='cls'
).cuda()
# Load evaluation dataset
ds = load_dataset('WhereIsAI/pubmedqa-test-angle-format-a', split='train')
# Evaluate
metric = CorrelationEvaluator(
text1=ds['text1'],
text2=ds['text2'],
labels=ds['label']
)(angle, show_progress=True)
print(metric)
π Benchmark Resultsο
Comparison of models trained on PubMed data:
Model |
Spearmanβs Correlation |
|---|---|
tavakolih/all-MiniLM-L6-v2-pubmed-full |
84.56 |
NeuML/pubmedbert-base-embeddings |
84.88 |
WhereIsAI/pubmed-angle-base-en |
86.01 |
WhereIsAI/pubmed-angle-large-en |
86.21 π |
Note
The AnglE-trained models outperform existing popular models, with the large variant achieving the highest correlation of 86.21.
Step 4: Use the Modelο
π§ Practical Applicationο
Load and use your trained model for semantic similarity tasks:
from angle_emb import AnglE
from angle_emb.utils import cosine_similarity
# Load model
angle = AnglE.from_pretrained(
'WhereIsAI/pubmed-angle-base-en',
pooling_strategy='cls'
).cuda()
# Define query and documents
query = 'How to treat childhood obesity and overweight?'
docs = [
query,
'The child is overweight. Parents should relieve their children\'s '
'symptoms through physical activity and healthy eating. First, they '
'can let them do some aerobic exercise, such as jogging, climbing, '
'swimming, etc. In terms of diet, children should eat more cucumbers, '
'carrots, spinach, etc. Parents should also discourage their children '
'from eating fried foods and dried fruits, which are high in calories '
'and fat. Parents should not let their children lie in bed without '
'moving after eating. If their children\'s condition is serious during '
'the treatment of childhood obesity, parents should go to the hospital '
'for treatment under the guidance of a doctor in a timely manner.',
'If you want to treat tonsillitis better, you can choose some '
'anti-inflammatory drugs under the guidance of a doctor, or use local '
'drugs, such as washing the tonsil crypts, injecting drugs into the '
'tonsils, etc. If your child has a sore throat, you can also give him '
'or her some pain relievers. If your child has a fever, you can give '
'him or her antipyretics. If the condition is serious, seek medical '
'attention as soon as possible. If the medication does not have a good '
'effect and the symptoms recur, the author suggests surgical treatment. '
'Parents should also make sure to keep their children warm to prevent '
'them from catching a cold and getting tonsillitis again.',
]
# Encode all texts
embeddings = angle.encode(docs)
query_emb = embeddings[0]
# Calculate similarities
for doc, emb in zip(docs[1:], embeddings[1:]):
similarity = cosine_similarity(query_emb, emb)
print(f"Similarity: {similarity:.4f}")
Output:
Similarity: 0.8030 # Highly relevant (obesity treatment)
Similarity: 0.4261 # Less relevant (tonsillitis treatment)
Tip
Higher similarity scores indicate more relevant documents. Use this for search, ranking, or clustering tasks.
π Summaryο
Congratulations! Youβve learned how to:
β Prepare domain-specific datasets for sentence embedding training
β
Train BERT-based models using the angle-trainer CLI
β Evaluate model performance with correlation metrics
β Apply trained models for semantic similarity tasks
π Next Stepsο
Explore π Training and Finetuning for advanced configuration options
Learn about different π― Evaluation methods
Check out ποΈ Official Pretrained Models for ready-to-use models
Return to π Quick Start for basic inference examples
Questions? See π«‘ Citation for how to cite this work in your research.