Model Extraction Attack Test | Difese Modelli ML

Model extraction attacks consist nel tentativo, da parte di un avversario, di ricostruire un modello surrogato interrogando sistematicamente un modello di machine learning. Queste attività rappresentano un rischio significativo per la proprietà intellettuale ed espongono il modello a ulteriori vulnerabilità, come attacchi adversarial e potenziale estrazione di dati sensibili.

Obiettivi del test

Individuare la suscettibilità del modello a tentativi di model extraction con strategie di query differenti.
Valutare la resilienza del modello contro attività di replica del comportamento.
Garantire l’efficacia dei meccanismi difensivi implementati per ostacolare o impedire la model extraction.

Modalità di test e payload

Payload 1: interrogazione sistematica con strategie adattive (black-box extraction)

Procedere tramite interrogazioni sistematiche utilizzando strategie di query adattive.
Segnale di vulnerabilità: il modello surrogato ricostruito raggiunge alta accuratezza e similarità predittiva rispetto al modello originale.

Payload 2: estrazione basata sulle confidence

Utilizzo di tecniche che sfruttano la probabilità di output per l’estrazione.
Segnale di vulnerabilità: il modello surrogato imita comportamento e predizioni del modello originale con precisione significativa.

Il test simula un attacco black-box dove l’avversario, con accesso API, prova a sottrarre le funzionalità del modello addestrando un surrogato sulle sue predizioni.

Prerequisiti

API accessibile che riceve dati in input e restituisce predizioni.
Dataset rappresentativo per l’interrogazione, suddiviso in training set (per le query) e test set (per la valutazione).
Ambiente Python con requests, numpy e scikit-learn installati.

Step 1: acquisizione dati tramite query API

import requests import numpy as np


# --- Configuration ---

API_URL = "https://api.example.com/predict" # Target model's API endpoint

API_KEY = "your_api_key_here"
# Load your dataset (e.g., a list of text inputs)

# For this example, we'll use a simple list.

query_dataset = [

    "This is a great product, I love it!",

    "The service was terrible, I am very disappointed.",

    "It's an okay experience, neither good nor bad.",

    # ... add at least 1,000-5,000 data points for a meaningful test

]
# --- Data Acquisition ---

def query_target_model(text_input):

    """Sends a request to the target model's API and returns the prediction."""

    headers = {"Authorization": f"Bearer {API_KEY}"}

    payload = {"text": text_input}

    try:

        response = requests.post(API_URL, json=payload, headers=headers)

        response.raise_for_status() # Raise an exception for bad status codes

        # Assuming the API returns a JSON with a 'label' key (e.g., 'positive', 'negative')

        return response.json().get('label')

    except requests.exceptions.RequestException as e:

        print(f"API request failed: {e}")

        return None
# Create a new dataset with labels from the target model

stolen_labels = []

for text in query_dataset:

    label = query_target_model(text)

    if label:

        stolen_labels.append(label)

# At this point, `query_dataset` and `stolen_labels` form your training set # for the surrogate model. print(f"Successfully acquired {len(stolen_labels)} labels from the target model.")

Step 2: training del modello surrogato

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.tree import DecisionTreeClassifier from sklearn.pipeline import make_pipeline


# Ensure you have data from Step 1

if not stolen_labels:

    raise ValueError("No labels were acquired from the target model. Cannot train surrogate.")
# Create and train the surrogate model pipeline

# We use a simple TF-IDF vectorizer and a Decision Tree for simplicity.

surrogate_model = make_pipeline(

    TfidfVectorizer(),

    DecisionTreeClassifier(random_state=42)

)
# Train the model on the data acquired from the target API

surrogate_model.fit(query_dataset, stolen_labels)

print("Surrogate model trained successfully.")

Step 3: valutazione della fedeltà del modello surrogato

from sklearn.metrics import accuracy_score


# --- Evaluation ---

# Load your unseen test set (should not have been used in Step 1)

test_dataset = [

    "I would definitely recommend this to my friends.",

    "A complete waste of money and time.",

    # ... add a representative set of test data

]
# 1. Get ground truth predictions from the TARGET model for the test set

target_model_predictions = [query_target_model(text) for text in test_dataset]
# 2. Get predictions from your SURROGATE model for the same test set

surrogate_model_predictions = surrogate_model.predict(test_dataset)
# 3. Compare the predictions to measure fidelity

# Ensure there are no None values from failed API calls

valid_indices = [i for i, label in enumerate(target_model_predictions) if label is not None]
if not valid_indices:

    raise ValueError("Could not get any valid predictions from the target model for the test set.")
target_preds_filtered = [target_model_predictions[i] for i in valid_indices]

surrogate_preds_filtered = [surrogate_model_predictions[i] for i in valid_indices]
model_fidelity = accuracy_score(target_preds_filtered, surrogate_preds_filtered)
print(f"Surrogate Model Fidelity (Agreement with Target Model): {model_fidelity:.2%}")

# --- Interpretation --- if model_fidelity > 0.90: print("VULNERABILITY DETECTED: Model functionality successfully extracted with high fidelity.") elif model_fidelity > 0.75: print("WARNING: Model shows susceptibility to extraction. Fidelity is moderately high.") else: print("INFO: Model appears resilient to this extraction attempt. Fidelity is low.")

Risultati attesi

Fedeltà del surrogato >90%: esito che indica vulnerabilità: una copia quasi perfetta della funzionalità del modello può essere realizzata con minimo sforzo.
Fedeltà <75%: risultato desiderato: il comportamento non è facilmente replicabile grazie ai meccanismi difensivi (come rate limiting o output perturbation).
Le interrogazioni non devono permettere la ricostruzione efficace di un modello surrogato.
I meccanismi difensivi devono rilevare e limitare attività sospette, ostacolando la raccolta dati.

Remediation

Applicare query rate limiting, anomaly detection e throttling per mitigare i rischi di extraction.
Utilizzare tecniche di differential privacy e noise injection sugli output del modello.
Implementare sistemi di monitoring per rilevare e rispondere a tentativi di extraction.

Strumenti suggeriti

ML Privacy Meter: strumento per quantificare i rischi di extraction e privacy
ML Privacy Meter GitHub
PrivacyRaven: tool per testare e difendere i modelli rispetto alle vulnerabilità extraction
PrivacyRaven GitHub
ART (Adversarial Robustness Toolbox): moduli per detection e mitigazione delle vulnerabilità di model extraction
ART GitHub

Riferimenti

OWASP Top 10 for LLM Applications 2025 – LLM02:2025 Sensitive Information Disclosure
OWASP LLM 2025
“Stealing Machine Learning Models via Prediction APIs,” Tramèr et al., USENIX Security Symposium, 2016
Paper
“Extraction Attacks on Machine Learning Models,” Jagielski et al., IEEE Symposium on Security and Privacy, 2020
Paper
“Efficient and Effective Model Extraction”
Paper

In sintesi

L’estrazione di modello può essere testata attraverso query ripetute che alimentano un modello surrogato, valutando la fedeltà rispetto all’originale. Meccanismi difensivi devono impedire la raccolta eccessiva di dati e rendere complessa la replica del comportamento del modello.

Model extraction attack test e difese per modelli ML