Intelligenza Artificiale Sostenibile

Intelligenza Artificiale Sostenibile Sustainable Artificial Intelligence

by Bruno Tessaro, posted on June 16, 2026

IT EN

L'intelligenza artificiale generativa rappresenta una delle trasformazioni tecnologiche più rilevanti dell'inizio del XXI secolo. La diffusione di sistemi come ChatGPT ha portato al centro dell'attenzione pubblica strumenti capaci di comprendere il linguaggio naturale, generare testi coerenti, assistere nella scrittura di codice, sintetizzare informazioni e supportare attività cognitive che fino a pochi anni fa sembravano esclusivamente umane. Nel giro di pochi mesi, l'AI generativa è passata dall'essere un argomento per specialisti a una tecnologia utilizzata quotidianamente da milioni di persone.
Alla base di questa rivoluzione si trovano i Large Language Models (LLM), reti neurali addestrate su enormi quantità di dati testuali. Il loro sviluppo ha richiesto investimenti senza precedenti in infrastrutture computazionali, energia e capacità di elaborazione. Per molti anni il progresso del settore è stato guidato da una convinzione apparentemente semplice: aumentando il numero di parametri, la quantità di dati e la potenza di calcolo, le prestazioni dei modelli sarebbero migliorate in modo costante. Questa strategia ha effettivamente prodotto risultati straordinari, ma ha anche evidenziato limiti economici, energetici e operativi sempre più evidenti.

L'addestramento dei modelli più avanzati richiede oggi grandi cluster di GPU, consumi energetici significativi e costi accessibili soltanto a poche organizzazioni globali. Parallelamente, aziende, istituzioni e centri di ricerca hanno iniziato a interrogarsi sulla sostenibilità di questo paradigma. È davvero necessario costruire modelli sempre più grandi? Esistono alternative capaci di offrire prestazioni elevate con risorse molto inferiori?
La risposta ha iniziato a delinearsi attraverso nuove strategie di progettazione. La qualità dei dati è diventata importante quanto la loro quantità. Tecniche come la quantizzazione, il pruning, il fine-tuning efficiente e il Retrieval-Augmented Generation hanno dimostrato che è possibile ottenere risultati competitivi senza aumentare indefinitamente le dimensioni dei modelli. Allo stesso tempo, la crescita dell'ecosistema open source ha reso accessibili tecnologie che fino a poco tempo fa erano riservate ai grandi laboratori industriali.

Questa evoluzione ha riportato in primo piano il concetto di autonomia tecnologica. La possibilità di eseguire modelli localmente, all'interno di aziende o infrastrutture controllate direttamente dagli utenti, offre vantaggi in termini di privacy, sicurezza e indipendenza dal cloud. In questo contesto il rapporto tra software e hardware assume un'importanza crescente. L'efficienza non dipende soltanto dagli algoritmi, ma anche dalla capacità di progettare architetture computazionali specializzate.
Tra le tecnologie che stanno attirando interesse figurano gli FPGA, dispositivi elettronici riconfigurabili che consentono di implementare acceleratori dedicati per specifici carichi di lavoro. La loro flessibilità li rende particolarmente interessanti per l'inferenza locale di modelli linguistici ottimizzati, aprendo la strada a una nuova generazione di sistemi di AI sostenibili e ad alta efficienza energetica.

Dalle reti neurali al Transformer
Per comprendere la nascita dei moderni modelli linguistici è necessario ripercorrere brevemente l'evoluzione delle reti neurali artificiali. Le prime idee risalgono agli anni Cinquanta, quando il perceptron di Frank Rosenblatt cercò di simulare in forma semplificata il funzionamento dei neuroni biologici. Nonostante l'entusiasmo iniziale, i limiti computazionali dell'epoca e alcune difficoltà teoriche rallentarono per decenni lo sviluppo del settore.
La situazione cambiò progressivamente a partire dagli anni Novanta e soprattutto nel primo decennio del Duemila, grazie all'aumento della potenza di calcolo, alla disponibilità di grandi quantità di dati e alla diffusione delle GPU come acceleratori per il deep learning. Le reti neurali profonde iniziarono a ottenere risultati sempre migliori in ambiti come la visione artificiale, il riconoscimento vocale e l'elaborazione del linguaggio naturale.
Uno dei problemi più complessi riguardava la gestione delle sequenze. A differenza delle immagini, il linguaggio è caratterizzato da una struttura temporale nella quale ogni parola dipende dal contesto precedente. Per affrontare questa sfida furono sviluppate le Recurrent Neural Networks (RNN), progettate per elaborare informazioni in modo sequenziale mantenendo una forma di memoria interna.
Le RNN rappresentarono un importante passo avanti, ma soffrivano di limiti significativi. Durante l'addestramento era difficile conservare informazioni provenienti da sequenze molto lunghe, un problema noto come vanishing gradient. Per mitigare queste difficoltà furono introdotte architetture più sofisticate come le Long Short-Term Memory (LSTM) e successivamente le Gated Recurrent Units (GRU). Questi modelli migliorarono la capacità di apprendere dipendenze a lungo termine e dominarono per anni il panorama dell'elaborazione del linguaggio naturale.
Nonostante i progressi, le architetture ricorrenti presentavano ancora un limite fondamentale: la natura sequenziale del calcolo. Ogni elemento della frase doveva essere elaborato dopo il precedente, riducendo la possibilità di sfruttare pienamente il parallelismo offerto dall'hardware moderno.
La svolta arrivò grazie al concetto di attenzione. Nel 2014 Dzmitry Bahdanau e collaboratori introdussero un meccanismo che permetteva ai modelli di concentrarsi dinamicamente sulle parti più rilevanti di una sequenza durante la traduzione automatica. L'idea si rivelò estremamente potente e aprì la strada a una nuova generazione di architetture.
Nel 2017 un gruppo di ricercatori di Google pubblicò il celebre articolo "Attention Is All You Need". Il lavoro propose un'architettura completamente nuova, denominata Transformer, che eliminava del tutto la ricorrenza e si basava esclusivamente sul meccanismo di self-attention. Invece di elaborare le parole una alla volta, il Transformer poteva analizzare simultaneamente l'intera sequenza, identificando le relazioni tra i vari elementi del testo.
La self-attention consente a ogni token di valutare l'importanza relativa degli altri token presenti nella sequenza. Questo approccio permette di catturare dipendenze linguistiche molto lunghe e di sfruttare efficacemente il parallelismo delle GPU. Il risultato fu un miglioramento significativo sia delle prestazioni sia della velocità di addestramento.

L'architettura Transformer è composta da blocchi fondamentali che includono meccanismi di attenzione multi-head, reti feed-forward e connessioni residue. Grazie a questa struttura, il modello può apprendere rappresentazioni linguistiche molto ricche e generalizzabili.
L'impatto scientifico fu immediato. Nel giro di pochi anni il Transformer divenne l'architettura dominante in quasi tutti i campi dell'intelligenza artificiale generativa. Traduzione automatica, generazione di testo, visione artificiale e persino biologia computazionale iniziarono a utilizzare varianti derivate da questa idea.
La vera importanza del Transformer non risiede soltanto nelle sue prestazioni, ma nella sua scalabilità. Per la prima volta era possibile addestrare modelli di dimensioni enormemente superiori rispetto al passato, sfruttando grandi cluster distribuiti. Questa caratteristica avrebbe dato origine alla successiva generazione di Large Language Models e trasformato profondamente l'intero settore dell'intelligenza artificiale.

L’ascesa dei Large Language Models
L'evoluzione dei Large Language Models è strettamente legata alla diffusione dell'architettura Transformer. Una volta dimostrata l'efficacia della self-attention, i ricercatori iniziarono a esplorare la possibilità di addestrare modelli sempre più grandi su enormi quantità di testo raccolto dal Web, da libri digitalizzati e da altre fonti documentali.
Nel 2018 OpenAI presentò GPT-1, un modello relativamente piccolo secondo gli standard attuali, ma innovativo per il modo in cui utilizzava il pretraining generativo. L'idea era semplice e potente: addestrare il modello a prevedere il token successivo all'interno di grandi corpus testuali e successivamente adattarlo a compiti specifici tramite fine-tuning. Questo approccio dimostrò che una rappresentazione linguistica generale poteva essere riutilizzata con successo in molte applicazioni diverse.
L'anno successivo arrivò GPT-2, che attirò grande attenzione per la qualità dei testi generati. Per la prima volta un modello linguistico mostrava capacità narrative e argomentative sufficientemente convincenti da suscitare discussioni sui possibili rischi di abuso. Sebbene le sue dimensioni fossero modeste rispetto agli standard attuali, GPT-2 evidenziò il potenziale dei modelli generativi su larga scala.
Il vero punto di svolta fu GPT-3, pubblicato nel 2020. Con 175 miliardi di parametri, il modello dimostrò che l'aumento delle dimensioni poteva produrre capacità emergenti inattese. GPT-3 riusciva a eseguire compiti per i quali non era stato addestrato esplicitamente, sfruttando pochi esempi forniti direttamente nel prompt. Questo fenomeno contribuì a rafforzare la convinzione che la crescita delle dimensioni rappresentasse il principale motore del progresso.
Parallelamente, OpenAI introdusse InstructGPT, un'evoluzione progettata per seguire meglio le istruzioni degli utenti. Il sistema utilizzava il Reinforcement Learning from Human Feedback (RLHF), una tecnica che integra valutazioni umane nel processo di ottimizzazione. L'obiettivo era rendere le risposte più utili, sicure e coerenti con le aspettative degli utenti.
Questa linea di ricerca culminò nel lancio di ChatGPT alla fine del 2022. Più che una semplice innovazione tecnica, ChatGPT rappresentò una svolta culturale. Per la prima volta milioni di persone poterono interagire direttamente con un modello linguistico avanzato attraverso un'interfaccia semplice e intuitiva. L'adozione rapidissima del servizio dimostrò che l'AI generativa era pronta per un utilizzo di massa.

Nel frattempo il panorama si stava ampliando. Meta pubblicò la famiglia LLaMA, dimostrando che modelli relativamente compatti potevano raggiungere prestazioni sorprendenti. Il rilascio dei pesi favorì una straordinaria crescita dell'ecosistema open source, accelerando la ricerca indipendente e lo sviluppo di applicazioni locali.
Successivamente emersero modelli come Mistral e Mixtral, caratterizzati da architetture efficienti e da un ottimo rapporto tra prestazioni e costo computazionale. Microsoft sviluppò la famiglia Phi, focalizzata sull'utilizzo di dati accuratamente selezionati piuttosto che sulla semplice crescita dimensionale. Google introdusse Gemma, mentre IBM rese disponibili i modelli Granite, orientati agli utilizzi aziendali e alla trasparenza del processo di sviluppo.
Questa evoluzione ha modificato profondamente il dibattito sugli LLM. Se nei primi anni la strategia dominante consisteva nell'aumentare continuamente le dimensioni dei modelli, oggi l'attenzione si sta spostando verso l'efficienza, la qualità dei dati, la specializzazione e la sostenibilità. I modelli più piccoli non vengono più considerati una semplice alternativa economica, ma una possibile direzione evolutiva dell'intero settore.
La storia degli LLM dimostra che il progresso dell'intelligenza artificiale non è il risultato di una singola innovazione. Comprendere questa evoluzione è essenziale per analizzare le sfide che il settore dovrà affrontare negli anni a venire, a partire da quella più urgente: il costo energetico della crescita degli LLM.

I limiti della crescita e la sfida energetica
Il successo dei Large Language Models è stato accompagnato da una convinzione che per alcuni anni ha guidato gran parte della ricerca nel settore: aumentando le dimensioni dei modelli, la quantità di dati utilizzati per l'addestramento e la potenza di calcolo disponibile, le prestazioni sarebbero migliorate in modo prevedibile. Questa idea è stata formalizzata nelle cosiddette scaling laws, studi che hanno analizzato il rapporto tra dimensione dei modelli e capacità di apprendimento.
Le ricerche condotte da OpenAI e successivamente da DeepMind hanno mostrato che l'aumento dei parametri produce effettivamente miglioramenti significativi in numerosi compiti linguistici. Per diversi anni questa evidenza ha spinto laboratori e aziende a costruire modelli sempre più grandi, dando origine a una competizione che ha portato alla nascita di sistemi con centinaia di miliardi di parametri.
Tuttavia, la crescita dimensionale ha rapidamente evidenziato un problema fondamentale: il costo. Addestrare un LLM richiede enormi quantità di energia, hardware specializzato e tempo di calcolo. I moderni cluster utilizzati per il training possono comprendere migliaia di GPU collegate tra loro da reti ad altissima velocità. Il consumo energetico di queste infrastrutture è paragonabile a quello di piccole comunità urbane e richiede sistemi di raffreddamento sempre più sofisticati.
Un contributo importante al dibattito arrivò nel 2022 con il modello Chinchilla sviluppato da DeepMind. I ricercatori dimostrarono che molti modelli erano stati addestrati utilizzando una quantità insufficiente di dati rispetto al numero di parametri. In altre parole, non era sempre conveniente aumentare le dimensioni della rete; in molti casi risultava più efficace utilizzare dataset più grandi e meglio bilanciati. Questo risultato contribuì a mettere in discussione l'idea che il progresso dipendesse esclusivamente dalla crescita dimensionale.

Parallelamente è emerso il concetto di Green AI, introdotto per promuovere una valutazione più ampia dei sistemi di intelligenza artificiale. Secondo questa prospettiva, l'accuratezza non dovrebbe essere l'unico criterio di giudizio. È necessario considerare anche il consumo energetico, i costi computazionali, la replicabilità degli esperimenti e l'impatto ambientale complessivo delle tecnologie sviluppate.
Il tema non riguarda soltanto l'addestramento. Anche l'inferenza, cioè l'utilizzo quotidiano dei modelli da parte degli utenti, richiede risorse significative. Ogni richiesta inviata a un chatbot attiva processi computazionali distribuiti all'interno di grandi datacenter. Con centinaia di milioni di utenti e miliardi di richieste giornaliere, il consumo energetico complessivo può diventare estremamente rilevante.
A questi aspetti si aggiungono ulteriori fattori. I datacenter richiedono sistemi di raffreddamento, infrastrutture di rete, apparati di alimentazione e continui aggiornamenti hardware. Inoltre, la produzione delle componenti elettroniche comporta l'utilizzo di materiali critici e processi industriali ad alta intensità energetica. L'impatto ambientale dell'intelligenza artificiale deve quindi essere valutato considerando l'intero ciclo di vita delle infrastrutture coinvolte.
Le implicazioni economiche sono altrettanto importanti. Soltanto poche organizzazioni dispongono delle risorse necessarie per addestrare modelli di frontiera. Questa concentrazione rischia di ridurre la diversità dell'ecosistema tecnologico e di aumentare la dipendenza da un numero limitato di fornitori. Per molte aziende e istituzioni, l'accesso alle tecnologie più avanzate avviene esclusivamente attraverso servizi cloud gestiti da grandi operatori internazionali.
Di fronte a queste sfide, il settore sta progressivamente cambiando prospettiva. Sempre più ricercatori ritengono che il futuro dell'AI non dipenderà soltanto dalla costruzione di modelli più grandi, ma dalla capacità di migliorare l'efficienza complessiva dei sistemi. L'obiettivo è ottenere prestazioni elevate riducendo il numero di parametri, il consumo energetico e il costo operativo.
La questione della sostenibilità non rappresenta quindi un ostacolo al progresso dell'intelligenza artificiale. Al contrario, potrebbe diventare il principale motore della sua prossima fase evolutiva. L'efficienza sta emergendo come una nuova metrica di innovazione, destinata ad affiancare e in alcuni casi a sostituire la semplice crescita della potenza computazionale.

La via dell’efficienza e dei modelli compatti
La crescente attenzione verso i costi energetici e computazionali degli LLM ha favorito la nascita di una nuova filosofia progettuale. Invece di inseguire esclusivamente la crescita delle dimensioni, molti gruppi di ricerca hanno iniziato a concentrarsi sulla qualità dei dati, sull'efficienza delle architetture e sull'ottimizzazione dell'addestramento.
Da questo approccio sono nati i cosiddetti Small Language Models (SLM), modelli più compatti progettati per operare con risorse limitate. L'obiettivo non è sostituire completamente i grandi modelli generalisti, ma offrire soluzioni più accessibili e sostenibili per specifici contesti applicativi.
Uno degli aspetti che ha contribuito maggiormente a questa evoluzione riguarda la qualità dei dataset. Per molti anni l'attenzione si è concentrata sulla raccolta di quantità sempre maggiori di testo proveniente dal Web. Con il tempo è emerso che la selezione accurata delle informazioni può produrre benefici comparabili o superiori all'aumento indiscriminato dei dati.
Dataset come Common Crawl, The Pile, RefinedWeb e FineWeb hanno introdotto procedure sempre più sofisticate di filtraggio, deduplicazione e controllo della qualità. Lo scopo è eliminare contenuti ridondanti, errori, spam e informazioni poco affidabili, migliorando l'efficacia dell'addestramento.

Un'altra tendenza significativa riguarda l'utilizzo di dati sintetici. Grazie agli stessi modelli linguistici è possibile generare nuovi esempi di addestramento, ampliare dataset esistenti e creare materiale specializzato per compiti specifici. Sebbene questa strategia richieda attenzione per evitare fenomeni di degrado della qualità, rappresenta uno strumento sempre più importante per la costruzione di modelli efficienti.
Un caso particolarmente interessante è rappresentato dalla famiglia Phi sviluppata da Microsoft. Questi modelli hanno dimostrato che dataset accuratamente selezionati e procedure di addestramento ottimizzate possono consentire a modelli relativamente piccoli di ottenere risultati competitivi rispetto a sistemi molto più grandi.

L'efficienza non riguarda soltanto l'addestramento iniziale. Tecniche di fine-tuning come LoRA (Low-Rank Adaptation) permettono di adattare modelli preaddestrati a nuovi compiti modificando soltanto una piccola parte dei parametri. Questo approccio riduce drasticamente i costi di personalizzazione e rende accessibile l'uso degli LLM anche a organizzazioni prive di grandi infrastrutture.
Successivamente sono emerse varianti ancora più efficienti, come QLoRA, che combinano fine-tuning e quantizzazione per ridurre ulteriormente il consumo di memoria. Queste tecniche consentono di lavorare con modelli avanzati utilizzando hardware relativamente economico.

Anche i metodi di allineamento stanno evolvendo rapidamente. Il Reinforcement Learning from Human Feedback ha svolto un ruolo fondamentale nello sviluppo dei chatbot moderni, ma richiede processi complessi e costosi. Approcci più recenti, come Direct Preference Optimization (DPO), cercano di ottenere risultati simili attraverso procedure più semplici e meno onerose dal punto di vista computazionale.
L'insieme di queste innovazioni suggerisce una conclusione importante. Le prestazioni di un modello non dipendono esclusivamente dal numero di parametri. La qualità dei dati, l'efficienza dell'architettura e le strategie di addestramento possono influenzare in modo determinante il risultato finale.
Questa consapevolezza sta contribuendo a ridefinire le priorità della ricerca. In questo nuovo contesto, modelli più piccoli e intelligenti potrebbero rappresentare una delle direzioni più promettenti per il futuro dell'intelligenza artificiale.

La conoscenza esterna e il paradigma RAG
Uno dei limiti fondamentali dei modelli linguistici riguarda la natura della loro memoria. Le informazioni apprese durante l'addestramento vengono incorporate nei parametri della rete neurale e non possono essere aggiornate facilmente senza eseguire nuove procedure di training. Questo rende difficile mantenere un modello costantemente allineato a dati recenti o specializzati.
Per affrontare questo problema è nata la Retrieval-Augmented Generation (RAG), una tecnica che combina modelli linguistici e sistemi di recupero delle informazioni. L'idea è semplice: invece di affidarsi esclusivamente alla conoscenza memorizzata nei parametri, il modello può consultare documenti esterni al momento della generazione della risposta.
Alla base di questo approccio si trovano gli embedding, rappresentazioni numeriche che trasformano parole, frasi o documenti in vettori matematici. Grazie a queste rappresentazioni è possibile confrontare semanticamente contenuti differenti e identificare quelli più pertinenti rispetto a una determinata richiesta.
I documenti vengono generalmente archiviati all'interno di database vettoriali come FAISS, Milvus o Qdrant. Quando un utente formula una domanda, il sistema ricerca i contenuti più rilevanti e li fornisce al modello come contesto aggiuntivo. In questo modo la risposta può basarsi su informazioni aggiornate senza modificare i parametri della rete neurale.
Un elemento cruciale del processo è il retrieval, ovvero la capacità di recuperare rapidamente i documenti più utili. La qualità di questa fase influisce direttamente sull'accuratezza delle risposte generate. Per migliorare ulteriormente i risultati vengono spesso utilizzati sistemi di reranking che rivalutano i documenti recuperati e ne selezionano una versione ottimizzata.

Dal punto di vista della sostenibilità, il RAG offre vantaggi particolarmente interessanti. Un modello relativamente piccolo può accedere a enormi basi documentali senza dover incorporare tutta la conoscenza nei propri parametri. Questo consente di ridurre le dimensioni della rete e il costo computazionale mantenendo elevata la qualità delle risposte.
La tecnica è oggi ampiamente utilizzata in contesti aziendali, dove l'accesso a documentazione interna aggiornata rappresenta un requisito fondamentale. Invece di addestrare continuamente nuovi modelli, le organizzazioni possono aggiornare semplicemente le basi documentali consultate dal sistema.
Questa integrazione rappresenta uno dei principali strumenti per costruire sistemi efficienti, aggiornabili e sostenibili.

Compressione, quantizzazione e architetture sparse
Con l'aumento delle dimensioni degli LLM, la compressione è diventata una delle aree di ricerca più importanti dell'intelligenza artificiale moderna. L'obiettivo è ridurre il costo computazionale senza compromettere in modo significativo le prestazioni del modello.
Una delle tecniche più diffuse è la quantizzazione. Normalmente i parametri di una rete neurale vengono rappresentati utilizzando numeri in virgola mobile ad alta precisione. La quantizzazione riduce il numero di bit necessari per memorizzare tali valori, consentendo di diminuire il consumo di memoria e accelerare l'inferenza. Formati come INT8, INT4 e più recentemente FP8 sono diventati strumenti fondamentali per l'esecuzione efficiente degli LLM.
Un approccio complementare è il pruning, che consiste nell'eliminazione dei pesi considerati meno rilevanti. Molte reti neurali contengono infatti una quantità significativa di parametri che contribuiscono in misura limitata al risultato finale. Rimuovendo questi elementi è possibile ridurre dimensioni e consumi mantenendo prestazioni comparabili.
La ricerca ha inoltre esplorato nuovi concetti. Nei modelli sparsi non tutti i parametri vengono utilizzati contemporaneamente durante il processo di inferenza. Questo consente di diminuire il numero effettivo di operazioni necessarie per generare una risposta, migliorando l'efficienza complessiva.
Tra le architetture più interessanti emerse negli ultimi anni figurano le Mixture of Experts (MoE). In questi sistemi il modello è composto da diversi sottoreti specializzate, chiamate esperti. Per ogni richiesta viene attivata soltanto una parte degli esperti disponibili, riducendo il carico computazionale pur mantenendo una capacità rappresentativa molto elevata.
Mixtral, sviluppato da Mistral AI, rappresenta uno degli esempi più noti di questa filosofia. Sebbene il numero totale di parametri sia elevato, soltanto una frazione viene utilizzata durante ogni fase di inferenza. Questo approccio offre un compromesso particolarmente efficace tra prestazioni e costi operativi.
L'importanza di queste tecniche va oltre il semplice miglioramento delle prestazioni. Compressione e ottimizzazione consentono infatti di eseguire modelli avanzati su workstation, server aziendali e persino dispositivi edge. Ciò favorisce la diffusione dell'intelligenza artificiale in contesti dove l'accesso a grandi infrastrutture cloud non è praticabile o desiderabile.

L’hardware riconfigurabile e il ritorno al silicio
Le GPU hanno svolto un ruolo fondamentale nella diffusione del deep learning. Originariamente progettate per l'elaborazione grafica, si sono rivelate estremamente efficaci nell'esecuzione delle operazioni matriciali necessarie alle reti neurali. Aziende come NVIDIA hanno costruito gran parte del proprio successo sulla crescente domanda di acceleratori per l'AI.
Accanto alle GPU sono emerse altre categorie di dispositivi specializzati. Le TPU sviluppate da Google rappresentano uno dei primi esempi di hardware progettato specificamente per il machine learning. Questi acceleratori consentono di ottenere elevata efficienza energetica in applicazioni particolari, soprattutto all'interno di infrastrutture cloud.
In questo panorama gli FPGA occupano una posizione peculiare. A differenza delle GPU, che possiedono un'architettura relativamente fissa, gli FPGA possono essere riconfigurati per implementare circuiti personalizzati. Questo permette di adattare l'hardware alle caratteristiche specifiche di un algoritmo.
La programmabilità degli FPGA è stata tradizionalmente considerata complessa, poiché richiedeva competenze di progettazione digitale e linguaggi hardware come Verilog o VHDL. Negli ultimi anni la situazione è cambiata grazie agli strumenti di High-Level Synthesis (HLS), che consentono di descrivere algoritmi utilizzando linguaggi più vicini alla programmazione tradizionale.
Aziende come AMD, dopo l'acquisizione di Xilinx, e Intel hanno investito significativamente nello sviluppo di ecosistemi dedicati all'AI. Strumenti come Vitis AI permettono di ottimizzare modelli neurali per l'esecuzione su FPGA, riducendo la distanza tra sviluppo software e implementazione hardware.
L'interesse verso queste soluzioni è particolarmente forte nel settore edge. In applicazioni industriali, sistemi embedded, robotica e Internet of Things, l'elaborazione locale offre vantaggi importanti in termini di latenza, sicurezza e affidabilità. In questi contesti l'efficienza energetica diventa spesso più importante della massima prestazione assoluta.
Il passaggio dall'algoritmo al silicio rappresenta quindi una delle trasformazioni più significative dell'AI contemporanea. La progettazione congiunta di software e hardware sta emergendo come uno dei principali strumenti per superare i limiti energetici e operativi delle architetture tradizionali.

Sovranità digitale e autonomia tecnologica
L'evoluzione degli LLM non è soltanto una questione tecnica. Le scelte relative alle infrastrutture, ai modelli e ai dati influenzano aspetti economici, sociali e geopolitici sempre più rilevanti. Per questo motivo il concetto di sovranità digitale è diventato uno dei temi centrali nel dibattito sull'intelligenza artificiale.

Molte organizzazioni utilizzano oggi servizi di AI ospitati nel cloud. Questo approccio offre vantaggi evidenti in termini di semplicità operativa e accesso a modelli avanzati, ma comporta anche una dipendenza significativa da fornitori esterni. I dati elaborati possono attraversare confini nazionali, essere soggetti a normative differenti o essere gestiti da infrastrutture sulle quali gli utenti finali hanno un controllo limitato.
La possibilità di eseguire modelli localmente rappresenta quindi una soluzione strategica per numerosi settori. Sanità, finanza, pubblica amministrazione, difesa e industria manifatturiera gestiscono spesso informazioni sensibili che richiedono elevati livelli di controllo e riservatezza. In questi contesti l'inferenza on-premise può ridurre i rischi associati alla trasmissione dei dati verso piattaforme esterne. L'ecosistema open source svolge un ruolo fondamentale in questa trasformazione. La disponibilità di modelli aperti consente a organizzazioni e ricercatori di studiare, modificare e adattare le tecnologie alle proprie esigenze. Questo favorisce l'innovazione e riduce la dipendenza da un numero ristretto di fornitori globali.
Anche il quadro normativo sta assumendo un'importanza crescente. L'Unione Europea, attraverso l'AI Act, ha introdotto una delle prime regolamentazioni organiche dedicate all'intelligenza artificiale. L'obiettivo è promuovere l'innovazione garantendo al tempo stesso trasparenza, sicurezza e tutela dei diritti fondamentali. In questo contesto, la possibilità di comprendere e controllare il funzionamento dei modelli acquisisce un valore strategico.

Parallelamente sta emergendo una nuova generazione di Small Language Models progettati per applicazioni specializzate. Questi sistemi dimostrano che molte attività possono essere svolte efficacemente senza ricorrere a modelli giganteschi. La combinazione di architetture compatte, tecniche di retrieval e ottimizzazione hardware rende possibile la realizzazione di soluzioni altamente efficienti.
Anche l'hardware continuerà a evolvere. Oltre agli FPGA stanno acquisendo importanza NPU, ASIC dedicati e acceleratori progettati specificamente per l'inferenza neurale. L'obiettivo comune è aumentare l'efficienza energetica e ridurre il costo computazionale delle applicazioni di AI.

La convergenza tra modelli più piccoli, dati di qualità, tecniche di compressione e hardware specializzato suggerisce una possibile direzione per il futuro. Invece di concentrare tutte le risorse in pochi sistemi enormi, l'intelligenza artificiale potrebbe evolvere verso un ecosistema distribuito composto da modelli specializzati, eseguibili localmente e integrati direttamente nei processi operativi delle organizzazioni.
In questa prospettiva la sostenibilità non rappresenta soltanto una necessità ambientale, ma un fattore abilitante dell'innovazione. Un'intelligenza artificiale più efficiente consuma meno risorse, riduce le barriere di accesso e rende possibile una diffusione più ampia delle tecnologie avanzate. La ricerca della frugalità computazionale potrebbe quindi diventare uno degli elementi distintivi della prossima fase evolutiva dell'AI.

Generative artificial intelligence represents one of the most significant technological transformations of the early twenty-first century. The widespread adoption of systems such as ChatGPT has brought into the public spotlight tools capable of understanding natural language, generating coherent text, assisting with code development, synthesizing information, and supporting cognitive activities that until only a few years ago appeared to be exclusively human. Within a matter of months, generative AI evolved from a topic reserved for specialists into a technology used daily by millions of people.
At the core of this revolution are Large Language Models (LLMs), neural networks trained on vast quantities of textual data. Their development has required unprecedented investments in computational infrastructure, energy, and processing capacity. For many years, progress in the field was driven by a seemingly simple assumption: increasing the number of parameters, the volume of training data, and the amount of computational power would steadily improve model performance. This strategy has indeed produced extraordinary results, but it has also revealed increasingly evident economic, energy-related, and operational limitations.

Training the most advanced models now requires large GPU clusters, substantial energy consumption, and financial resources accessible to only a handful of global organizations. At the same time, companies, institutions, and research centers have begun to question the sustainability of this paradigm. Is it truly necessary to build ever-larger models? Are there alternative approaches capable of delivering high performance with significantly fewer resources?
The answer has gradually emerged through new design strategies. Data quality has become just as important as data quantity. Techniques such as quantization, pruning, efficient fine-tuning, and Retrieval-Augmented Generation have demonstrated that competitive results can be achieved without continuously increasing model size. At the same time, the growth of the open-source ecosystem has made technologies accessible that until recently were reserved for major industrial research laboratories.

This evolution has brought the concept of technological autonomy back to the forefront. The ability to run models locally, within organizations or infrastructures directly controlled by users, offers important advantages in terms of privacy, security, and independence from cloud providers. In this context, the relationship between software and hardware is becoming increasingly important. Efficiency depends not only on algorithms but also on the ability to design specialized computational architectures.
Among the technologies attracting growing interest are FPGAs (Field-Programmable Gate Arrays), reconfigurable electronic devices that make it possible to implement dedicated accelerators for specific workloads. Their flexibility makes them particularly attractive for local inference of optimized language models, paving the way for a new generation of sustainable and energy-efficient AI systems.

From Neural Networks to the Transformer
To understand the emergence of modern language models, it is necessary to briefly retrace the evolution of artificial neural networks. The earliest concepts date back to the 1950s, when Frank Rosenblatt's perceptron attempted to simulate, in simplified form, the functioning of biological neurons. Despite initial enthusiasm, the computational limitations of the time and several theoretical challenges slowed the development of the field for decades.
The situation began to change gradually during the 1990s and especially throughout the first decade of the 2000s, thanks to increased computational power, the availability of massive datasets, and the widespread adoption of GPUs as accelerators for deep learning. Deep neural networks started achieving increasingly impressive results in fields such as computer vision, speech recognition, and natural language processing.
One of the most challenging problems involved the handling of sequential data. Unlike images, language is characterized by a temporal structure in which each word depends on the preceding context. To address this challenge, Recurrent Neural Networks (RNNs) were developed, specifically designed to process information sequentially while maintaining a form of internal memory.
RNNs represented a major step forward, but they suffered from significant limitations. During training, they struggled to retain information from very long sequences, a problem known as the vanishing gradient. To mitigate these difficulties, more sophisticated architectures such as Long Short-Term Memory (LSTM) networks and later Gated Recurrent Units (GRUs) were introduced. These models improved the ability to learn long-range dependencies and dominated the natural language processing landscape for many years.
Despite these advances, recurrent architectures still faced a fundamental limitation: the sequential nature of computation. Each element in a sentence had to be processed after the previous one, limiting the ability to fully exploit the parallelism offered by modern hardware.
The breakthrough came with the concept of attention. In 2014, Dzmitry Bahdanau and colleagues introduced a mechanism that enabled models to dynamically focus on the most relevant parts of a sequence during machine translation. The idea proved extraordinarily powerful and paved the way for a new generation of architectures.
In 2017, a team of Google researchers published the landmark paper *Attention Is All You Need*. The work introduced a completely new architecture, known as the Transformer, which eliminated recurrence entirely and relied exclusively on the self-attention mechanism. Instead of processing words one at a time, the Transformer could analyze an entire sequence simultaneously, identifying relationships among all elements of the text.
Self-attention enables each token to evaluate the relative importance of every other token in the sequence. This approach allows the model to capture very long-range linguistic dependencies while efficiently leveraging GPU parallelism. The result was a significant improvement in both performance and training speed.

The Transformer architecture consists of fundamental building blocks that include multi-head attention mechanisms, feed-forward networks, and residual connections. Thanks to this structure, the model can learn highly expressive and generalizable linguistic representations.
Its scientific impact was immediate. Within just a few years, the Transformer became the dominant architecture across virtually every field of generative artificial intelligence. Machine translation, text generation, computer vision, and even computational biology began adopting variants derived from this idea.
The true significance of the Transformer lies not only in its performance but also in its scalability. For the first time, it became possible to train models of unprecedented size by leveraging large distributed computing clusters. This capability would give rise to the next generation of Large Language Models and profoundly transform the entire field of artificial intelligence.

The Rise of Large Language Models
The evolution of Large Language Models is closely tied to the emergence of the Transformer architecture. Once the effectiveness of self-attention had been demonstrated, researchers began exploring the possibility of training increasingly large models on massive quantities of text collected from the web, digitized books, and other documentary sources.
In 2018, OpenAI introduced GPT-1, a model that would be considered relatively small by today's standards but was innovative in its use of generative pretraining. The idea was both simple and powerful: train the model to predict the next token within large text corpora and subsequently adapt it to specific tasks through fine-tuning. This approach demonstrated that a general linguistic representation could be successfully reused across a wide range of applications.
The following year, GPT-2 attracted considerable attention due to the quality of its generated text. For the first time, a language model displayed narrative and argumentative capabilities convincing enough to spark discussions about potential misuse. Although modest in size compared to modern standards, GPT-2 highlighted the potential of large-scale generative models.
The true turning point came with GPT-3, released in 2020. With 175 billion parameters, the model demonstrated that scaling could produce unexpected emergent capabilities. GPT-3 was able to perform tasks for which it had never been explicitly trained, relying solely on a few examples provided within the prompt. This phenomenon strengthened the belief that increasing model size was the primary driver of progress.
In parallel, OpenAI introduced InstructGPT, an evolution designed to follow user instructions more effectively. The system employed Reinforcement Learning from Human Feedback (RLHF), a technique that integrates human evaluations into the optimization process. The goal was to produce responses that were more useful, safer, and better aligned with user expectations.
This line of research culminated in the launch of ChatGPT at the end of 2022. More than a technical innovation, ChatGPT represented a cultural turning point. For the first time, millions of people could interact directly with an advanced language model through a simple and intuitive interface. The service's rapid adoption demonstrated that generative AI was ready for mainstream use.

Meanwhile, the landscape was expanding. Meta released the LLaMA family, demonstrating that relatively compact models could achieve remarkable performance. The release of model weights fueled extraordinary growth within the open-source ecosystem, accelerating independent research and local deployment initiatives.
Subsequently, models such as Mistral and Mixtral emerged, characterized by highly efficient architectures and an excellent balance between performance and computational cost. Microsoft developed the Phi family, emphasizing carefully curated datasets rather than simple dimensional growth. Google introduced Gemma, while IBM released the Granite models, designed for enterprise applications and greater transparency in the development process.
This evolution has profoundly changed the debate surrounding LLMs. While the dominant strategy in the early years focused on continuously increasing model size, attention is now shifting toward efficiency, data quality, specialization, and sustainability. Smaller models are no longer viewed merely as lower-cost alternatives but as a potential evolutionary direction for the entire field.
The history of LLMs demonstrates that progress in artificial intelligence is not the result of a single breakthrough. Understanding this evolution is essential for analyzing the challenges the field will face in the years ahead, beginning with the most pressing one: the energy cost of scaling large language models.

The Limits of Scaling and the Energy Challenge
The success of Large Language Models was accompanied by a belief that guided much of the industry's research for several years: increasing model size, training data volume, and available computational power would lead to predictable improvements in performance. This concept was formalized through the so-called scaling laws, studies that examined the relationship between model size and learning capability.
Research conducted by OpenAI and later by DeepMind showed that increasing parameter counts does indeed produce significant improvements across numerous language tasks. For several years, this evidence encouraged laboratories and companies to build increasingly larger models, leading to a competitive race that resulted in systems containing hundreds of billions of parameters.
However, rapid growth soon revealed a fundamental problem: cost. Training an LLM requires enormous amounts of energy, specialized hardware, and computational time. Modern training clusters may include thousands of GPUs interconnected through ultra-high-speed networks. The energy consumption of such infrastructures is comparable to that of small urban communities and requires increasingly sophisticated cooling systems.
An important contribution to the debate arrived in 2022 with DeepMind's Chinchilla model. Researchers demonstrated that many existing models had been trained with insufficient data relative to their number of parameters. In other words, increasing network size was not always the most effective strategy; in many cases, larger and better-balanced datasets produced superior results. This finding challenged the notion that progress depended solely on scaling model dimensions.

At the same time, the concept of Green AI emerged as a broader framework for evaluating artificial intelligence systems. According to this perspective, accuracy should not be the sole criterion of success. Energy consumption, computational cost, experimental reproducibility, and overall environmental impact must also be considered.
The issue extends beyond training. Inference—the day-to-day use of models by end users—also requires substantial resources. Every query sent to a chatbot activates distributed computational processes within large-scale data centers. With hundreds of millions of users and billions of requests each day, the cumulative energy demand becomes highly significant.
Additional factors further amplify the challenge. Data centers require cooling systems, networking infrastructure, power management equipment, and continuous hardware upgrades. Furthermore, the production of electronic components involves critical materials and energy-intensive industrial processes. The environmental impact of artificial intelligence must therefore be assessed across the entire lifecycle of the underlying infrastructure.
The economic implications are equally important. Only a limited number of organizations possess the resources necessary to train frontier models. This concentration risks reducing diversity within the technological ecosystem and increasing dependence on a small number of providers. For many businesses and institutions, access to state-of-the-art AI technologies is available only through cloud services operated by major international companies.
In response to these challenges, the industry is gradually changing its perspective. Increasingly, researchers believe that the future of AI will depend not only on building larger models but on improving overall system efficiency. The goal is to achieve high performance while reducing parameter counts, energy consumption, and operational costs.
Sustainability should therefore not be viewed as an obstacle to the advancement of artificial intelligence. On the contrary, it may become the primary driver of its next evolutionary phase. Efficiency is emerging as a new metric of innovation, destined to complement—and in some cases replace—the simple pursuit of greater computational power.

The Path Toward Efficiency and Compact Models
Growing concern over the energy and computational costs of LLMs has given rise to a new design philosophy. Rather than focusing exclusively on scale, many research groups have begun concentrating on data quality, architectural efficiency, and training optimization.
This approach has led to the development of so-called Small Language Models (SLMs), compact models designed to operate with limited resources. The objective is not to replace large general-purpose models entirely but to provide more accessible and sustainable solutions for specific application domains.
One of the factors that has contributed most significantly to this evolution is the increasing emphasis on dataset quality. For many years, attention focused on collecting ever-larger quantities of text from the web. Over time, it became evident that careful data selection could produce benefits comparable to—or even greater than—simply increasing dataset size.
Datasets such as Common Crawl, The Pile, RefinedWeb, and FineWeb introduced increasingly sophisticated procedures for filtering, deduplication, and quality control. Their purpose is to eliminate redundant content, errors, spam, and unreliable information, thereby improving training effectiveness.

Another important trend is the use of synthetic data. Language models themselves can generate new training examples, expand existing datasets, and create specialized material for specific tasks. Although this strategy requires careful management to avoid quality degradation, it is becoming an increasingly valuable tool for building efficient models.
A particularly interesting example is Microsoft's Phi family. These models have demonstrated that carefully curated datasets and optimized training procedures can enable relatively small models to achieve results competitive with systems many times larger.

Efficiency extends beyond initial training. Fine-tuning techniques such as LoRA (Low-Rank Adaptation) allow pretrained models to be adapted to new tasks by modifying only a small subset of parameters. This approach dramatically reduces customization costs and makes LLM adoption accessible even to organizations without extensive computational infrastructure.
More efficient variants such as QLoRA have subsequently emerged, combining fine-tuning with quantization to further reduce memory requirements. These techniques make it possible to work with advanced models using relatively inexpensive hardware.

Alignment methods are also evolving rapidly. Reinforcement Learning from Human Feedback played a crucial role in the development of modern chatbots, but it requires complex and costly processes. More recent approaches such as Direct Preference Optimization (DPO) aim to achieve similar outcomes through simpler and computationally less expensive procedures.
Together, these innovations point to an important conclusion: model performance does not depend exclusively on parameter count. Data quality, architectural efficiency, and training strategies can play a decisive role in determining outcomes.
This realization is helping redefine research priorities. In this new landscape, smaller and smarter models may represent one of the most promising directions for the future of artificial intelligence.

External Knowledge and the RAG Paradigm
One of the fundamental limitations of language models lies in the nature of their memory. The information acquired during training is embedded within the parameters of the neural network and cannot be easily updated without performing additional training procedures. This makes it difficult to keep a model continuously aligned with recent or highly specialized information.
To address this challenge, Retrieval-Augmented Generation (RAG) emerged as a technique that combines language models with information retrieval systems. The concept is straightforward: rather than relying exclusively on knowledge encoded in model parameters, the model can consult external documents at the moment a response is generated.
At the foundation of this approach are embeddings, numerical representations that transform words, sentences, or documents into mathematical vectors. These representations make it possible to compare different pieces of content semantically and identify those most relevant to a particular query.
Documents are typically stored within vector databases such as FAISS, Milvus, or Qdrant. When a user submits a question, the system retrieves the most relevant content and provides it to the language model as additional context. In this way, responses can incorporate up-to-date information without requiring any modification of the model's parameters.
A crucial component of the process is retrieval itself—the ability to rapidly identify and access the most useful documents. The quality of this stage directly influences the accuracy of generated responses. To further improve results, reranking systems are often employed to reassess retrieved documents and select the most relevant subset.

From a sustainability perspective, RAG offers particularly attractive advantages. A relatively small model can access enormous knowledge repositories without needing to encode all that information within its parameters. This makes it possible to reduce model size and computational cost while maintaining high-quality outputs.
Today, the technique is widely used in enterprise environments, where access to continuously updated internal documentation is often essential. Rather than retraining models repeatedly, organizations can simply update the knowledge bases consulted by the system.
This integration represents one of the most effective tools currently available for building efficient, updatable, and sustainable AI systems.

Compression, Quantization, and Sparse Architectures
As LLMs have grown in size, model compression has become one of the most important research areas in modern artificial intelligence. The objective is to reduce computational cost without significantly compromising model performance.
One of the most widely adopted techniques is quantization. Neural network parameters are normally represented using high-precision floating-point numbers. Quantization reduces the number of bits required to store these values, lowering memory consumption and accelerating inference. Formats such as INT8, INT4, and more recently FP8 have become essential tools for the efficient deployment of LLMs.
A complementary approach is pruning, which consists of removing weights considered less important. Many neural networks contain a substantial number of parameters that contribute only marginally to the final result. Eliminating these elements makes it possible to reduce model size and energy consumption while preserving comparable performance.
Research has also explored entirely new concepts. In sparse models, not all parameters are activated simultaneously during inference. This reduces the effective number of operations required to generate a response, thereby improving overall efficiency.
Among the most interesting architectures to emerge in recent years are Mixture of Experts (MoE) models. In these systems, the model is composed of multiple specialized subnetworks, known as experts. For each request, only a subset of the available experts is activated, reducing computational load while maintaining a very high representational capacity.
Mixtral, developed by Mistral AI, is one of the best-known examples of this philosophy. Although the total number of parameters is large, only a fraction is utilized during each inference step. This approach offers a particularly effective balance between performance and operational cost.
The significance of these techniques extends beyond simple performance improvements. Compression and optimization make it possible to run advanced models on workstations, enterprise servers, and even edge devices. This promotes the adoption of artificial intelligence in environments where access to large cloud infrastructures is impractical or undesirable.

Reconfigurable Hardware and the Return to Silicon
GPUs have played a fundamental role in the rise of deep learning. Originally designed for graphics processing, they proved exceptionally effective at executing the matrix operations required by neural networks. Companies such as NVIDIA have built a significant portion of their success on the growing demand for AI accelerators.
Alongside GPUs, other categories of specialized hardware have emerged. Google's Tensor Processing Units (TPUs) were among the first examples of processors specifically designed for machine learning workloads. These accelerators can achieve outstanding energy efficiency in particular applications, especially within cloud environments.
Within this landscape, FPGAs occupy a unique position. Unlike GPUs, which feature relatively fixed architectures, FPGAs can be reconfigured to implement custom hardware circuits. This capability makes it possible to adapt hardware directly to the specific characteristics of a given algorithm.
Historically, FPGA programming was considered complex because it required expertise in digital design and hardware description languages such as Verilog and VHDL. In recent years, however, the situation has changed thanks to High-Level Synthesis (HLS) tools, which allow developers to describe algorithms using languages much closer to traditional software programming.
Companies such as AMD, following its acquisition of Xilinx, and Intel have invested heavily in AI-oriented FPGA ecosystems. Tools such as Vitis AI enable neural models to be optimized for FPGA deployment, reducing the gap between software development and hardware implementation.
Interest in these solutions is particularly strong in the edge computing sector. Industrial applications, embedded systems, robotics, and the Internet of Things all benefit from local processing capabilities that provide lower latency, greater security, and improved reliability. In these contexts, energy efficiency often becomes more important than achieving the highest possible peak performance.
The transition from algorithms to silicon therefore represents one of the most significant transformations in contemporary AI. The co-design of software and hardware is emerging as one of the most effective approaches for overcoming the energy and operational limitations of traditional computing architectures.

Digital Sovereignty and Technological Autonomy
The evolution of LLMs is not merely a technical matter. Decisions regarding infrastructure, models, and data increasingly influence economic, social, and geopolitical dynamics. For this reason, the concept of digital sovereignty has become one of the central themes in the broader debate surrounding artificial intelligence.

Today, many organizations rely on cloud-hosted AI services. While this approach offers clear advantages in terms of operational simplicity and access to advanced models, it also creates significant dependence on external providers. Processed data may cross national borders, become subject to different regulatory frameworks, or be managed within infrastructures over which end users have limited control.
The ability to run models locally therefore represents a strategic solution for many sectors. Healthcare, finance, public administration, defense, and manufacturing frequently handle sensitive information that requires high levels of control and confidentiality. In such environments, on-premises inference can significantly reduce the risks associated with transmitting data to external platforms.
The open-source ecosystem plays a crucial role in this transformation. The availability of open models allows organizations and researchers to study, modify, and adapt technologies to their specific needs. This fosters innovation while reducing dependence on a small number of global providers.
Regulation is also becoming increasingly important. Through the AI Act, the European Union has introduced one of the first comprehensive regulatory frameworks dedicated to artificial intelligence. The objective is to encourage innovation while ensuring transparency, safety, and the protection of fundamental rights. In this context, the ability to understand and control the behavior of AI models acquires strategic value.

At the same time, a new generation of Small Language Models is emerging, specifically designed for specialized applications. These systems demonstrate that many tasks can be performed effectively without relying on gigantic models. The combination of compact architectures, retrieval techniques, and hardware optimization makes it possible to develop highly efficient solutions.
Hardware will continue to evolve as well. Beyond FPGAs, Neural Processing Units (NPUs), dedicated ASICs, and specialized inference accelerators are becoming increasingly important. Their common objective is to improve energy efficiency while reducing the computational cost of AI applications.

The convergence of smaller models, high-quality data, compression techniques, and specialized hardware suggests a possible direction for the future. Rather than concentrating resources in a few enormous systems, artificial intelligence may evolve toward a distributed ecosystem composed of specialized models, locally deployable solutions, and technologies integrated directly into organizational workflows.
From this perspective, sustainability is not merely an environmental necessity but an enabling factor for innovation. More efficient artificial intelligence consumes fewer resources, lowers barriers to adoption, and enables broader access to advanced technologies. The pursuit of computational frugality may therefore become one of the defining characteristics of the next evolutionary phase of AI.

Bruno Tessaro Insights