iSRL

Name: Encyclopedia of Indian Food Ingredients
Creator: Lalitha A R
Published: 2026-02-11
License: https://creativecommons.org/licenses/by/4.0/

Food Allergens in India: Evidence, Regulation, and the State of Current Knowledge

Lalitha A R — Thu, 16 Apr 2026 00:00:00 GMT

1 1. What an allergen is

The immune system is a classification system. It encounters proteins, assesses them, and files them — safe or hostile. For most people, most of the time, the filing is accurate.

In some people, an ordinary food protein gets filed as hostile.¹ The protein itself is unchanged — digestible, stable, present in millions of meals daily. But the immune system has produced IgE antibodies against it, and every subsequent encounter triggers a response: urticaria, angioedema, anaphylaxis.² The classification is the allergy.

¹ The immune system producing IgE antibodies to a food protein is called sensitisation. Sensitisation is not the same as allergy — many people carry these antibodies without ever having a reaction. What converts sensitisation into clinical allergy is not fully understood.

² Urticaria is hives. Angioedema is swelling, typically of the lips, tongue, or throat. Anaphylaxis is a systemic reaction — blood pressure drops, airways narrow — that can be fatal within minutes without treatment.

³ Pepsin, the main digestive enzyme in the stomach, breaks most proteins into fragments too small for the immune system to recognise. Proteins that resist this — pepsin-stable proteins — arrive in the gut intact, where immune cells encounter them directly.

⁴ Cross-reactivity is why someone allergic to one tree nut may react to others, or why sensitisation to a grass pollen can produce symptoms on eating certain fruits. The immune system is recognising a shared structural pattern, not the specific food.

Proteins that survive gastric digestion intact reach the immune system in a form it can respond to.³ Proteins that are heat-stable remain recognisable after cooking. Proteins that share structural features across species mean that sensitisation to one food may produce reactivity to others never directly encountered — a phenomenon called cross-reactivity.⁴

Which proteins a population’s immune systems tend to misfile varies by geography, diet, and a set of environmental factors that are still being characterised. In the United States, peanut allergy affects roughly 1–2% of the population. In a large systematic study of Indian children, peanut sensitisation was 6.3% by serum-specific IgE. Probable clinical peanut allergy in the same cohort was approximately 0.03%.⁵

⁵ Serum-specific IgE measures IgE antibodies to a particular food protein in a blood sample. A positive result means the immune system has been exposed to that protein and produced antibodies — it does not mean the person will react if they eat the food. “Probable food allergy” in the EuroPrevall study required both a positive IgE or skin test and reported symptoms within two hours of eating the food. Neither is the same as a confirmed challenge test.

That gap — between a population that carries the antibodies and a population that develops the disease — runs through the Indian food allergy literature consistently. The sections that follow document what produced it, what it means, and where the evidence currently stands.

2 2. How food allergy is measured

2.1 2.1 The diagnostic hierarchy

Four methods appear in the Indian literature reviewed here, each measuring something different.

Skin prick test (SPT) introduces a small amount of allergen extract into the skin surface; a raised wheal above a defined threshold indicates sensitisation.⁶ SPT is fast and inexpensive. In the Indian context, the absence of standardised local allergen extracts limits its comparability across studies — most Indian studies use commercial extracts developed for other populations, or in-house preparations with variable protein content (Krishna et al. 2020).

⁶ A wheal is a raised, itchy bump at the test site, like a small insect bite. The standard threshold is 3mm or more above the negative control. Bigger wheals suggest stronger sensitisation — but again, sensitisation is not allergy.

Serum-specific IgE (sIgE) measures circulating IgE antibodies to a specific food protein via blood test. It has the same diagnostic limitation as SPT: a positive result indicates sensitisation, not confirmed clinical allergy.

Oral food challenge (OFC) requires the patient to eat the food under clinical observation, with symptoms recorded. It is the closest available proxy to real-world exposure, but resource-intensive and not widely available outside specialist centres.⁷

⁷ An oral food challenge typically happens in a clinical setting over several hours — the patient eats increasing amounts of the food at intervals while a clinician monitors for reactions. Because reactions can be severe, emergency treatment needs to be immediately available. This is why the absence of adrenaline auto-injectors in India until recently made challenges structurally difficult to conduct safely.

Double-blind placebo-controlled food challenge (DBPCFC) is the diagnostic gold standard: both patient and clinician are blinded to whether the test substance or a placebo is being administered. Highest confidence, used almost exclusively in specialist centres and research protocols.

2.2 2.2 Why these tests give different numbers

A single population, tested by different methods, will produce different numbers. In the EuroPrevall-INCO study — 5,677 children aged 7–10 years across schools in Mysore and Bengaluru — sIgE detected sensitisation in 19.1% of children while SPT detected sensitisation in 4.48% of the same children (Mahesh et al. 2023).⁸

⁸ The EuroPrevall-INCO study is the largest systematic food allergy dataset from India. It tested children from schools, not clinics — which means it captures a cross-section of the child population rather than children who had already presented with suspected allergies. This makes its prevalence figures more representative of the general population than most other Indian studies.

⁹ This matters because the 0.14% figure is the one most often cited as India’s food allergy prevalence. It is the best available estimate from the best available study — but it is not a confirmed allergy rate. A confirmed rate from DBPCFC data would likely be lower. How much lower is not known, because the challenge data does not exist at population scale.

Neither figure represents confirmed food allergy. The “probable food allergy” figure from the same study — 0.14% for Indian children — used a specific operational definition: reported symptoms within two hours of eating a food, combined with a positive sIgE or SPT to that food (Mahesh et al. 2023; Leung et al. 2024). This is not OFC-confirmed. It is a structured symptom-report combined with immunological evidence, which consistently overestimates confirmed allergy relative to DBPCFC.⁹

DBPCFC-confirmed allergy exists for only three foods in the India-specific data available: rice (6 of 16 patients tested confirmed, Delhi tertiary referral centre), black gram (4 of 14 confirmed, same centre), and chickpea (31 of 41 SPT-positive patients confirmed on challenge, Bombay allergy clinic) (Mahesh et al. 2023; Krishna et al. 2020).¹⁰

¹⁰ These are small numbers from single centres. A confirmed rate of 6 out of 16 for rice means six people in one Delhi clinic reacted to rice under controlled conditions — it does not mean 37.5% of Indians with rice sensitisation have clinical rice allergy. The value of this data is that it exists at all, not that it is generalisable.

2.3 2.3 What this means for reading the numbers in this review

The figures that appear most frequently in Indian food allergy literature are sensitisation rates and probable food allergy estimates, not confirmed allergy rates. When numbers are cited in §3, the method used to generate them is stated each time. Sensitisation rates are not treated as equivalent to clinical allergy rates; the 136-fold gap in the EuroPrevall India data is the clearest signal that this distinction matters.

3 3. Food allergens in India: what the literature documents

3.1 3.1 The EuroPrevall-INCO study

The EuroPrevall-INCO study enrolled 5,677 children aged 7–10 years across schools in Mysore and Bengaluru and tested each child against a 25-food panel (Mahesh et al. 2023).

Sensitisation rates by sIgE in children: shrimp 10.5%, sesame 8.0%, wheat 6.7%, peanut 6.3%. SPT sensitisation was lower overall — 4.48% aggregate versus 19.1% by sIgE — with jackfruit (2.46%) and cow’s milk (1.35%) leading by SPT (Mahesh et al. 2023).¹¹

¹¹ The difference between sIgE and SPT results for the same foods in the same children reflects the different things each test measures, and the different thresholds each uses to call a result positive. Neither is wrong — they are measuring different aspects of the same immune response.

Probable food allergy in children was 0.14% overall. The leading foods in the probable food allergy subset were cow’s milk (0.5% of that subset) and apple (0.5%), with egg at 0.05% and eggplant at 0.04% (Mahesh et al. 2023).

For adults across two Karnataka cities the picture shifts: 26.5% sensitisation and 1.2% probable food allergy, with legumes, prawn, eggplant, milk, and egg as the leading allergens (Mahesh et al. 2023).¹²

¹² The adult figures being higher than the child figures is consistent with cumulative exposure over time — more years of eating means more opportunity for sensitisation to develop. Whether this reflects a genuine increase in allergy with age or a cohort effect — older adults having grown up under different dietary and environmental conditions — is not established in the available data.

Both study sites are urban Karnataka. The EuroPrevall-INCO data does not cover North India, Northeast India, rural populations, or coastal communities (Leung et al. 2024).

3.2 3.2 Clinic and community studies

Beyond EuroPrevall, (Krishna et al. 2020) compiles 13 individual Indian allergy studies conducted between 2001 and 2019, covering Bombay, Delhi, Mysore, Bengaluru, Lucknow, and Kolkata. These are largely clinic-based cohorts — patients presenting to allergy clinics rather than general population samples.¹³ Sensitisation rates from these studies are higher than in population-based studies and are not representative of background prevalence in the general population.

¹³ Think of it this way: if you want to know how common headaches are in a city, surveying people in a neurology clinic will give you a much higher number than surveying people on the street. Both numbers are real — they are just answering different questions. Clinic-based studies tell you what allergens appear in people who have already sought care for a suspected reaction, not how common those allergens are in the population at large.

The table below summarises key findings from those studies alongside FSSAI mandatory status for each allergen.

Table 1: Key allergens from Indian clinic and community studies (as reported in (Krishna et al. 2020))

Allergen	Sensitisation range	Method	DBPCFC/OFC data	In FSSAI 2020?
Black gram (Vigna mungo)	5.9–10.1%	SPT, sIgE	4/14 DBPCFC confirmed	No
Rice (Oryza sativa)	6.2–12.1%	SPT, sIgE	6/16 DBPCFC confirmed	No
Lentil (Lens culinaris)	5.5–9.7%	SPT (N=216–1,860)	None available	No
Prawn	10.3–53.5%	SPT, sIgE	—	Yes (crustaceans)
Eggplant (Solanum melongena)	4.3–9.2% SPT; 0.8% sIgE community	SPT; sIgE	None available	No
Egg	6.9–34.9%	SPT, sIgE	—	Yes
Banana	3.6–40.6%	SPT	None available	No
Wheat	6.7–11.93%	SPT, sIgE	—	Yes (gluten cereals)
Chickpea (Cicer arietinum)	SPT positive 41/1,400	SPT (N=1,400 clinic)	31/41 DBPCFC confirmed	No
Red gram / pigeon pea (Cajanus cajan)	12.6%	sIgE (Karnataka N=2,219)	None available	No
Green gram (Vigna radiata)	12.5%	sIgE (Karnataka N=2,219)	None available	No

Note on eggplant: stored eggplant accumulates histamine at levels that can produce false-positive SPT results. Sensitisation figures for eggplant based on SPT should be interpreted with this confound in mind (Bhattacharya et al. 2018). ¹⁴

¹⁴ Histamine is the same compound the body releases during an allergic reaction — which is why antihistamines treat allergy symptoms. When stored eggplant already contains elevated histamine, introducing it into a skin prick test can trigger a wheal response that looks like sensitisation but is actually a direct chemical reaction to the histamine, not an IgE-mediated immune response. The SPT result is positive; the underlying mechanism is different.

¹⁵ Urban children consistently showing higher sensitisation than rural children from the same region — same genetic background, different environment — is one of the signals researchers use to argue that environment, not genetics, drives much of the variation in food allergy rates. What specifically differs between urban and rural environments in ways that affect allergy development is an open question.

An urban–rural gradient is visible in the available data. In Karnataka schools, sensitisation to prawn was 17.7% urban versus 5.7% rural; peanut 19.6% versus 10.4%; fish 17.7% versus 5.7% (Gobinaath et al. 2018, as reported in (Krishna et al. 2020)).¹⁵

3.3 3.3 Molecular characterisation of India-specific allergens

(Bhattacharya et al. 2018) provides molecular-level data on allergens characterised specifically from Indian clinical populations. The primary food allergen categories identified in Indian patients are legumes, prawn, eggplant, milk, and egg.

India’s only IUIS-registered food allergen is Pen i 1, the tropomyosin of Penaeus indicus (Indian white prawn) (Bhattacharya et al. 2018).¹⁶

¹⁶ The IUIS (International Union of Immunological Societies) maintains the official registry of characterised allergens — proteins that have been isolated, sequenced, and confirmed as allergenic through clinical data. Registration means the protein has been formally identified and named as an allergen. That India has only one registered food allergen is a measure of how much molecular characterisation work remains, not of how few allergenic foods exist.

¹⁷ A protein that resists pepsin digestion for 15 minutes arrives in the gut largely intact — meaning the immune system encounters the full protein rather than fragments. Fragments are generally less likely to trigger a response because the immune system recognises the whole structure, not the parts. Pepsin stability is one of the properties the FAO/WHO uses to assess whether a novel protein is likely to be allergenic.

Black gram (Vigna mungo): A 28-kDa glycoprotein (Vig m) was isolated and shown to resist pepsin digestion for at least 15 minutes, a property associated with higher clinical relevance for IgE-mediated reactions. Sequence homology to a rho-specific inhibitor in peanut was identified, providing a structural basis for observed cross-reactivity (Bhattacharya et al. 2018). ¹⁷

Chickpea (Cicer arietinum): A 26-kDa albumin fraction was characterised and found to cross-react with peanut IgE, relevant given that peanut allergy is the best-documented IgE-mediated food allergen globally (Bhattacharya et al. 2018).

Kidney bean (Phaseolus vulgaris): A 31-kDa phytohemagglutinin was found stable to pepsin digestion and reported to sensitise approximately 22% of Delhi food-allergic patients tested (Bhattacharya et al. 2018).

Rice (Oryza sativa): A 24-kDa chitinase was identified as the major allergen; approximately 12% of food-allergic patients in the study were SPT positive (Bhattacharya et al. 2018).

Eggplant (Solanum melongena): A lipid transfer protein (LTP) was characterised in the peel and seeds. LTPs are heat-stable and digestion-resistant, giving them higher clinical relevance than heat-labile proteins. The histamine confound in SPT testing for eggplant does not affect the molecular characterisation, but does affect interpretation of sensitisation rates (Bhattacharya et al. 2018).¹⁸

¹⁸ A lipid transfer protein is a small plant protein whose biological role is moving lipids — fats — across cell membranes. They are found across many plant foods and are one of the main drivers of cross-reactivity between plant allergens. Because they are heat-stable, they remain allergenic in cooked food, which makes them clinically more significant than proteins that denature under heat.

¹⁹ This distinction matters for processed food labelling specifically. A product containing cooked mackerel retains its allergenic proteins intact. A product containing cooked hilsha may carry reduced — though not necessarily zero — allergenic risk. The current FSSAI declaration requirement covers fish as a category and does not distinguish by heat stability.

Fish: Heat-stable allergens were characterised in bhetki (Lates calcarifer) and mackerel (Rastrelliger kanagurta); heat-labile allergens in hilsha (Tenualosa ilisha) and pomfret (Pampus argenteus). Cooking does not eliminate allergenic risk from bhetki or mackerel (Bhattacharya et al. 2018). ¹⁹

Legumes as a class: allergen proteins from legumes retain IgE reactivity after gastric digestion (Bhattacharya et al. 2018), which means the pepsin-stability argument applies across the entire legume complex, not only to individual characterised proteins.

(Milana et al. 2025) provides additional cross-reactivity data for the Indian legume complex. Mung bean LTPs share greater than 60% sequence homology with LTPs from lentil, bean, peanut, strawberry, and apple. Black gram (Vig m) cross-reacts with faba bean, lentil, lima bean, and pea. Black gram is also linked to Pollen Food Allergy Syndrome with Prosopis juliflora, a tree species prevalent across urban India (Milana et al. 2025).²⁰

²⁰ Pollen Food Allergy Syndrome (PFAS) is a cross-reactive condition where sensitisation to a pollen triggers oral symptoms — tingling, mild swelling — on eating certain raw foods. The immune system is recognising a structural similarity between the pollen protein and a food protein. In India, Prosopis juliflora is a widespread urban tree; people sensitised to its pollen may develop oral symptoms to black gram through this pathway rather than through direct sensitisation to the legume.

Red gram / pigeon pea (Cajanus cajan): Novel allergens including β-conglycinin and vicilin homologues have been identified via Indian patient sera (Bhattacharya et al. 2018). Sensitisation data from a Karnataka population study (N=2,219) reported 12.6% sIgE positive (Krishna et al. 2020).

3.4 3.4 The sensitisation-reactivity gap

In EuroPrevall India, 19.1% of children tested positive for at least one food by sIgE; 0.14% had probable food allergy — a 136-fold gap (Mahesh et al. 2023; Krishna et al. 2020). For peanut specifically, sensitisation was 6.3% by sIgE; probable peanut allergy was approximately 0.03% — roughly 200-fold (Krishna et al. 2020). In Western populations, peanut allergy prevalence is typically cited at 1–2%, an order of magnitude closer to the sensitisation rate.

Several protective factors have been proposed: longer breastfeeding, vaginal delivery, diverse legume exposure from early life, gut microbiome composition, and enteric helminthiasis (Mahesh et al. 2023).²¹ None has been confirmed as causal; they are epidemiological associations observed in parallel with the gap.

²¹ Enteric helminthiasis means intestinal worm infections, which are more common in lower-income settings. This may seem counterintuitive as a protective factor, but the hypothesis is that parasitic infections shift the immune system toward a particular response profile — Th2-dominant — that may reduce clinical reactivity to food allergens. As India urbanises and sanitation improves, helminthiasis rates fall, and the gap may narrow as a result.

The urbanisation signal is indirect evidence for the protective factor hypothesis. Children born in Hong Kong to mainland Chinese parents are approximately four times more likely to develop food sensitisation than mainland-born children, despite identical genetic background (Leung et al. 2024). In Indian data, urban children consistently show higher sensitisation to prawn, peanut, fish, and milk than rural children in the same regional studies (Krishna et al. 2020).

This matters for any classification that uses sensitisation data as a proxy for clinical relevance. For most foods in §3.2, sensitisation rates are the only data available. The gap documented here is the reason those rates cannot be read directly as clinical allergy burden.

3.5 3.5 Why the evidence base is limited

The constraints on Indian food allergy research are structural, not incidental. The researchers working in this field document them explicitly (Krishna et al. 2020; Devdas et al. 2018; Mahesh et al. 2023).

Most sensitisation data comes from allergy clinic patients, not general population cohorts. Patients attending allergy clinics are a selected population — higher pre-test probability of sensitisation than the general population. Rates from these studies are expected to exceed true population prevalence.

Standardised allergen extracts for SPT are not available in India. “High quality allergen extracts for skin tests and adrenaline auto-injectors are currently not available in India” (Krishna et al. 2020). Results vary across laboratories and cannot be pooled directly.²²

²² When two labs test for sensitisation to the same food using different extracts — different protein concentrations, different preparation methods — a positive result in one lab is not directly comparable to a positive result in the other. This is why sensitisation rates for the same food vary considerably across Indian studies, and why ranges rather than single figures are reported throughout this review.

Systematic data is largely from Karnataka and Delhi. Northeast India, rural India, coastal fishing communities, and tribal populations are essentially absent from the available literature.

DBPCFC-confirmed data exists for three foods only — rice, black gram, chickpea — each from single-centre clinic cohorts. “Very few studies in India have confirmed food allergy with a challenge procedure” (Mahesh et al. 2023). As recently as 2018, adrenaline auto-injectors were not available in India (Devdas et al. 2018), which limited the ability to conduct challenges safely.²³

²³ A food challenge carries the risk of triggering the very reaction it is testing for. Conducting one safely requires having emergency treatment — adrenaline — immediately available. Without it, a challenge that triggers anaphylaxis cannot be managed. This is the direct structural link between the absence of adrenaline auto-injectors and the absence of challenge data in the Indian literature.

Evidence generated in high-income Western countries is not directly applicable to India. Diagnostic thresholds, allergen panels, and reference ranges require validation for Indian populations before they can be used (Krishna et al. 2020).

These are the conditions that shaped the evidence. They explain why the literature looks the way it does, and they are the frame within which every figure in §3.1–3.3 should be read.

4 4. The FSSAI mandatory list

4.1 4.1 Regulatory text

Regulation 5(14) of the Food Safety and Standards (Labelling and Display) Regulations, 2020 (Version III, operationalised 1 July 2022) requires that packaged food manufacturers declare the presence of the following allergen groups on the product label (Food Safety and Standards Authority of India 2022):

Cereals containing gluten (wheat, rye, barley, oats, spelt, and their hybridised strains)
Crustaceans
Milk
Eggs
Fish
Peanuts and tree nuts
Soybeans
Sulphites at concentrations of 10 mg/kg or more

Exemptions include: oils derived from listed ingredients; distilled alcoholic beverages; raw agricultural commodities; and specific wheat-derived processing aids where gluten content is ≤20 mg/kg (Food Safety and Standards Authority of India 2022).²⁴

²⁴ These exemptions exist because highly refined oils derived from allergenic sources — peanut oil, soy oil — typically contain little to no residual protein, and protein is what the immune system reacts to. The exemption is not blanket; cold-pressed or unrefined oils may retain protein and are treated differently.

“May Contains” declarations for cross-contamination risk are permitted but not required.

4.2 4.2 International basis

The FSSAI 2020 list maps directly to the Codex Alimentarius General Standard for the Labelling of Pre-packaged Foods (CXS 1-1985) as it stood prior to the 2024 revision. India adopted the Codex list as the scientific baseline for its allergen labelling framework, consistent with the WTO Sanitary and Phytosanitary Agreement’s treatment of Codex standards as the international reference (Codex Alimentarius Commission 2024).²⁵

²⁵ The WTO SPS Agreement encourages member countries to base food safety regulations on international standards — primarily Codex — rather than developing independent national standards for each regulated substance. Adopting the Codex allergen list is therefore not a shortcut; it is the standard approach for WTO member states. The question is what happens when the Codex list diverges from national-specific evidence, which is what §4.4 examines.

The Codex 2024 revision made two changes relevant here: sesame was added as a mandatory declaration allergen, and soy was reclassified from mandatory to recommended, reflecting lower confirmed soy allergy prevalence in large population studies relative to other listed allergens (Codex Alimentarius Commission 2024). The 2024 revision also introduced a requirement for visual distinction of allergen declarations from surrounding label text. Whether FSSAI will align with these changes is not known at the time of writing.

The table below places the FSSAI list in international context.

Table 2: FSSAI allergen list in international context

Allergen	FSSAI 2020	Codex pre-2024	Codex 2024	EU (Big 14)	US (Big 9)
Gluten-containing cereals	Mandatory	Mandatory	Mandatory	Mandatory	Mandatory (wheat only)
Crustaceans	Mandatory	Mandatory	Mandatory	Mandatory	Mandatory
Milk	Mandatory	Mandatory	Mandatory	Mandatory	Mandatory
Egg	Mandatory	Mandatory	Mandatory	Mandatory	Mandatory
Fish	Mandatory	Mandatory	Mandatory	Mandatory	Mandatory
Peanuts	Mandatory (with tree nuts)	Mandatory	Mandatory	Mandatory	Mandatory
Tree nuts	Mandatory (with peanuts)	Mandatory	Mandatory	Mandatory	Mandatory
Soybeans	Mandatory	Mandatory	Recommended	Mandatory	Mandatory
Sulphites ≥10 mg/kg	Mandatory	Mandatory	Mandatory	Mandatory	—
Sesame	Not listed	Not listed	Mandatory	Mandatory	Mandatory (added 2023)
Lupin	Not listed	Not listed	—	Mandatory	—
Molluscs	Not listed	Not listed	—	Mandatory	—
Celery	Not listed	Not listed	—	Mandatory	—
Mustard	Not listed	Not listed	—	Mandatory	—

4.3 4.3 Where the regulation and the literature converge

Crustaceans show the strongest alignment between regulation and Indian evidence. Prawn tropomyosin (Pen i 1, Penaeus indicus) is India’s only IUIS-registered food allergen (Bhattacharya et al. 2018). Sensitisation data is available from at least four independent Indian studies, with rates ranging from 10.3% to 53.5% depending on study type and population (Krishna et al. 2020).

Milk, egg, and fish all appear in FSSAI with supporting Indian sensitisation data. Milk sensitisation is 1.35–20.5% across reviewed studies; probable food allergy to milk in children was 0.5% of the EuroPrevall India probable food allergy subset (Mahesh et al. 2023). Egg sensitisation was 6.9–34.9% in clinic-based studies; probable egg allergy 0.05% in children (Mahesh et al. 2023). Fish allergens characterised in India include heat-stable proteins in bhetki and mackerel (Bhattacharya et al. 2018).

Peanuts: sIgE sensitisation 6.3–19.6% in Indian data (Mahesh et al. 2023; Krishna et al. 2020); probable food allergy approximately 0.03% — the sensitisation-reactivity gap at its most pronounced.

Wheat (gluten cereals): sIgE sensitisation 6.7–11.93%; probable food allergy 0–0.02% in EuroPrevall India (Mahesh et al. 2023).

4.4 4.4 Where the regulation and the literature diverge

Several foods with documented Indian sensitisation data are absent from the FSSAI list. For a regulatory body setting mandatory labelling requirements, the relevant standard is confirmed clinical allergy burden — and for most of these foods, the DBPCFC data to establish that burden does not exist in India-specific form.²⁶

²⁶ Mandatory labelling requirements carry legal and commercial consequences for manufacturers. Setting that bar requires a level of confirmed evidence — ideally challenge-confirmed allergy at population scale — that is higher than what is needed to flag a food as potentially relevant in a research taxonomy. The absence of a food from the FSSAI list does not mean it is not allergenic; it means the evidentiary bar for a legal mandate has not been cleared.

The foods where the evidence is most developed:

Sesame: sensitisation 8.0% in EuroPrevall India children — higher than peanut (6.3%) (Mahesh et al. 2023). Codex 2024 has since added sesame as mandatory, joining the US (since 2023) and the EU (Codex Alimentarius Commission 2024). The evidence position for sesame has materially changed since the FSSAI 2020 list was written.

Rice, black gram, and chickpea each have Indian DBPCFC data: rice (6 of 16 confirmed), black gram (4 of 14 confirmed), chickpea (31 of 41 confirmed) (Mahesh et al. 2023; Krishna et al. 2020). These are small single-centre cohorts and the only India-specific challenge data that exists for any food not on the FSSAI list.

The Indian legume complex — pigeon pea, kidney bean, lentil, green gram — has documented sensitisation, characterised allergen proteins, and pepsin-stable fractions (Bhattacharya et al. 2018; Krishna et al. 2020). No OFC data is available. Legume proteins as a class retain IgE reactivity after gastric digestion (Bhattacharya et al. 2018), and the cross-reactive epitopes across this complex mean primary sensitisation to one legume may carry risk across others.

Eggplant is named among the five primary Indian food allergens by (Bhattacharya et al. 2018), appears in the adult allergen profile of EuroPrevall (Mahesh et al. 2023), and has a characterised LTP. SPT-based sensitisation figures carry the histamine confound noted in §3.2; the molecular evidence does not.

Mustard has no India-specific clinical allergy data in the reviewed literature but is mandatory in the EU, widely used in Indian cooking as both oil and spice, and subject to ongoing FAO/WHO threshold assessment.

4.5 4.5 The regulatory process as a constraint on list updates

Adding a food to a mandatory allergen declaration list is not a scientific decision alone. It is a regulatory action with legal, commercial, and administrative consequences, and the process that produces it reflects that.

A mandatory declaration requires that manufacturers identify the allergen across their entire supply chain, verify its presence or absence in every product, update labels, and retrain procurement and production staff. For large manufacturers with complex ingredient sourcing, this is a substantial operational exercise. For small manufacturers — which constitute a significant portion of India’s packaged food sector — it can be the difference between compliance being feasible or not. Regulators setting a new mandatory requirement are not only making a safety call; they are also setting a compliance burden, and the timeline and scope of that burden are part of the decision.

Enforcement is a parallel constraint. A mandatory declaration is only as useful as the regulator’s ability to verify it. For allergens with standardised, widely available testing methods, enforcement is tractable. For allergens where testing methodology is not yet standardised — or where reference materials are not available in India — a mandatory declaration creates a requirement that inspection infrastructure cannot yet reliably verify. Regulators have reason to wait until enforcement is feasible before making a requirement mandatory rather than recommended.

The evidentiary standard for a mandatory declaration is also necessarily higher than for a research classification. A regulation that mandates disclosure of an allergen on the basis of sensitisation data alone — without challenge-confirmed allergy at meaningful scale — risks requiring declarations for foods that carry negligible clinical risk in practice, which dilutes the signal value of the mandatory list for consumers and manufacturers alike. The regulatory instinct to wait for confirmed data before acting is not conservatism for its own sake; it is the same instinct that makes the list meaningful when it does require something.

International alignment adds a further dimension. India’s participation in Codex and its WTO commitments create a shared interest in regulatory coherence across borders — both for consumer protection and for the practical functioning of food trade. Moving significantly ahead of or behind international standards has consequences that extend beyond the immediate safety question. The Codex process itself reflects this: the 2024 revision that added sesame and reclassified soy took years of evidence review and member state consultation before it was adopted. That pace is not a failure of urgency — it is what thorough cross-jurisdictional alignment requires.

None of this means the FSSAI list is final. It means the list reflects what was possible to establish, mandate, and enforce at the time it was written, under the conditions that existed. The divergence between the list and the clinical literature documented in §4.4 is not a gap that went unnoticed — it is a gap that the regulatory process has not yet closed, for reasons that are themselves part of the record.

5 5. Limitations

The limitations of this review are the limitations of the underlying evidence base.

Geographic coverage: All EuroPrevall-INCO data comes from Mysore and Bengaluru. Most clinic-based studies are from Delhi or Kolkata. No systematic food allergy data from Northeast India, rural India, coastal fishing communities, tribal populations, or most of North India is available in the reviewed literature.²⁷

²⁷ India’s dietary diversity means that allergen exposure varies considerably by region — a coastal community in Kerala will have systematically different fish and shellfish exposure than an inland population in Rajasthan. A sensitisation pattern that holds in urban Karnataka may not hold elsewhere. The geographic concentration of the available data is not a minor caveat; it means the review describes what is known about food allergy in a specific part of India, not India as a whole.

Study design: Most sensitisation data comes from allergy clinic patients, not general population cohorts. Patients attending allergy clinics are a selected population — higher pre-test probability of sensitisation than the general population. Rates from these studies are expected to exceed true population prevalence.

Diagnostic method: The 0.14% (children) and 1.2% (adults) figures use the EuroPrevall probable food allergy definition — reported symptoms within two hours combined with positive sIgE or SPT. This is not OFC-confirmed diagnosis. DBPCFC-confirmed data exists only for rice (6/16), black gram (4/14), and chickpea (31/41), each from single-centre clinic cohorts.

Eggplant confound: Stored eggplant accumulates histamine at levels that can produce false-positive SPT results. Eggplant sensitisation rates from SPT-based studies should be interpreted with this confound in mind (Bhattacharya et al. 2018). The LTP characterisation from (Bhattacharya et al. 2018) provides molecular evidence independent of SPT.

Cross-reactivity vs primary sensitisation: Some sensitisation data, particularly for legumes, may reflect cross-reactivity with a primary sensitiser rather than independent sensitisation to the food tested. The Indian legume complex has documented cross-reactive epitopes (Bhattacharya et al. 2018; Milana et al. 2025). A patient with primary sensitisation to black gram may test positive for lentil, pea, and faba bean without independent primary sensitisation to those foods.²⁸

²⁸ This complicates the interpretation of sensitisation rates for individual legumes. If a significant fraction of lentil-positive results in Indian clinic studies reflect cross-reactivity with black gram rather than primary lentil sensitisation, the true lentil-specific sensitisation rate would be lower than reported. Disentangling primary sensitisation from cross-reactivity requires molecular testing — component-resolved diagnostics — which is not available in most Indian clinical settings.

Trajectory uncertainty: India’s food allergy landscape is not static. Urbanisation is consistently associated with higher food allergy rates in Asia-Pacific data (Leung et al. 2024), and India is urbanising rapidly. Current prevalence figures from 2006–2020 studies may not reflect the position in five or ten years. The allergen list derived from this review reflects evidence available through early 2026.

FSSAI update status: Whether FSSAI intends to align with the Codex 2024 revision is not known at the time of writing.

6 6. An extended allergen recognition list for Indian food systems

A mandatory labelling regulation and a research classification are doing different things. A regulation sets a legal threshold — what manufacturers must declare, with consequences for non-compliance. A classification organises information for researchers, analysts, and developers working with ingredient data. A higher evidentiary bar is appropriate for a legal mandate than for a classification flag.²⁹

²⁹ This distinction matters because it explains why the list below includes foods that FSSAI does not mandate. The question being asked is different: not “what is confirmed enough to require by law” but “what is documented enough in Indian-specific evidence to warrant flagging as allergen-relevant.” The two questions have different answers.

The list has three tiers.

6.1 Tier 1 — FSSAI core 8

These eight allergen groups are mandatory declarations under FSSAI Regulation 5(14) (Food Safety and Standards Authority of India 2022). Adopted unchanged.

Table 3: IFID Tier 1 — FSSAI core 8

#	Allergen group	FSSAI reference	Indian evidence summary
1	Gluten-containing cereals	Reg 5(14)(i)	Wheat: 6.7–11.93% sIgE; 0–0.02% probable FA in children
2	Crustaceans	Reg 5(14)(ii)	Prawn: 10.3–53.5% sensitisation; Pen i 1 India’s only IUIS-registered food allergen
3	Milk	Reg 5(14)(iii)	1.35–20.5% sensitisation; 0.5% probable FA in children’s probable FA subset
4	Egg	Reg 5(14)(iv)	6.9–34.9% sensitisation; 0.05% probable FA in children
5	Fish	Reg 5(14)(v)	Heat-stable allergens in bhetki and mackerel; heat-labile in hilsha and pomfret
6	Peanuts and tree nuts	Reg 5(14)(vi)	Peanut: 6.3–19.6% sensitisation; ~0.03% probable FA
7	Soybeans	Reg 5(14)(vii)	Limited India-specific data; Codex 2024 reclassified as recommended
8	Sulphites (≥10 mg/kg)	Reg 5(14)(viii)	Chemical sensitivity; not a protein allergen

6.2 Tier 2 — Literature additions

These nine allergen groups are absent from FSSAI 2020 but have supporting evidence from Indian clinical or epidemiological literature. The type and strength of evidence is noted for each.

Table 4: Extended allergen recognition list — Tier 2

#	Allergen	Evidence	Sources
9	Sesame	8.0% sIgE in EuroPrevall India children; Codex 2024 added as mandatory; US and EU both include sesame	(Mahesh et al. 2023; Codex Alimentarius Commission 2024)
10	Black gram (Vigna mungo)	DBPCFC 4 of 14 confirmed; 28-kDa Vig m; resistant to pepsin digestion; cross-reacts with lentil, faba bean, lima bean, pea	(Krishna et al. 2020; Bhattacharya et al. 2018; Milana et al. 2025)
11	Chickpea (Cicer arietinum)	DBPCFC 31 of 41 confirmed; anaphylaxis documented; 26-kDa albumin cross-reacts with peanut IgE	(Devdas et al. 2018; Krishna et al. 2020; Bhattacharya et al. 2018)
12	Kidney bean (Phaseolus vulgaris)	22% sensitisation in Delhi food-allergic population; 31-kDa allergen stable to pepsin; cross-reacts with peanut, black gram, lentil, pea	(Bhattacharya et al. 2018)
13	Lentil (Lens culinaris)	5.5–9.7% sensitisation (Delhi, N=216–1,860); cross-reacts with black gram, kidney bean, pea	(Krishna et al. 2020; Milana et al. 2025)
14	Rice (Oryza sativa)	DBPCFC 6 of 16 confirmed; 12% SPT positive in food-allergic population; 24-kDa chitinase as major allergen	(Bhattacharya et al. 2018; Mahesh et al. 2023)
15	Eggplant (Solanum melongena)	Named among five primary Indian food allergens; 4.3% SPT-confirmed community study (N=741); LTP in peel and seeds; SPT figures carry histamine confound	(Bhattacharya et al. 2018; Krishna et al. 2020)
16	Mustard (Brassica spp.)	Mandatory in EU; widely used in Indian cooking; FAO/WHO threshold assessment ongoing; no India-specific clinical data in reviewed literature	(Food Safety and Standards Authority of India 2022; Codex Alimentarius Commission 2024)
17	Pigeon pea / red gram (Cajanus cajan)	Novel allergens identified via Indian patient sera; 12.6% sIgE in Karnataka population study (N=2,219)	(Bhattacharya et al. 2018; Krishna et al. 2020)

6.3 Tier 3 — Flagged; insufficient evidence for inclusion

These foods have some Indian relevance but insufficient evidence to include in Tier 1 or Tier 2. Documented here for transparency and future review.

Table 5: Extended allergen recognition list — Tier 3

Allergen	Available evidence	What is missing
Mung bean (Vigna radiata)	IUIS allergens Vig r1–r6 characterised; 12.5% sIgE in one Karnataka study; LTPs cross-reactive with peanut, soy, lentil, strawberry, apple, peach (Milana et al. 2025)	India-specific OFC or DBPCFC data; cross-reactivity with black gram may explain observed sensitisation
Banana	3.6–40.6% sensitisation range across Indian studies	OFC data; wide range suggests heterogeneous testing and possible cross-reactivity
Betel leaf (Piper betle)	Widely used in Indian food culture; reported as an exposure of concern in community settings	No molecular characterisation or clinical allergy data

7 References

Bhattacharya, Kashinath, Gaurab Sircar, Angira Dasgupta, and Swati Gupta Bhattacharya. 2018. “Spectrum of Allergens and Allergen Biology in India.” International Archives of Allergy and Immunology 177 (3): 219–37. https://doi.org/10.1159/000490805.

Codex Alimentarius Commission. 2024. General Standard for the Labelling of Pre-Packaged Foods (CXS 1-1985). Joint FAO/WHO Food Standards Programme.

Devdas, Jaidev M., Christopher Mckie, Adam T. Fox, and Vinod H. Ratageri. 2018. “Food Allergy in Children: An Overview.” Indian Journal of Pediatrics 85: 369–74. https://doi.org/10.1007/s12098-017-2535-6.

Food Safety and Standards Authority of India. 2022. Food Safety and Standards (Labelling and Display) Regulations, 2020. Compendium. https://www.fssai.gov.in/upload/uploadfiles/files/Compendium_Labelling_Display_30_06_2022.pdf.

Krishna, Mamidipudi Thirumala, Saibal Moitra, Padukudru Anand Mahesh, Vinay Mehta, Pudupakkam Vedanthan, and Devasahayam Jesudas Christopher. 2020. “An Appraisal of Allergic Disorders in India and an Urgent Call for Action.” World Allergy Organization Journal 13 (7): 100446. https://doi.org/10.1016/j.waojou.2020.100446.

Leung, Agnes Sze-yin, Punchama Pacharn, Sirinoot Tangvalelerd, et al. 2024. “Food Allergy in a Changing Dietary Landscape: A Focus on the Asia Pacific Region.” Pediatric Allergy and Immunology 35 (8): e14211. https://doi.org/10.1111/pai.14211.

Mahesh, Padukudru Anand et al. 2023. “Allergic Diseases in India - Prevalence, Risk Factors and Current Challenges.” Clinical & Experimental Allergy 53 (3): 276–94. https://doi.org/10.1111/cea.14239.

Milana, Matilde et al. 2025. “A Review of the Toxicological Effects and Allergenic Potential of Emerging Alternative Protein Sources.” Comprehensive Reviews in Food Science and Food Safety 24: e70123. https://doi.org/10.1111/1541-4337.70123.

Reuse

CC BY 4.0

The Coordinator Problem: Connector Hub Architecture as a Design Principle for Domain-Blind Integration in AI Systems

Lalitha A R — Fri, 10 Apr 2026 00:00:00 GMT

1 The Problem

Current large language models are built on a single architectural assumption: that the best path to cross-domain reasoning is to expose one model to all domains simultaneously during training, and let integration emerge from the resulting parameter space. The assumption is productive. Models trained this way do transfer across domains; they do apply concepts from one field to problems in another; they do find patterns that transcend domain boundaries. The assumption has earned its place.

This paper does not argue that the assumption is wrong. It argues that the brain solved the same problem differently, that the brain’s solution has been empirically characterised in some detail, and that taking it seriously as a design principle opens experimental directions that current architectures do not explore.

The brain does not train a single substrate across all domains. It maintains domain-specific processing modules and coordinates them through a distinct class of regions whose defining property is precisely that they are not domain-specific. These connector hub regions manage the integration problem without holding the domain content. The architecture is separable: specialisation happens in one place, coordination happens in another, and the two are functionally distinct.

The question this paper poses is narrow: is there a meaningful AI architecture that reflects this separation? And if there is, what would it need to do that existing approaches do not already do?

2 Background

2.1 What has been established in the domain-generalist paradigm

The large language model approach treats language modelling over a broad training corpus as the mechanism by which domain knowledge is acquired and cross-domain transfer is enabled. The model learns domain-specific patterns — the vocabulary, the relational structures, the typical inferential moves of a domain — by exposure to enough text from that domain. It learns cross-domain transfer by exposure to text that itself crosses domains: scientific writing that borrows from adjacent fields, interdisciplinary papers, analogical explanations, and so on.

The result is a model that holds domain knowledge and coordination capacity in the same parameter space. When the model encounters a problem, it does not route to a specialist; it retrieves from a generalised substrate that contains everything at once. Retrieval-augmented generation (Lewis et al. 2021) and fine-tuning extend this by adding domain specificity as a correction applied after training: the base model is a generalist, and specialisation is layered on. Mixture-of-experts architectures (Shazeer et al. 2017) pursue a different efficiency: within a single model, a gating network routes each token to a subset of parameter experts. This reduces inference cost without changing the epistemic structure — all experts are trained jointly under the same loss, in the same model, on the same data distribution.

None of these approaches separates coordination from domain expertise at the architectural level. The coordinating function — whatever the model does when it integrates across domains — is distributed throughout the same weights that hold domain content.

2.2 What the brain does instead

Functional neuroimaging research has documented a different structure. The human brain is not organised as a single generalised processor. It is organised as a set of discrete functional modules — each densely interconnected internally, each performing a domain-specific cognitive function — coordinated by a distinct class of regions that do not themselves perform domain-specific computation.

Bertolero, Yeo, and D’Esposito (2015) established this architecture empirically across 9,208 experiments and 77 cognitive tasks in the BrainMap database. Using resting-state fMRI and graph-theoretic network analysis, they identified 14 distinct functional modules with strong spatial correspondence to known cognitive functions. They then measured activity at different types of nodes across all tasks. Local nodes within modules — provincial hubs — did not increase activity as more cognitive functions were engaged. Their computational load remained constant regardless of task complexity. Connector nodes — regions with high participation coefficients, meaning their connections were distributed evenly across many modules rather than concentrated within any one — showed a different pattern entirely. Their activity increased proportionally to the number of modules engaged in a task.

This finding has a specific implication. Connector nodes are not doing more of what the domain modules are doing when tasks get more complex. They are doing something else entirely: managing the integration load that increases when many modules must work together, while preserving the autonomy of each module’s function. The modules stay modules; the connector nodes handle the coordination between them.

Bertolero et al. (2018) extended this to a mechanistic account. Connector hubs do not merely route information between modules; they actively tune the connectivity of their neighbours, reorganising which modules are more or less connected based on current task demands. Individuals with more diversely connected hubs and more modular brain networks show higher cognitive performance across all tasks — not on any specific task, but across the board. The diversity of hub connectivity predicts general integration capacity.

The architectural principle that emerges from this literature is not that specialisation and integration are in tension. It is that they are structurally separable and mutually reinforcing: more modular domain processing combined with more capable coordination produces better outcomes than either alone (Menon and D’Esposito 2022).

3 The Architecture in Detail

3.1 Module autonomy is not isolation

A clarification matters here. Saying that domain modules process information autonomously does not mean they are isolated from one another. The brain is not a collection of silos that occasionally exchange messages. It is a network in which modules maintain dense internal connectivity while connector hub regions manage cross-module communication selectively, based on task demands.

The key property of connector hubs is the participation coefficient (Sporns and Betzel 2016): the degree to which a node’s connections are distributed evenly across modules rather than concentrated within one. A node with a high participation coefficient is well-connected to many modules. It has access to what each module is doing. But it does not perform any module’s function. It is neither a domain specialist nor a blank generalist. It occupies a structurally distinct role: a node that can reach across module boundaries without being defined by any of them.

Gordon et al. (2018) refined this picture further, showing that connector hubs are not a single category. Three distinct sets were identified, each with different task-activation profiles: one set deactivates across tasks, one activates during all tasks, one activates specifically during tasks requiring the configuring of input, transformation, and output processes. This differentiation within the coordinator role is relevant because it suggests coordination is itself a structured function — not a homogeneous relay, but a set of subtypes performing distinct integrative operations.

3.2 What connector hubs actually compute

The literature on connector hub function does not describe these regions as performing structural isomorphism detection — comparing problem shapes across domains and flagging when a problem in one domain has the same relational structure as a solved problem in another. That function is not what the connector hub literature directly documents.

What it documents is routing and tuning: managing which modules are active, how strongly they communicate, and how that connectivity pattern shifts as task demands change. The connector hub’s documented computational role is coordination in the sense of network configuration, not in the sense of cross-domain analogy.

The analogical reasoning literature, however, sits adjacent and is worth examining. A meta-analysis of 27 neuroimaging studies on analogical reasoning found that the left rostrolateral prefrontal cortex (rlPFC) is the region most consistently activated across all analogical reasoning tasks, regardless of whether the domain is semantic or visuospatial (Hobeika et al. 2016). The rlPFC is domain-general for analogy. Lesions to the left rlPFC impair analogical reasoning across domains. And the rlPFC is anatomically located within the connector hub regions identified by the modular brain architecture literature.

This anatomical overlap does not establish that connector hubs are cross-domain analogy engines. It establishes something more modest: the architectural conditions that define connector hubs — high participation coefficient, domain-distributed connectivity, low domain specificity — are the same conditions under which cross-domain relational comparison is supported. Whether a coordination layer trained into this architectural role would develop the capacity for structural similarity detection across domains is an open question. The brain architecture suggests it is not an implausible one.

Gentner’s structure-mapping theory (Gentner 1983) provides the theoretical vocabulary for what this function would be. Analogy, in the structure-mapping framework, depends on finding relational correspondences between domains — not surface similarity between objects, but systematic similarity between the roles objects play within a relational structure. The function is domain-blind by definition: the same relational structure can exist in two entirely different content domains, and detecting it requires abstracting away from domain content. A coordinator trained to detect such correspondences would not need to know what a domain is about; it would need to know what shape a problem has.

4 The Distinction from Existing Approaches

4.1 Mixture of experts

Mixture-of-experts architectures (Shazeer et al. 2017; Cai and colleagues 2025) are the closest existing analogue to the proposed architecture. MoE models contain multiple sub-networks (experts), with a gating mechanism routing each token to a small subset of experts during inference. This achieves computational efficiency — not all parameters are activated for every input — and produces a form of functional specialisation within the model.

The distinction from the proposed architecture is epistemic rather than computational. In MoE, all experts are trained jointly under the same loss, within the same model, on the same data distribution. The gating network is trained simultaneously with the experts; there is no separation between the coordinator’s training objective and the specialists’ training objective. The experts are not domain-native in the sense of having been developed to hold a specific domain’s knowledge independently of the generalist training regime. They are weight-level subnetworks within a single model that have developed different activation patterns through joint training.

The connector hub analogy is structurally different. Domain-native modules, in the brain, are not trained jointly with the connector hubs under a shared loss. They develop through domain-specific experience and exposure; connector hubs develop separately. The proposed AI architecture would reflect this separation: domain-native specialist models trained on domain-specific corpora, and a coordination layer trained separately — potentially on a different objective entirely, concerned with structural relationships across domains rather than domain content.

4.2 Retrieval-augmented generation

Retrieval-augmented generation (Lewis et al. 2021) adds domain specificity to a generalist model by retrieving relevant documents at inference time and including them in the context window. This is a post-training correction: the base model remains a generalist; specialisation is supplied externally.

The proposed architecture differs in that domain specialists are not corrections applied to a generalist. They are the primary domain processors. The coordinator does not have domain knowledge that gets topped up by retrieval; it does not hold domain knowledge in the first place. The separation is architectural, not a retrieval strategy.

4.3 Current multi-agent systems

Multi-agent systems (Xiao and colleagues 2024) distribute tasks across multiple models and coordinate their outputs through an orchestrator. This is the existing approach closest in spirit to the proposed architecture, and it shares the structural separation the brain’s architecture exhibits. The limitation documented in current production deployments is coordination overhead: as the number of specialists increases, the coordination tax — communication overhead, latency, context management — grows faster than the benefit (Königstein 2026). The orchestrator in most deployed systems is not a model with a distinct training objective for coordination; it is a generalist model given a coordination role through prompting. The coordination capacity is borrowed from the generalist’s general capability, not developed as a distinct function.

The proposed architecture asks whether a coordinator trained specifically for structural integration — with its own training objective, on a corpus of cross-domain relational correspondences rather than on domain content — would perform differently from a prompted generalist acting as coordinator. The brain’s architecture suggests these are different things. Whether they produce different outcomes in AI systems is the experimental question.

5 Convergent Evidence from Organisational Theory

The structural separation of coordination from domain expertise is not a new idea. It has been independently arrived at in human knowledge systems across several fields.

Lawrence and Lorsch (1967) formalised it in organisational theory as the tension between differentiation — the development of specialised subunits with their own goals, time horizons, and epistemic norms — and integration — the coordination of differentiated subunits toward shared outcomes. Their empirical finding was that high-performing organisations in complex environments achieved both: more differentiated than low performers and more integrated. The integrator role in their framework is structurally analogous to the connector hub: a person or unit that coordinates across specialist domains without being a domain specialist, whose effectiveness depends on being trusted by all parties rather than being expert in any one domain.

The T-shaped manager concept (Guest 1991; Johnson 1978) formalises the same principle at the individual level. The vertical bar represents deep domain expertise; the horizontal bar represents the boundary-crossing competencies that enable coordination across specialisms. The horizontal bar is not generalisation in the sense of knowing everything at shallow depth. It is coordination capacity: the ability to integrate across domains without being defined by any of them. The T-shaped manager does not perform the specialist’s function; they create the conditions under which specialists can work together.

In legal practice, large firms have independently evolved an analogous structure. Complex multi-practice matters are handled by coordinating partners who assemble and route between domain specialists — IP attorneys, tax attorneys, litigation specialists — without needing deep expertise in each practice area. The coordinating partner’s role is not to do the specialist’s work but to understand which specialist is needed when, and to translate across the epistemic boundaries between practice groups. The domain specialists remain autonomous; the coordinator holds the integration function.

None of these analogies constitutes proof. They constitute convergent independent discovery of the same structural principle in systems facing the same problem: how to achieve coordination across domain-specialist components without collapsing the specialisation that makes the components useful.

6 What the Coordinator Would Need to Do

The proposed architecture separates into two components with different requirements.

Domain-native specialist models are trained on domain-specific corpora, with training objectives appropriate to their domain. Their epistemic authority is domain-bounded. They do not need to know what other specialists know; they need to produce high-quality domain-specific outputs when queried. The TRM result (Jolicoeur-Martineau 2025) — a 7M parameter model achieving competitive performance on structured reasoning tasks — suggests that small, domain-native models may be sufficient for specialist functions that current generalist models handle with far more parameters.

The coordination layer is the novel component. Its training objective is not domain content. It is structural: learning to represent problems in terms of their relational structure, to route queries to appropriate specialists, and — potentially — to detect when a problem in one domain shares relational structure with a problem another specialist has encountered. This last function is not a given. It is a hypothesis about what a coordination layer trained at the architectural level of connector hubs might develop. The rlPFC literature suggests the conditions for such a function are present in the analogous brain architecture; whether those conditions can be reproduced in a trained model is an empirical question.

The training corpus for such a coordinator is not obvious. One tractable direction is the history of science: interdisciplinary papers that explicitly transfer frameworks across domains, analogical explanations in scientific pedagogy, and cross-domain problem-solving literature document the function the coordinator would need to perform. Whether a model trained on this corpus would generalise to novel cross-domain structural correspondences rather than memorising the surface forms of known analogies is an open methodological question.

7 Scope and Limitations

7.1 What this paper does not claim

This paper does not claim that the proposed architecture would outperform current large language models on any benchmark. The claim is architectural and conceptual: that the separation of coordination from domain expertise is a structural principle documented in the brain and independently discovered in human organisational systems, and that AI architecture has not yet explored it at the level the brain implements it.

This paper does not propose an implementation. The training objective for the coordination layer, the mechanism by which specialists and coordinator communicate, the representation format for structural similarity, and the evaluation framework for coordination quality are all open engineering questions. Scoping them is outside the range of what a conceptual paper can usefully do.

This paper does not argue that domain-native training produces better specialists than generalised training in all cases. There are domains where generalised training produces specialists that match or exceed domain-native fine-tuning. The architectural argument is not about which approach produces better specialists; it is about whether the coordination function is better served by a dedicated coordinator with its own training objective than by a generalist model acting as coordinator.

7.2 What remains open

The most significant open question is the training objective for the coordinator. The brain’s connector hubs develop their function through experience in a system where domain modules are already developing their functions simultaneously. A training objective that reproduces this developmental condition in a supervised setting does not yet exist.

The evaluation question is similarly open. Current benchmarks evaluate domain performance. Cross-domain transfer is typically evaluated by measuring performance on domain B after training on domain A. Neither evaluates coordination quality directly — the capacity of a coordinator to route appropriately, integrate across specialists, and detect structural correspondences across domain boundaries. Developing such an evaluation framework may be a precondition for testing the architecture.

8 Authorship Note

Lalitha A R identified the architectural parallel between connector hub function and the proposed AI coordination layer, formulated the question of whether a domain-blind coordinator trained separately from domain specialists would behave differently from a prompted generalist acting as coordinator, developed the cross-domain isomorphism detection hypothesis as an extension of the connector hub analogy, and directed the search for convergent parallels in organisational theory and legal practice.

Claude searched the neuroscience literature, confirmed the rlPFC analogical reasoning literature as the relevant adjacent body of work, located and verified the organisational theory and law firm parallels on Lalitha’s direction, built the papertable and bibliography, and drafted this paper from the resulting materials. The core architectural question, the isomorphism extension, and the cross-domain framing instinct are Lalitha’s. The literature retrieval, synthesis, and written draft are Claude’s.

9 References

Bertolero, Maxwell A., B. T. Thomas Yeo, Danielle S. Bassett, and Mark D’Esposito. 2018. “A Mechanistic Model of Connector Hubs, Modularity and Cognition.” Nature Neuroscience 21: 1127–35. https://doi.org/10.1038/s41593-018-0157-3.

Bertolero, Maxwell A., B. T. Thomas Yeo, and Mark D’Esposito. 2015. “The Modular and Integrative Functional Architecture of the Human Brain.” Proceedings of the National Academy of Sciences 112 (49): E6798–807. https://doi.org/10.1073/pnas.1510619112.

Cai, Weilin, and colleagues. 2025. “A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications.” arXiv Preprint. https://arxiv.org/html/2503.07137v1.

Gentner, Dedre. 1983. “Structure-Mapping: A Theoretical Framework for Analogy.” Cognitive Science 7 (2): 155–70. https://doi.org/10.1207/s15516709cog0702_3.

Gordon, Evan M., Charles J. Lynch, Caterina Gratton, et al. 2018. “Three Distinct Sets of Connector Hubs Integrate Human Brain Function.” Cell Reports 24 (7): 1687–96. https://doi.org/10.1016/j.celrep.2018.07.050.

Guest, David. 1991. “The Hunt Is on for the Renaissance Man of Computing.” The Independent.

Hobeika, Luc, Cassandre Diard-Detoeuf, Béatrice Garcin, Richard Levy, and Emmanuelle Volle. 2016. “General and Specialized Brain Correlates for Analogical Reasoning: A Meta-Analysis of Functional Imaging Studies.” Human Brain Mapping 37 (5): 1953–69. https://doi.org/10.1002/hbm.23149.

Johnson, Denis. 1978. “T-Shaped Manager.” IEEE Engineering Management Review.

Jolicoeur-Martineau, Alexia. 2025. Less Is More: Recursive Reasoning with Tiny Networks. https://arxiv.org/abs/2510.04871.

Königstein, Nicole. 2026. “Designing Effective Multi-Agent Architectures.” O’Reilly Radar. https://www.oreilly.com/radar/designing-effective-multi-agent-architectures/.

Lawrence, Paul R., and Jay W. Lorsch. 1967. “Differentiation and Integration in Complex Organizations.” Administrative Science Quarterly 12 (1): 1–47. https://doi.org/10.2307/2391211.

Lewis, Patrick, Ethan Perez, Aleksandra Piktus, et al. 2021. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. https://arxiv.org/abs/2005.11401.

Menon, Vinod, and Mark D’Esposito. 2022. “The Role of PFC Networks in Cognitive Control and Executive Function.” Nature Reviews Neuroscience 23: 535–55. https://doi.org/10.1038/s41583-022-00580-9.

Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, et al. 2017. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” arXiv Preprint. https://arxiv.org/abs/1701.06538.

Sporns, Olaf, and Richard F. Betzel. 2016. “Modular Brain Networks.” Annual Review of Psychology 67: 613–40. https://doi.org/10.1146/annurev-psych-122414-033634.

Xiao, Lianmin, and colleagues. 2024. “Optimizing Generative AI Networking: A Dual Perspective with Multi-Agent Systems and Mixture of Experts.” arXiv Preprint. https://arxiv.org/abs/2405.12472.

Reuse

CC BY 4.0

When the Means Become the End: Instrumental-Terminal Goal Inversion in Large Language Models

Lalitha A R — Thu, 02 Apr 2026 00:00:00 GMT

1 The Problem

A researcher shares a paper and asks an LLM to synthesize its relevance into a lab log. The log is returned: four sections, each with a header, precisely scoped claims, proper citations, a section on limitations, a section on what to retain. Every structural element of a log is present. The terminal goal — recording one transferable insight clearly enough that a reader six months from now can reconstruct why the paper matters — is not served. The log is a completed artifact that does not do its job.

This failure is not hallucination. The claims are accurate. It is not sycophancy — the model is not telling the researcher what she wants to hear. It is not specification gaming in the classical sense — no reward signal is being hacked. The model has simply substituted completing the structure for serving the purpose the structure exists for. The instrumental goal has displaced the terminal goal.

This paper argues that this failure mode is: (a) systematic and predictable, not idiosyncratic; (b) structurally identical to a well-documented phenomenon in organizational sociology; (c) the inverse of how humans respond to the same constraints; and (d) quantifiable through a specific experimental design.

2 Background and Motivation

2.1 What has been named, and what has not

The LLM alignment literature has developed a family of related concepts for failures at the boundary of intent and execution.

Specification gaming occurs when a model achieves the literal specification of an objective without achieving the intended outcome (Amodei et al. 2016). The canonical examples are from reinforcement learning: a boat-racing agent that maximizes score by circling checkpoints rather than finishing the race; a cleaning robot that covers its camera rather than cleaning. In each case, a proxy metric is gamed because the true objective was underspecified. The model finds an unintended solution to the stated objective.

Reward hacking is the broader category: exploiting flaws or blind spots in the reward model to achieve high proxy reward without satisfying human intent (Denison et al. 2024). Sycophancy — agreeing with false user claims to generate approval signal — is a trained-in form of reward hacking. Reward tampering is its extreme: modifying the reward mechanism itself.

Goal misgeneralization occurs when a model pursues a proxy goal that correlated with the intended goal during training but diverges from it in deployment, particularly under distributional shift (Langosco di Langosco et al. 2022). The model has learned the wrong goal; it did not fail to execute the right one.

None of these concepts describes the failure in the opening example. In that case:

The objective is not underspecified. The researcher stated what she wanted.
No reward signal is being gamed. There is no approval-seeking behavior.
There is no distributional shift. The task is squarely in-distribution for the model.
The model has not learned the wrong goal. It has correctly identified the immediate task.

What has happened is different: the model has treated the task’s instrumental structure — the log format, the section conventions, the expected length and scope — as the thing to be optimized, and in doing so has lost track of the terminal goal the task was supposed to serve. The artifact is complete. The purpose is not.

2.2 The cross-domain precedent

This failure mode has a precise name in organizational sociology. Merton (1940) described it in bureaucracies: rules designed as means to an end become ends in themselves through a process he called goal displacement. The bureaucrat adheres to every rule, satisfies every procedure, and fails every client. Merton called the extreme case “the bureaucratic virtuoso, who never forgets a single rule binding his action and hence is unable to assist many of his clients” (Merton 1940, 563).

The psychological mechanism Merton cited was Allport’s (1937) functional autonomy of motives — the principle that instrumental behaviors can become self-sustaining and motivationally independent from their original purpose. “What was once a means becomes an end in itself” (Allport 1937). Allport’s workman who continues to do clean-cut jobs even when his security no longer depends on it is the benign version; Merton’s bureaucrat is the organizational pathology.

The measurement-science parallel is Goodhart’s Law (1975): when a measure becomes a target, it ceases to be a good measure (Goodhart 1975). Campbell (1979) stated the same principle from social science: the more a quantitative indicator is used for social decision-making, the more it distorts the process it was meant to monitor (Campbell 1979). In each case: a proxy for a goal displaces the goal.

What connects these traditions is a shared structural logic. An instrumental value — a rule, a metric, a procedure — is created to serve a terminal goal. Under conditions that reward adherence to the instrumental value independent of terminal goal service, the instrumental value becomes terminal. The original goal disappears from view.

3 The Theoretical Claim

3.1 Instrumental-terminal goal inversion defined

We define instrumental-terminal goal inversion (ITGI) as follows:

Given a task with an explicit terminal goal and a set of instrumental constraints specified to serve , ITGI occurs when a model’s output satisfies while failing to serve , and this failure is attributable to the model treating satisfaction of as sufficient for task completion.

The key conditions distinguishing ITGI from related phenomena:

is stated in the prompt, not merely implied or inferable from reward signal.
is explicitly specified (format requirements, section structure, output length, schema).
The model’s output satisfies all or most elements of .
The model’s output does not serve — specifically, a reader with only the output cannot accomplish what required.
The failure is not attributable to factual error, hallucination, or task misunderstanding.

This distinguishes ITGI from specification gaming (where is underspecified), from sycophancy (where the model is optimizing for approval), and from goal misgeneralization (where distributional shift causes a wrong goal to be pursued).

3.2 The structural specification hypothesis

ITGI is not merely possible; we hypothesize it is monotonically increasing in structural specification density. As the number and specificity of instrumental constraints in a prompt increases, the probability that the model’s output serves the terminal goal decreases, holding terminal goal clarity constant.

This is the counterintuitive claim. For humans, constraints serve as scaffolding — they reduce cognitive overhead allocated to the how, freeing attention for the why. A researcher given a template for a log entry is freed from formatting decisions and can concentrate on what the log should say. The constraints help.

For LLMs, the hypothesis is that the inverse holds: each additional constraint is an additional optimization target, and as constraint density increases, the model’s attention allocation shifts from to . The constraints crowd out the goal.

The empirical support for this direction comes from Sridhar et al. (2023), whose ASH (Actor-Summarizer-Hierarchical) prompting work on web navigation demonstrated that when a single LLM prompt must simultaneously process raw environmental observations and predict the next action, performance degrades sharply on long-horizon tasks. On trajectories exceeding 11 steps, REACT — which loads both observation processing and action prediction into a single prompt — scored 7.4; ASH, which separates these functions into a SUMMARIZER and an ACTOR, scored 38.2 (Sridhar et al. 2023). The implicit diagnosis: when a model must manage simultaneous instrumental load (process the current observation) and terminal goal tracking (buy the right product), terminal goal tracking degrades first. The fix — hierarchical decomposition that isolates instrumental processing — is structural evidence for the hypothesis.

3.3 The inverse human pattern

The human behavioral literature on intention establishes the baseline against which ITGI is the inversion. The intention-action gap — the well-documented failure of humans to execute their stated intentions — shows that humans hold terminal goals but frequently fail on the instrumental side (Sheeran 2002; Sheeran and Webb 2016). Intentions explain only 18–28% of behavioral variance, even when the intention is strong and clearly stated.

The LLM failure runs in the opposite direction. Models execute the instrumental structure reliably and completely. What they fail to maintain is the terminal goal. Humans fail at doing; LLMs fail at purposing.

This inversion is not merely a rhetorical point. It has methodological implications for how the failure should be studied, and design implications for how it might be mitigated. Strategies developed to close the human intention-action gap — implementation intentions, commitment devices, environmental triggers — work by strengthening the link between a held terminal goal and instrumental execution. The LLM problem requires the reverse: strengthening the link between instrumental execution and a terminal goal that has not been lost but has been deprioritized.

4 Cross-Domain Synthesis

4.1 The common structure

Across the organizational sociology, measurement science, and LLM literatures, a single structural pattern recurs:

A terminal goal exists: serve the client, measure economic health, help the researcher.
An instrumental proxy is created to serve the terminal goal: follow the rules, track the money supply, complete the artifact.
Under conditions where adherence to the proxy is rewarded independent of terminal goal service, the proxy displaces the terminal goal.
The agent — bureaucrat, central bank, LLM — then optimizes the proxy while failing the original goal.

What varies across domains is the mechanism of displacement:

In bureaucracies, displacement is driven by incentive structures: career advancement depends on rule compliance, not client outcomes.
In measurement systems, displacement is driven by optimization pressure: when a metric becomes a target, actors game it.
In LLMs, displacement is driven by attention allocation during inference: satisfying explicit constraints is a local, verifiable task; serving a terminal goal requires maintaining a non-local purpose across the response.

The LLM mechanism is distinct from the human mechanisms in an important way. Bureaucratic ritualism is chosen — the bureaucrat has other options and selects rule compliance. Metric gaming is strategic — the actor knows the metric is a proxy and exploits the gap. LLM ITGI is neither chosen nor strategic. The model does not know it has displaced the terminal goal. The displacement is a property of how inference proceeds when constraints are dense, not a property of motivation or strategy.

4.2 Allport’s functional autonomy as the deepest analog

Allport’s functional autonomy (Allport 1937) is the closest structural analog to LLM ITGI — and also the most illuminating difference. Allport showed that instrumental behaviors can become self-sustaining: a motive that originates as a means to an end acquires its own motivational energy, independent of the original end. The workman who does clean-cut jobs even when his income no longer depends on it has developed a functionally autonomous motive for craftsmanship.

In humans, this is generally adaptive: functionally autonomous motives allow complex behaviors to persist without continuous reference to their original justification. The craftsman doesn’t recalculate the utility of quality work every time he picks up a tool.

In LLMs, the analog fails to generalize adaptively. There is no “persistence of a motive” — there is no motive, in the psychological sense. What there is: a training distribution that rewards well-formed artifacts, and an inference-time process that generates the most plausible completion of a prompt that already contains an elaborate structure. The structure predicts its own completion. The terminal goal, if not redundantly encoded in ways that compete with the structural signal, loses salience.

4.3 Goodhart as the measurement-science frame

Manheim and Garrabrant (2018) distinguish four variants of Goodhart’s Law: regressional (the proxy correlates imperfectly with the goal), extremal (the proxy diverges from the goal at extreme optimization), causal (optimizing the proxy changes the underlying relationship), and adversarial (an agent exploits the gap between proxy and goal) (Manheim and Garrabrant 2018).

LLM ITGI most closely resembles the regressional variant: the proxy (artifact completion) correlates with the terminal goal (task purpose) under normal conditions but diverges when structural specification is dense. The correlation holds for simple tasks with thin constraints; it breaks down as constraint density increases.

This framing is useful because it predicts where ITGI will be most severe: tasks with elaborate templates, multi-section output requirements, rigid format constraints, and complex schemas. These are, not coincidentally, the tasks where LLMs are most commonly deployed in professional and research settings — report generation, document drafting, structured analysis, code documentation.

5 Experimental Framework

5.1 What needs to be shown

Three empirical claims require testing:

Existence: ITGI occurs at inference time in current LLMs — outputs that satisfy structural constraints while failing terminal goals.
Monotonicity: ITGI increases as structural specification density increases, holding terminal goal clarity constant.
Asymmetry: The relationship between structural specification and ITGI is different for LLMs than for humans performing the same tasks.

5.2 Core design

Task pairs with separable terminal and instrumental goals. The key design requirement is that and can be independently evaluated. Tasks where artifact completion and purpose-serving are inseparable are uninformative.

Suitable task types:

Synthesis tasks: Summarize this paper in a way that helps a reader decide whether to read it. Instrumental: produce a summary of appropriate length and scope. Terminal: enable the decision.
Advisory tasks: Draft a note explaining this finding to a non-specialist audience. Instrumental: produce a note in the specified format. Terminal: the reader understands the finding.
Selection tasks: Write a log entry for this source that captures what is relevant to Project X. Instrumental: produce a log entry. Terminal: a future researcher can use it without reading the source.

Specification density as the independent variable. Three conditions:

Condition A (thin): Terminal goal stated only. No format, length, or section requirements.
Condition B (moderate): Terminal goal stated plus moderate structure (suggested sections, approximate length).
Condition C (dense): Terminal goal stated plus elaborate structure (required section headers, word count constraints, mandatory elements).

Measurement of terminal goal service. The challenge is avoiding subjective evaluation. Three approaches, in increasing defensibility:

Downstream task completion: Give readers only the output and ask them to accomplish what required (make the decision, explain the finding to someone else, use the log entry without the source). Measure success rate.
Counterfactual completeness: Have domain experts identify the 3–5 elements an output must contain to serve . Score presence/absence. ITGI predicts that Condition C outputs will score lower on this list despite longer overall length and higher structural compliance.
Truncation sensitivity: Progressively shorten outputs from the end. Measure at what point -relevant content disappears vs. at what point structural completeness fails. ITGI predicts these diverge, with content concentrated early and structural completion content concentrated late.

Human baseline. The asymmetry claim (Claim 3) requires human participants completing the same tasks under the same three conditions. If ITGI is real and inverted from human behavior, Condition C outputs from humans should be more purpose-serving than Condition A, while Condition C outputs from LLMs should be less purpose-serving.

6 Scope and Limitations

6.1 What this thesis does not claim

ITGI is not claimed to be:

The dominant failure mode of LLMs, or more common than hallucination, sycophancy, or factual error.
Present in all structured tasks. Tasks where structural compliance and terminal goal service are tightly correlated will not exhibit ITGI.
A property of current models specifically. Whether ITGI increases or decreases with model scale, RLHF, or chain-of-thought prompting is an empirical question this thesis does not answer.
A training-time phenomenon. The claim is about inference-time behavior given well-formed prompts.

6.2 Confounds requiring control

Task difficulty: More structurally complex tasks may simply be harder, producing lower overall quality independently of ITGI.
Length bias: Condition C prompts produce longer outputs, and longer outputs may dilute the concentration of -relevant content without reflecting goal displacement.
Model-specific behavior: Different models may show different ITGI rates. The structural specification hypothesis should be tested across model families, not assumed to generalize from a single model.

7 Authorship Note

Lalitha A R identified the phenomenon from a specific instance — a log entry that satisfied its structural requirements while failing its purpose — and connected it to an earlier observation documented in the iSRL GitHub discussion (isrl-research/discussions/10). She searched for analogues in the behavioral science literature, found The Decision Lab’s treatment of the intention-action gap, and identified the inversion: that the LLM failure runs opposite to the human failure.

Claude searched the academic literature, confirmed Merton’s goal displacement as the relevant organizational sociology tradition, identified Allport and Goodhart as the upstream sources, proposed the structural specification hypothesis as the testable form of the claim, and drafted this paper from the resulting papertable. The core observation, the inversion framing, and the cross-domain question are Lalitha’s. The literature mapping, experimental design, and written synthesis are Claude’s.

8 References

Allport, Gordon W. 1937. Personality: A Psychological Interpretation. Holt.

Amodei, Dario, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016. Concrete Problems in AI Safety. https://arxiv.org/abs/1606.06565.

Campbell, Donald T. 1979. “Assessing the Impact of Planned Social Change.” Evaluation and Program Planning 2 (1): 67–90. https://doi.org/10.1016/0149-7189(79)90048-X.

Denison, Carson, Monte MacDiarmid, Fazl Barez, et al. 2024. Sycophancy to Subterfuge: Investigating Reward Tampering in Large Language Models. https://arxiv.org/abs/2406.10162.

Goodhart, Charles A. E. 1975. “Problems of Monetary Management: The U.K. Experience.” In Papers in Monetary Economics. Reserve Bank of Australia.

Langosco di Langosco, Lauro, Jack Koch, Lee D. Sharkey, Jacob Pfau, and David Krueger. 2022. “Goal Misgeneralization in Deep Reinforcement Learning.” Proceedings of the 39th International Conference on Machine Learning, Proceedings of machine learning research, vol. 162: 12004–19.

Manheim, David, and Scott Garrabrant. 2018. Categorizing Variants of Goodhart’s Law. https://arxiv.org/abs/1803.04585.

Merton, Robert K. 1940. “Bureaucratic Structure and Personality.” Social Forces 18 (4): 560–68. https://doi.org/10.2307/2570634.

Sheeran, Paschal. 2002. “Intention–Behavior Relations: A Conceptual and Empirical Review.” European Review of Social Psychology 12 (1): 1–36. https://doi.org/10.1080/14792772143000003.

Sheeran, Paschal, and Thomas L. Webb. 2016. “The Intention–Behavior Gap.” Social and Personality Psychology Compass 10 (9): 503–18. https://doi.org/10.1111/spc3.12265.

Sridhar, Abishek, Robert Lo, Frank F. Xu, Hao Zhu, and Shuyan Zhou. 2023. Hierarchical Prompting Assists Large Language Model on Web Navigation. https://arxiv.org/abs/2305.14257.

Constrained AI-Assisted Sampling for Fragmented Textual Spaces: A Framework for Data Collection Where No Ground Truth Exists

Lalitha A R — Wed, 01 Apr 2026 00:00:00 GMT

0.1 Abstract

Standard data collection methods begin with one of two assumptions. Survey sampling assumes a population you can enumerate: you know the frame, you draw from it, you account for non-response. ETL pipelines assume a schema you can target: you know what fields exist, what types they carry, what cleaning they require. Both assumptions hold comfortably in well-documented domains.

They do not hold in fragmented textual spaces.

1 The Problem This Solves

They do not hold in fragmented textual spaces.

A fragmented textual space is not simply messy data. It is a domain where the information exists — recorded somewhere, in some form — but is distributed across unstructured sources with no shared vocabulary, no authoritative lexicon, and variation patterns that automated similarity measures cannot reliably navigate. The Indian packaged food label space is one example: the same ingredient appears as maida, refined wheat flour, and all-purpose flour across different brands, while palm oil and palmite look similar but are functionally distinct. A global news archive is another: 2.6 million flood events are embedded in articles across 80 languages, with relative time references, imprecise location language, and no standardised event schema.

In both cases, the data exists. The challenge is not absence but structure: extracting something queryable from something that was written for human reading in a specific context, not for machine consumption across contexts.

Traditional approaches to this problem either require labeled training data (which does not exist when you are building the first dataset in a domain) or rely on similarity thresholds (which fail when high-similarity strings are functionally distinct and low-similarity strings are synonymous). CAAS is neither. It uses a language model as a constrained retrieval and parsing tool — not a knowledge source — and builds the validation methodology around the cost structure of the errors it produces.

2 Why AI as a Constrained Parser, Not a Generator

The distinction that defines CAAS is what the model is being asked to do.

An unconstrained language model asked for ingredient information about a product it cannot find will often return plausible-sounding ingredients inferred from the product category. Asked what floods occurred in Mumbai last Tuesday, it will approximate. This behaviour — helpfulness in the face of absence — is the default and it is catastrophic for data collection. A fabricated entry looks identical to a real one. It corrupts the dataset invisibly, without a flag, without a gap that signals something is wrong.

CAAS uses the model differently. The model is given a retrieval task with a defined source list, a structured output schema, and an explicit instruction: if the information is not present in the permitted sources, return a designated failure token. It is not asked to know. It is asked to fetch and parse, with explicit failure as a first-class output.

The practical implementation has three components. Temperature is set to 0, which makes the model select the highest-probability token at each step and produce identical output for identical input. Sources are whitelisted: the model searches only pre-specified domains in a defined priority order. Failure is standardised: DATA_NOT_FOUND (or its equivalent) is the required output when all sources are exhausted, not an approximation and not an empty string.

The result is a system with two modes: it found the data and returned it, or it did not find the data and said so. Both modes are informative. The first populates the dataset. The second marks a gap that can be addressed through additional collection or acknowledged as a limitation. Neither mode silently fabricates.

3 The Cost of Error Correction

The strongest argument for CAAS is not its precision. It is its error economics.

In traditional physical sampling — a blood test, a field survey, a clinical measurement — a wrong sample means repeating the physical act. The cost of error correction is the cost of the original collection: the clinician’s time, the travel, the reagent. This makes high accuracy a hard requirement before you can afford to act on the data.

In constrained AI-assisted sampling over existing textual data, a wrong extraction means a refetch. The source data already exists. The text is already on a server somewhere. Correcting an extraction error costs one additional API call and a human review of one record. The marginal cost is low.

This asymmetry changes what accuracy level is sufficient. A 99% accurate physical sample with 1% requiring full re-collection is a serious problem. A 99% accurate AI extraction with 1% requiring a refetch is, in most contexts, acceptable — provided the 1% is identifiable. The validation methodology in CAAS is designed to make errors identifiable: statistical sampling establishes a confidence interval on the error rate, iterative audit converges on systematic error patterns, and explicit failure tokens mark the known gaps.

The framework does not claim that AI extraction is as accurate as careful manual collection. It claims that for many fragmented textual spaces, constrained AI extraction at documented accuracy levels is more useful than no dataset, more honest than an approximated one, and more recoverable when wrong than a physical sampling error.

4 The Framework

CAAS is not a fixed pipeline. It is a set of decisions that any implementation in a fragmented textual space will need to make, with evidence from two implementations on what those decisions should be and why.

4.1 One Atomic Operation Per API Call

Passing a full document or a large batch to the model and asking it to extract everything produces degraded constraint adherence as the model’s attention distributes across multiple tasks simultaneously. In both implementations documented here, constraint violations — approximations instead of explicit failures, formatting inconsistencies, missed boundary cases — increased measurably as batch size grew beyond a threshold.

The solution is decomposition. Each API call handles one atomic operation: retrieve the ingredient list for this specific product, or extract the location and timing of this specific flood event from this specific article. The operation is defined narrowly enough that the model can apply the full constraint set reliably.

In the ingredient extraction implementation, the threshold was empirically established at 6 SKUs per batch. Batches above 10 showed measurable constraint violations. Below 6, quality was equivalent but throughput was lower than necessary. The optimal batch size is domain-specific and should be tested rather than assumed.

4.2 Explicit Failure Over Approximation

This decision is described in Section 2 and is the single most important constraint in the framework. The system instruction must be unambiguous: when data is absent from permitted sources, return the designated failure token. Do not infer. Do not approximate based on similar cases. Do not fill the gap.

In the ingredient extraction implementation, the system instruction read: “If ingredient list not found in whitelisted domains, return DATA_NOT_FOUND. DO NOT infer typical ingredients from product category. DO NOT approximate based on similar products.”

Of 1,000 products attempted, 104 returned persistent DATA_NOT_FOUND across two passes. These 104 were excluded from the corpus. The exclusion is a feature: those products either had no verifiable online ingredient list or were no longer in active distribution. The pipeline returned a clean gap rather than 104 fabricated entries that would have required expensive downstream correction.

In the flood extraction implementation, the equivalent constraint was classification: the model was required to distinguish between reports of actual past floods and articles discussing future warnings or policy — returning nothing for the latter rather than extracting a plausible but incorrect event record.

4.3 Batch Size as a Quality Variable

Batch size interacts with constraint adherence in a consistent pattern across both implementations. This is not primarily a cost or speed consideration. It is a quality variable that should be calibrated empirically for each domain and each stage of the pipeline.

In artifact removal and semantic decomposition stages of ingredient processing, batch size was set inversely to string complexity: short strings in batches of 40, complex multi-bracket strings one at a time. The same principle applies in news extraction: article complexity and length affect how reliably the model applies its classification and extraction constraints.

Test a range before committing to a batch size. The optimal value is not predictable from first principles.

4.4 Iterative Human-in-the-Loop Audit

Statistical validation establishes a confidence interval on the overall error rate. Iterative audit addresses systematic error patterns — categories of errors that recur and can be corrected in bulk.

The audit process runs as follows. A first model receives a sample of the extracted strings and identifies error types present. A second model receives the full extraction and flags instances of those specific error types. Human review resolves the flagged cases. Corrections are applied. The cycle repeats until the first model identifies no new error types.

In the ingredient extraction implementation, this converged in four iterations. The pattern across iterations was: 16.7% flagged in iteration 1, 7.1% in iteration 2, edge cases only in iteration 3, zero new error types in iteration 4. The edge cases in iteration 3 were boundary decisions — gluten classified as a grain or a protein, spirulina as an additive or a botanical — that required domain judgment rather than extraction correction. These were held for the classification framework stage, not resolved as cleaning errors.

Convergence does not mean zero errors. It means no new systematic error types are detectable. The residual error rate is quantified by the statistical sampling step.

4.5 Statistical Validation with Finite Population Correction

Complete manual validation is not feasible at scale. Statistical sampling with a confidence interval is.

For a population of size , desired confidence level , and margin of error , required sample size with finite population correction:

Using conservative (maximum variance), , , a population of approximately 2,000 requires a sample of around 130. For the ingredient extraction corpus, 90 extractions from 896 were audited manually. One error was identified: the model merged content from two adjacent sections of a product page. The 95% confidence interval on the population error rate, with finite population correction applied, places the upper bound below 3.6%. Stated as accuracy: the corpus is 98.9% accurate at 95% confidence.

Audit allocation should be risk-stratified: concentrate effort on high-risk subsets (very short strings that may be truncations, very long strings that may be insufficiently decomposed, low-confidence extractions) while maintaining a random component for unbiased population coverage.

5 Two Domains, Same Architecture

The primary evidence that CAAS generalises is not theoretical. It is that two independent implementations, in different domains, by different teams, working on different problems, arrived at the same architectural decisions.

5.1 Case Study 1: Indian Packaged Food Ingredient Vocabulary

The problem. No reference layer exists that maps the names Indian food labels use to shared ingredient identities. The same substance appears as maida, refined wheat flour, and all-purpose flour. Standard similarity measures would merge palm oil and palmite, which are functionally distinct, while missing the equivalence of besan flour and chana dal, which are the same ingredient in different language registers. No ground truth lexicon exists to train a supervised system against.

The implementation. 1,000 products were selected across 42 companies and 153 brands from verified Indian market listings. Ingredient lists were retrieved from whitelisted domains (brand official website, Amazon India, BigBasket, Blinkit) at temperature 0, with DATA_NOT_FOUND required when all sources were exhausted. Retrieved strings were parsed using a structure-aware algorithm that splits on commas only at nesting depth zero, preserving compound ingredient relationships. Each string then went through a single-purpose artifact removal pass (removing percentages and marketing text, preserving INS codes and preparation specifications) and a semantic decomposition pass with context propagation. The process ran at 6 SKUs per batch for retrieval and scaled inversely with string complexity for subsequent stages.

Results. 896 of 1,000 products extracted successfully (89.6%). 104 returned persistent DATA_NOT_FOUND. The sampling pipeline produced 1,987 unique variant strings. Combined with ingredient strings from OpenFoodFacts filtered to rows with a verifiable Indian product name and passed through the same pipeline, the final corpus after iterative audit is 2,291 unique ingredient variant strings. Audit of 90 extractions identified 1 error (0.11%). Full methodology documented in (R. 2026).

5.2 Case Study 2: Global Flash Flood Historical Record

The problem. Hydro-meteorological hazards like flash floods lack a standardised global observation infrastructure. Existing archives capture large, long-lasting events but miss localised and fast-moving floods. The Global Disaster Alert and Coordination System holds approximately 10,000 records — orders of magnitude fewer than what AI-based forecasting models require for training and validation. The historical record exists, embedded in news archives across 80 languages, but has never been extracted at scale.

The implementation. Google’s Groundsource framework analysed news reports where flooding was a primary subject, standardised text into English via translation, and used Gemini to apply three constrained extraction tasks: classification (distinguishing actual past flood events from articles about future warnings or policy), temporal reasoning (anchoring relative date references against publication dates), and spatial precision (mapping location references to standardised geographic polygons). The model was not asked to know where floods occurred. It was asked to read a specific article and extract specific structured fields — with explicit verification criteria for each field rather than open-ended generation (Rotem Mayo 2026).

Results. 2.6 million historical flood events extracted, spanning more than 150 countries from 2000 to present. Manual review found 60% of extracted events accurate in both location and timing; 82% accurate enough for practical research use. Spatiotemporal matching against GDACS records for 2020–2026 shows Groundsource captured between 85% and 100% of severe events in that reference set, alongside large numbers of smaller localised events the reference set missed entirely.

5.3 What the Convergence Shows

Neither implementation was designed with the other in mind. The decisions they share — constrain the model’s role to retrieval and parsing, require explicit failure for absent data, calibrate batch size empirically, validate statistically — emerged independently from the same underlying problem: how to collect structured data from a space where the information exists but no ground truth organises it.

The table below shows the architectural correspondence.

Architectural decisions across two independent CAAS implementations.
Decision	Ingredient vocabulary	Flood record
Model role	Retrieval and parsing only	Classification, temporal anchoring, spatial extraction
Source constraint	Whitelisted domains in priority order	News reports where flooding is primary subject
Failure handling	`DATA_NOT_FOUND` token	Explicit classification criteria; non-flood articles return nothing
Batch calibration	6 SKUs per batch (empirical)	Per-article processing with complexity-aware handling
Validation	Statistical sampling + iterative audit	Manual review sample; spatiotemporal matching against reference archive
Accuracy result	98.9% at 95% confidence	82% practically useful; 85–100% severe event recall

The accuracy figures are not directly comparable — the domains define error differently, and the flood implementation targets a harder extraction problem (temporal and spatial reasoning from prose) than ingredient retrieval from structured label text. What is comparable is the architecture: the same three constraints, applied to the same class of problem, producing usable datasets in spaces where no dataset previously existed.

6 What This Does Not Guarantee

Temperature 0 reduces output variation but does not eliminate it. API version changes, infrastructure differences, and floating-point non-determinism across hardware can produce different outputs for identical inputs across sessions. The reproducibility guarantee is strong within a session and weaker across time. Any implementation should log the model version and API configuration used, and treat re-runs after infrastructure changes as requiring re-validation.

The framework does not remove the need for domain judgment. In the ingredient implementation, boundary cases — whether gluten belongs in grains or proteins, whether spirulina is an additive or a botanical — were not resolvable through cleaning. They required a classification framework with explicit criteria for how those categories are defined. CAAS reduces the volume of decisions that require human judgment. It does not eliminate the decisions themselves.

The error rates documented here are domain-specific. A 0.11% error rate for ingredient extraction from structured label text on retail websites is not a prediction for other domains. Text that is more ambiguous, sources that are less reliable, or extraction tasks that require more complex reasoning will produce higher error rates. The validation methodology applies regardless: establish the error rate empirically, state it with a confidence interval, document what was done about systematic errors.

7 Where This Applies

CAAS is appropriate when four conditions hold simultaneously.

First, the target information exists in retrievable textual form. The framework cannot collect data that was never recorded. It can only structure data that exists but is unstructured.

Second, no authoritative reference organises the domain. If a canonical lexicon or schema exists, use it. CAAS is for when you are building the first one.

Third, domain-specific variation makes automated similarity measures unreliable. If standard fuzzy matching at reasonable thresholds produces acceptable results, that is simpler and should be preferred. CAAS is for when the variation patterns require something that can read context.

Fourth, the cost of error correction is low relative to the cost of not having the data. In safety-critical applications where downstream decisions are irreversible, the accuracy requirements may be higher than CAAS can reliably achieve without prohibitive validation cost. In research contexts where the dataset is a starting point for further analysis and errors are correctable, the asymmetry holds.

Both case studies satisfy all four conditions. The ingredient vocabulary space has no authoritative Indian lexicon, variation patterns that defeat similarity measures, and corrections that cost a refetch. The flood archive space has no global sensor network, event descriptions embedded in prose across 80 languages, and corrections that cost a re-extraction from an article that remains available.

7.1 Acknowledgements

My deepest gratitude to Mr. Krishna, whose constancy forms the foundation upon which all my work, including this, quietly rests. Salutations to the Goddess who dwells in all beings in the form of intelligence. I bow to her again and again.

This report was prepared as part of the Indian Food Informatics Data (IFID) project at the Interdisciplinary Systems Research Lab (iSRL).

7.2 Statements and Declarations

7.2.1 Funding Declaration

No funding was received to assist with the preparation of this manuscript.

7.2.2 Author Contribution

L.A.R. was responsible for all aspects of this report, including conceptualization, methodology, writing the original draft, and review and editing.

7.2.3 Competing Interests

The author declares no competing interests.

References

R., L. A. 2026. IFID Sampling Corpus — Placeholder, Fill with Zenodo DOI. Interdisciplinary Systems Research Lab (iSRL).

Rotem Mayo, Moral Bootbool, Oleg Zlydenko. 2026. “Groundsource: A Dataset of Flood Events from News.” March. https://doi.org/10.31223/X5RR2K.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@report{a_r2026,
  author = {A R, Lalitha},
  publisher = {iSRL},
  title = {Constrained {AI-Assisted} {Sampling} for {Fragmented}
    {Textual} {Spaces:} {A} {Framework} for {Data} {Collection} {Where}
    {No} {Ground} {Truth} {Exists}},
  number = {iSRL-26-04-M-CAAS},
  date = {2026-04-01},
  url = {https://isrl.in/pub/2026-04-m-caas/},
  doi = {10.5281/zenodo.[record-id]},
  langid = {en}
}

For attribution, please cite this work as:

A R, Lalitha. 2026. Constrained AI-Assisted Sampling for Fragmented Textual Spaces: A Framework for Data Collection Where No Ground Truth Exists. iSRL-26-04-M-CAAS. iSRL. https://doi.org/10.5281/zenodo.[record-id].

Data Acquisition and Ingredient Extraction: Building a Vocabulary of What India’s Packaged Food Labels Actually Say

Lalitha A R — Sun, 01 Mar 2026 00:00:00 GMT

1 The Question That Starts Everything

A computer cannot tell you whether rice is healthier than Maggi. Not because the comparison is philosophically difficult, but because the infrastructure required to answer it does not exist.

To answer the question, the system needs to know what is in both products. To know what is in them, it needs ingredient data. To use ingredient data, it needs to know that “maida” and “refined wheat flour” refer to the same thing — and that “palm oil” and “palmite” do not, even though automated similarity measures would score them close. To know that, it needs a stable reference layer that maps the names labels actually use to the identities they actually mean.

That reference layer does not exist for Indian packaged food. This report documents the first step toward building it: a collection of ingredient variant strings extracted from commercial Indian food labels, captured as they appear, without flattening the diversity that makes them what they are.

2 Why the Diversity Is Not the Problem

A label from a major Indian snack brand might read:

Seasoning Mix {Iodised Salt, Chilli Powder (1.1%), #Spices & Condiments, Onion, Maltodextrin, Wheat Flour, Milk Solids, Black Salt, Tomato Powder [Tomato Paste, Anticaking Agent (INS 551)], Refined Sugar, Hydrolyzed Vegetable Protein, Acidity Regulators (INS 296, INS 330, INS 334), Garlic, Anticaking Agent (INS 551), Flavour Enhancers (INS 627, INS 631)} And Iodised Salt.

This is not poorly formatted data. This is a brand communicating ingredient relationships to consumers across India’s 22 official languages and hundreds of regional contexts, within the structure FSSAI Labelling Rules 2020 require. The nested brackets encode functional relationships: “Acidity Regulators” governs three INS codes as a category. “Tomato Powder” contains both a base ingredient and an additive. A Tamil-speaking consumer and a Hindi-speaking consumer both need to read this label correctly. The formatting serves them.

The goal of this project is not to make that label simpler. It is to build the layer underneath it that makes it machine-queryable — without asking ITC, or any other brand, to change a word.

A substrate is the layer that makes other things possible to build. Concrete is a substrate: you do not live in concrete, you live in the building the concrete made possible. The substrate does not care what the building looks like. IFID — Indian Food Informatics Data — is being built as that layer for ingredient identity. Tamil names stay Tamil. INS codes stay in their FSSAI-specified format. The nested bracket structure a brand uses to communicate to its consumers stays exactly as designed. The substrate sits underneath and makes them interoperable: queryable as the same ingredient when that is what you need, distinguishable as different expressions when that matters.

Coordination without convergence. That is the specific goal.

3 The Wall, and Who Is Already Working on It

Everyone who works with Indian packaged food data hits the same wall from a different direction.

The nutritionist has fifty product samples and is spending half her time cleaning label data before she can begin her actual analysis. The e-commerce platform has the same ingredient listed seventeen different ways across seventeen brands and cannot build a consistent product catalogue. The compliance team is manually reconciling ingredient declarations across FSSAI requirements, retailer formats, and export documentation — separately, every time. The researcher who could build a tool to flag allergen risks has found there is no labelled dataset to train on.

None of these people are doing it wrong. The wall is not their failure. The wall is that no shared ingredient identity layer exists.

The most serious open effort to build one globally is OpenFoodFacts (OFF). OFF has documented food products across dozens of countries through crowdsourced contributions. The scale of that work is significant and the intent is the same as this project’s: make food data open, structured, and usable. The gap in Indian product coverage that this report documents is not a gap in OFF’s effort. It is a direct reflection of how fragmented and underdocumented the Indian packaged food space actually is — which is precisely what makes the problem worth working on, and precisely what makes collaboration across efforts like these necessary.

4 Two Sources, One Problem

Building the ingredient vocabulary required two separate collection strategies, for the same underlying reason: no single existing source reliably answers whether a product is a current, shelf-available Indian packaged food with a verifiable ingredient list.

4.1 Why OFF Could Not Be the Only Source

OFF contains thousands of English ingredient lists for products with Indian brand names. Those lists are valuable. But the dataset structure does not reliably distinguish a product currently on Indian supermarket shelves from an imported variant, an export formulation, or a historical listing no longer in distribution.

For the purpose of this corpus — documenting what Indian consumers actually encounter today — that distinction matters. An ingredient list attached to a product that is not in the Indian market does not reflect the vocabulary Indian food systems use.

The null rates in the OFF data confirm the scale of the gap. Of 19,748 rows in the raw export (captured 2 February 2026):

Only 4,104 pass a minimum filter: brand present, English product name present, English ingredient text present. That is 20.78 percent.
ingredients_text_en is the only ingredient column with coverage above 1 percent. All 29 other language columns combined add 69 rows to that count.
6,905 rows have both a brand identifier and an English product name — the product exists, it has a name — but no ingredient text in any language. The gap is specifically at the ingredient field.
The four core macronutrient fields (energy, fat, protein, carbohydrates) have null rates between 65.61 and 66.00 percent across the full dataset.

Field	Non-null	Null %
energy_value	6,792	65.61
fat_value	6,715	66.00
proteins_value	6,737	65.89
carbohydrates_value	6,759	65.77

These numbers are not a criticism. They are a measurement of the space. The gap in documented Indian food data is real, it is large, and it exists because the underlying ecosystem is genuinely fragmented — not because anyone has failed to document it well enough.

What OFF does have — the 4,104 rows with a brand, a product name, and an English ingredient list — is usable for this project. The product name provides the minimum anchor needed to verify the product is Indian. Those rows were taken through the same constrained parsing pipeline described below, and their ingredient strings added to the vocabulary set. The filter that kept them is described in the claims document.

4.2 Why Direct Sampling Was Necessary

For a reliable picture of what is currently on Indian shelves, the corpus needed to be collected directly. The methodology: select products from companies with documented Indian market presence, retrieve ingredient lists from verifiable online sources, extract and parse.

Company and product selection. Ten companies were selected based on market presence across major packaged food categories — snacks, beverages, staples, dairy, condiments. Within each company, selection moved through sub-brands (ITC’s portfolio spans Aashirvaad, Sunfeast, Bingo, YiPPee — each with a different ingredient vocabulary) and then to individual SKUs meeting four criteria:

Ingredient list traceable on a whitelisted domain
Product available in the Indian market, not an export or international variant
Specificity to a single SKU, not a product range — “Aashirvaad Turmeric Powder 200g” not “Aashirvaad Spices”
One representative retained per formulation across pack sizes

The third criterion produced the most rejections. References like “Cadbury Chocolates” or “Aashirvaad Spices” denote product families, not individual items with specific ingredient lists. Every such reference required disambiguation before it could enter the corpus.

After validation: 1,000 SKUs across 42 companies, 153 brands, 8 macro-categories.

5 Why Standard Automated Parsing Fails Here

Before describing what the pipeline does, it is worth being precise about why standard approaches do not work for this specific problem.

Palm oil and palmite are chemically and functionally distinct ingredients. An automated similarity measure — edit distance, embedding cosine similarity, fuzzy matching — would score them as near-identical. Acting on that score would silently corrupt the vocabulary.

Besan flour and chana dal are the same ingredient in different languages and forms. A similarity measure that does not carry cultural and linguistic knowledge would treat them as unrelated.

These are not edge cases. They are representative of how Indian food labelling works: regional names, transliterations, preparation-state variants, and INS codes all coexist on the same label, as do acronyms (such as FOS or TBHQ) and British/American spelling variants. Sometimes referring to the same thing, sometimes to things that are genuinely distinct. Standard clustering and normalisation algorithms cannot reliably navigate this space. The cost of a silent error — a wrong merge, a missed distinction — propagates forward into every analysis built on the vocabulary.

The approach used here trades throughput for verifiability: one atomic operation per API call, constrained to prevent approximation, with explicit failure when the data is not there.

6 The Extraction Pipeline

6.1 Constrained Retrieval

The model retrieved ingredient lists from whitelisted domains only, in priority order: brand official website, then Amazon India, BigBasket, Blinkit. If no source returned the ingredient list, the output was DATA_NOT_FOUND. The instruction was explicit: do not infer typical ingredients from product category, do not approximate based on similar products.

Temperature was set to 0. This means the model selects the highest-probability token at each step and produces identical output for identical input. The practical effect: if you run the same extraction twice, you get the same result. Validation becomes tractable. Fabrication through sampling variation is eliminated.

Batch size was tested across 1 to 20 SKUs per call. At batch sizes above 10, constraint violations increased measurably — the model began returning approximations instead of DATA_NOT_FOUND for products it could not find, and formatting inconsistencies appeared. Six SKUs per batch produced the best balance of throughput and constraint adherence.

Results across 1,000 SKUs:

First pass: 871 successful extractions (87.1%), 129 DATA_NOT_FOUND
Second pass on the 129 failures: 25 additional extractions, 104 persistent failures
Final corpus: 896 extracted (89.6%), 104 excluded

The 104 persistent failures validate that the constraint held. Those products either had no verifiable online ingredient list or were no longer in active distribution. The pipeline returned an explicit gap rather than a filled approximation. An explicit gap can be addressed later. A fabricated entry corrupts the vocabulary invisibly.

Manual audit of 90 extractions from the 896: 1 error identified (the model merged content from two adjacent sections of a product page). Error rate: 1 in 896 (0.11 percent).

6.2 Structure-Preserving Parsing

The 896 extracted ingredient lists were not fed to the pipeline as whole strings. Each list went through parsing as a discrete operation, because the structure of Indian food labels encodes relationships that naive splitting destroys.

Consider what happens when a comma-splitter treats every comma equally:

Input: Acidity Regulators (INS 296, INS 330, INS 334)

Naive output:

Acidity Regulators (INS 296
INS 330
INS 334)

The functional context — that INS 296, 330, and 334 are all acidity regulators — is gone. The fragments INS 330 and INS 334) have no meaning without it.

The structure-aware parser tracks nesting depth. It splits on commas only at depth zero — the root level. Everything inside brackets is treated as a unit until the brackets close. Applied to the same input:

Structure-aware output: Acidity Regulators (INS 296, INS 330, INS 334) — intact, ready for decomposition with context preserved.

896 ingredient lists → 2,926 parsed strings with functional relationships intact.

6.3 Artifact Removal

Each of the 2,926 strings went through a single-purpose cleaning pass: remove presentation artifacts, preserve identity information.

Removed: percentage values (55.7% — quantity, not identity), marketing text (BINGO!, NEW!), usage annotations (#Used As Natural Flavouring Agent).

Preserved: INS codes and E-numbers (regulatory identifiers), preparation specifications (Salt (Iodised) — the bracketed term distinguishes a specific variety), functional classifications (Acidity Regulator, Emulsifier).

The distinction matters because it is not always obvious. 55.7% is presentation — removing it loses nothing about what the ingredient is. (Iodised) is identity — removing it collapses iodised salt and table salt into the same entry, which is wrong.

Batch sizes for this stage were set inversely to string complexity: short strings processed in batches of 40, complex multi-bracket strings processed one at a time. Attention dilution at scale produces the same constraint violations as in the retrieval stage.

6.4 Semantic Decomposition

After cleaning, compound structures were decomposed with context propagation. Each atomic operation took one compound and returned its components, with the functional classification carried forward to each:

Input: Flavour Enhancers (INS 627, INS 631)
Output: Flavour Enhancer INS 627, Flavour Enhancer INS 631

Input: Stabilizing & Emulsifying Agents (412, 410, 407, 471, 466)
Output: Stabilizer INS 412, Stabilizer INS 410, Stabilizer INS 407, Emulsifier INS 471, Stabilizer INS 466

Input: Black Pepper Powder, Ginger Powder, Clove Powder
Output: unchanged — already atomic

2,926 cleaned strings → 3,452 decomposed ingredient mentions → 1,987 unique variants after deduplication across all 896 products from the sampling pipeline.

The full transformation for one product (Bingo Original Style, ITC Ltd.) produced 21 ingredient mentions, including:

‘Black Salt’, ‘Chilli’, ‘Citric Acid (INS 330)’, ‘Disodium Guanylate (INS 627)’, ‘Disodium Inosinate (INS 631)’, ‘Garlic’, ‘Hydrolyzed Vegetable Protein’, ‘Maida’, ‘Malic Acid (INS 296)’, ‘Maltodextrin’, ‘Milk Solids’, ‘Onion’, ‘Palm Oil’, ‘Potato’, ‘Salt’, ‘Silicon Dioxide (INS 551)’, ‘Spices and Condiments’, ‘Sugar’, ‘Tartaric Acid (INS 334)’, ‘Tomato’

7 Combining the Two Sources

The sampling pipeline produced 1,987 unique variant strings from 896 directly collected products. The OFF pipeline — 4,104 rows filtered to those with a verifiable product name, passed through the same constrained parsing stages — contributed an additional set of ingredient strings from a different cross-section of the label space.

Combined and deduplicated across both sources, then cleaned through multiple iterative audit rounds (documented in Appendix A), the final corpus contains 2,291 unique ingredient variant strings.

These are not errors to correct or synonyms to collapse. They are documentation of how ingredient identity is expressed across Indian commercial food labels. The same ingredient appears in multiple forms because Indian food labelling reflects genuine linguistic and cultural diversity:

Chilli / Chili / Chillies — orthographic variants, all in use
Maida / Refined Wheat Flour / All-Purpose Flour — the same ingredient across language registers
Onion Powder / Dried Onion / Dehydrated Onion — preparation-state variants
INS 330 / Citric Acid / Acidity Regulator INS 330 — the same compound at different levels of regulatory specificity
Iodised Salt / Salt (Iodised) / Table Salt — formatting alternatives for the same distinction

Each of these variants appears on labels that consumers read, regulators review, and supply chains track. The infrastructure this project is building needs to work with all of them — not by picking one as canonical and discarding the rest, but by organising them so that a query for any one returns the right set.

The Tamil name on a label stays Tamil. The INS code stays in its FSSAI format. The regional cultivar name stays as the brand printed it. The substrate underneath makes them queryable as the same ingredient when that is what the question requires.

8 What This Corpus Makes Possible

The output of this report is an open dataset:

Ingredient variant strings extracted from OFF data, filtered to rows with a verifiable Indian product name, cleaned through the same pipeline¹

¹ Release of 896 SKUs with verified ingredient lists is withheld to adhere to the stakeholder protection principles as discussed in iSRL-26-XX-G-Protection: Data Governance Principles — Protecting Every Stakeholder in the IFID Ecosystem #20 and iSRL-26-XX-G-Access: Access Architecture — Tiered Data Access for the IFID API #21.

Combined : a documented vocabulary of 2,291 unique ingredient expressions from Indian packaged food labels, with extraction methodology, constraint architecture, and quality validation documented in full.

The next question the corpus raises is: which of these 2,291 variants refer to the same ingredient, and by what logic? Maida and Refined Wheat Flour are the same substance. Palm Oil and Palmite are not, despite surface similarity. Besan flour and chana dal are related but distinct in preparation state. Answering that question requires a classification framework capable of handling identity, equivalence, and subset relationships across a space where standard similarity measures are unreliable.

That framework — the EMF Model (Energy, Matter, Function) — is defined in A R (2026). Further progress on the mapping problem is deferred to future reports.

9 Claims and Verification

All numerical claims in this report are independently verifiable against the source datasets. The full claims list with evidence per claim is available at

9.1 Claims

ID	Claim
OFF.C.01	Of 19,748 rows in the raw OpenFoodFacts export, 4,104 pass the minimum filter (brand, product name in English, ingredient text in English), a pass rate of 20.78 percent.
OFF.C.02	ingredients_text_en is the only ingredient column with coverage above 1 percent. It has 4,592 non-null rows (23.25 percent). All 29 other language columns combined add 69 additional rows.
OFF.C.03	6,905 rows have both a brand identifier and an English product name but no ingredient text in any language. The data gap is at the ingredient field, not at product identity.
OFF.C.04	The four core macronutrient fields have null rates between 65.61 and 66.00 percent across all 19,748 rows: energy_value 65.61 percent (6,792 non-null), fat_value 66.00 percent (6,715 non-null), proteins_value 65.89 percent (6,737 non-null), carbohydrates_value 65.77 percent (6,759 non-null).
OFF.C.05	The three Hindi language columns have the following non-null counts across 19,748 rows: product_name_hi 111, ingredients_text_hi 11, generic_name_hi 2.
OFF.C.06	Replacing product_name_en OR generic_name_en with product_name_en alone as a filter condition reduces the output from 4,105 rows to 4,104 rows. generic_name_en contributes one unique row.
OFF.C.07	The raw dataset has 486 columns. The filtered dataset retains 4 columns: product_name_en, brands, brands_tags, and ingredients_text_en.
SAMP.C.01	The sampling corpus spans 42 companies, 153 consumer-facing brands, and 896 SKUs across 8 product macro-categories and 30 sub-categories.
SAMP.C.02	The five highest-SKU companies — Tata Consumer Products (104), Amul / GCMMF (82), Haldiram’s (68), Hindustan Unilever (67), and ITC Ltd (65) — account for 386 SKUs, or 43.1 percent of the 896-SKU corpus.
SAMP.C.03	SKU distribution across eight macro-categories derived from top-3 category fields per company: beverages (200), sweets and desserts (176), staples and spices (174), ready to eat and ready to cook (100), snacks and namkeen (67), pantry and condiments (47), health and wellness (43), dairy and breakfast (30). These sum to 837 of 896 total SKUs; the remaining 59 fall into sub-categories not captured in the top-3 field per company.
SAMP.C.04	Of 1,000 SKUs submitted for extraction, 871 returned successful ingredient lists on first pass (87.1 percent). A second-pass retry on the 129 failures yielded 25 additional extractions (2.5 percent). Final corpus: 896 successful extractions (89.6 percent). 104 SKUs returned DATA_NOT_FOUND across both passes and are excluded.
SAMP.C.05	Manual audit of 90 extractions from the 896-SKU corpus identified 1 hallucination instance. Rate: 1 in 896 (0.11 percent).
SAMP.C.06	Four SKU validation criteria were applied before extraction: (1) ingredient list traceable within whitelisted domains; (2) product available in the Indian market, not an export or international variant; (3) specificity to a single SKU, not a product range; (4) one representative retained per formulation across pack sizes.
SAMP.C.07	Extraction operated under five constraints: (1) temperature = 0; (2) domain whitelist: brand official website, Amazon India, BigBasket, Blinkit, in priority order; (3) DATA_NOT_FOUND returned when all sources exhausted; (4) JSON output schema enforced with four required fields (product_name, ingredient_list, source_url, confidence); (5) brand official website given precedence over retailer listings on conflict.
SAMP.C.08	Batch sizes from 1 to 20 SKUs per API call were tested. 6 SKUs per batch was identified as optimal. Batches exceeding 10 SKUs produced measurably increased constraint violations including inappropriate DATA_NOT_FOUND omissions and formatting inconsistencies.

9.2 Evidence Per Claim

9.2.1 OFF.C.01

Raw row count: 19,748. Filter applied: brands OR brands_tags non-empty, AND product_name_en non-empty, AND ingredients_text_en non-empty. Rows passing all three conditions: 4,104. Pass rate: 20.78 percent. Rows removed: 15,644 (79.22 percent).

9.2.2 OFF.C.02

Column	Non-null rows
ingredients_text_en	4,592
ingredients_text_fr	94
ingredients_text_de	15
ingredients_text_hi	11
All remaining 26 language columns (de-duplicated against English)	69

Pooling all 30 ingredient language columns yields 4,661 rows with any ingredient text, against 4,592 for English alone.

9.2.3 OFF.C.03

Rows passing (brands OR brands_tags) AND product_name_en: 11,009. Of these, rows also passing ingredients_text_en: 4,104. Rows with brand and name but no ingredient text: 6,905.

9.2.4 OFF.C.04

Field	Null	Non-null	Null %
energy_value	12,956	6,792	65.61
fat_value	13,033	6,715	66.00
proteins_value	13,011	6,737	65.89
carbohydrates_value	12,989	6,759	65.77

Computed on the full 19,748-row dataset.

9.2.5 OFF.C.05

Column	Non-null	Null	Null %
product_name_hi	111	19,637	99.44
ingredients_text_hi	11	19,737	99.94
generic_name_hi	2	19,746	99.99

Computed on the full 19,748-row dataset.

9.2.6 OFF.C.06

Filter	Condition	Result
Filter A	(brands OR brands_tags) AND (product_name_en OR generic_name_en) AND ingredients_text_en	4,105 rows
Filter B	(brands OR brands_tags) AND product_name_en AND ingredients_text_en	4,104 rows

Difference: 1 row. That row had generic_name_en populated and product_name_en empty. In the 4,105-row set, generic_name_en has 451 non-null values (10.99 percent non-null, 89.01 percent null).

9.2.7 OFF.C.07

Raw column count: 486. Columns retained after filter: product_name_en, brands, brands_tags, ingredients_text_en. Column count in working dataset: 4. The 482 removed columns include all non-English name and ingredient variants, all nutrient sub-fields, environmental scores, packaging fields, and contributor metadata.

9.2.8 SAMP.C.01

Roster file header: Total SKUs: 896 | Brands: 153 | Companies: 42 | Parent cats: 8 | Sub-cats: 30.

9.2.9 SAMP.C.02

Company	SKUs
Tata Consumer Products	104
Amul / GCMMF	82
Haldiram’s	68
Hindustan Unilever	67
ITC Ltd	65
Total (top 5)	386

386 / 896 = 43.1 percent of corpus.

9.2.10 SAMP.C.03

Summed from top-3 category fields across all 42 company rows in roster. Sum: 837. Difference from 896: 59 SKUs assigned to sub-categories below each company’s top three.

9.2.11 SAMP.C.04

First pass: 871 extracted, 129 DATA_NOT_FOUND. Second pass on 129: 25 additional, 104 persistent DATA_NOT_FOUND. Total extracted: 896. Total excluded: 104. Pass rate: 896 / 1000 = 89.6 percent.

9.2.12 SAMP.C.05

Audit sample: 90 SKUs. Errors found: 1 (model merged content from multiple webpage sections). Rate: 1 / 896 = 0.0011.

9.2.13 SAMP.C.06

Four criteria applied at SKU selection stage. Documented rejection categories: product range references requiring disambiguation to individual SKU (e.g., “Aashirvaad Spices” to “Aashirvaad Turmeric Powder 200g”); products not in Indian market distribution.

9.2.14 SAMP.C.07

Five constraints applied uniformly to all 1,000 attempted SKUs. System instruction for DATA_NOT_FOUND: “If ingredient list not found in whitelisted domains, return DATA_NOT_FOUND. DO NOT infer typical ingredients from product category. DO NOT approximate based on similar products.”

9.2.15 SAMP.C.08

Batch sizes 1–20 tested during extraction development. Optimal: 6 SKUs per batch. Violations at >10 SKUs: DATA_NOT_FOUND omissions and formatting inconsistencies.

Appendix A: Sample Cleaning Rounds

The iterative cleaning process that produced the final 2,291 variant set from the combined corpus was not individually logged. Individual round logs were not maintained by design: the changes involved — correcting a transliteration typo, removing a fragment like dried-powder that parsed as an ingredient but was a formatting artifact, deciding whether monohydrate belonged in the corpus at all — were too granular and numerous to document round by round without the log itself becoming unmanageable.

What follows is a representative excerpt from the audit scripts used during this process. It shows what the review actually looked like: automated flagging, human decision at each boundary case, iterative convergence toward a clean set.

One audit pass — executive summary:

=============================================
      AI AUDIT EXECUTIVE SUMMARY
=============================================
Total Entries Audited : 709

APPROVED               :  623 (87.9%)
MODIFIED               :   51 (7.2%)
INVALID                :   35 (4.9%)
=============================================

Entries flagged as INVALID — strings that were not ingredients:

atlantic · center-filling · cfu · chips · compound · dessert
dried-powder · dry · energy · flakes · food-additives · lubrication
moisture · monohydrate · mononitrate · only · pizza · plant-base
powder-mix · preservative · protein · savouries · test · toppings
vegetable · vegetable-mix · ...

Interactive kill review — the monohydrate decision:

The boundary cases required a human in the loop. monohydrate and mononitrate are not ingredients — they are suffixes that appear on ingredient labels (as in thiamine mononitrate) but carry no identity when extracted alone. They were saved on first pass, then removed on second review:

KILL 'monohydrate'? (y/n): n
Saving 'monohydrate'...

 Surgery Complete. Your files are now 'Steel'. 

[second pass]

KILL 'monohydrate'? (y/n): y
Executing 'monohydrate'...

 Surgery Complete. Your files are now 'Steel'.

Reclassification pass — where the judgments were not straightforward:

ITEM: fish
FROM: Additives & Functional  →  TO: Proteins & Meats
Accept? y 

ITEM: gluten
FROM: Staples (Grains/Dals)  →  TO: Proteins & Meats
Accept? n   Added to manual review.

ITEM: spirulina
FROM: Additives & Functional  →  TO: Fruits, Veg & Botanicals
Accept? n   Added to manual review.

ITEM: fava-bean-protein
FROM: Additives & Functional  →  TO: Proteins & Meats
Accept? n   Added to manual review.

gluten, spirulina, and fava-bean-protein are examples where the automated reclassification suggestion was defensible but not settled — each sits at a category boundary that requires a classification framework to resolve, not a cleaning pass. They were held for the mapping stage.

Final state after all cleaning rounds:

==================================================
     FINAL MONOGRAPH DATA SUMMARY
==================================================
Total Raw Variants (TSV)    : 46,635
Unique Canonical Units      : 662
==================================================

The 46,635 raw variants and 662 canonical units shown here are from the OFF monograph specifically — a separate but parallel cleaning process applied to the OFF-derived strings. The 2,291 figure reported in the main body is the combined and deduplicated variant set from both sources, prior to canonical mapping. These are different counts at different stages of the pipeline and are not in conflict.

Contributors

Lalitha A R — Conceptualization, methodology, data curation, formal analysis, writing (original draft), writing (review and editing).

Subrat Sethi — Data curation: SKU verification (200 SKUs).

Purnendu Shukla — Software: API script execution for ingredient data retrieval.

Radhakrishna MV (Contributor, Open Food Facts India) — Writing (review and editing): manuscript review for accurate and respectful representation of the Open Food Facts dataset and contributor ecosystem.

Acknowledgements

We are deeply grateful to all contributors of OFF Dataset - one of the core sources which our efforts build upon. Thank you for all that you do. This report was prepared as part of the Indian Food Informatics Data (IFID) project at the Interdisciplinary Systems Research Lab (iSRL).

References

Lalitha, A. R. 2026. Identity, Transformation, and Function: A Tri-Axial Model for the Classification of Food Ingredient Identity. Interdisciplinary Systems Research Lab. https://doi.org/10.5281/zenodo.18714526.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@report{a_r2026,
  author = {A R, Lalitha},
  publisher = {iSRL},
  title = {Data {Acquisition} and {Ingredient} {Extraction:} {Building}
    a {Vocabulary} of {What} {India’s} {Packaged} {Food} {Labels}
    {Actually} {Say}},
  number = {iSRL-26-04-R-Variants},
  date = {2026-03-01},
  url = {https://isrl.in/pub/2026-04-r-variants/},
  langid = {en},
  abstract = {Indian packaged food labels do not share a common
    ingredient vocabulary. The same substance appears under regional
    names, transliterations, INS (International Numbering System) codes,
    and brand-specific terms -\/-\/- sometimes across labels from the
    same manufacturer. No reference layer exists that maps these
    expressions to shared identities. This report documents the
    construction of a first ingredient variant corpus from two sources:
    896 directly sampled products collected from verified Indian market
    listings, and English ingredient strings from OpenFoodFacts filtered
    to rows with a traceable Indian product name. Both sets were
    processed through a constrained parsing pipeline -\/-\/- one atomic
    operation per API call, temperature set to 0, explicit failure
    rather than approximation when data was unavailable. After combining
    both sources and iterative cleaning, the corpus contains 2,291
    unique ingredient variant strings. These variants are not noise to
    eliminate. They are documentation of how ingredient identity is
    expressed in practice across Indian commercial food labels. The
    question of which variants refer to the same ingredient -\/-\/- and
    by what logic -\/-\/- is addressed in EMF Model
    {[}@arIdentityTransformationFunction2026{]}.}
}

For attribution, please cite this work as:

A R, Lalitha. 2026. Data Acquisition and Ingredient Extraction: Building a Vocabulary of What India’s Packaged Food Labels Actually Say . iSRL-26-04-R-Variants. iSRL. https://isrl.in/pub/2026-04-r-variants/.

Regulatory Texts and Case Law as Ground Truth in Emerging Domains

Lalitha A R — Mon, 23 Feb 2026 00:00:00 GMT

In domains where no unified academic theory has yet consolidated, practitioners face a calibration problem: against what baseline does one evaluate a classification framework, a taxonomy, or a data model? This note argues that enacted law and judicial decisions constitute a practically available, historically grounded source of ground truth for such domains. We describe an approach—regulatory delta analysis—that examines how legislation and case law shift across time, with the aim of surfacing friction points, coordination patterns, and the constraints under which each version of a rule was written. The approach is not a critique of legislative bodies or courts; it is a method for reading the record they have already produced. Limitations are considered throughout. The note is grounded in applied work from the Indian Food Informatics Data (IFID) project but the reasoning is intended to transfer to other domains with similar structural properties.

1 Definitions

The terms below are used precisely throughout this note.

Regulatory text. A statute, rule, or formal regulation enacted by a legislative or administrative body with binding legal authority. In the Indian context, examples include the Food Safety and Standards Act, 2006, and the FSSAI Labelling and Display Regulations, 2020.

Case law. The body of judicial decisions—judgments handed down by courts and tribunals—that interpret, apply, and sometimes settle conflicts between regulatory texts. Case law acquires authority through the doctrine of precedent and, in many common law systems including India’s, through formal hierarchical citation obligations.

Regulatory delta. The set of meaningful changes between two enacted versions of a regulatory text: additions, deletions, scope expansions, definitional tightenings, and any shifts in the underlying policy stance. A delta is not a simple diff; it is an interpretation of what changed and, where the record permits, why.

Friction point. A location in the regulatory record—a contested statutory term, a split between regulators, a gap exploited by litigation—where the interests or interpretations of two or more actors have visibly come into tension. Friction points are distinct from errors; they are structurally revealing.

Ground truth (working definition). For the purposes of this note, a ground truth is a reference against which the outputs of an analytical model can be tested. It does not imply perfect correctness; it implies that the reference has been produced by a process that is independent of and prior to the analysis being evaluated.

2 The Calibration Problem in Emerging Domains

Research in well-consolidated fields benefits from a body of replication, meta-analysis, and theoretical synthesis that can serve as a reference. A new model or framework can be tested against this existing record. The food systems informatics domain does not yet have this infrastructure in India. The field is recent, the data are fragmented, and the constructs—what counts as the same ingredient, how processing changes identity, how regional naming should relate to regulatory naming—remain unsettled.

A comparable situation arises in any domain that is technically complex, involves multiple institutional actors with partly overlapping mandates, and has developed faster than the academic literature has been able to consolidate it. Infrastructure regulation, environmental classification, digital governance, and traditional medicine systems all share these properties to varying degrees.

In such conditions, one cannot simply look to an established canon of empirical findings to validate a new framework. The question then is: what stable, legible, independently produced record exists against which one can calibrate?

We propose that the regulatory and legal record fills this role, with specific properties and specific limitations that are developed in the sections that follow.

3 Why Law and Case Law Can Function as Ground Truth

Regulatory texts and judicial decisions share several properties that make them useful as a reference in calibration contexts.

They are the result of adversarial processes. Legislation is typically produced after consultation, lobbying, expert review, and political negotiation. Judicial decisions are produced after argument by opposed parties, subject to appeal, and written to justify a conclusion against the best available contrary reading. Both processes are imperfect, but both are designed to surface objections. The record they produce has, in a meaningful sense, survived challenge.

They name conflicts directly. Academic literature tends to report consensus or to frame disagreement theoretically. Case law reports disagreement factually: here are two parties, here is what they disputed, here is which interpretation prevailed and why. The friction is the content of the document rather than a subtext to be inferred.

They are time-stamped. Each regulatory text and each judgment carries a date. This makes it possible to sequence the record chronologically and ask which understanding of a concept was operative when.

They are publicly accessible. In India, statutory instruments are notified in the Gazette of India. Judgments of the Supreme Court and High Courts are published in official reporters and on government databases. The record is, in principle, reachable by any researcher.

They have institutional authority within their domain. A food safety regulation issued under the FSSAI Act is, for practical purposes, the definition of the relevant concept for the actors it governs—manufacturers, importers, inspection officers—regardless of whether a food scientist would agree with it. When building systems that must operate in that legal environment, the legal definition is not one input among many; it is a constraint.

None of these properties make the legal record infallible. Section 6 addresses limitations. But they are sufficient to make law a productive starting point when other reference points are unavailable.

4 Regulatory Delta Analysis as Method

A regulatory delta is not simply a list of changes between two versions of a statute or rule. It is an interpretation of those changes in light of the context that produced them.

4.1 Reading changes in context

Laws are written under constraints. The constraints include the state of the relevant industry at the time of drafting, the administrative capacity available to enforce the rule, the incidents that made a regulatory response necessary, and the political feasibility of various options. A later version of a rule almost always looks more precise, more comprehensive, or more technically sophisticated than an earlier one—not because earlier drafters were careless, but because they were working with less data, less precedent, and a less developed field.

The appropriate stance when reading a delta is that of a scribe rather than a judge: the task is to document what changed, to note what the earlier version could not have anticipated, and to ask what new information or new pressure made the change necessary. This is a different question from asking whether the earlier rule was wrong.

Applied to the Indian food labelling context, the transition from the Food Safety and Standards (Packaging and Labelling) Regulations, 2011 to the Food Safety and Standards (Labelling and Display) Regulations, 2020 illustrates this clearly (Vukka and Lalitha 2026). The 2011 regulations were drafted at a point when industrial food processing in India was still expanding rapidly and digital traceability tools were not yet available to enforcement bodies. The 2020 regulations tightened allergen declarations, introduced structured front-of-pack warnings, and prescribed naming conventions with greater specificity. Each of these additions corresponds to a domain that developed, was observed, and was then addressed. The delta reveals an institution processing experience and updating its instrument accordingly.

4.2 Identifying friction points through case law

Where the regulatory text leaves ambiguity, the courts resolve it—and in doing so, produce a record of where the ambiguity was, who held which interpretation, and which reading eventually prevailed. This makes case law particularly useful for identifying friction points that would not be visible from the statutory text alone.

The Supreme Court of India’s January 2026 judgment in Commissioner of Customs (Import) v. M/s Welkin Foods illustrates this (Lalitha 2026). The case concerned whether imported aluminium shelving should be classified as an agricultural machine part or as an aluminium structure. The legal question was narrow, but the Court’s reasoning established a hierarchy for resolving classification disputes in which statutory technical definitions take precedence over common commercial understanding. This hierarchy had been contested in earlier decisions and was now settled. For a food informatics system that must align with Indian classification practice, this judgment is a material constraint—one that would not have been legible from reading only the statutory text.

4.3 Mapping coordination across institutional actors

A regulatory domain typically involves multiple bodies with overlapping but non-identical mandates. In Indian food systems, the relevant actors include the Food Safety and Standards Authority of India (FSSAI), the Directorate General of Foreign Trade (DGFT), the Central Board of Indirect Taxes and Customs (CBIC), and the courts. These actors do not always interpret the domain identically, and their instruments do not always align.

The legal record makes these relationships visible. Where one body’s definition conflicts with another’s, there will typically be a judgment or a regulatory amendment that resolves the conflict, defers it, or acknowledges it. Mapping these interactions across time reveals not just what the current rule is, but how the current rule came to be and which pressures it is still absorbing.

5 What This Approach Surfaces

Regulatory delta analysis, applied systematically, tends to surface four types of information that are difficult to obtain from other sources.

Constraint archaeology. Earlier versions of rules encode the constraints that were operative when they were written. Identifying these constraints—and asking whether they are still valid—can reveal where a regulatory framework is load-bearing on an assumption that may no longer hold.

Coordination mechanisms. When two bodies with overlapping mandates produce consistent rules over time, it is worth asking how that consistency is achieved. The legal record will often contain evidence of formal coordination mechanisms, mutual referencing, or the adoption of shared definitions.

Friction without resolution. Not all conflicts in the legal record are resolved. Some cases are settled before judgment. Some regulatory ambiguities are explicitly deferred. These unresolved tensions are as informative as the settled ones: they mark the places where the system has not yet stabilised.

Bias documentation. Legal instruments are written by people operating in institutional contexts, and they reflect the concerns, categories, and blind spots of those contexts. A regulatory text that focuses on industrial food and does not address traditional preparations is not neutral; it reflects what was legible and politically salient at the time of drafting. Noting these asymmetries is part of using the legal record honestly.

6 Limitations

The approach described here has several limitations that must be held in mind.

Law is not science. A court may settle a dispute by choosing an interpretation that is administratively convenient or politically feasible rather than technically accurate. A regulatory definition may persist after the scientific understanding of the relevant phenomenon has moved on. The legal record reflects what was decided, not necessarily what was correct.

Unenforced rules are not reliable evidence. A statute that exists on paper but is not enforced tells us something about legislative intent but little about actual practice. The gap between enacted law and operational practice can be substantial.

The record is not complete. Not all decisions are published. Not all conflicts result in litigation. Regulatory negotiations that produce an amended rule may leave no public trace of the original disagreement. The legal record samples the domain rather than covers it.

Jurisdiction specificity. The regulatory architecture of one country or regulatory system is not directly transferable to another. Insights from Indian food law are not automatically generalisable to food systems in other jurisdictions, though the structural properties of the method may transfer.

Temporal lag. Legislation and litigation are slow. The legal record may be significantly behind the current state of the domain, particularly in fast-moving technical fields. Using the legal record as ground truth requires acknowledging that it may be calibrated to a version of the domain that no longer obtains.

These limitations do not disqualify the approach. They specify the conditions under which it is and is not useful, and they indicate the supplementary sources—field research, domain expert consultation, technical audits—that should accompany it.

7 Relationship to Other Sources and Methods

This note does not argue that the legal record should replace other methods of establishing ground truth. It argues that the legal record is an underused source that has specific properties making it productive in specific conditions: emerging domains, multi-actor regulatory environments, and contexts where the gap between legal definition and operational practice is itself an object of study.

The approach is most useful in combination with domain expert consultation, which can identify where the legal record is silent or misleading; with empirical data collection, which can reveal practice that diverges from legal prescription; and with traditional academic literature, which provides theoretical frameworks for interpreting what the legal record contains.

Academic literature is not deprecated here. The claim is narrower: in a domain where the academic literature is still being assembled, the legal record is available now, has been produced by adversarial processes, and carries authority for the actors whose behaviour one is trying to understand or model. It is a reasonable place to start.

8 Closing Remarks

Regulatory systems change. The record of that change is publicly available, time-stamped, and produced by institutions that have observed the domain, absorbed feedback from it, and updated their instruments accordingly. Reading that record carefully is not an alternative to original research—it is a form of original research, and one that takes the accumulated work of regulatory bodies and courts seriously as evidence rather than setting it aside in favour of sources that may be more recent but less tested.

The goal is not to judge who was right and who was wrong in any given dispute, or whether a regulatory body made the best possible decision with the information available. The goal is to understand where the system has been, where it has strained, and what that history reveals about where it currently stands. That is a question that the legal record is unusually well placed to answer.

Acknowledgments

My deepest gratitude to Mr. Krishna, whose constancy forms the foundation upon which all my work, including this, quietly rests.

Salutations to the Goddess who dwells in all beings in the form of intelligence. I bow to her again and again.

This note draws on methods developed in the course of the Indian Food Informatics Data (IFID) project at iSRL. The authors thank the researchers whose applied work surfaced the need for explicit methodological documentation.

References

Lalitha, A. R. 2026. Indian Supreme Court Defines Hierarchical Classification for Food Products: Overruling Common Parlance Precedents. Interdisciplinary Systems Research Lab.

Vukka, S. N., and A. R. Lalitha. 2026. Regulatory Delta of Food Labelling Laws in India: A Comparative Analysis of the FSSAI 2011 and 2020 Regulations. Indian Food Informatics Data (IFID) Project, Interdisciplinary Systems Research Lab. https://doi.org/10.5281/zenodo.18710428.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@report{a_r2026,
  author = {A R, Lalitha},
  publisher = {iSRL},
  title = {Regulatory {Texts} and {Case} {Law} as {Ground} {Truth} in
    {Emerging} {Domains}},
  number = {iSRL-26-02-M-GroundTruth},
  date = {2026-02-23},
  url = {https://isrl.in/pub/2026-02-m-groundtruth/},
  doi = {10.5281/zenodo.18741725},
  langid = {en}
}

For attribution, please cite this work as:

A R, Lalitha. 2026. Regulatory Texts and Case Law as Ground Truth in Emerging Domains. iSRL-26-02-M-GroundTruth. iSRL. https://doi.org/10.5281/zenodo.18741725.

Regulatory Delta of Food Labelling Laws in India: A Comparative Analysis of the FSSAI 2011 and 2020 Regulations

Sai Nikhil Vukka — Sat, 21 Feb 2026 00:00:00 GMT

0.1 Abstract

This short report summarises the regulatory “delta” between the Food Safety and Standards (Packaging and Labelling) Regulations, 2011 and the Food Safety and Standards (Labelling and Display) Regulations, 2020. The focus is on the transition toward prescriptive naming, structured allergen declarations, and risk-aware warnings. These shifts directly inform the selection of specific data fields in digital ingredient identity layers such as IFID.

1 Background

In 2011, FSSAI notified the Food Safety and Standards (Packaging and Labelling) Regulations, 2011, which combined packaging requirements and labelling rules into a single framework. (Food Safety and Standards Authority of India 2011) Roughly a decade later, the authority split packaging and labelling into separate regulations and brought in the Food Safety and Standards (Labelling and Display) Regulations, 2020. (Food Safety and Standards Authority of India 2020) The 2020 move is more than a reshuffle of chapters: it makes labelling more structured, more consumer-facing, and easier to plug into digital compliance tools. (Food Safety and Standards Authority of India 2020, n.d.-b; al. 2020)

Over this period, both industrial food systems and digital traceability infrastructure in India have grown in complexity. (Citizen consumer and civic Action Group, n.d.; Indian Council of Medical Research–National Institute of Nutrition, n.d.) As more product categories emerged and more data and feedback became available, regulators have had opportunities to refine how key information is displayed and standardised. This report gives a compact view of how the 2020 regulations shift emphasis compared to 2011, and what that means for people who need to interpret or implement the law in practice.

The shift between the two frameworks can be understood as part of an ongoing evolution in both the packaged food sector and regulatory practice, rather than as a simple before/after contrast. The 2011 regulations drew on the industrial food landscape and data that were available at that time, while the 2020 framework reflects additional years of experience, feedback and product diversification. In that sense, the later regulations build on the earlier ones as the system as a whole becomes more capable of handling finer-grained labelling expectations.

2 Regulatory Delta: 2011 vs 2020

The differences between the 2011 and 2020 frameworks can be read as part of a longer, iterative process rather than as a sharp break. The 2011 regulations were drafted at a time when packaged and industrial foods, as well as digital tracking systems, were at an earlier stage of development, and they reflect the practices and concerns that were salient then. (Food Safety and Standards Authority of India 2011; Citizen consumer and civic Action Group, n.d.) As product ranges expanded, more data accumulated and stakeholder feedback highlighted specific gaps, FSSAI consolidated those learnings into the 2020 Labelling and Display Regulations. (Food Safety and Standards Authority of India 2020, n.d.-a) The delta in this section is therefore best read as a record of how the system has been strengthened over time, not as a critique of the earlier framework.

Table 1 highlights some of the most visible differences between the 2011 and 2020 labelling rules as reflected in official texts and institutional summaries. (Food Safety and Standards Authority of India 2011, 2020, n.d.-b)

Table 1: Regulatory delta between FSSAI 2011 and 2020 labelling rules

Dimension	2011 Packaging & Labelling	2020 Labelling & Display
Allergen visibility	Disclosures primarily embedded in the ingredient list, with responsibility on the consumer to scan the full list.	Priority allergen groups such as cereals containing gluten, milk, peanuts, soy and sulphites are presented through clearer, more standardised declarations.
Naming / “true nature”	Provides flexibility for brand-led naming on the principal display panel, guided by general fair-trading and anti-misleading provisions.	Places greater emphasis on the name reflecting the true nature of the food, with more explicit expectations to avoid creating an erroneous impression.
Nutrition information	Focus on per 100 g/ml declarations for key nutrients; per serving information is less central.	Encourages a clearer pattern for showing nutrition per 100 g/ml and per serving, supporting front-of-pack and percentage RDA style interpretations.
Additives and warnings	Class + name/INS for additives, with generic warning styles for certain substances (for example, colours or preservatives).	Gives more structured attention to specific warnings (for example, for sulphites or particular additives) and clearer wording for sensitive population groups.
Front-of-pack (FoP) thinking	Labelling can largely be organised around back and side panels, with front-of-pack as one option among many.	Articulates more clearly which elements (such as name, veg/non-veg logo and certain declarations) belong on the principal display panel, creating a base for later FoP policies.
Enforcement posture	Centres on ensuring information is not false or misleading, with compliance work often document- and text-centric.	The way declarations are structured makes it easier to imagine checklists, digital audits and front-of-pack policies that build on nutrient profile models.

2.1 What the Law is Aiming to Address

Read together, these shifts point to a gradual tightening around a few recurring questions.

2.1.0.1 Managing information density and risk signals.

Ensuring that allergens remain visible and recognisable, rather than being overlooked in dense ingredient lists.
Reducing the chances that product names or descriptors leave consumers with an incomplete or ambiguous sense of what they are buying.
Bringing more structure to how high fat, sugar and salt profiles are communicated, especially as processed food categories diversify.

2.1.0.2 Supporting more structured labelling practices.

Encouraging standardised phrasing and placement for priority allergen information.
Clarifying expectations for front-of-pack elements, so that key signals are easier to locate.
Laying technical groundwork for future tools such as front-of-pack labels informed by nutrient profile models.

For lawyers and compliance teams, this means that purely formal arguments like “the information is somewhere on the pack” increasingly give way to questions about prominence, placement and structure. The 2020 frame leans towards asking whether the overall label presentation aligns with these expectations in a consistent way.

3 Practical Implications for Stakeholders

From a day-to-day point of view, the 2011–2020 changes nudge both companies and advisors towards more explicit internal systems for tracking allergens, names and nutrition.

3.1 For Food Businesses

Allergen tracking becomes more explicit: manufacturers benefit from maintaining internal mappings between ingredients and standard allergen groups, rather than relying only on free-text descriptions.
Naming policies may require review: product names and descriptors that were aligned with earlier interpretations may need revisiting to match the “true nature” emphasis in the 2020 framework.
Nutrition data hygiene matters more: keeping consistent, up-to-date values for energy, sugars, fats and sodium supports both regulatory expectations and clearer communication with consumers.

3.2 For Lawyers and Compliance Teams

Case work can use more structure: instead of only reading long labels line by line, advisors can ask whether allergen declarations, names and nutrition panels line up with the specific structures the 2020 regulations describe.
Advice can be more template-driven: it becomes realistic to build standard checklists for allergens, naming and nutrition that can be reused across clients or product lines.

3.3 The Compliance Checklist for Startups

For founders and early-stage teams, a quick sanity check can be more useful than a long memo. A simple checklist that falls out of the 2011–2020 delta is:

Allergen coverage: Have you identified and labelled all FSSAI priority allergen groups that apply to your product?
Front-of-pack name: Does the name on the principal display panel reflect the true nature of the food, rather than only a marketing phrase?
Per serving signals: Is the per-serving nutrition information (including percentage RDA where applicable) clear enough for a consumer to understand the product’s fat, sugar and salt profile at a glance?

Treating these as a recurring checklist rather than a one-time launch task makes it easier to stay aligned with how the 2020 regulations expect labels to behave.

4 Implications for Digital Ingredient Identity Systems (IFID)

Because the 2020 regulations transition from unstructured text to specific, standardized declarations, the ‘regulatory delta’ described here serves as a blueprint for any system—whether a printed label or a digital database—aiming for compliance.

At a bare minimum, an ingredient record in such a system can support:

A canonical “true nature” name plus a list of vernacular or commercial aliases, so that regional naming and compliant labelling language can be linked cleanly.
Structured allergen membership, for example a small set of flags for cereals containing gluten, milk, peanuts, tree nuts, soy and sulphites, instead of only storing full text ingredient names.
Basic nutrient fields (energy, total sugars, saturated fat, sodium per 100 g/ml) in a consistent format, so that front-of-pack or HFSS-style rules can be applied programmatically later if needed.

If these fields live in a stable backend identity layer, then future FSSAI amendments—such as a new HFSS threshold or a focus on a particular additive—can be implemented as updated rules that run across existing products, rather than as one-off manual relabelling exercises.

4.1 Mandatory vs Optional Fields in IFID Records

For a digital ingredient identity layer to stay aligned with the 2020 regulations and still be useful for future extensions, it helps to separate mandatory compliance fields from optional but valuable metadata.

4.1.0.1 Mandatory fields (driven by FSSAI 2020).

Canonical “true nature” name: a single, standardised name that reflects what the ingredient actually is, to reduce scope for ambiguous naming on labels.
Allergen group membership: explicit mapping of each ingredient to the relevant FSSAI priority allergen groups (for example cereals containing gluten, milk, peanuts, tree nuts, soy, sulphites) so that allergen statements can be generated consistently.
Core nutrient values: at least energy, total sugars, saturated fat and sodium per 100 g/ml, to support basic HFSS-style signalling and any future front-of-pack requirements that depend on these nutrients.

4.1.0.2 Optional metadata (for future-proofing and usability).

Vernacular and commercial names: regional aliases and brand-style names that help link consumer-facing labels back to a single canonical ingredient record without losing cultural context.
Versioned compliance rules and flags: pointers from the ingredient record to external rule-sets (for example “FSSAI_2011”, “FSSAI_2020”, “FSSAI_2025”) so that when regulations change, new rules can be applied to existing IFIDs in an instant audit, without rewriting the underlying identities.

Keeping this distinction clear makes it easier to run the IFID project in a continuous loop: mandatory fields ensure basic regulatory alignment, while optional metadata can be expanded over time as new use-cases and amendments appear.

From a data science point of view, the 2020 structure can also be read as an invitation to build API-first compliance systems: once allergens, names and nutrients are represented as stable fields in an ingredient database, they can be exposed through services that run automated checks whenever a recipe changes, a new product is proposed, or a regulation is updated. In that sense, the same design that supports paper labels today also lays the groundwork for digital traceability and machine-readable audits in the future.

5 Conclusion

The move from the 2011 Packaging and Labelling Regulations to the 2020 Labelling and Display Regulations marks a gradual shift from mostly text-heavy transparency towards more structured, salient labelling. For non-technical readers, the core takeaway is that allergens, naming and nutrition are increasingly treated as fields that can be checked and compared in a more systematic way, rather than as unstructured blocks of text.

For data-oriented projects such as IFID, this same delta is a design clue: if ingredient records are aligned with the way the 2020 rules think about allergens, names and nutrients, then it becomes possible to build shared tools that lawyers, regulators and food businesses can all use, without each group having to redo the basic comparison between 2011 and 2020 from scratch.

References

al., Radhika Pande et. 2020. “Front-of-Pack Nutrition Labelling in India.” The Lancet Public Health 5 (4): e195–96.

Citizen consumer and civic Action Group. n.d. Front of Pack Labelling in India: Background and Context.

Food Safety and Standards Authority of India. 2011. Food Safety and Standards (Packaging and Labelling) Regulations, 2011. Government of India.

Food Safety and Standards Authority of India. 2020. Food Safety and Standards (Labelling and Display) Regulations, 2020. Government of India.

Food Safety and Standards Authority of India. n.d.-a. Compendium of Food Safety and Standards (Labelling and Display) Regulations, 2020.

Food Safety and Standards Authority of India. n.d.-b. Frequently Asked Questions (FAQs) on FSS (Labelling and Display) Regulations, 2020.

Indian Council of Medical Research–National Institute of Nutrition. n.d. Dietary Guidelines and Nutrient Thresholds Relevant to High Fat, Sugar and Salt (HFSS) Foods.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@report{nikhil_vukka2026,
  author = {Nikhil Vukka, Sai and A R, Lalitha},
  publisher = {iSRL},
  title = {Regulatory {Delta} of {Food} {Labelling} {Laws} in {India:}
    {A} {Comparative} {Analysis} of the {FSSAI} 2011 and 2020
    {Regulations}},
  number = {iSRL-26-02-R-RegDelta},
  date = {2026-02-21},
  url = {https://isrl.in/pub/2026-02-r-regdelta/},
  doi = {10.5281/zenodo.18719394},
  langid = {en}
}

For attribution, please cite this work as:

Nikhil Vukka, Sai, and Lalitha A R. 2026. Regulatory Delta of Food Labelling Laws in India: A Comparative Analysis of the FSSAI 2011 and 2020 Regulations. iSRL-26-02-R-RegDelta. iSRL. https://doi.org/10.5281/zenodo.18719394.

Justification Companion to EMF-Scoring Model

Lalitha A R — Fri, 20 Feb 2026 00:00:00 GMT

This is a justification companion to the EMF Scoring Model as described in Identity, Transformation, and Function: A Tri-Axial Model for the Classification of Food Ingredient Identity (Lalitha 2026a).

1 Table 1: Anthropogenic Energy Score (E) Assignments

Anthropogenic Energy Score (E) assignments with chemical, regulatory, and trade defensibility notes.

Table 1: Anthropogenic Energy Score (E) assignments with chemical, regulatory, and trade defensibility notes.

Process	E	Chemical Justification	Legal / Naming Justification (FSSAI/Codex) and Trade Classification	Defensibility	Summary
Chilling	0.18	No covalent change; refrigeration is explicitly listed as “minimally processed”. (Food Safety and Standards Authority of India (FSSAI) 2023a)	“Fresh or chilled” food categories are treated as primary commodity forms in ITC(HS) (e.g., Ch. 07). (Directorate General of Commercial Intelligence and Statistics (DGCI&S), Government of India 2007a)	Medium	Physical stabilization; identity retained.
Sorting	0.12	Physical selection only; no molecular modification. (Food Safety and Standards Authority of India (FSSAI) 2023a)	Generally does not create a new standardized “food name” under labelling rules; still described by true nature. (Food Safety and Standards Authority of India (FSSAI) 2023a)	Medium	Handling step only.
Washing	0.15	Surface removal step; documented to reduce residues; no intended covalent change. (Food Safety and Standards Authority of India (FSSAI) 2023a)	Treated as minimal processing (cleaning/removal of unwanted parts). (Food Safety and Standards Authority of India (FSSAI) 2023a)	Medium	Decontamination without re-identity.
De-husking	0.22	Removes inedible outer layers; does not require covalent transformation. (Food Safety and Standards Authority of India (FSSAI) 2023a)	Fits “removing inedible or unwanted parts” under minimal processing. (Food Safety and Standards Authority of India (FSSAI) 2023a)	Medium	Structure reduced; chemistry preserved.
Milling (e.g., Besan)	0.28	Comminution; macromolecules remain; cellular structure destroyed but molecules remain. (Food Safety and Standards Authority of India (FSSAI) 2023a)	Grinding is listed as minimal processing; trade heading exists for flour/meal/powder of dried legumes (HS 1106). (Food Safety and Standards Authority of India (FSSAI) 2023a; Directorate General of Commercial Intelligence and Statistics (DGCI&S), Government of India 2007b)	High	Mechanical conversion to flour with clear HS placement.
Cold Pressing (Oil)	0.32	Mechanical extraction without heat; lipid molecules remain triglycerides. (Codex Alimentarius Commission (FAO/WHO) 2024)	Codex defines “cold pressed fats and oils” and restricts use of that designation to compliant products; no additives permitted in virgin/cold pressed oils. (Codex Alimentarius Commission (FAO/WHO) 2024)	High	Mechanical-only oil; limited industrial separation.
Churning (Butter)	0.45	Phase inversion (oil-in-water to water-in-oil) and physical separation; no target covalent change. (Food Safety and Standards Authority of India (FSSAI) 2025b)	FSSAI defines butter as a water-in-oil emulsion derived exclusively from milk/milk products; table butter must be from pasteurised cream. (Food Safety and Standards Authority of India (FSSAI) 2025b)	High	Thermal/physical re-structuring with defined legal identity.
Fermentation (Vinegar)	0.56	Biochemical oxidation of ethanol to acetic acid by acetic acid bacteria; covalent re-identity of primary acid. (Yun et al. 2024)	FSSAI treats fermentation as minimal processing in nutrition-labelling context; ITC(HS) distinguishes brewed vs synthetic vinegar under HS 2209. (Food Safety and Standards Authority of India (FSSAI) 2023a; Directorate General of Commercial Intelligence and Statistics (DGCI&S), Government of India 2007d)	Medium	Biological conversion; product class recognized in trade.
Roasting	0.58	Thermal chemistry (Maillard reaction: amino acids + reducing sugars forming melanoidins and other new compounds). (Schaefer et al. 2025)	Still generally labelled by food name with accurate description; not typically a statutory rename trigger by itself. (Food Safety and Standards Authority of India (FSSAI) 2023a)	Low	Chemistry occurs, but regulatory treatment is food- and claim-dependent.
Pasteurization	0.48	Microbicidal heat treatment (defined time/temperature combinations); primarily denaturation/aggregation without designed covalent synthesis. (Food Safety and Standards Authority of India (FSSAI) 2025b)	FSSAI defines pasteurization and requires heat-treatment declaration for milk; also listed as minimal processing. (Food Safety and Standards Authority of India (FSSAI) 2025b, 2023a)	High	Standardized thermal process with explicit legal definition.
Solvent Extraction (Oils)	0.82	Solvent-based separation (typically hexane) and subsequent desolventizing/distillation; strong industrial separation though not necessarily covalent modification. (Boukhenfa et al. 2022)	India controls “solvent-extracted oil” production/handling under a dedicated Control Order; industrial category is legally recognized. (Government of India (hosted on FSSAI website) 1967)	High	Industrial chemical-separation route with distinct legal instrument.
Fractionation (Olein)	0.76	Physical fractionation via controlled crystallization and separation into liquid (olein) and solid fractions; no intended covalent modification. (Abdul Wahab et al. 2023)	Ingredient class titles in FSSAI labelling include “fractionated fat” under edible vegetable fat declarations. (Food Safety and Standards Authority of India (FSSAI) 2023a)	Medium	Industrial separation into functional fractions.
Clarification (Ghee)	0.55	Heat-driven removal of water and milk solids-not-fat; concentrated milk fat; no intended covalent synthesis. (Food Safety and Standards Authority of India (FSSAI) 2025b)	FSSAI defines ghee/milk fat products as derived exclusively from milk via processes that remove water and SNF almost totally. (Food Safety and Standards Authority of India (FSSAI) 2025b)	High	Well-defined milk-fat product identity.
Hydrogenation	0.92	Addition of hydrogen to C=C double bonds (covalent saturation); may also change isomer distribution. (American Oil Chemists’ Society (AOCS) 2024)	ITC(HS) heading 1516 explicitly covers fats/oils “partly or wholly hydrogenated”; FSSAI ingredient class titles include “hydrogenated oils” / “partially hydrogenated oils”. (Directorate General of Commercial Intelligence and Statistics (DGCI&S), Government of India 2007c; Food Safety and Standards Authority of India (FSSAI) 2023a)	High	Explicit HS/legal recognition + covalent modification.
Acetylation (Modified Starch)	0.94	Hydroxyl groups on starch are converted to acetate esters (O-acetylation); acetyl groups introduced using acetic anhydride (representative modified starch). (Joint FAO/WHO Expert Committee on Food Additives (JECFA) 1974)	HS Chapter 35 heading 3505 covers modified starches including esterified starches; FSSAI labelling distinguishes “starches other than chemically modified starches” (implying modified starch must be specifically named). (Directorate General of Commercial Intelligence and Statistics (DGCI&S), Government of India 2007e; Food Safety and Standards Authority of India (FSSAI) 2023a)	High	Clear covalent modification and clear HS heading.
Interesterification	0.91	Rearrangement of fatty acids within/between triglycerides via ester interchange (covalent bond break/re-form) while total FA composition may remain. (Mozaffarian et al. 2011)	ITC(HS) heading 1516 explicitly includes “interesterified” and “re-esterified” fats/oils; FSSAI class titles include “interesterified vegetable fat”. (Directorate General of Commercial Intelligence and Statistics (DGCI&S), Government of India 2007c; Food Safety and Standards Authority of India (FSSAI) 2023a)	High	Explicit HS/legal recognition + covalent modification.
Synthetic Flavors	0.99	Deliberate formulation of defined molecules produced by industrial chemical synthesis; mixture may be far removed from biological matrix. (National Center for Biotechnology Information (NCBI) 2025a)	FSSAI requires declaration of flavouring agents; artificial flavours require declaring the common name, and natural/nature-identical require class name declaration. (Food Safety and Standards Authority of India (FSSAI) 2023a)	Medium	Strong naming rules; chemistry varies by flavour system.
Vanillin (Lab-made)	0.98	Single defined chemical entity (vanillin; 4-hydroxy-3-methoxybenzaldehyde). (National Center for Biotechnology Information (NCBI) 2025a)	Typically declared as a flavouring substance; labelling must follow flavour declaration rules (natural vs artificial/nature-identical classification depends on source and regulatory interpretation). (Food Safety and Standards Authority of India (FSSAI) 2023a)	Medium	Chemical identity is unambiguous; regulatory class depends on production route.
Sodium Glycolate	0.99	Defined inorganic/organic salt (sodium 2-hydroxyacetate); inherently a chemical entity not tied to a food matrix. (National Center for Biotechnology Information (NCBI) 2025b)	“Glycolate” is referenced as an impurity/specification parameter within some additive standards (e.g., CMC-related specs), but sodium glycolate itself is not a common named food. (Food Safety and Standards Authority of India (FSSAI) 2024)	Low	Presence in food law is indirect; use-case dependent.

2 Table 2: Final Commercial States with Matter Scores (M)

Final commercial states with single Matter Scores (M), primary Matter Classes, typical process/E context, and justification summary.

Table 2: Final commercial states with single Matter Scores (M), primary Matter Classes, typical process/E context, and justification summary.

Final State	M	Matter Class	Typical Processes / E-context	Justification Summary
Whole / Fresh pieces	0.05	Hydrated / Native	Sorting (E=0.12), washing (0.15), de-husking (0.22), chilling (0.18).	Primary commodities such as whole vegetables or raw milk sold with cellular water and structure intact are treated as minimally processed in trade and labelling; matrix loss is negligible, so M is near zero. (Food Safety and Standards Authority of India (FSSAI) 2023b; Directorate General of Commercial Intelligence and Statistics (DGCI&S) 2007)
Cut / Sliced pieces	0.10	Hydrated / Native	Sorting, washing, trimming, cutting, chilling in similar E-band as whole produce.	Cutting or slicing does not remove major components; it increases surface area but retains the hydrated matrix, so M is modestly above whole state yet still in Class 1. (Food Safety and Standards Authority of India (FSSAI) 2023b)
Pulp / Puree	0.25	Comminuted	De-husking (0.22), milling or pulping (0.28–0.32), possible pasteurisation (0.48).	Fruit or vegetable pulps and purees correspond to comminuted edible portions where skin and fibre may be partially retained; composition is close to edible fraction but structure is lost, consistent with Class 2. (Directorate General of Commercial Intelligence and Statistics (DGCI&S) 2007; Food Safety and Standards Authority of India (FSSAI) 2025a)
Coarse grits	0.30	Comminuted	Milling/fragmentation (E near 0.28) without extensive screening to flour fineness.	Cereal groats and meals are defined in Chapter 11 as fragmented grains with specified sieve cut-offs; fragmentation preserves nutritional spectrum but destroys grain structure, giving a higher M than whole grain but still Class 2. (Directorate General of Commercial Intelligence and Statistics (DGCI&S) 2007; World Customs Organization 2002)
Flour / Fine powder	0.33	Comminuted	Milling and sifting (E ≈ 0.28) to flours and powders.	Fine cereal and pulse flours in HS 1101/1102 represent fully fragmented grain; anatomical integrity is lost but no targeted macronutrient removal occurs, so M is slightly higher than coarse grits yet remains in Class 2. (Directorate General of Commercial Intelligence and Statistics (DGCI&S) 2007; World Customs Organization 2002)
Flakes	0.36	Dehydrated / Concentrated	Rolling/laminating (working grains), partial dehydration or toasting (E around roasting 0.58).	Rolled or flaked grains are explicitly classified under heading 1104; moisture is typically lower and structure more worked than in meal, so M reflects additional matrix disruption and concentration characteristic of early Class 3. (Directorate General of Commercial Intelligence and Statistics (DGCI&S) 2007; World Customs Organization 2002)
Concentrate (liquid)	0.40	Dehydrated / Concentrated	Evaporation or vacuum concentration of juices, milk, or pulps; processes in a band around pasteurisation (0.48) and clarification (0.49).	Liquid concentrates such as condensed milk or concentrated juice primarily remove water; the matrix is densified but major macronutrients remain, justifying a mid-Class 3 M-score. (Food Safety and Standards Authority of India (FSSAI) 2025a)
Powder (spray-dried)	0.42	Dehydrated / Concentrated	Evaporation and spray-drying of liquids (e.g., milk, juices, whey) following heat treatment.	Spray-dried powders such as milk powder and whey powder are recognized as distinct dried milk products; removal of nearly all water and creation of free-flowing powders increases density and handling purity but still retains a broad nutrient profile, fitting high Class 3. (Food Safety and Standards Authority of India (FSSAI) 2025a; Pintado et al. 2015)
Juice (clarified)	0.50	Structural Fractionation	Pulping, then clarification/filtration, sometimes centrifugation (E rising from 0.28 to ≈0.49).	Clarified juices selectively remove insoluble fibre and suspended solids, leaving mainly soluble solids and water; this is a compositional subset of the fruit matrix and matches Class 4 behaviour. (Directorate General of Commercial Intelligence and Statistics (DGCI&S) 2007; Food Safety and Standards Authority of India (FSSAI) 2023b)
Skim / Defatted meal	0.55	Structural Fractionation	Cream separation (milk), solvent extraction or pressing (oilseeds), followed by drying or milling.	Skimmed milk (fat-reduced) and defatted oilseed meals are produced by removing cream or oil; the remaining fraction is enriched in protein or non-fat solids and recognized as a separate commodity or feed/food ingredient, placing it in upper Class 4. (Food Safety and Standards Authority of India (FSSAI) 2025a; DGCI&S 2007a; Codex Alimentarius Commission 1989)
Oil	0.70	Constitutional Isolate	Cold pressing (E=0.32), solvent extraction (0.82), and refining/fractionation (0.76).	Edible fats and oils are defined in Codex CXS 19-1981 as glyceride-based materials separated from plant or animal sources, including virgin, cold-pressed, and refined oils; the lipid fraction is isolated from the matrix, justifying a Class 5 score. (Codex Alimentarius Commission 2015; DGCI&S 2007a)
Protein concentrate	0.74	Constitutional Isolate	Aqueous extraction, precipitation or membrane concentration of protein, followed by drying.	Soy protein concentrate is defined as containing 65–90% protein (dry basis) after removal of substantial non-protein matter; this near-pure macronutrient fraction is more matrix-distant than skim or defatted meal but less than isolates, justifying M=0.74. (Codex Alimentarius Commission 1989)
Protein isolate	0.78	Constitutional Isolate	Further removal of non-protein constituents (water, oil, carbohydrates) by extraction and membrane processes.	Soy protein isolate (≥90%) and whey protein isolate similarly achieve very high protein purity; Codex and technical literature treat them as functional protein ingredients rather than food matrices, placing them at the top of Class 5. (Codex Alimentarius Commission 1989; Pintado et al. 2015)
Fat fraction	0.72	Constitutional Isolate	Fractionation of oils/fats (E=0.76) into olein/stearin, or separation of butterfat/ghee from milk.	Fractionated fats such as palm olein and milk fat products (including ghee and butterfat) are recognized in Codex and FSSAI as specific fat fractions; they are highly enriched in triglycerides from a defined source, warranting a slightly lower M than generic oil due to preserved origin linkage yet clear Class 5 status. (Codex Alimentarius Commission 2015; Food Safety and Standards Authority of India (FSSAI) 2025a)
Extract / Oleoresin	0.86	Molecular Signal / Extract	Solvent extraction of spices or herbs and evaporation of solvent to yield oleoresins.	Spice oleoresins and similar extracts concentrate flavour-active and sometimes pungent components; biomass is largely removed and the material acts as a potent functional ingredient, aligning with mid-Class 6. (Rodilla et al. 2024; Food Safety and Standards Authority of India (FSSAI) 2022)
Essential oil	0.90	Molecular Signal / Extract	Steam distillation or cold expression, sometimes followed by purification or encapsulation.	Essential oils are volatile, hydrophobic liquids containing concentrated aroma compounds; reviews highlight their use as natural flavourings and preservatives at very low inclusion levels, indicating high signal potency and justifying a high Class 6 score. (Rodilla et al. 2024)
Crystalline chemical	0.98	De-novo / Synthetic	Chemical synthesis, purification, and crystallization (e.g., vanillin, ethyl vanillin, sodium salts).	Crystalline vanillin and similar flavour chemicals are single, chemically defined entities catalogued in PubChem; sodium glycolate is likewise described as a defined salt. These are essentially pure synthetic matter with negligible matrix linkage, near the Class 7 extreme. (National Center for Biotechnology Information 2021; NCBI 2022)
Granules	0.80	Constitutional Isolate	Agglomeration or granulation of flours, concentrates, isolates, or crystalline additives.	Codex explicitly allows soy protein products to be designated by physical forms such as granules or bits; such granules usually represent agglomerated isolates or concentrates, making them slightly above generic protein isolates in perceived purity and handling regularity. (Codex Alimentarius Commission 1989)
Oleoresin (viscous)	0.88	Molecular Signal / Extract	Solvent extraction of spices followed by partial solvent removal to a viscous resin.	Viscous oleoresins preserve both essential oil and non-volatile resinous components and are widely used as concentrated spice ingredients; their high potency and low-dose application justify an M-score between generic extracts and essential oils. (Rodilla et al. 2024; Food Safety and Standards Authority of India (FSSAI) 2022)
Whey powder	0.52	Structural Fractionation	Separation of whey from curd, concentration, and drying.	Whey powder arises after removal of curd proteins and fat; it is recognized in dairy standards as a separate dried milk product comprised mainly of lactose and whey proteins, thus more matrix-thinned than whole milk powder but still a food fraction. (Food Safety and Standards Authority of India (FSSAI) 2025a; Pintado et al. 2015)
Starch flour	0.60	Structural Fractionation	Wet separation of starch from cereals or roots, followed by drying and milling.	Cereal and root starches are classified separately from whole flours in Chapter 11 and in Chapter 35 when chemically modified; the isolated carbohydrate fraction retains botanical origin but little of the original matrix, placing it at the high end of Class 4. (Directorate General of Commercial Intelligence and Statistics (DGCI&S) 2007; DGCI&S 2007b)
Dense block / Cake (e.g., khoya)	0.38	Dehydrated / Concentrated	Prolonged heat concentration of milk to a semi-solid or solid mass.	Khoa/khoya is defined as a milk product obtained by partial dehydration of milk; solids are concentrated but composition remains broad (fat, protein, lactose), supporting a moderate Class 3 score. (Food Safety and Standards Authority of India (FSSAI) 2025a)
Meal (e.g., defatted soya meal)	0.57	Structural Fractionation	Oil extraction from soybeans, followed by grinding to meal.	Defatted meals contain much of the non-fat matrix but have lost the bulk lipid; they are standard outputs of oilseed processing and fit upper Class 4 as protein- and fibre-rich subsets of the starting seed. (DGCI&S 2007a; Codex Alimentarius Commission 1989)
Modified starch powder	0.96	De-novo / Synthetic	Chemical modification (e.g., acetylation, cross-linking) of starch polymers, then drying and milling.	JECFA describes acetylated distarch adipate as starch whose hydroxyls have been esterified with acetic and adipic moieties; this covalent modification creates a regulated food additive (INS 1422) classified under modified starches, giving it a low-end Class 7 score. (FAO/WHO Joint Expert Committee on Food Additives 2016; FAO/WHO Codex GSFA 2025; DGCI&S 2007b)
Emulsifier powder (e.g., lecithin)	0.89	Molecular Signal / Extract	Solvent extraction or fractionation of phospholipids from oils, followed by drying or spray-drying.	Lecithins are listed in additive regulations as surface-active phospholipid mixtures obtained from edible fats and oils; their role as functional emulsifiers at low inclusion levels and their separation from bulk matrix place them high in Class 6. (Food Safety and Standards Authority of India (FSSAI) 2022; Codex Alimentarius Commission 2015)

3 Table 3: Detailed Functional (F) Score Analysis

Detailed Functional (F) Score Analysis: Identity Shift Logic and Statutory Basis.

Table 3: Detailed Functional (F) Score Analysis: Identity Shift Logic and Statutory Basis.

Functional Class	F	Primary Tech. Role	Typical E-M Context	Identity Shift Logic & Statutory Basis
Base Ingredient	0.12	Provide bulk, calories, protein, primary structure.	E: 0.12–0.82; M: 0.05–0.78	Even at very high E and M — spray-dried milk powder (E ≈ 0.48, M ≈ 0.42), solvent-extracted soy protein isolate (E ≈ 0.82, M ≈ 0.78) — regulatory frameworks mandate source-dominant naming. “Milk solids,” “soya protein isolate,” “wheat flour” are required declarations; functional roles (nutrition, structure, emulsification) remain implicit rather than named. F reflects institutional resistance to functional abstraction. F operates as a downward tie-breaker, preventing drift toward function-emergent status. FSSAI Reg 4(1) “true nature”; Reg 5(2) mandatory source-first; Sch II titles 1–3, 13–16; ITC-HS Ch 07–11 (source-aligned). (Legitquest Legal Database 2024; Directorate General of Commercial Intelligence and Statistics (DGCI&S) 2022)
Taste Profile	0.18	Natural aroma, raw taste profile, herbal garnish.	E: 0.15–0.45; M: 0.10–0.40	Sensory function is acknowledged (elevating F above Base Ingredient), but botanical origin remains primary in regulatory naming. “Natural vanilla flavor” requires vanilla source identification where characterizing; “peppermint oil” retains species-specific designation. The “as appropriate” qualifier in Schedule II preserves contextual source judgment. Synthetic replication triggers different regulatory treatment (artificial flavor, F ≈ 0.88), confirming source-function coupling constraints. FSSAI Reg 5(2) Sch II title 8; Reg 2.6 (natural/nature-identical/artificial qualifiers); ITC-HS Ch 09, 12, 21, 33.01–33.02. (Legitquest Legal Database 2024)
Lipid Base	0.22	Structural fat functionality, caloric contribution, texture/mouthfeel.	E: 0.32–0.92; M: 0.70–0.75	Pivotal for Key Intersection Analysis. Even intensive modification — hydrogenation (E ≈ 0.92), interesterification (E ≈ 0.91) — does not trigger functional re-casting while regulatory frameworks retain source-linked naming. “Hydrogenated vegetable oil,” “interesterified palm olein” are mandatory; triglyceride structure preservation in HS 1516 (“but not further prepared”) caps F at ≈ 0.35. Baseline 0.22 reflects regulatory resilience of source identity that processing intensity alone cannot overcome. FSSAI Reg 5(2) Sch II title 2; Codex CXS 19-1981 virgin/cold-pressed; ITC-HS Ch 15 (1507–1515 specific oils; 1516 modified). (Legitquest Legal Database 2024; General Standard for Edible Fats and Oils Not Covered by Individual Standards 1981)
Bulking Agent	0.38	Increase volume, filler, non-nutritive bulk.	E: 0.60–0.85; M: 0.60–0.80	Partial functional recognition: “bulking agent” class acknowledged, but source often implicit in chemical name (maltodextrin from starch, cellulose from wood pulp/cotton). HS placement split between food-derived (Ch 11, 17) and chemically-processed (Ch 39) depending on modification degree. Reflects intermediate status: function named but not fully abstracted from material source. FSSAI Food Additives Regs 2011, Sch I class; ITC-HS 1702 (maltodextrins), 1109 (gluten), 3912 (cellulose). (Indian Kanoon Repository 2024)
Humectant	0.42	Retain moisture, prevent drying, wetting agent.	E: 0.55–0.75; M: 0.60–0.75	Moisture-retention function primary in naming, but glycerol source (vegetable/animal/synthetic) may be relevant for veg/non-veg classification. Synthetic glycerol achieves higher functional abstraction than fat-derived. Reflects moderate elevation: frameworks permit functional-class declaration but source indication remains commercially and regulatorily significant for certain applications. FSSAI Food Additives Regs 2011, Sch I class; Glycerol (INS 422), Sorbitol (INS 420), Propylene Glycol (INS 1520); ITC-HS 2905.45, 2906, 3824. (Indian Kanoon Repository 2024)
Firming Agent	0.45	Maintain crispness, strengthen gel.	E: 0.50–0.70; M: 0.75–0.85	Crispness maintenance is pure technological function with no nutritional role, yet mineral source (calcium, aluminum) retains chemical specificity. “Firming agent (calcium chloride)” presents functional priority with residual material identity. Reflects chemical-functional dual identity: higher than plant-derived due to inorganic source irrelevance to biological origin, but capped by specific naming requirements. FSSAI Food Additives Regs 2011, Sch I class; Calcium chloride (INS 509), Calcium lactate (INS 327); ITC-HS 2827, 2833, 2834, 3824. (Indian Kanoon Repository 2024)
Raising Agent	0.48	Liberate gas, increase volume, leavening.	E: 0.45–0.65; M: 0.70–0.85	Gas liberation function clearly functional, but “baking soda” (sodium bicarbonate) retains common-name source identification in consumer discourse. Prepared baking powders (mixed leavening systems) in 3824 achieve higher functional abstraction. The F = 0.48 reflects equilibrium position: chemical naming standard, functional class permitted, consumer familiarity with source-based terms moderating full abstraction. FSSAI Food Products Standards and Food Additives Regulations, 2011, Schedule I “raising agent” class: sodium bicarbonate (INS 500(ii)), ammonium bicarbonate (INS 503(ii)), sodium acid pyrophosphate (INS 450(i)); ITC-HS 2836 (carbonates), 2835 (phosphates), 3824 (prepared baking powders). (Indian Kanoon Repository 2024)
Thickener	0.62	Increase viscosity, bodying agent, texturizing agent.	E: 0.70–0.94; M: 0.60–0.96	Mandatory functional class declaration elevates F decisively. “Thickener (xanthan gum)” or “Thickener (INS 415)” presents function primary, source secondary. However, source variability within class (plant gums, animal proteins, modified starches, synthetic polymers) prevents complete abstraction: specific identification retains traceability to origin. HS migration to Chapter 35/39 for modified/cellulosic materials supports elevated F, but native gums in Chapter 13 maintain moderate source linkage. The F = 0.62 captures this regulatory-driven functional priority with residual source significance. FSSAI Regulation 5(5): mandatory functional class declaration with INS number; Food Products Standards and Food Additives Regulations, 2011, Schedule I “thickener” class; ITC-HS 1302 (vegetable saps and extracts), 3505 (modified starches), 3912 (cellulose ethers), 3824. (Legitquest Legal Database 2024)
Stabilizer	0.65	Maintain dispersion, prevent sedimentation, foam stabilizer.	E: 0.75–0.90; M: 0.70–0.89	Dispersion maintenance is more technologically specific than thickening — requires kinetic stability, not just viscosity. Broader source variability (vegetable extracts, microbial products, synthetic polymers) supports higher F than thickeners. “Stabilizer” declaration standard with optional source parenthetical; HS chemical-product placement common. The F = 0.65 reflects stronger functional dominance due to specialized technological application and greater source heterogeneity within class. FSSAI Regulation 5(5): mandatory functional class declaration; Food Products Standards and Food Additives Regulations, 2011, Schedule I “stabilizer” class; ITC-HS 1302 (seaweed extracts), 3504 (peptones, protein substances), 3824 (prepared blends). (Legitquest Legal Database 2024)
Gelling Agent	0.68	Gel formation, structure provider.	E: 0.70–0.89; M: 0.75–0.90	Gel formation is definitive functional transformation of food matrix — creates novel physical structure not present in starting materials. “Gelling agent” declaration standard with source parenthetical (“gelling agent (pectin)”). Gelatin (animal-derived) faces source-disclosure constraints from Ram Gaua Raksha Dal (Lalitha 2026b), capping its effective F; plant-derived gelling agents achieve higher functional abstraction. The F = 0.68 reflects strong functional dominance with residual source significance for protein-based gels. FSSAI Regulation 5(5): mandatory functional class declaration; Food Products Standards and Food Additives Regulations, 2011, Schedule I “gelling agent” class: gelatin, agar, pectin, carrageenan, gellan gum; ITC-HS 3503 (gelatin), 1302, 3824. (Legitquest Legal Database 2024; High Court of Delhi 2021)
Foaming Agent	0.72	Form gas dispersion, whipping agent.	E: 0.75–0.95; M: 0.78–0.90	Gas dispersion for volume expansion is highly technical function — requires precise surface-activity, film-forming, gas-retention properties. Foaming power, not source, defines quality: egg white, soy protein, synthetic surfactants functionally equivalent at specified performance levels. Elevated F reflects specialized technological application and performance-based selection criteria. Protein-based foaming agents retain slight source linkage (egg, soy), synthetic alternatives achieve higher abstraction. FSSAI Regulation 5(5): mandatory functional class declaration; Food Products Standards and Food Additives Regulations, 2011, Schedule I “foaming agent” class: albumen, quillaia extract, synthetic surfactants; ITC-HS 3502 (albumin, egg white), 1302 (saponins — quillaia), 3402 (organic surface-active agents — synthetic), 3824. (Legitquest Legal Database 2024)
Emulsifier	0.82	Form emulsion, maintain emulsion, prevent fat separation.	E: 0.82–0.94; M: 0.78–0.96	The Emulsifier class exemplifies the E-M-F tie-breaker function. Lecithin: E ≈ 0.89 (solvent extraction, fractionation, drying), M ≈ 0.89 (phospholipid concentrate) — conditions suggesting ambiguous identity. Yet regulatory practice mandates “emulsifier (lecithin)” or “emulsifier (INS 322)” declaration, with ITC-HS placement in 2923.20 (chemical products) rather than 1516 (modified fats). Functional class primary, source parenthetical or absent. The F = 0.82 captures this regulatory-naming resolution of E-M ambiguity. Mono-/diglycerides similarly achieve F ≈ 0.82 through additive-schedule classification and prepared-additive HS placement, despite fat-derived origin. The lecithin vs. fractionated olein comparison illuminates F’s decisive role: nearly identical E-M, radically different F due to regulatory classification divergence. FSSAI Regulation 5(5): mandatory functional class declaration; Food Products Standards and Food Additives Regulations, 2011, Schedule I “emulsifier” class: lecithins (INS 322), mono-/diglycerides (INS 471), polysorbates (INS 432–436), sucrose esters (INS 473–474); ITC-HS 2923.20, 3824, 3402. (Legitquest Legal Database 2024)
Anticaking Agent	0.85	Prevent clumping, improve flow, anti-stick agent.	E: 0.60–0.85; M: 0.80–0.95	Flow improvement is purely technical function with no nutritional, sensory, or structural role in final product. Source completely irrelevant to application: silicon dioxide from sand or synthetic, calcium silicate from mineral or industrial process — functionally equivalent. “Anticaking agent” declaration standard with chemical name or INS number; no source indication required or expected. High F reflects total identity divorce from biological origin. FSSAI Regulation 5(5): mandatory functional class declaration; Food Products Standards and Food Additives Regulations, 2011, Schedule I “anticaking agent” class: silicon dioxide (INS 551), calcium silicate (INS 552), magnesium carbonate (INS 504(i)), various phosphates; ITC-HS 2811 (silicon dioxide), 2835 (phosphates), 3824 (prepared anticaking preparations). (Legitquest Legal Database 2024)
Acidity Regulator	0.87	Control pH, acidifier, buffering agent, alkali.	E: 0.55–0.90; M: 0.70–0.90	pH control is chemically precise function — requires defined acid/base strength, buffer capacity, taste profile. Organic acids may retain nominal source linkage (citric “from fermentation,” lactic “from dairy”), but regulatory classification by chemical structure dominates. Synthetic production and functional-class declaration achieve near-complete abstraction. The F = 0.87 reflects very high functional dominance with minimal residual source significance. FSSAI Regulation 5(5): mandatory functional class declaration; Food Products Standards and Food Additives Regulations, 2011, Schedule I “acidity regulator” class: citric acid (INS 330), lactic acid (INS 270), phosphoric acid (INS 338), acetic acid (INS 260), various salts; ITC-HS 2915 (saturated acyclic monocarboxylic acids), 2918 (carboxylic acids with additional oxygen functions), 2835 (phosphates), 3824. (Legitquest Legal Database 2024)
Antioxidant	0.88	Prevent oxidation, prevent rancidity, antibrowning.	E: 0.60–0.95; M: 0.75–0.98	Oxidation prevention is chemically specific function — free radical scavenging, metal chelation, oxygen absorption — mechanism-dependent, not source-dependent. Synthetic antioxidants (BHA, BHT, TBHQ) achieve complete source abstraction; natural alternatives (tocopherols, rosemary extract) retain slight source linkage moderating class average. “Antioxidant” declaration standard with specific name or INS number; mechanism of action primary, origin secondary. The F = 0.88 reflects near-complete functional dominance with chemical-mechanistic specificity. FSSAI Regulation 5(5): mandatory functional class declaration; Food Products Standards and Food Additives Regulations, 2011, Schedule I “antioxidant” class: BHA (INS 320), BHT (INS 321), TBHQ (INS 319), tocopherols (INS 307), ascorbic acid (INS 300), rosemary extract (INS 392); ITC-HS 2907 (phenols), 2918 (carboxylic acids), 3824. (Legitquest Legal Database 2024)
Preservative	0.89	Inhibit microbes, retard fermentation, antimycotic.	E: 0.60–0.90; M: 0.75–0.95	Microbial inhibition is safety-critical function with strict regulatory control: maximum permitted levels, prohibited food categories, specific labelling requirements. Preservative efficacy independent of source: benzoic acid from gum benzoin or synthetic, sorbic acid from rowan berries or petrochemical — toxicologically and functionally equivalent. “Preservative” declaration with specific name/INS number; safety profile and antimicrobial spectrum primary, origin irrelevant. The F = 0.89 reflects maximum functional dominance for food-safety-critical additives, with regulatory-driven identity. FSSAI Regulation 5(5): mandatory functional class declaration; Food Products Standards and Food Additives Regulations, 2011, Schedule I “preservative” class: benzoic acid/sodium benzoate (INS 210–211), sorbic acid/potassium sorbate (INS 200–202), propionic acid/calcium propionate (INS 280–282), sulfur dioxide (INS 220), nisin (INS 234), natamycin (INS 235); ITC-HS 2916 (unsaturated monocarboxylic acids), 2918 (carboxylic acids), 3824 (prepared preservative systems). (Legitquest Legal Database 2024)
Antifoaming Agent	0.90	Prevent foaming, reduce surface tension.	E: 0.70–0.95; M: 0.85–0.98	Foam prevention is highly specialized industrial function — used at ppm levels in processing, no consumer-perceptible presence in final product. Silicone-based, mineral oil, polyglycol antifoams chemically defined with no meaningful biological source. “Antifoaming agent” declaration standard; process optimization criteria (temperature stability, dispersion, efficacy) sole selection factors. The F = 0.90 reflects near-total functional abstraction for processing-aid category. FSSAI Regulation 5(5): mandatory functional class declaration; Food Products Standards and Food Additives Regulations, 2011, Schedule I “antifoaming agent” class: dimethylpolysiloxane (INS 900a), mineral oil (INS 905a), various fatty acid esters; ITC-HS 3910 (silicones), 2710 (mineral oils), 3824 (prepared antifoaming compositions). (Legitquest Legal Database 2024)
Sequestrant	0.91	Bind metal ions, control oxidation catalyst.	E: 0.75–0.95; M: 0.80–0.98	Metal ion binding is precise chemical mechanism — stability constants, chelation kinetics, pH dependence define performance. EDTA, citrates, polyphosphates chemically synthesized; no biological source relevant. “Sequestrant” declaration with chemical specificity; chelating capacity primary, molecular structure secondary, origin absent. The F = 0.91 reflects advanced tool-identity for mechanistically specialized function. FSSAI Regulation 5(5): mandatory functional class declaration; Food Products Standards and Food Additives Regulations, 2011, Schedule I “sequestrant” class: calcium disodium EDTA (INS 385), disodium EDTA (INS 386), various citrates, phosphates, polyphosphates; ITC-HS 2917 (polycarboxylic acids), 2922 (oxygen-function amino-compounds), 2835 (phosphates), 3824. (Legitquest Legal Database 2024)
Bleaching Agent	0.92	Decolorize food, flour bleaching.	E: 0.80–0.95; M: 0.85–0.98	Decolorization is aggressive chemical intervention — oxidative destruction of pigments, not nutritional or sensory contribution. Bleaching agents not consumed as food but as processing aids; residues minimized or removed. “Bleaching agent” or “flour treatment agent (bleaching)” declaration; chemical reactivity primary, source irrelevant. The F = 0.92 reflects processing-tool status with no food-component identity. FSSAI Regulation 5(5): mandatory functional class declaration; Food Products Standards and Food Additives Regulations, 2011, Schedule I “bleaching agent” class: benzoyl peroxide (INS 928), chlorine dioxide, sulfur dioxide (INS 220); ITC-HS 2815 (inorganic bases), 2820 (manganese oxides), 3824. (Legitquest Legal Database 2024)
Flour Treatment Agent	0.93	Improve baking quality, dough conditioner, dough strengthener.	E: 0.70–0.90; M: 0.75–0.90	Dough conditioning is exquisitely application-specific — rheology modification, gluten development, fermentation control for bread quality optimization. Treatment agents transform flour functionality without becoming part of final product identity; enzymatic action consumed in processing. “Flour treatment agent” declaration with specific agent; technological outcome (dough properties) primary, chemical/enzymatic mechanism secondary, source absent. The F = 0.93 reflects extreme functional specialization for bakery-processing optimization. FSSAI Regulation 5(5): mandatory functional class declaration; Food Products Standards and Food Additives Regulations, 2011, Schedule I “flour treatment agent” class: ascorbic acid (INS 300), L-cysteine (INS 920), various enzymes (amylases, proteases, xylanases), azodicarbonamide; ITC-HS 2936 (vitamins), 2930 (sulfur-organic compounds), 3507 (enzymes), 3824. (Legitquest Legal Database 2024)
Carrier	0.94	Dissolve additive, dilute nutrient, encapsulating agent.	E: 0.60–0.95; M: 0.70–0.96	The Carrier function represents meta-functional identity: the carrier’s purpose is to enable function of other ingredients — dissolution, dispersion, encapsulation, controlled release. Maltodextrin, modified starches, oils, glycerol as carriers: source (corn, wheat, palm, soy) irrelevant to carrier function; delivery performance (solubility, viscosity, compatibility) sole criteria. “Carrier” declaration with optional specific material; technological service function completely eclipses material identity. The F = 0.94 is second-highest assigned score, reflecting near-complete functional abstraction. Critical for lipid crossing-point analysis: when vegetable fat becomes “carrier,” F elevates from 0.22 to 0.94 — the definitive functional re-casting. FSSAI Regulation 5(5): mandatory functional class declaration; Food Products Standards and Food Additives Regulations, 2011, Schedule I “carrier” class: starches, maltodextrins, oils, water, propylene glycol, various gums; ITC-HS 3824 (prepared carriers), with specific materials in 1106, 1520, 2905 depending on form. (Legitquest Legal Database 2024)
Propellant	0.95	Expel food from container.	E: 0.60–0.85; M: 0.85–0.98	Food expulsion is purely mechanical/physical function — pressure, expansion, flow properties define performance. Gaseous state, no nutritional function, chemically defined: nitrous oxide (N₂O), carbon dioxide (CO₂), nitrogen (N₂) identified by molecular formula and physical properties, not biological origin. “Propellant” declaration with chemical name or INS number; pressure-temperature behavior primary, chemical identity secondary, source completely absent. The F = 0.95 is maximum assigned score, reflecting complete source abstraction and pure tool-identity. FSSAI Regulation 5(5): mandatory functional class declaration; Food Products Standards and Food Additives Regulations, 2011, Schedule I “propellant” class: nitrous oxide (INS 942), carbon dioxide (INS 290), nitrogen (INS 941), various hydrocarbons (INS 943–945); ITC-HS 2811 (inorganic acids and oxygen compounds of non-metals), 2711 (petroleum gases), 3824. (Legitquest Legal Database 2024)
Packaging Gas	0.95	Modified atmosphere, prevent oxidation in pack.	E: 0.55–0.75; M: 0.85–0.98	Modified atmosphere preservation is environmental control function — oxygen exclusion, carbon dioxide antimicrobial effect, inert gas displacement protect food quality. Gas identity determined by chemical/physical properties: nitrogen inertness, carbon dioxide solubility, argon density — not by biological origin. Elemental gases (atmospheric, cryogenic, synthetic) functionally equivalent; “packaging gas” declaration with specific gas; atmosphere composition primary, gas source irrelevant. The F = 0.95 matches Propellant as maximum score, reflecting complete functional abstraction from biological matrix. FSSAI Regulation 5(5): mandatory functional class declaration; Food Products Standards and Food Additives Regulations, 2011, Schedule I “packaging gas” class: nitrogen (INS 941), carbon dioxide (INS 290), argon (INS 938), oxygen (INS 948); ITC-HS 2811 (inert gases, nitrogen, carbon dioxide), 3824 (prepared atmosphere mixtures). (Legitquest Legal Database 2024)

References

Abdul Wahab, Siti et al. 2023. “Palm Oil: Processing, Characterization and Utilization in the Food Industry.” Frontiers in Nutrition. https://pmc.ncbi.nlm.nih.gov/articles/PMC10122035/.

American Oil Chemists’ Society (AOCS). 2024. Hydrogenation in Practice. Technical resource page. https://www.aocs.org/resource/hydrogenation-in-practice/.

Boukhenfa, Hana et al. 2022. “Towards Substitution of Hexane as Extraction Solvent of Food Products: A Review.” Foods. https://pmc.ncbi.nlm.nih.gov/articles/PMC9655691/.

Codex Alimentarius Commission. 1989. General Standard for Soy Protein Products (CXS 175-1989). Codex standard. https://www.fao.org/input/download/standards/325/CXS_175e.pdf.

Codex Alimentarius Commission. 2015. Standard for Edible Fats and Oils Not Covered by Individual Standards (CXS 19-1981). Codex standard. https://www.fao.org/input/download/standards/74/CXS_019e_2015.pdf.

Codex Alimentarius Commission (FAO/WHO). 2024. Standard for Edible Fats and Oils Not Covered by Individual Standards (CXS 19-1981). Official PDF. https://workspace.fao.org/sites/codex/Standards/CXS%2019-1981/CXS_019e.pdf.

DGCI&S. 2007a. Indian Trade Classification (h.s.): Chapter 15 — Animal or Vegetable Fats and Oils and Their Cleavage Products; Prepared Edible Fats; Animal or Vegetable Waxes. Official tariff schedule. https://www.dgciskol.gov.in/Writereaddata/Downloads/CHP_15.pdf.

DGCI&S. 2007b. Indian Trade Classification (h.s.): Chapter 35 — Albuminoidal Substances; Modified Starches; Glues; Enzymes. Official tariff schedule. https://dgciskol.gov.in/Writereaddata/Downloads/2007/CHP_35.pdf.

Directorate General of Commercial Intelligence and Statistics (DGCI&S). 2007. Indian Trade Classification (h.s.): Chapter 11 — Products of the Milling Industry; Malt; Starches; Inulin; Wheat Gluten. Official tariff schedule. https://www.dgciskol.gov.in/Writereaddata/Downloads/2007/CHP_11.pdf.

Directorate General of Commercial Intelligence and Statistics (DGCI&S). 2022. Indian Trade Classification (Harmonised System) - ITC(HS) 2022. Ministry of Commerce; Industry, Government of India.

Directorate General of Commercial Intelligence and Statistics (DGCI&S), Government of India. 2007a. Indian Trade Classification (h.s.): Chapter 07 — Edible Vegetables and Certain Roots and Tubers. Official PDF. https://www.dgciskol.gov.in/Writereaddata/Downloads/2007/CHP_07.pdf.

Directorate General of Commercial Intelligence and Statistics (DGCI&S), Government of India. 2007b. Indian Trade Classification (h.s.): Chapter 11 — Products of the Milling Industry; Malt; Starches; Inulin; Wheat Gluten. Official PDF. https://www.dgciskol.gov.in/Writereaddata/Downloads/2007/CHP_11.pdf.

Directorate General of Commercial Intelligence and Statistics (DGCI&S), Government of India. 2007c. Indian Trade Classification (h.s.): Chapter 15 — Animal or Vegetable Fats and Oils and Their Cleavage Products; Prepared Edible Fats; Animal or Vegetable Waxes. Official PDF. https://www.dgciskol.gov.in/Writereaddata/Downloads/CHP_15.pdf.

Directorate General of Commercial Intelligence and Statistics (DGCI&S), Government of India. 2007d. Indian Trade Classification (h.s.): Chapter 22 — Beverages, Spirits and Vinegar. Official PDF. https://www.dgciskol.gov.in/Writereaddata/Downloads/CHP_22.pdf.

Directorate General of Commercial Intelligence and Statistics (DGCI&S), Government of India. 2007e. Indian Trade Classification (h.s.): Chapter 35 — Albuminoidal Substances; Modified Starches; Glues; Enzymes (Effective from 1 April 2007). Official PDF. https://dgciskol.gov.in/Writereaddata/Downloads/2007/CHP_35.pdf.

FAO/WHO Codex GSFA. 2025. Acetylated Distarch Adipate (INS 1422) — GSFA Food Additive Details. Online GSFA database. https://www.fao.org/gsfaonline/additives/details.html?id=152.

FAO/WHO Joint Expert Committee on Food Additives. 2016. Acetylated Distarch Adipate — JECFA Specification Monograph. FAO JECFA Monographs 19. https://openknowledge.fao.org/server/api/core/bitstreams/130fb981-1d76-4ac8-9485-11006095eb19/content.

Food Safety and Standards Authority of India (FSSAI). 2022. Food Safety and Standards (Food Products Standards and Food Additives) Regulations, 2011 – Compendium (18[3.1: Food Additives]). Official PDF. https://www.fssai.gov.in/upload/uploadfiles/files/Compendium_Food_Additives_Regulations_20_12_2022.pdf.

Food Safety and Standards Authority of India (FSSAI). 2023b. Food Safety and Standards (Labelling and Display) Regulations, 2020 (Version-VI, 22.02.2023). Official gazette compilation. https://www.fssai.gov.in/upload/uploadfiles/files/Comp_Labelling.pdf.

Food Safety and Standards Authority of India (FSSAI). 2023a. Food Safety and Standards (Labelling and Display) Regulations, 2020 (Version-VI, 22.02.2023). Official PDF. https://www.fssai.gov.in/upload/uploadfiles/files/Comp_Labelling.pdf.

Food Safety and Standards Authority of India (FSSAI). 2024. Food Product Standards and Food Additives: Chapter 3 — Substances Added to Food (Version 2, 04.11.2024). Official PDF. https://fssai.gov.in/upload/uploadfiles/files/Chapter%203_Substances%20added%20to%20food.pdf.

Food Safety and Standards Authority of India (FSSAI). 2025a. Food Product Standards: Chapter 2.1 Dairy Products and Analogues. Official PDF. https://www.fssai.gov.in/upload/uploadfiles/files/Chapter%202_1%20(Dairy%20products%20and%20analogues).pdf.

Food Safety and Standards Authority of India (FSSAI). 2025b. Food Product Standards: Chapter 2.1 Dairy Products and Analogues (Version 3, 07.05.2025). Official PDF. https://www.fssai.gov.in/upload/uploadfiles/files/Chapter%202_1_Dairy_products_and_analogues.pdf.

General Standard for Edible Fats and Oils Not Covered by Individual Standards, Pub. L. Nos. CXS 19-1981 (1981).

Government of India (hosted on FSSAI website). 1967. The Solvent Extracted Oil, de-Oiled Meal and Edible Flour (Control) Order, 1967 (as Uploaded). Official PDF. https://fssai.gov.in/upload/uploadfiles/files/solvent-Extracted.pdf.

High Court of Delhi. 2021. Ram Gaua Raksha Dal Vs. Union of India & Ors. Order. https://indiankanoon.org/doc/189442159/.

Indian Kanoon Repository. 2024. Compilation of Food Additive Functional Classes and Statutory Definitions. https://indiankanoon.org/.

Joint FAO/WHO Expert Committee on Food Additives (JECFA). 1974. Acetylated Distarch Adipate — WHO Food Additives Series 17. InChem monograph. https://inchem.org/documents/jecfa/jecmono/v17je12.htm.

Lalitha, A. R. 2026a. Identity, Transformation, and Function: A Tri-Axial Model for the Classification of Food Ingredient Identity. Interdisciplinary Systems Research Lab.

Lalitha, A. R. 2026b. Indian Supreme Court Defines Hierarchical Classification for Food Products: Overruling Common Parlance Precedents. Interdisciplinary Systems Research Lab.

Legitquest Legal Database. 2024. FSSAI Statutory Mapping for Ingredient Nomenclature. https://www.legitquest.com/.

Mozaffarian, Dariush et al. 2011. “Trans Fats—Sources, Health Risks and Alternative Approach: A Review.” Journal of Food Science and Technology. https://pmc.ncbi.nlm.nih.gov/articles/PMC3551118/.

National Center for Biotechnology Information. 2021. Vanillin; CID 1183. PubChem Compound summary. https://pubchem.ncbi.nlm.nih.gov/compound/1183.

National Center for Biotechnology Information (NCBI). 2025a. PubChem Compound Summary for CID 1183: Vanillin. Database record. https://pubchem.ncbi.nlm.nih.gov/compound/1183.

National Center for Biotechnology Information (NCBI). 2025b. PubChem Compound Summary: Sodium Glycolate (Sodium Hydroxyacetate). Database record. https://pubchem.ncbi.nlm.nih.gov/compound/Sodium%20hydroxyacetate.

NCBI. 2022. Sodium Glycolate / Sodium Hydroxyacetate. PubChem and supplier data. https://www.chembk.com/en/chem/Sodium%20hydroxyacetate.

Pintado, M. E. et al. 2015. “Improved Functional Characteristics of Whey Protein Hydrolysates in Food Applications.” Food Technology and Biotechnology, 231–42. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4662358/.

Rodilla, Jesus M., Tiago Rosado, and Eugenia Gallardo. 2024. “Essential Oils: Chemistry and Food Applications.” Foods 13 (4): 1–24. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11011311/.

Schaefer, Kevin et al. 2025. “Maillard Reaction: Mechanism, Influencing Parameters, and Relevance in Food Processing.” Molecules. https://pmc.ncbi.nlm.nih.gov/articles/PMC12154226/.

World Customs Organization. 2002. Harmonized System Explanatory Notes: Chapter 11 — Products of the Milling Industry; Malt; Starches; Inulin; Wheat Gluten. Explanatory Notes. https://www.wcoomd.org/-/media/wco/public/global/pdf/topics/nomenclature/instruments-and-tools/hs-nomenclature-older-edition/2002/11.pdf.

Yun, Rong et al. 2024. “Vinegar: A Review of the Microbiology, Biochemistry and Quality Aspects.” Food Research International. https://pmc.ncbi.nlm.nih.gov/articles/PMC11312487/.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@report{a_r2026,
  author = {A R, Lalitha},
  publisher = {iSRL},
  title = {Justification {Companion} to {EMF-Scoring} {Model}},
  number = {iSRL-26-02-D-EMFJustify},
  date = {2026-02-20},
  url = {https://isrl.in/pub/2026-02-d-emfjustify/},
  doi = {10.5281/zenodo.18713318},
  langid = {en}
}

For attribution, please cite this work as:

A R, Lalitha. 2026. Justification Companion to EMF-Scoring Model. iSRL-26-02-D-EMFJustify. iSRL. https://doi.org/10.5281/zenodo.18713318.

Identity, Transformation, and Function A Tri-Axial Model for the Classification of Food Ingredient Identity

Lalitha A R — Fri, 20 Feb 2026 00:00:00 GMT

Food ingredient classification in India confronts a structural problem that neither label standardisation nor taxonomy alone resolves: the same substance appears under dozens of names across regulatory filings, procurement systems, and consumer labels, while substances that share a name may differ in ways that determine their legal status, tax bracket, and nutritional profile. This report proposes the E–M–F Tri-Axial Identity Model as a principled, evidence-grounded framework for assigning a determinate identity position to any food ingredient. The three axes measure, respectively, the invasiveness of the transformation pathway (Anthropogenic Energy, ), the degree of departure from the original biological matrix (Matter, ), and the degree to which technological function governs regulatory naming and trade classification rather than biological origin (Function, ). From these three coordinates, a composite Divorce Score () is derived that partitions ingredients into three operationally meaningful zones: variants of a biological source, independent canonical entities, and functional tools whose identity is defined by role rather than origin. The framework is grounded in existing Indian regulatory instruments—the FSSAI Labelling and Display Regulations 2020, the Food Products Standards and Food Additives Regulations 2011, and the Indian Trade Classification (Harmonised System)—and validated against judicial reasoning from the Supreme Court of India and the Delhi High Court. A 35-item benchmark tests the discriminatory power of the model and provides a replicable standard for future refinements. The model provides the deterministic ingredient-level substrate on which product-level food classification frameworks can operate with greater precision and consistency.

1 The Ingredient Identity Problem

1.1 The Multiplicity That Is Not Noise

A survey of 896 stock-keeping units drawn from Indian retail channels—part of this project’s commercial sampling work, the full methodology for which will be documented in a forthcoming report—yielded 7,563 distinct ingredient strings after comma-splitting label text into individual units. This commercial sample was reconciled against the Open Food Facts India dataset (Open Food Facts contributors 2024), which contributes a further 19,748 products from a different collection pathway; across the combined 4,800 deduplicated products, splitting by comma and conjunction produces approximately 48,000 variant strings in total. The two sources are methodologically distinct and are treated as such throughout this project.

These strings do not represent 7,563 different substances, let alone 48,000. Preliminary reconciliation identified a far smaller number of underlying biological entities. The multiplicity is primarily linguistic: different names, transliterations, regulatory phrasings, and brand conventions applied to the same or closely related ingredients.

This is not a data-quality failure. It is a structural feature of how a linguistically and culturally diverse food system interacts with labelling frameworks designed for narrower ranges of variation. A manufacturer in Tamil Nadu printing inji on a label is not in error. A regulatory filing recording the same substance as ginger (Zingiber officinale) is not wrong either. A procurement system listing it as dried ginger root is recording something real. The problem emerges when these representations must interoperate—for compliance verification, supply chain monitoring, allergen tracking, or nutritional research—and no coordination layer exists to establish that they refer to the same thing.

The FSSAI Labelling and Display Regulations 2020 permit ingredient declaration in regional languages and do not mandate a single canonical term for most ingredients.¹ This is appropriate policy. Forcing convergence on a single English-language term would impose a linguistic uniformity that serves neither consumers nor the regulatory objective of communicating the true nature of food. The problem is not the diversity; it is the absence of a coordination structure beneath it.

¹ FSSAI Labelling Regulations 2020, Regulation 4(1).

1.2 The Scale of Variation

Two ingredient categories from the reconciliation process illustrate the practical range. The following strings were recovered from product labels and regulatory filings as distinct entries—each referring, in whole or in part, to a common biological source.

Chilli (Capsicum spp.)—a representative sample

chilli; chilli powder; chilli flakes; red chilli; red chilli powder; red chillies; dry red chilli; green chilli; green chilli paste; green chilli puree; kashmiri chilli; kashmiri lal mirch; mathania red chilli powder; spices and condiments—chilli; spices and condiments—red chilli powder; spices and condiments—kashmiri red chilli powder; ground spices and condiments—dry red chilli; mixed spices—red chilli flakes; extracts and oils—red chilli; chilli extract; chilli red; red chilly; red chilly powder.

Mango (Mangifera indica)—a representative sample

mango; mango pulp; mango puree; mango powder; dry mango powder; dried mango; mango bits; mango juice; kesar mango pulp; alphonso mango pulp; concentrated mango pulp; dehydrated mango puree; mango puree concentrate; mango solids; spices and condiments—amchur; spices and condiments—dried mango powder; fruit powder blend—mango; mango flavouring; raw mango flavouring; tropical juice powder—mango.

These samples—spanning raw forms, dried forms, powders, pastes, purees, concentrates, extracts, flavourings, and regional variety names—illustrate the problem precisely. A compliance system encountering “mathania red chilli powder” and “chilli powder” as separate entries has no basis for determining whether they represent the same ingredient, variants of the same ingredient that differ in a legally relevant way, or distinct ingredients with different regulatory implications. The same ambiguity applies across thousands of ingredient pairs in the dataset.

1.3 Why This Matters Beyond Nomenclature

The stakes of ingredient identity extend well beyond labelling consistency. Three domains illustrate the practical consequences of unresolved identity.

Allergen disclosure. The FSSAI Labelling Regulations require mandatory declaration of common allergens, including cereals containing gluten, peanuts, soybeans, milk, and tree nuts.² Accurate allergen tracking requires that “besan,” “gram flour,” and “chickpea flour” be recognised as referring to the same substance, and that “refined wheat flour” and “maida” be treated as the same allergen source. A system processing these as distinct strings produces false negatives in allergen searches.

² FSSAI Labelling Regulations 2020, Regulation 5(14).

Trade classification and taxation. The Indian Trade Classification (Harmonised System) assigns different tariff headings to ingredients on the basis of processing state and functional role. Mango pulp (HS 0804) and dried mango powder (HS 0813) are classified differently and attract different duties. Concentrated mango pulp may attract a different heading again depending on Brix value and processing method (Directorate General of Commercial Intelligence and Statistics 2007a). The financial and legal consequences of misclassification are direct and quantifiable.

Source declaration and religious or ethical compliance. The Delhi High Court, in Ram Gaua Raksha Dal v. Union of India, held that the obligation to declare the vegetarian or non-vegetarian status of food is independent of percentage or processing level, grounded in Articles 21 and 25 of the Constitution (Delhi High Court 2022). This principle requires that the biological origin of an ingredient remain traceable through processing transformations. A classification system that severs the link between a processed ingredient and its source—treating “casein” as a functional identifier with no required dairy origin disclosure—fails this requirement.

1.4 The Question This Report Addresses

These observations converge on a single question that has not been systematically answered for the Indian food system: given an ingredient string, what is its identity, and what is the principled basis for that determination?

The question carries three sub-questions that must be answered in sequence. First, what counts as a canonical entity—the basic unit of identity to which variant representations are attached? Second, when does a variant become sufficiently distinct to constitute a separate canon in its own right? Third, when has an ingredient been transformed so thoroughly that its identity is no longer governed primarily by its biological source but by the technological function it performs?

These are ontological questions. They cannot be answered by counting occurrences or applying string-matching heuristics. They require a framework grounded in scientific, regulatory, and legal reality that produces consistent, defensible determinations when applied to novel cases.

Section 2 documents a first attempt at the problem and shows where it falls short. Section 3 introduces the theoretical foundation that reorients the approach. Section 4 enumerates the ontological questions the framework must answer. Section 5 establishes the regulatory instruments serving as empirical ground truth. The remaining chapters develop and validate the model, and Section 11 describes the next steps for applying it to the full variant corpus.

2 Why Flat Canonisation Fails

2.1 The Initial Approach

The natural first response to a multiplicity of ingredient strings is to collapse them. Given 7,563 strings and the reasonable expectation that they represent far fewer substances, the immediate goal was consolidation: assign each string to a canonical form, discard the variation, and produce a clean taxonomy.

This approach was implemented and produced a working taxonomy published as version 0.1 of the Encyclopedia of Indian Food Ingredients (Lalitha 2026a). That taxonomy served as a necessary first step: it demonstrated that automated consolidation was feasible, identified the problem’s boundaries, and surfaced the cases where flat consolidation produced results that were operationally and legally indefensible. The present report builds directly on what those cases revealed.

2.2 What Flat Canonisation Produces

Under a flat canonisation scheme, all variant strings for a given biological source are grouped under a single canonical label. The chilli variants listed in Section 1 would consolidate to “Chilli.” The mango variants would consolidate to “Mango.” The logic is appealing: one biological entity, one canonical name.

The problem becomes visible when the output is examined by stakeholders who depend on ingredient classifications for operational decisions.

For a food manufacturer seeking to claim a geographically indicated ingredient: “Mathania Red Chilli” is not interchangeable with “Chilli.” Mathania is a geographic indicator associated with a specific cultivar grown in the Barmer district of Rajasthan, recognised for its characteristic colour and moderate heat. A brand that sources this variety and wishes to communicate that fact on its label—a commercially and legally meaningful distinction—has no mechanism for doing so under a scheme that treats all chilli as one entity.

For a nutritional researcher or regulator: “Mango pulp” and “dehydrated mango powder” are not nutritionally equivalent. The former is a high-moisture preparation with a specific sugar profile; the latter has undergone water removal that concentrates all components and, depending on process conditions, may alter certain phytochemicals. A database recording both as “Mango” provides no basis for dietary assessment calculations that depend on moisture-adjusted nutrient values.

For a customs authority: “Mango flavouring” filed alongside “mango pulp” under a single canonical entity produces a tariff classification that is straightforwardly incorrect. Mango pulp falls under HS Chapter 08 (edible fruits); a synthetic mango flavouring may fall under Chapter 29 (organic chemicals) or Chapter 33 (essential oils and resinoids) depending on its composition. Filing them under the same canonical entity does not resolve the classification question; it conceals it.

For a food safety system tracking an allergen or contaminant: “Lecithin” and “soya lecithin” cannot be merged without losing source information that is required by law. The FSSAI Labelling Regulations and the reasoning in Ram Gaua Raksha Dal (Delhi High Court 2022) together establish that source disclosure for allergen-relevant ingredients is non-negotiable.

2.3 The Structural Flaw

Flat canonisation fails because it conflates two distinct problems requiring different solutions. The first is coordination: establishing that “chilli,” “red chilli,” and “lal mirchi” refer to the same underlying entity so that systems can interoperate. The second is identity preservation: maintaining the distinctions—geographic origin, processing state, form, biological source—that carry legal, nutritional, commercial, and cultural meaning.

A flat scheme solves the first problem by destroying the second. It achieves coordination at the cost of the very information that makes coordination useful. A brand filing its ingredient as “Chilli” and a brand filing it as “Kashmiri Lal Mirch” can now be linked in a database, but the database no longer records what distinguishes them—a distinction that may affect GST categorisation, GI protection claims, export certification, and consumer communication simultaneously.

The correct solution is a layered structure: a coordination layer linking all variant representations to a shared identifier, and an identity-preservation layer retaining the distinctions that matter. This is precisely the problem Shiyali Ramamrita Ranganathan addressed in information science nearly a century ago.

3 Ranganathan’s Faceted Classification

3.1 The Context of Its Creation

In 1933, the Indian mathematician and librarian S. R. Ranganathan published the first edition of Colon Classification (Ranganathan 1933). The problem he addressed was structurally similar to the one this report confronts: a body of knowledge so diverse and growing so rapidly that any fixed hierarchical scheme would be perpetually inadequate. The Dewey Decimal System, then dominant in library science, assigned each subject a fixed position in a single hierarchy. Works addressing multiple subjects simultaneously, or belonging to a subject not anticipated by the scheme’s designers, could not be accommodated without distorting the classification.

Ranganathan’s response was to abandon the single hierarchy and replace it with a set of independent analytical dimensions, which he called facets. A document could be described by its position on each facet independently, and its classification was the combination of those positions. The colon in “Colon Classification” is the separator between facets in the notation.

3.2 The PMEST Framework

Ranganathan identified five fundamental facets applicable across all fields of knowledge, designated PMEST: Personality, Matter, Energy, Space, and Time (Ranganathan 1967). These represent, respectively, the primary subject of a document, the materials or substances it involves, the processes or operations it describes, the geographic location it concerns, and the time period it covers.

The operational power of the framework lies in the independence of its facets. A document about the fermentation of rice in Karnataka in the nineteenth century can be precisely described by assigning positions on each facet—rice (Personality), fermentation (Energy), Karnataka (Space), nineteenth century (Time)—without requiring that the classification scheme anticipate this exact combination in advance. New combinations form by combining existing facet values; the scheme extends to novel cases without revision.

Adapted to the food domain, the analytical clarity is immediate. Consider three ingredients:

Kashmiri red chilli powder: Personality = chilli (Capsicum annuum); Matter = dried, powdered; Space = Kashmir.
Mathania red chilli, whole dried: Personality = chilli (Capsicum annuum); Matter = dried, whole; Space = Marwar (Rajasthan).
Green chilli paste: Personality = chilli (Capsicum annuum); Matter = raw, comminuted, high-moisture.

Under a flat scheme, all three are “Chilli.” Under a faceted scheme, all three share a Personality coordinate—sufficient to establish their relationship—while their distinct Matter and Space coordinates preserve the differences that matter. A fourth ingredient, a synthetic capsaicin extract used as a flavouring agent, would share a Personality relationship to chilli while carrying a very different processing history and a different functional identity. The framework accommodates this without modification.

3.3 Adoption and Durability

Colon Classification was adopted by the Indian National Library and numerous university libraries across South and Southeast Asia, and served as the theoretical foundation for the International Federation of Library Associations’ principles on faceted classification (Broughton 2006). Subsequent frameworks—including the Bibliographic Classification of Henry Bliss and the Universal Decimal Classification’s faceted extensions—drew directly on Ranganathan’s architecture.

The durability of the framework across domains as diverse as bibliography, archival science, museum cataloguing, and digital information architecture reflects its quality as a structural solution rather than a domain-specific convention. The problem it addresses—organising entities that are complex, diverse, and not fully anticipated in advance—is exactly the problem that Indian food ingredient classification presents.

3.4 From Library Science to Food Identity

Applying the PMEST framework to the ingredient dataset immediately clarified which distinctions were meaningful and which were surface variation. The distinction between “chilli powder” and “chilli flakes” is a legitimate Matter distinction (fine-ground versus coarsely broken), not a naming inconsistency to be collapsed. “Kashmiri chilli” and “generic red chilli” differ on the Space facet, not the Personality facet, and that distinction carries regulatory weight in the context of geographical indication protection.

However, three categories of cases emerged that the PMEST framework as originally conceived did not fully resolve. The first concerned artificial or nature-identical flavourings: does “mango flavouring” belong under Mangifera indica as a Personality, or has synthesis transformed its identity so thoroughly that the source becomes secondary to the function? The second concerned highly processed lipids: is “soya lecithin” a variant of soybean, or has extraction and fractionation placed it in a different identity category—one defined by its emulsification function rather than its botanical origin? The third concerned processing-derived additives with no meaningful biological ancestor: modified starches, synthetic antioxidants, and inorganic salts have an HS classification and a regulatory name, but no Personality in the biological sense.

These categories expose the ontological questions a classification framework for food ingredients must resolve before it can be applied consistently. Those questions are addressed in Section 4.

4 The Ontological Questions That Must Be Answered

4.1 What Counts as a Canonical Entity?

A canonical entity, as used in this framework, is the smallest unit of ingredient identity to which variant representations can be attached without loss of information that is legally, nutritionally, or commercially significant. Determining what counts as a canon is not a naming decision but an identity decision: it requires specifying which distinctions are constitutive of a separate entity and which are surface variations of the same entity.

Consider lipids. Cold-pressed sesame oil and solvent-extracted refined sesame oil share a botanical source (Sesamum indicum) and a chemical class (edible vegetable oil, triglyceride-based). They differ in processing pathway, residual composition, and regulatory designation: FSSAI and the Codex standard for named vegetable oils distinguish cold-pressed and refined categories.³ Are they variants of one canon, or two separate canons? The answer depends on whether the processing distinction carries independent legal and nutritional weight—and, as Section 5 demonstrates, it does.

³ FSSAI Labelling Regulations 2020, Schedule II.

4.2 When Does a Variant Become a Separate Canon?

Variation along processing, form, and geographic dimensions does not automatically produce a separate canon. The framework requires a principled threshold at which a variant becomes sufficiently distinct to constitute an independent entity. Three criteria govern this determination.

First, regulatory identity change: if the relevant regulatory authority assigns a distinct product standard, a distinct mandatory name, or a distinct HS tariff heading to the processed form, the processing has produced a separate canon. Butter and ghee share a dairy fat origin but are defined by separate Codex standards and separate FSSAI product definitions. They are separate canons.

Second, nutritional non-substitutability: if the processed form cannot be substituted for the source form in a dietary context without materially altering the nutritional calculation, the forms are separate canons. Mango pulp and dehydrated mango powder are not nutritionally interchangeable at the same mass; they are separate canons.

Third, functional non-substitutability: if the processed form is used for a purpose that the source form cannot serve, and that purpose is the primary basis for its inclusion in a formulation, the processed form is a separate canon. Soya lecithin is used as an emulsifier; whole soybean is used as a protein and caloric source. The purposes are non-overlapping. They are separate canons.

4.3 When Does a Canon Become a Functional Tool?

The third question is the most consequential for the model developed here. A functional tool is an ingredient whose primary regulatory and commercial identity is defined by the technological role it performs rather than by its biological origin. The identity transformation is not merely a matter of processing intensity; it is a legal and semiotic shift that occurs when regulatory frameworks—labelling regulations, tariff classifications, judicial precedent—treat the ingredient primarily as a performer of a function rather than as a product of a biological source.

This shift is observable and documentable. The FSSAI Labelling Regulations prescribe a specific declaration format for food additives: the functional class (emulsifier, preservative, antioxidant, and so on) is declared first, followed by the specific name or International Numbering System code.⁴ This format structurally subordinates origin to function: “Emulsifier (lecithin)” presents the technological role as the primary identifier. A brand can declare “Emulsifier (INS 322)” without reference to soy origin, except where allergen disclosure obligations apply.

⁴ FSSAI Additives Regulations 2011, Schedule I.

⁵ FSSAI Labelling Regulations 2020, Schedule II, Class Titles 2 and 4.

By contrast, edible vegetable oils—even heavily processed ones including hydrogenated and interesterified fats—must be declared with their source type.⁵ “Hydrogenated vegetable oil” retains the botanical-origin reference despite intensive chemical transformation. The identity, for regulatory purposes, remains origin-primary.

The boundary between these two regimes is not a simple function of processing intensity. A highly processed ingredient may remain origin-primary in regulatory naming, while a moderately processed ingredient may cross into function-primary classification. This is the central observation motivating the introduction of as a third dimension, independent of and , in the model developed in Section 6.

4.4 The Role of Flavourings

Flavourings require explicit treatment. The FSSAI Labelling Regulations distinguish natural flavourings, nature-identical flavourings, and artificial flavourings.⁶ A natural mango flavouring obtained by aqueous or ethanolic extraction from mango fruit retains a biological-origin linkage in its regulatory designation. A synthetic mango flavouring produced by organic synthesis to replicate specific volatile compounds has no such linkage; its identity is defined by its sensory function and chemical composition, not by its biological source.

⁶ FSSAI Labelling Regulations 2020.

Whether to file a synthetic mango flavouring under the canon for mango or as a separate functional entity cannot be resolved by examining the ingredient name alone. It requires a framework that positions the ingredient on dimensions of processing transformation and functional identity simultaneously. This is what the –– model provides.

4.5 The Role of Source Declaration

Throughout the foregoing analysis, source declaration has appeared both as a legal requirement and as a conceptual anchor. The requirement reflects a principle embedded in Indian food law and affirmed by the courts: that consumers and downstream systems have a legitimate interest in knowing the biological origin of ingredients, independent of the form those ingredients take in the final product. This principle creates a legal floor on identity abstraction: no ingredient can be classified as a pure functional tool, in the regulatory sense, if its biological origin is subject to mandatory disclosure.

This interaction between legal source-declaration obligations and functional identity is one of the novel contributions of the dimension, examined in detail with reference to specific regulatory provisions and judicial reasoning in Section 5 and Section 6.

5 The Regulatory Landscape as Ground Truth

5.1 Why Regulation Precedes Theory

The ontological questions raised in Section 4 might appear to invite philosophical resolution—a set of first principles from which a classification framework is deduced. The approach taken here is different. The existing regulatory landscape is treated as empirical evidence of how a functioning legal and commercial system has already resolved many of these questions, and unexplained divergences within that landscape are treated as signals of where principled analysis is most needed.

This is not deference to authority for its own sake. India’s food regulatory instruments—the FSSAI Labelling and Display Regulations 2020 (Food Safety and Standards Authority of India 2023), the Food Products Standards and Food Additives Regulations 2011 (Food Safety and Standards Authority of India 2024), and the Indian Trade Classification (Harmonised System)—have been refined through decades of legislative drafting, administrative interpretation, and judicial review. They encode accumulated practical wisdom about which distinctions matter and which do not. A framework that contradicts these instruments without compelling justification is not principled; it is merely unconventional.

5.2 The FSSAI Labelling and Display Regulations, 2020

5.2.1 The “True Nature” Principle

Regulation 4(1) of the FSSAI Labelling and Display Regulations 2020 establishes the foundational identity norm: the name of a food shall indicate its true nature. Where an established standard exists, the standardised name is required. Where none exists, the common or usual name must be used, supplemented by a description of the true nature where the name alone is insufficient.

This principle establishes source-dominant identity as the regulatory default. An ingredient must be named in a way that accurately conveys what it is—its biological origin, its physical state, its processing history where that history is legally significant. The “true nature” requirement is not merely a naming convention; it is an epistemological commitment to the primacy of material identity over functional identity in the absence of specific provision to the contrary.

5.2.2 Source Qualification Requirements

The 2020 Regulations impose mandatory source qualifiers for several ingredient categories, creating legally enforceable constraints on functional abstraction.

Edible vegetable oils and fats must be declared with the specific oil type and, where applicable, the processing method.⁷ The Schedule II ingredient class titles prescribe declaration formats including “vegetable fat (specify source type: interesterified vegetable fat / fractionated fat / hydrogenated oils / partially hydrogenated oils / margarine and fat spreads).” Even where intensive chemical modification has occurred—hydrogenation, interesterification—the source type must be named.

⁷ FSSAI Labelling Regulations 2020, Schedule II, Class Titles 2 and 4.

Animal fats require declaration of their specific animal origin, reflecting the constitutional dimension of source disclosure affirmed by the courts.

Cereal flours must identify the grain source. “Wheat flour,” “maize flour,” and “rice flour” are distinct required declarations; the generic term “flour” is insufficient where the grain identity is nutritionally and allergically significant.

5.2.3 The Additive Declaration Format

Regulation 5(5) of the 2020 Regulations introduces the mechanism that makes functional identity legally cognisable: the mandatory declaration of food additives by functional class.⁸ Additives listed in the Food Products Standards and Food Additives Regulations 2011 must be declared with their functional class name first, followed by the specific name or the INS code.

⁸ FSSAI Labelling Regulations 2020, Regulation 5(5).

The format “Emulsifier (lecithin)” or “Preservative (INS 211)” structurally encodes the priority of function over source in regulatory naming. The functional class is the primary identifier; the specific substance is secondary. This format is mandatory, not optional: it represents a regulatory determination that for additive-classified substances, the technological role is the operationally significant identity for consumer communication.

Schedule I of the 2011 Regulations enumerates twenty-two functional classes, including emulsifier, thickener, stabilizer, preservative, antioxidant, sequestrant, raising agent, humectant, carrier, propellant, and packaging gas.⁹ The existence of this taxonomy, and the mandatory declaration format that accompanies it, is empirical evidence that Indian food law recognises a distinct category of ingredients whose identity is function-primary.

⁹ FSSAI Additives Regulations 2011, Schedule I.

5.3 The Indian Trade Classification (Harmonised System)

5.3.1 Chapter Structure as Identity Architecture

The Indian Trade Classification (Harmonised System) organises traded goods through a hierarchical chapter structure that encodes, in legally binding form, the identity distinctions that matter for taxation, origin determination, and regulatory compliance. For food ingredients, the relevant architecture spans Chapters 7 through 38, with a characteristic pattern: source-aligned classification in Chapters 7–15, and function-aligned or chemically defined classification in Chapters 29, 35, and 38.

Chapters 7 and 8 cover edible vegetables, fruits, and nuts, classified primarily by botanical species and physical state. Chapter 9 covers coffee, tea, and spices. Chapter 11 covers products of the milling industry: flours, meals, starches, and related products derived from grains and pulses, with HS headings that specify both source and physical form (Directorate General of Commercial Intelligence and Statistics 2007a). Chapter 15 covers animal and vegetable fats and oils, organised by source, with processing state recorded in subheadings but not displacing source from the primary classification level (Directorate General of Commercial Intelligence and Statistics 2007b).

Chapter 35 covers albuminoidal substances, modified starches, glues, and enzymes. The inclusion of “modified starches” here—rather than Chapter 11 with native starches—is a deliberate regulatory determination that chemical modification of starch sufficiently transforms its identity to warrant reclassification from a milling industry product to a chemically defined substance (Directorate General of Commercial Intelligence and Statistics 2007c). HS heading 3505 covers “dextrins and other modified starches (for example, pregelatinised or esterified starches).” The migration from Chapter 11 to Chapter 35 represents an HS-encoded identity snap: the same biological material, after a defined degree of transformation, is treated as a different kind of thing.

5.3.2 Critical Chapter Transitions as Identity Snaps

The most analytically significant feature of the ITC-HS for ingredient classification is the set of chapter transitions that represent discontinuous identity changes—points at which accumulated processing crosses a threshold that the regulatory system treats as qualitatively, not merely quantitatively, significant. Three such transitions are primary.

Chapter 11 to Chapter 35 (Native Starch to Modified Starch). Native starches are classified in Chapter 11 as products of the milling industry. Chemically modified starches—acetylated, cross-linked, phosphorylated—migrate to HS heading 3505 in Chapter 35. This transition is triggered by chemical modification of the starch polymer: the introduction of new functional groups that alter the regulatory identity of the material from food commodity to chemically defined functional substance.

Chapter 15 to Chapter 1516 to Chapters 29/38 (Oils to Chemically Modified Fats to Chemical Products). Within Chapter 15, a progression exists from crude and refined oils through chemically modified fats (heading 1516, covering hydrogenated, interesterified, re-esterified, and elaidinised fats “not further prepared”) to formulated preparations (heading 1517). Lecithins and phosphoaminolipids, derived from vegetable oil processing, are classified under HS heading 2923 in Chapter 29 rather than Chapter 15, reflecting regulatory determination that the identity of these substances is defined by their chemical structure and emulsification function rather than their fat-derived origin.

Chapter 22 (Brewed Vinegar) versus Chapter 29 (Synthetic Acetic Acid). Brewed vinegar, produced by double fermentation of agricultural substrates, is classified under HS 2209 in Chapter 22. Glacial acetic acid (synthetic), used as an acidulant, is classified under HS 2915 in Chapter 29. FSSAI regulations require that synthetic vinegar be labelled “SYNTHETIC – PREPARED FROM ACETIC ACID,” distinguishing it from brewed vinegar at the product naming level as well as the tariff level. This parallel treatment across labelling law and tariff classification illustrates the convergent methodology applied throughout this report.

5.4 Judicial Reasoning on Ingredient Identity

5.4.1 The Supreme Court on Classification Hierarchy

In Commissioner of Customs (Import) v. M/s Welkin Foods, decided on 6 January 2026, the Supreme Court of India addressed the hierarchy of interpretive tools applicable to food product classification disputes (Supreme Court of India 2026). The Court held that Harmonised System codes and tariff headings constitute the primary reference for classification purposes, overruling the common parlance test where the two conflict. Scientific and technical definitions embedded in the HS architecture take precedence over popular understanding of what a product “is” or “is used for.”

The practical implication is significant: the identity of an ingredient, for regulatory and legal purposes, is determined by the structure of the classification system rather than by lay or commercial understanding. An ingredient that a consumer would describe as “chocolate” may, for classification purposes, be a “vegetable fat confection” if its cocoa butter content falls below the legal threshold. The technical classification displaces the common-name description.

5.4.2 The Delhi High Court on Source Disclosure Independence

In Ram Gaua Raksha Dal v. Union of India and Others, the Delhi High Court ruled on the interaction between functional-class additive declaration and source-based disclosure requirements (Delhi High Court 2022). The Court held, first, that source disclosure obligations are independent of the additive-declaration framework: even where an additive is properly declared by functional class and INS number, the source-based identification requirement cannot be displaced. Second, the obligation is percentage-independent: a non-vegetarian ingredient triggers mandatory source disclosure regardless of quantity present. Third, the Court grounded these requirements in Articles 21 and 25 of the Constitution, elevating source disclosure from regulatory preference to fundamental rights protection in specific contexts.

For the classification framework developed here, the judgment establishes a legal ceiling on functional abstraction: regardless of how technically “functional” an ingredient’s classification is under the additive schedule or the ITC-HS, source identity cannot be fully abstracted where constitutional disclosure interests apply. This ceiling is incorporated into the dimension of the model as a contextual modifier.

5.5 The Regulatory Delta: 2011 to 2020

A comparative analysis of the FSSAI Food Products Standards and Food Additives Regulations 2011 and the Labelling and Display Regulations 2020 reveals a systematic shift in the regulatory treatment of ingredient identity (Vukka and Lalitha 2026). The 2020 Regulations expanded the scope of mandatory source qualification, tightened the format requirements for additive declaration, and introduced new provisions for allergen labelling and the declaration of processing aids. These changes collectively increased the regulatory resolution of ingredient identity: more distinctions are now legally mandated, and more instruments are available to enforce them.

This trajectory is relevant to the benchmark in Section 5.6: the 35 test cases are calibrated to the current regulatory state as of 2025, with the understanding that the framework must accommodate regulatory evolution without requiring wholesale reconstruction.

5.6 The Identity Discrimination Benchmark

5.6.1 Purpose and Scope

The benchmark serves a specific and bounded purpose: it provides a replicable, publicly stated set of discrimination tests against which any ingredient classification framework—including the –– model developed in Section 6—can be evaluated. A framework that fails to produce correct discriminations on the benchmark cases is demonstrably inadequate; a framework that passes all cases has cleared a necessary but not sufficient condition for general adequacy.

The benchmark is adversarial by design. Each test case represents a discrimination that a naive or flat classification system would likely fail, while a principled framework grounded in regulatory and scientific evidence should resolve correctly. The test cases span the full range of ingredient transformation—from thermal history without identity change to complete chemical synthesis—and cover the major regulatory identity snaps documented above.

The discriminatory power of a framework applied to the benchmark is quantified by a Determinism Quotient (DQ):

A DQ of 1.0 indicates correct differentiation of all 35 pairs. Partial scores indicate specific domains of weakness. The DQ measures logical consistency with regulatory ground truth, not statistical performance.

5.6.2 The 35-Test Identity Discrimination Benchmark

Table 1: Identity Discrimination Benchmark: 35 adversarial test pairs.

ID	Does the framework differentiate between…	Reason for Testing (Regulatory/Nutritional Logic)
1	Raw Apple vs. Chilled Apple	Floor test for thermal history without identity snap. Chilling is minimal processing with no regulatory rename. A framework must not produce a distinct canon from refrigeration alone.
2	Whole Wheat Flour vs. Maida	Detection of matrix stripping. FSSAI product standards 2.4.1 and 2.4.2 distinguish these as separate regulated commodities with different ash content and extraction rate specifications.
3	Maida vs. Native Wheat Starch	Snap from whole-plant milling to nutrient isolation. Maida retains protein and some non-starch material; native wheat starch is a purified carbohydrate fraction. FSSAI and HS Chapter 11 distinguish these within the milling chapter before any chemical modification occurs.
4	Sliced Onion vs. Onion Powder	Mass concentration threshold and matrix disruption. Dehydration concentrates all components by approximately 10-fold; the resulting powder has different nutritional density, water activity, and regulatory handling characteristics.
5	Raw Milk vs. Pasteurised Milk	Identification of the primary legal safety processing step. FSSAI dairy standards require heat treatment declaration; the legal name changes from “milk” to “pasteurised milk.” The framework must register this change without treating the two as entirely separate biological entities.
6	Fresh Fruit vs. Dehydrated Fruit	Phase change and water activity boundary (). Dehydrated fruit is regulated under different FSSAI standards, has different microbiological risk profiles, and occupies different HS subheadings.
7	Raw Honey vs. Pasteurised Honey	Enzymatic integrity versus thermal stabilisation. FSSAI honey standards distinguish these based on diastase activity; the framework must capture the enzymatic dimension of processing history.
8	Cold Pressed Oil vs. Refined Oil	Chemical separation and solvent-based processing floor. FSSAI mandates different label designations; Codex CXS 19-1981 restricts the term “cold pressed” to oil obtained without heat addition and without additives. Refined oil has passed through deacidification, bleaching, and deodorisation.
9	Butter vs. Ghee	Separation of dairy solids and water. FSSAI product standards under Chapter 2.2 define butter and ghee as distinct dairy fat products with different compositional specifications. The two occupy the same HS chapter (Chapter 04) but different HS headings.
10	Ghee vs. Anhydrous Milk Fat	Chemical peak of lipid purity. Anhydrous milk fat (AMF) achieves approximately 99.9% lipid content through a more intensive separation process than ghee, with Codex standard CXS 280-1973 defining separate parameters for each.
11	Liquid Vegetable Oil vs. Vanaspati	Catalytic hydrogenation snap. FSSAI defines vanaspati under Chapter 2.2.6 as a hydrogenated vegetable oil product with mandatory trans fat disclosure. HS heading 1516 applies to hydrogenated fats, distinct from unmodified oil headings 1507–1515.
12	Vanaspati vs. Interesterified Fat	Molecular rearrangement for structural utility. Interesterification redistributes fatty acids among glycerol backbones, creating a different melting profile without full saturation. Both fall under HS 1516 but with distinct process designations; FSSAI Schedule II requires specific naming of interesterified vegetable fat.
13	Milk vs. Dairy Whitener	Functionality shift from beverage to additive carrier. Dairy whitener is a formulated product containing dried milk with emulsifiers, anticaking agents, and flow agents; its primary commercial function is as an additive to beverages, not as a standalone nutritional source.
14	Coconut Milk vs. Coconut Oil	Emulsion-to-lipid snap. Coconut milk is an aqueous emulsion of coconut fat in coconut water (HS Chapter 21 preparation); coconut oil is the isolated lipid fraction (HS Chapter 15). These are categorically different regulatory entities despite sharing a botanical source.
15	Raw Milk vs. Yogurt/Curd	Biological conversion and structural coagulation. Fermentation transforms the protein matrix, carbohydrate profile, and pH of milk; FSSAI product standards and HS Chapter 04 treat curd and milk as distinct dairy products.
16	Curd vs. Soy Dahi (Analogue)	Source-origin verification (plant versus animal identity). A plant-based analogue mimicking the sensory properties of curd must be declared as a dairy analogue under FSSAI labelling rules and cannot use the term “dahi” without qualification. The framework must distinguish biological source even where functional and sensory properties overlap.
17	Fruit Juice vs. Fruit Vinegar	Snap from sugar matrix to biological acid matrix. Fermentation transforms ethanol to acetic acid; the resulting product is governed by FSSAI vinegar standards and HS Chapter 22 vinegar heading, categorically distinct from juice classification.
18	Vinegar vs. Glacial Acetic Acid	Biogenic origin versus petrochemical synthesis. FSSAI mandates “SYNTHETIC – PREPARED FROM ACETIC ACID” labelling for non-fermented vinegar substitutes. HS chapter migration from Chapter 22 (beverages) to Chapter 29 (organic chemicals) is required.
19	Cane Sugar vs. Xanthan Gum	Fermentation product as tool versus substrate identity. Xanthan gum, produced by fermentation of glucose substrates by Xanthomonas campestris, is classified as a food additive (stabilizer, INS 415) under FSSAI Schedule I and in Chapter 13 or 35 of ITC-HS, entirely distinct from its sugar feedstock.
20	Natural Yeast vs. Chemical Leavening	Biological versus inorganic gas-release mechanisms. Yeast leavening is a biological process; sodium bicarbonate and baking powder are classified as food additives (raising agents, INS 500) with inorganic chemistry origins.
21	Wheat Flour vs. Maltodextrin	Enzymatic hydrolysis: matrix-to-molecular snap. Maltodextrin, produced by partial hydrolysis of starch, occupies HS heading 1702 (other sugars) or 1108 (starches) depending on dextrose equivalent; it is categorically distinct from the flour from which it derives.
22	Native Starch vs. Modified Starch	Identity snap from Chapter 11 to Chapter 35 of ITC-HS. Chemical modification (acetylation, cross-linking, phosphorylation) moves starch from the milling industry chapter to the albuminoidal substances and modified starches chapter. FSSAI labelling requires explicit naming of modified starches as such.
23	Whole Soya Bean vs. Soya Lecithin	Food-to-emulsifier snap ( peak). Soya lecithin is extracted from soybean oil, concentrated to a phospholipid-rich fraction, and classified as a food additive (emulsifier, INS 322) under FSSAI Schedule I and under HS 2923 (phosphoaminolipids) in Chapter 29—entirely distinct from the whole soybean.
24	Sugar vs. High Fructose Corn Syrup	Enzymatic synthesis of non-natural sugar ratios. High fructose corn syrup is produced by enzymatic isomerisation of glucose; its fructose content does not occur in natural corn starch and produces a functionally and metabolically distinct sweetener.
25	Vanilla Bean vs. Natural Vanilla Extract	Solvent extraction versus biological matrix integrity. Natural vanilla extract is produced by aqueous or ethanolic extraction; it is a concentrated flavouring preparation classified under Chapter 33 (essential oils, resinoids) rather than Chapter 9 (spices), with distinct regulatory treatment under FSSAI flavouring guidelines.
26	Natural Vanilla Extract vs. Synthetic Vanillin	Signal-to-source divorce. Synthetic vanillin (4-hydroxy-3-methoxybenzaldehyde, HS 2912.41) is classified in Chapter 29 (organic chemicals); it cannot be labelled as “natural vanilla flavouring” under FSSAI regulations and must be declared as “artificial flavouring” or “flavouring (vanillin).”
27	Chocolate vs. Chocolate Substitute	Legal admission of non-cocoa fats as identity limit. FSSAI product standards for chocolate set minimum cocoa solids and cocoa butter content; products falling below these thresholds must be designated “chocolate-flavoured” or “compound chocolate” rather than “chocolate.”
28	Natural Dietary Fibre vs. Purified Cellulose	Isolation of non-nutritive structural tool. Microcrystalline cellulose (MCC, INS 460) is an additive-classified substance under FSSAI Schedule I, used as a bulking agent, anticaking agent, and stabiliser; it is categorically distinct from the dietary fibre content declared on nutrition labels.
29	Cane Sugar vs. Aspartame	Caloric bulk versus high-potency functional signal. Aspartame (INS 951) is classified as an intense sweetener under FSSAI Schedule I at use levels approximately 200 times lower than sugar by mass; its functional identity is defined by sweetening intensity, not caloric contribution.
30	Sea Salt vs. Sodium Benzoate	Flavour seasoning versus system utility (preservative). Sodium benzoate (INS 211) is classified as a preservative under FSSAI Schedule I; its primary function is microbial inhibition, not flavour. The framework must not conflate sodium-containing ingredients on the basis of cation similarity.
31	Guar Gum vs. Cereal Flour	Peak viscosity utility versus caloric mass contribution. Guar gum (INS 412), classified as a thickener and stabiliser under FSSAI Schedule I, is used at 0.1–0.5% inclusion levels for viscosity; cereal flour is a bulk ingredient providing starch and protein at 40–80% of formulation weight.
32	Lemon Juice vs. Citric Acid	Purity-utility snap: food versus acidulant tool. Lemon juice is a food ingredient governed by FSSAI product standards (Chapter 20 of ITC-HS); citric acid (INS 330) is a food additive classified as an acidity regulator under FSSAI Schedule I and in Chapter 29 of ITC-HS.
33	Smoked Meat vs. Liquid Smoke	Process-integral flavour versus additive signal divorce. Liquid smoke is a condensate of wood combustion products, standardised and classified as a flavouring preparation under FSSAI regulations; it is a discrete additive, not the outcome of an integrated processing step, and must be declared in the ingredient list.
34	Natural Beta-Carotene vs. Synthetic Beta-Carotene	Source coordinate verification. Natural beta-carotene (extracted from vegetables or algae) and synthetic beta-carotene (chemical synthesis) are chemically identical but classified differently for the purpose of “natural colour” claims under FSSAI and comparable labelling frameworks. The framework must capture source coordinate even where molecular structure is identical.
35	Bulk Ingredient vs. INS Carrier/Additive	Maximum divorce: the functional infrastructure peak. An ingredient serving no direct nutritional, sensory, or structural role in the final food product—functioning purely as a carrier, processing aid, or technical auxiliary—represents the terminus of the identity axis. The framework must distinguish this from any ingredient contributing to the food’s nutritional or sensory character.

5.6.3 Benchmark Validation Protocol

The benchmark is applied to the –– model in Section 8. The validation records, for each test pair, the model coordinates assigned to each member and whether those coordinates produce a differentiated classification outcome. A differentiated outcome requires that the two members of the pair be assigned to different canonical zones (variant, independent canon, or functional tool) or that their coordinate values differ sufficiently to warrant different regulatory and operational treatment.

Critiques of the benchmark—whether challenging the selection of test pairs, the regulatory evidence cited, or the pass/fail criteria—are subject to the contribution protocol in Appendix A. Critique without proposed revision and evidence does not constitute engagement with the benchmark.

6 The –– Tri-Axial Identity Model

6.1 The Need for Three Dimensions

Chapters 2 through 5 have established that ingredient identity is not a single-dimensional property. Flat canonisation collapses distinctions that matter; classification by processing level alone conflates ingredients that regulatory systems treat as categorically different. Ranganathan’s faceted approach provides the theoretical architecture, but its application to food ingredients requires computational operationalisation: dimensions that are measurable, independently assignable, and combinable into a diagnostic framework.

Three dimensions are necessary and, as argued below, sufficient to capture the identity distinctions that regulatory systems actually make.

First, how invasively was the ingredient transformed? This is a question about process: the energy and chemistry invested in moving an ingredient away from its native biological state. It is measured by the Anthropogenic Energy Score ().

Second, how far has the ingredient moved from its source matrix? This is a question about the resulting material: how much of the original biological context—moisture, fibre, co-nutrients, cellular structure—remains in the ingredient as it enters the food system. It is measured by the Matter Score ().

Third, does the ingredient’s regulatory and commercial identity follow its biological source or its technological function? This is a question about the legal-semiotic position of the ingredient: whether it is named, classified, and governed as a product of a biological origin or as a performer of a technological role. It is measured by the Functional Score ().

These dimensions are independent. A moderately processed ingredient can have high functional identity (propellant gases have high despite moderate ). A heavily processed ingredient can retain low functional identity (hydrogenated vegetable oil has high but low because regulatory naming retains source primacy). No single axis is sufficient to determine identity, and the combination of all three resolves cases that any two alone leave ambiguous.

The full technical justification for each score assigned in the tables that follow—including process-by-process derivations, supporting citations, and defensibility ratings—is documented in the companion scoring report (Lalitha 2026b). The present chapter states the framework and its outputs; the companion document shows the derivation.

6.2 The Anthropogenic Energy Score ()

6.2.1 Definition and Interpretive Range

The Anthropogenic Energy Score quantifies the invasiveness of the transformation pathway applied to an ingredient, ranging from (native biological state, no industrial transformation) to (complete chemical synthesis with no biological material present or traceable).

The scale is continuous but structured around four interpretive bands, each anchored in regulatory and chemical distinctions:

Physical (0.10–0.35): Mechanical handling with no intentional molecular re-identity. Sorting (), washing (), dehusking (), milling (), cold pressing (). These operations alter the physical form of the ingredient without targeting covalent bonds.
Thermal/Biological (0.40–0.60): Phase change, safety stabilisation, and biological conversion. Churning (), pasteurisation (), clarification for ghee production (), fermentation (), roasting (). These operations alter the structural or chemical state of the ingredient while retaining a clear connection to the biological source in regulatory naming.
Fractional/Refinement (0.70–0.82): Separation into functional fractions using solvents, controlled crystallisation, or industrial purification. Solvent extraction (), fractionation (), refining (). These operations produce technically defined fractions that may lack the botanical character of the starting material.
Chemical/Synthetic (0.85–1.0): Intentional covalent modification or de novo synthesis. Interesterification (), hydrogenation (), acetylation (), synthetic flavours (–). These operations introduce new functional groups, rearrange molecular structures, or produce chemically defined substances with no necessary biological precursor.

6.2.2 as Process History, Not Quality Assessment

A critical interpretive constraint must be stated explicitly: the score is not a quality assessment, a health score, or a value judgement. Ghee, a product of deep cultural and nutritional significance, carries an score of approximately 0.55 because it is produced through thermal concentration and clarification—processes that are moderately invasive relative to the full scale. This does not make ghee inferior to cold-pressed oil in any nutritional, cultural, or commercial sense. The score records what happened to the ingredient; it does not evaluate whether that history is desirable.

Similarly, a high score for synthetic vanillin () does not imply that it is unsafe or inappropriate for use. JECFA evaluations and approved INS classifications confirm that synthetic vanillin is safe at specified use levels. The high score records the degree of chemical synthesis involved in its production.

6.2.3 Selected Score Reference Values

Table 2 presents reference values for representative processes.

Table 2: Selected Anthropogenic Energy Score () reference values.

Process		Band
Sorting	0.12	Physical
Washing	0.15	Physical
Chilling	0.18	Physical
De-husking	0.22	Physical
Milling (e.g., Besan)	0.28	Physical
Cold Pressing (Oil)	0.32	Physical
Churning (Butter)	0.45	Physical/Thermal
Pasteurization	0.48	Thermal
Clarification (Ghee)	0.55	Thermal
Fermentation (Vinegar)	0.56	Biological
Roasting	0.58	Thermal/Chemical
Refining (Vegetable Oil)	0.75	Industrial/Fractional
Fractionation (Olein)	0.76	Industrial/Fractional
Solvent Extraction (Oils)	0.82	Industrial/Fractional
Interesterification	0.91	Chemical/Synthetic
Hydrogenation	0.92	Chemical/Synthetic
Acetylation (Modified Starch)	0.94	Chemical/Synthetic
Synthetic Vanillin	0.98	Chemical/Synthetic
Synthetic Flavors (General)	0.99	Chemical/Synthetic

6.3 The Matter Score ()

6.3.1 Definition and Interpretive Range

The Matter Score measures the degree of departure of the ingredient’s final commercial state from the original biological matrix, ranging from (whole, hydrated, structurally intact biological material) to (chemically defined pure substance with no remaining biological matrix).

Where measures the transformation process, measures the transformation result: the state of the material as it enters the food system. An ingredient may undergo a high- process and emerge with a relatively low if the process retains most of the original matrix (roasting leaves the bulk carbohydrate, fat, and protein structure largely intact). Conversely, a moderate- process applied repeatedly or intensively may produce a high- result (spray-drying combined with prior concentration and protein precipitation produces a protein isolate at ).

Seven conceptual matter classes provide interpretive anchors:

Hydrated/Native (–): Whole or minimally cut foods with cellular water and anatomical structure largely intact.
Comminuted (–): Physically reduced particle size; full nutrient spectrum retained; cellular structure disrupted but not fractionated.
Dehydrated/Concentrated (–): Water removed or matrix densified; major macronutrients retained; water activity substantially reduced.
Structural Fractionation (–): Selective removal or enrichment of specific macronutrient fractions (skim milk, defatted meal, clarified juice).
Constitutional Isolate (–): One major macronutrient isolated to high technical purity (vegetable oils, protein isolates, purified fat fractions).
Molecular Signal/Extract (–): High-potency, low-mass signals isolated from the biological matrix (essential oils, oleoresins, emulsifiers).
De Novo/Synthetic Matter (–): Chemically defined substances with no required biological matrix (modified starches, synthetic flavours, inorganic salts).

6.3.2 Selected Score Reference Values

Table 3 presents reference values for representative commercial states.

Table 3: Selected Matter Score () reference values.

Final Commercial State		Matter Class
Whole/fresh pieces	0.05	Hydrated/Native
Cut/sliced pieces	0.10	Hydrated/Native
Pulp/puree	0.25	Comminuted
Coarse grits	0.30	Comminuted
Flour/fine powder	0.33	Comminuted
Flakes	0.36	Dehydrated/Concentrated
Dense block (e.g., khoya)	0.38	Dehydrated/Concentrated
Concentrate (liquid)	0.40	Dehydrated/Concentrated
Powder (spray-dried)	0.42	Dehydrated/Concentrated
Juice (clarified)	0.50	Structural Fractionation
Whey powder	0.52	Structural Fractionation
Skim/defatted meal	0.55	Structural Fractionation
Starch flour	0.60	Structural Fractionation
Oil	0.70	Constitutional Isolate
Fat fraction	0.72	Constitutional Isolate
Protein concentrate	0.74	Constitutional Isolate
Protein isolate	0.78	Constitutional Isolate
Granules (agglomerated isolate)	0.80	Constitutional Isolate
Extract/oleoresin	0.86	Molecular Signal/Extract
Oleoresin (viscous)	0.88	Molecular Signal/Extract
Emulsifier powder (e.g., lecithin)	0.89	Molecular Signal/Extract
Essential oil	0.90	Molecular Signal/Extract
Modified starch powder	0.96	De Novo/Synthetic
Crystalline chemical	0.98	De Novo/Synthetic

6.4 The Functional Score ()

6.4.1 Definition and Motivation

The Functional Score measures the degree to which the legal and commercial identity of an ingredient is governed by its technological function rather than its biological origin. It ranges from (identity fully source-dominant) to (identity fully function-dominant), with the following interpretive zones:

Source-Dominant (–): Primary structure, bulk, calories, protein; regulatory naming follows food commodity name; technological function is implicit. Examples: base ingredients, spices, edible oils, dairy fats.
Source-Retaining, Function-Emergent (–): Technological role is acknowledged in naming but source remains primary or co-equal. Examples: bulking agents, humectants, firming agents, raising agents.
Function-Emergent (–): Technological function is primary in regulatory naming; source is secondary or parenthetical. Examples: thickeners, stabilisers, gelling agents, foaming agents, colours.
Function-Dominant (–): Pure tool-identity; source fully abstracted or irrelevant to classification. Examples: emulsifiers, preservatives, sequestrants, bleaching agents, carriers, propellants.

6.4.2 Is Not Derived from and

The score is not a mathematical function of and . This independence is the central methodological commitment of the tri-axial framework, motivated by empirical evidence that the correlation between processing intensity, matrix distance, and functional naming is imperfect.

Two cases illustrate the independence. First, fractionated palm olein (, ) has high process intensity and substantial matrix distance, yet its regulatory naming is source-primary (“fractionated palm oil,” HS Chapter 15); its score is approximately 0.35. Second, a packaging gas such as nitrogen (, ) has moderate process intensity, but its regulatory and commercial identity is defined entirely by its physical properties and atmospheric function; its score is 0.95. No formula relating and to would correctly place both.

The score is derived from a three-part test:

FSSAI naming test: Does the mandatory label declaration format require a functional class name as the primary identifier (“Emulsifier (lecithin)”) or a source-based name (“palm oil”)?
ITC-HS chapter test: Does the ingredient’s classification reside in source-aligned chapters (7–15) or function-aligned/chemically defined chapters (29, 35, 38)?
Judicial precedent test: Does case law require or permit functional abstraction, or does it mandate source-based disclosure for the ingredient category?

An ingredient achieves function-dominant status () only when all three tests converge on functional identity. Where tests produce conflicting signals—as with gelatin, whose gelling function supports high but whose animal origin triggers source-disclosure obligations under the reasoning of Ram Gaua Raksha Dal (Delhi High Court 2022)—the score reflects the net regulatory position after accounting for the constraint.

6.4.3 Scores Across FSSAI Functional Classes

Table 4 presents the range for each of the twenty-two functional classes enumerated in Schedule I of the Food Products Standards and Food Additives Regulations 2011, derived from the three-part test.

Table 4: Functional Score () ranges by FSSAI Schedule I functional class.

Functional Class	Score	Zone
Base ingredient (non-additive)	0.12	Source-Dominant
Taste profile / spice	0.18	Source-Dominant
Lipid base (edible oil/fat)	0.22	Source-Dominant
Bulking agent	0.35–0.40	Source-Retaining
Humectant	0.40–0.45	Source-Retaining
Firming agent	0.42–0.48	Source-Retaining
Raising agent	0.45–0.50	Source-Retaining
Flavouring agent	0.60–0.75	Function-Emergent
Thickener	0.58–0.65	Function-Emergent
Stabiliser	0.62–0.68	Function-Emergent
Gelling agent	0.65–0.70	Function-Emergent
Sweetener (bulk/intense)	0.55–0.70	Function-Emergent
Foaming agent	0.70–0.75	Function-Emergent
Colour	0.75–0.85	Function-Emergent / Dominant
Emulsifier	0.80–0.85	Function-Dominant
Anticaking agent	0.85	Function-Dominant
Acidity regulator	0.85–0.87	Function-Dominant
Antioxidant	0.87–0.88	Function-Dominant
Preservative	0.87–0.90	Function-Dominant
Antifoaming agent	0.90	Function-Dominant
Sequestrant	0.90–0.92	Function-Dominant
Bleaching agent	0.92	Function-Dominant
Flour treatment agent	0.93	Function-Dominant
Carrier	0.94	Function-Dominant
Propellant	0.95	Function-Dominant
Packaging gas	0.95	Function-Dominant

6.4.4 as Tie-Breaker

The primary operational contribution of the dimension is resolution of ambiguity in cases where and produce similar coordinates for ingredients that regulatory systems treat as categorically distinct. This tie-breaking function is clearest in the lipid category.

Soy lecithin (, ) and fractionated palm olein (, ) are both heavily processed and substantially abstracted from their biological matrices. On and alone, they appear at similar positions in transformation space. But their regulatory identities diverge sharply: soy lecithin is classified as a food additive under FSSAI Schedule I (emulsifier, INS 322) and in HS Chapter 29 (phosphoaminolipids); its primary regulatory identity is functional. Fractionated palm olein is classified as a vegetable fat under FSSAI Schedule II class titles and in HS Chapter 15; its primary regulatory identity is source-based. The scores—approximately 0.82 for lecithin and 0.35 for fractionated palm olein—resolve this ambiguity and produce distinct classification outcomes.

6.5 The –– Coordinate System

Each ingredient is assigned a position in a three-dimensional coordinate space: , , . The position is the ingredient’s identity coordinate—its location in the space of processed ingredients, determined independently on each dimension.

The coordinate is not a summary statistic; it is a structured representation preserving the information carried by each dimension. Two ingredients with the same score (derived in Section 7) may have very different coordinate profiles reflecting different kinds of identity transformation. The coordinate system allows these differences to be traced and reasoned about.

Assignment of coordinates follows the evidence hierarchy: FSSAI regulations and product standards take precedence over general labelling rules; ITC-HS chapter assignments provide independent corroboration; judicial reasoning fills gaps and resolves conflicts. Where evidence is unavailable or conflicting, the assignment is flagged as provisional and subject to revision through the contribution protocol in Appendix A.

7 The Divorce Score () and Operational Zones

7.1 From Coordinates to Classification

The three-dimensional coordinate assigns an ingredient a position in transformation space, but operational deployment requires a scalar classification: a single determination of which zone an ingredient occupies. The Divorce Score serves this purpose. It aggregates the three coordinates into a single composite index placing ingredients into one of three operationally distinct zones corresponding to the three ontological positions: variant of a biological source, independent canonical entity, and functional tool.

7.2 Definition of the Divorce Score

where , , and are the Anthropogenic Energy, Matter, and Functional scores respectively, each in . The resulting score is also in .

7.2.1 Weight Rationale

The weighting scheme assigns the highest weight (0.4) to and equal weights (0.3 each) to and . This allocation reflects the empirical finding that regulatory naming and trade classification—captured by —are the most reliable single predictors of identity zone in borderline cases, while and provide necessary context that alone cannot supply.

An ingredient can have high and while remaining in a source-primary zone if regulatory frameworks have determined that its identity should remain tied to its biological origin despite intensive processing (hydrogenated vegetable oil is the paradigm case). Conversely, an ingredient can have moderate and while being fully function-primary if its regulatory naming and HS classification are function-dominant (packaging gases are the paradigm case). In both cases, is the decisive variable; the 0.4 weight acknowledges this without making and redundant.

The equal weighting of and reflects their complementarity: describes the transformation history while describes the resulting state, and cases where these diverge are precisely where both pieces of information are needed to characterise the ingredient accurately.

The weights in Equation 2 are explicitly provisional. They reflect the best current judgement calibrated against the benchmark cases in Section 8. Refinement using subject matter expert input, expanded benchmark coverage, or Bayesian calibration against regulatory decision data is anticipated and invited through the contribution protocol in Appendix A.

7.3 The Three Operational Zones

7.3.1 Zone 1: Variant ()

An ingredient with is classified as a variant—a representation of a biological source sufficiently close to the source, in process history, material state, and regulatory naming, to be filed under the same canonical entity. Variants do not require independent canon entries; they are represented through the suffix system as elaborated forms of a canonical identity.

Examples of variant-zone ingredients include whole fresh produce, minimally processed grains, cold-pressed oils from named sources, dried whole spices, and named dairy products such as pasteurised milk and fresh curd. The variant zone encompasses the full range of legitimate labelling variation that does not rise to the level of a distinct regulatory or nutritional identity.

Within the variant zone, the suffix system preserves distinctions that matter commercially and culturally. A brand using “Mathania Red Chilli” is in the variant zone relative to the “Red Chilli” canon; its specific suffix records geographic origin without displacing the canon. A brand using “Kashmiri Lal Mirch” occupies the same zone with a different suffix. Both coordinate under the same canon while retaining their distinct commercial identities.

7.3.2 Zone 2: Independent Canon ()

An ingredient with in constitutes an independent canon—an entity sufficiently distinct from any biological source to warrant its own canonical entry, but not so transformed that its identity is wholly defined by its technological function. Independent canons have a biological origin that remains traceable and relevant to their identity, but they are not interchangeable with other forms of that origin for regulatory, nutritional, or commercial purposes.

Examples include refined vegetable oils, dairy fat fractions (ghee, butter), fermented vinegar, modified starches before the HS Chapter 11-to-35 migration, protein concentrates, spray-dried powders of identifiable biological origin, and dehydrated fruit products.

7.3.3 Zone 3: Functional Tool ()

An ingredient with is classified as a functional tool—an entity whose identity is primarily defined by its technological role rather than its biological origin. Functional tools do not contribute directly to the nutritional or sensory character of the food from the consumer’s perspective; they are infrastructure enabling the food system to achieve technical objectives.

This does not mean functional tools are unimportant. Emulsifiers, preservatives, sequestrants, and carriers are essential to the safety, stability, and palatability of packaged foods. But their identity, for regulatory and classification purposes, follows their function, not their source. The declaration format mandated by FSSAI (“Functional Class (Specific Name or INS)”) encodes this principle in law.

Examples include emulsifiers (soya lecithin, mono- and diglycerides), preservatives (sodium benzoate, potassium sorbate), sequestrants (calcium disodium EDTA), carriers, packaging gases, and modified starches classified under HS Chapter 35. Synthetic flavouring substances—where source is not required to be declared and identity is defined by molecular structure and sensory function—also occupy this zone.

7.4 Zone Boundaries and Source Disclosure Obligations

The Divorce Score thresholds are not unconditional. Two legal constraints modify the operational effect of zone assignment.

First, the allergen disclosure requirement: FSSAI Regulation 5(14) mandates declaration of common allergens—including cereals containing gluten, peanuts, soybeans, milk, eggs, fish, crustaceans, and tree nuts—regardless of the ingredient’s zone assignment.¹⁰ A soy lecithin with is classified as a functional tool, but its soy origin must still be disclosed for allergen purposes. Zone 3 classification does not displace the allergen disclosure obligation.

¹⁰ FSSAI Labelling Regulations 2020, Regulation 5(14).

Second, the religious/ethical source disclosure requirement: as established by the Delhi High Court in Ram Gaua Raksha Dal (Delhi High Court 2022), the vegetarian/non-vegetarian origin of an ingredient must be declared regardless of its processing level or functional classification. Gelatin derived from animal bones, used as a gelling agent, carries a mandatory source-disclosure obligation on religious grounds that cannot be displaced by functional naming.

These constraints do not alter the zone assignment—the score and zone determination remain as computed—but they create additional labelling obligations that apply in parallel. The framework records these obligations as conditional metadata attached to the canonical entry.

7.5 Worked Zone Assignments

The following five examples illustrate the zone assignment process, using Table 2 and Table 3 as the reference for individual score values.

7.5.1 Cold-Pressed Sesame Oil

Cold pressing applies mechanical extraction without heat or solvent: . The resulting product is a pure triglyceride fraction with the biological source fully present in lipid form: . Regulatory naming is source-primary throughout—“sesame oil” is the mandatory declaration name, HS Chapter 15—placing this firmly in the lipid base functional category: .

Zone 2 (Independent Canon). Cold-pressed sesame oil is not a variant of whole sesame seeds—the process and resulting state differ enough to warrant its own canonical entry—but its identity remains source-primary throughout. Solvent-extracted refined sesame oil, by contrast, carries from the additional refining steps (deacidification, bleaching, deodorisation), yielding , also Zone 2 but a distinct canon with a difference of 0.129 from its cold-pressed counterpart.

7.5.2 Soya Lecithin

Extraction from soybean oil through degumming, fractionation, and drying involves solvent exposure and intensive industrial separation: . The resulting phospholipid concentrate is a molecular-signal extract far removed from the whole soybean: . FSSAI Schedule I requires its declaration as “Emulsifier (INS 322)” and ITC-HS places it in Chapter 29 (phosphoaminolipids): .

Zone 3 (Functional Tool), with mandatory allergen metadata: soy origin must be disclosed under Regulation 5(14).¹¹

¹¹ FSSAI Labelling Regulations 2020, Regulation 5(14).

7.5.3 Kashmiri Red Chilli Powder

Dehusking followed by milling to fine powder: (combined processing, no heat or solvent applied). The full nutrient spectrum of the chilli is retained in fine comminuted form: . Regulatory naming is source-primary with geographic specificity retained; FSSAI treats this under spice standards: .

Zone 1 (Variant). Kashmiri Red Chilli Powder coordinates under the Red Chilli canonical family, distinguished by a geographic origin suffix. It coordinates equally with generic red chilli powder for allergen and compliance purposes while retaining its regional identity in consumer-facing declarations.

7.5.4 Acetylated Distarch Adipate (INS 1422)

Esterification of starch hydroxyl groups with both acetic and adipic moieties involves intentional covalent bond formation: . The resulting powder is classified as a modified starch under HS Chapter 35—de novo/synthetic matter: . FSSAI Schedule I requires its declaration under the modified starch additive category; ITC-HS Chapter 35 confirms function-dominant classification: .

Zone 3 (Functional Tool).

7.5.5 Fractionated Palm Olein

Controlled crystallisation and liquid-fraction separation: . The resulting product is a constitutional isolate of palm lipids: . Despite the process intensity, FSSAI Schedule II requires source-retaining naming (“fractionated palm oil” or “palm olein”) and ITC-HS retains it in Chapter 15: .

Zone 2 (Independent Canon). This example illustrates the tie-breaking function of directly: despite and values that might suggest Zone 3 proximity, the source-retaining regulatory naming anchors the ingredient firmly in Zone 2. This is not an anomaly in the model; it is precisely what the independent dimension is designed to capture.

8 Benchmark Validation

8.1 Validation Approach

The 35-test benchmark introduced in Section 5.6 is applied to the –– model as defined in Section 6 and Section 7. For each test pair, the model assigns coordinates to each member, computes scores from Equation 2, and determines zone classification. A discrimination is scored as correct if the pair members fall in different zones or, within the same zone, if the magnitude of difference is sufficient to warrant distinct canonical treatment under the framework’s canonical separation criteria.

Score assignments draw directly from Table 2 and Table 3 as primary reference, with scores assigned from the functional class taxonomy in Table 4. Full technical derivations, including process-by-process forensic notes and defensibility ratings, are in the companion scoring report (Lalitha 2026b).

8.2 Benchmark Results

Table 5: Benchmark Validation Results: All 35 test pairs with computed , , , and scores.

ID	Ingredient A					Ingredient B					Correct?
1	Raw Apple	0.12	0.05	0.12	0.10	Chilled Apple	0.18	0.05	0.12	0.12	✓
2	Whole Wheat Flour	0.28	0.33	0.12	0.23	Maida	0.28	0.48	0.12	0.28	✓*
3	Maida	0.28	0.48	0.12	0.28	Native Starch	0.49	0.60	0.55	0.55	✓
4	Sliced Onion	0.15	0.10	0.12	0.12	Onion Powder	0.58	0.42	0.18	0.37	✓
5	Raw Milk	0.12	0.05	0.12	0.10	Pasteurised Milk	0.48	0.05	0.12	0.21	✓
6	Fresh Fruit	0.12	0.05	0.12	0.10	Dehydrated Fruit	0.58	0.36	0.15	0.34	✓
7	Raw Honey	0.12	0.05	0.12	0.10	Pasteurised Honey	0.48	0.05	0.12	0.21	✓
8	Cold Pressed Oil	0.32	0.70	0.22	0.39	Refined Oil	0.75	0.70	0.22	0.52	✓
9	Butter	0.45	0.72	0.22	0.44	Ghee	0.55	0.72	0.22	0.47	✓
10	Ghee	0.55	0.72	0.22	0.47	Anh. Milk Fat	0.82	0.72	0.82	0.79	✓
11	Liquid Veg. Oil	0.75	0.70	0.22	0.52	Vanaspati	0.92	0.72	0.55	0.71	✓
12	Vanaspati	0.92	0.72	0.55	0.71	Interester. Fat	0.91	0.72	0.82	0.82	✓
13	Milk	0.12	0.05	0.12	0.10	Dairy Whitener	0.48	0.42	0.85	0.61	✓
14	Coconut Milk	0.28	0.25	0.12	0.21	Coconut Oil	0.32	0.70	0.22	0.39	✓
15	Raw Milk	0.12	0.05	0.12	0.10	Yogurt/Curd	0.56	0.38	0.15	0.34	✓
16	Curd	0.56	0.38	0.15	0.34	Soy Dahi	0.56	0.57	0.85	0.68	✓
17	Fruit Juice	0.28	0.50	0.12	0.28	Fruit Vinegar	0.56	0.50	0.18	0.39	✓
18	Vinegar	0.56	0.50	0.45	0.52	Glacial Acetic Acid	0.99	0.98	0.99	0.99	✓
19	Cane Sugar	0.55	0.98	0.55	0.68	Xanthan Gum	0.56	0.98	0.88	0.81	✓
20	Natural Yeast	0.12	0.05	0.12	0.10	Chem. Leavening	0.99	0.99	0.99	0.99	✓
21	Wheat Flour	0.28	0.33	0.12	0.23	Maltodextrin	0.58	0.98	0.85	0.81	✓
22	Native Starch	0.49	0.60	0.55	0.55	Modified Starch	0.94	0.96	0.82	0.90	✓
23	Whole Soya Bean	0.12	0.05	0.12	0.10	Soya Lecithin	0.82	0.89	0.82	0.84	✓
24	Cane Sugar	0.55	0.98	0.55	0.68	HFCS	0.91	0.99	0.85	0.91	✓
25	Vanilla Bean	0.12	0.05	0.12	0.10	Natural Extract	0.86	0.86	0.60	0.76	✓
26	Natural Extract	0.86	0.86	0.60	0.76	Syn. Vanillin	0.98	0.98	0.99	0.99	✓
27	Chocolate	0.58	0.72	0.22	0.48	Choc. Substitute	0.91	0.72	0.85	0.73	✓
28	Natural Fibre	0.28	0.55	0.12	0.30	Purified Cellulose	0.82	0.98	0.88	0.89	✓
29	Cane Sugar	0.55	0.98	0.55	0.68	Aspartame	0.99	0.99	0.99	0.99	✓
30	Sea Salt	0.12	0.98	0.12	0.38	Sodium Benzoate	0.99	0.99	0.99	0.99	✓
31	Guar Gum	0.82	0.86	0.88	0.86	Cereal Flour	0.28	0.33	0.12	0.23	✓
32	Lemon Juice	0.28	0.50	0.12	0.28	Citric Acid	0.99	0.99	0.99	0.99	✓
33	Smoked Meat	0.58	0.10	0.12	0.25	Liquid Smoke	0.86	0.88	0.99	0.92	✓
34	Nat. β-Carotene	0.86	0.86	0.85	0.86	Syn. β-Carotene	0.99	0.99	0.99	0.99	✓
35	Bulk Ingredient	0.12	0.05	0.12	0.10	INS Carrier	0.99	0.99	0.99	0.99	✓

*Test 2 produces both members in Zone 1 ( and ), but with a difference of 0.05 sufficient to warrant distinct canonical entries given that FSSAI product standards 2.4.1 and 2.4.2 explicitly define them as separate regulated commodities. The model correctly does not over-differentiate them into separate zones while still producing operationally distinct canonical assignments.

8.3 Worked Validations

Six cases illustrate the model’s discriminatory performance across the range of the benchmark, with all calculations drawn directly from Table 2 and Table 3.

8.3.1 Test 1: Raw Apple vs. Chilled Apple (Floor Test)

Raw apple: sorting and washing only, intact cellular structure, consumed as food without functional class designation.

Chilled apple: refrigeration added to sorting and washing, no change in material state or regulatory designation.

Both are Zone 1 variants; the difference of 0.018 is below the threshold for canonical distinction. The framework correctly does not treat chilling as an identity-changing event. Discrimination: correct.

8.3.2 Test 8: Cold-Pressed Oil vs. Refined Oil

Cold-pressed sesame oil (see worked zone assignment in Section 7): , Zone 2.

Refined sesame oil: refining adds deacidification, bleaching, and deodorisation to the cold-pressing process.

Both are Zone 2 (Independent Canon), which is correct: both are regulatory-named oils with source-primary identity. However, their scores differ by 0.129 and their values differ by 0.43, producing distinct canonical entries. Codex CXS 19-1981 and FSSAI both recognise these as separate product designations. Discrimination: correct.

8.3.3 Test 11: Liquid Vegetable Oil vs. Vanaspati

Refined liquid vegetable oil: , Zone 2.

Vanaspati—hydrogenated vegetable oil with mandatory trans fat disclosure, HS 1516:

Liquid oil is Zone 2; vanaspati sits precisely at the Zone 2/Zone 3 boundary. The reflects that “vanaspati” retains its FSSAI-defined product name with source-retaining naming, holding it at the upper edge of Zone 2 rather than crossing into Zone 3. Different canons, with the boundary position itself carrying analytical meaning about vanaspati’s status as a product that is heavily transformed yet still primarily identified by its food-commodity name. Discrimination: correct.

8.3.4 Test 18: Vinegar vs. Glacial Acetic Acid

Brewed vinegar: double fermentation from agricultural substrate, classified under HS 2209 (Chapter 22, beverages/vinegar), retaining biological origin in product name.

Glacial acetic acid—petrochemical synthesis, Chapter 29 (organic chemicals), FSSAI requires “SYNTHETIC – PREPARED FROM ACETIC ACID” labelling:

Vinegar is Zone 2 (Independent Canon); glacial acetic acid is Zone 3 (Functional Tool). The HS chapter migration from 22 to 29 and the FSSAI mandatory labelling distinction are both fully captured. Discrimination: correct.

8.3.5 Test 22: Native Starch vs. Modified Starch

Native wheat starch: starch isolation within HS Chapter 11.

Acetylated distarch adipate (INS 1422): covalent modification, HS Chapter 35, FSSAI additive schedule.

Native starch is Zone 2; modified starch is Zone 3. The HS chapter migration from 11 to 35—the identity snap discussed in Section 5—is faithfully represented by the zone transition. Discrimination: correct.

8.3.6 Test 23: Whole Soya Bean vs. Soya Lecithin

Whole soya bean: minimal processing, intact biological matrix.

Soya lecithin (see worked zone assignment in Section 7): , Zone 3.

The difference of 0.742 represents near-maximal discrimination. Zone 1 variant to Zone 3 functional tool, driven by three-dimensional divergence on all axes. Allergen metadata attaches to the lecithin canonical entry requiring soy origin disclosure under Regulation 5(14),¹² demonstrating that Zone 3 classification does not eliminate source tracking where legally required. Discrimination: correct.

¹² FSSAI Labelling Regulations 2020, Regulation 5(14).

8.4 Determinism Quotient

All 35 benchmark pairs yield correct discriminations under the model as specified. The Determinism Quotient is:

Note 2 carries an asterisk because both members fall in Zone 1; the discrimination is achieved through magnitude rather than zone boundary crossing. This is treated as a correct discrimination because the framework is designed to produce sub-zone canonical distinctions where regulatory instruments independently require them—which they do for Whole Wheat Flour versus Maida under FSSAI product standards 2.4.1 and 2.4.2.

Three cases identified during validation require calibration attention in subsequent versions: Test 10 (Ghee vs. Anhydrous Milk Fat, where the score assignment for AMF warrants review against Codex CXS 280-1973 standards), Test 16 (Curd vs. Soy Dahi, where the analogue detection relies on capturing the “non-biological source” signal, suggesting a future source-metadata extension), and Test 34 (Natural vs. Synthetic Beta-Carotene, where molecular identity is identical but source coordinate differs—a case where structured source metadata would strengthen the model’s discriminatory basis). These are areas for refinement, not failures; the model produces correct outputs in all three cases under the current specification.

9 Relationship to Existing Food Classification Frameworks

9.1 NOVA and Ingredient-Level Substrates

The NOVA food processing classification system classifies food products into four groups based on the extent and purpose of industrial food processing (Arora et al. 2025; Ispirova et al. 2025). NOVA Group 4 (ultra-processed foods) is defined by reference to industrial processing and the presence of ingredients typically used only in industrial production—many of which correspond to Zone 3 of the –– model.

NOVA operates at the product level: given a complete food product, it classifies the product by the nature of its processing. The –– model operates at the ingredient level: given an individual ingredient string, it assigns that ingredient a deterministic identity position. Product-level classification requires reliable ingredient-level classification as its substrate, and recent machine learning work applying NOVA to large datasets has encountered precisely this bottleneck: the absence of a principled ingredient-level scheme limits the accuracy and consistency of product-level predictions (Arora et al. 2025; Ispirova et al. 2025).

The correspondence between the two frameworks is not coincidental—both are responding to the same underlying physical and regulatory reality about how processing transforms ingredient identity. Zone 3 ingredients (functional tools defined by technological role) map directly onto the additive-classified substances that NOVA uses to identify ultra-processed products. Zone 1 and Zone 2 ingredients map onto the culinary and processed ingredients of NOVA Groups 2 and 3. The –– model makes that reality computationally deterministic at the ingredient level, which is what product-level frameworks need as input.

9.2 The ITC-HS as Ground Truth

The Indian Trade Classification (Harmonised System) has been used throughout this report as primary evidence—a regulatory system that has already resolved many ingredient identity questions through decades of judicial and administrative refinement. The Supreme Court’s ruling in Welkin Foods (Supreme Court of India 2026) places HSN classification at the top of the interpretive hierarchy for identity disputes.

The –– framework uses the ITC-HS as its primary evidence base. HS codes are necessary but not always sufficient for ingredient-level classification: two ingredients may share an HS heading while having very different , , and scores if their processing histories and regulatory naming differ within the heading. The framework provides finer-grained resolution within and across HS headings. Where –– coordinates and HS chapter assignments converge—as they do in the majority of benchmark cases—that convergence is confirmation that the model is correctly grounded. Where they diverge, that divergence is a signal requiring investigation.

10 Limitations and Open Questions

10.1 Weight Calibration

The weights in the Divorce Score formula are provisionally assigned and have not been validated against a large empirical dataset of regulatory decisions or expert classifications. The choice of 0.4 for and 0.3 each for and is analytically motivated—the reasoning is set out in Section 7—but it has not been optimised against a ground-truth corpus. Refinement of the weights using subject matter expert input, expanded benchmark coverage, or Bayesian calibration against regulatory decision data is anticipated and invited through the contribution protocol in Appendix A.

10.2 Zone Boundary Precision

The thresholds at and are calibrated to the benchmark cases but have not been validated across the full range of Indian food system ingredients. Ingredients near the thresholds may be sensitive to small changes in coordinate assignment. This sensitivity is acknowledged as a characteristic of the framework, not a failure: the zone boundaries are policy-relevant thresholds, not natural discontinuities in the physical or chemical properties of ingredients. The framework makes its current specification transparent and open to empirical refinement.

10.3 Source Coordinate Incompleteness

The score captures regulatory naming modality but does not encode the full specificity of the source coordinate: whether an ingredient is of plant or animal origin, whether it carries a geographic indication, or whether it has a specific religious or ethical status. A proposed extension treats source-metadata as a structured annotation on each canonical entry, separate from the three-dimensional coordinate system. This would allow recording “soya lecithin: source = Glycine max, vegan-compatible, allergen-flagged (soy)” as metadata attached to the Zone 3 classification without adding a fourth dimension that would complicate the score calculation.

10.4 Dynamic Regulatory Landscape

The regulatory ground truth used to calibrate scores is a snapshot as of 2025. Food regulation in India is actively evolving: FSSAI has issued amendments, notifications, and draft regulations at increasing frequency, and the judicial landscape continues to develop (Vukka and Lalitha 2026). The framework architecture accommodates this: scores are derived from a three-part test against specific regulatory provisions, so changes to those provisions propagate to score updates without requiring a redesign. Maintaining currency with regulatory changes is an ongoing maintenance task.

10.5 Scope: Indian Regulatory Context

The framework is calibrated specifically to the Indian regulatory context—FSSAI instruments, ITC-HS schedules, and Indian judicial precedent. The and dimensions are grounded in chemistry and nutrition science that is internationally applicable, but the dimension is context-specific. Extension to other regulatory contexts would require parallel derivation of scores from those contexts’ instruments. The architecture is designed to support such extension; the calibration work has not been performed.

11 Next Steps: Building the Faceted Ingredient System

11.1 What the Corpus Looks Like

The commercial sampling work conducted as part of this project—covering 896 stock-keeping units drawn from Indian retail channels, with full methodology to be documented in a forthcoming report—combined with the Open Food Facts India dataset (Open Food Facts contributors 2024) yields approximately 4,800 deduplicated products. Splitting ingredient declarations by comma and conjunction across the full combined corpus produces approximately 48,000 variant strings. The two sources are methodologically distinct: the 896 SKU sample is a structured retail survey; the Open Food Facts contribution is a different collection pathway with its own coverage characteristics. Both are part of this project’s data infrastructure.

The processes and physical forms documented in the and reference tables of this report were derived from systematic examination of what actually appears across those 48,000 strings—not from prior literature alone, but from the empirical evidence of how Indian packaged food manufacturers describe their ingredients on commercial labels. The variant corpus is the empirical foundation on which the –– framework rests.

11.2 The Classification Task Ahead

The immediate next step is to build the faceted ingredient system: assigning coordinates and scores to each of the 729 canonical entities in the Encyclopedia v0.1 taxonomy (Lalitha 2026a), and then mapping the approximately 48,000 variant strings to their canonical families through the entity resolution pipeline.

This is a forward task. What the variant corpus has provided so far is the empirical basis for defining processes, matter classes, and functional zones—the population of real forms and transformations that any viable framework must handle. The next phase applies the –– model to classify each canonical entity systematically, extend those classifications through the suffix system to geographic, cultivar, and preparation-state variants, and attach the legal metadata (allergen flags, source disclosure obligations) that Zone assignments alone do not capture.

Each variant string will carry, as its output: a canonical ID, a zone classification, a score, suffix metadata encoding whatever distinctions the variant expresses beyond the canon, and any applicable legal disclosure flags. That structured output is what downstream systems—allergen detection, supply chain coordination, nutritional research, regulatory compliance—require as their input.

11.3 Governance and Expert Input

The classification of 729 canonical entities will not be completed by computational methods alone. Score assignments in the boundary regions—Zone 1/Zone 2 transitions for moderately processed ingredients, Zone 2/Zone 3 transitions for ingredients with mixed regulatory signals—require domain expertise that food scientists, food lawyers, customs practitioners, and nutritional researchers hold.

The expert input process described in Appendix B is the mechanism for incorporating this knowledge. What is available computationally is the framework specification, the benchmark as a quality standard, and the variant corpus as the empirical scope of the problem. What domain experts contribute is the evidence-based judgement about where specific ingredients fall within that framework, particularly in the cases that the benchmark was designed to surface as hard.

11.4 Planned Outputs

The classification work will produce a versioned update to the Encyclopedia of Indian Food Ingredients, with –– coordinates and zone classifications attached to each canonical entry, and with full technical derivations reviewed against the companion scoring report (Lalitha 2026b). A versioned update protocol will be implemented so that regulatory changes propagate to score updates in a traceable and documented manner. The source metadata extension—structured annotation for origin-specific data including botanical source, geographic indication status, and religious or ethical classification—will be developed alongside the coordinate assignments.

The goal of this work is a publicly accessible, versioned, expert-reviewed ingredient classification system that any downstream application—NOVA-based product classification, allergen detection, supply chain systems, nutritional databases—can use as a stable, deterministic substrate.

Acknowledgments

My deepest gratitude to Mr. Krishna, whose constancy forms the foundation upon which all my work, including this, quietly rests.

Salutations to the Goddess who dwells in all beings in the form of intelligence. I bow to her again and again.

This report was prepared as part of the Indian Food Informatics Data (IFID) project at the Interdisciplinary Systems Research Lab (iSRL). The synthesis draws upon extensive legal research and domain analysis conducted for food informatics applications.

Appendix A: Critique and Contribution Protocol

Purpose

Measurement frameworks that affect regulatory decisions, commercial classifications, and consumer communications must be robust to expert scrutiny. This protocol establishes the conditions under which critiques of the –– framework will be engaged with substantively. It welcomes contributions from domain experts while maintaining the evidentiary standards that give the framework its analytical credibility.

The protocol is not gatekeeping. It is a quality filter distinguishing contributions that advance the framework from commentary that, however sincere, does not provide the specific, evidence-based refinements that the framework requires. Expert critique meeting the protocol’s requirements will be acknowledged, documented, and incorporated into future versions.

Conditions for a Valid Critique

Evidentiary Support from Permitted Sources

Every factual assertion in a critique must be supported by at least one source from the following categories: official Government of India gazettes including FSSAI regulations and compendiums; DGCI&S Indian Trade Classification schedules and official explanatory notes; original judgments from the Supreme Court of India or High Courts, obtained from official court repositories; Codex Alimentarius Commission standards and guidelines; JECFA reports and evaluation reports; peer-reviewed scientific literature published in indexed journals.

The following are not permitted: marketing materials, industry association publications, or brand websites; commercial legal database summaries; blog posts, news articles, or trade press regardless of publication prominence; unpublished or unreviewed claims regardless of the credentials of the claimant.

Specific Line-Level Identification

A valid critique must identify the specific claim, score, or framework element being challenged, specifying: which section, table, or equation contains the element; what the current value or claim is; what the proposed alternative value or claim is; and why the proposed alternative is better supported by evidence than the current formulation. General claims that the framework is “incorrect” or “incomplete” without this specificity do not constitute actionable critique.

Benchmark Consistency Check

If the proposed revision would alter a score, zone threshold, or weight parameter, the critique must demonstrate that the revised formulation still produces correct discriminations for the 35-test benchmark. A revision that corrects one case while failing another provides weaker grounds for adoption than a revision that improves overall benchmark performance.

Critique Submission Format

Section/Element: [Identify the specific section, table, equation, or score]

Current Formulation: [State the current claim, value, or assignment]

Proposed Revision: [State the proposed alternative]

Evidence: [Cite at least one permitted source]

Benchmark Impact: [State how the revision affects the 35-test benchmark, with specific test IDs]

Contact: [Contact details for correspondence]

Appendix B: Ways to Contribute

The –– framework is an open research project. Contributions from domain experts are essential to its development. Two engagement pathways are available.

Asynchronous Expert Input (2–4 hours per month). Every two weeks, the research team compiles open questions that have not been resolved through the team’s own analysis—typically concerning ingredient-level score assignments where regulatory evidence is ambiguous, benchmark cases where the model’s output warrants review, and weight calibration questions where expert judgement can supplement analytical reasoning. Contributors respond at their own pace; there is no expectation of real-time engagement.

Systems Researcher Engagement (10–15 hours per week). Researchers with domain expertise in food science, food law, informatics, or nutritional science who wish to engage more deeply with the framework’s development are invited to join the research team. This engagement involves participation in framework development, validation work, and the preparation of technical reports.

All contributions, critique submissions, and expressions of interest should be directed through: https://isrl.in/join-us.html

All critiques submitted in the format described in Appendix A will receive a written response within thirty days. Contributors whose input leads to a modification of the framework will be acknowledged in the subsequent version, with a description of the modification they proposed or supported. Contributions received but not adopted will be acknowledged with an explanation.

References

Arora, Nalin, Aviral Chauhan, Siddhant Rana, et al. 2025. Application of Machine Learning to Predict Food Processing Level Using Open Food Facts.

Broughton, Vanda. 2006. “The Need for a Faceted Classification as the Basis of All Methods of Information Retrieval.” Aslib Proceedings 58 (1–2): 49–72.

Delhi High Court. 2022. Ram Gaua Raksha Dal v. Union of India and Others, W.P.(C) 12055/2021.

Directorate General of Commercial Intelligence and Statistics. 2007a. Indian Trade Classification (H.S.): Chapter 11 — Products of the milling industry; malt; starches; inulin; wheat gluten. https://www.dgciskol.gov.in/Writereaddata/Downloads/2007/CHP_11.pdf.

Directorate General of Commercial Intelligence and Statistics. 2007b. Indian Trade Classification (H.S.): Chapter 15 — Animal or vegetable fats and oils. https://www.dgciskol.gov.in/Writereaddata/Downloads/CHP_15.pdf.

Directorate General of Commercial Intelligence and Statistics. 2007c. Indian Trade Classification (H.S.): Chapter 35 — Albuminoidal substances; modified starches; glues; enzymes. https://dgciskol.gov.in/Writereaddata/Downloads/2007/CHP_35.pdf.

Food Safety and Standards Authority of India. 2023. Food Safety and Standards (Labelling and Display) Regulations, 2020 (Version-VI, 22.02.2023). https://www.fssai.gov.in/upload/uploadfiles/files/Comp_Labelling.pdf.

Food Safety and Standards Authority of India. 2024. Food Safety and Standards (Food Products Standards and Food Additives) Regulations, 2011, as amended through 2024. https://fssai.gov.in/upload/uploadfiles/files/Chapter%203_Substances%20added%20to%20food.pdf.

Ispirova, Gordana, Michael Sebek, Giulia Menichetti, and Ganesh Bagler. 2025. Informatics for Food Processing.

Lalitha, A. R. 2026a. Encyclopedia of Indian Food Ingredients (v0.1.0): A Standardized Taxonomy for Indian Food Informatics. Interdisciplinary Systems Research Lab, Zenodo. https://doi.org/10.5281/zenodo.18650863.

Lalitha, A. R. 2026b. Justification Companion to EMF-Scoring Model (IFID Project). Interdisciplinary Systems Research Lab. https://doi.org/10.5281/zenodo.18713318.

Open Food Facts contributors. 2024. Open Food Facts Database. https://world.openfoodfacts.org.

Ranganathan, S. R. 1933. Colon Classification. 1st ed. Madras Library Association.

Ranganathan, S. R. 1967. “Prolegomena to Library Classification.” Annals of Library Science 14: 1–15.

Supreme Court of India. 2026. Commissioner of Customs (Import) v. M/s Welkin Foods, 2026 SCC OnLine SC 27; 2026 INSC 19.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@report{a_r2026,
  author = {A R, Lalitha},
  publisher = {iSRL},
  title = {Identity, {Transformation,} and {Function} {A} {Tri-Axial}
    {Model} for the {Classification} of {Food} {Ingredient} {Identity}},
  number = {iSRL-26-02-R-EMF},
  date = {2026-02-20},
  url = {https://isrl.in/pub/2026-02-r-emf/},
  doi = {10.5281/zenodo.18714527},
  langid = {en}
}

For attribution, please cite this work as:

A R, Lalitha. 2026. Identity, Transformation, and Function A Tri-Axial Model for the Classification of Food Ingredient Identity. iSRL-26-02-R-EMF. iSRL. https://doi.org/10.5281/zenodo.18714527.

Indian Supreme Court Defines Hierarchical Classification Framework for Food Products, Overruling Common Parlance Precedents

Lalitha A R — Sun, 15 Feb 2026 00:00:00 GMT

0.1 Abstract

This report synthesizes landmark judicial decisions in Indian food classification law, documenting the transition from common parlance-based interpretation to a hierarchical technical classification framework. The January 6, 2026 Supreme Court judgment in Commissioner of Customs Import v. Ms Welkin Foods formally established a precedential hierarchy that prioritizes statutory interpretation over lay understanding, marking a departure from pre-HSN era jurisprudence. This analysis examines four domains (GST/Tax, Food Safety, Customs, and Dietary Labels) where technical definitions now supersede common parlance, with significant implications for food informatics, regulatory compliance, and ingredient classification systems.

1 Introduction

The classification of food products under Indian law has undergone a fundamental transformation. Historically, courts relied heavily on common parlance—the everyday understanding of terms as used in ordinary commerce—to interpret ambiguous statutory language. This approach, exemplified by pre-1986 cases such as Nix v. Hedden (1893) in the United States and Ramavatar Budhaiprasad v. Assistant Sales Tax Officer (1961) in India, prioritized accessibility and commercial understanding over technical precision.

The adoption of the Harmonised System of Nomenclature (HSN) by India in 1986, following its international introduction in 1988 by the World Customs Organisation, initiated a gradual shift toward technical classification. However, the tension between common understanding and scientific taxonomy persisted until the Supreme Court’s 2026 ruling explicitly established a hierarchy for resolving classification disputes.

This report documents this watershed moment and its implications across multiple regulatory domains.

2 Historical Context: Pre-HSN Era Common Parlance Precedents

2.1 United States: Nix v. Hedden (1893)

The seminal case Nix v. Hedden, 149 U.S. 304 (1893), established that tomatoes should be classified as vegetables rather than fruits for tariff purposes, despite their botanical classification. The Supreme Court of the United States held that classification should follow common parlance—how ordinary people in commerce understand terms—rather than technical botanical definitions.

2.2 India: Early Common Parlance Cases

2.2.1 Ramavatar Budhaiprasad v. Assistant Sales Tax Officer (1961)

In this landmark case, AIR 1961 SC 1325; (1962) 1 SCR 279; (1961) 12 STC 286, decided on March 14, 1961, the Supreme Court of India addressed whether betel leaves should be classified as vegetables for sales tax exemption purposes under the Central Provinces and Berar Sales Tax Act, 1947.

The Court held that the word “vegetables” must be interpreted not in a technical or botanical sense, but in its popular sense as understood in common language—denoting classes of vegetable matter grown in kitchen gardens or farms and used for the table. The Court stated: “It has not been defined in the Act and being a word of everyday use it must be construed in its popular sense, meaning that sense which people conversant with the subject matter with which the statute is dealing would attribute to it.”

The Court ruled that betel leaves, while botanically plant matter, were not vegetables in common parlance and were therefore taxable.

2.2.2 Krishna Iyer v. State of Kerala (1962)

Decided on March 6, 1962, this case from the Kerala High Court similarly applied the common parlance test to determine whether green ginger qualified as a vegetable for tax exemption purposes. The Court held that vegetables should be understood “as commonly understood denoting those classes of vegetable matter which are grown in kitchen gardens and are used for the table,” and concluded that green ginger, despite being plant matter, was included in the specific term “ginger” in the tax schedule and was therefore taxable.

2.3 The Pre-HSN Framework

Prior to India’s adoption of the HSN system in 1986, courts consistently applied common parlance as the primary interpretive tool for ambiguous statutory terms. This approach served several purposes:

It made tax classifications accessible to ordinary merchants without specialized knowledge
It aligned legal interpretations with commercial practice
It avoided the complexity of technical botanical or chemical classifications
It provided predictability based on everyday understanding

3 The Watershed Moment: Commissioner of Customs Import v. Ms Welkin Foods (2026)

3.1 Case Details

Citation: 2026 SCC OnLine SC 27; 2026 INSC 19
Date: January 6, 2026 (reported January 6-7, 2026)
Court: Supreme Court of India
Bench: Justice J.B. Pardiwala and Justice R. Mahadevan
Parties: Commissioner of Customs (Import) v. M/s Welkin Foods

3.2 Facts and Issue

The case concerned the proper classification of imported aluminium shelving used for mushroom cultivation. The respondent, Welkin Foods, argued the goods should be classified under Customs Tariff Item (CTI) 84369900 as “parts” of agricultural machinery. The Revenue contended the shelving should be classified under CTI 76109010 as “Aluminium Structures.”

The core legal question was whether the intended use of the product (mushroom cultivation) or its objective technical characteristics should govern classification.

3.3 The Court’s Reasoning

The Supreme Court held that classification must be based on objective characteristics of the product, not solely on intended end-use. The Court established several critical principles:

Structure vs. Machine: The shelving was held to be a “structure” (fixed in place) rather than a “part” of a machine. It did not qualify as a component essential for the mechanical function of agricultural machinery.
Material Identity Primacy: While exclusive use can sometimes influence classification, it does not override the fundamental material identity of the product when it is specifically described elsewhere in the tariff.
Hierarchical Framework: Most significantly, the Court articulated a hierarchy for classification disputes, stating: “It is only in a state of statutory silence, where the legislative intent remains unexpressed, that the tribunals or courts may resort to the common or trade parlance test.”

3.4 The Established Hierarchy

The Ms Welkin Foods judgment established the following precedential hierarchy for food and product classification:

Judicial Interpretation of Statute (highest priority)—How the court reads the HSN and statutory provisions
Technical/Scientific Definition—When statute provides technical guidance through HSN codes
Expert Opinion—Testimony from qualified experts in relevant fields
Common Parlance—Trade usage and ordinary understanding (only in statutory silence)

This hierarchy fundamentally reorients Indian classification jurisprudence, relegating common parlance from its historical primacy to a fallback position.

4 Domain-Specific Applications

The hierarchical framework established in Ms Welkin Foods has been applied consistently across multiple regulatory domains, demonstrating the pervasiveness of technical classification over common understanding.

4.1 GST and Tax Domain: Scientific Composition Prevails

4.1.1 In re Gajanand Foods Private Limited

The Gujarat Authority for Advance Ruling (GAAR) and subsequently the Appellate Authority (AAAR) addressed whether instant mix flours containing spices, leavening agents, and other additives should be classified under Chapter Headings 1102 or 1106 (attracting 5% GST) or under Heading 2106 90 (attracting 18% GST).

Ruling: The AAAR held that instant mix flours for products like Gota, Khaman, Dhokla, Idli, Dosa, Handvo, and others, containing 5-37% additional ingredients (spices, salt, sodium bicarbonate, chili powder), are classifiable under HSN 2106 90 as “Food Preparations not elsewhere specified or included,” attracting 18% GST.

Rationale: The technical composition, including functional additives that transformed the product from mere flour into a meal preparation kit, removed it from the common category of “flour.” The presence of leavening agents, spices, and other ingredients meant for creating a specific dish demonstrated that these were food preparations, not basic flours.

4.1.2 In re Ramdev Food Products Private Limited

In a parallel case, the Gujarat AAAR addressed similar instant mix flours produced by Ramdev Food Products, including instant mixes for Gota, Khaman, Dalwada, Dahiwada, Idli, Dhokla, Dosa, Pizza, Methi Gota, and Handvo.

Ruling: The AAAR upheld the AAR’s classification of these products under HSN 2106 90, attracting 18% GST. The Court rejected arguments based on VAT-era precedents, holding that “merely because the end consumer of the Instant Mix Flour is required to follow certain food preparation processes before such product(s) can be consumed, is no ground to take these products out of Chapter Heading 2106.”

Key Principle: The technical composition and processing state—not the trade name or common understanding—governs classification under the HSN system.

4.2 Food Safety Domain: Nutrient Thresholds Over Marketing Terms

4.2.1 3S and Our Health Society v. Union of India

Case Details: Writ Petition (Civil) No. 437/2024
Court: Supreme Court of India
Bench: Justice J.B. Pardiwala and Justice R. Mahadevan (initial disposal: April 9, 2025)
Subsequent hearings: February 2026

This ongoing public interest litigation seeks mandatory Front-of-Package Warning Labels (FoPWL) on packaged food products containing high levels of sugar, salt, and saturated fats.

Court’s Direction: The Supreme Court directed the Food Safety and Standards Authority of India (FSSAI) to prioritize scientific thresholds of salt, sugar, and saturated fats over the food industry’s preferred marketing terminology. The Court emphasized that consumer health protection requires objective, scientifically determined nutrient levels rather than subjective or trade-based descriptions.

Implication: The Court’s insistence on scientific measurement over industry terminology reflects the same hierarchical principle established in Ms Welkin Foods—technical, objective criteria supersede commercial nomenclature.

4.3 Customs Domain: Engineering Function Over Trade Nomenclature

The Ms Welkin Foods case itself exemplifies this domain. The Supreme Court held that aluminium racks for mushroom cultivation are technically structures (Chapter 76) and not machinery (Chapter 84) because they lack mechanical function, regardless of their trade name or intended agricultural use.

This establishes that in customs classification, the technical characteristics—material composition and functional properties—override the commercial designation or end-use of a product.

4.4 Dietary Labels Domain: Biological Origin Disclosure Mandatory

4.4.1 Ram Gaua Raksha Dal v. Union of India and Others

Case Details: W.P.(C) 12055/2021
Court: Delhi High Court
Bench: Justice Vipin Sanghi and Justice Jasmeet Singh (December 2021) / Justice Vipin Sanghi and Justice Dinesh Kumar Sharma (March 2022)
Date: December 9, 2021; subsequent order March 2, 2022

This case challenged the inadequate disclosure of animal-sourced ingredients in packaged food products, particularly where International Numbering System (INS) codes obscure the biological origin of food additives.

Court’s Ruling: The Delhi High Court held that the biological origin of ingredients must be disclosed, stating that “every person has a right to know as to what he/she is consuming, and nothing can be offered to the person on a platter by resort to deception, or camouflage.”

The Court directed that:

Food Business Operators must make full and complete disclosure of all ingredients, not only by their code names (INS numbers) but also by disclosing whether they originate from plant, animal source, or are manufactured in a laboratory.
The disclosure must specify the actual plant or animal source, regardless of the percentage used in the food article.
Even minuscule amounts of animal-sourced ingredients (other than milk, milk products, honey, beeswax, carnauba wax, or shellac) render the product non-vegetarian and must be disclosed accordingly.
Chemical code alone “camouflages the truth from the consumer.”

Constitutional Basis: The Court grounded this requirement in Articles 19(1)(a) (freedom of speech and information), 21 (right to life and health), and 25 (freedom of religion) of the Indian Constitution, recognizing that dietary choices based on religious, ethical, or health considerations require transparent ingredient disclosure.

Implication: This case demonstrates that in labeling disputes, the biological or chemical origin—a technical, scientific classification—supersedes simplified marketing designations or chemical code nomenclature.

5 Synthesis: The Four-Domain Framework

Table 1 synthesizes the current state of classification law across the four primary domains:

Table 1: Classification Framework Across Regulatory Domains

Domain	Classification Winner	Landmark Case
GST/Tax	Technical (HSN)	Gajanand Foods; Ramdev Food Products
Food Safety	Scientific (Nutrient Levels)	3S and Our Health Society
Customs	Technical (Engineering)	Ms Welkin Foods
Dietary Labels	Biological Origin	Ram Gaua Raksha Dal

6 Mandatory Disclosure Requirements: Implications for Food Informatics

The cases analyzed in this report establish several mandatory disclosure requirements under Indian law. These requirements have direct implications for food informatics systems, ingredient databases, and regulatory compliance frameworks.

6.1 Source Disclosure

Requirement: Even if an ingredient is heavily processed or becomes a derivative compound, the source must be declared.

Example: Lecithin must be labeled as “Lecithin (Soy)” or “Lecithin (Egg),” not merely as “Lecithin” or by INS code alone.

Legal Basis: Ram Gaua Raksha Dal v. Union of India

6.2 Dietary Status

Requirement: Products must be classified as vegetarian (with egg/dairy), non-vegetarian, or pure vegetarian (no animal source or dairy).

Principle: Even if an ingredient is a chemical derived from an animal source, it must be declared with respect to its source.

Legal Basis: Ram Gaua Raksha Dal v. Union of India; Food Safety and Standards (Labelling and Display) Regulations, 2020

6.3 Allergen Declaration

Requirement: Even if processing is extreme and the allergen potency is reduced from the source level, allergen presence must still be declared for consumer safety.

Legal Basis: Right to health (Article 21 of the Constitution); FSSAI allergen declaration requirements

6.4 Functional Class for Chemicals

Requirement: For chemical additives, the functional class (purpose of inclusion) must be declared followed by the INS Number.

Example: “Preservative (INS 202)” rather than merely “INS 202” or “Potassium Sorbate.”

Legal Basis: FSSAI Labelling Regulations 2020; functional usage declaration requirements

7 Analytical Implications for Food Informatics

The hierarchical framework and mandatory disclosure requirements have profound implications for food informatics systems:

7.1 Attribute-Based Ingredient Classification

The cases reveal that the determination of whether something constitutes a separate ingredient entity versus a variant of an existing entity depends on functional attributes rather than source alone.

Key Principle: Functionality takes precedence over source material when determining ingredient separateness.

For example, in the instant mix flour cases, the presence of functional additives (leavening agents, spices used for specific culinary purposes) transformed what might be considered a “variant of flour” into a distinct food preparation. The additives were not merely processing aids but functional components that changed the nature of the product.

7.2 Hierarchical Data Modeling

Food informatics databases must now implement hierarchical classification systems that mirror the judicial hierarchy:

Statutory Classification Layer: HSN codes and tariff classifications
Technical/Scientific Layer: Chemical composition, functional properties, biological origin
Commercial Layer: Trade names, common parlance terms (lowest priority)

7.3 Mandatory Metadata Requirements

Any comprehensive food ingredient database must now capture:

Biological/chemical source (plant, animal, synthetic)
Specific source species/material (even if heavily processed)
Functional class (for additives)
Allergen status (regardless of processing degree)
Dietary classification (vegetarian, vegan, non-vegetarian)
HSN code classification

8 Conclusion

The January 6, 2026 Supreme Court judgment in Commissioner of Customs Import v. Ms Welkin Foods represents a watershed moment in Indian food classification jurisprudence. By formally establishing a hierarchical framework that prioritizes statutory interpretation and technical definition over common parlance, the Court has fundamentally reoriented how food products are classified across multiple regulatory domains.

This shift reflects the increasing complexity of food systems, the globalization of trade through standardized systems like HSN, and the constitutional imperative for transparent disclosure that enables informed consumer choice. The historical reliance on common parlance, while accessible and commercially grounded, proved insufficient in an era of complex food processing, international trade nomenclature, and diverse dietary requirements based on health, religion, and ethics.

The four-domain analysis presented in this report—spanning GST/Tax, Food Safety, Customs, and Dietary Labels—demonstrates the consistency with which Indian courts are now applying technical classification principles. In each domain, technical or scientific attributes supersede lay understanding or commercial nomenclature.

For food informatics systems, regulatory compliance frameworks, and ingredient databases, these decisions mandate a fundamental restructuring. Classification systems must be hierarchical, metadata must capture technical attributes (source, function, composition), and disclosure must prioritize scientific accuracy over commercial simplicity.

This report serves as a synthesis of century-spanning jurisprudential evolution, documenting the transition from common parlance dominance to technical hierarchy supremacy. It provides legal researchers, food & beverage lawyers, compliance professionals, and informatics specialists with a comprehensive framework for understanding and applying current Indian food classification law.

8.1 Acknowledgments

My deepest gratitude to Mr. Krishna, whose constancy forms the foundation upon which all my work, including this, quietly rests.

Salutations to the Goddess who dwells in all beings in the form of intelligence. I bow to her again and again.

References

Reuse

CC BY 4.0

Citation

BibTeX citation:

@report{a_r2026,
  author = {A R, Lalitha},
  publisher = {iSRL},
  title = {Indian {Supreme} {Court} {Defines} {Hierarchical}
    {Classification} {Framework} for {Food} {Products,} {Overruling}
    {Common} {Parlance} {Precedents}},
  number = {iSRL-26-02-R-SCFood},
  date = {2026-02-15},
  url = {https://isrl.in/pub/2026-02-r-scfood/},
  doi = {10.5281/zenodo.18651646},
  langid = {en}
}

For attribution, please cite this work as:

A R, Lalitha. 2026. Indian Supreme Court Defines Hierarchical Classification Framework for Food Products, Overruling Common Parlance Precedents. iSRL-26-02-R-SCFood. iSRL. https://doi.org/10.5281/zenodo.18651646.

Encyclopedia of Indian Food Ingredients

Lalitha A R — Wed, 11 Feb 2026 00:00:00 GMT

Abstract

A standardized taxonomy and multi-format dataset (JSON, Markdown, LaTeX) covering 600+ food components — from traditional Ayurvedic botanicals to contemporary industrial additives. Bridges conventional culinary knowledge with international food standards to establish a machine-readable framework for Indian food data systems.

Repository

Source data and formats available at: https://github.com/ifid-data/encyclopedia

References

Reuse

CC BY 4.0

Citation

BibTeX citation:

@dataset{a_r2026,
  author = {A R, Lalitha},
  publisher = {iSRL},
  title = {Encyclopedia of {Indian} {Food} {Ingredients}},
  number = {iSRL-26-02-B-Encyclopedia},
  date = {2026-02-11},
  url = {https://isrl.in/pub/2026-02-b-encyclopedia/},
  doi = {10.5281/zenodo.18650863},
  langid = {en}
}

For attribution, please cite this work as:

A R, Lalitha. 2026. “Encyclopedia of Indian Food Ingredients.” iSRL-26-02-B-Encyclopedia. iSRL, February 11. https://doi.org/10.5281/zenodo.18650863.

Indian Food Ingredients & Label Variants

Lalitha A R — Sun, 01 Feb 2026 00:00:00 GMT

This version has been superseded

This dataset is no longer maintained. The v1 approach was found to be structurally inadequate for the problem it was designed to solve. The full reasoning is documented below. The dataset remains available for reference at the link above.

For current work, see the Identity, Transformation, and Function framework and its justification companion.

We released Indian Food Ingredients & Label Variants (v1) with the goal of making ingredient label text parseable by machines. The dataset standardised ingredient names — mapping kashmiri chilli to chilli, for instance — on the assumption that a normalised vocabulary would make automated parsing tractable.

Two problems emerged as data collection continued.

First, the approach trades away information the project is now explicitly committed to preserving. The data makes this concrete.

                            canon               variant
0                      A2 Protein            a2 protein
1  Acesulfame Potassium (INS 950)          acesulfame k
2  Acesulfame Potassium (INS 950)  acesulfame potassium
3  Acesulfame Potassium (INS 950)     sweetener ins 950
4           Acetic Acid (INS 260)           acetic acid

import pandas as pd

df = pd.read_csv("data/ingredients.csv", header=None, names=["canon", "variant"])

# All variants that map to Chilli in v1
chilli = df[df["canon"] == "Chilli"].copy()

# The ones that carry regional and variety-level identity
regional = chilli[chilli["variant"].str.contains(
    "kashmiri|mathania|jalapeño|lal mirch", case=False
)].reset_index(drop=True)

print(regional.to_string(index=False))

 canon                                                          variant
Chilli                                                  kashmiri chilli
Chilli                                               kashmiri lal mirch
Chilli                                                    mild jalapeño
Chilli salt with spices and condiments chillies and capsicum lal mirchi
Chilli                 spices and condiments kashmiri red chilli powder
Chilli                 spices and condiments mathania red chilli powder
Chilli                                      stalkless kashmiri chillies

In v1, every row above maps to Chilli. Kashmiri lal mirch, mathania red chilli powder, stalkless kashmiri chillies — all collapsed into the same canon as chili powder and red chilly flakes.

The brands that wrote these labels did not have to. Kashmiri chilli could have been declared as chilli — it would have been legally compliant. The choice to name it specifically was a choice to preserve something: a regional identity, a flavour profile, a cultural referent that Indian consumers recognise and reach for. The v1 mapping erases that choice.

This is not only a question of cultural fidelity. Ingredient identity has legal and fiscal consequences. Fresh alphonso mangoes attract 0% GST as an agricultural produce; mango pulp processed from a specific GI-tagged variety enters a different regulatory category. Kashmiri chilli carries a Geographical Indication; a generic chilli does not. When a mapping table collapses these into one canon, it does not simplify the data — it destroys the signal that downstream regulatory, taxation, and traceability systems depend on. Respecting the taste of India is not a sentiment; it is a data integrity requirement.

Second, the ingredient name space in Indian packaged food is too diverse for automated mapping to be reliable. The problem splits into two structurally different cases:

Semantic variants — spelling differences, typos, punctuation variation — can be resolved with a comprehensive mapping table, because the variation is noise around a stable referent. Chenna, bengal gram flour, and chickpea flour are different names for the same thing. Palmitate and palm oil are not — they are similar-sounding but distinct ingredients.
Cultural and linguistic variants — regional names, transliterations, variety-level distinctions (like alphonso mango) — cannot be mapped reliably because the variation itself carries meaning. A model trained on such a mapping would not learn the differences; it would erase them.

Maintaining a single mapping table that handles both cases conflates the problem. In practice, it means tracking every normalisation decision made during data cleaning — effectively a log of every typo fixed across thousands of rows — with no mechanism to distinguish meaningful variation from noise.

The ingredient substrate under development makes this mapping unnecessary. A deterministic identity layer — one that assigns canonical identifiers to ingredients independent of how they are written on any given label — eliminates the need for probabilistic name matching at parse time. Labels are parsed against the substrate, not against a maintained vocabulary of variants.

The v1 dataset will remain available for reference. The label variants mapping will not be maintained going forward.

This brings us to the question of how we extract the variants in a way that preserves the signal.

How do we formalise that milk solids feels like it should be under milk while butter feels different? How do we measure the distance between a variant and its source ingredient?

These questions led to a food classification framework inspired by Ranganathan’s 1933 Colon Classification¹² and grounded in Indian judicial and regulatory precedents — FSSAI, ITC-HS, court rulings.

¹ Colon Classification (Faceted Classification) by S R Ranganathan, Father of Indian Library Science.

² Instead of a flat list, faceted classification lets us express a single object as a set of values across independent dimensions — the way filtering by price, type, and brand on Amazon works, rather than browsing a single ranked list.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@dataset{a_r2026,
  author = {A R, Lalitha},
  publisher = {iSRL},
  title = {Indian {Food} {Ingredients} \& {Label} {Variants}},
  number = {iSRL-26-02-DS-Variants},
  date = {2026-02-01},
  url = {https://isrl.in/pub/2026-02-ds-variants/},
  doi = {10.5281/zenodo.1871452},
  langid = {en},
  abstract = {**This dataset has been superseded.** The v1 mapping
    approach — standardising ingredient label variants to a canonical
    vocabulary — was found to conflate noise reduction with meaningful
    cultural and linguistic variation. This document explains why the
    approach was abandoned and what replaced it. A mapping of 2500+
    regional ingredient variations as observed in Indian labels. This
    dataset provides a structured mapping of the diverse ways
    ingredients are named on Indian food packaging, linking variants
    (the actual text found on labels) to a canon (a standardised, clean
    category). Example mapping: Canon: Acetic Acid (INS 260) — Variants:
    acidity regulator 260, vinegar, ins 260, acetic acid (260).}
}

For attribution, please cite this work as:

A R, Lalitha. 2026. “Indian Food Ingredients & Label Variants.” iSRL-26-02-DS-Variants. iSRL, February 1. https://doi.org/10.5281/zenodo.1871452.