# profner_classification_master **Repository Path**: hf-datasets/profner_classification_master ## Basic Information - **Project Name**: profner_classification_master - **Description**: Mirror of https://huggingface.co/datasets/luisgasco/profner_classification_master - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-09 - **Last Updated**: 2025-08-09 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README --- dataset_info: features: - name: tweet_id dtype: string - name: text dtype: string - name: label dtype: class_label: names: '0': SIN_PROFESION '1': CON_PROFESION splits: - name: train num_bytes: 711780 num_examples: 2786 - name: validation num_bytes: 238488 num_examples: 999 - name: test num_bytes: 242754 num_examples: 1001 download_size: 807660 dataset_size: 1193022 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* --- # Binary Classification Dataset: Profession Detection in Tweets This dataset is a derived version of the original **PROFNER** task, adapted for binary text classification. The goal is to determine whether a tweet **mentions a profession or not**. ## 🧠 Objective Each example contains: - A `tweet_id` (document identifier) - A `text` field (full tweet content) - A `label`, which has been normalized into two classes: - `CON_PROFESION`: The tweet contains a reference to a profession. - `SIN_PROFESION`: The tweet does not contain any profession-related term. ## 📦 Dataset Structure The dataset is formatted as a `DatasetDict` with three splits: | Split | Description | |--------------|-------------------------------------------------------| | `train` | Balanced dataset containing both classes | | `validation` | Contains equal distribution of profession/no-profession | | `test` | Also balanced for evaluating binary classification | Each example follows the structure: ```python { "tweet_id": "1242399976644325376", "text": "Nuestros colaboradores y conductores se quedan en casa!", "label": "CON_PROFESION" # or "SIN_PROFESION" } ``` The `label` column is implemented with Hugging Face `ClassLabel`, which makes it easy to convert between string and integer representation. ## 🔄 Label Mapping The dataset uses the following class labels: ```python label_list = ["SIN_PROFESION", "CON_PROFESION"] label2id = { "SIN_PROFESION": 0, "CON_PROFESION": 1 } id2label = { 0: "SIN_PROFESION", 1: "CON_PROFESION" } ``` These are automatically applied via Hugging Face `datasets.Features`. ## 📥 How to Load ```python from datasets import load_dataset ds = load_dataset("luisgasco/profner_classification_master") print(ds["train"][0]) # Show features print(ds["train"].features) # Ver etiquetas as strings para un ejemplo: example = ds["train"][5] print(example["label"]) # IDs print(ds["train"].features["label"].int2str(example["label"])) ``` ## ✍️ Author Processed and [Luis Gasco](https://huggingface.co/luisgasco) for educational purposes, based on the PROFNER corpus.