# profner_classification_master

**Repository Path**: hf-datasets/profner_classification_master

## Basic Information

- **Project Name**: profner_classification_master
- **Description**: Mirror of https://huggingface.co/datasets/luisgasco/profner_classification_master
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-08-09
- **Last Updated**: 2025-08-09

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

---
dataset_info:
  features:
  - name: tweet_id
    dtype: string
  - name: text
    dtype: string
  - name: label
    dtype:
      class_label:
        names:
          '0': SIN_PROFESION
          '1': CON_PROFESION
  splits:
  - name: train
    num_bytes: 711780
    num_examples: 2786
  - name: validation
    num_bytes: 238488
    num_examples: 999
  - name: test
    num_bytes: 242754
    num_examples: 1001
  download_size: 807660
  dataset_size: 1193022
configs:
- config_name: default
  data_files:
  - split: train
    path: data/train-*
  - split: validation
    path: data/validation-*
  - split: test
    path: data/test-*
---

# Binary Classification Dataset: Profession Detection in Tweets

This dataset is a derived version of the original **PROFNER** task, adapted for binary text classification. The goal is to determine whether a tweet **mentions a profession or not**.

## 🧠 Objective

Each example contains:
- A `tweet_id` (document identifier)
- A `text` field (full tweet content)
- A `label`, which has been normalized into two classes:
  - `CON_PROFESION`: The tweet contains a reference to a profession.
  - `SIN_PROFESION`: The tweet does not contain any profession-related term.

## 📦 Dataset Structure

The dataset is formatted as a `DatasetDict` with three splits:

| Split        | Description                                           |
|--------------|-------------------------------------------------------|
| `train`      | Balanced dataset containing both classes              |
| `validation` | Contains equal distribution of profession/no-profession |
| `test`       | Also balanced for evaluating binary classification    |

Each example follows the structure:

```python
{
  "tweet_id": "1242399976644325376",
  "text": "Nuestros colaboradores y conductores se quedan en casa!",
  "label": "CON_PROFESION"  # or "SIN_PROFESION"
}
```

The `label` column is implemented with Hugging Face `ClassLabel`, which makes it easy to convert between string and integer representation.

## 🔄 Label Mapping

The dataset uses the following class labels:

```python
label_list = ["SIN_PROFESION", "CON_PROFESION"]
label2id = { "SIN_PROFESION": 0, "CON_PROFESION": 1 }
id2label = { 0: "SIN_PROFESION", 1: "CON_PROFESION" }
```

These are automatically applied via Hugging Face `datasets.Features`.

## 📥 How to Load

```python
from datasets import load_dataset

ds = load_dataset("luisgasco/profner_classification_master")
print(ds["train"][0])
# Show features
print(ds["train"].features)
# Ver etiquetas as strings para un ejemplo:
example = ds["train"][5]
print(example["label"])  # IDs
print(ds["train"].features["label"].int2str(example["label"]))
```

## ✍️ Author

Processed and [Luis Gasco](https://huggingface.co/luisgasco) for educational purposes, based on the PROFNER corpus.