1 Star 3 Fork 2

Gitee 极速下载/spark-nlp

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
此仓库是为了提升国内下载速度的镜像仓库,每日同步一次。 原始仓库: https://github.com/JohnSnowLabs/spark-nlp
克隆/下载
CHANGELOG 201.26 KB
一键复制 编辑 原始数据 按行查看 历史
Devin Ha 提交于 2025-05-26 22:43 +08:00 . Bump Version to 6.0.2 [skip test] [run doc]
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270127112721273127412751276127712781279128012811282128312841285128612871288128912901291129212931294129512961297129812991300130113021303130413051306130713081309131013111312131313141315131613171318131913201321132213231324132513261327132813291330133113321333133413351336133713381339134013411342134313441345134613471348134913501351135213531354135513561357135813591360136113621363136413651366136713681369137013711372137313741375137613771378137913801381138213831384138513861387138813891390139113921393139413951396139713981399140014011402140314041405140614071408140914101411141214131414141514161417141814191420142114221423142414251426142714281429143014311432143314341435143614371438143914401441144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465146614671468146914701471147214731474147514761477147814791480148114821483148414851486148714881489149014911492149314941495149614971498149915001501150215031504150515061507150815091510151115121513151415151516151715181519152015211522152315241525152615271528152915301531153215331534153515361537153815391540154115421543154415451546154715481549155015511552155315541555155615571558155915601561156215631564156515661567156815691570157115721573157415751576157715781579158015811582158315841585158615871588158915901591159215931594159515961597159815991600160116021603160416051606160716081609161016111612161316141615161616171618161916201621162216231624162516261627162816291630163116321633163416351636163716381639164016411642164316441645164616471648164916501651165216531654165516561657165816591660166116621663166416651666166716681669167016711672167316741675167616771678167916801681168216831684168516861687168816891690169116921693169416951696169716981699170017011702170317041705170617071708170917101711171217131714171517161717171817191720172117221723172417251726172717281729173017311732173317341735173617371738173917401741174217431744174517461747174817491750175117521753175417551756175717581759176017611762176317641765176617671768176917701771177217731774177517761777177817791780178117821783178417851786178717881789179017911792179317941795179617971798179918001801180218031804180518061807180818091810181118121813181418151816181718181819182018211822182318241825182618271828182918301831183218331834183518361837183818391840184118421843184418451846184718481849185018511852185318541855185618571858185918601861186218631864186518661867186818691870187118721873187418751876187718781879188018811882188318841885188618871888188918901891189218931894189518961897189818991900190119021903190419051906190719081909191019111912191319141915191619171918191919201921192219231924192519261927192819291930193119321933193419351936193719381939194019411942194319441945194619471948194919501951195219531954195519561957195819591960196119621963196419651966196719681969197019711972197319741975197619771978197919801981198219831984198519861987198819891990199119921993199419951996199719981999200020012002200320042005200620072008200920102011201220132014201520162017201820192020202120222023202420252026202720282029203020312032203320342035203620372038203920402041204220432044204520462047204820492050205120522053205420552056205720582059206020612062206320642065206620672068206920702071207220732074207520762077207820792080208120822083208420852086208720882089209020912092209320942095209620972098209921002101210221032104210521062107210821092110211121122113211421152116211721182119212021212122212321242125212621272128212921302131213221332134213521362137213821392140214121422143214421452146214721482149215021512152215321542155215621572158215921602161216221632164216521662167216821692170217121722173217421752176217721782179218021812182218321842185218621872188218921902191219221932194219521962197219821992200220122022203220422052206220722082209221022112212221322142215221622172218221922202221222222232224222522262227222822292230223122322233223422352236223722382239224022412242224322442245224622472248224922502251225222532254225522562257225822592260226122622263226422652266226722682269227022712272227322742275227622772278227922802281228222832284228522862287228822892290229122922293229422952296229722982299230023012302230323042305230623072308230923102311231223132314231523162317231823192320232123222323232423252326232723282329233023312332233323342335233623372338233923402341234223432344234523462347234823492350235123522353235423552356235723582359236023612362236323642365236623672368236923702371237223732374237523762377237823792380238123822383238423852386238723882389239023912392239323942395239623972398239924002401240224032404240524062407240824092410241124122413241424152416241724182419242024212422242324242425242624272428242924302431243224332434243524362437243824392440244124422443244424452446244724482449245024512452245324542455245624572458245924602461246224632464246524662467246824692470247124722473247424752476247724782479248024812482248324842485248624872488248924902491249224932494249524962497249824992500250125022503250425052506250725082509251025112512251325142515251625172518251925202521252225232524252525262527252825292530253125322533253425352536253725382539254025412542254325442545254625472548254925502551255225532554255525562557255825592560256125622563256425652566256725682569257025712572257325742575257625772578257925802581258225832584258525862587258825892590259125922593259425952596259725982599260026012602260326042605260626072608260926102611261226132614261526162617261826192620262126222623262426252626262726282629263026312632263326342635263626372638263926402641264226432644264526462647264826492650265126522653265426552656265726582659266026612662266326642665266626672668266926702671267226732674267526762677267826792680268126822683268426852686268726882689269026912692269326942695269626972698269927002701270227032704270527062707270827092710271127122713271427152716271727182719272027212722272327242725272627272728272927302731273227332734273527362737273827392740274127422743274427452746274727482749275027512752275327542755275627572758275927602761276227632764276527662767276827692770277127722773277427752776277727782779278027812782278327842785278627872788278927902791279227932794279527962797279827992800280128022803280428052806280728082809281028112812281328142815281628172818281928202821282228232824282528262827282828292830283128322833283428352836283728382839284028412842284328442845284628472848284928502851285228532854285528562857285828592860286128622863286428652866286728682869287028712872287328742875287628772878287928802881288228832884288528862887288828892890289128922893289428952896289728982899290029012902290329042905290629072908290929102911291229132914291529162917291829192920292129222923292429252926292729282929293029312932293329342935293629372938293929402941294229432944294529462947294829492950295129522953295429552956295729582959296029612962296329642965296629672968296929702971297229732974297529762977297829792980298129822983298429852986298729882989299029912992299329942995299629972998299930003001300230033004300530063007300830093010301130123013301430153016301730183019302030213022302330243025302630273028302930303031303230333034303530363037303830393040304130423043304430453046304730483049305030513052305330543055305630573058305930603061306230633064306530663067306830693070307130723073307430753076307730783079308030813082308330843085308630873088308930903091309230933094309530963097309830993100310131023103310431053106310731083109311031113112311331143115311631173118311931203121312231233124312531263127312831293130313131323133313431353136313731383139314031413142314331443145314631473148314931503151315231533154315531563157315831593160316131623163316431653166316731683169317031713172317331743175317631773178317931803181318231833184318531863187318831893190319131923193319431953196319731983199320032013202320332043205320632073208320932103211321232133214321532163217321832193220322132223223322432253226322732283229323032313232323332343235323632373238323932403241324232433244324532463247324832493250325132523253325432553256325732583259326032613262326332643265326632673268326932703271327232733274327532763277327832793280328132823283328432853286328732883289329032913292329332943295329632973298329933003301330233033304330533063307330833093310331133123313331433153316331733183319332033213322332333243325332633273328332933303331333233333334333533363337333833393340334133423343334433453346334733483349335033513352335333543355335633573358335933603361336233633364336533663367336833693370337133723373337433753376337733783379338033813382338333843385338633873388338933903391339233933394339533963397339833993400340134023403340434053406340734083409341034113412341334143415341634173418341934203421342234233424342534263427342834293430343134323433343434353436343734383439344034413442344334443445344634473448344934503451345234533454345534563457345834593460346134623463346434653466346734683469347034713472347334743475347634773478347934803481348234833484348534863487348834893490349134923493349434953496349734983499350035013502350335043505350635073508350935103511351235133514351535163517351835193520352135223523352435253526352735283529353035313532353335343535353635373538353935403541354235433544354535463547354835493550355135523553355435553556355735583559356035613562356335643565356635673568356935703571357235733574357535763577357835793580358135823583358435853586358735883589359035913592359335943595359635973598359936003601360236033604360536063607360836093610361136123613361436153616361736183619362036213622362336243625362636273628362936303631363236333634363536363637363836393640364136423643364436453646364736483649365036513652365336543655365636573658365936603661366236633664366536663667366836693670367136723673367436753676367736783679368036813682368336843685368636873688368936903691369236933694369536963697369836993700370137023703370437053706370737083709371037113712371337143715371637173718371937203721372237233724372537263727372837293730373137323733373437353736373737383739374037413742374337443745374637473748374937503751375237533754375537563757375837593760376137623763376437653766376737683769377037713772377337743775377637773778377937803781378237833784378537863787378837893790379137923793379437953796379737983799380038013802380338043805380638073808380938103811
=======
6.0.1
=======
----------------
New Features & Enhancements
----------------
* Introducing new Vision Language Models:
* InternVL (SPARKNLP-1123)
* Florance-2 (SPARKNLP-1131)
* Introducing Partition and PartitionTransformer for unstructured data processing (SPARKNLP-1174)
----------------
Bug Fixes
----------------
* Adjusted python type annotations for the `AutoGGUFModel` (#14576)
=======
6.0.1
=======
----------------
New Features & Enhancements
----------------
* Introducing new Vision Language Models:
* SmolVLM (SPARKNLP-1115)
* PaliGemma + PaliGemma2 (SPARKNLP-1121)
* Gemma 3 (SPARKNLP-1124)
* Adding additional parameters options to PDF Reader (SPARKNLP-1158)
----------------
Bug Fixes
----------------
* Fixed typo in Python for RoBertaForMultipleChoice (SPARKNLP-1177)
* Fixing typos in various jupyter notebooks
=======
6.0.0
=======
----------------
New Features & Enhancements
----------------
* Introducing new large language models:
* OLMo model support (SPARKNLP-1006)
* Phi 3.5 Vision model support (SPARKNLP-1060)
* LLAVA model support (SPARKNLP-1033)
* CoHere model support (SPARKNLP-1032)
* Qwen2-VL model support (SPARKNLP-1077)
* Llama 3.2 Vision models (SPARKNLP-1078)
* Deepseek Janus model support (SPARKNLP-1088)
* Added LLAVA v1.5 7b quantized model
* Added StarCoder2 3b int8 model
* New MultipleChoice Transformers:
* AlbertForMultipleChoice (SPARKNLP-1105)
* DistilBertForMultipleChoice (SPARKNLP-1106)
* RoBertaForMultipleChoice (SPARKNLP-1107)
* XlmRoBertaForMultipleChoice (SPARKNLP-1108)
* New file format support:
* Excel files reader (SPARKNLP-1102)
* PowerPoint files reader (SPARKNLP-1103)
* PDF reader (SPARKNLP-1098)
* Text reader (SPARKNLP-1113)
* Other improvements:
* AutoGGUFVisionModel for vision model support (SPARKNLP-1079)
* Added Extractor to SparkNLP (SPARKNLP-1109)
* Updated Python and Scala model names
* Improved error handling for AutoGGUF models
----------------
Bug Fixes
----------------
* Fixed typo in MXBAI notebook
========
5.5.3
========
----------------
Bug Fixes & Enhancements
----------------
* BGEEmbeddings: The default pretrained model for BGEEmbeddings has been changed from "bge_base" to "bge_small_en_v1.5". Users relying on the old default will need to explicitly specify "bge_base" in the pretrained method.
* Added useCLSToken parameter to allow users to choose between CLS token pooling and attention-based average pooling for sentence embeddings.
* BGEEmbeddings: BGEEmbeddings now supports a `useCLSToken` parameter, which defaults to True. This affects the embedding calculation strategy. Existing users should verify their usage and potentially set this parameter explicitly.
* Added HasClsTokenProperties in Scala: Introduced the HasClsTokenProperties trait in Scala providing useCLSToken parameter functionality for relevant annotators.
* Fixing wrong padding in attention mask in `MPNet`, `BGE`, `E5`, `Mxbai`, `Nomic`, `SnowFlake`, and `UAE`. This resulted in wrong inference results in some cases and not equal to the ONNX version in transformers/sentence-transformers.
* Various Performance Optimizations: Multiple changes across different models (Albert, Bart, CLIP, CamemBert, ConvNextClassifier, DeBerta, DistilBert, E5, Instructor, MPNet, Mxbai, Nomic, RoBerta, SnowFlake, UAE, ViTClassifier, VisionEncoderDecoder, Wav2Vec2, XlmRoBertaClassification, XlmRoberta) appear to focus on performance improvements and code cleanup, especially related to OpenVINO and ONNX inference. These may lead to faster inference times.
* Fixing Llama3 download issue in Python.
========
5.5.2
========
----------------
New Features & Enhancements
----------------
* OpenVINO Support for Transformers (PR #14408):
Added OpenVINO inference support to a broad range of transformer-based annotators, including DeBertaForQuestionAnswering, DeBertaForSequenceClassification, RoBertaForTokenClassification, XlmRobertaForZeroShotClassification, BartTransformer, GPT2Transformer, and many others.
* BLIPForQuestionAnswering Transformer (PR #14422):
Introduced a new transformer BLIPForQuestionAnswering for image-based question answering tasks. The transformer processes images alongside associated questions to provide relevant answers.
* AutoGGUFEmbeddings Annotator (PR #14433):
Added AutoGGUFEmbeddings to support embeddings from AutoGGUFModels, providing rich sentence embeddings. Includes an end-to-end example notebook for usage.
* HTML Parsing into DataFrame (PR #14449):
Introduced sparknlp.read().html() to parse local or remote HTML files and convert them into structured Spark DataFrames for easier analysis.
* Email Parsing into DataFrame (PR #14455):
Added sparknlp.read().email() method to parse email files into structured DataFrames, enabling scalable analysis of email content. (Note: Dependent on #14449)
* Microsoft Word Document Parsing into DataFrame (PR #14476):
Added a new feature to parse .docx and .doc files into a Spark DataFrame, streamlining the integration of Word documents into NLP pipelines.
* Microsoft Fabric Support (PR #14467):
Introduced support for leveraging Microsoft Fabric for word embeddings storage and retrieval, enhancing scalability and efficiency.
* cuDNN Upgrade Instructions on Databricks (PR #14451):
Added instructions on upgrading cuDNN for GPU inference and cleaned up redundant Databricks installation instructions.
* ChunkEmbeddings Metadata Preservation (PR #14462):
Modified ChunkEmbeddings to preserve the original chunk’s metadata in the resulting embeddings, ensuring richer contextual information is retained.
* Default Names and Languages for Annotators (PR #14469):
Updated default names and language configurations for newly created seq2seq annotators to improve consistency and clarity.
----------------
Bug Fixes
----------------
* Spark Version Errors (PR #14467):
Resolved issues related to long Spark versions when integrating Microsoft Fabric support.
========
5.5.1
========
----------------
New Features & Enhancements
----------------
* `BertForMultipleChoice` Transformer Added. Enhanced BERT’s capabilities to handle multiple-choice tasks such as standardized test questions and survey or quiz automation.
* Integrated New Tasks and Documentation:
* Added support and documentation for the following tasks:
* Automatic Speech Recognition
* Dependency Parsing
* Image Captioning
* Image Classification
* Landing Page
* Question Answering
* Summarization
* Table Question Answering
* Text Classification
* Text Generation
* Text Preprocessing
* Token Classification
* Translation
* Zero-Shot Classification
* Zero-Shot Image Classification
* `PromptAssembler` Annotator Introduced. Introduced a new annotator that constructs prompts for LLMs using a chat template and a sequence of messages. Accepts an array of tuples with roles (“system”, “user”, “assistant”) and message texts. Utilizes llama.cpp as a backend for template parsing, supporting basic template applications.
----------------
Bug Fixes
----------------
* Resolved Pretrained Model Loading Issue on DBFS Systems.
* Fixed a bug where pretrained models were not found when running AutoGGUF model pipelines on Databricks due to incorrect path handling of gguf files.
========
5.5.0
========
----------------
New Features & Enhancements
----------------
* Introduced QWEN2Transformer (#14188)
* Introduced MiniCPM (#14205)
* Introduced NLLB (#14209)
* Implemented Nomic embeddings (#14217)
* Introduced CamemBertForZeroShotClassification annotator (#14354)
* Implemented Mxbai Embeddings (#14355)
* Introduced AlbertForZeroShotClassification (#14361)
* Introduced Phi-3 (#14373)
* Implemented Starcoder2 for causal language modeling (#14358)
* Integrated llama.cpp (#14364)
* Implemented SnowFlake (#14353)
* Introduced ONNX support to vision annotators (#14356)
* Introduced ONNX and OpenVINO support to Missing Annotators (#14359)
* Added OpenVINO install instructions (#14382)
* Exported notebooks for release candidate (#14393)
========
5.4.2
========
----------------
New Features & Enhancements
----------------
* Added demo notebook for Image Classification Annotators
* Added aggressiveMatching parameter to DateMatcher and MultiDateMatcher annotators
* Added aggressiveMatching parameter to DocumentSimilarityRanker annotator
========
5.4.1
========
----------------
New Features & Enhancements
----------------
* Added support for loading duplicate models in Spark NLP, allowing multiple models from the same annotator to be loaded simultaneously.
* Updated the README for better coherence and added new pages to the website.
* Added support for a stop IDs list to halt text generation in Phi, Mistral, and Llama annotators.
----------------
Bug Fixes
----------------
* Fixed the default model names for Phi2 and Mistral AI annotators.
========
5.4.0
========
----------------
New Features & Enhancements
----------------
* Added OpenVINO Runtime integration for various models, enabling enhanced inference performance. (#14246)
* Added Python APIs to incorporate OpenVINO support. (#14242)
* Introduced support for ONNX models and average pooling in ONNX-based annotators. (#14245)
* Implemented MPNet for token classification. (#14244)
* Added support for MistralAI LLM and LLAMA2. (#14243)
* Improved caching mechanisms in Streamlit demos. (#14241)
* Enhanced models' card and README documentation for Models Hub. (#14240)
* Added OpenVINO GPU dependencies. (#14236)
* Locked macOS version for runners and added missing SBT setup. (#14235)
----------------
Bug Fixes
----------------
* Fixed bugs in Colab notebooks. (#14239)
* Resolved issues with BERT backend and broken annotators. (#14238)
* Corrected LLAMA2 position ID and generation bug. (#14237)
========
5.3.3
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introduce UAEEmbeddings for sentence embeddings using Universal AnglE Embedding, aimed at improving semantic textual similarity tasks
* Introduce critical enhancements and optimizations to the processing of the CoNLL-U format for Dependency Parsers training, including enhanced multiword token handling and improved handling of missing uPos values
* Add example notebook for `DocumentCharacterTextSplitter`
* Add example notebook for `DeBertaForZeroShotClassification`
* Add example notebooks for `BGEEmbeddings` and `MPNetEmbeddings`
* Add example notebook for `MPNetForQuestionAnswering`
* Add example notebook for `MPNetForSequenceClassification`
* Implement cache mechanism for `metadata.json`, enhancing efficiency by avoiding unnecessary downloads
----------------
Bug Fixes
----------------
* Address a bug with serializing ONNX models that lack a `.onnx_data` file, ensuring better reliability in model serialization processes
* Delete redundant `Multilingual_Translation_with_M2M100.ipynb` notebook entries
* Fix Colab link for the M2M100 notebook
========
5.3.2
========
----------------
Bug Fixes
----------------
* Fix and add notebooks to import models from Hugging Face
* Add ONNX and TensorFlow notebooks
* Fix XlnetForSeqeunceClassification and added XlnetForTokenClassificaiton
* Rename DistilBertForZeroShotClassification
* Add missing notebooks
* Add MPNetEmbeddings to annotator
* Fix XLMRoBertaForQuestionAnswering, XLMRoBertaForTokenClassification, and XLMRoBertaForSequenceClassification: Reverted the change in tfFile naming that was causing exceptions while loading and saving the models
* Fix documentation for sparknlp.start()
========
5.3.1
========
----------------
Bug Fixes
----------------
* Fix M2M100 not working on the second run (closing the ONNX Session by mistake)
* Fix ONNX models failing in clusters like Databricks
* Fix `ZeroShotNerClassification` issue with NerConverter
* adding colab notebook for M2M100
========
5.3.0
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing Llama-2 and all the models fine-tuned based on this architecutre. This our very first CasualLM annotator in ONNX and it comes with support for quantization in INT4 and INT8 for CPUs.
* **NEW:** Introducing `MPNetForSequenceClassification` annotator for sequence classification tasks. This annotator is based on the MPNet architecture and is designed to classify sequences of text into a set of predefined classes.
* **NEW:** Introducing `MPNetForQuestionAnswering` annotator for question answering tasks. This annotator is based on the MPNet architecture and is designed to answer questions based on a given context.
* **NEW:** Introducing `M2M100` state-of-the-art multilingual translation. M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. The model can directly translate between the 9,900 directions of 100 languages.
* **NEW:** Introducing a new `DeBertaForZeroShotClassification` annotator for zero-shot classification tasks. This annotator is based on the DeBERTa architecture and is designed to classify sequences of text into a set of predefined classes.
* **NEW:** Implement retreival feature in our `DocumentSimilarity`annotator. The new DocumentSimilarity ranker is a powerful tool for ranking documents based on their similarity to a given query document. It is designed to be efficient and scalable, making it ideal for a variety of RAG applications/
* Add ONNNX support for `BertForZeroShotClassification` annotator.
* Add support for in-memory use of `WordEmbeddingsModel` annotator in server-less cluster. We initially introduced in-memory feature for this annotator for users inside Kubernetes cluster without any `HDFS`, however, today it runs without any issue `locally`, Google `Colab`, `Kaggle`, `Databricks`, `AWS EMR`, `GCP`, and `AWS Glue`.
* New Whisper Large and Distil models.
* Update ONNX Runtime to 1.17.0
* Support new Databricks Runtimes of 14.2, 14.3, 14.2 ML, 14.3 ML, 14.2 GPU, 14.3 GPU
* Support new EMR 6.15.0 and 7.0.0 versions
* Add nobteook to fine-tune a BERT for Sentence Embeddings in Hugging Face and import it to Spark NLP
* Add notebook to import BERT for Zero-Shot classification from Hugging Face
* Add notebook to import DeBERTa for Zero-Shot classification from Hugging Face
* Update EntityRuler documentation
* Improve SBT project and resolve warnings (almost!)
----------------
Bug Fixes
----------------
* Fix Spark NLP Configuration's to set `cluster_tmp_dir` on Databricks' DBFS via `spark.jsl.settings.storage.cluster_tmp_dir` https://github.com/JohnSnowLabs/spark-nlp/issues/14129
* Fix score calculation in `RoBertaForQuestionAnswering` annotator https://github.com/JohnSnowLabs/spark-nlp/pull/14147
* Fix optional input col validations https://github.com/JohnSnowLabs/spark-nlp/pull/14153
* Fix notebooks for importing DeBERTa classifiers https://github.com/JohnSnowLabs/spark-nlp/pull/14154
* Fix GPT2 deserialization over the cluster (Databricks) https://github.com/JohnSnowLabs/spark-nlp/pull/14177
========
5.2.3
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing support for ONNX Runtime in XLMRoBertaForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in XLMRoBertaForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in XLMRoBertaForQuestionAnswering annotator
* Refactoring AWS SDK use in Spark NLP to reduce the overal size of the library. We have dropped the use of `bundle` and started to directly using `S3` SDK. This will also minimize incompatibilities with other libraries that use AWS SDKs
* Add new notebooks to import DeBertaForQuestionAnswering, DebertaForSequenceClassification, and DeBertaForTokenClassification models from HuggingFace
* Add a new `DocumentTokenSplitter` notebook
* Add a new trainig NER notebook by using DeBerta Embeddings
* Add a new trainig text classification notebook by using INSTRUCTOR Embeddings
* Update `RoBertaForTokenClassification` notebook
* Update `RoBertaForSequenceClassification` notebook
* Update `OpenAICompletion` notebook with new `gpt-3.5-turbo-instruct` model
----------------
Bug Fixes
----------------
* Fix `BGEEmbeddings` not downloading in Python
========
5.2.2
========
----------------
Enhancements
----------------
* Update `aws-java-sdk-bundle` dependency to a version without any CVEs
----------------
Bug Fixes
----------------
* Fix the missing `BGEEmbeddings` from annotator in Python
* Add a new BGE notebook to import models into Spark NLP
* Upload the new true `BGE` models to Spark NLP for text embeddings
========
5.2.1
========
----------------
New Features & Enhancements
----------------
* Add support for Spark and PySpark 3.5 major release
* Support Databricks Runtimes of 14.0, 14.1, 14.2, 14.0 ML, 14.1 ML, 14.2 ML, 14.0 GPU, 14.1 GPU, and 14.2 GPU
* **NEW:** Introducing the `BGEEmbeddings` annotator for Spark NLP. This annotator enables the integration of `BGE` models, based on the BERT architecture, into Spark NLP. The `BGEEmbeddings` annotator is designed for generating dense vectors suitable for a variety of applications, including `retrieval`, `classification`, `clustering`, and `semantic search`. Additionally, it is compatible with `vector databases` used in `Large Language Models (LLMs)`.
* **NEW:** Introducing support for ONNX Runtime in DeBertaForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in DeBertaForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in DeBertaForQuestionAnswering annotator
* Add a new notebook to show how to import any model from `T5` family into Spark NLP with TensorFlow format
* Add a new notebook to show how to import any model from `T5` family into Spark NLP with ONNX format
* Add a new notebook to show how to import any model from `MarianNMT` family into Spark NLP with ONNX format
----------------
Bug Fixes
----------------
* Fix serialization issue in `DocumentTokenSplitter` annotator failing to be saved and loaded in a Pipeline
* Fix serialization issue in `DocumentCharacterTextSplitter` annotator failing to be saved and loaded in a Pipeline
========
5.2.0
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introduceding the `CLIPForZeroShotClassification` for Zero-Shot Image Classification using OpenAI's CLIP models
* **NEW:** Introduceding the `DocumentTokenSplitter` which allows users to split large documents into smaller chunks to be used in RAG with LLM models
* **NEW:** Introducing support for ONNX Runtime in T5Transformer annotator
* **NEW:** Introducing support for ONNX Runtime in MarianTransformer annotator
* **NEW:** Introducing support for ONNX Runtime in BertSentenceEmbeddings annotator
* **NEW:** Introducing support for ONNX Runtime in XlmRoBertaSentenceEmbeddings annotator
* **NEW:** Introducing support for ONNX Runtime in CamemBertForQuestionAnswering, CamemBertForTokenClassification, and CamemBertForSequenceClassification annotators
* Adding a caching support for newly imported T5 models in TF format to improve the performance to be competitive to ONNX version
* Improve ZIP util and add tests for both ZipArchiveUtil and OnnxWrapper
* Refactor ONNX and add OnnxSession to broadcast
* Update ONNX Runtime to 1.16.3
* Add a new notebook fro structure streaming
----------------
Bug Fixes
----------------
* Fix random dimension mismatch in E5Embeddings and MPNetEmbeddings due to a missing average_pool after last_hidden_state in the output
* Fix batching exception in E5 and MPNet embeddings annotators failing when sentence is used instead of document
* Fix chunk construction when an entity is found
* Fix a bug in library's version in Scala
* Fix Whisper models not downloading due to wrong library's version
* Fix and refactor saving best model based on given metrics during NerDL training
========
5.1.4
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introduceding the `DocumentCharacterTextSplitter` which allows users to split large documents into smaller chunks. `DocumentCharacterTextSplitter` takes a list of separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.
* **NEW:** Introducing support for ONNX Runtime in RobertaForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in RobertaForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in RobertaForQuestionAnswering annotator
* Adding an example to load a model directly from Azure using .load() method. This example helps users to understand how to set Spark NLP to load models from Azure
----------------
Bug Fixes
----------------
* Fix a bug with in `Whisper` annotator, that would not allow every model to be imported
* Fix BPE Tokenizer to include a flag whether or not to always prepend a space before words (previous behavior for embeddings)
* Fix BPE Tokenizer to correctly convert and tokenize non-latin and other special characters/words
* Fix `RobertaForQuestionAnswering` to produce the same logits and indexes as the implementation in Transformer library
* Fix the return order of logits in `BertForQuestionAnswering` and `DistilBertForQuestionAnswering` annotators
========
5.1.3
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing support for ONNX Runtime in BertForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in BertForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in BertForQuestionAnswering annotator
* **NEW:** Introducing support for ONNX Runtime in DistilBertForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in DistilBertForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in DistilBertForQuestionAnswering annotator
* **NEW:** Setting ONNX configuration such as GPU device id, execution mode, etc. via Spark NLP configs
* Update Whisper documentation with minimum required version of Spark/PySpark (3.4)
----------------
Bug Fixes
----------------
* Fix `module 'sparknlp.annotator' has no attribute 'Token2Chunk'` error in Python when using `Token2Chunk` annotator inside loaded PipelineModel
========
5.1.2
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing VisionEncoderDecoder annotator to generate captions from images
* Add missing enteries in the docs and update them with the new features
* Improve beam search results in BART Transformer
========
5.1.1
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing support for ONNX Runtime in MPNet embedding annotator
* **NEW:** Introducing support for ONNX Runtime in AlbertForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in AlbertForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in AlbertForQuestionAnswering annotator
* Implement `getVectors` feature in Word2VecModel, Doc2VecModel, and WordEmbeddingsModel annotators. This new feature allows access to the entire tokens and their vectors in the loaded model.
----------------
Bug Fixes
----------------
* Fix how to save and load `Whisper` models
* Fix saving ONNX model on Windows operating system
========
5.1.0
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing WhisperForCTC annotator for Automatic Speech Recognition (ASR)
* **NEW:** Introducing OpenAICompletion and OpenAIEmbeddings annotators
* **NEW:** Introducing MPNet Text Embeddings annotators
* **NEW:** Introducing a new BART for Zero-Shot Text Classification annotator
* **NEW:** Adding ONNX support to E5 Embeddings annotator
* **NEW:** New full support for GCP and Azure distributed storages
* New 150+ MPNet models
* New Databricks 13.3 runtime support
* New EMR 6.12.0 version support
----------------
Bug Fixes
----------------
* Fix max sentence length issue in E5Embeddings
========
5.0.2
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing support for ONNX Runtime in ALBERT, CamemBERT, and XLM-RoBERTa annotators
* **NEW:** Implement ZeroShotNerModel annotator for zero-shot NER based on XLM-RoBERTa architecture
----------------
Bug Fixes
----------------
* Fix MarianTransformers annotator breaking with `java.lang.ClassCastException` in Python
* Fix out of 0.0/1.0 accuracy in SentenceDetectorDL and MultiClassifierDL annotators
* Fix BART issue with low temperature value that only occurred when there are no non infinite logits satisfying the low temperature and top_k values
* Add missing E5Embeddings and InstructorEmbeddings annotators to `annotators` in Scala for easy all-in-one import
========
5.0.1
========
----------------
Bug Fixes & Enhancements
----------------
* Fix `multiLabel` param issue in `XXXForSequenceClassitication` and `XXXForZeroShotClassification` annotators
* Add the missing `threshold` param to all `XXXForSequenceClassitication` in Python
* Fix issue with passing `spark.driver.cores` config as a param into start() function in Python and Scala
* Add new notebooks to export BERT, DistilBERT, RoBERTa, and DeBERTa models to ONNX format
========
5.0.0
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing support for ONNX Runtime in Spark NLP. ONNX Runtime is a high-performance inference engine for machine learning models in the ONNX format. ONNX Runtime has proved to considerably increase the performance of inference for many models.
* **NEW:** Introducing **InstructorEmbeddings** annotator in Spark NLP 🚀. `InstructorEmbeddings` can load new state-of-the-art INSTRUCTOR Models inherited from T5 for Text Embeddings.
* **NEW:** Introducing **E5Embeddings** annotator in Spark NLP 🚀. `E5Embeddings` can load new state-of-the-art E5 Models inherited from BERT for Text Embeddings.
* **NEW:** Introducing **DocumentSimilarityRanker** annotator in Spark NLP 🚀. `DocumentSimilarityRanker` is a new annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbours search on top of sentence embeddings, It aims to capture the semantic meaning of a document in a dense, continuous vector space and return it to the ranker search.
----------------
Bug Fixes
----------------
* Fix BART issue with maxInputLength
========
4.4.4
========
----------------
New Features & Enhancements
----------------
* Add `Warmup` stage to loading all Transformers for word embeddings: ALBERT, BERT, CamemBERT, DistilBERT, RoBERTa, XLM-RoBERTa, and XLNet. This helps reducing the first inference time and also validate importing external models from HuggingFace https://github.com/JohnSnowLabs/spark-nlp/pull/13851
* Add new notebooks to import ZeroShot Classifiers for Bert, DistilBERT, and RoBERTa fine-tuned based on NLI datasets https://github.com/JohnSnowLabs/spark-nlp/pull/13845
----------------
Bug Fixes
----------------
* Fix not being able to save models from XXXForSequenceClassification and XXXForZeroShotClassification annotators https://github.com/JohnSnowLabs/spark-nlp/pull/13842
========
4.4.3
========
----------------
New Features & Enhancements
----------------
* New `multilabel` parameter to switch from multi-class to multi-label on all Classifiers in Spark NLP: AlbertForSequenceClassification, BertForSequenceClassification, DeBertaForSequenceClassification, DistilBertForSequenceClassification, LongformerForSequenceClassification, RoBertaForSequenceClassification, XlmRoBertaForSequenceClassification, XlnetForSequenceClassification, BertForZeroShotClassification, DistilBertForZeroShotClassification, and RobertaForZeroShotClassification
* Refactor protected Params and Features to avoid unwanted exceptions during runtime https://github.com/JohnSnowLabs/spark-nlp/pull/13797
* Add proper documentation and instructions for ZeroShot classifiers: BertForZeroShotClassification, DistilBertForZeroShotClassification, and RobertaForZeroShotClassification https://github.com/JohnSnowLabs/spark-nlp/pull/13798
* Extend support for downloading models/pipelines directly by given name or S3 path in ResourceDownloader https://github.com/JohnSnowLabs/spark-nlp/pull/13796
----------------
Bug Fixes
----------------
* Fix pretrained pipelines that stopped working since 4.4.2 release on PySpark 3.0 and 3.1 versions (adding 123 new pipelines were added) https://github.com/JohnSnowLabs/spark-nlp/pull/13805
* Fix pretrained pipelines that stopped working since 4.4.2 release on PySpark 3.2 and 3.3 versions (adding 120 new pipelines) https://github.com/JohnSnowLabs/spark-nlp/pull/13811
* Fix Java compatibility issue caused by SystemUtils dependency https://github.com/JohnSnowLabs/spark-nlp/pull/13806
========
4.4.2
========
----------------
New Features & Enhancements
----------------
* Implement a new Zero-Shot Text Classification for RoBERTa annotator called `RobertaForZeroShotClassification`
* Support Apache Spark 3.4
* Omptize BART models for memory efficiency
* Introducing `cache` feature in BartTransformer
* Improve error handling for max sequence length for transformers in Python
* Improve `MultiDateMatcher` annotator to return multiple dates
----------------
Bug Fixes
----------------
* Fix a bug in Tapas due to exceeding the maximum rank value
* Fix loading Transformer models via loadSavedModel() method from DBFS on Databricks
========
4.4.1
========
----------------
New Features & Enhancements
----------------
* Implement a new Zero-Shot Text Classification for DistilBERT annotator called `DistilBertForZeroShotClassification`
* Adding `threshold` param to `AlbertForSequenceClassification`, `BertForSequenceClassification`, `BertForZeroShotClassification`, `DistilBertForSequenceClassification`, `CamemBertForSequenceClassification`, `DeBertaForSequenceClassification`, LongformerForSequenceClassification`, RoBertaForQuestionAnswering`, `XlmRoBertaForSequenceClassification`, and `XlnetForSequenceClassification` annotators
* Add new notebooks to import models for `SwinForImageClassification` and `ConvNextForImageClassification` annotators for Image Classification
========
4.4.0
========
----------------
New Features
----------------
* Implement a new Zero-Shot Text Classification for BERT annotator called `BertForZeroShotClassification`
* Implement a new ConvNextForImageClassification annotator
* Introducing BART Transformer for text-to-text generation tasks like translation and summarization
* Set custom entity name in Data2Chunk via `setEntityName` param
* Add a new `nerHasNoSchema` param for NerConverter when labels coming from NerDLMOdel and NerCrfModel don't have any schema
----------------
Bug Fixes & Enhancements
----------------
* Fix loading `WordEmbeddingsModel` bug when loading a model from S3 via `cache_folder` config
* Fix `WordEmbeddingsModel` bug failing when it's used with `setEnableInMemoryStorage` set to `True` and LightPipeline
* Remove deprecated parameter enablePatternRegex from EntityRulerApproach & EntityRulerModel
* Deprecate Python 3.6
========
4.3.2
========
----------------
New Features & Enhancements
----------------
* Add S3 support for CoNLL(), POS(), CoNLLU() training classes https://github.com/JohnSnowLabs/spark-nlp/pull/13596
* Add support for non-schema NER (`I-` or `B-`) tags in NerConverter annotator https://github.com/JohnSnowLabs/spark-nlp/pull/13642
* Improve self-hosted examples with better documentation, Docker examples, no broken links, and more https://github.com/JohnSnowLabs/spark-nlp/pull/13575
* Improve error handling for validation evaluation in ClassifierDL and MultiClassifierDL trainable annotators https://github.com/JohnSnowLabs/spark-nlp/pull/13615
----------------
Bug Fixes
----------------
* Fix `Date2Chunk` and `Chunk2Doc` annotators compatibility with PipelineModel https://github.com/JohnSnowLabs/spark-nlp/pull/13609
* Fix `DependencyParserModel` predicting all Chunks as `<no-type>` https://github.com/JohnSnowLabs/spark-nlp/pull/13620
* Removed `calculationsCol` parameter from MultiDocumentAssembler in Python that doesn't actually exist https://github.com/JohnSnowLabs/spark-nlp/pull/13594
========
4.3.1
========
----------------
New Features
----------------
* Easily use external Tokenizers such as spaCy in Spark NLP pipeline
* Implement `params` parameter which can supply custom configurations to the SparkSession
----------------
Bug Fixes & Enhancements
----------------
* Add `entity` field to the metadata in Date2Chunk
* Fix ViT models & pipelines examples in Models Hub
========
4.3.0
========
----------------
New Features
----------------
* Implement HubertForCTC annotator for automatic speech recognition
* Implement SwinForImageClassification annotator for Image Classification
* Introducing CamemBERT for Question Answering annotator
* Implement ZeroShotNerModel annotator for zero-shot NER based on RoBERTa architecture
* Implement Date2Chunk annotator
* Enable params argument in spark_nlp start() function
* Allow doc_id reading CoNLL file datasets
----------------
Bug Fixes & Enhancements
----------------
* Relocating all notebooks back to examples directory
* Improve download/loading models & pipelines from AWS and GCP. When setting `cache_pretrained` directory to AWS and GCP will avoid copying existing models/pipelines
* Improve GitHub templates for Bug reports, documentation, and feature request
* Add documentation to ResourceDownloader
* Refactor `ml` package to allow another DL engine in future
* Apache Spark 3.3.1 is now the base version of Spark NLP
* Spark NLP supports M2 in addition to M1. Therefore, we are renaming `spark-nlp-m1` to `spark-nlp-silicon` on Maven
* Fix calculating delimiter id in CamemBERT
* Fix loadSavedModel for private buckets
========
4.2.8
========
----------------
Bug Fixes & Enhancements
----------------
* Fix the issue with optional keys (labels) in metadata when using XXXForSequenceClassitication annotators. This fixes `Some(neg) -> 0.13602075` as `neg -> 0.13602075` to be in harmony with all the other classifiers. https://github.com/JohnSnowLabs/spark-nlp/pull/13396
* Introducing a config to skip `LightPipeline` validation for `inputCols` on the Python side for projects depending on Spark NLP. This toggle should only be used for specific annotators that do not follow the convention of predefined `inputAnnotatorTypes` and `outputAnnotatorType`.
========
4.2.7
========
----------------
Bug Fixes & Enhancements
----------------
* Fix `outputAnnotatorType` issue in pipelines with `Finisher` annotator. This change adds `outputAnnotatorType` to `AnnotatorTransformer` to avoid loading `outputAnnotatorType` attribute when a stage in pipeline does not use it.
* Fix the wrong sentence index calculation in metadata by annotators in the pipeline when `setExplodeSentences` param was set to `true` in SentenceDetector annotator
* Fix the issue in `Tokenizer` when a custom pattern is used with `lookahead/-behinds` and it has `0 width` matches. This led to indexes not being calculated correctly
* Fix missing to output embeddings in `.fullAnnotate()` method when `parseEmbeddings` param was set to `True/true`
* Fix broken links to the Python API pages, as the generation of the PyDocs was slightly changed in a previous release. This makes the Python APIs accessible from the Annotators and Transformers pages like before
* Change default values of `explodeEntities` and `mergeEntities` parameters to `true`
* Better error handling when there are empty paths/relations in `GraphExtraction`annotator. New message will better guide the user on how to configure `GraphExtraction` to output meaningful relationships
* Removed the duplicated definition of method `setWeightedDistPath` from `ContextSpellCheckerApproach`
========
4.2.6
========
----------------
Enhancements
----------------
* Updating Spark & PySpark dependencies from 3.2.1 to 3.2.3 in provided scripts and in all the documentation
----------------
Bug Fixes
----------------
* Fix the broken TypedDependencyParserApproach and TypedDependencyParserModel annotators used in Python (this bug was introduced in 4.2.5 release)
* Fix the broken Python API documentation
========
4.2.5
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing **CamemBertForSequenceClassification** annotator in Spark NLP 🚀. `CamemBertForSequenceClassification` can load CamemBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `CamembertForSequenceClassification` for PyTorch or `TFCamembertForSequenceClassification` for TensorFlow in HuggingFace 🤗
* **NEW:** Add `AnnotatorType` validation in Spark NLP `LightPipeline`. Currently, a misconfiguration of `inputCols` in an annotator in a pipeline raises an exception when using `transform` method, but in `LightPipeline` it only outputs empty values. This behavior can confuse users, this change introduces a validation that will raise an exception now in `LightPipeline` too.
* Add outputAnnotatorType for all annotators in Python
* Add inputAnnotatorTypes and outputAnnotatorType requirement validation for all subclasses derived from `AnnotatorApproach` and `AnnotatorModel`
* Adding AnnotatorType validation in `LightPipeline`
* Add validation for the number and type of columns set in `TFNerDLGraphBuilder` annotator. In efforts to avoid wrong definition of columns when using Spark NLP annotators in Python
* Add more details to Alphabet error message in `EntityRuler` annotator to better guide users
* Add instructions on how to resolve RocksDB incompatibilities when using Spark NLP with an M1 machine
* Refactor and implement a better error handling in ResourceDownloader. This change removes `getObjectFromS3` allowing AWS SDK to rise the correspondent error. In addition, this change also refactors ResourceDownloader to reflect the intention of each credential type on the downloader
* Implement full build and test of all unit tests base on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x major releases
* UpdateUpgrade `sbt-assembly` to `1.2.0` that comes with lots of performance improvements. This benefits those who are trying to package Spark NLP as a Fat JAR
* Update `sbt` to `1.8.0` with improvements and bug fixes, but mostly for CVEs fixes:
* Updates to Coursier 2.1.0-RC1 to address [https://github.com/advisories/GHSA-wv7w-rj2x-556x](https://github.com/advisories/GHSA-wv7w-rj2x-556x "https://github.com/advisories/GHSA-wv7w-rj2x-556x")
* Updates to Ivy 2.3.0-sbt-a8f9eb5bf09d0539ea3658a2c2d4e09755b5133e to address [https://github.com/advisories/GHSA-wv7w-rj2x-556x](https://github.com/advisories/GHSA-wv7w-rj2x-556x "https://github.com/advisories/GHSA-wv7w-rj2x-556x")
* Use the new withIncludeScala in assemblyOption instead of value
----------------
Bug Fixes
----------------
* Fix an issue with the `BigTextMatcher` Annotator, where it would not match entities with overlapping definitions. For Example, if both `lung` and `lung cancer` are defined, `lung` would not be matched in a given text. This was due to an abstraction error of one of the subclasses of the `BigTextMatcher` during construction of the underlying data structure
* Fix indexing issue for `RegexTokenizer` annotator. If the document was split into sentences, the index of the sentence inside the document was not taken into consideration for the indexes of the tokens. This would lead to further issues down the pipeline, where tokens would be filtered while unpacking them for other Annotators
* Refactor the `Resolvers` object in Spark NLP's dependency to avoid the conflict with the Resolvers inside the new `sbt`
========
4.2.4
========
----------------
New Features & Enhancements
----------------
* Introduce support for GCP storage to be allowed as `cache_pretrained` directory for keeping all downloaded models and pipelines
* Update to TensorFlow 2.7.4 with bug and CVEs fixes
* Update documentation on how to use `testDataset` param in NerDLApproach, ClassifierDLApproach, MultiClassifierDLApproach, and SentimentDLApproach
* Update installation instructions for Apple M1 chip
* Improve error handling while importing external TensorFlow models into Spark NLP
* Improve error messages when importing external models from remote storages like DBFS, S3, and HDFS
* Add support for future decoder-encoder models (2 separate models)
----------------
Bug Fixes
----------------
* Add missing setPreservePosition in NerConverter
* Add missing inputAnnotatorTypes to BigTextMatcher, ViveknSentimentModel, and NerConverter annotators
* Fix all wrong example codes provided for LemmatizerModel in Models Hub
* Fix provided notebook to import Longformer models from HF: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/transformers/HuggingFace%20in%20Spark%20NLP%20-%20Longformer.ipynb
* Fix the t5_grammar_error_corrector model to be compatible with Spark NLP 4.0+
========
4.2.3
========
----------------
New Features & Enhancements
----------------
* Implement a new control over number of accepted columns in Python. This will sync the behavior between Scala and Python where user sets more columns than allowed inside setInputCols
* Adding metadata sentence key parameter in order to select which metadata field to use as sentence for CoNLLGenerator annotator
* Include escaping in CoNLLGenerator annotator when writing to csv and preserve special char tokens
* Add documentation for new `IAnnotation` feature for Scala users
* Add rules and delimiter parameters to RegexMatcher annotator to support string as input in addition to a file
```python
regexMatcher = RegexMatcher() \
.setRules(["\\d{4}\\/\\d\\d\\/\\d\\d,date", "\\d{2}\\/\\d\\d\\/\\d\\d,short_date"]) \
.setDelimiter(",") \
.setInputCols(["sentence"]) \
.setOutputCol("regex") \
.setStrategy("MATCH_ALL")
```
----------------
Bug Fixes
----------------
* Fix NotSerializableException when WordEmbeddings is used over K8s cluster while `setEnableInMemoryStorage` is set to `true`
* Fix a bug in RegexTokenizer annotator when it outputs the wrong indexes if the pattern includes splits that are not followed by a space
* Fix training modul failing on EMR due to a bad Apache Spark version detection. The following classes were fixed: `CoNLL()`, `CoNLLU()`, `POS()`, and `PubTator()`
* Fix a bug in CoNLLGenerator annotator where token has non-int metadata
* Fix the wrong SentencePiece model's name required for DeBertaForQuestionAnswering and DeBertaEmbeddings when importing models
* Fix `NaNs` result in some ViTForImageClassification models/pipelines
========
4.2.2
========
----------------
New Features & Enhancements
----------------
* Add support for importing TensorFlow SavedModel from remote storages like DBFS, S3, and HDFS
* Add support for `fullAnnotate` in `LightPipeline` for path of images in Scala
* Add `fullAnnotate` method in `PretrainedPipeline` for Scala
* Add `fullAnnotateJava` method in `PretrainedPipeline` for Java
* Add `fullAnnotateImage` to `PretrainedPipeline` for Scala
* Add `fullAnnotateImageJava` to `PretrainedPipeline` for Java
* Add support for QA in `fullAnnotate` method in `PretrainedPipeline`
* Add `Predicted Entities` to all Vision Transformers (ViT) models and pipelines
----------------
Bug Fixes
----------------
* Unify `annotatorType` name in Python and Scala for Spark schema in Annotation, AnnotationImage and AnnotationAudio
* Fix missing indexes in `RecursiveTokenizer` annotator
========
4.2.1
========
----------------
New Features & Enhancements
----------------
* Support for multi-lingual WordSegmenter. Add `enableRegexTokenizer` feature in WordSegmenter to support word segmentation within mixed and multi-lingual content https://github.com/JohnSnowLabs/spark-nlp/pull/12854
* Add support for Audio/ASR (Wav2Vec2) support to LightPipeline https://github.com/JohnSnowLabs/spark-nlp/pull/12895
* Add support for Double type in addition to Float type to AudioAssembler annotator https://github.com/JohnSnowLabs/spark-nlp/pull/12904
* Improve error handling in fullAnnotateImage for LightPipeline https://github.com/JohnSnowLabs/spark-nlp/pull/12868
* Add SpanBertCoref annotator to all docs https://github.com/JohnSnowLabs/spark-nlp/pull/12889
----------------
Bug Fixes
----------------
* Fix feeding `fullAnnotate` in Lightpipeline with a list that started to fail in 4.2.0 release
* Fix exception in ContextSpellCheckerModel when updateVocabClass is used with append set to true https://github.com/JohnSnowLabs/spark-nlp/pull/12875
* Fix exception in Chunker annotator https://github.com/JohnSnowLabs/spark-nlp/pull/12901
========
4.2.0
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing **Wav2Vec2ForCTC** annotator in Spark NLP 🚀. `Wav2Vec2ForCTC` can load `Wav2Vec2` models for the Automatic Speech Recognition (ASR) task. Wav2Vec2 is a multi-modal model, that combines speech and text. It's the first multi-modal model of its kind we welcome in Spark NLP. This annotator is compatible with all the models trained/fine-tuned by using `Wav2Vec2ForCTC` for **PyTorch** or `TFWav2Vec2ForCTC` for **TensorFlow** models in HuggingFace 🤗 (https://github.com/JohnSnowLabs/spark-nlp/pull/12767)
* **NEW:** Introducing **TapasForQuestionAnswering** annotator in Spark NLP 🚀. `TapasForQuestionAnswering` can load TAPAS Models with a cell selection head and optional aggregation head on top for question-answering tasks on tables (linear layers on top of the hidden-states output to compute logits and optional logits_aggregation), e.g. for SQA, WTQ or WikiSQL-supervised tasks. TAPAS is a BERT-based model specifically designed (and pre-trained) for answering questions about tabular data. This annotator is compatible with all the models trained/fine-tuned by using `TapasForQuestionAnswering` for **PyTorch** or `TFTapasForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **CamemBertForTokenClassification** annotator in Spark NLP 🚀. `CamemBertForTokenClassification` can load CamemBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using `CamembertForTokenClassification` for PyTorch or `TFCamembertForTokenClassification` for TensorFlow in HuggingFace 🤗
(https://github.com/JohnSnowLabs/spark-nlp/pull/12752)
* Implementing `setTestDataset` to evaluate metrics on an external dataset during training of Text Classifiers in Spark NLP. This feature is similar to NerDLApproach where metrics are calculated on each Epoch and have been added to the following multi-class/multi-label text classifier annotators: `ClassifierDLApproach`, `SentimentDLApproach`, and `MultiClassifierDLApproach` (https://github.com/JohnSnowLabs/spark-nlp/pull/12796)
* Refactoring and improving `EntityRuler` annotator inference to up to 24x faster especially when used with a long list of labels/entities. We speed up the inference process by implementing the Aho-Corasick algorithm to match patterns in a string. This requires the following changes when using `EntityRuler` https://github.com/JohnSnowLabs/spark-nlp/pull/12634
* Add support for S3 storage in the `cache_folder` where models are downloaded, extracted, and loaded from. Previously, we only supported all local file systems, HDFS, and DBFS. This new feature is especially useful for users on Kubernetes clusters with no access to HDFS or any other distributed file systems (https://github.com/JohnSnowLabs/spark-nlp/pull/12707)
* Implementing `lookaround` functionalities in `DocumentNormalizer` annotator. Currently, `DocumentNormalizer` has both `lookahead` and `lookbehind` functionalities. To extend support for more complex normalizations, especially within the clinical text we are introducing the `lookaround` feature (https://github.com/JohnSnowLabs/spark-nlp/pull/12735)
* Implementing `setReplaceEntities` param to `NerOverwriter` annotator to replace all the NER labels (entities) with the given new labels (entities) (https://github.com/JohnSnowLabs/spark-nlp/pull/12745)
----------------
Bug Fixes
----------------
* Fix a bug in generating the NerDL graph by using TF v2. The previous graph generated by the `TFGraphBuilder` annotator resulted in an exception when the length of the sequence was 1. This issue has been resolved and the new graphs created by `TFGraphBuilder` won't have this issue anymore (https://github.com/JohnSnowLabs/spark-nlp/pull/12636)
* Fix a bug introduced in the 4.0.0 release between Transformer-based Word Embeddings annotators. In the 4.0.0 release, the following annotators were migrated to BatchAnnotate to improve their performance, especially on GPU. However, a bug was introduced in sentence indices which when it is combined with SentenceEmbeddings for Text Classifications tasks (ClassifierDLApproach, SentimentDLApproach, and ClassifierDLApproach) resulted in low accuracy: AlbertEmbeddings, CamemBertEmbeddings, DeBertaEmbeddings, DistilBertEmbeddings, LongformerEmbeddings, RoBertaEmbeddings, XlmRoBertaEmbeddings, and XlnetEmbeddings (https://github.com/JohnSnowLabs/spark-nlp/pull/12641)
* Add support for a list of questions and context in LightPipeline. Previously, only one context and question at a time were supported in LightPipeline for Question Answering annotators. We have added support to `fullAnnotate` and `annotate` to receive two lists of questions and contexts (https://github.com/JohnSnowLabs/spark-nlp/pull/12653)
* Fix division by zero exception in the `GPT2Transformer` annotator when the `setDoSample` param was set to true (https://github.com/JohnSnowLabs/spark-nlp/pull/12661)
========
4.1.0
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing **ViTForImageClassification** annotator in Spark NLP 🚀. `ViTForImageClassification` can load Vision Transformer `ViT` Models with an image classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for ImageNet. This annotator is compatible with all the models trained/fine-tuned by using `ViTForImageClassification` for **PyTorch** or `TFViTForImageClassification` for **TensorFlow** models in HuggingFace 🤗
* Provide support for AWS Graviton processors and ARM64 processors with architecture greater than ARMv8
* Introducing **TFNerDLGraphBuilder** annotator. `TFNerDLGraphBuilder` can be used to automatically detect the parameters of a needed NerDL graph and generate the graph within a pipeline when the default NER graphs are not suitable for your training datasets.
* Allow passing confidence scores from all XXXForTokenClassification annotators to NerConverter. From this release it is possible to access the confidence scores coming from the following annotators via NerConverter: AlbertForTokenClassification, BertForTokenClassification, DeBertaForTokenClassification, DistilBertForTokenClassification, LongformerForTokenClassification, RoBertaForTokenClassification, XlmRoBertaForTokenClassification, XlnetForTokenClassification, and DeBertaForTokenClassification
* Introducing PushToHub Python class to easily push public models/pipelines to Models Hub
* Introducing fullAnnotateImage to existing LightPipeline to support ImageAssembler and ViTForImageClassification annotators in a Spark NLP pipeline.
========
4.0.2
========
----------------
New Features
----------------
* SentenceDetector now comes with a new parameter `customBoundsStrategy` for returning custom bounds https://github.com/JohnSnowLabs/spark-nlp/pull/10567
----------------
Bug Fixes
----------------
* Fix bug that attempts to create spark session on executors when using GraphExtraction https://github.com/JohnSnowLabs/spark-nlp/pull/9905
========
4.0.1
========
----------------
New Features
----------------
* Full support for Apache Spark & PySpark 3.3.0
* Add Apache Spark 3.3.0 to Google Colab and Kaggle setup scripts
* New `-g` option for Google Colab and Kaggle setup on GPU device to upgrade `libcudnn8` to 8.1.0 to solve the issue on GPU
* Support for Databricks Runtime 11.0
----------------
Bug Fixes
----------------
* Fix the error caused by PySpark 3.3.0 in CoNLL, CoNLLU, POS, and PubTator annotators as training helpers
* Fix and re-upload Dependency and Type Dependency parser pre-trained models
* Update pre-trained pipelines with issues on PySpark 3.2 and 3.3
========
4.0.0
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing **AlbertForQuestionAnswering** annotator in Spark NLP 🚀. `AlbertForQuestionAnswering` can load `ALBERT` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `AlbertForQuestionAnswering` for **PyTorch** or `TFAlbertForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **BertForQuestionAnswering** annotator in Spark NLP 🚀. `BertForQuestionAnswering` can load `BERT` & `ELECTRA` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `BertForQuestionAnswering` and `ElectraForQuestionAnswering` for **PyTorch** or `TFBertForQuestionAnswering` and `TFElectraForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **DeBertaForQuestionAnswering** annotator in Spark NLP 🚀. `DeBertaForQuestionAnswering` can load `DeBERTa` v2&v3 Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `DebertaV2ForQuestionAnswering` for **PyTorch** or `TFDebertaV2ForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **DistilBertForQuestionAnswering** annotator in Spark NLP 🚀. `DistilBertForQuestionAnswering` can load `DistilBERT` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `DistilBertForQuestionAnswering` for **PyTorch** or `TFDistilBertForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **LongformerForQuestionAnswering** annotator in Spark NLP 🚀. `LongformerForQuestionAnswering` can load `Longformer` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `LongformerForQuestionAnswering` for **PyTorch** or `TFLongformerForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **RoBertaForQuestionAnswering** annotator in Spark NLP 🚀. `RoBertaForQuestionAnswering` can load `RoBERTa` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `RobertaForQuestionAnswering` for **PyTorch** or `TFRobertaForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **XlmRoBertaForQuestionAnswering** annotator in Spark NLP 🚀. `XlmRoBertaForQuestionAnswering` can load `XLM-RoBERTa` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `XLMRobertaForQuestionAnswering` for **PyTorch** or `TFXLMRobertaForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **MultiDocumentAssembler** annotator where multiple inputs require to be converted to DOCUMENT such as in XXXForQuestionAnswering annotators
* Optimizing batch processing for transformer-based Word Embeddings on a GPU device. These optimizations result in performance improvements from +50% to +700% (more details in Benchmarks section)
* **NEW:** Introducing **SpanBertCorefModel** annotator for Coreference Resolution on BERT and SpanBERT models based on [BERT for Coreference Resolution: Baselines and Analysis](https://arxiv.org/abs/1908.09091) paper. An implementation of a SpanBert based coreference resolution model.
* Support for 2 inputs in LightPipeline for with MultiDocumentAssembler
* Migrate T5Transformer to TensorFlow v2 architecture with re-uploading all the existing models
* Official support for Apple silicon M1 on macOS devices. From Spark NLP 4.0.0 you can use `spark-nlp-m1` package that supports Apple silicon M1 on your macOS machine
* Official support for Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.2.x and supports Spark/PySpark 3.0.x and 3.1.x in additions
* Unifying all supported Apache Spark packages on Maven into `spark-nlp` for CPU, `spark-nlp-gpu` for GPU, and `spark-nlp-m1` for new Apple silicon M1 on macOS. The need for Apache Spark specific package like `spark-nlp-spark32` has been removed.
* Adding a new param to sparknlp.start() function in Python and Scala for Apple silicon M1 on macOS (`m1=True`)
* Update Colab, Kaggle, and SageMaker scripts
* Add new default NerDL graph for xsmall DeBERTa embeddings model (384 dimensions)
* Adding annotateJava method to PretrainedPipeline class in Java to facilitate the use of LightPipelines
* Allow change of case sensitivity. Currently, user cannot set setCaseSensitive param. This allows users to change this value if the model was saved/uploaded with the wrong case sensitivity parameter. (BERT, ALBERT, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, and Longformer for XXXForSequenceClassitication and XXXForTokenClassification.
* Keep accuracy in ClassifierDL and SentimentDL during the training between 0.0 and 1.0
* Preserve the original form of the token in BPE Tokenizer used in RoBERTa annotators (used in embeddings, sequence and token classification)
* Refactor the entire Python module in Spark NLP to make the development and maintenance easier
* Refactor unit tests in Python and migrate to pytest
* Welcoming 6x new Databricks runtimes to our Spark NLP family:
* Databricks 10.4 LTS
* Databricks 10.4 LTS ML
* Databricks 10.4 LTS ML GPU
* Databricks 10.5
* Databricks 10.5 ML
* Databricks 10.5 ML GPU
* Welcoming a new EMR 6.x series to our Spark NLP family:
* EMR 6.6.0 (Apache Spark 3.2.0 / Hadoop 3.2.1)
* Upgrade TensorFlow to 2.7.1 and start supporing Apple silicon M1
* Upgrade RocksDB with new enhancements and support for Apple silicon M1
* Upgrade SentencePiece tokenizer TF ops to 2.7.1
* Upgrade SentencePiece JNI to v0.1.96 and provide support for Apple silicon M1 on macOS support
* Upgrade to Scala 2.12.15
----------------
Bug Fixes
----------------
* Fix the default pre-trained model for DeBertaForTokenClassification in Scala and Python
* Remove a requirement in DocumentNormalizer that consecutive stage processing can produce empty text annotations without breaking the pipeline
* Fix WordSegmenterModel outputting wrong order of tokens. The regex that groups the tagging format was refactored to preserve the order of segmented outputs (tokens)
* Fix encoding sentences not respecting the max sequence length given by a user in XlmRobertaSentenceEmbeddings
* Fix encoding sentences by using SentencePiece to calculate the correct tokens indexing
* Fix SentencePiece serialization issue when XlmRoBertaEmbeddings and XlmRoBertaSentenceEmbeddings annotators are used from a Fat JAR on GPU
* Remove non-existing parameters from DocumentAssembler in Python
----------------
Backward Compatibility
----------------
* Deprecate support for Spark/PySpark 2.3, Spark/PySpark 2.4, and Scala 2.11 https://github.com/JohnSnowLabs/spark-nlp/pull/8319
* The start() functions in Python and Scala will no longer have `spark23`, `spark24`, and `spark32` parameters. The default `sparknlp.start()` works on PySpark 3.0.x, 3.1.x, and 3.2.x without the need of any Spark related flags
* Some models/pipelines which were trained or saved by using Spark and PySpark 2.3/2.4 will no longer work on Spark NLP 4.0.0
* Remove json4s-ext dependency to allow the support for all Apache Spark major releases in one build
========
3.4.4
========
----------------
New Features
----------------
* **NEW:** Introducing **DeBertaForTokenClassification** annotator in Spark NLP 🚀. `DeBertaForTokenClassification` can load DeBERTa v2&v3 models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using `DebertaV2ForTokenClassification` for **PyTorch** or `TFDebertaV2ForTokenClassification` for **TensorFlow** models in HuggingFace
* **NEW:** Introducing **CamemBertEmbeddings** annotator in Spark NLP 🚀
* Add support for BatchAnnotate to UniversalSentenceEncoder
----------------
Bug Fixes & Enhancements
----------------
* Optimizing Tokenizer performance up to 400% when there is exceptions list
* Support latest PySpark releases in Colab, Kaggle, and SageMaker scripts
* Removing trove4j dependency
* Fix bug that caused get input/output/LazyAnnotator to return None
* Fix DeBertaForSequenceClassification in Python failing to load pretrained models
========
3.4.3
========
----------------
New Features
----------------
* **NEW:** Introducing **DeBertaForSequenceClassification** annotator in Spark NLP 🚀. `DeBertaForSequenceClassification` can load DeBERTa v2&v3 models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `DebertaForSequenceClassification` for **PyTorch** or `TFDebertaForSequenceClassification` for **TensorFlow** models in HuggingFace
* New multi-label feature in all SequenceForClassification. The following annotators now have the option to switch to sigmoid activation function instead of softmax for the output layer: AlbertForSequenceClassification, BertForSequenceClassification, DeBertaForSequenceClassification, DistilBertForSequenceClassification, LongformerForSequenceClassification, RoBertaForSequenceClassification, XlmRoBertaForSequenceClassification, and XlnetForSequenceClassification
* New minLength, maxLength, splitLength, customBounds, and useCustomBoundsOnly parameters in SentenceDetectorDL
* New impossiblePenultimates feature in SentenceDetectorDLModel
* New feature to set names for columns in CoNLLU class: textCol, documentCol, sentenceCol, formCol, uposCol, xposCol, and lemmaCol
* New formCol and lemmaCol parameters in Lemmatizer annotator
* Add new functionality to download and extract models from S3 via direct link
----------------
Bug Fixes & Enhancements
----------------
* Fix and train new English spell checker models for Spark NLP 3.4.1 on Spark 3.x and 2.x
* Update SentenceDetector documentation
* Add a missing notebook to demonstrate training a WordSegmenterApproach annotator for word segmentation
========
3.4.2
========
----------------
New Features
----------------
* Introducing DeBertaEmbeddings annotator. DeBERTa (Decoding-enhanced BERT with disentangled attention) improves the BERT and RoBERTa models using two novel techniques. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%).
This annotator is compatible with all the models trained/fine-tuned by using `DebertaV2Model` for **PyTorch** or `TFDebertaV2Model` for **TensorFlow** models (DeBERTa-v2 & DeBERTa-v3) in HuggingFace
* Introducing a new param enableCaching in Doc2VecApproach and Word2VecApproach which if enabled speeds up the training
* Support Databricks runtime 10.3, 10.3 ML, and 10.3 ML & GPU
* Support EMR emr-5.34.0 and emr-6.5.0
----------------
Bug Fixes
----------------
* Fix bestModelMetric param when the set value was ignored https://github.com/JohnSnowLabs/spark-nlp/pull/6978
========
3.4.1
========
----------------
New Features & Enhancements
----------------
* Implement TF Session warmup for MarianTransformer, T5Transformer, and GPT2Transformer annotators. The first inference for these annotators used to take between 15-20 seconds, now with the warmup session all the inferences including the first time will be the same https://github.com/JohnSnowLabs/spark-nlp/pull/6773
* Add bestModelMetric param to choose between Micro-average or Macro-average for best model https://github.com/JohnSnowLabs/spark-nlp/pull/6749
* Add trimWhitespace and preservePosition params to RegexTokenizer https://github.com/JohnSnowLabs/spark-nlp/pull/6806
* Add a new `setSentenceMatchAdd` param to EntityRuler to match entities across documents/sentences and not just tokens https://github.com/JohnSnowLabs/spark-nlp/pull/6841
* Add support spark32 and real_time_output flags in sparknlp.start() function at the same time https://github.com/JohnSnowLabs/spark-nlp/pull/6822
----------------
Bug Fixes
----------------
* Fix random NullPointerException when using TensorFlow models without Kyro serialization https://github.com/JohnSnowLabs/spark-nlp/pull/6741
* Fix RecursiveTokenizerModel not being readable in a saved Pipeline https://github.com/JohnSnowLabs/spark-nlp/pull/6748
* Fix ContextSpellCheckerApproach not being trained on Databricks https://github.com/JohnSnowLabs/spark-nlp/pull/6750
* Fix ContextSpellCheckerModel wrong order of tokens it's used with Sentence Detectors https://github.com/JohnSnowLabs/spark-nlp/pull/6799
* Fix GraphExtraction when fullAnnotate and document are used at the same time https://github.com/JohnSnowLabs/spark-nlp/pull/6845
* Fix Word2VecModel being cast to Doc2VecModel by mistake https://github.com/JohnSnowLabs/spark-nlp/pull/6849
* Fix broken sentence indexing in BertEmbeddings that impacted SentenceEmbeddings for text classification https://github.com/JohnSnowLabs/spark-nlp/pull/6867
* Fix missing setExceotionsPath param in Tokenizer when it's used in Python https://github.com/JohnSnowLabs/spark-nlp/pull/6868
* Fix the wrong metrics being mentioned when useBestModel was enabled. The documentation said Micro-averaged F1 but in fact, it was Macro-average F1. (this option is now available to choose which metric to be tracked)
* Update broken slow unit tests https://github.com/JohnSnowLabs/spark-nlp/pull/6767
========
3.4.0
========
----------------
Major features and improvements
----------------
* **NEW:** Introducing **GPT2Transformer** annotator in Spark NLP 🚀. OpenAI GPT2 - huggingface `TFGPT2LMHeadModel`
* **NEW:** Introducing **RoBertaForSequenceClassification** annotator in Spark NLP 🚀. `RoBertaForSequenceClassification` can load RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `RobertaForSequenceClassification` for **PyTorch** or `TFRobertaForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **XlmRoBertaForSequenceClassification** annotator in Spark NLP 🚀. `XlmRoBertaForSequenceClassification` can load XLM-RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `XLMRobertaForSequenceClassification` for **PyTorch** or `TFXLMRobertaForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **LongformerForSequenceClassification** annotator in Spark NLP 🚀. `LongformerForSequenceClassification` can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `LongformerForSequenceClassification` for **PyTorch** or `TFLongformerForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **AlbertForSequenceClassification** annotator in Spark NLP 🚀. `AlbertForSequenceClassification` can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `AlbertForSequenceClassification` for **PyTorch** or `TFAlbertForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **XlnetForSequenceClassification** annotator in Spark NLP 🚀. `XlnetForSequenceClassification` can load XLNet Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `XLNetForSequenceClassification` for **PyTorch** or `TFXLNetForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing trainable and distributed Word2Vec annotators based on Word2Vec in Spark ML
* Support for Apache Spark and PySpark 3.2.x on Scala 2.12
* Introducing `useBestModel` param in NerDLApproach annotator. This param in the NerDLApproach preserves and restores the model that has achieved the best performance at the end of the training. The priority is metrics from testDataset (micro F1), metrics from validationSplit (micro F1), and if none is set it will keep track of loss during the training
* Welcoming 6x new Databricks runtimes to our Spark NLP family:
* Databricks 10.0
* Databricks 10.0 ML GPU
* Databricks 10.1
* Databricks 10.1 ML GPU
* Databricks 10.2
* Databricks 10.2 ML GPU
* Welcoming 3x new EMR 6.x series to our Spark NLP family:
* EMR 5.33.1 (Apache Spark 2.4.7 / Hadoop 2.10.1)
* EMR 6.3.1 (Apache Spark 3.1.1 / Hadoop 3.2.1)
* EMR 6.4.0 (Apache Spark 3.1.2 / Hadoop 3.2.1)
* Adding a new param to sparknlp.start() function in Python for Apache Spark 3.2.x (`spark32=True`)
* Add new scripts/notebook to generate custom TensroFlow graphs for `ContextSpellCheckerApproach` annotator
* Add a new `graphFolder` param to `ContextSpellCheckerApproach` annotator. This param allows to train ContextSpellChecker from a custom made TensorFlow graph
* Support DBFS file system in `graphFolder` param. Starting Spark NLP 3.4.0 you can point NerDLApproach or ContextSpellCheckerApproach to a TF graph hosted on Databricks
* Add new feature to all classifiers (`ForTokenClassification` and `ForSequenceClassification`) to retrieve classes from the pretrained models
* Add `inputFormats` param to DateMatcher and MultiDateMatcher annotators. DateMatcher and MultiDateMatcher can now define a list of acceptable input formats via date patterns to search in the text. Consequently, the output format will be defining the output pattern for the unique output format.
* Enable batch processing in T5Transformer and MarianTransformer annotators
* Add Schema to `readDataset` in CoNLL() class
----------------
Bug Fixes
----------------
* Fix a race condition in a cluster mode when the accessing TF session is called as many times as the number of available cores on the Driver machine for the very first time. Loading a model multiple times result in disk activities and IO becomes a bottleneck for larger models especially on a machine(s) with slower disks) https://github.com/JohnSnowLabs/spark-nlp/pull/6575
* Fix a performance issue introduced in the 3.3.3 release for T5Transformer and MarianTransformer annotators. While we added support for ignored tokens, accidentally we introduced a bug that degraded the performance for these two annotators (sometimes twice slower). Please do update to 3.4.0 if you are using any of these annotators. https://github.com/JohnSnowLabs/spark-nlp/pull/6605
* Fix a bug in model resolution by not filtering based on the timestamp
* Fix configProtoBytes param type in Python https://github.com/JohnSnowLabs/spark-nlp/pull/6549
* Fix missing DefaultParamsReadable in RegexTokenizer annotator https://github.com/JohnSnowLabs/spark-nlp/pull/6653
* Fix missing models `lemma_antbnc`, `sentiment_vivekn`, and `spellcheck_norvig` for Spark 3.x
* Fix missing pipelines `clean_slang`, `check_spelling`, `match_chunks`, and `match_datetime` for Spark 3.x
* Fix `saveModel` in TrainingHelper
* Fix Keyword/Yake module naming in Scala https://github.com/JohnSnowLabs/spark-nlp/pull/6562
----------------
Backward Compatibility
----------------
* The parameter `dateFormat` in DateMatcher and MultiDateMatcher annotators has been renamed to `outputFormat`:
```python=
# previously
.setDateFormat("yyyy/MM/dd")
# after 3.4.0 release
.setOutputFormat("yyyy/MM/dd")
```
* Deprecating xling TF Hub models for UniversalSentenceEncoder annotator (there are `CMLM` models available which outperform xling models with support for more languages)
* Deprecating Finnish old BERT models (there are newer models available now)
========
3.3.4
========
----------------
Patch release
----------------
* Fix "ClassCastException" error in pretrained function for DistilBertForSequenceClassification in Python
========
3.3.3
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing **DistilBertForSequenceClassification** annotator in Spark NLP 🚀. `DistilBertForSequenceClassification` DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `DistilBertForSequenceClassification` or `TFDistilBertForSequenceClassification` in HuggingFace 🤗
* **NEW:** Introducing trainable and distributed **Doc2Vec** annotators based on Word2Vec in Spark ML
* Improving BertEmbeddings for single document/sentence DataFrame per row on a single machine with a GPU device
* Improving BertSentenceEmbeddings for single document/sentence DataFrame per row on a single machine with a GPU device
* Add a new feature to the CoNLL() class, allowing it to read multiple CoNLL files at the same time into a single DataFrame
* Add support for Long type in label column for ClassifierDLApproach and SentimentDLApproach
----------------
Bug Fixes
----------------
* Improve models and pipelines resolutions in Spark NLP when wrong models/pipelines are downloaded regardless of their Apache Spark version
* Fix MarianTransformer bug on empty sequences
* Fix TFInvalidArgumentException in MarianTransformer for sequences longer than 512
* Fix MarianTransformer multi-lingual models and pipelines such as `opus_mt_mul_en` and `opus_mt_mul_en`
========
3.3.2
========
----------------
New Features
----------------
* Comet.ml integration with Spark NLP
* Introducing BertForSequenceClassification annotator
----------------
Bug Fixes
----------------
* Fix EntityRulerApproach name from import
* Fix missing EntityRulerModel in ResourceDownloader
* Fix NerDLApproach logs format on Databricks
* Fix a missing batchSize param in NerDLModel that degraded GPU performance
========
3.3.1
========
----------------
New Features
----------------
* Introducing EntityRuler annotator to receive either a JSON or CSV ontology file that maps entities to patterns. You can implement a purely rule-based entity recognition system by using EntityRuler, it can be saved as a Model and reused in other pipelines to annotate your document against your knowledge base.
----------------
Bug Fixes
----------------
* Fix compatibility issue between NerOverwriter and AlbertForTokenClassification, BertForTokenClassification, DistilBertForTokenClassification, LongformerForTokenClassification, RoBertaForTokenClassification, XlmRoBertaForTokenClassification, XlnetForTokenClassification annotators
* Fix a bug in ContextSpellCheckerApproach annotator failing to find an appropriate TF graph
* Fix a bug in ContextSpellCheckerModel not being able to load a trained model
* Fix token alignment with token pieces in BertEmbeddings resulting in missing vectors with Unicode characters
* Add the missing pretrained NER models for the XlmRoBertaForTokenClassification annotator
* Add the missing pretrained NER models for the LongformerForTokenClassification annotator
----------------
Backward compatibility
----------------
* Renaming YakeModel to YakeKeywordExtraction to represent the actual purpose of this annotator more clearly.
========
3.3.0
========
----------------
Major features and improvements
----------------
* **NEW:** Beginning of Spark NLP 3.3.0 release there will be no limitation of size when you import TensorFlow models! You can now import TF Hub & HuggingFace models larger than 2G of size.
* **NEW:** Up to 50x faster when saving Spark NLP models and pipelines! 🚀 We have improved the way we package TensorFlow SavedModel while saving Spark NLP models & pipelines. For instace, it used to take up to 10 minutes to save `xlm_roberta_base` model prior to Spark NLP 3.3.0, and now it only takes up to 15 seconds!
* **NEW:** Introducing **AlbertForTokenClassification** annotator in Spark NLP 🚀. `AlbertForTokenClassification` can load ALBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using `AlbertForTokenClassification` or `TFAlbertForTokenClassification` in HuggingFace 🤗
* **NEW:** Introducing **XlnetForTokenClassification** annotator in Spark NLP 🚀. `XlnetForTokenClassification` can load XLNet Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using `XLNetForTokenClassificationet` or `TFXLNetForTokenClassificationet` in HuggingFace 🤗
* **NEW:** Introducing **RoBertaForTokenClassification** annotator in Spark NLP 🚀. `RoBertaForTokenClassification` can load RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using `RobertaForTokenClassification` or `TFRobertaForTokenClassification` in HuggingFace 🤗
* **NEW:** Introducing **XlmRoBertaForTokenClassification** annotator in Spark NLP 🚀. `XlmRoBertaForTokenClassification` can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using `XLMRobertaForTokenClassification` or `TFXLMRobertaForTokenClassification` in HuggingFace 🤗
* **NEW:** Introducing **LongformerForTokenClassification** annotator in Spark NLP 🚀. `LongformerForTokenClassification` can load Longformer Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using `LongformerForTokenClassification` or `TFLongformerForTokenClassification` in HuggingFace 🤗
* **NEW:** Introducing new ResourceDownloader functions to easily look for pretrained models & pipelines inside Spark NLP (Python and Scala). You can filter models or pipelines via `language`, `version`, or the name of the `annotator`
* Welcoming [Databricks Runtime 9.1 LTS](https://docs.databricks.com/release-notes/runtime/9.1.html), 9.1 ML, and 9.1 ML with GPU
* Fix printing a wrong version return in sparknlp.version()
----------------
Bug Fixes
----------------
* Fix a bug in RoBertaEmbeddings when all special tokens were identical
* Fix a bug in RoBertaEmbeddings when special token contained valid regex
* Fix a bug lead to memory leak inside NorvigSweeting spell checker. This issue caused issues with pretrained pipelines such as `explain_document_ml` and `explain_document_dl` when some inputs
* Fix the wrong types being assigned to `minCount` and `classCount` in Python for `ContextSpellCheckerApproach` annotator
* Fix `explain_document_ml` pretrained pipeline for Spark NLP 3.x on Apache Spark 2.x
========
3.2.3
========
----------------
Bug Fixes & Enhancements
----------------
* Add delimiter feature to CoNLL() class to support other delimiters in CoNLL files https://github.com/JohnSnowLabs/spark-nlp/pull/5934
* Add support for IOB in addition to IOB2 format in GraphExctraction https://github.com/JohnSnowLabs/spark-nlp/pull/6101
* Change YakeModel output type from KEYWORD to CHUNK to have more available features after the YakeModel annotator such as Chunk2Doc or ChunkEmbeddings https://github.com/JohnSnowLabs/spark-nlp/pull/6065
* Fix the default language for XlmRoBertaSentenceEmbeddings pretrained model in Python https://github.com/JohnSnowLabs/spark-nlp/pull/6057
* Fix SentenceEmbeddings issue concatenating sentences instead of each correspondent sentence https://github.com/JohnSnowLabs/spark-nlp/pull/6060
* Fix GraphExctraction usage in LightPipeline https://github.com/JohnSnowLabs/spark-nlp/pull/6101
* Fix compatibility issue in `explain_document_ml` pipeline
* Better import process for corrupted merges file in Longformer tokenizer https://github.com/JohnSnowLabs/spark-nlp/pull/6083
========
3.2.2
========
----------------
New Features
----------------
* A new RoBertaSentenceEmbeddings annotator for sentence embeddings used in SentimentDL, ClassifierDL, and MultiClassifierDL annotators
* A new XlmRoBertaSentenceEmbeddings annotator for sentence embeddings used in SentimentDL, ClassifierDL, and MultiClassifierDL annotators
* Add support for AWS MFA via Spark NLP configuration
* Add new AWS configs to Spark NLP configuration when using a private S3 bucket to store logs for training models or access TF graphs needed in NerDLApproach
* spark.jsl.settings.aws.credentials.access_key_id
* spark.jsl.settings.aws.credentials.secret_access_key
* spark.jsl.settings.aws.credentials.session_token
* spark.jsl.settings.aws.s3_bucket
* spark.jsl.settings.aws.region
----------------
Bug Fixes & Enhancements
----------------
* Improve loading merges file for RoBERTa tokenizer
* Remove batchSize param from broadcast in XlmRoBertaEmbeddings to be set after it is created
* Preserve previsouly generated metadata in BertSentenceEmbeddings annotator
* Set `elmo` as a default poolingLayer in ElmoEmbeddings
* Fix special tokens ids in XlmRoBertaEmbeddings annotator
* Fix distilbert_base_token_classifier_ontonotes model
* Fix distilbert_base_token_classifier_conll03 model
* Fix distilbert_base_token_classifier_few_nerd model
* Fix distilbert_token_classifier_persian_ner model
* Fix ner_conll_longformer_base_4096 model
========
3.2.1
========
----------------
Patch release
----------------
* Fix "unsupported model" error in pretrained function for LongformerEmbeddings, BertForTokenClassification, and DistilBertForTokenClassification
========
3.2.0
========
----------------
Major features and improvements
----------------
* **NEW:** Introducing **LongformerEmbeddings** annotator
* **NEW:** Introducing **BertForTokenClassification** annotator
* **NEW:** Introducing **DistilBertForTokenClassification** annotator
* **NEW:** Introducing **GraphExctraction** and **GraphFinisher** annotators.
* **NEW:** Introducing support for multilingual **DateMatcher** and **MultiDateMatcher** annotators. These two annotators will support **English**, **French**, **Italian**, **Spanish**, **German**, and **Portuguese** languages
* **NEW:** Introducing new **Python APIs** and fully documented **Pydoc**
* **NEW:** Introducing new **Spark NLP configurations** via spark.conf() by deprecating `application.conf` usage
* Add support for S3 to `log_folder` Spark NLP config and `outputLogsPath` param in `NerDLApproach`, `ClassifierDlApproach`, `MultiClassifierDlApproach`, and `SentimentDlApproach` annotators
* Added examples to all Spark NLP Scaladoc
* Added examples to all Spark NLP Pydoc
* Welcoming new Databricks runtimes to our Spark NLP family:
* Databricks 8.4 ML & GPU
* Fix printing a wrong version return in sparknlp.version()
========
3.1.3
========
----------------
Bug Fixes & Enhancements
----------------
* Fix serialization issue in NorvigSweetingModel
* Fix the issue with BertSentenceEmbeddings model in TF v2
* Update ArrayType structure to fix Finisher failing to clean up some annotators
========
3.1.2
========
----------------
New Features
----------------
* Migrate XlnetEmbeddings to TensorFlow v2. This allows the importing of HuggingFace XLNet models to Spark NLP
* Migrate XlnetEmbeddings to BatchAnnotate to allow better performance on accelerated hardware such as GPU
* Dynamically extract special tokens from SentencePiece model in XlmRoBertaEmbeddings
* Add setIncludeAllConfidenceScores param in NerDLModel to merge confidence scores per label to only predicted label
* Sync Python params with Scala params in ContextSpellCheckerApproach, WordSegmenterApproach, RegexMatcher, and ViveknSentimentApproach,
----------------
Bug Fixes & Enhancements
----------------
* Fix issue with SymmetricDeleteModel
* Fix issue with encoding unknown bytes in RoBertaEmbeddings
* Fix issue with multi-lingual UniversalSentenceEncoder models
----------------
Backward compatibility
----------------
We have migrated XlnetEmbeddings to TensorFlow v2, the earlier models prior to 3.1.2 won't work after this release.
We have already updated the models and uploaded them on Models Hub. You can use `pretrained()` that takes care of it automatically or please make sure you download the new models manually.
========
3.1.1
========
----------------
New Features
----------------
* Migrate AlbertEmbeddings to TensorFlow v2. This allows the importing of HuggingFace ALBERT models to Spark NLP
* Migrate AlbertEmbeddings to BatchAnnotate to allow better performance on accelerated hardware such as GPU
* Enable stdout/stderr in real-time for child processes `sparknlp.start()`. Thanks to PySpark 3.x, this is now possible with `sparknlp.start(real_time_output=True)` to have the outputs of Spark NLP (such as metrics during training) right in your Jupyter, Colab, and Kaggle notebooks.
* Complete examples for all annotators in Scaladoc APIs https://github.com/JohnSnowLabs/spark-nlp/pull/5668
----------------
Bug Fixes & Enhancements
----------------
* Fix YakeModel issue with empty token https://github.com/JohnSnowLabs/spark-nlp/pull/5683 thanks to @shaddoxac
* Fix getAnchorDateMonth method in DateMatcher and MultiDateMatcher https://github.com/JohnSnowLabs/spark-nlp/pull/5693
* Fix the broken PubTutor class in Python https://github.com/JohnSnowLabs/spark-nlp/pull/5702
* Fix relative dates in DateMatcher and MultiDateMatcher such as `day after tomorrow` or `day before yesterday` https://github.com/JohnSnowLabs/spark-nlp/pull/5706
* Add isPaddedToken param to PubTutor https://github.com/JohnSnowLabs/spark-nlp/pull/5702
* Fix issue with `logger` inside session on some setup https://github.com/JohnSnowLabs/spark-nlp/pull/5715
* Add signatures to TF session to handle inputs/outputs more dynamically in BertEmbeddings, DistilBertEmbeddings, RoBertaEmbeddings, and XlmRoBertaEmbeddings https://github.com/JohnSnowLabs/spark-nlp/pull/5715
* Fix XlmRoBertaEmbeddings issue with `init_all_tables` https://github.com/JohnSnowLabs/spark-nlp/pull/5715
* Add missing random seed param to ClassifierDLApproach, MultiClassifierDLApproach, and SentimentDLApproach https://github.com/JohnSnowLabs/spark-nlp/pull/5697
* Make the Java Exceptions appear before Py4J exceptions for ease of debugging in Python https://github.com/JohnSnowLabs/spark-nlp/pull/5709
* Make sure batchSize set in NerDLModel is the same internally to feed TensorFlow https://github.com/JohnSnowLabs/spark-nlp/pull/5716
----------------
Backward compatibility
----------------
We have migrated AlbertEmbeddings to TensorFlow v2, the earlier models prior to 3.1.1 won't work after this release.
We have already updated the models and uploaded them on Models Hub. You can use `pretrained()` that takes care of it automatically or please make sure you download the new models manually.
========
3.1.0
========
----------------
New Features
----------------
* **NEW:** Introducing DistiBertEmbeddings annotator. DistilBERT is a small, fast, cheap, and light Transformer model trained by distilling BERT base. It has 40% fewer parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of BERT’s performances
* **NEW:** Introducing RoBERTaEmbeddings annotator. RoBERTa (Robustly Optimized BERT-Pretraining Approach) models deliver state-of-the-art performance on NLP/NLU tasks and a sizable performance improvement on the GLUE benchmark. With a score of 88.5, RoBERTa reached the top position on the GLUE leaderboard
* **NEW:** Introducing XlmRoBERTaEmbeddings annotator. XLM-RoBERTa (Unsupervised Cross-lingual Representation Learning at Scale) is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data with 100 different languages. It also outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model
* **NEW:** Introducing support for HuggingFace exported models in equivalent Spark NLP annotators. Starting this release, you can easily use the `saved_model` feature in HuggingFace within a few lines of codes and import any BERT, DistilBERT, RoBERTa, and XLM-RoBERTa models to Spark NLP. We will work on the remaining annotators and extend this support to the rest with each release - For more information please visit [this discussion](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669)
* **NEW:** Migrate MarianTransformer to BatchAnnotate to control the throughput when you are on accelerated hardware such as GPU to fully utilize it
* Upgrade to TensorFlow v2.4.1 with native support for Java to take advantage of many optimizations for CPU/GPU and new features/models introduced in TF v2.x
* Update to CUDA11 and cuDNN 8.0.2 for GPU support
* Implement ModelSignatureManager to automatically detect inputs, outputs, save and restore tensors from SavedModel in TF v2. This allows Spark NLP 3.1.x to extend support for external Encoders such as HuggingFace and TF Hub (coming soon!)
* Implement a new BPE tokenizer for RoBERTa and XLM models. This tokenizer will use the custom tokens from `Tokenizer` or `RegexTokenizer` and generates token pieces, encodes, and decodes the results
* Welcoming new Databricks runtimes to our Spark NLP family:
* Databricks 8.1 ML & GPU
* Databricks 8.2 ML & GPU
* Databricks 8.3 ML & GPU
* Welcoming a new EMR 6.x series to our Spark NLP family:
* EMR 6.3.0 (Apache Spark 3.1.1 / Hadoop 3.2.1)
----------------
Backward compatibility
----------------
* We have updated our MarianTransformer annotator to be compatible with TF v2 models. This change is not compatible with previous models/pipelines. However, we have updated and uploaded all the models and pipelines for `3.1.x` release. You can either use `MarianTransformer.pretrained(MODEL_NAME)` and it will automatically download the compatible model or you can visit [Models Hub](https://sparknlp.org/models) to download the compatible models for offline use via `MarianTransformer.load(PATH)`
========
3.0.3
========
----------------
New Features
----------------
* Add new functionalities for text generation in T5Transformer
----------------
Bug Fixes
----------------
* Fix ChunkEmbeddings Array out of bounds exception
* Fix pretrained tfhub_use_multi and tfhub_use_multi_lg models in UniversalSentenceEncoder
* Fix anchorDateMonth in Python and case sensitivity in relative dates
========
3.0.2
========
----------------
New Features and Enhancements
----------------
* Experimental support for community models and pipelines https://github.com/JohnSnowLabs/spark-nlp/pull/2743
* Add proper conversions for Scala 2.11/2.12 in ContextSpellChecker to use models from Spark 2.x in Spark 3.x https://github.com/JohnSnowLabs/spark-nlp/pull/2758
* Provide confidence scores for all available tags in NerDLModel and NerCrfModel https://github.com/JohnSnowLabs/spark-nlp/pull/2760
```
# Previously in NerDLModel and NerCrfModel
[[named_entity, 0, 4, B-LOC, [word -> Japan, confidence -> 0.9998], []]
```
```
# In Spark NLP 3.0.2
[[named_entity, 0, 4, B-LOC, [B-LOC -> 0.9998, I-ORG -> 0.0, I-MISC -> 0.0, I-LOC -> 0.0, I-PER -> 0.0, B-MISC -> 0.0, B-ORG -> 1.0E-4, word -> Japan, O -> 0.0, B-PER -> 0.0], []]
```
* Add confidence score to NerConverter metadata https://github.com/JohnSnowLabs/spark-nlp/pull/2784
```
[chunk, 30, 37, john, [entity -> PERSON, sentence -> 0, chunk -> 0, confidence -> 0.44035]
```
* Refactoring SentencePiece encoding in AlbertEmbeddings and XlnetEmbeddings https://github.com/JohnSnowLabs/spark-nlp/pull/2777
----------------
Bug Fixes
----------------
* Fix an exception in NerConverter when the documents/sentences don't carry the used tokens in NerDLModel https://github.com/JohnSnowLabs/spark-nlp/pull/2784
* Fix an exception in AlbertEmbeddings when the original tokens are longer than the piece tokens https://github.com/JohnSnowLabs/spark-nlp/pull/2777
========
3.0.1
========
----------------
New Features
----------------
* Add minLength and maxLength parameters to Normalizer annotator https://github.com/JohnSnowLabs/spark-nlp/pull/2614
* 1 line to setup [Google Colab](https://github.com/JohnSnowLabs/spark-nlp#google-colab-notebook)
* 1 line to setup [Kaggle Kernel](https://github.com/JohnSnowLabs/spark-nlp#kaggle-kernel)
----------------
Enhancements
----------------
* Adjust shading rule for amazon AWS to support sub-projects from Spark NLP Fat JAR https://github.com/JohnSnowLabs/spark-nlp/pull/2613
* Fix the missing variables in BertSentenceEmbeddings https://github.com/JohnSnowLabs/spark-nlp/pull/2615
* Restrict loading Sentencepiece ops only to supported models https://github.com/JohnSnowLabs/spark-nlp/pull/2623
* improve dependency management and resolvers https://github.com/JohnSnowLabs/spark-nlp/pull/2479
========
3.0.0
========
----------------
New Features
----------------
* Support for Apache Spark and PySpark 3.0.x on Scala 2.12
* Support for Apache Spark and PySpark 3.1.x on Scala 2.12
* Migrate to TensorFlow v2.3.1 with native support for Java to take advantage of many optimizations for CPU/GPU and new features/models introduced in TF v2.x
* Welcoming 9x new Databricks runtimes to our Spark NLP family:
* Databricks 7.3
* Databricks 7.3 ML GPU
* Databricks 7.4
* Databricks 7.4 ML GPU
* Databricks 7.5
* Databricks 7.5 ML GPU
* Databricks 7.6
* Databricks 7.6 ML GPU
* Databricks 8.0
* Databricks 8.0 ML (there is no GPU in 8.0)
* Databricks 8.1 Beta
* Welcoming 2x new EMR 6.x series to our Spark NLP family:
* EMR 6.1.0 (Apache Spark 3.0.0 / Hadoop 3.2.1)
* EMR 6.2.0 (Apache Spark 3.0.1 / Hadoop 3.2.1)
* Starting Spark NLP 3.0.0 the default packages for CPU and GPU will be based on Apache Spark 3.x and Scala 2.12 (`spark-nlp` and `spark-nlp-gpu` will be compatible only with Apache Spark 3.x and Scala 2.12)
* Starting Spark NLP 3.0.0 we have two new packages to support Apache Spark 2.4.x and Scala 2.11 (`spark-nlp-spark24` and `spark-nlp-gpu-spark24`)
* Spark NLP 3.0.0 still is and will be compatible with Apache Spark 2.3.x and Scala 2.11 (`spark-nlp-spark23` and `spark-nlp-gpu-spark23`)
* Adding a new param to sparknlp.start() function in Python for Apache Spark 2.4.x (`spark24=True`)
* Adding a new param to adjust Driver memory in sparknlp.start() function (`memory="16G"`)
----------------
Performance Improvements
----------------
Introducing a new batch annotation technique implemented in Spark NLP 3.0.0 for NerDLModel, BertEmbeddings, and BertSentenceEmbeddings annotators to radically improve prediction/inferencing performance.
From now on the `batchSize` for these annotators means the number of rows that can be fed into the models for prediction instead of sentences per row.
You can control the throughput when you are on accelerated hardware such as GPU to fully utilize it.
----------------
Breaking changes
----------------
There are only 5 annotators that are not compatible with both Scala 2.11 (Apache Spark 2.3 and Apache Spark 2.4) and Scala 2.12 (Apache Spark 3.x).
You can either train and use them on Apache Spark 2.3.x/2.4.x or train and use them on Apache Spark 3.x. The rest of our models/pipelines can be used on all Apache Spark and Scala major versions.
- TokenizerModel
- PerceptronApproach (POS Tagger)
- WordSegmenter
- DependencyParser
- TypedDependencyParser
========
2.7.5
========
----------------
Bugfixes
----------------
* Fix BigDecimal error in NerDL when includeConfidence is true
----------------
Enhancements
----------------
* Shade Hadoop AWS and AWS Java SDK dependencies
========
2.7.4
========
----------------
Bugfixes
----------------
* Fix Tensors with a 0 dimension issue in ClassifierDL and SentimentDL
* Fix index error in TokenAssembler
* Fix MatchError in DateMatcher and MultiDateMatcher annotators
* Fix setOutputAsArray and its default value for valueSplitSymbol in Finisher annotator
----------------
Enhancements
----------------
* Implement missing frequencyThreshold and ambiguityThreshold params in WordSegmenterApproach annotator
* Downgrade Hadoop from 3.2 to 2.7 which caused an issue with S3
* Update Apache HTTP Client
========
2.7.3
========
---------------
New Features
---------------
* Add anchorDateYear, anchorDateMonth, and anchorDateDay to DateMatcher and MultiDateMatcher to be used for relative dates extraction
----------------
Bugfixes
----------------
* Fix the default value for action parameter in Python wrapper for DocumentNormalizer annotator
* Fix Lemmatizer pretrained models published in 2021
----------------
Enhancements
----------------
* Improve T5Transformer performance on documents with many sentences
========
2.7.2
========
----------------
Bugfixes
----------------
* Fix casual mask calculations resulting in bad translation in MarianTransformer
* Fix Serialization issue in the cluster while training ContextSpellChecker
* Fix calculating CHUNK spans based on the sentences' boundaries in RegexMatcher
----------------
Enhancements
----------------
* Add GPU support for training ContextSpellChecker
* Adding Scalatest ability to control tests by tags
========
2.7.1
========
----------------
Bugfixes
----------------
* Fix default pretrained model T5Transformer
* Fix default pretrained model WordSegmenter
* Fix missing reference to WordSegmenter in ResourceDwonloader
* Fix T5Transformer models crashing due to unknown task
* Fix the issue of saving and reading ClassifierDL, SentimentDL, and MultiClassifierDL models introduced in the 2.7.0 release
----------------
Enhancements
----------------
* Export new T5 models with optimized Encoder/Decoder
* Add support for alternative tagging with the positional parser in RegexTokenizer
* Refactor AssertAnnotations
----------------
Backward compatibility
----------------
* In order to fix the issue of Classifiers in the clusters, we had to export new TF models and change the read/write functions of these annotators. This caused any model trained prior to the 2.7.0 release not to be compatible with 2.7.1 and require retraining including pre-trained models. (we are re-training all the existing text classification models with 2.7.1)
========
2.7.0
========
------------------------------
Major features and improvements
------------------------------
* Introducing MarianTransformer annotator for machine translation based on MarianNMT models. Marian is an efficient, free Neural Machine Translation framework mainly being developed by the Microsoft Translator team (646+ pretrained models & pipelines in 192+ languages)
* Introducing T5Transformer annotator for Text-To-Text Transfer Transformer (Google T5) models to achieve state-of-the-art results on multiple NLP tasks such as Translation, Summarization, Question Answering, Sentence Similarity, and so on
* Introducing brand new and refactored language detection and identification models. The new LanguageDetectorDL is faster, more accurate, and supports up to 375 languages
* Introducing WordSegmenter annotator, a trainable annotator for word segmentation of languages without any rule-based tokenization such as Chinese, Japanese, or Korean
* Introducing DocumentNormalizer annotator cleaning content from HTML or XML documents, applying either data cleansing using an arbitrary number of custom regular expressions either data extraction following the different parameters
* [Spark NLP Display](https://github.com/JohnSnowLabs/spark-nlp-display) for visualization of different types of annotations
* Add support for new multi-lingual models in UniversalSentenceEncoder annotator
* Add support to Lemmatizer to be trained directly from a DataFrame instead of a text file
* Add training helper to transform CoNLL-U into Spark NLP annotator type columns
----------------
Bugfixes and Enhancements
----------------
* Fix all the known issues in ClassifierDL, SentimentDL, and MultiClassifierDL annotators in a Cluster
* NerDL enhancements for memory optimization and logging during the training with the test dataset
* SentenceEmbeddings annotator now reuses the storageRef of any embeddings used in prior
* Fix dropout in SentenceDetectorDL models for more deterministic results. Both English and Multi-lingual models are retrained for the 2.7.0 release
* Fix Python dataType Annotation
* Upgrade to Apache Spark 2.4.7
========
2.6.5
========
----------------
Bugfixes
----------------
* Fix a bug in batching sentences in BertSentenceEmbeddings
* Fix AttributeError when trying to load a saved EmbeddingsFinisher in Python
----------------
Enhancements
----------------
* Improve handling exceptions in DocumentAssmbler when user uses a corrupted DataFrame
========
2.6.4
========
----------------
Bugfixes
----------------
* Fix loading from a local folder with no access to the cache folder
* Fix NullPointerException in DocumentAssembler when there are null in the rows
* Fix dynamic padding in BertSentenceEmbeddings
========
2.6.3
========
---------------
New Features
---------------
* Add enableMemoryOptimizer to allow training NerDLApproach on a dataset larger than the memory
* Add option to explode sentences in SentenceDetectorDL
----------------
Enhancements
----------------
* Improve POS (AveragedPerceptron) performance
* Improve Norvig Spell Checker performance
----------------
Bugfixes
----------------
* Fix SentenceDetectorDL unsupported model error in pretrained function
* Fix a race condition in Lru that can cause NullPointerException during a LightPipeline operations with embeddings
* Fix max sequence length calculation in BertEmbeddings and BertSentenceEmbeddings
* Fix threshold in YakeModel on Python side
========
2.6.2
========
---------------
New Features
---------------
* Introducing a new SentenceDetectorDL
----------------
Enhancements
----------------
* Improved BioBERT models quality for BertEmbeddings (it achieves higher accuracy in sequence classification)
* Improved Sentence BioBERT models quality for BertSentenceEmbeddings (it achieves higher accuracy in text classification)
* Add unit test to MultiClassifierDL annotator
* Better error handling in SentimentDLApproach
* Improve loadSavedModel in BertEmbeddings and BertSentenceEmbeddings
----------------
Bugfixes
----------------
* Fix BERT LaBSE model for BertSentenceEmbeddings
* Fix loadSavedModel for BertSentenceEmbeddings in Python
---------------
Deprecations
---------------
* DeepSentenceDetector is deprecated in favor of SentenceDetectorDL
========
2.6.1
========
----------------
Bugfixes
----------------
* Fix a bug in ClassifierDL that resulted in low accuracy during the training
========
2.6.0
========
------------------------------
Major features and improvements
------------------------------
* **NEW:** A new MultiClassifierDL annotator for multi-label text classification
* **NEW:** A new BertSentenceEmbeddings annotator with 41 available pre-trained models for sentence embeddings used in SentimentDL, ClassifierDL, and MultiClassifierDL annotators
* **NEW:** A new YakeModel annotator for an unsupervised, corpus-independent, domain, and language-independent and single-document keyword extraction algorithm
* Integrate 24 new Small BERT models where the smallest model is 24x times smaller and 28x times faster compare to BERT base models
* Add 3 new ELECTRA small, base, and large models
* Add 4 new Finnish BERT models for BertEmbeddings and BertSentenceEmbeddings
* Improve BertEmbeddings memory consumption by 30%
* Improve BertEmbeddings performance by more than 70% with a new built-in dynamic shape inputs
* Remove the poolingLayer parameter in BertEmbeddings in favor of sequence_output that is provided by TF Hub models for new BERT models
* Add validation loss, validation accuracy, validation F1, and validation True Positive Rate during the training in MultiClassifierDL
* Add parameter to enable/disable list detection in SentenceDetector
* Unify the loggings in ClassifierDL and SentimentDL during training
----------------
Bugfixes
----------------
* Fix Tokenization bug with Bigrams in the exception list
* Fix the versioning error in second SBT projects causing models not being found via pretrained function
* Fix logging to file in NerDLApproach, ClassifierDL, SentimentDL, and MultiClassifierDL on HDFS
* Fix ignored modified tokens in BertEmbeddings, now it will consider modified tokens instead of originals
========
2.5.5
========
---------------
New Features
---------------
- Add getClasses() function to NerDLModel
- Add getClasses() function to ClassifierDLModel
- Add getClasses() function to SentimentDLModel
---------------------
Enhancements
---------------------
- Improve max sequence length calculation in BertEmbeddings and XlnetEmbeddings
----------------
Bugfixes
----------------
- Fix a bug in RegexTokenizer in Python
- Fix StopWordsCleaner exception in Python when pretrained() is used
- Fix max sequence length issue in AlbertEmbeddings and SentencePiece generation
- Fix HDFS support for setGaphFolder param in NerDLApproach
========
2.5.4
========
---------------
New Features
---------------
* Add support for Apache Spark 2.3.x including new Maven artifacts and full support of all pre-trained models/pipelines
* Add 43 new pre-trained models in 43 languages to StopWordsCleaner annotator
* Introduce a new RegexTokenizer to split text by regex pattern
---------------------
Enhancements
---------------------
* Retrained 6 new BioBERT and ClinicalBERT models
* Add a new param to `start()` function to start the session for Apache Spark 2.3.x
----------------
Bugfixes
----------------
* Add missing library for SentencePiece used by AlbertEmbeddings and XlnetEmbeddings on Windows
* Fix ModuleNotFoundError in LanguageDetectorDL pipelines in Python
========
2.5.3
========
---------------
New Features
---------------
* TextMatcher now can construct the chunks from tokens instead of the original documents via buildFromTokens param
* CoNLLGenerator now is accessible in Python
----------------
Bugfixes
----------------
* Fix a bug in ContextSpellChecker resulting in IllegalArgumentException
---------------------
Enhancements
---------------------
* Improve RocksDB connection to support different storage capabilities
* Improve parameters naming convention in ContextSpellChecker
---------------------
Enhancements
---------------------
* Add NerConverter to documentation
* Fix multi-language tabs in documentation
========
2.5.2
========
---------------
New Features
---------------
* Introducing a new LanguageDetectorDL state-of-the-art annotator to detect and identify languages in documents and sentences
* Add a new param entityValue to TextMatcher to add custom value inside metadata. Useful in post-processing when there are multiple TextMatcher annotators with multiple dictionaries https://github.com/JohnSnowLabs/spark-nlp/issues/920
----------------
Bugfixes
----------------
* Add missing TensorFlow graphs to train ContextSpellChecker annotator https://github.com/JohnSnowLabs/spark-nlp/issues/912
* Fix misspelled param in classThreshold param in ContextSpellChecker annotator https://github.com/JohnSnowLabs/spark-nlp/issues/911
* Fix a bug where setGraphFolder in NerDLApproach annotator couldn't find a graph on Databricks (DBFS) https://github.com/JohnSnowLabs/spark-nlp/issues/739
* Fix a bug in NerDLApproach when includeConfidence was set to true https://github.com/JohnSnowLabs/spark-nlp/issues/917
* Fix a bug in BertEmbeddings https://github.com/JohnSnowLabs/spark-nlp/issues/906 https://github.com/JohnSnowLabs/spark-nlp/issues/918
---------------------
Enhancements
---------------------
* Improve TF backend in ContextSpellChecker annotator
========
2.5.1
========
---------------
New Features
---------------
* Add Python support for PubTator reader to convert automatic annotations of the biomedical datasets into DataFrame
* Add 6 new pre-trained BERT models from BioBERT and ClinicalBERT
---------------------
Enhancements
---------------------
* Add unit tests for XlnetEmbeddings
* Add unit tests for AlbertEmbeddings
* Add unit tests for ContextSpellChecker
========
2.5.0
========
---------------
New Features
---------------
* A new AlbertEmbeddings annotator with 4 available pre-trained models
* A new XlnetEmbeddings annotator with 2 available pre-trained models
* A new ContextSpellChecker annotator, the state-of-the-art annotator for spell checking
* A new SentimentDL annotator for multi-class sentiment analysis. This annotator comes with 2 available pre-trained models trained on IMDB and Twitter datasets
* Add new PubTator reader to convert automatic annotations of the biomedical datasets into DataFrame
* Introducing a new outputLogsPath param for NerDLApproach, ClassifierDLApproach and SentimentDLApproach annotators
* Refactored CoNLLGenerator to actually use NER labels from the DataFrame
* Unified params in NerDLModel in both Scala and Python
* Extend and complete Scaladoc APIs for all the annotators
----------------
Bugfixes
----------------
* Fix position of tokens in Normalizer
* Fix Lemmatizer exception on a bad input
* Fix annotator logs failing on object storage file systems like DBFS
----------------
Documentation
----------------
* Update documentation for release of Spark NLP 2.5.x
* Update the entire [spark-nlp-workshop](https://github.com/JohnSnowLabs/spark-nlp-models) notebooks for Spark NLP 2.5.x
* Update the entire [spark-nlp-models](https://github.com/JohnSnowLabs/spark-nlp-workshop) repository with new pre-trained models and pipelines
========
2.4.5
========
---------------
Overview
---------------
We are very excited to extend Spark NLP support to 6 new Databricks runtimes and add support to Cloudera and EMR YARN cluster-mode.
As always, we thank our community for their feedback and questions in our Slack channel.
---------------
New Features
---------------
* Extend Spark NLP support for Databricks runtimes:
* 6.2
* 6.2 ML
* 6.3
* 6.3 ML
* 6.4
* 6.4 ML
* 6.5
* 6.5 ML
* Add support for cluster-mode in Cloudera and EMR YARN clusters
* New splitPattern param in Tokenizer to split tokens by regex rules
----------------
Bugfixes
----------------
* Fix ClassifierDLModel save and load in Python
* Fix ClassifierDL TensorFlow session reuse
* Fix Normalizer positions of new tokens
----------------
Documentation
----------------
* Update documentation for release of Spark NLP 2.4.x
* Update the entire [spark-nlp-workshop](https://github.com/JohnSnowLabs/spark-nlp-models) notebooks for Spark NLP 2.4.x
* Update the entire [spark-nlp-models](https://github.com/JohnSnowLabs/spark-nlp-workshop) repository with new pre-trained models and pipelines
========
2.4.4
========
---------------
Overview
---------------
* We are very excited to release the very first multi-class text classifier in Spark NLP v2.4.4! We have built a generic ClassifierDL annotator that uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 50 classes.
* We are also happy to announce the support of yet another language: Russian! We have trained and prepared 5 pre-trained models and 6 pre-trained pipelines in Russian.
**NOTE**: ClassifierDL is an experimental feature in 2.4.4 release. We have worked hard to aim for simplicity and we are looking forward to your feedback as always.
---------------
New Features
---------------
* Introducing an experimental multi-class text classification by using the DNNs model in TensorFlow called `ClassifierDL`. This annotator can train any dataset from 2 up to 50 classes.
* 5 new pretrained Russian models (Lemma, POS, 3x NER)
* 6 new pretrained Russian pipelines
---------------
Enhancements
---------------
* Add param to NerConverter to override modified tokens instead of original tokens
----------------
Bugfixes
----------------
* Fix TokenAssembler
* Fix NerConverter exception when NerDL is trained with different tagging style than IOB/IOB2
========
2.4.3
========
---------------
Overview
---------------
This minor release fixes a bug on our Python side that was introduced in 2.4.2 release.
As always, we thank our community for their feedback and questions in our Slack channel.
----------------
Bugfixes
----------------
* Fix Python imports which resulted in AttributeError: module 'sparknlp' has no attribute
========
2.4.2
========
---------------
Overview
---------------
This minor release fixes a few bugs in some of our annotators reported by our community.
As always, we thank our community for their feedback and questions in our Slack channel.
----------------
Bugfixes
----------------
* Fix UniversalSentenceEncoder.pretrained() that failed in Python
* Fix ElmoEmbeddings.pretrained() that failed in Python
* Fix ElmoEmbeddings poolingLayer param to be a string as expected
* Fix ChunkEmbeddings to preserve chunk's index
* Fix NGramGenerator and missing chunk metadata
---------------
New Features
---------------
* Add GPU support param in Spark NLP start function: sparknlp.start(gpu=true)
* Improve create_model.py to create custom TF graph for NerDLApproach
----------------
Documentation
----------------
* Update documentation for release of Spark NLP 2.4.x
* Update the entire [spark-nlp-workshop](https://github.com/JohnSnowLabs/spark-nlp-models) notebooks for Spark NLP 2.4.x
* Update the entire [spark-nlp-models](https://github.com/JohnSnowLabs/spark-nlp-workshop) repository with new pre-trained models and pipelines
========
2.4.1
========
---------------
Overview
---------------
This minor release fixes a few bugs in some of our annotators reported by our community.
As always, we thank our community for their feedback and questions in our Slack channel.
----------------
Bugfixes
----------------
* Improve ChunkEmbeddings annotator and fix the empty chunk result
* Fix UniversalSentenceEncoder crashing on empty Tensor
* Fix NorvigSweetingModel missing sentenceId that results in NGramsGenerator crashing
* Fix missing storageRef in embeddings' column for ElmoEmbeddings annotator
----------------
Documentation
----------------
* Update documentation for release of Spark NLP 2.4.x
* Add new features such as ElmoEmbeddings and UniversalSentenceEncoder
* Add multiple programming languages for demos and examples
* Update the entire [spark-nlp-models](https://github.com/JohnSnowLabs/spark-nlp-models) repository with new pre-trained models and pipelines
========
2.4.0
========
---------------
Overview
---------------
We are very excited to finally release Spark NLP v2.4.0! This has been one of the largest releases we have ever made since the inception of the library!
The new release of Spark NLP `2.4.0` has been migrated to TensorFlow `1.15.0` which takes advantage of the latest deep learning technologies and pre-trained models.
As always, thanks to the community for the feedback and questions in our Slack channel.
Please beware as this release breaks backwards compatibility with previously saved models, particularly on Tensorflow and Embeddings, aside from code-breaking changes in the API.
We will be working in our documentation to enhance the learning curve.
---------------
New Features
---------------
* TensorFlow 1.15.0 now works behind Spark NLP. This brings implicit improvements in performance, accuracy and functionalities
* New Annotator UniversalSentenceEncoder with 2 pre-trained models from TF Hub. Check our spark-nlp-models repo for updates
* New Annotator MultiDateMatcher capable of matching more than one date per sentence (Extends DateMatcher algorithm)
* New Annotator NGramGenerator with Param tweaks for customization
* New Annotator BigTextMatcher works best with large amounts of input data
* New Annotator ElmoEmbeddings with a pre-trained model from TF Hub. Check our spark-nlp-models repo for updates
* BertEmbeddings improvements with 5 new models from TF Hub
* RecursivePipelineModel as an enhanced PipelineModel allows Annotators access previous annotators in the pipeline for more ML strategies
* LazyAnnotators: A new Param in Annotators allow them to stand idle in the Pipeline and do nothing. Can be called by other Annotators in a RecursivePipeline
---------------
Enhancements
---------------
* RocksDB now available as a flexible API called `Storage`. Allows any annotator to have it's own distributed local index database
* Now our Tensorflow pre-trained models are cross-platform. Enabling multi-language models and other improvements to Windows users.
* Improved IO performance in general for handling embeddings
* Improved cache cleanup and GC by liberating open files utilized in RocksDB (to be improved further)
* Tokenizer and SentenceDetector Params minLength and MaxLength to filter out annotations outside these bounds
* Tokenizer improvements in splitChars and simplified rules
* DateMatcher improvements
* TextMatcher improvements preload algorithm information within the model for faster prediction
* Annotators the utilize embeddings have now a strict validation to be using exactly the embeddings they were trained with
* Improvements in the API allow Annotators with Storage to save and load their RocksDB database independently and let it be shared across Annotators
----------------
Bugfixes
----------------
* Fixes in Chunk and SentenceEmbeddings to better deal with empty cleaned-up Annotations
* Fixed PretrainedPipeline in Python to allow accessing the inner PipelineModel in the instance
* Probably a bunch of uncommented bugfixes along the way :)
========
2.3.6
========
---------------
Overview
---------------
This minor release fixes a bug in ChunkEmbeddings causing an out of boundaries exception in some scenarios. We
also switch to maven coordinates as default source for start() function since spark-packages has not been responsive
on their package approval process. Thank you all for your consistent feedback.
---------------
Bugfixes
---------------
* Fixed a bug in Chunk Embeddings caused by out of bound exception in some scenarios
---------------
Other
---------------
* start() function switched to use maven coordinates instead
========
2.3.5
========
---------------
Overview
---------------
We would like to thank you all for your valuable feedback via our Slack channels and our GitHub repositories.
Spark NLP `2.3.4` is a very stable and rock-solid release. However, we wanted to fix the few remaining minor bugs before moving to our bigger release `2.4.0`!
---------------
Bugfixes
---------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/702 Date matcher fixes flexible dates
* https://github.com/JohnSnowLabs/spark-nlp/pull/718 Fixed a bug in a pragmatic sentence detector where a sub matched group contained a dollar sign.
* https://github.com/JohnSnowLabs/spark-nlp/pull/719 Move import to top-level to avoid import fail in Spark NLP functions
* https://github.com/JohnSnowLabs/spark-nlp/pull/709 https://github.com/JohnSnowLabs/spark-nlp/pull/716 Some improvements in our documentation thanks to @marcinic @howmuchcomputer
========
2.3.4
========
---------------
Overview
---------------
Thank you, as always, for the feedback given at Slack and our repos. The most important part of this release,
is how we internally started organizing models. We'll be deploying our model news in
https://github.com/JohnSnowLabs/spark-nlp-models . The models repo will be kept up to date.
As for this release, it improves various internal API functionalities, allowing for positive side-effects across
the library. As an important enhancement, we have added user UDFs and functions for both Scala and Python users
to be able to easily manipulate annotations on DataFrames. Finally, we have fixed various bugs in embeddings
metadata to make sure we provide accurate offsetting information for other annotators to consume it successfully.
---------------
Enhancements
---------------
* Revamped functions in Scala and python to help users deal with annotations from dataframes or in UDF form, such as `map_annotations` and `filter_by_annotations`
---------------
Bugfixes
---------------
* Fixed bugs in ChunkEmbeddings and SentenceEmbeddings causing them to report wrong metadata and offset values
* Fixed a nested import issue in Python causing LightPipelines not to work in some environments
---------------
Developer API
---------------
* downloadModel is now flexible as to which inner downloader class is being used to access AnnotatorModel reference
* pretrained API now deals with defaultModelName as an Option to allow non default pretrained models
---------------
Other
---------------
* version() now returns the version string instead of just printing it
========
2.3.3
========
---------------
Overview
---------------
We are very glad to announce this release, it actually ended up much bigger than we expected.
Thanks to the community feedback, we arranged many bugfixes. We also spent some times and started building
models for the TextMatcher, so it got various improvements and bugfixes when dealing with empty sentences or cleaned up tokens.
We also added UDF ready functions in Python to easily deal with Annotations. Finally, we fixed a few bugs when loading models from disk.
Thank you very much for constant feedback on Slack.
---------------
New Features
---------------
* TextMatcher new param `mergeOverlapping` allows for handling overlapping output chunks when matching entities share keywords
* NER overwriter annotator allows for overwriting NER output with custom entities
* Added `map_annotations`, `map_annotations_strict`, `map_annotations_col`, `filter_by_annotations_col` and `explode_annotations_col` functions to python side. Allows dealing with Annotations easily.
---------------
Enhancement
---------------
* Made ChunkEmbeddings output to be compatible with SentenceEmbeddings for better flexibility in pipelines
---------------
Bugfixes
---------------
* Fixed BertEmbeddings crashing on empty input sentences
* Fixed missing load API and import shorcuts on the new Embeddings annotators
* Added missing metadata fields in ChunkEmbeddings
* Fixed wrong sentence IDs in sentences or tokens that got a cleanup during the pipeline
* Fixed typos in docs. Thanks @marcinic
* Fixed bad deprecated OCR and SpellChecker python classpath
========
2.3.2
========
---------------
Overview
---------------
This release addresses multiple bug fixes and some enhancements regarding memory consumption in BertEmbeddings annotator.
Thanks for your feedback and reports!
---------------
Bugfixes
---------------
* Fix missing EmbeddingsFinisher in Scala and Python
* Reverted embeddings move to copy due to CRC issue
* Fix IndexOutOfBoundsException in SentenceEmbeddings
---------------
Enhancement
---------------
* Optimize BertEmbeddings memory consumption
========
2.3.1
========
---------------
Overview
---------------
This quick release addresses a bug in Lemmatizer loading/pretrained function causing it not to work in 2.3.0.
We took the chance to include a feature which did not make it for base 2.3.0 and slightly changed protected variables for
better Java API, also including a pretrained compatible function with Java. Thanks for the quick issue feedback again!
---------------
New Features
---------------
* New EmbeddingsFinisher specializes in dealing with embedding annotators output. Traditional finisher still behaves the same as 2.3.0
---------------
Bugfixes
---------------
* Fixed a bug in previous release causing LemmatizerModel not to be loaded or pretrained load
* Fixed pretrained() function to return proper type in Java
---------------
Developer API
---------------
* defaultModelName, defaultLang and defaultLoc static pretrained properties are now public
========
2.3.0
========
---------------
Overview
---------------
Thanks for your contributions and feedback on Slack. This amazing release comes with many new features in the embeddings scope,
allowing pipeline builders to retrieve embeddings for specific bodies of texts in any form given, from sentences to chunks or n-grams.
We also worked a lot on making sure Spark NLP on Java works as intended. Finally, we improved aws profiles compatibility for frameworks
that utilize multiple credential profiles. Unfortunately, we have deprected Eval and OCR due to internal patents in some of the latest improvements
John Snow Labs has contributed to.
---------------
New Features
---------------
* New SentenceEmbeddings annotator utilizes WordEmbeddings or BertEmbeddings to generate sentence or document embeddings
* New ChunkEmbeddings annotator utilizes WordEmbeddings or BertEmbeddings to generate chunk embeddings from Chunker or NGramGenerator outputs
* New StopWordsCleaner integrates Spark ML StopWordsRemoval function into Spark NLP pipeline
* New NGramGenerator annotator integrates Spark ML NGram function into Spark ML with a new cumulative feature to also generate range ngrams like the scikit-learn library
---------------
Enhancements
---------------
* Improved Java intercompatibility on Pretrained and LightPipeline APIs. Examples added.
* Finisher and LightPipelines Parse Embeddings Vector flag allows for optional vector processing to save memory and improve performance
* setInputCols in python can be passed as *args
* new Param enableScore in SentimentDetector to switch output types between confidence score and results (Thanks @maxwellpaulm)
* spark_nlp profile name by default in AWS config allows for multiple profile download compatible
---------------
Bugfixes
---------------
* Fixed POS training dataset creator to improve performance
---------------
Deprecations
---------------
* OCR Module dropped from open source support
* Eval Module dropped from open source support
========
2.2.2
========
---------------
Overview
---------------
Thank you again for all your feedback and questions in our Slack channel. Such feedback from users and contributors
(thank you Stuart Lynn @sllynn) helped to find several python module bugs. We also fixed and improved OCR support
towards extracting page coordinates and fixed NerDL evaluator from Python
---------------
Enhancements
---------------
* Added a create_models.py python script to generate Graphs for NerDL without the need of jupyter
* Added a new annotator Token2Chunk to convert all tokens to chunk types (useful for extracting token coordinates from OCR)
* Added OCR Page Dimensions
* Python setInputCols now accepts *args no need to input list
---------------
Bugfixes
---------------
* Fixed python support of NerDL evaluation not taking all params appropriately
* Fixed a bug in case sensitivity matching of embeddings format in python (Thanks @sllynn)
* Fixed a bug in python DateMatcher with dateFormat param not working (Thanks @sllynn)
* Fixed a bug in PositionFinder reporting duplicate coordinate elements
----------------
Developer API
----------------
* Renamed trainValidationProp to validationSplit in NerDLApproach
----------------
Documentation
----------------
* Added several missing annotator documentation in docs page
========
2.2.1
========
---------------
Overview
---------------
This short release is to address a few uncovered issues in the previous 2.2.0 release. Thank you all for quick feedback.
---------------
Enhancements
---------------
* NerDLApproach new param includeValidationProp allows partitioning the training set and exclude a fraction
* NerDLApproach trainValidationProp now randomly samples the data as opposed to head first
---------------
Bugfixes
---------------
* Fixed a bug in ResourceHelper causing folder resources to fail when a folder is empty (affects various annotators)
* Fixed a bug in python embeddings format not parsed to upper case
* Fixed a bug in python causing an incapability to load PipelineModels after loading embeddings
========
2.2.0
========
---------------
Overview
---------------
Last time, following a release candidate schedule proved to be a quite effective method to avoid silly bugs right after release!
Fortunately, there were no breaking bugs by carefully testing releases alongside the community,
which ended up in various pull requests. This huge release features OCR based coordinate highlighting, BERT embeddings refactor and tuning, more tools for accuracy evaluation in python, and much more.
We welcome your feedback in our Slack channels, as always!
---------------
New Features
---------------
* OCRHelper now returns coordinate positions matrix for text converted from PDF
* New annotator PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types
* Evaluation module now also ported to Python
* WordEmbeddings now include coverage metadata information and new static functions `withCoverageColumn` and `overallCoverage` offer metric analysis
* NerDL Now has `includeConfidence` param that enables confidence scores on prediction metadata
* NerDLApproach now has `enableOutputLog` outputs training metric logs to file
* New Param in BERT `poolingLayer` allows for polling layer selection
---------------
Enhancements
---------------
* BERT Embeddings now merges much better with Spark NLP, returning state of the art accuracy numbers for NER (Details will be expanded). Thank you for community feedback.
* Progress bar and size estimate report when downloading pretrained models and loading embeddings
* Models and pipeline cache now more efficiently managed and includes CRC (not retroactive)
* Finisher and LightPipeline now deal with embeddings properly, including them in pre processed result (Thank you Will Held)
* Tokenizer now allows regular expressions in the list of Exceptions (Thank you @atomobianco)
* PretrainedPipelines now allow function `fullAnnotate` to retrieve fully information of Annotations
* DocumentAssembler new cleanup modes: each, each_full and delete_full allow more control over text cleaning up (different ways of dealing with new lines and tabs)
---------------
Bugfixes
---------------
* Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
* Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
* Fixed missing setters for whitelist param in NerConverter
* Fixed a bug where parameters from a BERT model were incorrectly being read from python because of not being correctly serialized
* Fixed a bug where ResourceDownloader conflicted S3 credentials with public model access (Thank you Dimitris Manikis)
* Fixed Context Spell Checker bugs with performance improvements (pretrained model disabled until we get a better one)
========
2.1.1
========
---------------
Overview
---------------
Thank you so much for your feedback on slack. This release is to extend life length of the 2.1.x release, with important bugfixes from upstream
---------------
Bugfixes
---------------
* Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
* Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
* Fixed missing setters for whitelist param in NerConverter
========
2.1.0
========
---------------
Overview
---------------
Thank you for following up with release candidates. This release is backwards breaking because two basic annotators have been redesigned.
The tokenizer now has easier to customize params and simplified exception management.
DocumentAssembler `trimAndClearNewLiens` was redesigned into a `cleanupMode` for further control over the cleanup process.
Tokenizer now supports pretrained models, meaning you'll be capable of accessing any of our language based Tokenizers.
Another big introduction is the `eval` module. An optional Spark NLP sub-module that provides evaluation scripts, to
make it easier when looking to measure your own models are against a validation dataset, now using MLFlow.
Some work also began on metrics during training, starting now with the `NerDLApproach`.
Finally, we'll have Scaladocs ready for easy library reference.
Thank you for your feedback in our Slack channels.
Particular thanks to @csnardi for fixing a bug in one of the release candidates.
---------------
New Features
---------------
* Spark NLP Eval module, includes functions to evaluate NER and Spell Checkers with MLFlow (Python support and more annotators to come)
---------------
Enhancements
---------------
* DocumentAssembler new param `cleanupMode` allows user to decide what kind of cleanup to apply to source
* Tokenizer has been severely enhanced to allow easier and more intuitive customization
* Norvig and Symmetric spell checkers now report confidence scores in metadata
* NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through `setTrainValidationProp`
* Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets ground base for further development
---------------
Bugfixes
---------------
* Fixed Dependency Parser not reporting offsets correctly
* Dependency Parser now only shows head token as part of the result, instead of pairs
* Fixed NerDLModel not allowing to pick noncontrib versions from linux
* Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible
* Removed unintentional gc calls causing some performance issues
---------------
Framework
---------------
* ResourceDownloader now capable of utilizing credentials from aws standard means (variables, credentials folder)
---------------
Documentation
---------------
* Scaladocs for Spark NLP reference
* Added Google Colab workthrough guide
* Added Approach and Model class names in reference documentation
* Fixed various typos and outdated pieces in documentation
========
2.0.8
========
---------------
Overview
---------------
This release fixes a few tiny but meaningful issues that prevent from new trained models having internal compatibility issues.
---------------
Bugfixes
---------------
* Fixed wrong logic when checking embeddingsRef is being overwritten in a WordEmbeddingsModel
* Deleted unnecessary chunk index from tokens
* Fixed some of the new trained models compatibility issues when python API had mismatching pretrained models compared to scala
========
2.0.7
========
---------------
Overview
---------------
This release addresses bugs related to cluster support, improving error messages and fixing various potential bugs depending
on the cluster configuration, such as Kryo Serialization or non default FS systems
---------------
Bugfixes
---------------
* Fixed a bug introduced in 2.0.5 that caused NerDL not to work in clusters with Kryo serialization enabled
* NerDLModel was not properly reading user provided config proto bytes during prediction
* Improved cluster embeddings message to hit user of cluster mode without shared filesystems
* Removed lazy model downloading on PretrainedPipeline to download the model at instantiation
* Fixed URI construction for cluster embeddings on non defaultFS configurations, improves cluster compatibility
========
2.0.6
========
---------------
Overview
---------------
Following the 2.0.5 (read notes below), this release fixes a bug when disabling contrib param in NerDLApproach on non-windows OS
---------------
Bugfixes
---------------
* Fixed NerDLApproach failing when training with setUseContrib(false)
========
2.0.5
========
---------------
Overview
---------------
This release bumps Spark NLP by default to Apache Spark 2.4.3. Spark has been undergoing testing with Scala 2.12 and they are back in 2.11 now, so this should be a working release.
In this version, we fixed a series of Pretrained models, as well as focused on improving the flexibility of NerDL annotator, which is, if not, the most popular one based on user feedback.
Users can point to graphs they create without having to re-compile the library, graph options as well whether to use Tensorflow contrib is now user defined.
Particular thanks to @CyborgDroid because of reporting importantly and well-reported bugs that helped us improve Spark NLP.
Thank you for reporting issues and feedback, and we always welcome more. Join us on Slack!
---------------
Enhancements
---------------
* ViveknSentiment annotator now includes confidence score in metadata
* NerDL now has setGraphFolder to allow a path to folder with custom generated graphs using python/tensorflow code
* NerDL now has setConfigProtoBytes to allow users submit his own ConfigProto (serialized) to the graph settings
* NerDLApproach now has setUseContrib to let training user decide whether or not to use contrib. Contrib LSTM Cells are proved to return more accurate results, but does not work in Windows yet.
* Updated default tensorflow settings to include GPU allow_growth by default, disabled log device placement spamming message
* Spark version bumped to 2.4.3
---------------
Bugfixes
---------------
* Fixed contrib NerDL models not work properly in clusters such as Databricks (Thanks @CyborgDroid)
* Fixed sparknlp.start(include_ocr=True) missing dependencies for OCR
* Fixed DependencyParser pretrained models not working properly in Python
---------------
Models and Pipelines
---------------
* NerDL will download noncontrib model if windows is detected, for better compatibility
* noncontrib version of pipelines with NerDL have been uploaded, as well as new models. Check documentation for complete list
* Improved error message when user is under windows and trying to load a contrib NerDL model
* Fixed ViveknSentimentModel not working properly (Thanks @CyborgDroid)
---------------
Developer API
---------------
* Embeddings in python moved to annotator module for consistency
* SourceStream ResourceHelper class now properly handles cluster files for Dependency Parser
* Metadata model reader now ignores empty lines instead of failing
* Unified lang instead of language attribute name in pretrained API
========
2.0.4
========
---------------
Overview
---------------
We are excited about Spark NLP workshop (spark-nlp-workshop repository) being so useful for many users.
Now we also made a step forward by moving website's documentation to an easy to maintain Wiki!. Spark NLP library received key bug fixes
on this release. Thanks to the community for reporting issues on GitHub. Much more to come, as always.
---------------
Bugfixes
---------------
* Fixed DependencyParser and TypedDependencyParser working inaccurately
* Fixed a bug preventing the load of WordEmbeddingsModel class from python
* Fixed wrong pretrained model names preventing some pretrained models to work properly
* Fixed BertEmbeddings not being capable of loading from file due a reader exception
---------------
Documentation
---------------
* Website documentation migrated to GitHub wiki page (WIP)
---------------
Developer API
---------------
* OcrHelper now reports failed file name when throwing exceptions (Thanks @kgeis)
* Fixed Annotation function explodeAnnotations to consider replacing output column scenarios
* Fixed TRAVIS CI unit tests
========
2.0.3
========
---------------
Overview
---------------
Short after 2.0.2, a hotfix release was made to address two bugs that prevented users from using pretrained tensorflow models in clusters.
Please read release notes for 2.0.2 to catch up!
---------------
Bugfixes
---------------
* Fixed logger serializable, causing issues in executors to serialize TensorflowWrapper
* Fixed contrib loading in cluster, when retrieving a Tensorflow session
========
2.0.2
========
---------------
Overview
---------------
Thank you for joining us in this exciting Spark NLP year!. We continue to make progress towards a better performing library, both in speed and in accuracy.
This release focuses strongly in the quality and stability of the library, making sure it works well in most cluster environments
and improving the compatibility across systems. Word Embeddings continue to be improved for better performance and lower memory blueprint.
Context Spell Checker continues to receive enhancements in concurrency and usage of spark. Finally, tensorflow based annotators
have been significantly improved by refactoring the serialization design. Help us with feedback and we'll welcome any issue reports!
---------------
New Features
---------------
* NerCrf annotator has now includeConfidence param that includes confidence scores for predictions in metadata
---------------
Enhancements
---------------
* Cluster mode performance improved in tensorflow annotators by serializing to bytes internal information
* Doc2Chunk annotator added new params startCol, startColByTokenIndex, failOnMissing and lowerCase allows better chunking of documents
* All annotations that derive from sentence or chunk types now contain metadata information referring to the sentence or chunk ID they belong to
* ContextSpellChecker now creates a window around the token to improve computation performance
* Improved WordEmbeddings matching accuracy by trying alternative case sensitive tokens
* WordEmbeddings won't load twice if already loaded
* WordEmbeddings can use embeddingsRef if source was not provided, improving reutilization of embeddings in a pipeline
* WordEmbeddings new param includeEmbeddings allow annotators not to save entire embeddings source along them
* Contrib tensorflow dependencies now only load if necessary
---------------
Bugfixes
---------------
* Added missing Symmetric delete pretrained model
* Fixed a broken param name in Normalizer (thanks @RobertSassen)
* Fixed Cloudera cluster support
* Fixed concurrent access in ContextSpellChecker in high partition number use cases and LightPipelines
* Fixed POS dataset creator to better handle corrupted pairs
* Fixed a bug in Word Embeddings not matching exact case sensitive tokens in some scenarios
* Fixed OCR Tess4J initialization problems in concurrent scenarios
---------------
Models and Pipelines
---------------
* Renaming of models and pipelines (work in progress)
* Better output column naming in pipelines
---------------
Developer API
---------------
* Unified more WordEmbeddings interface with dimension params and individual setters
* Improved unit tests for better compatibility on Windows
* Python embeddings moved to sparknlp.embeddings
========
2.0.1
========
---------------
Overview
---------------
Thanks for following up after our 2.0.0 release!. This release covers a few holes left by the immense 2.0.0 release,
to address high priority issues found after release. More importantly, the library should now behave correctly when using
Spark cluster modes, and memory and CPU utilization should be reduced to normal levels after some serious profiling of Serialization
revealed a bunch of problems. Aside from performance and resource management improvements, we include an OCR dependency handler in start() function as well
as improve the support of GPU for NER Deep Learning models. Finally, check out our spark-nlp-workshop repo, it has cool features!
---------------
Enhancements
---------------
* Improved serialization of Deep Learning models, shows performance boosts of up to 2.5 times over 1.8.3
* Tensorflow contrib libraries now managed correctly across a cluster
* Reverted useFeatureBroadcasting after internal benchmarks proved it was performing better
* SparkNLP.start() and sparknlp.start() now accept an includeOCR parameter which allows to automatically include OCR library
* Recreated NerDL Graphs to allow GPU allow_growth in tensorflow to improve memory management with GPU
* Expanded GPU coverage in NerDL graph
* Reduced NerDL Batch Size for better compatibility with GPUs
---------------
Bugfixes
---------------
* Fixed deep learning models not working across cluster due a bug in inputBuffers from graph reading
* Fixed a bug in POS() training function which did not work correctly from Python
* Fixed a bug in OCR where page number and intersection was not correctly matched
* Correctly handle exceptions when training Norvig and Symmetric Spell Checkers from dataframes
---------------
Developer API
---------------
* ContextSpellChecker now follows Features API correctly
---------------
Documentation
---------------
* spark-nlp-workshop repository has been expanded with better documentation and new notebooks
* we are still catching up with 2.x release!
========
2.0.0
========
---------------
Overview
---------------
Thank you for following up with the biggest changelog ever on Spark NLP: Spark NLP 2.0.0! Where to begin?
We have no less than 50 Pull Requests merged this time. Most importantly, we become the first library to have a production
ready implementation of BERT embeddings. Along with this interesting deep learning and context based embeddings algorithm, here is a quick overview of new things:
* Word Embeddings as well as Bert Embeddings are now annotators, just like any other component in the library. This means, embeddings can be
cached on memory through DataFrames, can be saved on disk and shared as part of pipelines!
* We revamped and enhanced Named Entity Recognition (NER) Deep Learning models to a new state of the art level, reaching up to 93% F1 micro-averaged accuracy in the industry standard.
* We upgraded tensorflow version and also started using contrib LSTM Cells.
* Performance and memory usage improvements also tag along by improving serialization throughput of Deep Learning annotators by receiving feedback from Apache Spark contributor Davies Liu.
* Revamping and expanding our pretrained pipelines list, plus the addition of new pretrained models for different languages together with
tons of new example notebooks, which include changes that aim the library to be easier to use. API overall was modified towards helping new comers get started.
* OCR module comes with a handful of improvements that increase accuracy.
All of this comes together with a full range of bug fixes and annotator improvements, follow up the details below!
Bear with us since documentation is still catching up a little bit behind, as well as new models to be made available. Stay tuned on Slack!
----------------
New Features
----------------
* BertEmbeddings annotator, with four google ready models ready to be used through Spark NLP as part of your pipelines, includes Wordpiece tokenization.
* WordEmbeddings, our previous embeddings system is now an Annotator to be serialized along Spark ML pipelines
* Created training helper functions that create spark datasets from files, such as CoNLL and POS tagging
* NER DL has been revamped by using contrib LSTM Cells. Added library handling for different OS.
----------------
Enhancements
----------------
* OCR improved handling of images by adding binarizing of buffered segments
* OCR now allows automatic adaptive scaling
* SentenceDetector params merged between DL and Rule based annotators
* SentenceDetector max length has been disabled by default, and now truncates by whitespace
* Part of Speech, NER, Spell Checking and Vivekn Sentiment Analysis annotators now train from dataset passed to fit() using Spark in the process
* Tokens and Chunks now hold metadata information regarding which sentence they belong to by sentence ID
* AnnotatorApproach annotators now allow a param trainingCols allowing them to use different inputs in training and in prediction. Improves Pipeline versatility.
* LightPipelines now allow method transform() to call against a DataFrame
* Noticeable performance gains by improving serialization performance in annotators through removal of transient variables
* Spark NLP in 30 seconds now provides a function SparkNLP.start() and sparknlp.start() (python) that automatically creates a local Spark session.
* Improved DateMatcher accuracy
* Improved Normalizer annotator by supporting and tokenizing a slang dictionary, with case sensitivity matching option
* ContextSpellChecker now is capable of handling multiple sentences in a row
* PretrainedPipeline feature now allows handling John Snow Labs remote pretrained pipelines to make it easy to update and access new models
* Symmetric Delete spell checking model improved training performance
----------------
Models and Pipelines
----------------
* Added more than 15 pretrained pipelines that cover a huge range of use cases. To be documented
* Improved multi language support by adding french and italian pipelines and models. More to come!
* Dependency Parser annotators now include a pretrained english model based on CoNLL-U 2009
----------------
Bugfixes
----------------
* Fixed python classname reference when deserializing pipelines
* Fixed serialization in ContextSpellChecker
* Fixed a bug in LightPipeline causing not to include output from embedded pipelines in a PipelineModel
* Fixed DateMatcher wrong param name not allowing to access it properly
* Fixed a bug where DateMatcher didn't know how to handle dash in dates where year had two digits instead of four
* Fixed a ContextSpellChecker bug that prevented it from being used repeatedly with collections in LightPipeline
* Fixed a bug in OCR that made it blow up with some image formats when using text preferred method
* Fixed a bug on OCR which made params not to work in cluster mode
* Fixed OCR setSplitPages and setSplitRegions to work properly if tesseract detected multiple regions
----------------
Developer API
----------------
* AnnotatorType params renamed to inputAnnotatorTypes and outputAnnotatorTypes
* Embeddings now serialize along a FloatArray in Annotation class
* Disabled useFeatureBroadcasting, showed better performance number when training large models in annotators that use Features
* OCR must be instantiated
* OCR works best with 4.0.0-beta.1
----------------
Build and release
----------------
* Added GPU build with tensorflow-gpu to Maven coordinates
* Removed .jar file from pip package
========
1.8.3
========
---------------
Overview
---------------
We're glad to announce a new release for Spark NLP. This one calls the attention of the community who contributed
immensely towards reporting bugs and feedback to the library. This release focuses in various bugfixes around DeepSentenceDetector
and also python deserialization of some specific pipelines. It also improves the DeepSentenceDetector allowing further fine-tuning
and customization. Then, we have embeddings that are being cached in the models folder, and further improvements towards accessing
them through S3 storage. Finally, we have made serious improvements in noteoboks and documentation around the library.
Special thanks to @Tshimanga and @haimco10 for very interesting contributions. See you on Slack!
---------------
Enhancements
---------------
* Improved OCR performance in skew detection
* SentenceDetector now better handles single quote protections (Thanks @haimco10)
* DeepSentenceDetector now can explodeSentences (Thanks @Tshimanga from Deep6.ai)
* EmbeddingsHelper now is capable of caching downloaded embeddings to avoid re-downloading
* Application.conf file may now be read from an s3 location
* DeepSentenceDetector has now access to all pragmatic SentenceDetector params in order to fine-tune it
---------------
Bugfixes
---------------
* Fixed ambiguous classpath resolution in pyspark, causing errors in deserializing some models
* Fixed DeepSentenceDetector not being deserializable in PySpark
* Fixed Chunk2Doc and Doc2Chunk annotators not being loadable in PySpark
* Fixed a bug where DeepSentenceDetector wouldn't corrent denote start and end offsets (Thanks @Tshimanga from Deep6.ai)
* Fixed a bug where DeepSentenceDetector would miss sentence parts when NER model missed header sentence (Thanks @Tshimanga from Deep6.ai)
* Cleaned and optimized DeepSentenceDetector code (Thanks @danilojsl)
* Fixed a missing dependency for OCR
---------------
Documentation and notebooks
---------------
* Added support and instructions for Anaconda deployment (Thanks @Maziyar)
* Updated various python notebooks to show utilization of spark packages instead of jars
* Added a new conference talk with Spark NLP in French at XebiCon'18
* Updated documentation towards less use of jars in favor of dependency solving
========
1.8.2
========
---------------
Overview
---------------
This release potentially targets to improve performance and resource usage in some pipelines that use word embeddings, it also comes
together with a very interesting autorotation feature in OCR, and a couple of new annotators to solve particular needs, including the ChunkTokenizer
or a Param to limit sentence lengths. Finally, we are starting to organize our multilingual store of models and data for training models.
Check the examples for some italian notebooks!. Thanks again to all community for such quick feedback all the time.
---------------
New Features
---------------
* OCR now capable of automatic rotation, significantly improving accuracy in some scenarios
* ChunkTokenizer is a new annotator that Tokenizes CHUNK type annotations. Extends Tokenizer algorithm and stores chunk ID for reference.
* SentenceDetector new Param maxLength now cuts off sentences longer than (by default) 240 characters. It avoids Deep Learning annotator issues and may improve performance in some scenarios.
* NerConverter new Param whiteList now allows a list of NER labels to be considered, while discarding the rest. May be useful for selective CHUNKing pipelines.
---------------
Enhancements
---------------
* Pipelines using Word Embeddings should now perform faster due to a group of RocksDB optimizations allowing annotators to reuse current open connections to DB
---------------
Bugfixes
---------------
* Fixed a bug where DeepSentenceDetector was missing the load() interface (Thanks @Tshimanga from Deep6!)
* Fixed a bug where RocksDB opened too many files at once causing pipelines to fail or to work very slowly
* Fixed NerCrfModel when prefetching RocksDB causing slower performance
---------------
Framework
---------------
* Added missing artifact resolution dependencies for OCR Module
* Started adding and organizing multilanguage models (Thanks @maziyarpanahi)
* Updated RocksDB to 5.17.2
========
1.8.1
========
---------------
Overview
---------------
This hotfix version of Spark-NLP improves framework support by adding Maven coordinates for OCR and allowing S3 retrieval of files.
We also included code for generating Graphs for NerDL and also for creating your own metadata files for a private model downloader.
As new features, we are including a new experimental machine learning based sentence detector, which uses NER for bounds detections.
Aside from this, we are including a few bug fixes and ocr improvements. Enjoy! and thanks again for community contributions!
---------------
New Features
---------------
* New DeepSentenceDetector annotator takes Spark-NLP's NER Deep Learning models as a base to improve sentence detection
---------------
Enhancements
---------------
* Improved accuracy of ContextSpellChecker by enabling re-ranking of candidate words according to a weighted levenshtein distance
* OCR process now defaults to split content in rows whether paragraphs or pages are identified for improved parallelism. May be turned off
---------------
Examples and use cases
---------------
* Added Scala examples for Sentiment analysis and Lemmatizer in Italian (Thanks Vincenzo Gaudenzi from DXC.technology for dataset and model contribution!!!)
---------------
Bugfixes
---------------
* Fixed a bug in Norvig and Symmetric SpellCheckers where the pattern parameter was not provided properly in Scala side (Thanks @johnmccain for reporting!)
---------------
Framework
---------------
* Added hadoop-aws dependency for remote download capabilities (e.g. word embeddings sets)
---------------
Other
---------------
* Metadata files for pretrained model downloads code is now included. This may be useful if anyone wants to setup their own private local model downloader service
* NerDL Graphs generation code is now included in the library. This allows the usage of custom word embedding dimensions and feature counts.
---------------
Special mentions
---------------
* Vincenzo Gaudenzi (DXC.technology) for contributing italian datasets and models. @maziyar for creating examples with them.
* @correlator from Deep6.ai for contributing feedback in slack and features feedback in general
* @johnmccain for reporting bugs in spell checker
* @rohit-nlp for delivering maven coordinates for OCR
* @haimco10 for contributing a sentence detector improvement with apostrophe's use case. Not merged due specific issues involved.
========
1.8.0
========
---------------
Overview
---------------
This release is huge! Spark-NLP made the leap into Spark 2.4.0, even with the challenge of not having everyone yet on board there (i.e. Zeppelin doesn't yet support it).
In this version we release three new NLP annotators. Two for dependency parsing processes and one for contextual deep learning based spell checking.
We also significantly improved OCR functionality, fine-tuning capabilities and general output performance, particularly on tesseract.
Finally, there's plenty of bug fixes and improvements in the word embeddings field, along with performance boosts and reduced disk IO.
Feel free to shoot us with any feedback you have! Particularly on your Spark 2.4.x experience.
---------------
New Features
---------------
* Built on top of Spark 2.4.0
* Dependency Parser annotator allows for sentence relationship encoding
* Typed Dependency Parser annotator allows for labeling relationships within dependency tags
* ContextSpellChecker is our first Deep Learning based Spell Checker that evaluates context and not only tokens
---------------
Enhancements
---------------
* More OCR parameters exposed for further fine tuning, including preferred methods priority and page segmentation modes
* OCR now has a setting setSplitPages() which allows setting whether to output one page per row or the entire document instead
* Improved word embeddings performance when working in local filesystems
* Reduced the amount of disk IO when working with Word Embeddings
* All python notebooks improved for better readability and better documentation
* Simplified PySpark interface API
* CoNLLGenerator utility class which helps building CoNLL-2003 files for NER training
* EmbeddingsHelper now allows reading word embeddings files directly from s3a:// paths
---------------
Bugfixes
---------------
* Solved race-condition issues in regards of cluster usage of RocksDB index for embeddings
* Fixed application.conf reading bug which didn't properly refresh AWS credentials
* RocksDB index no longer uses compression, in order to support Windows without native RocksDB compression libraries
* Solved various python default parameter settings
* Fixed circular dependency with jbig pdfbox image OCR
---------------
Deprecations
---------------
* DeIdentification annotator is no longer supported in the open source version of Spark-NLP
* AssertionStatus annotator is no longer supported in the open source version of Spark-NLP
========
1.7.3
========
---------------
Overview
---------------
This hotfix release focuses on fixing word-embeddings cluster problems on some frameworks such as Databricsk, while keeping 1.7.x performance benefits. Various YARN based clusters have been tested, databricks cloud among them to test this hotfix.
Aside of that, multiple improvements have been commited towards a better support of PySpark-NLP, fixing diverse technical issues in the API that help consistency in Annotator's super classes.
Finally, PIP installation has been made easier with a SparkNLP class that creates SparkSession automatically, for those who are learning Python Spark on their local computers.
Thanks to all the community for reporting issues.
---------------
Bugfixes
---------------
* Fixed 'RocksDB not serializable' when running LightPipeline scenarios or using _.functions implicits
* Fixed dependency with apache.commons.codec causing Apache Zeppelin 0.8.0 not to work in %pyspark
* Fixed Python pretrained() downloader not correctly setting Params and incorrectly creating new Model UIDs
* Fixed error 'JavaPackage not callable' when using AnnotatorModel.load() API without instantiating the class first
* Fixed Spark addFiles missing local file causing Word Embeddings not properly work in some Cluster-based frameworks
* Fixed broadcast NoSuchElementException `Failed to get broadcast_6_piece0 of broadcast_6` causing pretrained models not work in cluster frameworks (thanks @EnricoMi)
---------------
Developer API
---------------
* EmbeddingsHelper.setRef() has been removed. Reference is now set implicitly through EmbeddingsHelper.load(). Does not need to be loaded before deserializing models.
* Fixed and properly renamed chunk2doc and dock2chunk transformers, should now be working as expected
* Renamed setCompositeTokens to setCompositeTokensPatterns to help user remind that regex are being used in such Param
* Fixed PySpark automatic getter and setter Param generation when using pretrained() or load() models
* Simplified cluster path resolution for word embeddings
---------------
Other
---------------
* sparknlp.base now contains SparkNLP() classs which automatically cretes SparkSession using appropriate jar settings. Helps newcomers get started in PySpark NLP.
========
1.7.2
========
---------------
Overview
---------------
Quick release with another hotfix, due to a new found bug when deserializing word embeddings in a distributed fs. Also introduces changes in application.conf reader in order
to allow run-time changes. Also introduces renaming from EmbeddingsHelper API.
---------------
Bugfixes
---------------
* Fixed embeddings deserialization from distributed filesystem (caused due to windows pathfix)
* Fixed application.conf not reading changes in runtime
* Added missing remote_locs argument in python pretrained() functions
* Fixed wrong build version introduced in 1.7.1 to detect proper pretrained models version
---------------
Developer API
---------------
* Renamed EmbeddingsHelper functions for more convenience
========
1.7.1
========
---------------
Overview
---------------
Thanks to our slack community (Bryan Wilkinson, @maziyarpanahi, @apiltamang), a few bugs been pointed out very quickly from 1.7.0 release. This hotfix fixes an embeddings deserialization issue when cache_pretrained is located on a distributed filesystem.
Also, fixes some path resolution in Windows OS. Thanks to Maziyar, .gitattributes been added in order to identify proper languages in GitHub.
Finally, 1.7.1 adds a missing annotator from 1.7.0 Chunk2Doc, which converts CHUNK types into DOCUMENT types, for further retokenization or other annotations.
---------------
Enhancements
---------------
* Chunk2Doc annotator converts annotatorType from CHUNK to DOCUMENT
---------------
Bugfixes
---------------
* Fixed embedding-based annotators deserialization error when cache_pretrained is on distributed fs (Thanks Bryan Wilkinson for pointing out issue and testing fix)
* Fixed windows path reading when deserializing embeddings (Thanks @apiltamang)
---------------
Other
---------------
* .gitattributes added in order to properly discard jupyter as main language for GitHub repo (thanks @maziyarpanahi)
========
1.7.0
========
---------------
Overview
---------------
Having multiple annotators that use the same word embeddings set, may result in huge pipelines, driver memory and storage consumption.
Since now on, embeddings may be shared and reutilized across annotators making the process much more efficient.
Also, thanks to @apiltamang, we now better support path resolution for Windows implementations.
---------------
Enhancements
---------------
Memory and storage saving by allowing annotators with embeddings through params 'includeEmbeddings' and 'embeddingsRef' to allow them to set whether they should be included when saved, or referenced by id from other annotators
EmbeddingsHelper class allows embeddings management
---------------
Bug fixes
---------------
Thanks to @apiltamang for improving URI path support for Windows Servers
---------------
Developer API
---------------
Embeddings interfaces and method names completely refactored, hopefully simplified and easier to understand
========
1.6.3
========
---------------
Overview
---------------
This release includes a new annotator for de-identification of sensitive information. It uses CHUNK annotations, meaning its accuracy will depend on previous annotators on the pipeline.
Also, OCR capabilities have been improved in the OCR module.
In terms of broken stuff, we've fixed a few annoying bugs on SymmetricDelete and SentenceDetector explode feature.
Finally, pip is now part of the official repositories, meaning you can install it just as any other module. It also includes jars and we've added a SparkNLP class which creates SparkSession easily for you.
Thanks again for all community contribution in issues, feedback and comments in GitHub and in Slack.
---------------
New features
---------------
* DeIdentification annotator, takes DOCUMENT and TOKEN from the original sentence, plus a CHUNK annotation to anonymize target chunk in sentence. CHUNK annotation might come from NerConverter, TextMatcher or other chunk annotators.
---------------
Enhancements
---------------
* Kernel zoom and region erosion improve overall detection quality. Fixed some stability bugs. Improved parallelism
---------------
Bug fixes
---------------
* Sentence Detector explode sentences into rows now works properly
* Fixed Dictionary-based sentiment detector not working on pyspark
* Added missing NerConverter to annotator._ imports
* Fixed SymmetricDelete spell checker deleting tokens in some scenarios
* Fixed SymmetricDelete spell checker unwilling lower-casing
---------------
Other
---------------
* PySpark pip now part from official pip repos
* Pip installation now includes corresponding spark-nlp jar. base module includes SparkNLP SparkSession creator
========
1.6.2
========
---------------
Overview
---------------
In this release, we focused on reviewing out streaming performance, buy measuring our amount of sentences processed by second, through a LightPipeline.
We increased Norvig Spell Checker by more than 300% by disabling DoubleVariants and improving algorithm orders. It is now reported capable of 42K sentences per second.
Symmetric Delete Spell checker is more performance, although it has been reported to process 2K sentences per second.
NerCRF has been reported to process 300 hundred sentences per second, while NerDL can do twice fast (about 700 sentences per second).
Vivekn Sentiment Analysis was improved and is now capable to processing 100K sentences per sentence (before it was below 500).
Finally, SentenceDetector performance was improved by a 40% from ~30K rows processed per second to ~40K. But, we have now enabled Abbreviation processing by default which reduces final speed to 22K rows per second with a negative net but better accuracy.
Again, thanks for the community for helping with feedback. We welcome everyone asking questions or giving feedback in our Slack channel or reporting issues on Github.
---------------
Enhancements
---------------
* OCR now features kernel segmentation. Significantly improves image based PDF processing
* Vivekn Sentiment Analysis prediction performance improved by better data structures
* Both Norvig and Symmetric Delete spell checkers now have improved performance
* SentenceDetector improved accuracy by better handling abbreviations. UseAbbreviations now also by default turned ON
* SentenceDetector improved performance significantly by improved preloading of rules
---------------
Bug fixes
---------------
* Fixed NerDL not training correctly (broken since 1.6.0). Pretrained models not affected
* Fixed NerConverter not properly considering multiple sentences per row (after using SentenceDetector), causing an unhandled exception to occur in some scenarios.
* Tensorflow sessions now all support allow_soft_placement, supporting GPU based graphs to work with and without GPU
* Norvig Spell Checker fixed a missing step from the algorithm to check for additional variants. May improve accuracy
* Norvig Spell Checker disabled DoubleVariants by default. Was not improving accuracy significantly and was hitting performance very hard
---------------
Developer API
---------------
* New FeatureSet allows HashSet params
---------------
Models
---------------
* Vivekn Sentiment Pipeline doesn't have Spell Checker anymore
* Fixed Vivekn Sentiment pretrained improved accuracy
========
1.6.1
========
---------------
Overview
---------------
Hi! We're glad to announce new hotfix 1.6.1. Although changes seem modest or very specific, there is a lot going underground. First of all, we've worked hard with the community to understand S3-based clusters,
which don't have a common fs.defaultFS configuration, which is the one we use to tell where is the cluster temp folder located in order to distribute word embeddings. We fixed two things here,
on one side we fixed a bug pointing to the wrong filesystem. Second, we added a custom override setting in application.conf that allows manually setting where to put temp folders in cluster. This should help S3 users.
Please share your feedback on this regard.
On the other hand, we created a new annotator type internally. The CHUNK type allows better modulary in the communication between different annotators. Impact will be noticed implicitly and over time.
---------------
New features
---------------
* new Scala-only functions that make it easier to work with Annotations in Dataframes. May be imported through com.johnsnowlabs.nlp.functions._ and allow mapping and filtering within and outside Annotations.
filterByAnnotations, mapAnnotations and explodeAnnotations work by providing a column and a function. Check out documentation. Possibly later coming to Python.
---------------
Bug fixes
---------------
* Fixed incorrect filesystem readings in some S3 environments for word embeddings
* Fixed NerCRF not correctly training from CONLL, labeling everything as -O- (Thanks @arnound from Slack Channel)
---------------
Enhancements
---------------
* Added overrideable config sparknlp.settings.cluster_tmp_dir allows setting cluster location for temporary embeddings file. May help S3 based clusters with no fs.defaultFS set to a proper distributed storage.
* New annotator type: CHUNK. Representes a SUBSTRING of DOCUMENT and it is used as output from NerConverter, TextMatcher, RegexMatcher and other annotators that retrieve a substring from the original document.
This will make for better modularity and integration within various annotators, such as between NER and AssertionStatus.
* New annotation transformer: ChunkAssembler. Takes a string or array(string) column from a dataset and creates a CHUNK type annotator. The content must also belong to the current DOCUMENT annotation's content.
* SentenceDetector new param explodeSentences allow to explode sentences within a single row into different rows to increase parallelism and performance in some scenarios. Particularly OCR based.
* AssertionDLApproach now may be used within LightPipelines
* AssertionDLApproach and AssertionLogRegApproach now work from CHUNK type instead of start/end bounds. May still be trained with Start/end though. This means target for assertion may be any CHUNK output annotator now (e.g. RegexMatcher)
---------------
Other
---------------
* PerceptronApproachLegacy moved back to default PerceptronApproach. Distributed PerceptronApproach moved to PerceptronApproachDistributed due to not meeting accuracy expectations yet.
* Some configuration parameters in application.conf have been appropriately moved to proper annotator Params (NorvigSweeting Spell Checker, Vivekn Approach and Sentiment Detector affected)
* application.conf renamed configuration values for better consistency
---------------
Developer API
---------------
* Added beforeAnnotate() and afterAnnotate() to manipulate dataframes after or before calling annotate() UDF
* Added extraValidate() and extraValidateMsg() in all annotators to provide developer to add additional SCHEMA checks in transformSchema() stage
* Removed validation() stage in fit() stage. Allows for more flexible training when some of the columns are not really required yet.
* WrapColumnMetadata() will wrap an Annotation column with its appropriate Metadata. Makes it easier not to forget about Metadata in Schema.
* RawAnnotator trait has now all the basics needed to start a new Annotator without annotate() function. It is a complete previous stage before AnnotatorModel, which inherits from RawAnnotator.
========
1.6.0
========
---------------
Overview
---------------
We're late! But it was worth it. We're glad to release 1.6.0 which brings new features, lots of enhancements and many bugfixes. First of all, we are thankful for community participating in Slack and in GitHub by reporting feedback and issues.
In this one, we have a new annotator, the Chunker, which allows to grab pieces of text following a particular Part-of-Speech pattern.
On the other hand, we have a brand new OCR to Spark Dataframe utility, which bundles as an optional component to Spark-NLP. This one requires tesseract 4.x+ to be installed on your system, and may be downloaded from our website or readme pages.
Aside from that, we improved in many areas, from the DocumentAssembler to work better with OCR output, down to our Deep Learning models with better consistency and accuracy. Word Embedding based annotators also receive improvements when working in Cluster environments.
Finally, we are glad a user contributed a fix to the AWS dependency issue, particularly happening in Cloudera environments. We're still waiting for feedback, and gladly accept it.
We'll be working on the documentation as this release follows. Thank you.
---------------
New Features
---------------
* New annotator: Chunker. This annotator takes regex for Part-of-Speech tags and returns appropriate chunks of text following such patterns
* OCR to Spark-NLP: As an optional jar module, users may use OcrHelper class in order to convert PDF files into Spark Dataset, ready to be utilized by Spark-NLP's document assembler. May be used without Spark-NLP. Requires Tesseract 4.x on your system.
---------------
Enhancements
---------------
* TextMatcher now has caseSensitive (setCaseSensitive) Param which allows to setup for matching with case sensitivity or not (Ignores if Normalizer did it). Returned word is still the original.
* LightPipelines in Python should now be faster thanks to an optimization of prefetching results into Python memory instead of py4j bridge
* LightPipelines can now handle embedded Pipelines
* PerceptronApproach now trains utilizing full Spark distributed algoritm. Still experimental. PerceptronApproachLegacy may still be used, which might be better for local non cluster setups.
* Tokenizer now has a param 'includeDefaults' which may be set to False to disable all preset-rules.
* WordEmbedding based annotators may now decide to normalize tokens before matching embeddings vectors through 'useNormalizedTokensForEmbeddings' Param. Generally improves consistency, lesser overfitting.
* DocumentAssembler may now better deal with large amounts of texts by using 'trimAndClearNewLines' to better work with OCR Outputs and be better ready for further Sentence Detection
* Improved SentenceDetector handling of enumerations and lists
* Slightly improved SentenceDetector performance through non-tail-recursive optimizations
* Finisher does no longer have default delimiters when output into String (not Array) (thanks @S_L)
---------------
Bug fixes
---------------
* AWS library dependecy conflict now resolved (Thanks to @apiltamang for proposing solution. thanks to the community for follow-up). Solution is experimental, waiting for feedback.
* Fixed wrong order of further added Tokenizer's infixPatterns in Python (Thanks @sethah)
* Training annotators that use Word Embeddings in a distributed cluster does no longer throw file not found exceptions sporadically
* Fixed NerDLModel returning non-deterministic results during prediction
* Deep-Learning based models and graphs now allow running them on CPU if trained on GPU and GPU is not available on client
* WordEmbeddings temporary location no longer in HOME dir, moved to tmp.dir
* Fixed SentenceDetector incorrectly bounding sentences with non-English characters (Thanks @lorenz-nlp)
* Python Spark-NLP annotator models should now have all appropriate setter and getter functions for Params
* Fixed wrong-format of column when showing Metadata through Finisher's output as Array
* Added missing python Finisher's include metadata function (thanks @PinusSilvestris for reporting the bug)
* Fixed Symmetric Delete Spell Checker throwing wrong error when training with an empty dataset (Thanks @ankush)
---------------
Developer API
---------------
* Deep Learning models may now be read through SavedModelBundle API into Tensorflow for Java in TensorflowWrapper
* WordEmbeddings now allow checking if word exists with contains()
* Included tool that converts text into CoNLL format for further labeling for training NER models (
========
1.5.4
========
---------------
Overview
---------------
This release improves various annotators: the Normalizer, SymmetricDelete, TextMatcher, DocumentAssembler and Finisher
allowing them to cover more use-cases that were mentioned in our Slack channel. We also fixed two important bugs.
Finally, this will be our first release with PIP support for python sparknlp, for those entirely python based.
---------------
Enhancements
---------------
* Normalizer now allows multiple to-delete regex patterns.
* Normalizer slangDictionary param allows converting tokens into something else (e.g. 'lol' into 'laughing out loud') from a dictionary file
* SymmetricDelete spell checker may now be trained from the dataset passed to fit if external corpus not provided
* SymmetricDelete spell checker improved training and prediction performance
* Finisher param includeMetadata now outputs annotation metadata content both in Array format or String format
* DocumentAssembler may now read from Array[String] column if provided. This improves compatibility for some SparkML transformers
* TextMatcher now includes identifier name in metadata
---------------
Bug fixes
---------------
* Fixed a bug introduced in 1.5.3 that made spark-nlp not to work in Python2 (thanks @surendralalwani)
* Fixed SymmetricDeleteApproach wrong annotator type
---------------
Other
---------------
* setup.py for PIP support (instructions will be added to readme and website). Still needs spark-nlp jar in SparkSession classpath.
========
1.5.3
========
---------------
Overview
---------------
This quick release is a hotfix for issues found on 1.5.2 after it's release. Thanks to the users who quickly tested this out.
It fixes Symmetric spell checker not being capable of reading the pretrained model, a SentenceDetector missing default value and retroactive version matching to the downloader
---------------
Bug fixes
---------------
* Fixed a bug causing the library to fail when trying to save or read an annotator with an unset Feature without default
* Added missing default Param value to SentenceDetector. Thanks @superman24-7
* Symmetric spell checker now utilizes List instead of ListBuffer on its prediction layer
* Fixed Vivekn Sentiment Analysis failing when training with a sentiment column
---------------
Models
---------------
* Symmetric Spell Checker pretrained model now works well and may be downloaded
* Vivekn Sentiment pretrained model now defaults to "token" input column instead of "spell"
---------------
Other
---------------
* Downloader now works retroactively when a newer version finds a model of a previous release
* Renamed folder argument to remote_loc for downloader remote location, which caused confusion. Thanks @AtulSehgal
* Added new Scala example in example folder, also available on website
========
1.5.2
========
---------------
Overview
---------------
This release focuses on improving model downloader stability, fixing word embedding reading issues and joining
spark ecosystem filesystem configuration appropriately, utilizing spark's defined default filesystem, in order to work
properly with clusters and multi node environments. This includes Databricks cloud clusters or Amazon EMR yarn HDFS nodes.
Aside of that we come up with exciting new features, a brand new Spell Checker with higher accuracy inspired on the
Symmetric delete algorithm.
Finally Assertion Status can be trained and predicted on top of NER output, since before
this only worked by providing assertion status Start and End boundaries for the target to assert.
---------------
New Features
---------------
* Assertion status annotators can now be trained and predict against NER output instead of start and end boundaries. Entities can now be directly asserted
* Brand new Symmetric Delete annotator (SymmetricDeleteApproach) with closer to start of the art optimal accuracy 80%
---------------
Enhancements
---------------
* Model downloader now uses proper spark filesystem. Works properly with distributed storage, databricks cloud clusters or amazon EMR seamlessly
* Fixed several race condition while loading word embeddings from disk or download resources, library is more stable
* Improved several assertion status validations and error messages
---------------
Bug fixes
---------------
* Stand alone Annotator models are now properly read from disk in python
---------------
Models
---------------
* New Symmetric Delete Spell checker pretrained model
* Vivekn Sentiment annotator may now be downloaded standalone with pretrained()
========
1.5.1
========
---------------
Overview
---------------
This release is an enhancement release to 1.5.0 which includes improved downloader properties and better annotator defaults.
Also, assertion status models have been included as pretrained, which are models trained on top of Glove Stanford word embeddings
---------------
Enhancements
---------------
* SentenceDetector has now a useCustomOnly param which enforces into using only the custom bounds provided (thanks @atomobianco)
* Normalizer defaults to not lowerCase words leads to better implicit accuracy in pipelines (thanks @marek.modry)
* SpellChecker defaults to be case sensitive leads to better accuracy
* DateMatcher improved speed performance
* com.johnsnowlabs.annotator._ in Scala now also includes RecursivePipelines and LightPipelines for easier imports
* ModelDownloader has been improved with better directory management
---------------
Models
---------------
* New Assertion Status (LogisticRegression and DeepLearning) pretrained models now available
* Vivekn, Basic and Advanced pretrained Pipelines improved accuracy (thanks @marek.modry)
---------------
Other
---------------
* S3 library dependencies updated
========
1.5.0
========
---------------
Overview
---------------
We are proud to announce if not the biggest release in terms of content in Spark-NLP!
This release makes the library miles easier to use for new comers, allowing easier to import
annotators and the extended use of model downloader throughout pretrained models and pipelines.
This also includes two new annotators that use deep learning algorithms with graphs from TensorFlow, which
is the first time we do so.
Apart from this, we include new Light Pipelines that are 10x times faster when working with data smaller than about
50,000 rows length.
Finally, we included several bugfixes across the library, from algorithm wise to developer API.
We'll gladly welcome any feedback! The website has been extensively updated.
---------------
New features
---------------
* Light Pipelines are Annotator Pipelines created from SparkML pipelines that run more than 10x faster in small datasets
* Deep Learning NER based on Bi-LSTM and Convolutional Neural Networks from word embeddings datasets
* Deep Learning Assertion Status model based on LSTM to compute status identification from word embeddings
* Easier to use Spark-NLP:
1. Imports have been made easy in scala API (com.johnsnowlabs.annotator._) to bring all annotators
2. BasicPipeline and AdvancedPipeline downloadable pipelines created for quick annotation of text
3. Light Pipelines are easy to use and accept simple strings to annotate a Spark ML Pipeline without spark datasets
* New Downloadable models: CRF NER, Lemmatizer, POS and Spell checker
* New Downloadable pipelines: Vivekn Sentiment analysis, BasicPipeline and AdvancedPipeline
---------------
Enhancements
---------------
* Model downloader significantly improved in terms of usability
---------------
Documentation
---------------
* Website widely improved
* Added invite to our first slack chat channel
---------------
Bugfixes
---------------
* Fixed positional index wrong value when creating Annotations from constructor
* Fixed hamming distance calculation in spell checker
* Fixed Downloadable NER model failing sporadically due to missing temporary files
* Fixed SearchTrie algorithm used in TextMatcher (fmy. EntiyExtractor) thanks @avenka11 for reporting and proposing solution
* Fixed some model deserialization issues happening on Windows
---------------
Other
---------------
* Thanks to @showy we have TravisCI automatic integration testing
* Finisher now outputs to array by default
* Training example resources removed in advantage of using the model downloader more
========
1.4.2
========
---------------
Bugfixes
---------------
* Filesystem protocols now properly read across the library, fixed use case for S3:// protocol (thanks @avenka11)
* Library now works properly in Windows environments
* PySpark annotator param getters now work properly when retrieving default values
* Fixed stemmer serialization due to misspelled param name
* Fixed Tokenizer infixPattern param name to infixPatterns, leading to broken pyspark serialization of such param
* Added missing addInfixPattern() function to PySpark, to allow adding patterns to current value
* Model Downloader clearCache now properly removes both .zip files and extracted content
* Model Downloader is now capable of reading all types of models properly
* Added missing clearCache function into PySpark
---------------
Developer API
---------------
* Function names in model downloader code has been refactored consistenyl
---------------
Other
---------------
* RocksDB rolled back to previous version to support Windows
* NerCRF unittest modified to reduce time to test
* Removed training scripts from repository
* Updated build spark and scala version
========
1.4.1
========
---------------
New features
---------------
* Model and Pipeline Downloader
We are glad to announce our first experimental model downloader, working both in Python and Scala.
This allows to download pre-trained models from our public storage. This does not include any pre-trained models yet
but just the logic to be able to do it.
---------------
Enhancements
---------------
* Improved ExternalResource API (introduced in 1.4.0) to make it easier to provide external corpus and resource information
on annotators such as readAs (which allows setting how would you like SparkNLP to read your source), delimiters and parse settings among
other options that might be passed to Spark Reader directly. Annotators using external sources now all share this functionality.
WordEmbeddings are not yet supported on this format.
* All python annotators now properly have getter functions to retrieve param values
--------------
Bugfixes
--------------
* Fixed some annotators in python not de-serializable on their own outside a Pipeline
* Fixed CRF NER not working when not using word embeddings (thanks @crisliu for reporting)
* Fixed Tokenizer not properly recognizing some stop words (thanks @easimadi)
* Fixed Tokenizer not properly recognizing composite tokens when changing target pattern param (thanks @easimadi)
* ReadAs parameter now properly read from string in all ExternalResource setters
---------------
Developer API
---------------
* PySpark API further improvements within AnnotatorApproach, AnnotatorModel and now private internal _AnnotatorModel for fit() result representation
* Automated getter have been written in order not to have to write getter functions in all annotators manually
-----------
Other
-----------
* RocksDB dependency rolled back to 5.2.1 for better universal compatibility particularly to support databricks platform
---------------
Documentation
---------------
* Updated website components page to match 1.4.x
* Replaced notebooks site to a placeholder linking to current python notebooks for lower maintenance
========
1.4.0
========
---------------
New features
---------------
* All annotator external sources have been unified through an ExternalResource component.
This is used to represents external data information deals with content in HDFS or local just as spark deals with data.
It also improves performance globally and allows customization
into how these sources are read (e.g. as RDD or Line by Line sequences)
* NorvigSweeting SpellChecker, ViveknSentiment and POS Perceptron can now train from the dataset passed to fit().
For Spell Checker, this will be applied if the user did not supply a corpus, forcing fit() to learn from words in the data column.
For ViveknSentiment and POS Perceptron, this strategy will be applied if sentimentCol and posCol params have been set respectively.
---------------
Enhancements
---------------
* ResourceHelper now has an improved SourceStream class which allows for more consistent HDFS/Filesystem reading by using
more of the Hadoop APIs.
* application.conf is now a global setting and can be overridden in run-time through ConfigLoader.setConfigPath(). It may also be accessed from PySpark
* PySpark API improved by creating AnnotatorApproach and AnnotatorModel classes
* EntityMatcher now uses recursive Pipelines
* Part-of-Speech tagging performance has been improved throughout the prediction algorithm
* EntityMatcher may now use RecursivePipeline in order to tokenize external data with the same pipeline provided by the user
---------------
Developer API
---------------
*PySpark API has been severly improved to make it easier to extend JVM classes
*PySpark API improved for extending annotator approaches and models appropriately
----------------
Bugfixes
----------------
* Reverted a bug introduced causing NER not to read datasets properly from HDFS
* Fixed EntityMatcher wrongly normalizing external content (thanks @sofianeh)
----------------
Documentation
----------------
* Fixed EntityMatcher documentation obsolete params (Thanks @sofianeh)
* Fixed NER CRF documentation in website
========
1.3.0
========
IMPORTANT: Pipelines from 1.2.6 or older cannot be loaded from 1.3.0
---------------
New features
---------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/94
Tokenizer annotator has been revamped. It now follows standard NLP Rules, matching above 90% of StanfordNLP Tokens
This annotator has now more complex rules allowing setting custom composite words as exceptions (e.g. to not break New York)
and custom Prefix, Infix, Suffix and Breaking rules. It uses regular expression groups in order to match various tokens per target word
Defaults have been updated to also be language agnostic and support foreign characters from Unicode charset
* https://github.com/JohnSnowLabs/spark-nlp/pull/93
Assertion Status. This annotator identifies negated sequences within target scope. Assertion status is a machine learning
annotator and works throughout a set of Word Embeddings which a set of them is provided as a part of our Python notebook examples.
* https://github.com/JohnSnowLabs/spark-nlp/pull/90
Recursive Pipelines. We have created our own Pipeline class which will take more advantages from Spark-NLP annotators.
Although this Pipeline is completely optional and works well with default Apache Spark estimators and transforms, it allows
training our annotators more efficiently by allowing annotator approaches access the previous state of the Pipeline,
allowing them to use it to tokenize or transform their own external content. It is recommended to use such Pipelines.
----------------
Enhancements
----------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/83
Part of Speech training has been improved in both performance and quality, and now better makes use of the input corpus provided.
New params have been extended in order to have more control of its training, through corpusFormat and corpusLimit, allowing
whether to read training data as Dataset or raw text files, and the number of limit files if a folder is provided
* https://github.com/JohnSnowLabs/spark-nlp/pull/84
Thanks to @lambdaofgod to allow Normalizer to optionally lower case tokens
* Thanks to Lorenz Bernauer, Normalizer default pattern now becomes language agnostic by not breaking unicode characters such as Spanish or German letters
* Features now have appropriate default values which are lazy by nature and executed only once upon request. This improves by side effect to the Lemmatizer performance.
* RuleFactory (A regex rule factory) performance has been improved due to set to use a Factory pattern and not re-check it's strategy on every transformation in run-time.
This might have positive side effects in SentenceDetector, DateMatcher and RegexMatcher which extensively use this class.
----------------
Class Renames
----------------
RegexTokenizer -> Tokenizer (it is not just regex anymore)
SentenceDetectorModel -> SentenceDetector (it is not a model, it is a rule-based algorithm)
SentimentDetectorModel -> SentimentDetector (it is not a model, it is a rule-based algorithm)
----------------
User Utilities
----------------
* ResourceHelper has a function createDatasetFromText which allows the user to more
easily read one or multiple text files from path into a dataset with various options,
including filename by row or by file aggregation. This class should be more widely
used since it helps dealing with local files parsing. It shall be better documented.
* com.johnsnowlabs.util now contains a Benchmark class which allows measuring the time of
any function easily, by using it as Benchmark.time("Description of measured") {someFunction()}
----------------
Developer API
----------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/89/files
Word embedding traits have been generalized. Now any annotator who might want to use them can easily access their properties
* Recursive pipelines now allow injecting PipelineModel object into train() stage. It is an optional parameter. If the user
utilizes RecursivePipeline, the annotator might use this pipeline for transforming secondary data inputs.
* Annotator abstract class has been divided into a previous RawAnnotator class which contains all annotator properties
and validations, but does not make use of the annotate() function. This allows annotators that need to work directly with
the transform() call, but also participate between other annotators in the pipeline
----------------
Bugfixes
----------------
* Fixed a bug in annotators with word embeddings not correctly serializing into disk
* Fixed a bug creating temporary folders in home folder
* Fixed a broken geospatial pattern in sentence detection
========
1.2.6
========
---------------
Enhancements
---------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/82
Vivekn Sentiment Analysis improved memory consumption and training performance
Parameter pruneCorpus is an adjustable value now, defaults to 1. Higher values lead to better performance
but are meant on larger corpora. tokenPattern params are meant to allow different tokenization regex
within the corpora provided on Vivekn and Norvig models.
* https://github.com/JohnSnowLabs/spark-nlp/pull/81
Serialization improvements. New default format (parquet lasted little) is RDD objects. Proved to be lighter on
heap memory. Also added lazier default values for Feature containers. New application.conf performance tunning
settings allow to customize whether we want to Feature broadcast or not, and use parquet or objects in serialization.
========
1.2.5
========
IMPORTANT: Pipelines from 1.2.4 or older cannot be loaded from 1.2.5
---------------
New features
---------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/70
Word embeddings parameter for CRF NER annotator
* https://github.com/JohnSnowLabs/spark-nlp/pull/78
Annotator features replace params and are now serialized using KRYO and partitioned files, increases performance and smaller
memory consumption in Driver for saving and loading pipelines with large corpora. Such features are now also broadcasted
for better performance in distributed environments. This enhancement is a breaking change, does not allow to load older pipelines
----------------
Bug fixes
----------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/cb9aa4366f3e2c9863482df39e07b7bacff13049
Stemmer was not capable of being deserialized (Implements DefaultParamsReadable)
* https://github.com/JohnSnowLabs/spark-nlp/pull/75
Sentence Boundary detector was not properly setting bounds
----------------
Documentation (thanks @maziyarpanahi)
----------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/79
Typo in code
* https://github.com/JohnSnowLabs/spark-nlp/pull/74
Bad description
========
1.2.4
========
---------------
New features
---------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/c17ddac7a5a9e775cddc18d672e80e60f0040e38
ResourceHelper now allows input files to be read in the shape of Spark Dataset, implicitly enabling HDFS paths, allowing larger annotator input files. Needs to set 'TXTDS' as input format Param to let annotators read this way. Allowed in: Lemmatizer, EntityExtractor, RegexMatcher, Sentiment Analysis models, Spell Checker and Dependency Parser.
---------------
Enhancements and progress
---------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/4920e5ce394b25937969cc4cab1d81172be722a3
CRF NER Benchmarking progress
* https://github.com/JohnSnowLabs/spark-nlp/pull/64
EntityExtractor refactored. This annotator uses an input file containing a list of entities to look for inside target text. This annotator has been refactored to be of better use and specifically faster, by using a Trie search algorithm. Proper examples included in python notebooks.
---------------
Bug fixes
---------------
* Issue https://github.com/JohnSnowLabs/spark-nlp/issues/41 <> https://github.com/JohnSnowLabs/spark-nlp/commit/d3b9086e834233f3281621d7c82e32195479fc82
Fixed default resources not being loaded properly when using the library through --spark-packages. Improved input reading from resources and folder resources, and falling back to disk, with better error handling.
* https://github.com/JohnSnowLabs/spark-nlp/commit/08405858c6186e6c3e8b668233e30df12fa50374
Corrected param names in DocumentAssembler
* Issue https://github.com/JohnSnowLabs/spark-nlp/issues/58 <> https://github.com/JohnSnowLabs/spark-nlp/commit/5a533952cdacf67970c5a8042340c8a4c9416b13
Deleted a left-over deprecated function which was misleading.
* https://github.com/JohnSnowLabs/spark-nlp/commit/c02591bd683db3f615150d7b1d121ffe5d9e4535
Added a filtering to ensure no empty sentences arrive to unnormalized Vivekn Sentiment Analysis
---------------
Documentation and examples
---------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/b81e95ce37ed3c4bd7b05e9f9c7b63b31d57e660
Added additional resources into FAQ page.
* https://github.com/JohnSnowLabs/spark-nlp/commit/0c3f43c0d3e210f3940f7266fe84426900a6294e
Added Spark Summit example notebook with full Pipeline use case
* Issue https://github.com/JohnSnowLabs/spark-nlp/issues/53 <> https://github.com/JohnSnowLabs/spark-nlp/commit/20efe4a3a5ffbceedac7bf775466b7a8cde5044f
Fixed scala python documentation mistakes
* https://github.com/JohnSnowLabs/spark-nlp/commit/782eb8dce171b69a615887b3defaf8b729b735f2
Typos fix
---------------
Other
---------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/91d8acb1f0f4840dad86db3319d0b062bd63b8c6
Removed Regex NER due to slowness and little use. CRF NER to replace NER.
---------------
Other
---------------
https://github.com/JohnSnowLabs/spark-nlp/commit/91d8acb1f0f4840dad86db3319d0b062bd63b8c6
Removed Regex NER due to slowness and little use. CRF NER to replace NER.
========
1.2.3
========
---------------
Bugfixes
---------------
* Sentence detection not properly bounding punctuation marks
* Sentence detection abbreviations feature disabled by default, since it is a slow feature and not necessary benefitial
* Sentence detection punctuation bounds improved algorithm performance by removing redundant formatting
* Fixed a bug causing sentiment analysis not to work when using normalized tokens that may cause tokens to be deleted from sentence
* Fixed Resource Helper text reading missing first line from text stream. More improvements to come later.
---------------
Other
---------------
CRF NER progress, word embeddings support. Not yet officially released.
========
1.2.2
========
---------------
New features
---------------
* Finisher parameter to export output as Array
========
1.2.1
========
---------------
Bugfixes
---------------
* Finisher not properly deleting annotation columns even when set to true explicitly
========
1.2.0
========
--------------
New features
--------------
* New annotator: CRF NER
* New transformer: Token Assembler
* Added custom bounds parameter in sentence detector
* Added custom pattern parameter in Normalizer
* SpellChecker able to train from text files as dataset
* Typesafe configuration may now be read from disk at runtime
-------------------------
Performance improvements
-------------------------
* Enabled and suggested KRYO Serializer for better pipeline write-read
--------------------
Code improvements
--------------------
* Pack/Unpack in annotators leads to better code reutilization
* ResourceHelper reduced io responsibility
* Annotations centralized main result and Finisher improvements
---------------
Bugfixes
---------------
* RegexMatcher fixed to read from input files properly
* DocumentAssembler fixed relative positioning
------------------
Release framework
------------------
* Better package release readiness for different repositories
* Organization package name change due to central's standards
Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
Java
1
https://gitee.com/mirrors/spark-nlp.git
git@gitee.com:mirrors/spark-nlp.git
mirrors
spark-nlp
spark-nlp
master

搜索帮助