Subject Matter Expert eComms

10+ years in data engineering or data processing, with at least 5 years focused on eComms data in financial services
Deep expertise in text processing and NLP preprocessing: tokenisation, normalisation, encoding handling, language detection, and noise reduction
Proven experience building production-grade data pipelines for communication surveillance, e-discovery, or regulatory compliance
Hands-on experience with at least two archiving or surveillance platforms (Global Relay, Smarsh, NICE Actimize, Behavox, Relativity)
Strong understanding of electronic communication formats: EML, MSG, Bloomberg FLP, Teams/Slack JSON exports, voice transcription formats
Experience with data de-duplication at scale: fuzzy matching algorithms (MinHash, SimHash, Jaccard), attachment hashing
Understanding of ASIC INFO 283 data completeness requirements and multi-jurisdiction retention obligations
Bachelor’s or Master’s degree in Computer Science, Data Science, Computational Linguistics, or Information Science
Experience in an Australian Tier-1 bank environment with ASIC, AUSTRAC, and APRA oversight
Knowledge of WORM-compliant archiving standards and chain-of-custody requirements
Experience with multi-lingual text processing for APAC languages (Mandarin, Cantonese, Japanese, Malay)
Familiarity with transformer model data preparation: sub-word tokenisation, attention masking, context windowing
Experience with real-time streaming pipelines (Kafka Streams, Apache Flink)
Data processing: Apache Kafka, Kafka Streams, Apache Flink, Apache Spark, Azure Data Factory
Text processing: spaCy, NLTK, regex, Beautiful Soup, Apache Tika, textract
Languages: Python 3.10+, Java, SQL, Bash scripting
De-duplication: MinHash, SimHash, Locality-Sensitive Hashing (LSH), ssdeep, TLSH
Databases: PostgreSQL, Elasticsearch, MongoDB, Azure Cosmos DB
Infrastructure: Docker, Kubernetes, Azure cloud services, Terraform
Monitoring: Grafana, Prometheus, ELK Stack, Great Expectations (data quality)
Communication platforms: Bloomberg Vault, Global Relay Archive, Smarsh Enterprise Archive, Microsoft Purview
Define and govern the end-to-end eComms data processing pipeline architecture: from raw ingestion through sanitisation, normalisation, de-duplication, and enrichment to ML-ready output
Design data sanitisation processes: HTML/RTF stripping, embedded image noise removal, email header cleaning, and thread delineation for reply chains
Build disclaimer detection and removal systems using pattern matching and ML classifiers — covering legal footers, confidentiality notices, and regulatory boilerplate
Develop signature block detection and extraction using structural analysis
Design whitelist management frameworks: approved counterparties, internal distribution lists, automated system message exclusions, with periodic review cycles and jurisdiction-specific separation
Implement cross-channel message de-duplication: forwarded message detection, near-duplicate fuzzy matching (MinHash/SimHash), attachment fingerprinting, and conversation threading
Build entity resolution pipelines: trader identity mapping (aliases, nicknames), counterparty normalisation, and channel-type classification
Design metadata enrichment workflows: desk assignment, book mapping, counterparty risk tier, jurisdiction tagging, and timestamp UTC alignment
Define and measure data quality KPIs: completeness rate, dedup accuracy, noise removal precision, signal loss rate (target < 2%), and pipeline throughput/latency SLAs
Advise the ML team on data preparation requirements for transformer models: tokenisation strategies, sequence formatting, label engineering, and data augmentation
Conduct regular pipeline quality audits and produce data quality scorecards for compliance review
Document pipeline specifications, data dictionaries, and operational runbooks for regulatory examination readiness

Start typing and press enter to search