Subject Matter Expert eComms

  • 10+ years in data engineering or data processing, with at least 5 years focused on eComms data in financial services
  • Deep expertise in text processing and NLP preprocessing: tokenisation, normalisation, encoding handling, language detection, and noise reduction
  • Proven experience building production-grade data pipelines for communication surveillance, e-discovery, or regulatory compliance
  • Hands-on experience with at least two archiving or surveillance platforms (Global Relay, Smarsh, NICE Actimize, Behavox, Relativity)
  • Strong understanding of electronic communication formats: EML, MSG, Bloomberg FLP, Teams/Slack JSON exports, voice transcription formats
  • Experience with data de-duplication at scale: fuzzy matching algorithms (MinHash, SimHash, Jaccard), attachment hashing
  • Understanding of ASIC INFO 283 data completeness requirements and multi-jurisdiction retention obligations
  • Bachelor’s or Master’s degree in Computer Science, Data Science, Computational Linguistics, or Information Science
  • Experience in an Australian Tier-1 bank environment with ASIC, AUSTRAC, and APRA oversight
  • Knowledge of WORM-compliant archiving standards and chain-of-custody requirements
  • Experience with multi-lingual text processing for APAC languages (Mandarin, Cantonese, Japanese, Malay)
  • Familiarity with transformer model data preparation: sub-word tokenisation, attention masking, context windowing
  • Experience with real-time streaming pipelines (Kafka Streams, Apache Flink)
  • Data processing: Apache Kafka, Kafka Streams, Apache Flink, Apache Spark, Azure Data Factory
  • Text processing: spaCy, NLTK, regex, Beautiful Soup, Apache Tika, textract
  • Languages: Python 3.10+, Java, SQL, Bash scripting
  • De-duplication: MinHash, SimHash, Locality-Sensitive Hashing (LSH), ssdeep, TLSH
  • Databases: PostgreSQL, Elasticsearch, MongoDB, Azure Cosmos DB
  • Infrastructure: Docker, Kubernetes, Azure cloud services, Terraform
  • Monitoring: Grafana, Prometheus, ELK Stack, Great Expectations (data quality)
  • Communication platforms: Bloomberg Vault, Global Relay Archive, Smarsh Enterprise Archive, Microsoft Purview
  • Define and govern the end-to-end eComms data processing pipeline architecture: from raw ingestion through sanitisation, normalisation, de-duplication, and enrichment to ML-ready output
  • Design data sanitisation processes: HTML/RTF stripping, embedded image noise removal, email header cleaning, and thread delineation for reply chains
  • Build disclaimer detection and removal systems using pattern matching and ML classifiers — covering legal footers, confidentiality notices, and regulatory boilerplate
  • Develop signature block detection and extraction using structural analysis
  • Design whitelist management frameworks: approved counterparties, internal distribution lists, automated system message exclusions, with periodic review cycles and jurisdiction-specific separation
  • Implement cross-channel message de-duplication: forwarded message detection, near-duplicate fuzzy matching (MinHash/SimHash), attachment fingerprinting, and conversation threading
  • Build entity resolution pipelines: trader identity mapping (aliases, nicknames), counterparty normalisation, and channel-type classification
  • Design metadata enrichment workflows: desk assignment, book mapping, counterparty risk tier, jurisdiction tagging, and timestamp UTC alignment
  • Define and measure data quality KPIs: completeness rate, dedup accuracy, noise removal precision, signal loss rate (target < 2%), and pipeline throughput/latency SLAs
  • Advise the ML team on data preparation requirements for transformer models: tokenisation strategies, sequence formatting, label engineering, and data augmentation
  • Conduct regular pipeline quality audits and produce data quality scorecards for compliance review
  • Document pipeline specifications, data dictionaries, and operational runbooks for regulatory examination readiness