Ask HN: How are you cleaning and transforming data before imports/uploads?

36 points by dataflowmapper 5 months ago

Hi all,

I’m curious how folks handle the prep work for data imports/uploads into systems like Salesforce, Workday, NetSuite, or really any app that uses template based import for data loading, migration, or implementation.

Specifically: - How do you manage conversions/transformations like formatting dates, getting everything aligned with the templates, mapping old codes to new ones, etc.

- Are you primarily using Excel, custom scripts, Power Query or something else?

- What are the most tedious/painful parts of this process and what have you found that works?

Really appreciate any insights and am curious to learn from everyone's experience.

PaulHoule 5 months ago

"Scripts" in Python, Java and other conventional programming languages (e.g. whatever it is you already use)

Not Bash, not Excel, not any special-purpose tool because the motto of those is "you can't get there from here". Maybe you can get 80% of the way there, which is really seductive, but that last 20% is like going to the moon. Specifically, real programming languages have the tools to format dates correctly with a few lines of code you can wrap into a function, fake programming languages don't. Mapping codes is straightforward, etc.

dataflowmapper 5 months ago

Yeah programming definitely offers most flexibility if you have that skillset. I'm particularly interested in your 'last 20% is like going to the moon' analogy for special-purpose tools or even Excel/Bash. Do you have any examples off the top of your head of the kinds of transformation or validation challenges that you find fall into that really difficult 20%, where only a 'real programming language' can effectively get the job done?
- PaulHoule 5 months ago
  
  For one thing a lot of tools like Excel do unwanted data transformations, such as importing from a CSV they try to guess whether text is meant to be a string or a number and sometimes guess wrong. You can spend a lot of time disabling this behavior or preventing it or fixing it up afterwards, but frequently bad data just plain corrupts the analysis or target system.
  You can't really trust the time handling functions on any platform which is some of the reason why languages like Python, and Java, might have two or three libraries for working with dates and times in the standard library because people realized the old one was unfixable. Plenty of Javascript date handling libraries have told people "look, this is obsolete, it's time to move on" not just because that's the Javascript way, but because the libraries really were error prone.
  In a real programming language it's straightforward to fix those problems, in a fake programming language it is difficult or impossible.
  If you've got a strange index structure in the source or destination data, for instance, many tools fall down. For instance if you want to convert nested set trees
  https://blog.uniauth.com/nested-set-model
  to something more normal like an adjacency list (or vice versa) a lot of simple tools are going to fall down.
  - dataflowmapper 5 months ago
    
    Gotcha, totally agree on those points. I think everyone's dealt with the Excel typing crap. My team uses Workato for some integrations and we use scripts any time we need math because of questionable precision so I see your take on the unreliable functions part.
chaos_emergent 5 months ago

for the longest time I envisioned some sort of configuration specification that could retrieve URLs, transform and map data, handle complex conditional flows...and then I realized that I wanted a Normal Programming Language for Commerce and started asking o3 to write me Python scripts.
- aaronbrethorst 5 months ago
  
  Hell, for me, would be what you described and implemented in Yaml.
_mlbt 5 months ago

It’s not sexy, but Perl is the purpose built tool for this job.
- PaulHoule 5 months ago
  
  Got a lot of bustedness though. I was writing cgi-bin scripts for Perl and I remember the urldecode in Perl being busted fundamentally, like it would rewrite '%20'-> SP but not rewrite '+' -> SP so I had to write my own urldecode.
  My impression is that people use Python to do most of the things that we used to do with Perl.
  Circa 2001 I wrote a cookie-based authentication system for the web which had an "authentication module" of roughly 100 lines that I wound up rewriting in at least ten different languages such as PHP, Java, Cold Fusion, Tango, etc. The OO Perl version was the shortest and my favorite.

schmookeeg 5 months ago

We're an AWS shop, so for lightweight or one-off stuff, it's a typescript lambda. Everything else ends up in a python script to output glue-friendly stuff to S3.

Assume at some point, the data will bork up.

If you ingest Excel (ugh), treat it like free range data. I have a typescript lambda that just shreds spreadsheets in a "ok scan for this string, then assume the thing to the right of it is this value we want" style -- it's goofy AF but it's one of my favorite tools in the toolbox, since I look magical when I use it. It allows me to express-pass janky spreadsheets into Athena in minutes, not days.

It is based on the convert-excel-to-json library and once you grok how it wants to work (excel -> giant freaky JSON object with keys that correspond to cell values, so object.A, object.B, object.C etc for columns. array index for row number), you can use it as a real blunt-force chainsaw approach to unstructured data LARPing as an excel doc :D

chaos_emergent 5 months ago

as I sit in front of my computer waiting for a transformation-for-import job to complete, I can describe my basic workflow:

1. define a clean interface target - for me, that's an interface that I made for my startup to import call data.

2. explore the data a little to get a sense of transformation mappings.

3. create a PII-redacted version of the file, upload it to ChatGPT along with the shape of my interface, ask it to write a transformation script in Python

4. run it on a subset of the data locally to verify that it works.

5. run it in production against my customer's account.

I'm curious - that seems like a reasonably standard flow, and it involves a bit of manual work, but it seems like the right tradeoff between toil and automation. Do you struggle with that workflow or think it could be better somehow?

dataflowmapper 5 months ago
Thanks for sharing that workflow, for more straight forward flows, that sounds like a decent approach. My main thoughts on where it could be improved, or where I see potential struggles, are when:
```
  - People aren't comfortable or familiar with coding/Python.
  - You get into more complex imports like historical data, transactional data, etc. There you might have like 15 transaction types that have to be mapped, all with different fields, math, and conditional logic where the requirements become too much for just prompting ChatGPT effectively, and iterating on the Python can get pretty involved.
  - The source data structure and transformation needs aren't consistently the same, leading to a lot of 'throwaway' or heavily modified scripts for each unique case.
  - Tasks like VLOOKUPs or enriching data come into play, which might add manual coding or complexity beyond a simple 1-to-1 source-to-destination script.
```
These are the areas where I'm exploring if a more structured way could offer benefits in terms of repeatability and accessibility for a wider range of users or complex scenarios. Appreciate the insight into your process and your thoughts on this.
- sargstuff 5 months ago
  
  From doing historical data restorations (pre-json), using documented industry/standard structures & protocols, normalized data formats/layouts helps out quite a bit.
  Note: awk/shell scripts, ~80's because relatively portable across platforms (mac,pc,dec,ibm (ignoring ebcdic), etc.)

irridiance 5 months ago

I transform hundreds of tabular sources. For the cleaning / transformation, I found that a very small number of transformations is required, and that we need to review them as a team including business owners. So, I wrote a simple grammar that is very English-like; that gets translated into Polars operations under the covers in Python. It covers 98% + of my ingestion needs, and means that we focus on the needs of the logical data transformations as a team. Business users can easily make changes for sources they manage.

One of the concepts is a “map”, for old values to new values. Those we keep in Excel in Git, so that business users can edit / maintain them. Being Excel, we’re careful to validate the import of those rules when we do a run, mainly to indicate where there’s been a lot of change to identify where there might be an unintended change. Excel makes me nervous in data processing work in general (exploration with Pivots is great, though I’ve moved to Visidata as my first tool of choice). But for years of running in this way we’ve worked around Excel lax approach to data, such as interpreting numerical ID fields as numbers rather than strings.

For output “rendering”, because everything is in Polars, we can most frequently simply output to CSV. We use Jinja for some funky cases.

hyperman1 5 months ago

Some of my experiences:

Step 1 is always storing a pristine unmodified copy. Try to build a chain of steps that can always be redone starting at that copy

If you have any control over what comes in, try to make some baseline requirements:. A stable identifier for each record, a parseable number format with clarity about US vs world conventies like , vs . and a stable date format like ISO 8601. Also a real structured format like json, xml, ... works better than csv, which is better than xls(x)

From there, it depends.

If there is a lot of data and you need a quick estimate, the POSIX toolset is a good choice: cut, grep, sed, jq, ...

If there is a lot of data, reasonably structured, load it minimally parsed in an sql table as (line number,source id,line of text) and parse from there. The database will auto parallelize.

If the data is hard to parse, grab a real programming language. You probably want something that can deal with character data quickly, e.g. Go Java Rust C# .... Python or R work if the amount of data is low or you can lean heavily on things like numpy. PHP, VBA,... tend to be slow and error prone.

My experience with ETL tooling is it's eternally 90% there. It should theoretically solve this, but I always seem to fall back on programming languages and fight the tools after a while.

nonameiguess 5 months ago

This was roughly half my job a decade ago. Literally because I split my time 50/50 between two separate projects, one of which was a business analytics automation project that was designed for exactly this purpose. Take all of the data exported from the various project management and issue tracking systems used by different product teams, transform as necessary, then import to a common corporate analytics system upper management used to track progress metrics.

The idea was to replace all of the bespoke custom Excel workflows that various program management teams used to do it themselves, so we were self-funding based on the estimated cost savings of replacing manual work, meaning I had pretty much free reign to do it however I wanted as long as I could get people to use it. I did it entirely in Python, making heavy use of Pandas, creating a toolchain that ingested mapping definitions from the users and relied on fuzzy matching (i.e. stemming, de-capitalization, recognizing word roots) to categorize field names into a common definition set used by the corporate system.

It wasn't a 100% solution, more like 95%, but the reality of this kind of work is we had a real garbage-in problem anyway that no software could solve. Extracting data from an issue tracker relies upon developers using it correctly in the first place. If you have a bunch of workflow transitions but whoever works an issue just transitions them all at once when work is complete, there's no way to know how long everything actually took. You can only extract information that was actually recorded. Coercing dates into a single format standard is fairly simple to do, but if the dates were wrong in the original format, they'll still be wrong in the corrected format.

dataflowmapper 5 months ago

Thanks for sharing, good point about garbage in and garbage out. I mostly work with software implementation so this use case is pretty insightful.

atentaten 5 months ago

I use Sqlite. We use csvs and Google Sheets a lot, so I usually import the csv as a sqlite table, use sql to create templates as the query results and then export those as csv for input into the templated system.

dataflowmapper 5 months ago

Nice, sounds like a pretty standard approach. Have you had any issues or time sinks around repeatability where you need to tweak scripts or create new ones for different CSVs?

sargstuff 5 months ago

Ah, modern tech (standards/protocols,python/odbc/LLM/etc) takes the fun out of restoring unknown data/data format from archived sources. aka mylar tape - regular expressions, awk, shell scripts,perl, hand-rolled statistical inference, trial and error to figure out what bit patterns/chunks generate most appropriate data.

Much better to filter/normalize data when entered or soon after completion of batch data entry when 'is this ok?' can be answered by those using data (vs. few decades later & no documentation/knowledge of intened purpose of data)

ensemblehq 5 months ago

Through many different companies, we’ve primarily handled this with validation/transformation scripts across a variety of tools. What’s useful is attaching observability/logging and having each transformation step do one thing only (borrowing from the Unix philosophy). Having a good understanding of the schema designs for each system is key - at some organizations, we’ve codified the schemas. Pydantic models (and similar) have been very useful as well.

crypto_is_king 5 months ago

This has been an interest of mine for quite a while.

Right now the most relevant mapping process I have to do is taking Amazon product data and transforming it to a Shopify upload csv.

The largest of these is around 20k rows, so nothing crazy.

There are apps that do this via APIs, but at a large enough scale, they are not reliable.

The process takes around 10 hours, with the help of a VA who does the lower level cleaning stuff.

I made a bare bones spreadsheet with under 10 columns, which makes it much easier to work with the data.

Once I'm satisfied, I run it through a python script, which turns it into a Shopify conforming csv.

Because of different needs of clients, I almost always have to modify the script, but less and less each time, and Cursor makes the edits pretty easy, as long as I review each proposed change and prevent it from getting the script.

Good thing about cursor is that it can run its own tests against the input csv in agent mode and debug autonomously.

dataflowmapper 5 months ago

Thanks for sharing, these kind of tasks can definitely become a time sink. Out of curiosity, you mentioned apps using the API's aren't reliable at scale, if you don't mind sharing, where do things break down when using them?

akgfab 5 months ago

I’ve had great success with radicli, weasel, and pydantic— writing scripts and defining inputs/outputs in a yaml is the sweet spot. Learned ml from explosion and their level of abstraction is just-right, though am curious how they are weathering the LLM storm.

sandreas 5 months ago

Visidata (https://www.visidata.org/) and Datasette (https://datasette.io/).

stop50 5 months ago

Python's csv module is extremly powerful. It has done what i needed to do with it.

francisofascii 5 months ago

Writing C# / LINQ scripts for this gives you the flexibility to deal with whatever impedance mismatch you have. It gets tedious and maybe makes less sense when you have dozens of model properties that are straight copy property X from A to B. Then maybe a ETL tool like FME makes more sense.

Date example:

var dateValue = DateTime.ParseExact(yyyymmdd, "yyyyMMdd", null); var dateString = dateValue.ToString("yyyy-MM-dd HH:mm:ss")

austin-cheney 5 months ago

We use MuleSoft, a SalesForce product. It does many things really well from an enterprise perspective. There are a few aspects I have learned to do better myself using Node

aerhardt 5 months ago

DuckDB, Python and LLMs. I can write in more detail when I have time!

jlpk 5 months ago

This is a major part of my job - we're a small team that works with close to 100 partners in the nonprofit sector, who all store similar data differently in CSV, PDFs, 2-3 industry-specific CRMs, etc. that we need to standardize and load. Our partners have small datasets (usually <2000 rows, maybe 20 columns or so), that are most of the time is extremely poorly formatted. We work in a single domain, so each partner's data is largely a different formatted version of the same common elements.

We have a template with accompanying documentation - partners with technical teams and well-structured data can basically self-serve with SQL. We have a series of meetings where we map columns and values, review errors, etc. More irritating than the data transformations is understanding the structure of the data and practical use, e.g., the same column for Partner A means something entirely different for Partner B. Transforming it might not be the problem - making sure everyone across our teams understands what it means and where it should go is a large pain point in mapping, but the coding of this logic is trivial.

For non-technical partners where we handle the prep, over time I wrote a small internal Python library just building off DataFrames that contains a few hundred tests to validate the data, plus a set of commonly used data cleaning functions that are flexible in their inputs. We connect to APIs to verify addresses, time zones, where we can. We now use LLMs more frequently to parse and structure fields to our standard, but it is still painful to review results from the LLM and ensure correctness. Each incoming file will result in an accompanying Jupyter notebook that runs all the relevant scripts.

Issue with working Excel (formulas, Office JS/PY Scripts, manual edits) has always been version tracking - difficult to replicate work if new versions of files come in while we are prepping the data. If we find an error post-import, it is great to be able to track down where and why we made the decision or coding error. I haven't tried Power Query though. I have tried OpenRefine, but I think sometimes it slows me down for easy functions, and API-based data transformations becoming a separate task.

When we have interns, coordinating cleaning / prep on a single file across users can be tricky.

We did try an internal POC of a UI based tool to allow users to map their own data, start cleaning it, but honestly, either a) the partner would have a data analyst or steward that didn't need it, or b) the partner wouldn't have enough of a skillset to feel comfortable cleaning it. Plus pretty often we'll need to join data from multiple spreadsheets, but conditionally use certain rows from one or the other to get the most up-to-date data, which can be difficult to do. Didn't feel as if this was worth the effort to continue with.

Fuzzy de-duplication validation is a pain, or anything where you actually want to check with the partner about its correctness - like if I notice that an email is spelled correctly because it almost matches the person's last name, but 1 letter different - becomes a long list of bullet points in an email. Something I would like is an easy way to flag and share errors and have a partner correct those in a friendly UI, then store those changes as code, without a long series of emails or meetings. We've a few times had an Excel file uploaded to Sharepoint, then with a script that automatically adds a comment to cells with errors -> but again, some people just aren't comfortable working with Excel, or get confused when they see a very structured version of their data with additional formatting, and there is no convenient way to track or replicate their changes.

There are always also one-off corrections that will need to be made as well - I've just sort of accepted this, and so we generate a list of warnings / recommendations for the partner to review or correct post-import based on the test results rather that us trying to do it. That has worked fine.

dataflowmapper 5 months ago
Wow, thanks for the detailed and insightful breakdown of your process. This really resonates a lot of the complexities I've been exploring and encountered as an implementation consultant.
A few things particularly stood out to me:
```
  - Your point about semantic mapping ("understanding the structure... what it means and where it should go") being more irritating than the actual coding of transformations is spot on. Domain expertise on these things can make a big difference.

  - The version tracking issues with Excel and the need to trace back decisions or errors are a big reason Excel isn't scalable. I totally agree with you and think Excel is bad for this kind of stuff and repeatable transformations/processes are important.
  - Your desire for an easier way to flag/share errors with partners for correction and then effectively store those changes as part of a repeatable process (like 'code' or a versioned configuration) is a fascinating challenge in collaborative data onboarding. It seems crucial for those fixes to not be one-offs especially when dealing with partners/clients which are notoriously slow. I've used Flatfile in the past, but its limiting in that its all about one-offs and transformations aren't repeated.
```
It sounds like you've built a robust internal Python library to manage a lot of this, which is impressive. When you use Jupyter notebooks for each incoming file:
```
  - How do you manage the reusability of common logic or tests across those notebooks, especially if a core cleaning function needs an update that should apply to many different partner files?
  - And how do less technical team members (like the interns you mentioned) interact with or contribute to that Python/notebook-based process, particularly if they need to understand or slightly modify a transformation rule?
```
And for your usage of LLM, is that mostly for mapping source field to target fields?
Thanks again for sharing such a comprehensive view, it's incredibly helpful for understanding the friction points and the need for traceable, repeatable solutions.
- jlpk 5 months ago
  
  Course! These are one-time migrations, so at most we have 3-4 projects happening concurrently and don't need to worry about backwards compatibility either. We can just import a specific version or branch of the library, or at worst, we just copy and paste the function and make the change in the notebook. But a majority of the functions we've put work into in the library are really the consistently applied across any incoming data files - API scripts, de-duplication, assigning IDs... honestly any changes like this are usually pretty easy to make.
  The tests are always consistent across files and primarily check for the validity of the upload (right type, for example, or logic (start times have end times, etc.). Every test should work for every file ever, since those are based on the platform and upload constraints rather than partner data.
  Short answer is they don't. For onboarding interns into the process, I write a skeleton notebook that imports the internal library and walks them through cleaning a file. But we would hire for interns that have background in coding and data cleaning. Starting out, rather than change a existing function, they might add a line of code in the notebook that changes the data in a way that the existing function would now work, for example. There are cases where specific-business logic needs to be coded into a function, but we would just write those ad-hoc. This isn't an upload that needs to be done automatically or very quickly, so that hasn't been a problem.
  The only reason other team members would contribute to this would be around shaping the business logic of which fields should be uploaded where and how they should be formatted, etc. But that data review part that sometimes needs to be done is very tricky, e.g., checking that the transformed results are what they want. It mostly happens in Excel - we did build a POC UI where they would upload a CSV, and they could click through each record and review and suggest changes in a cleaner way.
  For LLMs, we don't use them for mapping, but we've tested it and it works fine. The LLM mapping doesn't really save us a ton of time compared to us just looking through the file and assigning columns in our mapping file (which is composed of about 120 columns). If we wanted to deploy it and allow the partner to upload a sheet and the LLM to suggest a mapping, that could work. (The mapping step in the demo video on your site looks great!) Instead we use them to format and extract unstructured data -> for example, we might need to pull tags out of descriptive text, or extract event hours out of a string of text describing the event. The LLM can do this really well now with structured JSON outputs, but reviewing it to ensure it is correct is still a manual process.
  - dataflowmapper 5 months ago
    
    Appreciate the detailed follow up and all this information, its been super insightful to get more real world perspective on this. Your processes all sound very well thought out and address most of the shortcomings/pain points I've experienced personally. There's definitely a lot of valuable info in there for anyone in this space. And thank you for taking the time to check out the site, feedback is limited at this stage so hearing that something resonated is great, hoping it all makes sense.

moltar 5 months ago

Dbt

aaron695 5 months ago

[dead]